Apache Tika

Apache Tika 1.17 release, content extraction tool collection


Apache Tika 1.7 was released, Tika is a toolkit for text extracting. It integrates POI and Pdfbox, and provides a unified interface for text extraction. Second, Tika also offers a handy extension API to enrich its support for third-party file formats.

Apache Tika 1.17 includes many improvements and bug fixes
  •  Fix thread-safety in ChmExtractor (TIKA-2519).
  •  Upgrade cxf to 3.0.16 (TIKA-2516).
  • Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213).
  • Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512).
  •  Cache TikaConfig in EmbeddedDocumentUtil for better performance in documents with large number of attachments (TIKA-2511).
  • Extract media files from ooxml (TIKA-2510).
  • Standardize the way the Image and Video captioning dockers and extraction work (TIKA-2400, GitHub-208)
  • Upgrade to xmpcore 5.1.3 (TIKA-2034).
  • Upgrade to metadata-extractor 2.10.1 (TIKA-2486).
  • Upgrade to OpenNLP 1.8.3 (TIKA-2502).
  • Upgrade to Jackson 2.9.2 (TIKA-2501).


Leave a Reply

Your email address will not be published. Required fields are marked *