The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
The parse method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. The parse context argument is used to specify context information (like the current local) that is not related to any individual document. The main criteria that lead to this design were:
- Streamed parsing
- The interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. This allows even huge documents to be parsed without excessive resource requirements.
- Structured content
- A parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. A client application can use this information for example to better judge the relevance of different parts of the parsed document.
- Input metadata
- A client application should be able to include metadata like the file name or declared content type with the document to be parsed. The parser implementation can use this information to better guide the parsing process.
- Output metadata
- A parser implementation should be able to return document metadata in addition to document content. Many document formats contain metadata like the name of the author that may be useful to client applications.
- Context sensitivity
- While the default settings and behaviour of Tika parsers should work well for most use cases, there are still situations where more fine-grained control over the parsing process is desirable. It should be easy to inject such context-specific information to the parsing process without breaking the layers of abstraction.
* Fix thread safety bug in OpenOffice parser (TIKA-3334).
* The “writeLimit” header now pertains to the combined characters
written per container document (and embedded documents) in the /rmeta
endpoint in tika-server (TIKA-3325); it no longer functions only
per container or embedded document.
* Extract more embedded files in PDFs by recursively processing the
embedded file tree (TIKA-3332).
* Allow for case insensitive headers for configuration of the PDFParser
and the TesseractOCRParser in tika-server via Subhajit Das (TIKA-3320).
* Improve detection and parsing of XPS files (TIKA-3316).
* General dependency upgrades (TIKA-3244).
* Great optimization in ForkParser (TIKA-3237).
* Fix parsing of emails attached to other emails in PST files (TIKA-3004).
* MP3 parser should output the xmpDM:duration metadata as seconds not
milliseconds, consistent with the other Audio and Video parsers (TIKA-3318).
* MP4 parser check if any of the Compatible Brands match when identifying
the subtype (TIKA-3310).