Often the free-text data, such as radiology reports or referral letters are stored as binary documents and in different formats, and often also as scanned images. In such cases, it’s essential to extract the text from these documents to be able to perform any meaningful operation with the text. Apache Tika is one of the most commonly used open-source projects implementing rich set of operations on the documents, including text and meta-data extraction and handling many different types. In CogStack we use Apache Tika running as a web-service that exposes RESTful API to process the supplied documents. Although Tika Service was designed to be used within the data flows, it can be also deployed as a standalone service.
The recommended way of running Tika as a service is using Docker with the provided Docker image: cogstacksystems/tika-service:latest. This is primarily due to the fact that Apache Tika relies on a set of many external applications and dependencies to perform the text extraction. Tesseract is one of these applications that is used to perform OCR over the images. Moreover, Tika Service also implements a custom parser that was previously used inside CogStack-Pipeline (for backwards compatibility) and that was used on PDF files requiring OCR. This way the text extraction process can be run in isolation and scaled by running multiple instances according to the expected workload, such as when processing historical data.