Text extraction from binary documents


Often the free-text data, such as radiology reports or referral letters are stored as binary documents and in different formats, and often also as scanned images. In such cases, it’s essential to extract the text from these documents to be able to perform any meaningful operation with the text. Apache Tika is one of the most commonly used open-source projects implementing rich set of operations on the documents, including text and meta-data extraction and handling many different types. In CogStack we use Apache Tika running as a web-service that exposes RESTful API to process the supplied documents. Although Tika Service was designed to be used within the data flows, it can be also deployed as a standalone service.

Apache Tika resources:

CogStack Tika Service useful links:

Please see tutorial: CogStack using Apache NiFi Deployment Examples to see how Tika is used as a part of data pipeline.


Running Tika Service

The recommended way of running Tika as a service is using Docker with the provided Docker image: cogstacksystems/tika-service:latest. This is primarily due to the fact that Apache Tika relies on a set of many external applications and dependencies to perform the text extraction. Tesseract is one of these applications that is used to perform OCR over the images. Moreover, Tika Service also implements a custom parser that was previously used inside CogStack-Pipeline (for backwards compatibility) and that was used on PDF files requiring OCR. This way the text extraction process can be run in isolation and scaled by running multiple instances according to the expected workload, such as when processing historical data.

For a more detailed information on running Tika Service with documentation please see the Tika Service GitHub repository.

Tika Service API

Tika Service exposes a RESTful API and defines such endpoints:

  • GET /api/info - returns information about the service with its configuration,

  • POST /api/process - processes a binary data stream with the binary document content,

  • POST /api/process_file - processes a document file (multi-part request).

For a more detailed information on the API specs with use examples please see the Tika Service documentation.

Example use

Using curl to send the document to Tika server instance running on localhost on 8090 port:

curl -F file=@test.pdf http://localhost:8090/api/process_file | jq

Returned result:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 { "result": { "text": "Sample Type / Medical Specialty: Lab Medicine - Pathology", "metadata": { "X-Parsed-By": [ "org.apache.tika.parser.CompositeParser", "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.microsoft.ooxml.OOXMLParser" ], "X-OCR-Applied": "false", "Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" }, "success": true, "timestamp": "2019-08-13T15:14:58.022+01:00" } }