Date: Fri, 29 Mar 2024 07:16:41 +0000 (UTC) Message-ID: <1997111384.29.1711696601442@fbad8371d7ee> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_28_251387529.1711696601442" ------=_Part_28_251387529.1711696601442 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
Often the free-text data, such as radiology reports or referral letters = are stored as binary documents and in different formats, and often also as = scanned images. In such cases, it=E2=80=99s essential to extract the text f= rom these documents to be able to perform any meaningful operation with the= text. Apache Tika is one of the most commonly used open-source proj= ects implementing rich set of operations on the documents, including text a= nd meta-data extraction and handling many different types. In CogStack we u= se Apache Tika running as a web-service that exposes RESTful API to process= the supplied documents. Although Tika Service was = designed to be used within the data flows, it can be also deployed as a sta= ndalone service.
Apache Tika resources:
The official website: https://tika.apache.org/
The official documentation: https://tika.apache.org/1.= 22/index.html
CogStack Tika Service useful links:
GitHub with code and documentation: https://github.com/CogStack/tika-service
DockerHub: https://cloud.docker.com/u/cogstacksystems/repository/docker/cogstac= ksystems/tika-service
Please see tutorial: CogStack usin= g Apache NiFi Deployment Examples to see how Tika is used as a part of = data pipeline.
The recommended way of running Tika as a service is using Docker with th=
e provided Docker image: cogstacksystems/tika-service:latest
. =
This is primarily due to the fact that Apache Tika relies on a set of many =
external applications and dependencies to perform the text extraction. Tesseract is one of these applications that is used to=
perform OCR over the images. Moreover, Tika Service also implements a cust=
om parser that was previously used inside CogStack-Pipeline (for backwards =
compatibility) and that was used on PDF files requiring OCR. This way the t=
ext extraction process can be run in isolation and scaled by running multip=
le instances according to the expected workload, such as when processing hi=
storical data.
For a more detailed information on running Tika Service with documentati= on please see the Tika Service GitHub repository.
Tika Service exposes a RESTful API and defines such endpoints:
GET /api/info
- returns information about the serv=
ice with its configuration,
POST /api/process
- processes a binary data stream=
with the binary document content,
POST /api/process_file
- processes a document file=
(multi-part request).
For a more detailed information on the API specs with use examples pleas= e see the Tika Service documentation.
Using curl
to send the document to Tika server in=
stance running on localhost on 8090
port:
curl -F file=3D@test.pdf http://localhost:8090/api/process_file | =
jq
Returned result:
{ "result": { "text": "Sample Type / Medical Specialty: Lab Medicine - Pathology", "metadata": { "X-Parsed-By": [ "org.apache.tika.parser.CompositeParser", "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.microsoft.ooxml.OOXMLParser" ], "X-OCR-Applied": "false", "Content-Type": "application/vnd.openxmlformats-officedocument.wordpr= ocessingml.document" }, "success": true, "timestamp": "2019-08-13T15:14:58.022+01:00" } }