CogStack Documentation

What is CogStack?

CogStack is a lightweight distributed, fault tolerant database processing architecture and ecosystem, intended to make NLP processing and preprocessing easier in resource constrained environments. It comprises of multiple components, where CogStack Pipeline, the one covered in this documentation, has been designed to provide a configurable data processing pipelines for working with EHR data. For the moment it mainly uses databases and files as the primary source of EHR data with the possibility of adding custom data connectors in the near future. It makes use of the Java Spring Batch framework in order to provide a fully configurable data processing pipeline with the goal of generating an annotated JSON files that can be readily indexed into ElasticSearch, stored as files or pushed back to a database.

The CogStack ecosystem has been developed as an open source project with the code available on GitHub: .


Starting from version 1.2 CogStack is preferably being run as an ecosystem using a set of different microservices and deployed using Docker Compose. The ready-to-use CogStack Pipeline images are available to pull directly from the official Docker Hub under cogstacksystems organisation.


To get CogStack directly up and running, please refer to CogStack Quickstart guide.


Please note that we are currently working on a possible improved version of the CogStack-Pipeline aiming to replace the Spring Batch data processing engine with Apache NiFi. The most-up-to-date development version can be found in CogStack-NiFi repository.


Why does this project exist?

The CogStack consists of a range of technologies designed to to support modern, open source healthcare analytics within the NHS, and is chiefly comprised of the Elastic stack (ElasticSearch, Kibana, etc.), GATE, Bio-Yodie and MedCAT (clinical natural language processing for named entity extraction and linking), Tesseract for OCR, clinical text de-identification, and Apache Tika for documents to text conversion. Since the processed EHR data can be represented and stored in databases or ElasticSearch, CogStack can be perfectly utilised as one of the solutions for integrating EHR data with other types of biomedical, -omics, wearables data, etc.

Why is batch processing difficult?

When processing very large datasets (10s - 100s of millions rows of data), it is likely that some rows will present certain difficulties for different processes. These problems are typically hard to predict - for example, some documents may have very long sentences, an unusual sequence of characters, or machine only content. Such circumstances can create a range of problems for NLP algorithms, and thus a fault tolerant batch frameworks are required to ensure robust, consistent processing. 

To overcome these issues, CogStack data processing pipeline is based on Spring Batch framework. In the parlance of the batch processing domain language, CogStack uses the partitioning concept to create 'partition step' metadata for a DB table. This metadata is persisted in the Spring database schema, whereafter each partition can then be executed locally or farmed out remotely via a messaging middleware server (only ActiveMQ is supported at this time). Remote workers then retrieve metadata descriptions of work units. The outcome of processing is then persisted in the database, allowing robust tracking and simple restart of failed partitions.

Although CogStack data processing pipeline can be run in a parallel mode, coordinating multiple remote workers, it is also possible to run it on the local machine under highly constrained resources and using only few processing threads (or even just one).

Related publications

  • Jackson, Richard, et al. “CogStack-experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital.” BMC medical informatics and decision making 18.1 (2018): 47.
  • Wu, Honghan, et al. “SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research.” Journal of the American Medical Informatics Association 25.5 (2018): 530-537.
  • Wu, Honghan, et al. “SemEHR: surfacing semantic data from clinical notes in electronic health records for tailored care, trial recruitment, and clinical research.” The Lancet 390 (2017): S97.
  • Bean, Daniel M., et al. “Semantic computational analysis of anticoagulation use in atrial fibrillation from real world data.” PloS one 14.11 (2019).
  • Searle, Thomas, et al. “MedCATTrainer: A Biomedical Free Text Annotation Interface with Active Learning and Research Use Case Specific Customisation.” arXiv preprint arXiv:1907.07322 (2019).
  • Kraljevic, Zeljko, et al. “MedCAT–Medical Concept Annotation Tool.” arXiv preprint arXiv:1912.10166 (2019).
  • Tissot, Hegler, et al. “Natural Language Processing for Mimicking Clinical Trial Recruitment in Critical Care: A Semi-automated Simulation Based on the LeoPARDS Trial.” IEEE Journal of Biomedical and Health Informatics (2020), doi: 10.1109/JBHI.2020.2977925.
  • Bean, Daniel, et al. (2020) “ACE-inhibitors and Angiotensin-2 Receptor Blockers are not associated with severe SARS- COVID19 infection in a multi-site UK acute Hospital Trust.” medRxiv 2020.04.07.20056788.
  • Carr, Ewan, et al. (2020) “Supplementing the National Early Warning Score (NEWS2) for anticipating early deterioration among patients with COVID-19 infection.” medRxiv 2020.04.24.20078006.
  • Zhang, Huayu, et al. (2020) “Risk prediction for poor outcome and death in hospital in-patients with COVID-19: derivation in Wuhan, China and external validation in London, UK.” medRxiv 2020.04.28.20082222.

Further reading

Browse by labels

Recently updated articles