Quickstart

Quickstart

This content is archived.

Introduction

This simple tutorial demonstrates how to get CogStack running on a sample electronic health record (EHR) dataset stored initially in an external database. CogStack ecosystem has been designed with handling efficiently both structured and unstructured EHR data in mind. It shows its strength while working with the unstructured type of data, especially as some input data can be provided as documents in PDF or image formats.

In this Quickstart tutorial we only show how to run CogStack on a set of structured and free-text EHRs that have been already digitalized. The part covering unstructured type of data in form of PDF documents, images and other clinical notes which needs to processed prior to analysis and/or NLP has been covered extensively in the Examples part.

The main directory with resources used in this tutorial is available in the CogStack bundle under examples/. This tutorial is based on the Example 2, however, there are more examples available to play with – please see the Examples section.

 

Tip

To skip the brief description and to get hands on running CogStack please head directly to Running CogStack part.

Note

This tutorial is based on the newest version of CogStack 1.3.0.

The previous versions of CogStack and documentation can be found in the official github repository.

Tip

An online version of Quickstart Tutorial in is also available on CogStack AWS bucket.

 



 

Getting CogStack

The most convenient way to get the newest version of CogStack bundle is to download it directly from the official github repository:

wget 'https://github.com/CogStack/CogStack-Pipeline/releases/download/1.3.0/cogstack-pipeline-1.3.0-opejndk11.tar.gz'
unzip cogstack-pipeline-1.3.0-opejndk11.tar.gz

The content will be decompressed into CogStack-Pipeline-1.3.0/ directory.

 



 

How does CogStack work

Data processing workflow

The data processing workflow of CogStack is based on Java Spring Batch framework. Not to dwell too much into technical details and just to give a general idea – the data is being read from a predefined data source, later it follows a number of processing operations with the final result stored in a predefined data sink. CogStack implements variety of data processors, data readers and writers with scalability mechanisms that can be selected in CogStack job configuration. Although the data can be possibly read from different sources, the most frequently used data sink is ElasicSearch. For more details about the CogStack functionality, please refer to the other parts of the documentation.

In this tutorial we only focus on a simple and very common use-case, where CogStack reads and process structured and free-text EHRs data from a single PostgreSQL database. The result is then stored in ElasticSearch where the data can be easily queried in Kibana dashboard. However, CogStack engine also supports multiple data sources – please see Example 3 (from Examples section) which covers such case.

CogStack ecosystem

CogStack ecosystem consists of multiple inter-connected microservices running together. For the ease of use and deployment we use Docker (more specifically, Docker Compose), and provide Compose files for configuring and running the microservices. The selection of running microservices depends mostly on the specification of EHR data source(s), data extraction and processing requirements.

In this tutorial the CogStack ecosystem is composed of the following microservices:

  • samples-db – PostgreSQL database loaded with a sample dataset under db_samples name,

  • cogstack-pipeline – CogStack data processing engine with worker(s),

  • cogstack-job-repo – PostgreSQL database for storing information about CogStack jobs,

  • elasticsearch-1 – ElasticSearch search engine (single node) for storing and querying the processed EHR data,

  • kibana – Kibana data visualization tool for querying the data from ElasticSearch.

Since all the examples share the common configuration for the microservices used, the base Docker Compose file is provided in examples/docker-common/docker-compose.yml. The Docker Compose file with configuration of microservices being overriden for this example can be found in examples/example2/docker/docker-compose.override.yml. Both configuration files are automatically used by Docker Compose when deploying CogStack, as will be shown later.

For a more detailed description of used microservices in CogStack ecosystem, please refer to CogStack ecosystem (v1) part.

 

Info

In some more advanced deployments, the Nginx reverse proxy and/or Fluentd logging services are used – for more details, please see Example 6 or Example 7 in the Examples section.

 



 

Sample datasets

The sample dataset used in this tutorial consists of two types of EHR data:

  • Synthetic – structured, synthetic EHRs, generated using synthea application,

  • Medial reports – unstructured, medical health report documents obtained from MTSamples.

These datasets, although unrelated, are used together to compose a combined dataset.

Synthetic – synthea-based

This dataset consists of synthetic EHRs that were generated using synthea application – the synthetic patient generator that models the medical history of generated patients. For this tutorial, we generated EHRs for 100 patients and exported them as CSV files. Typed in the main synthea directory, the command line for running it:

./run_synthea -p 100 \ --generate.append_numbers_to_person_names=false\--exporter.csv.export=true

However, the pre-generated files are provided in a compressed form as examples/rawdata/synsamples.tgz file.

The generated dataset consists of the following files:

  • allergies.csv – Patient allergy data,

  • careplans.csv – Patient care plan data, including goals,

  • conditions.csv – Patient conditions or diagnoses,

  • encounters.csv – Patient encounter data,

  • imaging_studies.csv – Patient imaging metadata,

  • immunizations.csv – Patient immunization data,

  • medications.csv – Patient medication data,

  • observations.csv – Patient observations including vital signs and lab reports,

  • patients.csv – Patient demographic data,

  • procedures.csv – Patient procedure data including surgeries.

For more details about the generated files and the schema definition please refer to the official synthea wiki page. The sample records are shown in Advanced: preparing a DB schema for CogStack part.

In this example we use a subset of the available data – as a simple use-case, only patients.csv, encounters.csv and observations.csv file are used related. These files also represent separate tables in the db_samples database. For more advanced use-cases please check the Example 3 from the Examples section which uses the full dataset.

Medical reports – MTSamples

MTSamples is a collection of transcribed medical sample reports and examples. The reports are in a free-text format and have been downloaded directly from the official website.

Each report contain such information as:

  • Sample Type,

  • Medical Specialty,

  • Sample Name,

  • Short Description,

  • Medical Transcription Sample Report (in free text format).

The collection comprises in total of 4873 documents. A sample document is shown in Advanced: preparing a DB schema for CogStack part.

Preparing the data

For the ease of use a database dump with predefined schema and preloaded data will be provided in examples/example2/db_dump directory. This way, the PostgreSQL database will be automatically initialized when deployed using Docker. The database dump for this example (alongside the others) can be also directly downloaded from Amazon S3 by running in the main examples/ directory:

bash download_db_dumps.sh

Alternatively, the PostgreSQL database schema definition used in this tutorial db_create_schema.sqlis stored in examples/example2/extra/ directory alongside the script prepare_db.sh to generate the database dump. More information covering the creation of database schema can be found in Advanced: preparing a DB schema for CogStack part.