Running CogStack

Overview

CogStack Pipeline application can be run in different ways. It can be either run either:

  • Locally as a standalone Java application,
  • Run inside a Docker container (possibly, which will be deployed as a microservice inside an ecosystem).

The former way of running CogStack is highly recommended one and has been extensively covered by multiple examples in the Examples section.


Note

Please note that to run a sample CogStack Pipeline job it is also required to have a CogStack configuration file available defining the used properties, pipeline components, data processing, etc. Please refer to /wiki/spaces/COGEN/pages/37945560 part for a detailed description of available properties.

On this page :





Running as a standalone app

Prerequisites

Java Runtime Environment

CogStack Pipeline requires Java SE Runtime Environment in version >= 11.0 to be present in the system. The most commonly used JDK distributions are:


Note

Please note that with the change of licensing from Oracle coming with the new version 11 (requiring commercial license for production for support) we emphasise using the OpenJDK variant. More information about the licensing can be read >here<.

In our Docker images we are also now using a base image with OpenJDK.

External applications

There are some additional, external applications that selected components of CogStack Pipeline use when processing data. They need to be installed on the system prior running CogStack. These are:

  • TesseractOCR – for extracting text from images in version >= 4.0,
  • Image Magick – for performing conversion between image formats.


Note

Tesseract in version 4.0 introduced significant improvements in the quality of OCR process, hence CogStack in version 1.3.0 was also updated to use it. However, please note that on some older distributions of Debian / Ubuntu Tesseract may need to be installed manually or compiled from scratch. For more information, please refer to the official Tesseract wiki.

Running locally

Please see below: Running the pipeline.





Running as a containerised app

CogStack Pipeline application can be also run inside the container, using the official Docker image available from the official cogstacksystems Docker Hub. This is the highly recommended method to run CogStack Pipeline. Docker can provide lightweight virtualisation of a variety of microservices that CogStack makes use of. Hence, when coupled with the microservice orchestration docker compose technology, all of the components required to use CogStack can be set up with a few simple commands.

There are two images available to use: cogstacksystems/cogstack-pipeline:latest (stable) and cogstacksystems/cogstack-pipeline:dev-latest (development) – see: Building CogStack for more information.

The Dockerfile used to build both images is available in the main CogStack pipeline directory.

Info

The base image used by CogStack Pipeline is OpenJDK JRE 11.


Prerequisites

The only one prerequisite is to have the Docker installed on the system in version >= 1.13.

Running

CogStack Pipeline can be run either as a single container or as a part of ecosystem communicating with other microservices.

Using docker run

To run CogStack Pipeline inside a single container using Docker one can type:

docker run -it cogstacksystems/cogstack-pipeline:latest /bin/bash

This which will launch the CogStack container and spawn a bash console. From the console, one can launch CogStack Pipeline as explained in Running pipeline.

Using docker-compose 

Running CogStack Pipeline as a container within a configured stack of microservices using Docker Compose is based on the provided microservices configuration file (Docker Compose file, in YAML format). Multiple sample configurations have been covered in the Examples part in the documentation.

For example, using the docker-compose.yml file from Example 2, CogStack Pipeline service has been defined as:

cogstack-pipeline:
  image: cogstacksystems/cogstack-pipeline:latest
  volumes:
    - ./cogstack:/cogstack/cogstack_conf:ro
  environment:
    - SERVICES_USED=cogstack-job-repo:5432,samples-db:5432,elasticsearch-1:9200
    - LOG_LEVEL=info

    - FILE_LOG_LEVEL=off
  depends_on:
    - samples-db
    - cogstack-job-repo
    - elasticsearch-1
  command: /cogstack/run_pipeline.sh /cogstack/cogstack-*.jar /cogstack/job_config

It uses the latest version of cogstack-pipeline image from the Docker hub. It also specifies the mapping of the directories from the local machine ./cogstack directory to the host's directory /cogstack/cogstack_config (there usually reside CogStack Pipeline configuration file(s)). When deployed, it will launch CogStack Pipeline application through run_pipeline.sh script and process the data according to the pipeline configuration file residing in the previously mounted /cogstack/cogstack_config directory on the host.

The run_pipeline.sh script is just a helper script that will launch pipeline component prior awaiting for services become available as specified by SERVICES_USED . However, the pipeline can be also run as specified in Running the pipeline part.

To deploy the CogStack Pipeline application according to the specified microservices configuration and running as one of them, one only needs to type in the directory with the YAML file:

docker-compose up

For more examples with deploying the services, please see Examples part.



Running the pipeline

CogStack Pipeline is run as a command-line application – to run it, just type:

java [parameters] -jar cogstack-*.jar <directory>

where <directory> specifies the directory where the CogStack configuration file(s) are kept and which will be parsed by CogStack Pipeline application. This is the only one obligatory parameter to provide.

Moreover, CogStack Pipeline provides a number of [optional] parameters:

  • -DLOG_LEVEL=<level> (default: INFO ; available: DEBUG | INFO | ERROR) – specifies the logging verbosity level of the displayed to standard output,
  • -DLOG_FILE_NAME=<name> – specifies the filename where the application logs will be stored (in HTML format),
  • -DFILE_LOG_LEVEL=<level> (default: INFO available: DEBUG | INFO | ERROR) – logging verbosity level of the displayed to the file.


For a more detailed description of available properties please refer to /wiki/spaces/COGEN/pages/37945560 page. Moreover, there are multiple Examples available with sample job configuration.