Bio-Yodie

Note

This section is still as work-in-progress.

Overview

Bio-Yodie is a GATE application (stored as xgapp application) that can be used to extract medical concepts from the free-text documents. It is also the core component of SemEHR.



On this page :

Downloading

Bio-Yodie can be downloaded from the official GitHub repository: https://github.com/GateNLP/Bio-YODIE



Prerequisites

This section assumes using UNIX-based system for setting-up and configuring the Bio-Yodie. These steps should be also possible to accomplish on a Windows OS with UNIX environments installed such as Cygwin, MinGW or Windows Subsystem for Linux.

JAVA Runtime Environment

As a primary prerequisite Java JRE is required in version >= 8.0.

Either the official Java JRE from Oracle or OpenJDK can be used.

UMLS

Bio-Yodie has been primarily tested using Unified Medical Language System (UMLS), a medical concepts resources database. However, distribution, usage and access to UMLS requires obtaining a special license to use, followed by acceptance of terms and conditions. Hence, Bio-Yodie resources need to be compiled manually using provided scripts after obtaining UMLS (see below).

UMLS dataset can be downloaded from the official website: https://www.nlm.nih.gov/research/umls/

The dataset is downloaded as compressed file umls-*-full.zip which should occupy > 4 GB (compressed) and 9.1 GB (uncompressed).

GATE Developer

Note

Please note that two different versions of GATE are being used – one to prepare UMLS resources and one to run Bio-Yodie.

Bio-Yodie requires GATE Developer to be installed on the system to run. The required GATE version is >= 8.5.

However, when generating UMLS resources locally for Bio-Yodie (see below), the provided scripts are compatible with GATE in version prior to 8.5. Therefore, an additional GATE installation needs to be present in the system.

This implies that, when running the resources generation scripts, the environment variable $GATE_HOME needs to point to the older GATE installation directory.

After that, when running Bio-Yodie, $GATE_HOME should point to the newer installation of GATE, the one meant to be used.

Bio-Yodie

Bio-Yodie can be downloaded from the official GitHub repository: https://github.com/GateNLP/Bio-YODIE

The configuration guide and scripts are provided on the GitHub. The configuration of Bio-Yodie should boil down to running in the main Bio-Yodie directory:

bash update.sh

The script will take care of downloading all the prerequisites for Bio-Yodie, including external plugins and compiling them. After that, one can proceed to UMLS resources preparation.

The generated UMLS resources for Bio-Yodie should be placed in directory bio-yodie-resources (or sym-linked as) inside the main Bio-Yodie directory.

 

Preparing the Bio-Yodie resources

UMLS MetamorphoSys tool

After downloading and extracting UMLS dataset it needs to be pre-processed for Bio-Yodie by running MetamorphoSys tool. MetamorphoSys is the UMLS installation wizard and Metathesaurus customisation tool included in each UMLS release. MetamorphoSys is included in mmsys.zip file inside the decompressed UMLS package.

In the next step, one should decompress mmsys.zip file so that mmsys application will reside in the same directory as all the decompressed UMLS files.

To run mmsys one should run appropriate run_* script. The MetamorphoSys wizard will appear and will allow to customise which UMLS resources are to be exported. After following the wizard and selecting the appropriate configuration, the exported UMLS resources will be available in the specified output directory. The resources from the output directory will be in the next step used to generate the resources for Bio-Yodie.

More information with a detailed description of MetamorphoSys tool can be found on the official MetamorphoSys website.

Preparing UMLS resources for Bio-Yodie

Note

Please note that to generate UMLS resources for Bio-Yodie GATE in version <= 8.4.1 should be used and $GATE_HOME environment variable should set up accordingly.

The scripts to generate Bio-Yodie resources are provided in separate GitHub repository: https://github.com/GateNLP/bio-yodie-resource-prep

The official GitHub page covers the topic of setting up the environment, downloading the dependencies, linking the UMLS input and output resources, etc.

Info

Generating the resource files for Bio-Yodie from a default UMLS export can be a very resource consuming process. Hence, it may be good option to run it on a machine (or cluster) with at least 16 GB of RAM and 100 GB of free disk space.

After downloading the required dependences and linking the input UMLS resources, one should be able to generate the resources for Bio-Yodie by running in the main directory:

bash bin/all.sh

Please note that the resources generation script has commented multiple lines for generating additional data – this can be further modified and adapted to the NLP pipeline requirements.

Warn

There are numerous issues when trying to run resources generation script bin/all.sh on Mac OSX systems. The scripts need to be either adapted to be run on Mac OSX or GNU coreutils and GNU sed tools as replacements installed (as the one bundled by default in OSX are outdated). Hence, the easiest way is to run it on Linux.


Output files

The generated resources will be stored in the output directory. This directory should contain such sub-directories and files (with sizes):

output
├── databases
└── en
    ├── databases
    │   ├── [1.8G]  labelinfo.h2.db
    │   └── [  72]  labelinfo.trace.db
    └── gazetteer-en-bio
        ├── [  28]  cased-labels.def
        ├── [ 16M]  cased-labels.lst
        ├── [140M]  labels.lst
        ├── [  32]  uncased-labels.def
        └── [150M]  uncased-labels.lst

These resources should be either placed in directory bio-yodie-resources inside the main Bio-Yodie directory or just sym-linked.

Please note that depending on the UMLS resources version, the actual sizes of the generated sources may differ from the ones presented above.

Running Bio-Yodie

Note

Please note that GATE in version >= 8.5 should be used and $GATE_HOME environment variable set up accordingly.

Using GATE Developer GUI

The easiest way is to run Bio-Yodie can be run either through GATE Developer GUI or as a standalone application using GATE Embedded.

More information about running the custom GATE applications using GATE Developer can be found in the official GATE documentation and tutorials .


Using GATE GCP

Custom GATE applications can be run also in batch mode using GATE Cloud Paralleliser (GCP) . Since GCP offers a pretty configurable command line interface (CLI), there are multiple ways how custom GATE applications can be run using it.

As an example, below is presented a sample batch configuration file:

example-batch.xml
<?xml version ="1.0" encoding ="UTF-8" ?>
<batch id ="sample" xmlns ="http://gate.ac.uk/ns/cloud/batch/1.0">

	<!-- path to Bio-Yodie application -->
	<application file ="./bioyodie/application.xgapp"/>

	<!-- path to output report file -->
	<report file ="./sample-report.xml" />

	<!-- configuration of the data source to be processed (directory, files type, ...) -->
	<input dir = "./input-files"
		mimeType = "text/plain"
		compression = "none"
		encoding = "UTF-8"
		class = "gate.cloud.io.file.FileInputHandler" />

	<!-- configuration of the data sink -->
	<output dir = "./output-files-gate"
		compression = "none"
		encoding ="UTF-8"
		fileExtension = ".GATE.xml"
		class = "gate.cloud.io.file.GATEStandOffFileOutputHandler" />

	<!-- documents list to be processed -->
	<documents>
		<id>5.txt</id>
	</documents>

</batch>

Having the configuration stored as example-batch.xml Bio-Yodie can be run form the command line as:

java -jar gcp-cli.jar example-batch.xml

Bio-Yodie will be run using GATE GCP CLI with provided pre-configured batch configuration file. The example configuration can be further tailored to the specific requirements.

More information on running GATE applications using GATE GCP can be found in the official GATE user guide .


Integrating with CogStack

Note

We are now revisiting this topic, hence the way of integrating BioYodie with CogStack Pipeline will change in order to simplify it.


Since BioYodie is a GATE-based application, there are couple of ways in which it can be run inside the CogStack Pipeline, in particular: 

  • Run as a GATE application bundled with CogStack Pipeline using gate profile,
  • Run as an external web application exposing a REST endpoint communicating with CogStack Pipeline within webservice profile.

Bundled with CogStack Pipeline

At the moment, due to issues with the UMLS licensing, we cannot provide a prebuild CogStack Pipeline Docker image with BioYodie bundled with UMLS. Hence, such image needs to be build manually. An example of building a CogStack Pipeline with a sample GATE application is available in Examples section. Please note that these examples still use GATE Embedded in version 8.4.1.

As an alternative, CogStack Pipeline can be run on a system with GATE and BioYodie+UMLS already installed. In such cases, the job properties file for GATE application need to be pointing to the local instance of GATE+BioYodie. For more details please see the /wiki/spaces/COGEN/pages/37945560 and Examples parts.

In order to overcome these issues, we are focusing to use NLP applications running as a separate service.

Running as a separate service

In multiple internal releases BioYodie was run with CogStack Pipeline as a service, yet using GATE Embedded in version 8.4.1. Since, the new version of BioYodie requires GATE 8.5, we are now revisiting this topic and updating the webservice wrapper component.