Welcome to the PAIS and PIDB wiki!

Introduction

The systematic analysis of imaged pathology specimens results in a vast amount of morphological information at both the cellular and sub-cellular scales. The information generated by this process has tremendous potential for providing insight on the underlying mechanisms responsible for disease onset and progression. One major obstacle which tends to reduce wider adoption of these new technologies throughout the clinical and scientific communities is the challenge of managing, querying, and integrating the vast amounts of complex data resulting from the analysis and annotation of large digital pathology data sets.

We have developed a Pathology Analytical Pathology Imaging Standards (PAIS) data management system to model, manage and query analytical results and human annotations of pathology images, and a Pathology Image Database System (PIDB) to model, manage, and query whole slide images. The software has been deployed as the core data infrastructure for the NCI In Silico Brain Tumor Research Center at Emory University to support integrated morphologic analysis, managing 200 million nuclei and 15 billion image features. PAIS is deployed at the Cancer Institute of New Jersey for managing analytical results and annotations from 4740 tissue microarray discs (histospots) of breast cancer. In the algorithm validation project at Emory University, PAIS is used as the backend engine for algorithm evaluation. In the project of Informatics for Integrative Brain Tumor Whole Slide Analysis, the software is used as the data management infrastructure for analytical results derived from multiplex quantum dot IHC images.

The PAIS system includes three modules:

  • Database design and management
  • Data generation, pre-processing and uploading
  • Data analysis and visualization.

Detailed information about these modules is presented in the following.

Database Design and Management

We employ DB2 spatial database to store and manage our spatial data. To store the whole slide images and the related information, we propose two database schemas: pi and pais. To analyze the image data, we provide some user-defined functions and stored procedures.

You can find how to create database, schema and stored procedure here.

Data Generation, Pre-processing and Uploading

Given the whole slide images, we need first to generate the desired information such as boundaries of nuclei in the images according to different applications. Then a pre-processing step is employed to convert the extracted raw data into the specified format that can be processed later. Finally, the formatted data can be uploaded to the database for further analysis and visualization. So in this module, we have three sub-modules: data generation, data pre-processing and data uploading.

Data Generation

According to our application, we need to extract the boundary of each object (nucleus) in the whole slide images. It's unrealistic to process one whole slide image directly due to its large size (about 1GB) and high resolution (about 5000050000). Thus we first cut the whole slide image into tiles with size 4k4k then extract the boundary information based on the small tiles. The whole slide image tiling is based on OpenSlide. The boundary information of objects can be automatically extracted by employing the image segmentation methods in Matlab and OpenCV. For the boundary information from manual annotations, Aperio's image viewer can generate the corresponding markup XML file.

Another kind of data for the zoom-in and zoom-out functions when visualizing whole slide images is based on the pyramidal file structure.

Data Pre-processing

We use wkt format to store the boundary information of objects (polygons) in the whole slide image. For the markup XML file generated by Aperio, we provide a tool AperioXMLParser to convert the XML file to our wkt format.

The extracted polygons in the wkt files may be invalid (some polygons may be not closed or self-intersected), so we develop another tool PAISBoundayFixer to fix this problem to make sure all the polygons are proper for the following analysis.

Besides the valid polygons, we also need to add some extra information to generate the final desired XML file to be processed. One such information is PAIS _UID, which is used to recognize the different results from different research (or algorithms) for the same images. To generate PAIS _UID, you can run the provided tool PAISIdentifierGenerator with a specific config file as input.

After obtaining all the required information, we use the tool PAISDocumentGenerator to generate the XML file that can be consumed by our system. Then the XML files are ready to be uploaded to the database after compressed.

All the above functions can be done by our paistools.

Data Uploading

We utilize two schemas to store the information of whole slide images. One schema is pi, which is used to store the basic information of images, such as image size and image path; The other schema pais is used to store the information of objects in the images and other related information, such as the extracted feature and some statistical information.

To upload the image information to pi, a tool imageuploader is offered.

To upload the information of image objects, two tools are developed. One tool isPAISDocumentUploader, which uploads the compressed XML files to a temporary table. Then the other tool PAISDataLoadingManager is used to unzip the compressed XML files and stores the contained information to the pais schema.

After all the above steps, some intermediate data for further analysis and visualization can be generated using the provided stored procedures: load _patient.sql and sp _histogram.sql. The first one is to load the information of patients and the other one computes some related statistical information.

Here you can find the detailed steps to upload data.

Data Analysis and Visualization

A web portal is developed to demo our system. We use RESTfull to implement the web APIs andTomcat + Apache2 as the web server. To build this website, you need to install and configureApache2, Tomcat and OpenSlide library.

After installing all the required tools, you need to put two files portal.war and tcga.war on the deployment folder of Tomcat and Apache2. You are done if you can successfully visitlocalhost/portal using browsers.

Detailed steps to build the web portal can be found here.