Tauber Bioinformatics Research Center
The Tauber Bioinformatics Research Center was established in 2012 at the University of Haifa in Israel, based on a donation from the Laszlo N. Tauber Family Foundation, funding from the University, and several scientific grants (including grants from DARPA). The Center is under the auspices of the Research Authority at the University of Haifa.
This Center aims to develop scalable high performance computing (HPC) hardware/software solutions for analysis and integration of big “omics” data as well as linking omics with clinical data.
Mastering the Data
If efficient determination of cost-effective health care strategies is to be found, methods must be devised to discover clinically significant correlations that provide reliable predictions. We have addressed this fundamental challenge by developing an integrative platform that is capable of processing complex datasets from diverse sources, ranging from analysis of genomics to clinical responses. The bioinformatics analysis principles and algorithms are introduced using the T-BioInfo web-based visual bioinformatics and machine learning platform, which incorporates popular, open-source algorithms and proprietary algorithms developed by the Tauber group for concrete needs of biological and medical communities. T-BioInfo is a multi-omics platform that runs on high performance server clusters that can be accessed via a web interface, allowing anyone to analyze huge datasets on a laptop, tablet or PC in a short time frame. A special version of the platform contains educational datasets and is enriched with multimedia resources and intuitive features designed to assist an inexperienced user.
T-BioInfo Platform: Technical Description
The T-BioInfo platform consists of sections for analysis of several data types: Next Generation Sequencing (NGS) reads for various types of nucleic acid sequence-based omics data (genomics, epigenetics, and transcriptomics), mass-spectroscopy data (metabolomics and proteomics), and structural information about biopolymers, including atom coordinates (structures of small molecules, proteins, DNA/RNA, and antibodies). Because the platform can work with data in a tabular format, in addition to the raw omics data formats, clinical data and the results of data analysis are readily integrated with programmed analyses.
The Platform utilizes a flexible user interface for data processing and integration. Simple and intuitive, multiple kinds of analyses may be readily used by biomedical researchers, technicians, and students, who may not be familiar with bioinformatics. The interface allows preparation of several pipelines of algorithmic modules for analyzing the same input data. By running parallel pipelines, cross comparison of the analysis results and multiple integrations may be achieved. An example of the pipeline’s construction and its interfaces is depicted below: grey buttons are algorithmic modules that are combined in a pipeline (yellow dotted line with highlighted modules that are used), which goes across vertical groups of algorithms. Each group unites several algorithms with the same analysis objectives, and the interface automatically suggests which particular algorithm drawn from this vertical group is most compatible with the already selected modules. Simply, the pipeline is dynamically constructed to achieve the highest degree of analytic power.
Analysis of genome mutations. Detection of somatic mutations (tumor specific) is based on contrasting tumor and blood samples from the same patient through the platform’s combination of proprietary and public domain programs. For every position/mutation-variant, the algorithms are designed to contrast tumor and normal samples as statistical distances derived from all possible combinations of genome alleles. Detection of the mutation hot-spots is performed by a proprietary segmentation algorithm (BS-algorithm).
Early cancer diagnostics. DNA samples from plasma rely on sequencing of short double-stranded DNA fragments wrapped around histones. These fragments are typically harvested from necrotic and apoptotic tumor as well as normal cells. Thus, the highest frequency mutation variants are expected to come from normal cells, and far lower-frequency mutation variants are expected from tumor cells. Because they are low-frequency mutations, the mutation variants from cancers can be confounded with sequencing errors. To resolve this problem, our approach relies on building a consensus of matching reads from sequenced ends of the dsDNA fragments and thus eliminates sequencing errors, in which one of two matching reads does not show the same variant in the opposite read.
Analysis of CHiP-seq data for detecting histone modification signals across the genome is performed by the Platform with several public domain algorithms (HMM and Bayesian models) as well as by our BS-segmentation approach. Tests with real and simulated data show that the BS-algorithm outperforms the public domain approaches in number and accuracy of detected histone-modification fragments. Notably, the majority of BS-detected histone-modified fragments consist of a nucleosome length of 180 – 200 bp, which is expected for genomic fragments with histone modification signals.
Detection of DNA methylated islands through analysis of bisulfite sequencing data, in particular, differentiating hyper- and hypo-methylated islands, is performed by the platform by employing several public domain algorithms as well as two of our own algorithms, both based on the BS-segmentation approach.
Genome assembly-based detection of transcripts and their expression levels is typically performed through public domain algorithms: mapping of reads on the genome, isoform detection, and expression level detection for genes and isoforms. We employ several proprietary QC, error correction, and read-mapping algorithms, as well as unsupervised detection of transcripts by BS segmentation, whereby transcripts are genome fragments enriched by the read mapping.
Detection of de-novo transcripts from raw reads is performed by standard Trinity software as well as by our Bi-Clustering + P-clustering procedures. The Bi-Clustering algorithm employs a bi-partite graph between sequences and k-mers. The links-enriched subgraphs of this graph are Bi-Clusters that are used for accurate de-novo assembly. Followup clustering of Bi-Clusters is performed where the distance between two Bi-Clusters is defined by shared kmers and sequences of the Bi-Clusters. Next, DeBruijn assembly of the kmers in every cluster generates transcripts without chimeras. Note, in assembling genomes and expressed transcripts of the microbiome, samples are processed by the same Trinity and Bi-Clustering plus P-clustering techniques.
The analysis of metabolomics utilizes peak picking, time warping, and peak annotation, which is performed by public domain software, XCMS in particular. Our own approach is based on the isotope arrangement of peaks in mass-spectroscopy metabolomics. This approach can be used for peak picking to distinguish between true peaks and artefacts. The idea underlying this peak picking approach is to find the four-points patterns on the mass-spec image plane that can be clustered together in one location of the image, providing similar ratios of expected Gaussian-like intensity profiles of the isotope peaks. Several public domain proteomics programs are implemented in the platform in parallel with our own algorithms, developed based on isotope pattern detection.
The Platform processes highly accurate CirSeq sequencing data of virus mutations. Addressing virus evolution and determining virus quasispecies, an approach for detecting fitness profiles of individual mutations, called Time-Fit, was developed. After obtaining smooth Time-Fit fitness profiles across infection passages for all genome positions/variants, the mutation variants are clustered based on correlations of their Time-Fit profiles. We found that a group of mutation variants in the cluster followed epistatic characteristics in adapting to the host, and, therefore, might be considered as core mutation variants specific to quasispecies of the virus.
In parallel, the T-BioInfo platform performs analyses of CirSeq and regular NGS data on virus genome mutations in circuitry with a variety of host omics data: transcriptomics, genomics, and epigenetics.
The detection of structural similarity of biopolymers and molecules (similar sub-structures of two molecules or two proteins) is based on representing each 3D structure as a set of strings of atom-descriptors. These descriptors take into account the atoms’ 3D structural neighborhood, which defines its physico-chemical properties. By seeking specific alignments, the algorithm compares sets of strings of descriptors to determine which atoms of the two structures must be superimposed to provide the closest correspondence of strings of descriptors.
Clustering of molecules or proteins. A distance between 3D structures of two molecules or two biopolymers is calculated as an inverse similarity score for optimal superimposition of their sub-structures. Clustering of 3D structures is performed by P-clustering, which utilizes a density-type clustering of objects to identify any enrichment of objects in a metric space. The distance between objects is defined in the metric space, and enrichment of a neighborhood by objects is estimated based on fractal dimension of the neighborhood. For data flow, when all objects cannot be stored in the computer memory, clustering is performed by the big data-oriented P-clustering BD approach. Namely, the dataflow is processed in portions, and in each portion, elementary enrichments are collected and transferred to the next portion as pairs, center plus radius, of the most enriched neighborhoods.
Docking of ligand via structural similarity. The general idea underlying this approach is to find sub-structural similarities between a tested protein and the neighboring ligand protein in the crystallized protein-ligand complex. A substructure from a neighboring ligand that is similar to the ligand substructure of the protein in the complex, found through protein-screening from database of substructures, with a neighboring cavity that allows optimal positioning of the ligand in the cavity is used, along with calculation of the energy potential for protein-ligand interaction.
Association of heterogeneous data types in the same experiment is performed by a Bi-Association algorithm: if two types of observations in the same set of biological samples are given (two tables may serve as input, e.g. a table of gene expression and a corresponding table listing abundances of metabolites derived from the same samples), the hidden links between subsets of rows of two tables may be detected by identifying similar grouping (clustering) patterns of the samples based on each subset of rows:
A network of links and prediction (Bi-Association and stepwise regression) may be used for generating a network of many-to-many connections between objects of two or more kinds. Namely, a prediction of behavior of an object of type “A” across samples that is dependent on the behavior of several objects of type “B” can be prepared by regression analysis. For instance, drug efficacy across cell lines may be illustrated as dependent on the abundance of mutations in several mutation islands on the human genome. Initial studies have shown clinically relevant linkages in experiments of drug efficacy in cell lines (70 drugs across 46 cell lines). The same Bi-Association and sw-Regression approach was applied to demonstrating associations of groups of mutation islands with several drugs. Some of these islands are neighboring genes, which may reflect the same phenomenon shown for the island neighboring gene (ROBO1) linked to the therapeutic activity of doxorubicin.
The T-Bioinfo platform has several promising applications:
1. Potential for better physical examinations: Employing an automated, unsupervised method to analyze heterogeneous clinical and omics information, the Platform can determine how clinical signs of disease in a particular patient correlate with biological processes at the cell level.
2. Better population analytics: Patients can be grouped for optimizing applied drug care, where optimization is based on a network linking clinical data with omics biology of similar drug treatments in model animals and cell cultures.
3. Smarter clinical trial design: Effectively targeting patients’ clinical information with correlative omics data optimizes the likelihood of success of specific clinical trials.
4. Better clinical guidelines: Correlating underlying basic biology with patient symptoms will guide application of other available drugs and treatments.
5. Deeper biomarker understanding/discovery: Discovering the network of links between clinical data and omics biology in model animals and cell cultures under the same drug treatments might disclose key cellular elements/processes that are hubs of the disease regulation.
6. More efficient drug discovery: Revealing associations of a drug’s physico-chemical and structural descriptors with clinical efficacy has immediate application for drug design. The Platform’s capacity to disclose correlations of such descriptors with cellular processes as derived from omics data will offer rational screening databases of small molecules for more efficient drug development.