The United States Department of Defense, Defense Advanced Research Projects Agency (DARPA) grant: INTERfering and Co-Evolving Prevention and Therapy (INTERCEPT).The Tauber Bioinformatics Research Center is participating in a project entitled: rTIPs: “Trojan-horse” Interference Particles against pathogenic RNA viruses.The PI for this project is Prof. Raul Andino (UCSF).The collaborative team consists of University of California San Francisco, Stanford University, University of Haifa’s Tauber Bioinformatics Research Center, IBM, Pine Biotech, and Aleph Therapeutics.

The program aims to explore and evaluate the potential for Therapeutic Interfering Particles (TIP) as a therapeutic and preventive approach for the long term control of a broad range of rapidly evolving viruses. The novel path explored in the program is based upon previously reported Defective Interfering Particles (DIPs), virus-derived particles with partially deleted genomes that arise during natural infections. DIPs have been isolated from numerous viral infections and shown to interfere with replication and packaging processes through stoichiometric competition for essential viral components.

The most important project target is the design of therapeutically efficient DIPs. Our planned approach to TIP design optimization will be via associations of TIP candidates with clinical outcomes in animals and virus quasispecies fitness in cell lines. The associations are based on machine learning and other computational analysis of integrated model simulations. 


Participation in biological projects:

1.   The United States Department of Defense, Defense Advanced Research Projects Agency (DARPA) project titled, Linking Virus Population Genetic Structure to Infectivity and Adaptation” collaboration with the University of California, San Francisco and Stanford University (CA); Mount Sinai School of Medicine (NY); Leloir Institute (Argentina); and SAIC Inc. (San Diego, CA).

The project focuses on dengue virus (DENV) with a goal to examine cross-species adaptation in the natural virus cycle from mosquito to mammalian cells. DENV offers a unique opportunity to capture how the quasispecies structure and dynamics determine cycling between insect and human environments. The objective is to predict viral evolutionary trajectories and pathogenic potential.

The study is focused on diversity and sequence structure of a viral population and how these features are associated with fitness. A new accurate approach to measure the mutation distribution of viral quasispecies (i.e. the genetic structure of the population) was developed, and this information is linked to experimentally measured viral fitness landscapes through developed computer algorithms.

The fundamental force shaping virus evolution is its interaction with the host machinery, whereby for example the cellular environment modulate viral diversity and immune mechanisms act as constant selective pressures. The computational task is to identify these forces via analysis and integration of heterogeneous omics data: NGS sequencing, mass-spec proteomics, image analysis, etc. The wider goal is to predict the ability of any RNA virus to transmit from other species into humans, spread in the human population, and cause disease. 


2.  “Transcription of Repeats Activates Interferon (TRAIN) Mechanism in Oncology and Aging,” collaboration with Roswell Park Cancer Institute, Tartis Inc., Oncotartis Inc. and TartisAging Inc. (Buffalo, NY, USA).
Expression of interspersed repeats (SINEs) and pericentromeric tandem repeats in the genome of normal mouse cells is blocked by p53 and DNA methylation, either of which is sufficient for transcriptional silencing. Combined p53 inactivation and DNA demethylation through experimental means or natural circumstances (as in tumor cells) lead to unsilencing and massive transcription of these repeat elements, followed by triggering of a suicidal type I interferon (IFN) response. The predicted ability of repeat transcripts to form dsRNA likely mediates the observed IFN induction in cells with unsilenced repeats.


3.   “Genomics and Transcriptomics of the Blind Mole Rat (BMR), and Mechanisms Underlying its Hypoxia Tolerance:” collaboration between the University of Haifa, Beijing Genomics Institute (China) , Bar-Ilan University (Israel) and several universities in Europe and America.
This project involves a wide genomics study of BMR (NGS sequencing, assembly, and analysis of the BMR genome and transcriptome) that is focused on BMR’s adaptive response to underground stresses, and to hypoxia in particular. The NGS data analysis results reveal several specific adaptive genomic features of the BMR, including high rates of RNA and DNA editing, reduced chromosome rearrangement, and an over-representation of SINE elements, etc. We have discovered specific regulations of expression of genes and repeats in the BMR that lead to its tolerant response to hypercapnia and hypoxia, and, putatively, resistance to cancer.


4.   NIH/NIAID P01 grant, entitled, “Protein Homeostasis Mechanisms Underlying Enterovirus Replication and Evolution,” collaboration with the University of California, San Francisco.
This project entails the integration of omics data (NGS, virus infection images, LC_MS proteomics) generated in the study of protein homeostasis mechanisms underlying enterovirus replication and evolution. The major goals of this project are to find statistically significant associations between features of heterogeneous datasets that are generated in the same process of enterovirus replication and evolution.


Algorithmic developments:

1.   Circular sequencing (CirSeq) developed at the University of California, San Francisco, Raul.Andino Lab, utilizes a repetition code to detect and correct sequencing errors. Bioinformatics analysis of CirSeq data requires identifying tandem repeats within the circular read, generating a consensus of those repeats, and reconstructing linear genomic fragments. We developed an algorithm for speedy and accurate consensus identification and its genome linearization. The algorithm is based on the distribution of distances between k-mers in each CirSeq read.

2.   The relative fitness of mutations in a virus genome across passages indicates the importance of mutations for virus adaptation to specific cellular environments. Since CirSeq reads show genome mutations with high precision, in the mutation fitness analysis, we applied the generalized Bayesian (auto)regression approach to estimate the fitness of each mutation, and the confidence intervals of these values were estimated. In a poliovirus case study, after normalizing the fitness values by their confidence intervals, we observed a biologically interesting concentration of higher fitness values in a narrow genome fragment that encodes a protein responsible for interaction of the virus with the cell membrane.

3.   An algorithm for non-supervised analysis of “junk” DNA. It applies the probabilistic criterion based optimal clustering to tens of millions of genomic elements. The distances between elements are measured by a number of shared k-mers. In the analysis of total expressed RNA, the goal is to find a group of genomic elements that might provide specific regulation in an organism or under some particular biological condition.

4.   Integration/association of heterogeneous omics data (sequencing, microarray, mass-spec, structural, phenotype information).Integration/association of heterogeneous data is the key aspect of any project where information on the biological process of interests is collected from several data generation procedures: this is typical for biomedical projects, where divergent types of information (visual, medical, several types of omics features) are collected for every sample/patient. The idea of many-to-many association is based on a similarity of two matrices of distances between samples according to two groups of features taken from different sets.

Another application of this association approach is reverse engineering of a network of gene regulations: the distance matrix based association technique may be applied to the reverse engineering of a network of mutual gene regulations and/or a network of the metabolite regulations.

The distance matrix based association may be applied to screening a new sample against a database of previously studied samples. Such database screening that relies on rather delicate one-to-many and many-to-many associations would be a prediction method: which may then be used to predict unknown features of the query samples. Another association measure for database screening that we have utilized in a specific modification is the sorting/enrichment based association measure. This type of association is used in the CMAP database of collected microarray gene expressions after treatment of biological samples by small molecules (Broad Institute, Harvard-MIT). The two above association approaches, when applied together, support and cross-check each other.

5.   Docking and 3D protein-protein interaction by similarity: Computational detection of 3D contacts, including docking of a ligand/small molecule into a protein binding site. This is a difficult, time-consuming task, which is generally based on molecular mechanics calculations. The molecular mechanics based screening of a ligand across surfaces of all proteins in the Protein Data Bank (PDB) of 3D structures is a very slow operation with rather slim chances of success. On the other hand, screening PDB for a protein surface patch that is similar to a known binding site of a ligand of interest (or p-p interaction patch of interest) might be a relatively fast and promising procedure. The developed 3D similarity method is a key element of this approach.

6.   Structure-activity relationship (SAR) via atom voting: In drug design, one of the required steps is to improve the drag candidate (small molecule) activity “from hit to lead”. The chemical modifications of a small molecule depend on a pharmacophore scaffold (key activity atoms) of a group of active molecules. In our approach, the physico-chemical 3D similarity of conformers with different activity helps to identify the activity “pro” and “contra” voting atoms of the multiple 3D alignment of active and non-active molecules.

HPC implementations of developed algorithms:

1.   Highly parallelized GP-GPU calculation of distances between millions of sequences; distances (similarities) are calculated as correlations of kmer spectra of sequences.

2.   GP-GPU implementation of the flexible BLAST-like algorithm

3.   Multi-thread implementation of bi-clustering algorithm: each bi-cluster consists of a group of sequences associated with a group of k-mers