Browsing by Subject "Computational Biology"
Now showing 1 - 10 of 12
Results Per Page
Sort Options
Item Advances in translational bioinformatics facilitate revealing the landscape of complex disease mechanisms(Springer (Biomed Central Ltd.), 2014) Yang, Jack Y.; Dunker, A. Keith; Liu, Jun S.; Qin, Xiang; Arabnia, Hamid R.; Yang, William; Niemierko, Andrzej; Chen, Zhongxue; Luo, Zuojie; Wang, Liangjiang; Liu, Yunlong; Xu, Dong; Deng, Youping; Tong, Weida; Yang, Mary Qu; Department of Biochemistry and Molecular Biology, IU School of MedicineAdvances of high-throughput technologies have rapidly produced more and more data from DNAs and RNAs to proteins, especially large volumes of genome-scale data. However, connection of the genomic information to cellular functions and biological behaviours relies on the development of effective approaches at higher systems level. In particular, advances in RNA-Seq technology has helped the studies of transcriptome, RNA expressed from the genome, while systems biology on the other hand provides more comprehensive pictures, from which genes and proteins actively interact to lead to cellular behaviours and physiological phenotypes. As biological interactions mediate many biological processes that are essential for cellular function or disease development, it is important to systematically identify genomic information including genetic mutations from GWAS (genome-wide association study), differentially expressed genes, bidirectional promoters, intrinsic disordered proteins (IDP) and protein interactions to gain deep insights into the underlying mechanisms of gene regulations and networks. Furthermore, bidirectional promoters can co-regulate many biological pathways, where the roles of bidirectional promoters can be studied systematically for identifying co-regulating genes at interactive network level. Combining information from different but related studies can ultimately help revealing the landscape of molecular mechanisms underlying complex diseases such as cancer.Item Discovery and Interpretation of Subspace Structures in Omics Data by Low-Rank Representation(2022-10) Lu, Xiaoyu; Cao, Sha; Zhang, Chi; Yan, Jingwen; Zang, YongBiological functions in cells are highly complicated and heterogenous, and can be reflected by omics data, such as gene expression levels. Detecting subspace structures in omics data and understanding the diversity of the biological processes is essential to the full comprehension of biological mechanisms and complicated biological systems. In this thesis, we are developing novel statistical learning approaches to reveal the subspace structures in omics data. Specifically, we focus on three types of subspace structures: low-rank subspace, sparse subspace and covariates explainable subspace. For low-rank subspace, we developed a semi-supervised model SSMD to detect cell type specific low-rank structures and predict their relative proportions across different tissue samples. SSMD is the first computational tool that utilizes semi-supervised identification of cell types and their marker genes specific to each mouse tissue transcriptomics data, for better understanding of the disease microenvironment and downstream disease mechanism. For sparsity-driven sparse subspace, we proposed a novel positive and unlabeled learning model, namely PLUS, that could identify cancer metastasis related genes, predict cancer metastasis status and specifically address the under-diagnosis issue in studying metastasis potential. We found PLUS predicted metastasis potential at diagnosis have significantly strong association with patient’s progression-free survival in their follow-up data. Lastly, to discover the covariates explainable subspace, we proposed an analytical pipeline based on covariance regression, namely, scCovReg. We utilized scCovReg to detect the pathway level second-order variations using scRNA-Seq data in a statistically powerful manner, and to associate the second-order variations with important subject-level characteristics, such as disease status. In conclusion, we presented a set of state-of-the-art computational solutions for identifying sparse subspaces in omics data, which promise to provide insights into the mechanism in complex diseases.Item Identification of functionally connected multi-omic biomarkers for Alzheimer’s disease using modularity-constrained Lasso(PLOS, 2020-06-17) Xie, Linhui; Varathan, Pradeep; Nho, Kwangsik; Saykin, Andrew J.; Salama, Paul; Yan, Jingwen; Radiology and Imaging Sciences, School of MedicineLarge-scale genome wide association studies (GWASs) have led to discovery of many genetic risk factors in Alzheimer’s disease (AD), such as APOE, TOMM40 and CLU. Despite the significant progress, it remains a major challenge to functionally validate these genetic findings and translate them into targetable mechanisms. Integration of multiple types of molecular data is increasingly used to address this problem. In this paper, we proposed a modularity-constrained Lasso model to jointly analyze the genotype, gene expression and protein expression data for discovery of functionally connected multi-omic biomarkers in AD. With a prior network capturing the functional relationship between SNPs, genes and proteins, the newly introduced penalty term maximizes the global modularity of the subnetwork involving selected markers and encourages the selection of multi-omic markers with dense functional connectivity, instead of individual markers. We applied this new model to the real data collected in the ROS/MAP cohort where the cognitive performance was used as disease quantitative trait. A functionally connected subnetwork involving 276 multi-omic biomarkers, including SNPs, genes and proteins, were identified to bear predictive power. Within this subnetwork, multiple trans-omic paths from SNPs to genes and then proteins were observed. This suggests that cognitive performance deterioration in AD patients can be potentially a result of genetic variations due to their cascade effect on the downstream transcriptome and proteome level.Item Method Development Involving Modeling Bacterial Metabolite Regulation of Vaginal Epithelial Cell Signaling in Bacterial Vaginosis(2022-04-28) Trinh, Alan; Brubaker, DouglasBACKGROUND Bacterial vaginosis, which is the imbalance of normal vaginal microbiota, contributes to preterm delivery, vaginitis, and decreased drug efficacy. Despite metronidazole efficacy in reducing BV contributing organisms, BV continues to recur in 50% of patients. Previous studies showing imidazole propionate’s role in the pathogenesis of type II diabetes suggest that similar metabolite-regulated pathways in vaginal microbiomes may be the key in pathogenesis of uterine diseases such as BV. Thus, the purpose of this study was to observe the relationship between vaginal metabolites, host or microbiome-derived, and transcriptomic responses in vaginal epithelial tissues stratified by vaginal microbiome composition (“microbiome group”). The hypothesis was that differences in vaginal microbiome composition result in differential regulation of metabolite-host pathway functional relationships. METHODS Transcript levels and metabolite concentrations precollected from 23 East African women were processed and analyzed via R. Transcriptomic data were converted into KEGG pathway enrichment scores via ssGSEA2.0, a package within R. Enrichment scores were correlated (Spearman) with metabolite levels by microbiome group and lactobacillus dominant phenotypes, and relationships were visualized via Heatmap3 and Cytoscape. RESULTS The results showed varying strengths in correlation among metabolites and KEGG pathway enrichment scores after filtering for strong correlations (R > |0.5|) and significance (p< 0.05). Nonlactobacillus dominant microbiomes showed fewer strongly associated metabolite-KEGG pathway relationships compared to the lactobacillus dominant microbiome group, specifically the imidazole-related networks. CONCLUSIONS In this study, variations in significant correlations among metabolites and KEGG pathways suggests that microbiome diversity may contribute to how metabolites regulate host pathways in vaginal epithelial cells. The reduced pathway interactions observed in imidazole compounds suggests that dysregulation may contribute to recurrence of bacterial vaginosis. This method of modelling could be used to characterize the regulation of critical pathways associated with the pathogenesis of bacterial vaginosis.Item MutSignatures: an R package for extraction and analysis of cancer mutational signatures(Nature Publishing Group, 2020-10-26) Fantini, Damiano; Vidimar, Vania; Yu, Yanni; Condello, Salvatore; Meeks, Joshua J.; Obstetrics and Gynecology, School of MedicineCancer cells accumulate somatic mutations as result of DNA damage, inaccurate repair and other mechanisms. Different genetic instability processes result in characteristic non-random patterns of DNA mutations, also known as mutational signatures. We developed mutSignatures, an integrated R-based computational framework aimed at deciphering DNA mutational signatures. Our software provides advanced functions for importing DNA variants, computing mutation types, and extracting mutational signatures via non-negative matrix factorization. Specifically, mutSignatures accepts multiple types of input data, is compatible with non-human genomes, and supports the analysis of non-standard mutation types, such as tetra-nucleotide mutation types. We applied mutSignatures to analyze somatic mutations found in smoking-related cancer datasets. We characterized mutational signatures that were consistent with those reported before in independent investigations. Our work demonstrates that selected mutational signatures correlated with specific clinical and molecular features across different cancer types, and revealed complementarity of specific mutational patterns that has not previously been identified. In conclusion, we propose mutSignatures as a powerful open-source tool for detecting the molecular determinants of cancer and gathering insights into cancer biology and treatment.Item A non-parametric Bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects(Springer (Biomed Central Ltd.), 2014) Dundar, Murat; Akova, Ferit; Yerebakan, Halid Z.; Rajwa, Bartek; Department of Computer & Information Science, School of ScienceBACKGROUND: Flow cytometry (FC)-based computer-aided diagnostics is an emerging technique utilizing modern multiparametric cytometry systems.The major difficulty in using machine-learning approaches for classification of FC data arises from limited access to a wide variety of anomalous samples for training. In consequence, any learning with an abundance of normal cases and a limited set of specific anomalous cases is biased towards the types of anomalies represented in the training set. Such models do not accurately identify anomalies, whether previously known or unknown, that may exist in future samples tested. Although one-class classifiers trained using only normal cases would avoid such a bias, robust sample characterization is critical for a generalizable model. Owing to sample heterogeneity and instrumental variability, arbitrary characterization of samples usually introduces feature noise that may lead to poor predictive performance. Herein, we present a non-parametric Bayesian algorithm called ASPIRE (anomalous sample phenotype identification with random effects) that identifies phenotypic differences across a batch of samples in the presence of random effects. Our approach involves simultaneous clustering of cellular measurements in individual samples and matching of discovered clusters across all samples in order to recover global clusters using probabilistic sampling techniques in a systematic way. RESULTS: We demonstrate the performance of the proposed method in identifying anomalous samples in two different FC data sets, one of which represents a set of samples including acute myeloid leukemia (AML) cases, and the other a generic 5-parameter peripheral-blood immunophenotyping. Results are evaluated in terms of the area under the receiver operating characteristics curve (AUC). ASPIRE achieved AUCs of 0.99 and 1.0 on the AML and generic blood immunophenotyping data sets, respectively. CONCLUSIONS: These results demonstrate that anomalous samples can be identified by ASPIRE with almost perfect accuracy without a priori access to samples of anomalous subtypes in the training set. The ASPIRE approach is unique in its ability to form generalizations regarding normal and anomalous states given only very weak assumptions regarding sample characteristics and origin. Thus, ASPIRE could become highly instrumental in providing unique insights about observed biological phenomena in the absence of full information about the investigated samples.Item Overlapping Genes Produce Proteins with Unusual Sequence Properties and Offer Insight into De Novo Protein Creation(American Society for Microbiology, 2009-10) Rancurel, Corinne; Khosravi, Mahvash; Dunker, A. Keith; Romero, Pedro R.; Karlin, David; Biochemistry and Molecular Biology, School of MedicineIt is widely assumed that new proteins are created by duplication, fusion, or fission of existing coding sequences. Another mechanism of protein birth is provided by overlapping genes. They are created de novo by mutations within a coding sequence that lead to the expression of a novel protein in another reading frame, a process called "overprinting." To investigate this mechanism, we have analyzed the sequences of the protein products of manually curated overlapping genes from 43 genera of unspliced RNA viruses infecting eukaryotes. Overlapping proteins have a sequence composition globally biased toward disorder-promoting amino acids and are predicted to contain significantly more structural disorder than nonoverlapping proteins. By analyzing the phylogenetic distribution of overlapping proteins, we were able to confirm that 17 of these had been created de novo and to study them individually. Most proteins created de novo are orphans (i.e., restricted to one species or genus). Almost all are accessory proteins that play a role in viral pathogenicity or spread, rather than proteins central to viral replication or structure. Most proteins created de novo are predicted to be fully disordered and have a highly unusual sequence composition. This suggests that some viral overlapping reading frames encoding hypothetical proteins with highly biased composition, often discarded as noncoding, might in fact encode proteins. Some proteins created de novo are predicted to be ordered, however, and whenever a three-dimensional structure of such a protein has been solved, it corresponds to a fold previously unobserved, suggesting that the study of these proteins could enhance our knowledge of protein space.Item Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework(Oxford University Press, 2019-09-05) Yang, Jinyu; Ma, Anjun; Hoppe, Adam D.; Wang, Cankun; Li, Yang; Zhang, Chi; Wang, Yan; Liu, Bingqiang; Ma, Qin; Medical and Molecular Genetics, School of MedicineThe identification of transcription factor binding sites and cis-regulatory motifs is a frontier whereupon the rules governing protein-DNA binding are being revealed. Here, we developed a new method (DEep Sequence and Shape mOtif or DESSO) for cis-regulatory motif prediction using deep neural networks and the binomial distribution model. DESSO outperformed existing tools, including DeepBind, in predicting motifs in 690 human ENCODE ChIP-sequencing datasets. Furthermore, the deep-learning framework of DESSO expanded motif discovery beyond the state-of-the-art by allowing the identification of known and new protein-protein-DNA tethering interactions in human transcription factors (TFs). Specifically, 61 putative tethering interactions were identified among the 100 TFs expressed in the K562 cell line. In this work, the power of DESSO was further expanded by integrating the detection of DNA shape features. We found that shape information has strong predictive power for TF-DNA binding and provides new putative shape motif information for human TFs. Thus, DESSO improves in the identification and structural analysis of TF binding sites, by integrating the complexities of DNA binding into a deep-learning framework.Item RareVar: A Framework for Detecting Low-Frequency Single-Nucleotide Variants(Mary Ann Liebert, Inc., 2017-07) Hao, Yangyang; Xuei, Xiaoling; Li, Lang; Nakshatri, Harikrishna; Edenberg, Howard J.; Liu, Yunlong; Medical and Molecular Genetics, School of MedicineAccurate identification of low-frequency somatic point mutations in tumor samples has important clinical utilities. Although high-throughput sequencing technology enables capturing such variants while sequencing primary tumor samples, our ability for accurate detection is compromised when the variant frequency is close to the sequencer error rate. Most current experimental and bioinformatic strategies target mutations with ≥5% allele frequency, which limits our ability to understand the cancer etiology and tumor evolution. We present an experimental and computational modeling framework, RareVar, to reliably identify low-frequency single-nucleotide variants from high-throughput sequencing data under standard experimental protocols. RareVar protocol includes a benchmark design by pooling DNAs from already sequenced individuals at various concentrations to target variants at desired frequencies, 0.5%-3% in our case. By applying a generalized, linear model-based, position-specific error model, followed by machine-learning-based variant calibration, our approach outperforms existing methods. Our method can be applied on most capture and sequencing platforms without modifying the experimental protocol.Item Systems Modeling of Gut Microbiome Regulation of Estrogen Receptor Beta Signaling in Ulcerative Colitis(2023-04-28) Trinh, Alan; Munoz, Javier; Cross, Tzu-Wen; Brubaker, DougIntroduction: The pathogenesis of ulcerative colitis (UC), a chronic inflammatory disorder, involves interactions between gut microbiome dysbiosis, epithelial cell barrier disruption, and immune hyperactivity. Men are 20% more likely to develop UC and 60% more likely to progress to colitis-associated cancer than women. A possible explanation for this may be the anti-inflammatory and epithelial-protective role of estrogen via estrogen receptor beta (ESR2) in the gut. However, extracting insights into how microbiomes regulate host cell signaling is challenged by high-dimensional data integrations across kingdoms and the need to extract interpretable biological information from complex models. To address these challenges and understand microbiome regulation of ESR2 signaling, we developed a partial least squares path modeling (PLS-PM)-inspired microbiome multi-omic modeling framework. Materials and Methods: Gut metabolomic, colorectal transcriptomic, and stool 16S rRNA-seq data from unique UC or non-IBD controls subjects (n=35) were obtained from the Inflammatory Bowel Disease Multi-Omics Database. Single sample gene set enrichment analysis was used to calculate pathway scores for genes up or down-regulated by ESR2 (ESR2UP/ESR2DN respectively).Latent variables (LV) obtained via regularized sparse partial least square regression (sPLSR) mdoels were extracted and used as predictors in two linear regression meta-models with dependent variables of ESR2UP or ESR2DN scores, and independent variables in each model consisting of patient LV scores on metabolites and 16S LVs along with sex and UC status. Significance testing on regression coefficients identified LV interactions synergistically predictive of ER Beta pathway activity. Results and Discussion: The first two LVs from each single-omic sPLSR models were extracted to create terms in the multi-omic meta-model accounting for sex and disease status. The meta-model was predictive of ESR2UP pathway score, implicating UC status (p=0.046), microbiota LV1 (p=0.0006), metabolites LV2 (p=0.045), and interactions of metabolite LV1:microbiota LV1 (p=0.003), microbiota LV1:UC (p=0.0008), and microbiota LV2:sex (p=0.019) in predicting ESR2UP pathway status. For ESR2DN, the 16S model clustered by ESR2DN activity while the metabolomic model clustering was best illustrated by disease status. The ESR2DN meta-model was predictive of ESR2DN pathway activity, implicating main effects of microbiota LV1 (p =0.004), metabolites LV2 (p=0.004), and diagnosis and the interaction effects of metabolites LV1:microbiota LV1 (p=0.005), microbiota LV1:UC (p=0.014), microbiota LV2:sex (p=0.017), and metabolites LV2:UC (p=0.035) in predicting ESR2DN pathway status. Acesulfame, an artificial sweetener, and oxymetazoline, a nasal decongestant, were some of the metabolites predicted by our model to have a differential effect on ESR2 activity based on patient sex. The metabolites predicted in our models are tested in cancer cell lines to understand estrogen regulatory effects on inflammation observed in UC. Method developed in this study can be applied to gain insight regarding regulation of signaling pathways in pathologies not limited to UC. Conclusions: We demonstrate the effectiveness of a PLS-PM based method for modeling relationships between host signaling and microbiome multi-omics data via this investigation of ER Beta activity in UC patients. We quantified significant multi-omic microbiome interactions with disease status and sex that impact ER Beta signaling which may aid in identifying new microbiome-targeted UC therapeutics stratified by sex-specific disease characteristics.