其他摘要 | With the development of sequencing techniques, RNA-Seq and mass spectrometry analyses, there are a large number of data accumulation of metagenomics, transcriptomics and proteomics. But, bioinformatics analysis methods aimed at different omics data are still in process. Specially, for the omics big data of heterogeneous source and level, there is no efficient and accurate integration and mining method. In consideration of all great challenges in integrating different types of omics data and certain type of omics data from different sources, this study aimed to figure out an optimized analytical strategy in integrating various omics data. This work includes two parts: 1. Model construction to explore the correlation between transcriptomics and proteomics; 2. Development of biomarker selection data mining based on metagenomic data.Model construction based on omics data: For many genes, the gene expression at transcriptomic level does not necessarily correlate well with that at the proteomic level. This expression difference can be explained by a variety of reasons, such as transcription regulation, protein degradation, codon bias and codon adaptation index in translation. Given these reasons, such correlation can be manually established for each individual gene one at a time. However, this approach is slow and relies heavily on known genes and their expression properties. On the other hand, current RNA-Seq and mass spectrometry analyses have provided high-throughput transcriptomic and proteomic data for global profiling of the gene expressions at both levels. Thus, there is in urgent need for batch correlation analysis of the correlation of gene expression and protein abundance, as well as interpretation of such correlations.In this work, we have proposed a bi-clustering method to cluster genes that have consistent patterns for the correlation between gene expression and protein abundance. The clustering results have then been interpreted from the perspective of both transcriptomic features and proteomic features, which shows that mRNA half-life, protein half-life and protein 3D structure's joint-force in concerthas significant effect on the correlation of gene expression and protein abundance. Based on these results, we have further proposed a model called the general Linear Model based on individual Clusters (CLM) for prediction of protein abundance from gene expression based on a well-selected set of features. For mitochondrial data from different model organisms tissues, this model works well for protein abundance prediction from gene expression data when compared with General Linear Model (GLM) and Multivariate Adaptive Regression Splines (MARS) Model thus proofing the validity of the prediction model on this model organism mitochondrial genes. On another model organisms Saccharomyces cerevisiae transcriptomic and proteomic data, the model is built on a different set of features, and the prediction could also reach higher accuracy than that from all-gene-based linear correlation and higher correlation and lower Prediction Error Sum of Squares (SSE) than MARS in some clusters. Therefore, we concluded that the feature selection process based on bi-clustering results could select the set of features that would be suitable for a variety of species, yet different sets of features need to be selected for different species. And the general method forfeature selection and prediction model building would achieve relatively high accuracies for a wide range of gene sets.Biomarker selection based on metagenomic data:As more than 99% of microbial community could not be isolated and cultivated, the metagenomic methods have been commonly used to analyze microbial community as a whole. With the fast accumulation of metagenomic samples, especially those from next-generation sequencing techniques, it is now possible to quantify taxa (features) in the metagenomic data accurately. The presence/absence or different abundance values for a set of taxa could be used as appropriate taxonomical biomarkers for identification of the corresponding microbial community’s phenotype. Though there exist some metagenomic biomarker analysis tools, current methods are not robust, accurate and fast enough at selection of non-redundant biomarkers for prediction of microbial community’s phenotype.In this study, a novel method, MetaBoot, is presented by combining the techniques of mRMR (minimal redundancy maximal relevance) and bootstrapping, which could robustly and accurately discover non-redundant biomarkers for different microbial communities through mining of metagenomic data. MetaBoothas been tested and compared with other methods on well-designed simulated datasets considering norm and gamma distribution and publicly available metagenomic datasets. Results have shown that MetaBoot was robust across datasets of varied complexity and taxonomical distributionpatterns and could also select discriminative biomarkers with quite high accuracy and biological consistency.MetaBootis a suitable framework to discover taxonomical biomarkers that could distinguish different microbial communities.Integrated analysis strategy of biological big data: Model construction based on transcriptomic and proteomic data is a kind ofintegrated analysis of different omics data. And biomarker analysis based on metagenomic data is a research of data from different sources. Based on preliminary strategy investigation of the two types of biological big data, we have initially grasped optimized analytical strategy and achieved relatively better results. |
修改评论