异质性生物组学大数据整合挖掘方法初探

QIBEBT-IR > 单细胞中心组群

	异质性生物组学大数据整合挖掘方法初探
	王晓君
导师	宁康
	2014-11
学位授予单位	中国科学院研究生院
学位授予地点	北京
学位专业	生物化工
关键词	基因表达蛋白丰度双向聚类半衰期预测模型宏基因组生物标志物机器学习 Mrmr 自助抽样法分布模式
摘要	随着测序技术、RNA-Seq技术和质谱分析技术的快速发展，大量宏基因组、转录组和蛋白组数据得到累计。但是针对各种组学数据的生物信息学分析方法还有待发展。特别是针对来源和层次异质性的组学大数据，目前还没有高效准确的整合和挖掘方法。本研究针对生物大数据整合分析中不同组学数据整合和不同来源数据整合等问题，探索相关数据整合分析策略。主要分为两个部分：1. 针对转录组和蛋白组数据方面进行了数据建模，来探索转录组与蛋白组数据相关关系; 2. 针对宏基因组数据改善生物标志物筛选的方法，并进行相关的数据挖掘。基于组学数据的模型构建：根据中心法则，我们可以知道基因的表达是一个多级的过程，包括遗传信息从DNA通过信使RNA传达到蛋白质这一系列过程。而且，对于基因如何进行转录，信使RNA如何在核糖体上翻译成相应的氨基酸多肽链，以及后续的氨基酸多肽链折叠成有功能的蛋白质的机理的了解已经比较透彻。然而对于许多基因来说，基因在转录层面上的表达量与相应的蛋白质层面的表达量相关性并不好。这种在表达量上的差异可以用多种原因来解释，例如：转录调控、蛋白质降解以及翻译过程中的密码子偏好和密码子适应指数问题。如果给定了相关的影响因素，那对于单个基因来说它在转录和翻译水平的相关性是可以建立的。然而这种方法是效率低下的而且需要建立在掌握大量的关于基因以及它的表达产物特性的基础之上。另一方面，RNA-Seq和质谱分析技术提供了全局层面上基因在转录组和蛋白质组两个水平上高通量的数据。因此，批处理基因表达量与相应蛋白质丰度的相关分析以及解释其中的相关性机理成为当前的迫切需要。在这一工作中，一个通过双向聚类方法对相关的基因表达量和蛋白质丰度值进行聚类从而找出在基因表达量与蛋白丰度相关性上拥有相同表达模式的基因类群（或者称为基因类）。聚类的结果从转录和蛋白层面的特性来解释每一个基因类群中的独特性质，解释的结果显示mRNA的半衰期、蛋白质的半衰期以及蛋白质的三维结构的性质（蛋白质3D结构很复杂，本工作中我们主要考虑蛋白质表面积与体积比的大小，这主要是考虑到对于一个蛋白质其表面积/体积比越大其暴露的残基就越多，则其可能越容易被降解。）在影响基因表达量与相应蛋白质丰度相关性有重要的影响。基于上述的结果，我们进一步提出一个模型——基于单个基因类群的一般线性模型（简称：CLM模型）——基于一系列筛选好的特性（基因产物相关的特性，例如蛋白链长度等）利用基因的表达量数据预测相应蛋白质丰度。对于本研究中用到的模式生物的不同部位所得到的线粒体数据，在基于所有基因表达量和相应的蛋白质丰度数据构建的一般线性模型（General Linear Model，简称：GLM模型）和多元自适应样条回归模型（Multivariate Adaptive Regression Splines，简称：MARS模型）对比中，该模型在基于基因表达数据预测相应蛋白质丰度方面表现表现较好。这也证明CLM模型在该模式生物基因数据上的有效性。而在另一模式生物Saccharomyces cerevisiae 的转录组与蛋白组数据上，CLM模型需要建立在一系列新的特性（基因产物相关的特性，这里新的特性指的是重新进行变量选取过程）之上，而且与基于所有基因表达量和相应的蛋白质丰度数据的GLM模型相比较，CLM模型可以得到较高的预测准确率，而与基于单个双向聚类后得到的基因类群中包含的基因表达量和相应的蛋白质丰度数据的MARS模型比较，CLM模型在多个类群中可以得到较高的相关性（蛋白质丰度预测值与真实值之间的相关性）和较低的预测误差平方和（SSE）。因此，我们认为基于双向聚类结果的特征选取过程可以选取出适合于多种物种的一系列特性，不同的物种需要选取不同的特性。基于宏基因组数据的生物标志物筛选：由于超过99%的微生物现在还没有方法分离和培养，因此利用宏基因组方法将微生物群落看作一个整体来分析已经得到广泛的应用。随着宏基因组样本的快速积累，尤其是来自于下一代测序技术的宏基因组样本，使得在宏基因组数据中更准确地定量分类单元。一组存在/缺失或者拥有不同表达丰度的分类单元可以作为适当的分类标记，用来鉴定相应的微生物群落的表型。综观现在存在的宏基因组标记分析工具，现存方法在筛选非冗余标记物用来预测相应的微生物群落的表型方面不是特别稳健、准确或者快速。在本研究中，我们提出了一个新的方法—MetaBoot，它结合了mRMR（minimal redundancy maximal relevance）和自助抽样方法（bootstrapping），而结合了这两种方法可以通过对宏基因组数据的挖掘从而更稳健和准确地找出非冗余标记物，进而区分不同的微生物群落。我们已经在多种设计好的模拟数据上对MetaBoot方法进行了测试以及与其他方法进行了比较。而其中模拟数据是在考虑了公开的宏基因组数据集中的真实分布生成的，而真实数据集的分布是包含正态分布和伽马分布的。结果显示MetaBoot方法在拥有多种复杂度和分类分布模式的数据中变现稳健，而且其选出的标记物拥有较高的分类准确率。MetaBoot是一种适合于发现分类生物标志物的方法，利用这些标志物能较好的区分不同的微生物样本。生物大数据整合分析策略：针对转录组和蛋白组数据方面进行了数据建模分析，是一种不同组学数据整合分析的研究。而针对宏基因组数据改善生物标志物筛选的方法，是一种不同来源数据的研究。本研究课题基于此两类生物大数据整合分析方法的初探，初步掌握了生物大数据分析策略，并获得了较好的研究结果。
其他摘要	With the development of sequencing techniques, RNA-Seq and mass spectrometry analyses, there are a large number of data accumulation of metagenomics, transcriptomics and proteomics. But, bioinformatics analysis methods aimed at different omics data are still in process. Specially, for the omics big data of heterogeneous source and level, there is no efficient and accurate integration and mining method. In consideration of all great challenges in integrating different types of omics data and certain type of omics data from different sources, this study aimed to figure out an optimized analytical strategy in integrating various omics data. This work includes two parts: 1. Model construction to explore the correlation between transcriptomics and proteomics; 2. Development of biomarker selection data mining based on metagenomic data.Model construction based on omics data: For many genes, the gene expression at transcriptomic level does not necessarily correlate well with that at the proteomic level. This expression difference can be explained by a variety of reasons, such as transcription regulation, protein degradation, codon bias and codon adaptation index in translation. Given these reasons, such correlation can be manually established for each individual gene one at a time. However, this approach is slow and relies heavily on known genes and their expression properties. On the other hand, current RNA-Seq and mass spectrometry analyses have provided high-throughput transcriptomic and proteomic data for global profiling of the gene expressions at both levels. Thus, there is in urgent need for batch correlation analysis of the correlation of gene expression and protein abundance, as well as interpretation of such correlations.In this work, we have proposed a bi-clustering method to cluster genes that have consistent patterns for the correlation between gene expression and protein abundance. The clustering results have then been interpreted from the perspective of both transcriptomic features and proteomic features, which shows that mRNA half-life, protein half-life and protein 3D structure's joint-force in concerthas significant effect on the correlation of gene expression and protein abundance. Based on these results, we have further proposed a model called the general Linear Model based on individual Clusters (CLM) for prediction of protein abundance from gene expression based on a well-selected set of features. For mitochondrial data from different model organisms tissues, this model works well for protein abundance prediction from gene expression data when compared with General Linear Model (GLM) and Multivariate Adaptive Regression Splines (MARS) Model thus proofing the validity of the prediction model on this model organism mitochondrial genes. On another model organisms Saccharomyces cerevisiae transcriptomic and proteomic data, the model is built on a different set of features, and the prediction could also reach higher accuracy than that from all-gene-based linear correlation and higher correlation and lower Prediction Error Sum of Squares (SSE) than MARS in some clusters. Therefore, we concluded that the feature selection process based on bi-clustering results could select the set of features that would be suitable for a variety of species, yet different sets of features need to be selected for different species. And the general method forfeature selection and prediction model building would achieve relatively high accuracies for a wide range of gene sets.Biomarker selection based on metagenomic data:As more than 99% of microbial community could not be isolated and cultivated, the metagenomic methods have been commonly used to analyze microbial community as a whole. With the fast accumulation of metagenomic samples, especially those from next-generation sequencing techniques, it is now possible to quantify taxa (features) in the metagenomic data accurately. The presence/absence or different abundance values for a set of taxa could be used as appropriate taxonomical biomarkers for identification of the corresponding microbial community’s phenotype. Though there exist some metagenomic biomarker analysis tools, current methods are not robust, accurate and fast enough at selection of non-redundant biomarkers for prediction of microbial community’s phenotype.In this study, a novel method, MetaBoot, is presented by combining the techniques of mRMR (minimal redundancy maximal relevance) and bootstrapping, which could robustly and accurately discover non-redundant biomarkers for different microbial communities through mining of metagenomic data. MetaBoothas been tested and compared with other methods on well-designed simulated datasets considering norm and gamma distribution and publicly available metagenomic datasets. Results have shown that MetaBoot was robust across datasets of varied complexity and taxonomical distributionpatterns and could also select discriminative biomarkers with quite high accuracy and biological consistency.MetaBootis a suitable framework to discover taxonomical biomarkers that could distinguish different microbial communities.Integrated analysis strategy of biological big data: Model construction based on transcriptomic and proteomic data is a kind ofintegrated analysis of different omics data. And biomarker analysis based on metagenomic data is a research of data from different sources. Based on preliminary strategy investigation of the two types of biological big data, we have initially grasped optimized analytical strategy and achieved relatively better results.
作者部门	单细胞中心
学科领域	生物信息
公开日期	2017-06-30
学位类型	硕士 ; 学位论文
语种	中文
文献类型	学位论文
条目标识符	http://ir.qibebt.ac.cn/handle/337004/8104
专题	单细胞中心组群
作者单位	中国科学院青岛生物能源与过程研究所
推荐引用方式 GB/T 7714	王晓君. 异质性生物组学大数据整合挖掘方法初探[D]. 北京. 中国科学院研究生院,2014.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
异质性生物组学大数据整合挖掘方法初探.p（3505KB）	学位论文		开放获取	CC BY-NC-SA	请求全文