从RNA-seq数据中改进转录本和编码序列的重建

📄 原文题目

Improved reconstruction of transcripts and coding sequences from RNA-seq data

🔗 原文链接

https://academic.oup.com/nar/article/doi/10.1093/nar/gkag091/8488133?rss=1

💡 AI 核心解读

📝 英文原版摘要

<span class="paragraphSection"><div class="boxTitle">Abstract</div>Annotation of genes and transcripts is a key prerequisite for understanding the information that is encoded in newly sequenced genomes. One source of information suited for this purpose is RNA-seq data mapped to the respective genome sequence. RNA-seq-based approaches for transcript reconstruction generate transcript models from these data by combining regions of contiguous coverage (exons) and split read mappings (introns). Understanding phenotypes as a consequence of proteins encoded in a genome further requires the annotation of coding sequences within transcript models. We present GeMoSeq, a novel approach for transcript reconstruction from RNA-seq data that combines combinatorial enumeration of candidate transcripts with heuristics for splitting candidate transcripts into regions of contiguous coverage and subsequent likelihood-based quantification. Prediction of coding sequences is an integral part of the GeMoSeq algorithm. We benchmark GeMoSeq against previous approaches using a large collection of public RNA-seq data for seven species. For the majority of species, we observe an improved prediction performance of GeMoSeq, especially on the level of coding sequences and for species with dense genomes. We combine GeMoSeq with the homology-based approach GeMoMa to re-annotate two recently sequenced genomes of <span style="font-style: italic;">Nicotiana benthamiana</span> lab strains, which illustrates the main purpose of GeMoSeq: the initial annotation of newly sequenced genomes with protein-coding genes.</span>