category
bioRxiv
date
Mar 22, 2026
slug
status
Published
summary
提出基于SIMD向量化的FASTA/Q高速解析算法,创新性集成非ACTG字符处理、DNA序列位打包压缩、有限状态机编译为SIMD程序、跨x86/ARM架构优化、Rust语言实现的可调式接口等技术,经基准测试证明其吞吐量超越现有所有主流解析库。
tags
测序技术
type
Post

📄 原文题目

Helicase: Vectorized parsing and bitpacking of genomic sequences

🔗 原文链接

💡 AI 核心解读

提出基于SIMD向量化的FASTA/Q高速解析算法,创新性集成非ACTG字符处理、DNA序列位打包压缩、有限状态机编译为SIMD程序、跨x86/ARM架构优化、Rust语言实现的可调式接口等技术,经基准测试证明其吞吐量超越现有所有主流解析库。

📝 英文原版摘要

Modern sequencing pipelines routinely produce billions of reads, yet the dominant storage formats (FASTQ and FASTA) are text-based and sequential, making high-throughput parsing a persistent bottleneck in bioinformatics. Their regular, line-oriented structure makes them well-suited to SIMD vectorization, but existing libraries do not fully exploit it. We present vectorized algorithms for high-throughput FASTA/Q parsing, with on-the-fly handling of non-ACTG characters and built-in bitpacking of DNA sequences into multiple compact representations. The parsing logic is expressed as a finite state machine, compiled into efficient SIMD programs targeting both x86 and ARM CPUs. These algorithms are implemented in Helicase, a Rust library exposing a tunable interface that retrieves only caller-requested fields, minimizing unnecessary work. Exhaustive benchmarks across a wide range of CPUs show that Helicase meets or exceeds the throughput of all evaluated state-of-the-art libraries, making it the fastest general-purpose FASTA/Q parser to our knowledge. Availability: https://github.com/imartayan/helicase
环境扰动下无效的基因组错误校正动态调节突变供应和稳健性硅藻内共生体在低编码密度下拥有缩小但稳定的基因组
Loading...