category
bioRxiv
date
Mar 22, 2026
slug
status
Published
summary
提出基于SIMD向量化的FASTA/Q高速解析算法,创新性集成非ACTG字符处理、DNA序列位打包压缩、有限状态机编译为SIMD程序、跨x86/ARM架构优化、Rust语言实现的可调式接口等技术,经基准测试证明其吞吐量超越现有所有主流解析库。
tags
测序技术
type
Post
📄 原文题目
Helicase: Vectorized parsing and bitpacking of genomic sequences
🔗 原文链接
💡 AI 核心解读
提出基于SIMD向量化的FASTA/Q高速解析算法,创新性集成非ACTG字符处理、DNA序列位打包压缩、有限状态机编译为SIMD程序、跨x86/ARM架构优化、Rust语言实现的可调式接口等技术,经基准测试证明其吞吐量超越现有所有主流解析库。
📝 英文原版摘要
Modern sequencing pipelines routinely produce billions of reads, yet the dominant storage formats (FASTQ and FASTA) are text-based and sequential, making high-throughput parsing a persistent bottleneck in bioinformatics. Their regular, line-oriented structure makes them well-suited to SIMD vectorization, but existing libraries do not fully exploit it. We present vectorized algorithms for high-throughput FASTA/Q parsing, with on-the-fly handling of non-ACTG characters and built-in bitpacking of DNA sequences into multiple compact representations. The parsing logic is expressed as a finite state machine, compiled into efficient SIMD programs targeting both x86 and ARM CPUs. These algorithms are implemented in Helicase, a Rust library exposing a tunable interface that retrieves only caller-requested fields, minimizing unnecessary work. Exhaustive benchmarks across a wide range of CPUs show that Helicase meets or exceeds the throughput of all evaluated state-of-the-art libraries, making it the fastest general-purpose FASTA/Q parser to our knowledge. Availability: https://github.com/imartayan/helicase
- 作者:NotionNext
- 链接:https://tangly1024.com/article/32b48bd6-1f96-81cc-88ec-ff8991928e74
- 声明:本文采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。
相关文章
