From Bottleneck to Breakthrough: Accelerating GSNAP with Fault-Tolerant Mapping

Overview

Excelra transformed a legacy, sequential GSNAP bulk RNA-seq pipeline into a high-performance, fault-tolerant, and cloud-ready workflow by integrating intelligent parallelization, adaptive failure handling, and automated requeuing using Nextflow, enabling faster and cost-efficient RNA-seq alignment, bioinformatics pipeline development, and scientific data management. This innovative approach preserved GSNAP’s high sensitivity for splice-aware mapping, variant detection, and precision transcriptomics, while reducing alignment wall-time by 50% and enhancing scalability for large-scale biomarker discovery, precision medicine, and therapeutic target validation programs. Excelra’s expertise in automation, cloud enablement, and data-driven bioinformatics aligns with its broader capabilities across Scientific Informatics and Data Curation, helping enterprises optimize workflows, accelerate insights, and drive innovation in drug discovery and development. Explore more about our Bioinformatics capabilities at Bioinformatics Services, our end-to-end Scientific Data Management solutions at Scientific Data Management, or connect with us to transform your workflow at Contact Us.

Our client

Our client

Our client is a bioinformatics and statistics group within a larger international pharmacological company. This organization processes thousands of bulk RNA-seq libraries for biomarker discovery and therapeutic target validation and sought to accelerate their alignment pipeline while reducing cloud compute costs. Their existing GSNAP-based pipeline suffered from long runtimes, uneven load balancing, and repeated failures caused by random low-quality read segments that required manual pipeline restarts.

Client’s challenge

Client’s challenge

The client was using a legacy GSNAP-based bulk RNA-seq pipeline hosted on an internal server, which lacked compatibility with modern cloud-ready bioinformatics solutions such as AWS. Due to its sequential execution model, the GSNAP alignment process was inflexible, leading to long runtimes, higher failure risk, and increased computational costs, especially when processing low-quality sequencing reads. The pipeline also lacked scalability, fault tolerance, and parallelization capabilities, making it inefficient for large datasets and precision medicine applications. Additionally, it did not support side-by-side benchmarking using alternative transcript quantification tools like Salmon or Kallisto, restricting scientific data management and bioinformatics analysis capabilities. The existing setup was also not user-friendly and limited accessibility across research teams, preventing wider organizational adoption.

Client’s goals

Client’s goals

The client aimed to modernize their RNA-seq pipeline by transforming it into a scalable, cloud-enabled, and cost-efficient workflow with enhanced runtime performance and workflow automation. They wanted to redesign the pipeline with Nextflow, incorporating intelligent parallelization, automated failure handling, and support for multiple transcript quantification tools such as Salmon and Kallisto for comparative alignment and expression analysis. A key objective was to make the pipeline cloud-native on AWS, with features like fault tolerance, dynamic resource allocation, and scalable NGS pipeline optimization. They also wanted the solution to be user-friendly, easily deployable, and accessible across the organization to support cross-functional research teams and accelerate biomarker discovery, genomic data interpretation, and computational biology initiatives. Learn more about Excelra’s expertise in workflow modernization through our Computational Biology Services and FAIR Data Solutions.

Our Approach

For this pilot study, Excelra assessed the Perl pipeline and created a Nextflow framework with modules for a selection of QC validation steps, GSNAP alignment and subsequent counts generation and reporting.
Furthermore, we then developed a novel Nextflow parallelization and adaptive requeuing framework for GSNAP within the larger bulk RNA-seq workflow. The strategy consisted of three innovations:

1. Intelligent parallelization

  • Reads were automatically partitioned into balanced subsets optimized for parallel GSNAP execution across multiple compute nodes.
  • Dynamic chunk sizing ensured even distribution of computational load.

2. Adaptive failure handling

  • Jobs encountering failures due to poor-quality reads triggered an automatic re-partitioning step, splitting the problematic subset into finer-grained batches.
  • Failed batches were selectively re-queued without impacting successful jobs, avoiding costly restarts.

3. Seamless pipeline integration

  • We added this GSNAP workflow into a new Nextflow pipeline ensuring transparent orchestration, reproducibility, and compatibility with existing HPC and cloud backends.
  • Logging and checkpointing provided full traceability of partitions and retries.
  • Test profiles were generated.

Figure 1 Seed-extend alignment method: Short k-mer “seeds” from each RNA-seq read are first matched to candidate genomic loci using a fast index search. Each candidate location is then subjected to base-level dynamic programming extension, which reconstructs the read across mismatches, indels, and splice junctions, ultimately selecting the highest-scoring spliced alignment(s) for downstream transcriptomic analyses.

Our Solution

  • We were able to fully automate and refactor all the custom Perl workflows into the form of a functional Nextflow pipeline containing 18 modules.
    One of those modules was a GSNAP alignment module with:

    • Automated workload partitioning for high-throughput parallel execution.
    • Failure-aware retry logic that isolates and reprocesses only problematic subsets.
    • Configurable parameters for chunk size, retry thresholds, and quality score cutoffs.
  • By automatically partitioning the fastq files into chunks and parallelizing the GSNAP alignment, we were able to cut the GSNAP alignment time in half.
  • The resulting bulk RNA-seq Nextflow pipeline finished with a run-time of under 3 hours of wall time.
GSNAP Workfow

Figure 2 Novel GSNAP parallelization: In the mapping portion of this workflow after the QC steps, a read count of the number of read pairs per sample is tabulated and the fastq files reads split into equally sized subsets and trimming followed by GSNAP mapping of each subset section is initiated. If one of the subset portions encounters an error due to low quality alignment, that failed subset job is divided further into even smaller subsets (e.g. 5 additional GSNAP jobs, depending on sample size). This split upon failure process repeats once more so that, in the end if low quality reads cause errors in a sample only a small percentage of reads will be unaligned, without manually having to remove the read and requeue the whole pipeline mapping job prior to counts generation.

Accelerating Vaccine R&D with FAIR Data and Optimized Workflows Gap Assessment and Platform Selection

Conclusion

Our approach directly addressed the GSNAP trade-off: GSNAP’s high sensitivity mapping was retained, while drastically improving runtime and reducing wasted computational resources caused by failures due to low quality. This allowed this RNA workflow that demands full alignments to scale—keeping the technical advantages of seed-and-extend and simultaneously preserving quality features—without the prohibitive cost and wall time typically associated.
Over the course of just ten weeks of this pilot study, we successfully delivered a robust automated, cloud-compatible, Nextflow pipeline from the legacy Perl codes.
The success of this pilot study led the client to both continue and expand the scope of this project, wherein the GSNAP RNA-seq pipeline was expanded to contain all the original over 100 steps of quality control, validation, alignment, counts generation, and reporting. Visualization tools were also generated. This work is on-going.