High-throughput pipeline processing for 200+ sarcoma samples
Karolinska University Hospital needed a robust system capable of processing 200+ RNA-seq samples with precision and reproducibility. We delivered a scalable nf-core & Snakemake-based workflow that automated fusion detection, reduced computational overhead, and enabled seamless downstream research integration.
High-throughput pipeline processing for 200+ sarcoma samples
Fully containerized for long-term reproducibility
Optimized server infrastructure with parallel execution
Accurate fusion detection using STAR, Arriba & STAR-Fusion
Karolinska University Hospital in Sweden is one of Europe’s leading medical research institutions, conducting advanced studies in genomics, oncology, and molecular diagnostics. Their research teams rely on large-scale RNA sequencing data to identify biomarkers and genetic events that support clinical and translational research.
The research team faced multiple technical and infrastructure-related bottlenecks, including:
Scalability issues when processing hundreds of RNA-seq datasets simultaneously
High memory consumption and tool instability in STAR, Arriba, and STAR-Fusion
Manual workflows slowing down data validation and preprocessing
Fragmented environments, causing version conflicts and inconsistent outputs
Complex multi-server deployment, making reproducibility difficult
Need for smooth transition from existing Nextflow components to a more modular Snakemake setup
These challenges prevented fast turnaround times and accurate, repeatable fusion detection for sarcoma samples.
We followed a structured, research-driven, and engineering-focused approach:
Technical Requirements Mapping - Defined data formats, research goals, quality thresholds, and desired report outputs
Pipeline Architecture Design - Created a modular, reproducible workflow using nf-core RNAfusion and Snakemake, ensuring flexibility and future expansion.
Tool Optimization Research - Benchmarked STAR, Arriba, STAR-Fusion, and other tools to identify optimal configurations.
Containerized Environment Setup - Built Conda environments and Singularity containers to prevent version drift.
Infrastructure Planning - Architected Ubuntu server deployment with secure access, parallel job scheduling, and workflow automation.
Iterative Testing & Debugging - Ran multiple test cycles, identified memory leaks, validated genome references, and resolved software conflicts.
Iterative testing cycles to validate performance across different user journeys
A unified, scalable, and automated gene fusion detection ecosystem, including:
Modular RNA-seq pipeline using nf-core RNAfusion & Snakemake
Integration of industry-leading tools — STAR, Arriba, STAR-Fusion
Conda & Singularity containerization for reproducibility
Multi-server deployment for parallel, distributed execution
Automated data syncing from S3 cloud storage
Workflow validation with test datasets and real patient samples
Error debugging for memory, compatibility, and version issues
Documentation for long-term maintenance and future scalability
~40% faster data processing through optimized parallel execution
100% reproducible environment via Singularity & Conda
Significant reduction in workflow failures due to tool and memory optimization
High-confidence fusion detection integrated directly with downstream analysis tools
Modular pipeline ready for future expansion and additional research datasets
Modular Architecture allowed granular debugging, faster iteration, and flexible scaling
Containerized Environments ensured consistent results regardless of server differences
Optimized Resource Utilization reduced runtime and improved throughput
Strategic Tool Integration ensured accuracy across multiple gene fusion detection engines
Robust Infrastructure Setup allowed smooth deployment across various computing environments
All these elements combined to create a high-performance genomic analysis framework tailored for large-scale research applications.
Gene fusion detection requires careful tool-version control to avoid compatibility failures
Parallelization dramatically improves throughput in RNA-seq pipelines
Modularity is essential — non-modular pipelines create long-term technical debt
Memory optimization is critical for STAR and Arriba-based workflows
Cloud-to-local sync must be version-validated to maintain dataset integrity
These learnings allow us to build even better pipelines for future genomics clients.
The collaboration resulted in a powerful, scalable, and future-ready RNA-seq pipeline that empowers sarcoma researchers with accurate gene fusion insights. Vigous Technologies delivered a solution that blends precision, reproducibility, and automation — enabling long-term research impact and operational efficiency.

