Integration/pr batch 1 by Psy-Fer · Pull Request #77 · scverse/rustar-aligner

Psy-Fer · 2026-05-29T08:54:29Z

integration/pr-batch-1: merge 14 open PRs + human-readable RAM params

Summary

Batch merge of 14 open PRs into an integration branch for combined testing before landing on main. Also extends --limitGenomeGenerateRAM and --limitBAMsortRAM to accept human-readable suffixes (64G, 512M, 1T).

Merged PRs (in order)

feat(params): accept --limitGenomeGenerateRAM for STAR CLI compatibility #36 — --limitGenomeGenerateRAM CLI compatibility flag
fix(chimeric): create output parent directory before writing #37 — Create chimeric output parent directory before writing
fix(output): move pass-1 SJ tab into <prefix>_STARpass1/SJ.out.tab #44 — Move pass-1 SJ output to _STARpass1/SJ.out.tab
fix(stats): route reads exceeding --outFilterMultimapNmax to too_many_loci counter #54 — Fix PE too_many_loci counter (n_for_mapq=0 bug)
fix(bam): subtract Phred+33 offset from QUAL before writing to BAM #38 — Subtract Phred+33 offset from QUAL before writing to BAM
fix(bam): emit SAM-spec NM:i: tag distinct from STAR-internal nM:i: #42 — Emit SAM-spec NM:i: distinct from STAR-internal nM:i:
feat(bam): populate @PG header with PN/VN/CL fields #40 — Populate @PG header with PN/VN/CL fields
fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates #45 — Key SpliceJunctionDb in genome-absolute 0-based coordinates
fix(stitch): use sjdb_score in place of motif_score on annotated junctions #52 — Use sjdb_score in place of motif_score on annotated junctions
fix: decode Gsj-region SA hits into split splice seeds #51 — Decode Gsj-region SA hits into split splice seeds
feat(bam): emit XS:A: strand tag when --outSAMstrandField intronMotif #43 — Emit XS:A: strand tag when --outSAMstrandField intronMotif
fix(transcriptome-bam): write per-record RG:Z: tag matching @RG header #39 — Write per-record RG:Z: tag on transcriptome BAM
fix(transcriptome-bam): populate mate fields on paired-end records #41 — Populate mate fields on PE transcriptome BAM records
chore: switch sam attributes to bitflags #67 — Switch SAM attributes from HashSet<String> to bitflags (SamAttributes)
fix(stats): route no-seeds-no-clusters reads to UnmappedReason::Other #49 — Confirmed already on main; closed without merge

Additional changes

--limitGenomeGenerateRAM and --limitBAMsortRAM now accept human-readable memory values (64G, 512M, 1T, 1024, etc.) in addition to raw byte counts. Default changed to 31G. Tests added for both params.

Conflict resolutions

feat(bam): emit XS:A: strand tag when --outSAMstrandField intronMotif #43 vs chore: switch sam attributes to bitflags #67 (params/mod.rs): HashSet<String> attribute methods removed; XS/intronMotif coupling baked into validate() — intronMotif adds XS, anything else strips it (matches STAR's final output behaviour regardless of input path).
fix(bam): emit SAM-spec NM:i: tag distinct from STAR-internal nM:i: #42 vs chore: switch sam attributes to bitflags #67 (sam.rs): SamAttributes::NM bit covers both "NM" and "nM" CLI tokens; both NM:i: (edit distance) and nM:i: (mismatches-only) are emitted together when the bit is set.
fix: decode Gsj-region SA hits into split splice seeds #51 (junction/mod.rs): Test struct literal updated to include new n_genome_real field added by fix: decode Gsj-region SA hits into split splice seeds #51.
fix(transcriptome-bam): write per-record RG:Z: tag matching @RG header #39 (sam.rs): Test used undeclared CigarOp type alias; replaced with cigar::Op::new(cigar::op::Kind::Match, 4).

Test plan

cargo test — 433 tests passing, 0 failures
cargo clippy --all-targets — 0 warnings
Run STAR faithfulness benchmark against yeast SE/PE test data
Verify BAM QUAL scores with samtools stats (average_quality ~35, not ~68)
Verify NM:i: tag present and correct with samtools view
Verify XS:A: tag emitted when --outSAMstrandField intronMotif
Verify splice count improvement from Gsj seeding (fix: decode Gsj-region SA hits into split splice seeds #51 + fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates #45 + fix(stitch): use sjdb_score in place of motif_score on annotated junctions #52 combined)

STAR exposes --limitGenomeGenerateRAM as a memory cap for genomeGenerate. Pipelines that wrap STAR commonly derive a value from their task resources and pass it through. Currently rustar rejects the flag at the CLI parser, breaking those pipelines. Accept the flag with STAR's default of 31 GiB. The value is parsed but not enforced — rustar's memory management is independent. Fixes #25

The chimeric output writer constructs its path as <outFileNamePrefix>/Chimeric.out.junction, treating the prefix as a directory regardless of whether it ends in `/`. In two-pass mode the chim writer fires before any other output creates that directory, so the file open fails with "No such file or directory" and the entire run aborts. Call create_dir_all on the parent of the chim output path before opening the file. Without two-pass mode the bug was masked because another output writer happened to create the dir first. Fixes #35

The BAM binary QUAL field stores raw Phred values (0-93) per SAM spec §4.2.3. rustar was passing FASTQ ASCII bytes (Phred+33 encoded, e.g. 'B' = ASCII 66 = Phred 33) straight to noodles' QualityScores::from, which expects raw Phred values. Every QUAL byte in a rustar BAM was therefore +33 above what the spec mandates; samtools view rendered the SAM text with another +33 on top, producing biologically impossible quality strings like 'cccgg...'. Subtract 33 from each FASTQ ASCII byte at every BAM-write call site before constructing the QualityScores. Use saturating_sub so a malformed FASTQ byte < 33 clamps to 0 rather than underflowing. samtools stats average_quality on the nf-core/rnaseq test profile now reads 35.3 (matching STAR) rather than the previous 68.3. Fixes #34

@rg

When --outSAMattrRGline is supplied, rustar writes the @rg header to both the genome and transcriptome BAMs, but only stamps the per-record RG:Z: tag on the genome BAM. The transcriptome BAM was emitting 0 RG:Z: records vs STAR's 1-per-record output. Add the RG tag stamp to the transcriptome record builder (paired-end and single-end paths) so every record carries RG:Z:<id> matching the @rg header, byte-symmetric with STAR. Fixes #32

@pg

Per SAM spec §1.3, the @pg line conventionally carries PN (program name), VN (version), and CL (command line) alongside ID. rustar was emitting only ID:rustar-aligner, leaving downstream provenance tools (MultiQC's program-version table, dx-toolkit lineage tracking) with a blank entry. Expand the header writer to emit: @pg\tID:rustar-aligner\tPN:rustar-aligner\tVN:<cargo pkg version>\tCL:<args> The full command line is captured in main() before clap parses it, then threaded into Parameters via a new (skip) field so it reaches the SAM header builder. Version comes from CARGO_PKG_VERSION at compile time. This matches STAR's @pg format and gives downstream tools the provenance they need. Fixes #33 (the @pg header gap; AS divergence is a separate item).

build_transcriptome_records_pe was stamping SEGMENTED + FIRST/LAST_SEGMENT but leaving PROPERLY_SEGMENTED, MATE_REVERSE_COMPLEMENTED, RNEXT, PNEXT, and TLEN unset on every transcriptome record. Salmon's alignment-mode fragment-length inference reads those mate fields; with them missing it falls back to its 250+/-25 prior, distorting paired-end TPMs. Walk the projected (rec1, rec2) pairs after the existing flag-stamping loops and mirror the mate-bookkeeping the genome-space build_paired_mate_record already does: PROPERLY_SEGMENTED + MATE_REVERSE_COMPLEMENTED + mate ref id + mate position + signed TLEN (leftmost +, rightmost -). Fixes #22

--outSAMattributes NM was being routed to the existing nM writer (substitutions only), so requests for the SAM-spec edit-distance tag silently produced wrong values. Downstream tools that read NM:i: (samtools stats, picard, MultiQC) saw nothing. Treat NM and nM as distinct attribute tokens. When NM is requested, compute SAM-spec edit distance per the SAM v1 spec section 1.4: substitutions + inserted bases + deleted bases (excluding intron N skips). Keep nM emission unchanged when explicitly requested. Fixes #29

The flag was accepted at the CLI parser but never produced any XS:A: tags on the output BAM. Downstream tools that need strand inference (StringTie, Cufflinks, rseqc infer_experiment.py) saw nothing. Mirror STAR's two coupling paths: --outSAMattributes XS auto-enables --outSAMstrandField intronMotif, and vice versa. For each output record with at least one canonical intron motif, map the motif to + or - per STAR's convention; non-canonical or mixed motifs omit the tag. Fixes #30

STAR writes three log files alongside Log.final.out and keeps two-pass intermediates in <prefix>_STARpass1/. rustar was only writing Log.final.out and emitting the pass-1 SJ tab as <prefix>SJ.pass1.out.tab at the top level. Add minimal Log.out (parameters dump + per-phase timestamps) and Log.progress.out (timestamp + mapping speed) writers next to the existing Log.final.out writer. Move the pass-1 SJ tab into <prefix>_STARpass1/SJ.out.tab and mkdir the parent first. The Log.out content is intentionally a stub matching the file's existence rather than STAR's full verbosity; that's a follow-up. Fixes #28 Co-Authored-By: Claude <noreply@anthropic.com>

The DB was keyed in chromosome-local 1-based coords (straight from the GTF) but consulted in genome-absolute 0-based at stitch time, so every sjdb lookup during alignment missed. Annotated splice events never got the sjdb_score bonus and the stricter overhang gate fired in their place, dropping ~50 % of GT/AG splices on the test profile. The stats-time site queried in genome-absolute 1-based-equivalent, so it accidentally matched on chr 0 (chr_start=0) but missed on every other chromosome -- producing inconsistent answers between SJ.out.tab and Log.final.out on the same BAM. Normalise the DB to genome-absolute 0-based at construction, matching the convention prepared_junctions and SpliceJunctionStats already use. Update the stats-time call site to query in the new space. Stitch-time needs no change. Fixes #27 Co-Authored-By: Claude <noreply@anthropic.com>

The previous commit added writers that mimicked STAR's section-header structure with placeholder content (a Debug-format params dump + three timestamps for Log.out, a header + final-timestamp line for Log.progress.out). That closes the file-existence gap but makes the output look like STAR-equivalent verbose logging when it isn't — downstream tools parsing for memory usage, per-chunk progress, or warnings would silently get nothing. Drop the stub writers. The SJ.pass1.out.tab → <prefix>_STARpass1/SJ.out.tab placement fix stays — that's a real path-parity change with no fakery. Real STAR-equivalent Log.out / Log.progress.out content (per-phase progress, warnings, memory, periodic updates during long runs) is a separate engineering effort, not a stub. Co-Authored-By: Claude <noreply@anthropic.com>

Accepting the flag silently could mislead a user who passes a non-default cap into thinking rustar enforces it. Emit a warning whenever the value differs from STAR's default so the user knows the cap isn't applied. Co-Authored-By: Claude <noreply@anthropic.com>

The previous commit normalised the DB to genome-absolute 0-based and updated the stats-time query. Stitch-time was left unchanged on the assumption that it already passed genome-absolute 0-based — half right. donor_fwd was correct (first intron base, 0-based) but acceptor_fwd = donor_fwd + del landed on the first base AFTER the intron, while the DB keys store the last intron base. Lookup missed every annotated junction. Subtract 1 from the acceptor to land on the last intron base. After this, Log.final.out Annotated (sjdb) goes from 0 to a non-zero count that's consistent with the annotated=1 row count in SJ.out.tab on the same BAM. Co-Authored-By: Claude <noreply@anthropic.com>

Pass-1 alignment was silently dropping SA hits that fall in the appended splice-junction flanking buffer (Gsj). Those hits encode candidate splice events for annotated junctions: when a read seed straddles the donor/acceptor boundary of a Gsj slot, the matching genome bytes live in two non-adjacent real-genome positions. STAR decodes these via g1 >= P.nGenomeReal and feeds the (donor, acceptor) pair into the seed pipeline; rustar-aligner was discarding them at position_to_chr (which returns None for positions past the last real chromosome). Plumb n_genome_real onto Genome (pinned at the pre-sjdb forward total) and reload prepared junctions + sjdbOverhang from sjdbInfo.txt on the index-load path. cluster_seeds now expands each SA hit through decode_gsj_hit: real-genome hits pass through unchanged, single-flank Gsj hits are skipped (the equivalent real-genome SA entry already exists in the same range), and boundary-crossing Gsj hits split into two virtual seeds with the donor-side read run and the acceptor-side read run pointing at their respective real-genome positions. The existing stitch DP then chains them via its splice branch. Verified on the yeast SRR6357072 reproducer from the issue: Number of splices: Total jumps from 366 to 631 (target ~720) and GT/AG from 266 to 531 with non-canonical unchanged at 93. The remaining gap is gated on PR #45's annotated-junction lookup landing. Fixes #47 Co-Authored-By: Claude <noreply@anthropic.com>

…tions STAR's Transcript_alignScore.cpp adds either sjdbScore (annotated) or motifScore (unannotated) to the alignment score, not both. rustar was adding both on annotated junctions, over-crediting the score by motif_score per junction. Currently invisible because #27 keeps is_annotated false in practice; becomes the dominant remaining AS divergence on annotated splices as soon as #27 lands. Fixes #50 Co-Authored-By: Claude <noreply@anthropic.com>

…tion The helper at src/align/score.rs:137-143 had the same additive bug as the stitch-time call site this PR fixes — adding sjdb_score to the motif score rather than replacing it. Currently only called from its own tests, but a landmine if any other caller picks it up. Switch to replacement semantics matching the stitch path. Update the tests that locked in the old (wrong) additive result. Co-Authored-By: Claude <noreply@anthropic.com>

…_loci The existing AlignmentStats::record_alignment routing arm for n_alignments > max_multimaps was unreachable on the PE path - the too-many-loci filter returned n_for_mapq = 0, so the caller hit record_alignment(0, ...) and incremented the unmapped bucket. Log.final.out always reported 0 in the too-many-loci field and the per-bucket sum fell 3-of-50000 short of input on the yeast PE test profile (STAR records exactly 3 there on the same data). Return joint_pairs.len() as n_for_mapq from the PE filter site and route through record_alignment with that count, mirroring the SE caller pattern so the _ => arm fires. The read still isn't emitted in the BAM with outSAMmultNmax=-1, matching STAR; only the accounting changes. Fixes #53 Co-Authored-By: Claude <noreply@anthropic.com>

n_for_mapq is usize, so `if n > 0 { n } else { 0 }` is identical to passing n_for_mapq directly to record_alignment. Match the SE caller shape: the variable already carries the right count from the PE TooManyLoci return site, and record_alignment routes via max_multimaps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…integration/pr-batch-1

…o integration/pr-batch-1

… integration/pr-batch-1

…dditive' into integration/pr-batch-1

…to integration/pr-batch-1

…tegration/pr-batch-1

…into integration/pr-batch-1

…lds' into integration/pr-batch-1

…-batch-1

pinin4fjords and others added 30 commits May 12, 2026 17:59

chore(params): trim verbose doc comment

9a7be3e

chore(chimeric): remove narrative test comment

176a94c

chore(bam): remove narrative inline comment

586f206

chore(transcriptome-bam): remove narrative comments

5930c04

chore(bam): remove narrative comments

0e49edd

Merge branch 'main' into pr/pinin4fjords/51

4784bd0

Merge branch 'main' into pr/pinin4fjords/45

0ab761f

Merge branch 'main' into pr/pinin4fjords/42

aaa44de

Merge branch 'main' into pr/pinin4fjords/41

94f7e78

Merge branch 'main' into pr/pinin4fjords/37

c2967b0

Merge branch 'main' into fix/chim-parent-dir-create

3fa407e

Merge branch 'main' into pr/pinin4fjords/41

4bb2803

Psy-Fer added 12 commits May 29, 2026 17:47

Merge remote-tracking branch 'pinin4fjords/fix/nm-tag-sam-spec' into …

cc00dff

…integration/pr-batch-1

Merge remote-tracking branch 'pinin4fjords/fix/pg-header-content' int…

0275cac

…o integration/pr-batch-1

Merge remote-tracking branch 'pinin4fjords/fix/sjdb-coord-space' into…

8e5be3b

… integration/pr-batch-1

Merge remote-tracking branch 'pinin4fjords/fix/sjdb-replacement-not-a…

ebd3623

…dditive' into integration/pr-batch-1

Merge remote-tracking branch 'pinin4fjords/fix/pass1-sjdb-seeding' in…

67086b5

…to integration/pr-batch-1

add missing n_genome_real for #51

3194c97

Merge remote-tracking branch 'pinin4fjords/fix/xs-strand-tag' into in…

0fe9b08

…tegration/pr-batch-1

Merge remote-tracking branch 'pinin4fjords/fix/transcriptome-rg-tag' …

a47ddda

…into integration/pr-batch-1

Fix CigarOp to cigar::op::new

74ebfde

Merge remote-tracking branch 'pinin4fjords/fix/transcriptome-mate-fie…

98684f8

…lds' into integration/pr-batch-1

Merge remote-tracking branch 'origin/pa/sam-tags' into integration/pr…

188fe68

…-batch-1

fmt fix

baf11e4

Psy-Fer self-assigned this May 29, 2026

Update run_tests.sh

8c8eb24

Psy-Fer merged commit 7f4767e into main May 29, 2026
10 checks passed

Psy-Fer deleted the integration/pr-batch-1 branch May 29, 2026 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration/pr batch 1#77

Integration/pr batch 1#77
Psy-Fer merged 57 commits into
mainfrom
integration/pr-batch-1

Psy-Fer commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Psy-Fer commented May 29, 2026

integration/pr-batch-1: merge 14 open PRs + human-readable RAM params

Summary

Merged PRs (in order)

Additional changes

Conflict resolutions

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants