Skip to content

VCF-103 Stats: Improvements to stats run-time performance#877

Merged
alancleary merged 4 commits intomainfrom
alancleary/VCF-103-stats
Mar 20, 2026
Merged

VCF-103 Stats: Improvements to stats run-time performance#877
alancleary merged 4 commits intomainfrom
alancleary/VCF-103-stats

Conversation

@alancleary
Copy link
Member

VCF-103 implements a new parallelization strategy for the ingestion code path. While profiling and optimizing the code path, it was found that the allele count, variant stats, and sample stats computations were slowing down the hot path so some simple optimizations were made. This PR cherry picks those changes to reduce the scope of the VCF-103 PR.

These methods return the total size of the classes' buffers in bytes.
This was done by using an unordered set for tracking sample names.
This was done by using an unordered map for saving sample stats, minimizing map lookups via memoization, and replacing a nested map with structs.
This was done by reusing vectors for (missing)GT values, using an unordered set for tracking sample names, minimizing map lookups via memoization, and creating separate codepaths for updating v2 and v3 arrays, the prior of which can be done much more efficiently via appending instead of using the v3 insertion strategy.
@alancleary alancleary merged commit ae3b198 into main Mar 20, 2026
14 of 15 checks passed
@alancleary alancleary deleted the alancleary/VCF-103-stats branch March 20, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants