Bundler performance optimizations: 2x faster installs#9316
Bundler performance optimizations: 2x faster installs#9316tobi wants to merge 13 commits intoruby:masterfrom
Conversation
Split the ParallelInstaller into two distinct phases: Phase 1 (Download): Download ALL gems in parallel with a dedicated worker pool. Since .gem files are just archives, no dependency ordering is needed - all downloads can happen concurrently. Phase 2 (Install): Install gems with dependency-aware ordering, but with all gems already cached locally. Pure Ruby gems (no native extensions) are installed immediately without waiting for dependencies, since they don't execute code during installation. Only gems with native extensions wait for their dependencies. Additional changes in this commit: - Add Source::Rubygems#download() as standalone download method - Add has_native_extensions?() detection for install ordering - Add global gem cache at $XDG_CACHE_HOME/bundler/gems/ with hardlink-to-local-cache strategy for cross-Ruby-version sharing - Add early satisfaction check: skip entire pipeline when nothing changed (inspired by uv's SatisfiesResult::Fresh) - Make Definition#install_needed? public for the early check - Cache lockfile_exists? and use O(1) hash lookups in converge_specs - Memoize cached_gem and installed? to avoid redundant stat calls - Guard against unbounded growth of caches array - Short-circuit lockfile write when nothing changed
Use a hierarchical file copy strategy inspired by uv's linker: 1. clonefile (macOS APFS copy-on-write) via cp -cR - nearly instant 2. hardlink tree - shares inodes, no data copied 3. regular copy - fallback for cross-device or unsupported filesystems Reduce unnecessary filesystem operations: - Consolidate triple stat in strict_rm_rf to single lstat call - Read compact index info files once (was reading twice: once for MD5 checksum, once for data) - Skip mkdir_p for compact index cache dirs that already exist - Reorder cached! to check flag before File.exist? Add optional IO tracing via BUNDLER_IO_TRACE=1 environment variable for profiling filesystem operations during bundle install.
Inspired by uv's version.rs, pack gem versions into 64-bit integers: [16-bit major][16-bit minor][16-bit patch][16-bit extra]. Integer comparison is O(1) with zero allocations, replacing Gem::Version#<=> which splits strings and allocates arrays on every comparison. ~90% of real-world versions (those with <= 4 numeric segments, each <= 65535, no prerelease tags) use the fast integer path. Prerelease and unusual versions transparently fall back to Gem::Version. The resolver performs thousands of version comparisons during dependency resolution, so this reduces both CPU time and GC pressure.
Before starting PubGrub resolution, eagerly populate the spec cache for all known dependency names from both Gemfile requirements and lockfile transitive dependencies. This triggers the compact index's parallel fetching to batch network requests upfront rather than fetching specs one-by-one as the resolver discovers them. Also add a gem info cache to CompactIndex that memoizes fetched gem info by name, preventing redundant network requests when the resolver retries or multiple sources overlap. Both patterns are inspired by uv's batch_prefetch.rs and OnceMap.
Optimize data structures in frequently-called code paths: - LockfileParser: replace send() dispatch with case statement, add @specs_by_name hash for O(1) dependency-to-spec lookup - CompactIndexClient::Parser: reduce allocations in versions parsing by using index()+slice instead of split() - Index#empty?: direct @specs.empty? instead of Enumerable iteration - SpecSet: O(1) name checks in validate_deps, O(1) find_by_name_and_platform via lookup hash, cached reverse dependency map in what_required, O(1) rake lookup in sorted - LazySpecification: inline lock_name to avoid NameTuple allocation - LockfileGenerator: hash-based dedup instead of array scan - Gem::NameTuple: cache full_name, lock_name, and hash values - Gem::Installer: skip write_cache_file when cache already exists - Gem::Specification: memoize runtime_dependencies (called thousands of times during resolution, each creating a new filtered array)
The pool was hardcoded at 5 but the download phase uses 8+ workers, creating a bottleneck. Now scales to max(jobs, 8). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Opt-in setting (BUNDLE_IGNORE_RUBY_UPPER_BOUNDS=true) that filters out upper-bound Ruby version requirements from gem metadata. Useful when gems haven't updated their metadata for newer Ruby versions. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add CompactVersion.compare() and versions_equal?() class methods for direct Gem::Version comparison using 64-bit packed integers. Apply in resolver sort, group_by, and GemVersionPromoter filter_versions to avoid expensive Gem::Version#<=> in hot paths. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Blake2b256 is ~3x faster than MD5 for hashing. Use it for local-only operations (cache path generation, etag paths) via fast_hexdigest. Falls back to MD5 when OpenSSL doesn't support Blake2. Protocol-level compact index checksums still use MD5 as required by the server. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Cache parsed compact index info arrays in Marshal format at info-binary/<name>.bin. On subsequent runs, load binary cache if the compact index checksum matches, skipping text parsing (string splitting, object allocation). Non-fatal on cache read/write failures. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Split installation into download -> install phases with inline extract+finalize per gem. No batch barrier between extraction and installation — each gem extracts and finalizes in one step. After downloading, a quick scan reads .gem metadata to detect native extensions WITHOUT extracting. This lets the install phase prioritize native ext gems first so compilation starts ASAP and overlaps with pure Ruby gem installation. Key changes: - extract_to_temp/finalize_with[out]_extensions in RubyGemsGemInstaller - extract_gem/finalize_gem in Source::Rubygems - Streaming install in ParallelInstaller: native ext gems enqueued first - scan_native_extensions: peek at .gem metadata to detect extensions early - uv-inspired ProgressReporter with spinner, aligned counts, slow item display - Git sources participate in parallel download phase - Worker threads silenced to prevent UI corruption (thread-local fix) - Native extension detection from real spec after extraction (not LazySpec) - Global extension cache in XDG_CACHE_HOME keyed by Ruby ABI - Global gem cache path aligned with rubygems#7249 convention - Pre-filter installed_specs by lockfile gem names for targeted lookup - Incremental Gem::Specification.add_spec instead of double reset Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Remove Bundler.ui.silence wrapper (per-worker silence suffices) - Remove stock "Installing X" / "Fetching X" / "Using X" UI messages from source/rubygems.rb — progress reporter handles all display - Remove scan_native_extensions pass that opened every .gem an extra time — native extensions detected inline during install from real spec - Single pass: download → install (no intermediate scan) - Fix progress reporter: guard write_header when no phase active, flush after finish_phase to keep summaries in scrollback Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Could you split this large PR to small PRs? |
|
This is not reviewable. |
Key changes: - Global extracted gem cache (~/.cache/gem/extracted/) avoids re-extracting .gem files across projects/installs. Cache hit loads marshaled spec + hardlinks files into GEM_HOME in ~0.01s. - Single-pass .gem extraction reads tar once, piping data.tar.gz to native `tar` for zero-Ruby-allocation decompression. - Hardlink-first file placement (pure Ruby, no subprocess) with clonefile and cp_r fallbacks. - Hardlink .gem files into GEM_HOME/cache/ instead of copying. - Shallow git clones (--depth 1) for git source gems. - Hide cursor during parallel download/install phases. - Show current gem name in progress reporter. - Reset Gem::Specification only when actually compiling extensions.
Benchmark update — global extracted gem cacheAdded a global extracted gem cache ( Benchmark resultsEach iteration runs a paired cold→warm sequence: nuke all caches →
Geometric mean speedup vs stock cold:
What this means
The "rails" scenario uses the default |
|
Hey, thanks for sending this through @tobi Some of these optimisations overlap with things we've already got in-flight or on our radar, which is encouraging. As Kou mentioned though, the scope makes it pretty tough to review as one PR. We're going to dig through these and try to pull out the most impactful changes and look to land them incrementally so we can validate properly and make sure we're not breaking compatibility. Thanks! |
|
@colby-swandale I was going to work on splitting up this PR and rebenchmarking each step. I know there's a PR for refactoring the cache handling - are there other PRs in flight that compete with work in this PR that I should know about? |
|
@eileencodes #9210 comes to mind. The other PRs that have also overlap, but have since been merged are #9087 & #9230 |
|
Hello, indeed super hard to review one. I have checked out of curiosity one change (c859a8c) and it seems it needs more attention. Probably each of those optimizations will work as separate issue. Is there still plan to split this into smaller PRs? I have potential interested in exploring some of those suggestions, is there any tracker to follow? Is the list free to pick up? |
This comment was marked as off-topic.
This comment was marked as off-topic.
I wanted to circle back on this PR and surface visibility for other contributors that are interested in extracting some of the changes. Many of the most impactful optimizations from this PR were implemented or extracted out. All the "low" impact changes don't help move the needle much on the overall This is the list of changes that were merged or are going to
|
|
The following items are extracted from this PR and remain unaddressed:
There are also several small changes like memoization and hash-based lookups that could be worth extracting individually. |
This one is interesting. The benchmark I wrote showed that while this is faster on linux it's significantly slower on macos. There is some work that can be done to close the gap, that but I wasn't able to find a path that made it faster on macos. |
|
Hi @tobi , we spent time on your pull request after it was opened to figure out what could be pulled out and where we could find performance improvements in Rubygems/bundler. After reviewing the benchmark I originally wrote and the one you sent me, I wasn’t seeing reliable numbers. I had Claude rewrite the benchmark and eventually wrote a performance toolkit for Rubygems/bundler (it also supports benchmarking other Ruby package managers). One of the things that different benchmarks struggled with was not reducing the number of variables at play on any given run. To narrow down those variables the toolkit uses hyperfine for benchmarking, defaults to Ruby 4.0.1 (dynamically linked because it’s faster), and has a set definition for cold/warm cache, sets the correct environment variables to turn on the cache, and most importantly uses a local gemserver to avoid network time or rate limits skewing results. Ultimately while this PR does show some improvement in the warm cache, that improvement is only on Linux. On macOS the warm cache unfortunately is slower. For both Linux and macOS the cold cache is slower with this PR. In addition, when the cache is properly turned on in Bundler master (my original benchmark also wasn’t turning it on correctly either), the warm cache shows it’s already very fast. Lastly the original numbers looked more drastic because the benchmark compared against Bundler 2.7, which predates the performance improvements that had already landed. Even with my toolkit there’s still a lot of variables at play from current load on the machine to endpoint security software to what Ruby version is being used. I can get wildly different results depending on the machine and these variables. After many many many runs and lots of work to reduce uncertainty, these are the numbers I got for master versus this PR using Ruby 4.0.1 (dynamically linked, because that was faster), using a local fake network with real gems (so compiled time is still accounted for, but network time is not), 3 iterations, on a Gemfile with 164 gems. The macOS machine is my work macbook pro M3 but I also ran the numbers on my personal macbook air M3 and the numbers were similar. The Linux machine is a 32-core AWS sandbox isolated from noisy neighbors.
I was a bit surprised there was such a drastic difference between macOS and Linux so I had Claude bisect the PR to find where the regression was introduced. The goal was to understand if there were a set of commits that did meaningfully improve macOS. We found that the regression is isolated to a single code path in I put one of the bisect summaries and investigations that Claude did in a gist if anyone would like to read more about the process used. It is edited for clarity, but otherwise entirely written by Claude. ConclusionThe ideas that Claude came up with in this PR are interesting and were explored in depth. I bisected the PR in many different ways and pulled out parts that seemed most promising. Through this work we kept coming back to the warm cache on master is pretty fast already but it’s the cold cache that’s slow. The reason is because compiling native extensions completely dominates the threads in the cold cache. We can see this in the profiles below. A profile of cold cache installs on the Linux, blocking on installing native extensions: The pink bars are the threads waiting for compiling native extensions to be done. This is where we are spending all our time on a cold cache, which is arguably the most time consuming because it takes more 10-30s to install cold. Bigdecimal alone takes 3s to install, and while that’s happening the thread is just sitting around waiting.
A profile of warm cache installs on the Linux, not blocking on installing native extensions (warm caches have already compiled everything): Warm caches were always sub 5s, because they are not compiling native extensions. You can see we don’t have the same yellow and pink bars dominating the time.
All of the work we’ve done with AI and on our own points back to precompiling native extensions as our biggest win now that we’ve improved warm caches a lot. But degrading cold caches even farther is a non-starter. We have a project in the works that will solve the cold cache problem without requiring upstream gem authors to change anything. We already have data showing that we can reduce cold install times to sub 6s (so they’re inline with warm), which will be a huge win for the community. We can absolutely still find places to improve the warm cache, but that win isn’t going to be as noticeable as improving cold installs. Thank you for opening this PR. It pushed us to concretely define the benchmark that should be used for all performance work in Rubygems/bundler. We have opened issues to investigate some of the micro optimizations in this PR as a followup to our work on the cold cache performance. |


Summary
This branch implements ~35 performance optimizations to Bundler's installation pipeline, inspired by tenderlove's blog post and by studying uv's actual source code architecture. The result is a Bundler that is 1.5-2.5x faster across different workloads.
Benchmark Results (Sequential, 3 iterations each)
Cold Cache (fresh install, no gems cached)
Warm Cache (gems already downloaded, re-install)
Complete Optimization Table
Wave 1: Core Architecture (27 optimizations)
parallel_installer.rb,source/rubygems.rbparallel_installer.rbrubygems_gem_installer.rbmajor.minor.patch.extrainto one u64 for O(1) comparisoncompact_version.rb(new),resolver/candidate.rb$XDG_CACHE_HOME/gem/gems/across Ruby versionssource/rubygems.rbresolver.rbparallel_installer.rbinstaller.rb,definition.rbfetcher/compact_index.rbcaseinstead ofsend(@parse_method)lockfile_parser.rb@specs_by_namehash instead offindlockfile_parser.rblockfile_exists?— avoid repeatedFile.exist?callsdefinition.rbconverge_specs—deps_by_name+@gems_to_unlockas hashdefinition.rbindex()+ slice instead ofsplit(" ", 3)compact_index_client/parser.rbIndex#empty?fast path — direct@specs.empty?instead of iterationindex.rbvalidate_depsspec_set.rblookup[name]narrows candidatesspec_set.rb@reverse_depshash inwhat_requiredspec_set.rb@specs.findinsortedspec_set.rbspec_set.rblock_name— avoidGem::NameTupleallocationlazy_specification.rb{}instead of[]+include?lockfile_generator.rbfull_name,lock_name,hashrubygems/name_tuple.rbwrite_cache_filerubygems/installer.rbruntime_dependencies— memoize instead of filtering on every callrubygems/specification.rbWave 2: Advanced Optimizations (10 optimizations)
fetcher/gem_remote_fetcher.rbignore_ruby_upper_boundssetting — opt-in filter for<and<=Ruby constraintsmatch_metadata.rb,settings.rbcompare()andversions_equal?()class methodscompact_version.rb,resolver.rb,gem_version_promoter.rbshared_helpers.rb,compact_index_client/cache.rbcompact_index_client/parser.rbparallel_installer.rb,gem_installer.rb,rubygems_gem_installer.rb,source/rubygems.rbparallel_installer.rb$XDG_CACHE_HOME/gem/extensions/keyed by Ruby ABIsource/rubygems.rbstubs_for(name)instead of full globrubygems_integration.rb,source/rubygems.rbinstaller/progress_reporter.rb(new),parallel_installer.rbAdditional Fixes
install_needed?was private but called externallydefinition.rbcachesarray grew unbounded with duplicate entriessource/rubygems.rbcompact_index_client/cache.rbparallel_installer.rb(thread-local silence)LazySpecificationlacksextensions— native ext detection failedparallel_installer.rb(detect from real spec)source/git.rbGem::Specification.add_specinstead of double resetrubygems_gem_installer.rbArchitecture
Test plan
bundle installcold cache on synthetic workloads (small, chain, wide, medium)bundle installwarm cache on synthetic workloadsbundle execworks after installignore_ruby_upper_boundssetting works when enabled🤖 Generated with Claude Code