Conversation
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
📝 WalkthroughWalkthroughRestructures GPU testing infrastructure by introducing matrix-based parallel execution for multiple test variants, reducing job timeouts to 90 minutes, separating Megatron-specific test configuration into dedicated environments, and consolidating shared test fixtures. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/gpu_tests.yml (1)
42-47:⚠️ Potential issue | 🔴 CriticalBug:
tests/gpu_megatron/**is missing from the changed-files filter.Changes made only under
tests/gpu_megatron/won't trigger the PR GPU tests because this path isn't included in the file-change detection. Thegpu-tests-prjob will be skipped.Proposed fix
files: | .github/workflows/gpu_tests.yml modelopt/** tests/gpu/** + tests/gpu_megatron/** tox.ini setup.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #888 +/- ##
=======================================
Coverage 73.73% 73.73%
=======================================
Files 199 199
Lines 21165 21165
=======================================
Hits 15606 15606
Misses 5559 5559 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
[Short term]: Megatron based tests take a long time often resulting in CICD timeout. Splitting megatron tests into a dedicated CICD job for faster overall CI/CD run
[Mid/Long term]: Run all megatron gpu tests using
torchruninstead ofpytestso all dist processes are already created and all individual tests no longer need to setup and destroy their processes which adds a lot of overhead per testTesting