Skip to content

Add SGLang Ray inference example#40

Open
xyuzh wants to merge 7 commits intomainfrom
sglang_ray
Open

Add SGLang Ray inference example#40
xyuzh wants to merge 7 commits intomainfrom
sglang_ray

Conversation

@xyuzh
Copy link
Contributor

@xyuzh xyuzh commented Feb 11, 2026

Add offline and online inference drivers with Dockerfile and Anyscale job configs for running SGLang on Ray.

xyuzh and others added 5 commits February 10, 2026 19:02
Add offline and online inference drivers with Dockerfile and Anyscale job configs for running SGLang on Ray.
…stness

- Dockerfile: use sglang[all]==0.5.8 + sgl-kernel==0.3.21 instead of git fork
- Drivers: add logging, named placement groups, exit codes, better error handling
- Job configs: add NCCL_DEBUG, fix submit path comment
- README: add How It Works, Troubleshooting, local run examples

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Rename sglang_ray_inference -> sglang_inference
- Batch inference (job.yaml + driver_offline.py) fully working with
  multi-node TP=4, PP=2 using SGLang's use_ray=True mode
- Ray Serve deployment (service.yaml + serve.py) uses same pattern as
  official Ray LLM SGLang integration with signal monkey-patching
- Add query.py script for testing the service
- Simplify configuration with environment variables

The serving example is still being validated with multi-replica
autoscaling. Single replica works; investigating occasional timeouts
with multiple replicas.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Robert Nishihara <[email protected]>
worker_nodes:
- instance_type: g5.12xlarge # 4x A10G
min_nodes: 4
max_nodes: 8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the min_nodes max_nodes settings are different for offline and serve config, is there a reason for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your fix is correct. 2 and 8 is right, since the replicas autoscale from 1-4.

instance_type: m5.2xlarge # CPU-only head
worker_nodes:
- instance_type: g5.12xlarge # 4x A10G
min_nodes: 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we only need 2 nodes here?

max_nodes: 8

env_vars:
MODEL_PATH: "Qwen/Qwen3-1.7B"
Copy link
Contributor Author

@xyuzh xyuzh Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you succeeded with the 30B model

xyuzh added 2 commits March 8, 2026 19:47
- Switch from Engine to sglang.srt.ray.engine.RayEngine
- Upgrade base image to ray 2.54.0 and install from sglang main branch
- Update default model to Qwen3.5-27B
- Add threaded engine init with warmup in serve.py to avoid event loop conflicts
- Fix node counts in job/service configs (4 -> 2 worker nodes)
- Remove threading and signal monkey-patching from serve.py; use
  RayEngine directly with async_generate
- Add Dockerfile step to patch ray.serve replica.py with two-phase
  init support (compatible with Ray 2.54.0)
- Improve logging setup to avoid duplicate handlers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants