Skip to content
View dafu-wu's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report dafu-wu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
dafu-wu/README.md

Dafu Wu

AI Infrastructure Engineer

Building the substrate for AI systems that train, reason, and improve themselves — from GPU clusters to agentic RL pipelines.


What I Work On

Large-scale LLM Training Infrastructure Led end-to-end design and implementation of cloud-native AI training infrastructure from the ground up, supporting large-scale distributed training across heterogeneous GPU clusters (A100, H100, GB200). Integrated GPU scheduling, high-performance networking, and distributed training frameworks (PyTorch, Ray), achieving high cluster MFU. Drove system-level performance optimization across compute, networking, and storage layers, addressing bottlenecks in NCCL communication, GPU utilization, and I/O throughput in multi-node environments.

GPU Cluster Scheduling & Kubernetes Native AI Platforms Architected a multi-cluster scheduling system spanning 5 GPU clusters, enabling cross-cluster workload orchestration, resource pooling, and improved global utilization. Reviewer and contributor to Volcano (CNCF), with contributions to gang scheduling, capacity plugin correctness, and DRA resource management. End-to-end ML platform design on Kubernetes: job lifecycle management, GPU affinity, multi-tenancy, autoscaling, and observability.

Agentic RL & Inference Infrastructure Infrastructure for agentic RL training and inference. Integrated RL training frameworks (veRL, AReaL, NeMo-RL) and high-throughput inference engines (vLLM, SGLang) into production platforms. Built a pluggable OSWorld sandbox provider on the training platform, enabling closed-loop RL training pipelines for computer-use agents at scale.


Technical Stack

Layer Technologies
Training Frameworks PyTorch, Ray
RL Training veRL, AReaL, NeMo-RL
Inference vLLM, SGLang
Distributed NCCL
Orchestration Kubernetes, Volcano
Hardware A100, H100, GB200

Principles

  • Systems thinking first — AI infrastructure requires deep understanding of the full stack: hardware, networking, runtime, and model architecture.
  • Measure before optimizing — bottlenecks in distributed training are rarely where intuition suggests. Profile first, optimize second.

Open Source


Interests

Particularly drawn to automated research and self-improving agents — systems that can autonomously explore, experiment, and refine themselves. Interested in the infrastructure challenges these workloads introduce: long-horizon task execution, scalable sandbox environments, and tight feedback loops between inference and training.


Contact

GitHub

Pinned Loading

  1. volcano-sh/volcano volcano-sh/volcano Public

    A Cloud Native Batch System (Project under CNCF)

    Go 5.7k 1.4k

  2. NVIDIA-NeMo/RL NVIDIA-NeMo/RL Public

    Scalable toolkit for efficient model reinforcement

    Python 1.7k 427

  3. verl-project/verl verl-project/verl Public

    verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework

    Python 22k 4.1k

  4. areal-project/AReaL areal-project/AReaL Public

    The RL Bridge for LLM-based Agent Applications. Made Simple & Flexible.

    Python 5.3k 522