feat: add ccl registry, fix profiler by Chamberlain0w0 · Pull Request #109 · InfiniTensor/InfiniTrain

Chamberlain0w0 · 2026-02-25T05:49:52Z

把原先的 DeviceGuard 及其一系列定义都放到了 core/runtime 文件夹下面。
a. DeviceGuard 里面添加个别运行时接口，包括 stream/event 相关
b. blas_handle/stream/event 等定义都放在一块，提到了 runtime_common.h 中
c. 添加了 RuntimeStatus 的定义，后续考虑把所有接口返回值都设置为 RuntimeStatus，与 cuda runtime api 对齐
d. 所有文件里面 include 的路径也对应修改
Profiler 部分代码去除平台特化宏，通过调用 DeviceGuardImpl 里面提供的 runtime api 实现；此外，定位了一下带 vpp 的分布式下 Profiler 报错的问题，源于多线程读写冲突，加了互斥锁后修复。

======下面的部分和通信库/分布式相关======

创建 core/ccl/ 目录，添加 CclGroupGuard、CclImpl、CclImplRegistry（对标 DeviceGuard、DeviceGuardImpl、DeviceGuardImplRegistry）：
a. CclImplRegistry 用于注册不同平台的后端通信库
b. CclImpl 涵盖了通信库相关的所有接口定义
c. CclGroupGuard 以 RAII 作用域的方式自动包起了一个类似 ncclGroupStart() 和 ncclGroupEnd() 的区域
d. ccl_common.h 里面有 comm/unique_id/ccl_status 的定义
去除 ProcessGroup、Work 部分的平台特化代码，通过调用 CclImpl 里面提供的通信库 api 实现。有几个细节：
a. ProcessGroup 添加一个 DeviceType 类型的成员称作 backend，靠此来拿到对应的 CclImpl，调用注册平台的通信库 api
b. 添加 ProcessGroupFactory::Instance(DeviceType) 接口，要求创建或者获取工厂的时候需要传入 backend 参数；同时把原先无参版本 ProcessGroupFactory::Instance() 的语义改为类似于 const 的存在，仅获取已用某后端初始化过的工厂，而不会创建；实际的 static 的 instance 声明位置放到全局

…ance()

kilinchange · 2026-02-28T03:31:16Z

infini_train/include/core/ccl/ccl.h

+
+    virtual void ReadUniqueId(CclUniqueId *unique_id, const std::string &pg_name) const;
+
+    virtual void CleanupUniqueIdFile(const std::string &pg_name) const;


这三个函数本身没有用到硬件底层的 ccl 接口，感觉不适合放在 CclImpl 里，看看能不能抽出个通用函数

提出了到 ccl_utils.h，并且使这三个 function 与平台无关。

由于其涉及对 Unique_id 内容的读写，所以同步给 CclUniqueId 添加了 Data()/Size()/Load() 接口，供给修改后的上述三个 function 来调用，CclUniqueId 的平台相关的派生类需要 override 这三个接口。

kilinchange · 2026-02-28T09:03:44Z

infini_train/include/core/ccl/ccl.h

+
+    virtual void CreateUniqueId(CclUniqueId **unique_id) const;
+
+    virtual void GetUniqueId(CclUniqueId *unique_id) const;


nccl 应该只有 GetUniqueId 接口，为什么我们这里要多一个 Create 接口呢？

删了，直接调用这个，传入一个 nullptr 就行。

kilinchange · 2026-02-28T09:08:11Z

infini_train/include/core/ccl/ccl.h

+
+    virtual void GroupEnd() const;
+
+    virtual void CommGetAsyncError(const CclComm *comm, CclStatus *async_error) const;


叫 GetAsyncError 吧，跟 nccl 命名对齐下
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclgeterrorstring

kilinchange · 2026-02-28T09:10:28Z

infini_train/include/core/ccl/ccl.h

+
+    virtual void CommGetAsyncError(const CclComm *comm, CclStatus *async_error) const;
+
+    virtual void CreateComm(CclComm **comm) const;


这里 comm/unique_id 有 create 接口是因为各平台的 comm/unique_id 类型不一致吗？如果这样是不是应该像 cuda_stream 那样封装一个统一类型比较合适，将具体类型存在各硬件的实现里

删了，在 CommInitAll / CommInitRank 里面判断若空则 new 一个出来

kilinchange · 2026-02-28T11:14:26Z

infini_train/include/core/ccl/ccl_common.h

+        INFINI_TRAIN_CCL_STATUS_LIST(INFINI_TRAIN_CCL_STATUS_CASE)
+#undef INFINI_TRAIN_CCL_STATUS_CASE
+    default:
+        return "Unknown";


status 里已经有 kUnknown 了，这里要么返回 "unsupported" 要么直接崩掉吧

kilinchange · 2026-03-02T08:49:41Z

infini_train/src/core/runtime/cuda/cuda_guard_impl.h

    // event
+    void EventCreate(Event **event) const override;
+
+    void EventCreateWithFlags(Event **event, EventFlag flags) const override;


为什么不同参数形式创建 Event 的操作是放在 GuadImpl 实现里，而不是 Event 的构造函数做。
下面 Destroy 也是

已留 TODO

kilinchange · 2026-03-02T08:58:41Z

infini_train/include/core/runtime/device_guard.h

+
+    virtual Stream *CreateStreamWithPriority(Device, int priority) const;
+
+    virtual void DestroyStream(Stream *) const;


DeviceGuardImpl 为什么要有 Create/Destroy Stream 接口？如果需要支持 priorty 的构造接口，给 Stream 新增一个带 priorty 的构造函数

已留 TODO

kilinchange · 2026-03-02T09:00:22Z

infini_train/include/core/runtime/device_guard.h

+
+    virtual void EventCreateWithFlags(Event **event, EventFlag flags) const;
+
+    virtual void EventDestroy(Event *event) const;


Event 的 create/destroy 接口放到 Event 类里

已留 TODO

kilinchange · 2026-03-02T09:03:54Z

infini_train/include/core/runtime/runtime_common.h

+        INFINI_TRAIN_RUNTIME_STATUS_LIST(INFINI_TRAIN_RUNTIME_STATUS_CASE)
+#undef INFINI_TRAIN_RUNTIME_STATUS_CASE
+    default:
+        return "Unknown";


前面已经有一个 kUnknown 了，这里用 "unsupported" 或者 LOG(FATAL) 吧

kilinchange · 2026-03-02T09:14:19Z

infini_train/include/core/runtime/runtime_common.h

+    X(kAlreadyExists, -6)                                                                                              \
+    X(kPermissionDenied, -7)                                                                                           \
+    X(kInternal, -8)                                                                                                   \
+    X(kUnknown, -127)


这里的数值是只需要保证正数是正常行为，复数是非正常行为即可，不需要各平台底层相应 status_type 实际对应的 status_value 值一样是吧？因为是我们自己设定的 status_type->status_value 映射？

这里的值是多少是我们自己编码，具体每一项映射到平台相关的取值需要做个 swtich 判断或者 map。kUnknown 直接跨度到 -127 是考虑到后续可能要添加新的 error 类型，kUnknown 直接做兜底的值，这样方便扩展

…acros, mv unique_id file helper functions to utils

Chamberlain0w0 added 3 commits February 12, 2026 17:21

draft: remove ProgressGroup and Work derivitives for NCCL

20ea5b2

fix: fix nccl error in process_group and seg fault in profiler

98a8dfe

feat: integrate runtime_common, and modify ProcessGroup related apis

4fa71bc

Chamberlain0w0 changed the title ~~[WIP] feat: add ccl registry, fix profiler~~ feat: add ccl registry, fix profiler Feb 27, 2026

Chamberlain0w0 requested a review from kilinchange February 27, 2026 03:19

fix: add EventFlag enum, fix mutex usage in ProcessGroupFactory::Inst…

ff5d317

…ance()

kilinchange requested changes Mar 2, 2026

View reviewed changes

fix: resolve requested changes, remove unnecessary api, remove nccl m…

c289934

…acros, mv unique_id file helper functions to utils

Chamberlain0w0 force-pushed the feat/ccl_registry branch 2 times, most recently from 85c777f to c289934 Compare March 3, 2026 08:41


		virtual void ReadUniqueId(CclUniqueId *unique_id, const std::string &pg_name) const;

		virtual void CleanupUniqueIdFile(const std::string &pg_name) const;


		virtual void CreateUniqueId(CclUniqueId **unique_id) const;

		virtual void GetUniqueId(CclUniqueId *unique_id) const;


		virtual void GroupEnd() const;

		virtual void CommGetAsyncError(const CclComm comm, CclStatus async_error) const;


		virtual void CommGetAsyncError(const CclComm comm, CclStatus async_error) const;

		virtual void CreateComm(CclComm **comm) const;


		virtual Stream *CreateStreamWithPriority(Device, int priority) const;

		virtual void DestroyStream(Stream *) const;


		virtual void EventCreateWithFlags(Event **event, EventFlag flags) const;

		virtual void EventDestroy(Event *event) const;

Conversation

Chamberlain0w0 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilinchange Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Chamberlain0w0 commented Feb 25, 2026 •

edited

Loading

kilinchange Feb 28, 2026 •

edited

Loading