Skip to content

feat: add ccl registry, fix profiler#109

Open
Chamberlain0w0 wants to merge 5 commits intomasterfrom
feat/ccl_registry
Open

feat: add ccl registry, fix profiler#109
Chamberlain0w0 wants to merge 5 commits intomasterfrom
feat/ccl_registry

Conversation

@Chamberlain0w0
Copy link
Contributor

@Chamberlain0w0 Chamberlain0w0 commented Feb 25, 2026

  1. 把原先的 DeviceGuard 及其一系列定义都放到了 core/runtime 文件夹下面。
    a. DeviceGuard 里面添加个别运行时接口,包括 stream/event 相关
    b. blas_handle/stream/event 等定义都放在一块,提到了 runtime_common.h 中
    c. 添加了 RuntimeStatus 的定义,后续考虑把所有接口返回值都设置为 RuntimeStatus,与 cuda runtime api 对齐
    d. 所有文件里面 include 的路径也对应修改

  2. Profiler 部分代码去除平台特化宏,通过调用 DeviceGuardImpl 里面提供的 runtime api 实现;此外,定位了一下带 vpp 的分布式下 Profiler 报错的问题,源于多线程读写冲突,加了互斥锁后修复。

======下面的部分和通信库/分布式相关======

  1. 创建 core/ccl/ 目录,添加 CclGroupGuard、CclImpl、CclImplRegistry(对标 DeviceGuard、DeviceGuardImpl、DeviceGuardImplRegistry):
    a. CclImplRegistry 用于注册不同平台的后端通信库
    b. CclImpl 涵盖了通信库相关的所有接口定义
    c. CclGroupGuard 以 RAII 作用域的方式自动包起了一个类似 ncclGroupStart() 和 ncclGroupEnd() 的区域
    d. ccl_common.h 里面有 comm/unique_id/ccl_status 的定义

  2. 去除 ProcessGroup、Work 部分的平台特化代码,通过调用 CclImpl 里面提供的通信库 api 实现。有几个细节:
    a. ProcessGroup 添加一个 DeviceType 类型的成员称作 backend,靠此来拿到对应的 CclImpl,调用注册平台的通信库 api
    b. 添加 ProcessGroupFactory::Instance(DeviceType) 接口,要求创建或者获取工厂的时候需要传入 backend 参数;同时把原先无参版本 ProcessGroupFactory::Instance() 的语义改为类似于 const 的存在,仅获取已用某后端初始化过的工厂,而不会创建;实际的 static 的 instance 声明位置放到全局

@Chamberlain0w0 Chamberlain0w0 changed the title [WIP] feat: add ccl registry, fix profiler feat: add ccl registry, fix profiler Feb 27, 2026

virtual void ReadUniqueId(CclUniqueId *unique_id, const std::string &pg_name) const;

virtual void CleanupUniqueIdFile(const std::string &pg_name) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这三个函数本身没有用到硬件底层的 ccl 接口,感觉不适合放在 CclImpl 里,看看能不能抽出个通用函数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

提出了到 ccl_utils.h,并且使这三个 function 与平台无关。

由于其涉及对 Unique_id 内容的读写,所以同步给 CclUniqueId 添加了 Data()/Size()/Load() 接口,供给修改后的上述三个 function 来调用,CclUniqueId 的平台相关的派生类需要 override 这三个接口。


virtual void CreateUniqueId(CclUniqueId **unique_id) const;

virtual void GetUniqueId(CclUniqueId *unique_id) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl 应该只有 GetUniqueId 接口,为什么我们这里要多一个 Create 接口呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删了,直接调用这个,传入一个 nullptr 就行。


virtual void GroupEnd() const;

virtual void CommGetAsyncError(const CclComm *comm, CclStatus *async_error) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改


virtual void CommGetAsyncError(const CclComm *comm, CclStatus *async_error) const;

virtual void CreateComm(CclComm **comm) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里 comm/unique_id 有 create 接口是因为各平台的 comm/unique_id 类型不一致吗?如果这样是不是应该像 cuda_stream 那样封装一个统一类型比较合适,将具体类型存在各硬件的实现里

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删了,在 CommInitAll / CommInitRank 里面判断若空则 new 一个出来

INFINI_TRAIN_CCL_STATUS_LIST(INFINI_TRAIN_CCL_STATUS_CASE)
#undef INFINI_TRAIN_CCL_STATUS_CASE
default:
return "Unknown";
Copy link
Collaborator

@kilinchange kilinchange Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status 里已经有 kUnknown 了,这里要么返回 "unsupported" 要么直接崩掉吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

// event
void EventCreate(Event **event) const override;

void EventCreateWithFlags(Event **event, EventFlag flags) const override;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么不同参数形式创建 Event 的操作是放在 GuadImpl 实现里,而不是 Event 的构造函数做。
下面 Destroy 也是

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已留 TODO


virtual Stream *CreateStreamWithPriority(Device, int priority) const;

virtual void DestroyStream(Stream *) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeviceGuardImpl 为什么要有 Create/Destroy Stream 接口?如果需要支持 priorty 的构造接口,给 Stream 新增一个带 priorty 的构造函数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已留 TODO


virtual void EventCreateWithFlags(Event **event, EventFlag flags) const;

virtual void EventDestroy(Event *event) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Event 的 create/destroy 接口放到 Event 类里

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已留 TODO

INFINI_TRAIN_RUNTIME_STATUS_LIST(INFINI_TRAIN_RUNTIME_STATUS_CASE)
#undef INFINI_TRAIN_RUNTIME_STATUS_CASE
default:
return "Unknown";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

前面已经有一个 kUnknown 了,这里用 "unsupported" 或者 LOG(FATAL) 吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

X(kAlreadyExists, -6) \
X(kPermissionDenied, -7) \
X(kInternal, -8) \
X(kUnknown, -127)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的数值是只需要保证正数是正常行为,复数是非正常行为即可,不需要各平台底层相应 status_type 实际对应的 status_value 值一样是吧?因为是我们自己设定的 status_type->status_value 映射?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的值是多少是我们自己编码,具体每一项映射到平台相关的取值需要做个 swtich 判断或者 map。kUnknown 直接跨度到 -127 是考虑到后续可能要添加新的 error 类型,kUnknown 直接做兜底的值,这样方便扩展

…acros, mv unique_id file helper functions to utils
@Chamberlain0w0 Chamberlain0w0 force-pushed the feat/ccl_registry branch 2 times, most recently from 85c777f to c289934 Compare March 3, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants