-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Summary / 概述
When creating EngineConn (EC) instances, the distribution may become uneven across EngineConnManager (ECM) nodes. ECs tend to be created on the same or a small subset of ECMs, even when other ECMs have abundant available resources. This leads to high load on specific ECM nodes while others remain idle.
在创建 EngineConn (EC) 实例时,EC 的分布可能在多个 EngineConnManager (ECM) 节点之间不均匀。即使其他 ECM 有充足的可用资源,EC 也倾向于在同一个或少数几个 ECM 上创建。这会导致特定 ECM 节点负载过高,而其他节点处于空闲状态。
Problem Description / 问题描述
In production environments with multiple ECM nodes, users have observed that:
在多 ECM 节点的生产环境中,用户观察到以下现象:
- Uneven EC distribution / EC 分布不均: New EC instances are frequently created on the same ECM or a small number of ECMs / 新的 EC 实例频繁地在同一个或少数几个 ECM 上创建
- Resource underutilization / 资源利用率低: Some ECMs remain idle while others are overloaded / 部分 ECM 处于空闲状态,而其他 ECM 负载过重
- Single point of high load / 单点高负载: One ECM may become a bottleneck, affecting overall system performance / 单个 ECM 可能成为瓶颈,影响整体系统性能
- Inefficient cluster utilization / 集群利用率低: The full cluster capacity is not being leveraged / 集群的全部容量未能得到充分利用
Current Implementation Analysis / 当前实现分析
The current ECM selection mechanism in DefaultEngineCreateService.selectECM() uses a multi-rule chain approach:
当前 DefaultEngineCreateService.selectECM() 中的 ECM 选择机制采用多规则责任链模式:
| Rule / 规则 | Priority / 优先级 | Function / 功能 |
|---|---|---|
| ScoreNodeSelectRule | 0 | Sort by label matching score / 按标签匹配分数排序 |
| AvailableNodeSelectRule | 2 | Filter unavailable nodes / 过滤不可用节点 |
| OverLoadNodeSelectRule | 3 | Sort by memory usage rate / 按内存使用率排序 |
| ResourceNodeSelectRule | 5 | Sort by remaining resources / 按剩余资源排序 |
| NewECMStandbyRule | 7 | New ECM cooldown period / 新 ECM 冷却期 |
| HotspotExclusionRule | MaxValue | Random shuffle top nodes / 随机打乱前几名节点 |
Key Files / 关键文件:
linkis-application-manager/.../service/engine/DefaultEngineCreateService.scalalinkis-application-manager/.../selector/DefaultNodeSelector.scalalinkis-application-manager/.../selector/rule/HotspotExclusionRule.scala
Root Cause Analysis / 根因分析
Despite the existing HotspotExclusionRule, the following issues may cause distribution imbalance:
尽管存在 HotspotExclusionRule 热点排除规则,以下问题仍可能导致分布不均:
1. Delayed Resource Reporting / 资源上报延迟
- Health report period is 10 seconds by default (
wds.linkis.ecm.health.report.period) - 健康上报周期默认为 10 秒(
wds.linkis.ecm.health.report.period) - During high concurrency, multiple requests may see stale resource data
- 高并发时,多个请求可能看到过时的资源数据
- All concurrent requests may choose the same "best" ECM
- 所有并发请求可能选择同一个"最优" ECM
2. Limited Random Scope / 随机范围有限
- HotspotExclusionRule only shuffles top 5 nodes
- HotspotExclusionRule 仅随机打乱前 5 个节点
- If fewer than 5 ECMs are available, randomization is ineffective
- 如果可用 ECM 少于 5 个,随机化效果不佳
- Label filtering may reduce available ECM pool significantly
- 标签过滤可能显著减少可选的 ECM 池
3. Deterministic Scoring / 确定性评分
- Label matching scores may consistently favor certain ECMs
- 标签匹配分数可能始终偏向某些 ECM
- Resource sorting is deterministic, leading to repeated selection of the same node
- 资源排序是确定性的,导致重复选择同一节点
4. Concurrent Race Condition / 并发竞态条件
- Multiple EC creation requests arriving simultaneously
- 多个 EC 创建请求同时到达
- All see the same snapshot of cluster state
- 所有请求看到相同的集群状态快照
- All select the same ECM before resource updates propagate
- 在资源更新传播之前,所有请求选择同一个 ECM
Proposed Solutions / 建议解决方案
Option 1: Enhanced Weighted Random Selection / 方案一:增强的加权随机选择
Implement weighted random selection based on available resources:
基于可用资源实现加权随机选择:
// Pseudocode / 伪代码
def selectECMWeighted(ecmNodes: Seq[EMNode]): EMNode = {
val weights = ecmNodes.map(node => node.leftResource.weight)
val totalWeight = weights.sum
val random = Random.nextDouble() * totalWeight
var cumulative = 0.0
ecmNodes.zip(weights).find { case (node, weight) =>
cumulative += weight
cumulative >= random
}.map(_._1).getOrElse(ecmNodes.head)
}Option 2: Distributed Counter with Lease / 方案二:带租约的分布式计数器
Implement a distributed counter mechanism to track pending EC creations:
实现分布式计数器机制来跟踪待创建的 EC:
- Before selecting ECM, acquire a "reservation" in distributed cache
- 选择 ECM 前,在分布式缓存中获取"预留"
- Include pending reservations in resource calculation
- 在资源计算中包含待处理的预留
- Release reservation after EC creation completes or fails
- EC 创建完成或失败后释放预留
Option 3: Real-time Resource Synchronization / 方案三:实时资源同步
Reduce resource reporting delay and implement push-based updates:
减少资源上报延迟,实现推送式更新:
- Decrease
wds.linkis.ecm.health.report.periodto 1-2 seconds - 将
wds.linkis.ecm.health.report.period减少到 1-2 秒 - Implement event-driven resource updates when EC is created/destroyed
- EC 创建/销毁时实现事件驱动的资源更新
- Use optimistic locking for concurrent selection
- 并发选择时使用乐观锁
Option 4: Adaptive Load Balancing / 方案四:自适应负载均衡
Implement adaptive selection with feedback:
实现带反馈的自适应选择:
- Track actual EC creation success/failure rates per ECM
- 跟踪每个 ECM 的实际 EC 创建成功/失败率
- Apply penalties to frequently selected ECMs
- 对频繁被选中的 ECM 施加惩罚
- Implement exponential backoff for overloaded nodes
- 对过载节点实现指数退避
Acceptance Criteria / 验收标准
- In concurrent EC creation scenarios (e.g., 100 concurrent requests), EC distribution across ECMs should be relatively even (variance < 20% of mean)
- 在并发 EC 创建场景下(如 100 个并发请求),EC 在各 ECM 间的分布应相对均匀(方差 < 平均值的 20%)
- No single ECM should receive more than 150% of the average load
- 单个 ECM 接收的负载不应超过平均负载的 150%
- All existing functionality must remain intact
- 所有现有功能必须保持完整
- Performance overhead should be minimal (< 5% latency increase)
- 性能开销应最小化(延迟增加 < 5%)
- Solution should be configurable and backward compatible
- 解决方案应可配置且向后兼容
Test Plan / 测试计划
1. Unit Tests / 单元测试
- Test weighted random selection algorithm / 测试加权随机选择算法
- Test resource calculation with pending reservations / 测试带待处理预留的资源计算
- Test edge cases (single ECM, all ECMs at capacity) / 测试边界情况(单 ECM、所有 ECM 满载)
2. Integration Tests / 集成测试
- Simulate concurrent EC creation requests / 模拟并发 EC 创建请求
- Verify even distribution across ECM nodes / 验证 EC 在各 ECM 节点间的均匀分布
- Test failover when selected ECM becomes unavailable / 测试所选 ECM 不可用时的故障转移
3. Performance Tests / 性能测试
- Measure latency impact of new selection logic / 测量新选择逻辑的延迟影响
- Verify scalability with increasing ECM count / 验证随 ECM 数量增加的可扩展性
- Benchmark under high concurrency / 高并发下的基准测试
Related Configuration / 相关配置
| Config Key / 配置项 | Default / 默认值 | Description / 描述 |
|---|---|---|
| wds.linkis.ecm.health.report.period | 10s | ECM health report interval / ECM 健康上报间隔 |
| linkis.node.select.hotspot.exclusion.rule.enable | true | Enable hotspot exclusion / 启用热点排除 |
| wds.linkis.manager.am.em.new.wait.mills | 60000 | New ECM cooldown period / 新 ECM 冷却期 |
References / 参考资料
- ECM Architecture / ECM 架构: https://linkis.apache.org/docs/latest/architecture/computation-governance-services/engine/engine-conn-manager
- EngineConn Overview / EngineConn 概述: https://linkis.apache.org/docs/latest/architecture/computation-governance-services/engine/engine-conn