Refactor VOS aggregation: global scanner ULT with per-object aggregation#18001
Refactor VOS aggregation: global scanner ULT with per-object aggregation#18001
Conversation
Replace per-container VOS aggregation ULTs with a single global aggregation scanner ULT per pool_child. The scanner iterates all containers and spawns per-object aggregation ULTs for each object, limited by the DAOS_AGG_MAX_ULTS environment variable (default: 8). Key changes: - VOS: Add vos_aggregate_obj() for single-object aggregation with AGG_MODE_OBJ_AGGREGATE that allows concurrent per-object ULTs - Container: Add global agg_scanner_ult that enumerates objects via VOS iterator and spawns per-object ULTs with concurrency control - Pool: Add ds_start_agg_ult()/ds_stop_agg_ult() and wire into pool_child lifecycle. Read DAOS_AGG_MAX_ULTS env var for concurrency limit. - EC aggregation remains per-container ULTs (unchanged) Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/1d5c4c69-fe76-4740-98c9-574c557438a8 Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
- Remove unused fail_count variable and simplify debug message - Fix error path: when ds_start_agg_ult fails, goto out_cont directly since spc_agg_req is not set on failure (ds_stop_agg_ult handles NULL) Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/1d5c4c69-fe76-4740-98c9-574c557438a8 Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
|
Test stage Build on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18001/1/execution/node/280/log |
|
Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18001/1/execution/node/273/log |
|
Test stage Build on Leap 15 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18001/1/execution/node/288/log |
Move the aggregation scanner from per-pool_child to per-xstream. One ULT per xstream iterates ALL pools → ALL containers → spawns per-object aggregation ULTs. Key changes: - pool_tls: add dt_agg_req for the per-xstream scanner - pool/srv_target.c: agg_scanner_ult iterates dt_pool_list - ds_start_agg_scanner()/ds_stop_agg_scanner(): per-xstream lifecycle - Use SCHED_REQ_ANONYM since scanner is not tied to any single pool - agg_rate_ctl/agg_space_check: use pool_child's GC req for space pressure (since ANONYM req has no pool_info) - Remove spc_agg_req from ds_pool_child - Export cont_aggregate_runnable, cont_child_aggregate, cont_vos_agg_per_obj, cont_vos_agg_per_obj_cb for pool module - Stop scanner during pool_child_stop before container list changes - Pass scanner_req to per-object ULTs via ao_scanner_req Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/3e549de0-fe79-4c41-8fad-b0dd3c605ea4 Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
|
Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18001/2/execution/node/272/log |
|
Test stage Build on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18001/2/execution/node/280/log |
|
Test stage Build on Leap 15 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18001/2/execution/node/288/log |
- New env var DAOS_AGG_SCANNERS (default 4): controls the number of aggregation scanner ULTs per xstream, capped at the number of pools. - Each scanner partitions pools via (pool_ordinal % num_scanners) so all pools are covered with minimal overlap. - pool_tls: replace single dt_agg_req with dt_agg_reqs[] array and dt_num_scanners counter. - ds_start_agg_scanner: creates min(agg_num_scanners, pool_count) scanner ULTs, safe to call multiple times (only adds missing ones). - ds_stop_agg_scanner: stops all running scanner ULTs. - Change agg_max_ults default from 8 to 16 per user request. Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/71b1c016-b4d5-4554-9d52-9a4d40fca490 Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/71b1c016-b4d5-4554-9d52-9a4d40fca490 Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
|
Test stage Build on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18001/4/execution/node/281/log |
|
Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18001/4/execution/node/273/log |
|
Test stage Build on Leap 15 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18001/4/execution/node/289/log |
Convert EC aggregation from per-container ULTs to global scanner ULTs, matching the approach used for VOS aggregation. EC scanners (default 16, tunable via DAOS_EC_AGG_SCANNERS) iterate pools/containers and call ds_obj_ec_agg_cont() for each container. VOS scanners remain at default 4 (tunable via DAOS_AGG_SCANNERS). Key changes: - Add ec_agg_num_scanners (default 16) and DAOS_EC_AGG_SCANNERS env var - Add EC scanner state (dt_ec_agg_reqs[], dt_num_ec_scanners) to pool_tls - Implement ec_agg_scanner_ult() in pool/srv_target.c - Export ds_obj_ec_agg_cont() wrapper from srv_ec_aggregate.c - Remove per-container EC agg ULT creation from cont_start_agg/cont_stop_agg - Hook EC scanner start/stop into pool child lifecycle Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/2f09443d-48ca-46e5-a57c-3ab2a1b02ada Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
|
Test stage NLT completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18001/5/execution/node/787/log |
ec_agg_num_scannersconfig (default 16) andDAOS_EC_AGG_SCANNERSenv var insrv.cdt_ec_agg_reqs[],dt_num_ec_scanners) topool_tlsinsrv_internal.hds_obj_ec_agg_cont()wrapper fromsrv_ec_aggregate.cfor single-round EC agg per containerec_agg_scanner_ult()inpool/srv_target.c(analogous to VOS scanner)ds_start_ec_agg_scanner()/ds_stop_ec_agg_scanner()cont_start_agg()/cont_stop_agg()pool.handsrv_obj_ec.h