Skip to content

Refactor VOS aggregation: global scanner ULT with per-object aggregation#18001

Draft
Copilot wants to merge 6 commits intomasterfrom
copilot/refactor-ult-aggregation-logic
Draft

Refactor VOS aggregation: global scanner ULT with per-object aggregation#18001
Copilot wants to merge 6 commits intomasterfrom
copilot/refactor-ult-aggregation-logic

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 14, 2026

  • VOS aggregation: multi-scanner ULTs (up to 4, tunable via DAOS_AGG_SCANNERS)
  • EC aggregation: convert from per-container ULTs to multi-scanner ULTs (up to 16, tunable via DAOS_EC_AGG_SCANNERS)
    • Add ec_agg_num_scanners config (default 16) and DAOS_EC_AGG_SCANNERS env var in srv.c
    • Add EC scanner state (dt_ec_agg_reqs[], dt_num_ec_scanners) to pool_tls in srv_internal.h
    • Export ds_obj_ec_agg_cont() wrapper from srv_ec_aggregate.c for single-round EC agg per container
    • Implement ec_agg_scanner_ult() in pool/srv_target.c (analogous to VOS scanner)
    • Implement ds_start_ec_agg_scanner() / ds_stop_ec_agg_scanner()
    • Remove per-container EC agg ULT from cont_start_agg() / cont_stop_agg()
    • Hook EC scanner start/stop into pool child lifecycle
    • Export declarations in pool.h and srv_obj_ec.h
    • Run validation

Copilot AI and others added 2 commits April 14, 2026 09:04
Replace per-container VOS aggregation ULTs with a single global aggregation
scanner ULT per pool_child. The scanner iterates all containers and spawns
per-object aggregation ULTs for each object, limited by the DAOS_AGG_MAX_ULTS
environment variable (default: 8).

Key changes:
- VOS: Add vos_aggregate_obj() for single-object aggregation with
  AGG_MODE_OBJ_AGGREGATE that allows concurrent per-object ULTs
- Container: Add global agg_scanner_ult that enumerates objects via VOS
  iterator and spawns per-object ULTs with concurrency control
- Pool: Add ds_start_agg_ult()/ds_stop_agg_ult() and wire into pool_child
  lifecycle. Read DAOS_AGG_MAX_ULTS env var for concurrency limit.
- EC aggregation remains per-container ULTs (unchanged)

Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/1d5c4c69-fe76-4740-98c9-574c557438a8

Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
- Remove unused fail_count variable and simplify debug message
- Fix error path: when ds_start_agg_ult fails, goto out_cont directly
  since spc_agg_req is not set on failure (ds_stop_agg_ult handles NULL)

Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/1d5c4c69-fe76-4740-98c9-574c557438a8

Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Move the aggregation scanner from per-pool_child to per-xstream.
One ULT per xstream iterates ALL pools → ALL containers → spawns
per-object aggregation ULTs.

Key changes:
- pool_tls: add dt_agg_req for the per-xstream scanner
- pool/srv_target.c: agg_scanner_ult iterates dt_pool_list
- ds_start_agg_scanner()/ds_stop_agg_scanner(): per-xstream lifecycle
- Use SCHED_REQ_ANONYM since scanner is not tied to any single pool
- agg_rate_ctl/agg_space_check: use pool_child's GC req for space
  pressure (since ANONYM req has no pool_info)
- Remove spc_agg_req from ds_pool_child
- Export cont_aggregate_runnable, cont_child_aggregate,
  cont_vos_agg_per_obj, cont_vos_agg_per_obj_cb for pool module
- Stop scanner during pool_child_stop before container list changes
- Pass scanner_req to per-object ULTs via ao_scanner_req

Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/3e549de0-fe79-4c41-8fad-b0dd3c605ea4

Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
Copilot AI requested a review from gnailzenh April 14, 2026 09:50
@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Copilot AI and others added 2 commits April 14, 2026 12:38
- New env var DAOS_AGG_SCANNERS (default 4): controls the number of
  aggregation scanner ULTs per xstream, capped at the number of pools.
- Each scanner partitions pools via (pool_ordinal % num_scanners) so
  all pools are covered with minimal overlap.
- pool_tls: replace single dt_agg_req with dt_agg_reqs[] array and
  dt_num_scanners counter.
- ds_start_agg_scanner: creates min(agg_num_scanners, pool_count)
  scanner ULTs, safe to call multiple times (only adds missing ones).
- ds_stop_agg_scanner: stops all running scanner ULTs.
- Change agg_max_ults default from 8 to 16 per user request.

Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/71b1c016-b4d5-4554-9d52-9a4d40fca490

Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Convert EC aggregation from per-container ULTs to global scanner ULTs,
matching the approach used for VOS aggregation. EC scanners (default 16,
tunable via DAOS_EC_AGG_SCANNERS) iterate pools/containers and call
ds_obj_ec_agg_cont() for each container. VOS scanners remain at
default 4 (tunable via DAOS_AGG_SCANNERS).

Key changes:
- Add ec_agg_num_scanners (default 16) and DAOS_EC_AGG_SCANNERS env var
- Add EC scanner state (dt_ec_agg_reqs[], dt_num_ec_scanners) to pool_tls
- Implement ec_agg_scanner_ult() in pool/srv_target.c
- Export ds_obj_ec_agg_cont() wrapper from srv_ec_aggregate.c
- Remove per-container EC agg ULT creation from cont_start_agg/cont_stop_agg
- Hook EC scanner start/stop into pool child lifecycle

Agent-Logs-Url: https://github.com/daos-stack/daos/sessions/2f09443d-48ca-46e5-a57c-3ab2a1b02ada

Co-authored-by: gnailzenh <7268050+gnailzenh@users.noreply.github.com>
@daosbuild3
Copy link
Copy Markdown
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants