Skip to content

Ensemble read and plot updates#2056

Open
ukmo-huw-lewis wants to merge 9 commits intomainfrom
1972-ensemble-loading-not-recognising-member-0
Open

Ensemble read and plot updates#2056
ukmo-huw-lewis wants to merge 9 commits intomainfrom
1972-ensemble-loading-not-recognising-member-0

Conversation

@ukmo-huw-lewis
Copy link
Copy Markdown
Contributor

@ukmo-huw-lewis ukmo-huw-lewis commented Apr 15, 2026

Contribution checklist

Aim to have all relevant checks ticked off before merging. See the developer's guide for more detail.

  • Documentation has been updated to reflect change.
  • New code has tests, and affected old tests have been updated.
  • All tests and CI checks pass.
  • Ensured the pull request title is descriptive.
  • Ensure rose-suite.conf.example has been updated if new diagnostic added.
  • Conda lock files have been updated if dependencies have changed.
  • Attributed any Generative AI, such as GitHub Copilot, used in this PR.
  • Marked the PR as ready to review.

@ukmo-huw-lewis ukmo-huw-lewis linked an issue Apr 15, 2026 that may be closed by this pull request
@ukmo-huw-lewis
Copy link
Copy Markdown
Contributor Author

ukmo-huw-lewis commented Apr 15, 2026

Addresses #1972.

PART 1:
New function _merge_cubes_check_ensemble(cubes) attempts to merge input cubes.
Identifying ensemble inputs takes advantage of current issue that ensemble inputs do not merge, now that we set realization coordinate to 0 in instances (most inputs) where realization is not first set in input file. In this case, we now increment realization coordinate and then attempt to successfully merge cube.

This avoids the previous error iris.exceptions.DuplicateDataError: failed to merge into a single cube. as now realization coord is varying across an otherwise common cubelist.

New tests have been added to demonstrate expected behaviour.

PART 2:
Have also used this issue/PR to introduce some additional generalisations to ensemble plotting. This is anticipating example use-case in #2032 (i.e. where input stamp_coordinate may not always be "realization", and where number of ensembles to plot is not always a square grid).

Example outputs:

surface_spatial_plot with ensemble inputs (note here titles pick up "Realization" coord name rather than "Member" default)
image

cset --verbose bake -r ../src/CSET/recipes/surface_fields/generic_surface_spatial_plot_sequence.yaml --input-dir "${input_dir}" --output-dir ${output_dir} --VARNAME=${VARNAME} --MODEL_NAME="${model_name}" --METHOD="MEAN" --SUBAREA_TYPE=${SUBAREA_TYPE} --SUBAREA_EXTENT=${SUBAREA_EXTENT}

spatial_difference with 2 sets of ensemble inputs. Note some precision needed on how to read in 2 sets of ensembles using wildcards, to read in 2 sets of * inputs rather than risk seeing data as Nensemble x Nmodel path set.
image

input_dir=<path_to_data>/LFRic/20240121T0000Z*em*diagnostics_000_006.nc
input_dir2=<path_to_data>/UM/20240121T0000Z*em*pdiaga000.pp
cset --verbose bake -r ../src/CSET/recipes/surface_fields/surface_spatial_difference.yaml --input-dir "${input_dir}" "${input_dir2}" --output-dir ${output_dir} --VARNAME=${VARNAME} --BASE_MODEL="${model_name}" --OTHER_MODEL="${model_name2}" --METHOD="" --SUBAREA_TYPE=${SUBAREA_TYPE} --SUBAREA_EXTENT=${SUBAREA_EXTENT}

time_series
image

cset --verbose bake -r ../src/CSET/recipes/surface_fields/generic_surface_domain_mean_time_series.yaml --input-dir "${input_dir}" "${input_dir2}" --output-dir ${output_dir} --VARNAME=${VARNAME} --MODEL_NAME="['${model_name}', '${model_name2}']" --METHOD="MEAN"

histogram
image

cset --verbose bake -r ../src/CSET/recipes/surface_fields/generic_surface_histogram_series.yaml --input-dir "${input_dir}" --output-dir ${output_dir} --VARNAME=${VARNAME} --MODEL_NAME="${model_name}" --METHOD="MEAN" --SUBAREA_TYPE=${SUBAREA_TYPE} --SUBAREA_EXTENT=${SUBAREA_EXTENT} --SEQUENCE realization
image ``` cset --verbose bake -r ../src/CSET/recipes/surface_fields/generic_surface_histogram_series.yaml --input-dir "${input_dir}" --output-dir ${output_dir} --VARNAME=${VARNAME} --MODEL_NAME="${model_name}" --METHOD="MEAN" --SUBAREA_TYPE=${SUBAREA_TYPE} --SUBAREA_EXTENT=${SUBAREA_EXTENT} --SEQUENCE time ```

@ukmo-huw-lewis
Copy link
Copy Markdown
Contributor Author

ukmo-huw-lewis commented Apr 15, 2026

Marking as ready for review.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

Coverage

@ukmo-huw-lewis ukmo-huw-lewis added bug Something isn't working enhancement New feature or request labels Apr 15, 2026
Comment thread src/CSET/operators/read.py Outdated
cubes = cubes.merge()
except iris.exceptions.MergeError:
for ir, cube in enumerate(cubes):
cube.coord("realization").points = ir + 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very blunt fallback mechanism. It renumbers ensembles without any warning which could make tracing results back to a specific model run confusing.
I don't think we should be arbitrarily renumbering ensemble members at all, but if it's unavoidable we should at least log a warning, I'd also be tempted to say that if CSET is running inside a workflow we should even generate a cylc message so it shows up in the gui.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Agree on room for additional output messaging here and will action.

Also agree this is a blunt fallback, but it does offer a bug fix. Note in previous iterations when ensemble support was working, we relied on em?? identifiers in filename (I think) to help determine ensemble numbers, so arguably just as blunt.

For instances where input files do already have a realization coord, note there is no impact of this change (see testing), so this is aiming to catch instances where ensemble outputs do not have a realization coord defined in inputs (turns out to be quite a lot of our use-cases though!).

Not sure if this helps you, but would currently argue this seems like a practical response to preserve functionality when members not identifiable by other ways.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm still a little confused about what case this is handling. Can you provide a concrete example of how multiple cubes with the same ensemble member have been generated?
The only way I know of is with time-lagging where you end up with multiple control members, in which case I would argue that only the problem cubes should be modified.

Copy link
Copy Markdown
Contributor Author

@ukmo-huw-lewis ukmo-huw-lewis Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See original issue log #1972 - currently ensemble loading not working, and the cube.merge() call this replaces returns error
iris.exceptions.DuplicateDataError: failed to merge into a single cube.

Tests highlight variety of cases:
a) test_read_cubes_ensemble_with_realization_coord(): input example file contains all ensemble members in output file, so iris.load already gives cube with e.g. dims [realization, time, lat, lon]. This change has no impact.
b) test_read_cubes_ensemble_separate_files: reads ensemble members from separate files, but these have different realization coords within the file, so iris.load returns a Cubelist with different realization scalar coords, which can then merge nicely (adds N-dimension realization coord to the merged cube). This change has no impact.
c) test_read_cubes_ensemble_without_realization_coord: here I took copies of the test ensemble files and removed realization information from the file. This is equivalent to the examples which led to the #1972 errors being flagged. Here there is no information on ensemble member numbers, so this fallback enables the ensemble inputs to be processed.

Tests test_read_cubes_merge_cubes_* run through some equivalent testing but at level of direct calls to this function rather than higher-level call to read.read_cubes.

Thanks for time on this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that helps. in that case your test for c) has actually identified a different issue. The current behaviour of assigning member 0 to any cubes that don't have realization information is insufficient. Personally I think that means extracting that information from the file name should be reconsidered.

For this PR, I would recommend a stricter check. Perhaps only apply this renumbering if all of the realization numbers are 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experience of this in testing was that the assigning member 0 is effectively an initialisation callback, which creates a realization coord where one does not exist. This is helpful/necessary given much of the subsequent code assumes a valid realization code.

Proposing we update doc strings for that callback to indicate this is about creating and initialising the coord where not existing rather than specifically a "deterministic" thing.

We can't depend on that callback to do 'clever' things with ensemble numbering (to best of my bug-fixing anyhow) as the callback is applied to different instances from input files separately (i.e. different cube elements of the CubeList), so there is no awareness of extent of an ensemble input within that callback. For this reason we rely on two-stage a) callback on loading cube to ensure coord exists, b) checks in merge to address all zero once we come to merging loaded cubes/CubeList.

Have introduced slightly stricter check to only update where realization==0 (i.e. still at initialised value).
Suggest can look at filename-related numbering 'sometime' (i.e. in separate issue/PR), but equally think we need to support a flexibility of potential filenaming ecosystems with CSET, so makes sense to use the merge pass/fail to discriminate type of inputs.

Comment thread src/CSET/operators/read.py Outdated
if len(stamp_coords) == 1:
stamp_coordinate = stamp_coords[0]
else:
stamp_coordinate = "realization"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log a warning here.

Also, this seems like a bit of an odd fallback, what if the available coordinates are "member" and "pseudo_level"? I feel like this function should either raise an error if it can't find a unique stamp coordinate, or it should return a tuple of available stamp coordinates so routines that call it can decide whether they know how to handle multiple stamp coordinates or not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for time on this, appreciated.

Happy to add log message.

For context note the current approach here is that we have stamp_coordinate="realization" being set in calls to functions as a default. In principle, users can set stamp_coordinate as a recipe-level input, but none of the generic recipes currently have this as set variable, and to me feels like more 'flexibility' than practically useful. So this change effectively introduces the ability to support something other than realization with existing recipe approach.

In case where len(stamp_coords) > 1, then we stick to effectively current default that stamp_coordinate=="realization" (line 362), preserving the current default behaviour I think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging statements now added.

Copy link
Copy Markdown
Collaborator

@daflack David Flack (daflack) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, from my "science review" perspective, it looks good and makes sense, and I have yet to find an ensemble that doesn't work. It's been working nicely with data downloaded from MASS that was failing before (3 members including the control).

The only things to bring up are:

  1. (might be a different PR) the missing labels on the axes for the lat/lon lines

  2. Member vs. Realization in plot titles: I appreciate that realization is the cf compliant name, but thinking of publication plots (when included) it is usually member as the title. It would be great to be cf compliant - but I think we should probably stick with member as that has somewhat become the norm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensemble loading not recognising member 0

3 participants