Conversation
|
Addresses #1972. PART 1: This avoids the previous error New tests have been added to demonstrate expected behaviour. PART 2: Example outputs:
```
cset --verbose bake -r ../src/CSET/recipes/surface_fields/generic_surface_histogram_series.yaml --input-dir "${input_dir}" --output-dir ${output_dir} --VARNAME=${VARNAME} --MODEL_NAME="${model_name}" --METHOD="MEAN" --SUBAREA_TYPE=${SUBAREA_TYPE} --SUBAREA_EXTENT=${SUBAREA_EXTENT} --SEQUENCE time
```
|
|
Marking as ready for review. |
| cubes = cubes.merge() | ||
| except iris.exceptions.MergeError: | ||
| for ir, cube in enumerate(cubes): | ||
| cube.coord("realization").points = ir + 1 |
There was a problem hiding this comment.
This is a very blunt fallback mechanism. It renumbers ensembles without any warning which could make tracing results back to a specific model run confusing.
I don't think we should be arbitrarily renumbering ensemble members at all, but if it's unavoidable we should at least log a warning, I'd also be tempted to say that if CSET is running inside a workflow we should even generate a cylc message so it shows up in the gui.
There was a problem hiding this comment.
Thanks.
Agree on room for additional output messaging here and will action.
Also agree this is a blunt fallback, but it does offer a bug fix. Note in previous iterations when ensemble support was working, we relied on em?? identifiers in filename (I think) to help determine ensemble numbers, so arguably just as blunt.
For instances where input files do already have a realization coord, note there is no impact of this change (see testing), so this is aiming to catch instances where ensemble outputs do not have a realization coord defined in inputs (turns out to be quite a lot of our use-cases though!).
Not sure if this helps you, but would currently argue this seems like a practical response to preserve functionality when members not identifiable by other ways.
There was a problem hiding this comment.
I think I'm still a little confused about what case this is handling. Can you provide a concrete example of how multiple cubes with the same ensemble member have been generated?
The only way I know of is with time-lagging where you end up with multiple control members, in which case I would argue that only the problem cubes should be modified.
There was a problem hiding this comment.
See original issue log #1972 - currently ensemble loading not working, and the cube.merge() call this replaces returns error
iris.exceptions.DuplicateDataError: failed to merge into a single cube.
Tests highlight variety of cases:
a) test_read_cubes_ensemble_with_realization_coord(): input example file contains all ensemble members in output file, so iris.load already gives cube with e.g. dims [realization, time, lat, lon]. This change has no impact.
b) test_read_cubes_ensemble_separate_files: reads ensemble members from separate files, but these have different realization coords within the file, so iris.load returns a Cubelist with different realization scalar coords, which can then merge nicely (adds N-dimension realization coord to the merged cube). This change has no impact.
c) test_read_cubes_ensemble_without_realization_coord: here I took copies of the test ensemble files and removed realization information from the file. This is equivalent to the examples which led to the #1972 errors being flagged. Here there is no information on ensemble member numbers, so this fallback enables the ensemble inputs to be processed.
Tests test_read_cubes_merge_cubes_* run through some equivalent testing but at level of direct calls to this function rather than higher-level call to read.read_cubes.
Thanks for time on this.
There was a problem hiding this comment.
Thanks, that helps. in that case your test for c) has actually identified a different issue. The current behaviour of assigning member 0 to any cubes that don't have realization information is insufficient. Personally I think that means extracting that information from the file name should be reconsidered.
For this PR, I would recommend a stricter check. Perhaps only apply this renumbering if all of the realization numbers are 0?
There was a problem hiding this comment.
My experience of this in testing was that the assigning member 0 is effectively an initialisation callback, which creates a realization coord where one does not exist. This is helpful/necessary given much of the subsequent code assumes a valid realization code.
Proposing we update doc strings for that callback to indicate this is about creating and initialising the coord where not existing rather than specifically a "deterministic" thing.
We can't depend on that callback to do 'clever' things with ensemble numbering (to best of my bug-fixing anyhow) as the callback is applied to different instances from input files separately (i.e. different cube elements of the CubeList), so there is no awareness of extent of an ensemble input within that callback. For this reason we rely on two-stage a) callback on loading cube to ensure coord exists, b) checks in merge to address all zero once we come to merging loaded cubes/CubeList.
Have introduced slightly stricter check to only update where realization==0 (i.e. still at initialised value).
Suggest can look at filename-related numbering 'sometime' (i.e. in separate issue/PR), but equally think we need to support a flexibility of potential filenaming ecosystems with CSET, so makes sense to use the merge pass/fail to discriminate type of inputs.
| if len(stamp_coords) == 1: | ||
| stamp_coordinate = stamp_coords[0] | ||
| else: | ||
| stamp_coordinate = "realization" |
There was a problem hiding this comment.
Log a warning here.
Also, this seems like a bit of an odd fallback, what if the available coordinates are "member" and "pseudo_level"? I feel like this function should either raise an error if it can't find a unique stamp coordinate, or it should return a tuple of available stamp coordinates so routines that call it can decide whether they know how to handle multiple stamp coordinates or not.
There was a problem hiding this comment.
Thanks for time on this, appreciated.
Happy to add log message.
For context note the current approach here is that we have stamp_coordinate="realization" being set in calls to functions as a default. In principle, users can set stamp_coordinate as a recipe-level input, but none of the generic recipes currently have this as set variable, and to me feels like more 'flexibility' than practically useful. So this change effectively introduces the ability to support something other than realization with existing recipe approach.
In case where len(stamp_coords) > 1, then we stick to effectively current default that stamp_coordinate=="realization" (line 362), preserving the current default behaviour I think.
There was a problem hiding this comment.
Logging statements now added.
David Flack (daflack)
left a comment
There was a problem hiding this comment.
Overall, from my "science review" perspective, it looks good and makes sense, and I have yet to find an ensemble that doesn't work. It's been working nicely with data downloaded from MASS that was failing before (3 members including the control).
The only things to bring up are:
-
(might be a different PR) the missing labels on the axes for the lat/lon lines
-
Member vs. Realization in plot titles: I appreciate that realization is the cf compliant name, but thinking of publication plots (when included) it is usually member as the title. It would be great to be cf compliant - but I think we should probably stick with member as that has somewhat become the norm.





Contribution checklist
Aim to have all relevant checks ticked off before merging. See the developer's guide for more detail.
rose-suite.conf.examplehas been updated if new diagnostic added.