Add measure_obs: persist per-cell centroid/area/equivalent diameter into the annotating table#705
Conversation
… into the annotating table `measure_obs(sdata, element=None, ...)` computes one centroid, area and equivalent diameter per instance of a shapes or 2D-labels element and writes them, squidpy-style, into the annotating AnnData table: centroids to `obsm["spatial"]` (the canonical (n_obs, 2) array), area and equivalent diameter to `obs`. Values are stored in the element's intrinsic coordinates/units; equivalent diameter is `2*sqrt(area/pi)`. Labels use a streaming bincount aggregator that processes the raster block by block (one chunk plus O(n_labels) accumulators), so it stays out-of-core and scales to Xenium-size masks where a whole-array regionprops table would run out of memory; area (the per-label pixel count) is a free by-product. Shapes use shapely's vectorized centroid/area. The function is idempotent: outputs already present and current are not recomputed, a pre-existing `obsm["spatial"]` is trusted and never overwritten, and an instance-count change invalidates the cache. `inplace` follows the scanpy convention (mutate and return None, or operate on a deep copy and return it). Per-cell measurements require an annotating table to write into. Render-side wiring (routing `as_points` through these measurements for footprint dot sizing) is intentionally deferred to a follow-up PR.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #705 +/- ##
==========================================
+ Coverage 75.96% 76.40% +0.43%
==========================================
Files 14 14
Lines 4156 4314 +158
Branches 964 1003 +39
==========================================
+ Hits 3157 3296 +139
- Misses 647 663 +16
- Partials 352 355 +3
🚀 New features to boost your workflow:
|
….area (=0) Circles are stored as `Point` geometries with a `radius` column, for which shapely `.area` is 0 — so `measure_obs` wrote area=0 and equivalent_diameter=0 for every circle (surfaced on the real Visium spots dataset, all circles). Compute their area as `pi * r**2`; equivalent diameter then equals the true diameter `2*r`. Polygons/multipolygons still use the geometric area. Adds a regression test on `blobs_circles`.
Performance (real data)
Real cell-segmentation masks
Scale / out-of-core — real nucleus mask tiled to 671,232 cells / 164 M px, measured end-to-end (read from disk + persist) in 5.6 s at flat ~590 MB peak (4× the pixels → same memory; accumulators 16 MB). Real Visium HD: 5.48 M shapes in 4.4 s. A real-data run also surfaced a circle-area bug (shapely |
measure_obs now just computes and writes the requested measurements, overwriting existing values for the element's rows — the scanpy `calculate_qc_metrics` model. Removed the provenance marker, staleness tracking, per-row finiteness checks, the want_*/stale gating and the `force` parameter (5 helpers + 2 uns constants, ~85 net lines). Reuse belongs on the render read-path (read obsm if present, else compute), not in this writer; `centroids=False` keeps a pre-existing obsm["spatial"]. Merged the one-call `_compute_label_measurements` into `_compute_element_measurements`. Kept: the masked partial write (a table may annotate several elements), the incompatible-obsm-shape guard, and element=None / table resolution. Tests updated to the overwrite contract (recompute-overwrites, centroids-keeps-obsm, incompatible-shape-raises) replacing the idempotency/staleness tests.
Match the set_zero_in_cmap_to_transparent convention: measure_obs is a plain public function in pl/utils.py, accessed via `from spatialdata_plot.pl.utils import measure_obs` rather than promoted to `sdp.pl.measure_obs`.
Follow the established public-helper pattern (make_palette is defined under pl/ and re-exported in pl/__init__) rather than inventing a top-level spatialdata_plot.utils module. Public form: `from spatialdata_plot.pl import measure_obs`.
|
A comment on the latest message. With this PR (see in particular the text below), Two questions:
|
|
Not sure if I had the version from sdata 0.7.3, but I'll prototype the plotting speedups with this one here and if I like the UX, I'll upstream 👌 If it doesn't hold up, I'll just kick it out again before the next release |
|
I think the major benefit of my approach is that I'm getting the area for next to no extra cost as well which is also super useful for plotting + computations |
- #1 no-clobber: a populated obsm["spatial"] (reader- or prior-call-provided) is no longer overwritten — warn and skip that element's centroids. Coords stay in the element's intrinsic pixel space (documented); area/diameter still overwrite our own columns. Restores the per-element finiteness guard. - #2 unmatched instance ids: instances annotated in the table but absent from the element (e.g. str-vs-int id dtype mismatch) now warn instead of silently writing NaN. - #3 float-dtype labels: the dense relabelling bincounts integer searchsorted indices, never the raw labels, so a float-typed (integer-valued) mask no longer crashes np.bincount. - #4 atomic writes: validate every obs target (non-numeric column collision) before the first mutation, so a bad column never leaves a half-written table. - #5 O(n_labels) memory: relabel labels to a dense 0..k-1 range, so the aggregator's memory scales with the number of distinct labels, not the maximum label id (sparse/global ids no longer blow it up). Single max() pass replaced by a unique() pass; same pass count. - #7 circle area: dispatch on geometry TYPE (all-Point) rather than the presence of a "radius" column, so a polygon element carrying a radius column uses geometry.area, not pi*r**2. Tests updated to the new contract and extended for each fix.
- Collapse `_write_obs_region` + `_write_obsm_region` into one `_write_region` parameterized by `obsm=True/False` (same allocate-or-load -> masked-assign -> store-back, with the obsm shape guard). - Drop the `obsm_key`/`area_key`/`diameter_key` public kwargs and their threading; the destinations are now module constants (`_CENTROID_OBSM_KEY`/`_AREA_OBS_KEY`/`_DIAMETER_OBS_KEY`). A column-name collision still raises with an actionable message. - Inline the single-use `_transform_carrier`; collapse the throwaway `xy` dict; hoist `meas["area"].to_numpy()` so it's materialized once for area+diameter. Behavior unchanged; 17 TestMeasureObs + 63 non-visual test_utils green.
_region_mask_and_keys (2 trivial lines) folds into _measure_into_table; _measurable_elements becomes a comprehension in measure_obs's element=None branch. The remaining 9 helpers are each reused or encapsulate non-obvious logic (the streaming aggregator, the writer, the no-clobber/dtype guards, table resolution with its error messages).
#705 (measure_obs) merged to main, superseding #703's pre-extraction centroid block. This merge takes main's utils.py + test_utils.py wholesale (zero #703 delta there) and rewires the labels as_points path onto main's primitive: - render.py labels branch: drop the deleted `_get_or_compute_centroids` + the whole cache layer; compute full-resolution (scale0) centroids via main's `_compute_element_measurements`, and draw them with a transform built from the *same* scale0 element (`_prepare_transformation(_get_top_data_array(...))`) — so positions are independent of any rasterization applied to the rendered `label`. Coerce point_ids to the label dtype so str/object instance ids (e.g. Xenium readers) align instead of silently reindexing to NaN. - Net: utils.py == main (the 242-line #703 centroid block + cache layer is gone); the surviving diff vs main is just the as_points feature. - Tests: added a non-identity-transform regression test asserting the dots land at the cells' coordinate-system positions in display space (the guard for the transform pairing); existing as_points position tests still pass. Shapes branch unchanged (already intrinsic centroid + trans_data, positionally aligned to its post-filter color vector).
What
Public
measure_obsutility — computes per-cell centroid, area and equivalent diameter for a shapes or 2D-labels element and writes them into the annotatingAnnDatatable (squidpy-style):obsm["spatial"]· area →obs["area"]· equiv. diameter →obs["equivalent_diameter"]Stored in the element's intrinsic units. Labels area = pixel count; shapes area =
geometry.area(pi*r**2for circles).Why
Persist centroids/area once so renders and downstream tools (squidpy) reuse them instead of recomputing.
obsm["spatial"]is the canonical, coords-only home; area belongs inobs.How
O(n_labels)accumulators) — out-of-core, scales to Xenium-size masks; area is a free by-product.Point+radius) usepi*r**2.centroids=Falsekeeps an existingobsm["spatial"]. Needs an annotating table.inplacefollows the scanpy convention.Scope
Utility only — wiring
as_pointsrendering through these measurements is a follow-up.Tested in
tests/pl/test_utils.py::TestMeasureObs; performance benchmarks in the comment below.