Skip to content

Missing values in AnnData .uns DataFrame become string "nan" after SpatialData Zarr round trip #1152

Description

@AlbertoFabbri93

Description

When writing a SpatialData object to Zarr and reading it back with sd.read_zarr, missing values in a pandas DataFrame stored inside an AnnData table’s .uns are converted from real missing values (NaN / pd.NA) into the literal string "nan".

This changes the semantics of the metadata after a write/read round trip. In downstream code, the string "nan" is then treated as a real category rather than a missing value.

Reproducible check from my object

col = "TNBCtype4_n235_notPreCentered"

before = sdata_filtered.tables["table"].uns["patients"][col]

print("BEFORE WRITE")
print(before.dtype)
print(before.value_counts(dropna=False))
print("string nan:", (before == "nan").sum())
print("real NA:", before.isna().sum())

sdata_filtered.write(config.DATA_DIR / "TNBC_filtered2.zarr")

sdata_reloaded = sd.read_zarr(config.DATA_DIR / "TNBC_filtered2.zarr")
after = sdata_reloaded.tables["table"].uns["patients"][col]

print("\nAFTER READ")
print(after.dtype)
print(after.value_counts(dropna=False))
print("string nan:", (after == "nan").sum())
print("real NA:", after.isna().sum())

Observed output

BEFORE WRITE
object
TNBCtype4_n235_notPreCentered
BL1    39
BL2    25
NaN    24
LAR    22
M      20
Name: count, dtype: int64
string nan: 0
real NA: 24

AFTER READ
object
TNBCtype4_n235_notPreCentered
BL1    39
BL2    25
nan    24
LAR    22
M      20
Name: count, dtype: int64
string nan: 24
real NA: 0

Expected behavior

The missing values should round-trip as missing values.

Expected after reading:

string nan: 0
real NA: 24

Instead, all 24 real missing values are converted to the literal string "nan".

Context

The affected object is an AnnData table stored inside a SpatialData object:

sdata_filtered.tables["table"].uns["patients"]

The patients entry is a pandas DataFrame containing patient-level metadata. It is generated before writing and stored in .uns.

My script also sets:

ad.settings.allow_write_nullable_strings = True

but the affected column has dtype object both before and after the round trip.

Question

Is this expected behavior for pandas DataFrames stored in AnnData .uns during SpatialData Zarr serialization, or should missing values be preserved across the SpatialData.write() / sd.read_zarr() round trip?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions