Description
When writing a SpatialData object to Zarr and reading it back with sd.read_zarr, missing values in a pandas DataFrame stored inside an AnnData table’s .uns are converted from real missing values (NaN / pd.NA) into the literal string "nan".
This changes the semantics of the metadata after a write/read round trip. In downstream code, the string "nan" is then treated as a real category rather than a missing value.
Reproducible check from my object
col = "TNBCtype4_n235_notPreCentered"
before = sdata_filtered.tables["table"].uns["patients"][col]
print("BEFORE WRITE")
print(before.dtype)
print(before.value_counts(dropna=False))
print("string nan:", (before == "nan").sum())
print("real NA:", before.isna().sum())
sdata_filtered.write(config.DATA_DIR / "TNBC_filtered2.zarr")
sdata_reloaded = sd.read_zarr(config.DATA_DIR / "TNBC_filtered2.zarr")
after = sdata_reloaded.tables["table"].uns["patients"][col]
print("\nAFTER READ")
print(after.dtype)
print(after.value_counts(dropna=False))
print("string nan:", (after == "nan").sum())
print("real NA:", after.isna().sum())
Observed output
BEFORE WRITE
object
TNBCtype4_n235_notPreCentered
BL1 39
BL2 25
NaN 24
LAR 22
M 20
Name: count, dtype: int64
string nan: 0
real NA: 24
AFTER READ
object
TNBCtype4_n235_notPreCentered
BL1 39
BL2 25
nan 24
LAR 22
M 20
Name: count, dtype: int64
string nan: 24
real NA: 0
Expected behavior
The missing values should round-trip as missing values.
Expected after reading:
string nan: 0
real NA: 24
Instead, all 24 real missing values are converted to the literal string "nan".
Context
The affected object is an AnnData table stored inside a SpatialData object:
sdata_filtered.tables["table"].uns["patients"]
The patients entry is a pandas DataFrame containing patient-level metadata. It is generated before writing and stored in .uns.
My script also sets:
ad.settings.allow_write_nullable_strings = True
but the affected column has dtype object both before and after the round trip.
Question
Is this expected behavior for pandas DataFrames stored in AnnData .uns during SpatialData Zarr serialization, or should missing values be preserved across the SpatialData.write() / sd.read_zarr() round trip?
Description
When writing a
SpatialDataobject to Zarr and reading it back withsd.read_zarr, missing values in a pandasDataFramestored inside an AnnData table’s.unsare converted from real missing values (NaN/pd.NA) into the literal string"nan".This changes the semantics of the metadata after a write/read round trip. In downstream code, the string
"nan"is then treated as a real category rather than a missing value.Reproducible check from my object
Observed output
Expected behavior
The missing values should round-trip as missing values.
Expected after reading:
Instead, all 24 real missing values are converted to the literal string
"nan".Context
The affected object is an AnnData table stored inside a SpatialData object:
The
patientsentry is a pandasDataFramecontaining patient-level metadata. It is generated before writing and stored in.uns.My script also sets:
but the affected column has dtype
objectboth before and after the round trip.Question
Is this expected behavior for pandas
DataFrames stored in AnnData.unsduring SpatialData Zarr serialization, or should missing values be preserved across theSpatialData.write()/sd.read_zarr()round trip?