[SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3 by zhengruifeng · Pull Request #55974 · apache/spark

zhengruifeng · 2026-05-19T06:12:04Z

What changes were proposed in this pull request?

Make pyspark.sql.tests.coercion.test_pandas_udf_return_type.PandasUDFReturnTypeTests work under pandas >= 3.0 and on systems whose tzdata package no longer ships the legacy US/* aliases (e.g. Ubuntu 24.04 / noble).

Switch the tz-aware fixture from US/Eastern to America/New_York. The values returned by pd.date_range(...).values are identical for the two aliases (same zone, same DST rules), so the on-disk golden file does not need to be regenerated.
Patch the loaded golden DataFrame in memory for pandas >= 3.0. The golden file was generated under pandas 2 and the on-disk content is unchanged. At load time, when running under pandas >= 3.0, the test:
- Renames column keys whose representation differs between the two versions: datetime64 ndarrays default to [us] instead of [ns], and pd.Categorical keeps str-dtyped categories instead of object.
- Scales 13+ digit integers in cells of datetime64 / Timedelta-list columns by 1/1000. Pandas 3 returns microseconds where pandas 2 returned nanoseconds for the same cast (e.g. bigint <- pd.date_range(...).values flips from 86_400_000_000_000 to 86_400_000_000).
- Overrides the single decimal(10,0) x ['12','34']@list cell, which flipped from X (pandas 2 errored) to [Decimal('12'), Decimal('34')] (pandas 3 succeeds at the string -> Decimal coercion).

Why are the changes needed?

The scheduled CI run on the python-312-pandas-3 image fails in this suite, e.g. https://github.com/apache/spark/actions/runs/26002965955/job/76430490989. Root causes:

pd.date_range("19700101", periods=2, tz="US/Eastern").values raises zoneinfo._common.ZoneInfoNotFoundError: 'No time zone found with key US/Eastern'. Pandas 3 dropped pytz as a hard dependency and now resolves tz names through stdlib zoneinfo, which on Ubuntu 24.04 cannot find US/Eastern because Ubuntu moved the legacy aliases out of tzdata into a separate tzdata-legacy package that the CI image does not install.
After the alias fix, golden.loc[str_t, str_v] raises KeyError because the column keys in the golden file are pandas-2-shaped (datetime64[ns], Categorical(..., object)) but the lookup keys built at runtime are pandas-3-shaped (datetime64[us], Categorical(..., str)).
After the key rename, assertions still fail because the cast result values themselves changed: nanoseconds -> microseconds for datetime / Timedelta inputs, and one cell where pandas 3 now succeeds where pandas 2 errored.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Ran the suite locally under two envs:

# pandas 2.3.3 / Python 3.13
$ python/run-tests --testnames "pyspark.sql.tests.coercion.test_pandas_udf_return_type PandasUDFReturnTypeTests"
...
Tests passed in 31 seconds

# pandas 3.0.2 / Python 3.13
$ python/run-tests --testnames "pyspark.sql.tests.coercion.test_pandas_udf_return_type PandasUDFReturnTypeTests"
...
Tests passed in 30 seconds

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

…24.04 tzdata - Switch the tz-aware fixture from the legacy alias `US/Eastern` to its canonical IANA name `America/New_York`. On Ubuntu 24.04 the system `tzdata` package no longer ships the legacy `US/*` aliases (those moved to `tzdata-legacy`), so under pandas >= 3.0 (which resolves tz via stdlib zoneinfo instead of bundled pytz), the previous fixture raised `ZoneInfoNotFoundError` in CI. - Remap the loaded golden DataFrame in memory when running under pandas >= 3.0 so the pandas-2-generated golden columns still line up: `datetime64[ns]` -> `[us]` and `Categorical` categories `object` -> `str`. Only the column keys are remapped; the on-disk golden file is unchanged. Generated-by: Claude Code

@list

Extend the pandas-3 in-memory adapter so the value comparisons also line up: - Scale 13+ digit integers in cells of datetime64 / Timedelta-list columns by 1/1000. Pandas 3 returns microseconds where pandas 2 returned nanoseconds for the same cast, e.g. bigint <- pd.date_range(...).values flips from 86_400_000_000_000 to 86_400_000_000. - Override the single decimal(10,0) x ['12','34']@list cell, which flipped from "X" (pandas 2 errored) to [Decimal('12'), Decimal('34')] (pandas 3 succeeds). Test now passes under both pandas 2.3.3 (spark-dev-313) and pandas 3.0.2 (spark-dev-313-p3) locally. Generated-by: Claude Code

No behavior change. Folds _patch_golden_for_pandas3 directly into the loader block where it is used, since it is only called once. Also replaces the local re.sub helper with Series.str.replace(regex=True) to drop the `import re`. Generated-by: Claude Code

No behavior change. Use self.repr_value(value) and self.repr_type(...) to derive both rename and scale targets directly from self.test_data and the affected Spark type, instead of grep-matching the golden column names. Single loop over test_data builds both rename and scale_cols. Generated-by: Claude Code

zhengruifeng changed the title ~~[PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3 and Ubuntu 24.04 tzdata~~ [PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3 May 19, 2026

zhengruifeng changed the title ~~[PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3~~ [SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3 May 19, 2026

zhengruifeng requested a review from HyukjinKwon May 19, 2026 07:10

HyukjinKwon approved these changes May 19, 2026

View reviewed changes

zhengruifeng added 3 commits May 19, 2026 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3#55974

[SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3#55974
zhengruifeng wants to merge 4 commits into
apache:masterfrom
zhengruifeng:SPARK-fix-tz-uneastern

zhengruifeng commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhengruifeng commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengruifeng commented May 19, 2026 •

edited

Loading