Skip to content

[SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3#55974

Draft
zhengruifeng wants to merge 4 commits into
apache:masterfrom
zhengruifeng:SPARK-fix-tz-uneastern
Draft

[SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3#55974
zhengruifeng wants to merge 4 commits into
apache:masterfrom
zhengruifeng:SPARK-fix-tz-uneastern

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 19, 2026

What changes were proposed in this pull request?

Make pyspark.sql.tests.coercion.test_pandas_udf_return_type.PandasUDFReturnTypeTests work under pandas >= 3.0 and on systems whose tzdata package no longer ships the legacy US/* aliases (e.g. Ubuntu 24.04 / noble).

  1. Switch the tz-aware fixture from US/Eastern to America/New_York. The values returned by pd.date_range(...).values are identical for the two aliases (same zone, same DST rules), so the on-disk golden file does not need to be regenerated.

  2. Patch the loaded golden DataFrame in memory for pandas >= 3.0. The golden file was generated under pandas 2 and the on-disk content is unchanged. At load time, when running under pandas >= 3.0, the test:

    • Renames column keys whose representation differs between the two versions: datetime64 ndarrays default to [us] instead of [ns], and pd.Categorical keeps str-dtyped categories instead of object.
    • Scales 13+ digit integers in cells of datetime64 / Timedelta-list columns by 1/1000. Pandas 3 returns microseconds where pandas 2 returned nanoseconds for the same cast (e.g. bigint <- pd.date_range(...).values flips from 86_400_000_000_000 to 86_400_000_000).
    • Overrides the single decimal(10,0) x ['12','34']@list cell, which flipped from X (pandas 2 errored) to [Decimal('12'), Decimal('34')] (pandas 3 succeeds at the string -> Decimal coercion).

Why are the changes needed?

The scheduled CI run on the python-312-pandas-3 image fails in this suite, e.g. https://github.com/apache/spark/actions/runs/26002965955/job/76430490989. Root causes:

  • pd.date_range("19700101", periods=2, tz="US/Eastern").values raises zoneinfo._common.ZoneInfoNotFoundError: 'No time zone found with key US/Eastern'. Pandas 3 dropped pytz as a hard dependency and now resolves tz names through stdlib zoneinfo, which on Ubuntu 24.04 cannot find US/Eastern because Ubuntu moved the legacy aliases out of tzdata into a separate tzdata-legacy package that the CI image does not install.
  • After the alias fix, golden.loc[str_t, str_v] raises KeyError because the column keys in the golden file are pandas-2-shaped (datetime64[ns], Categorical(..., object)) but the lookup keys built at runtime are pandas-3-shaped (datetime64[us], Categorical(..., str)).
  • After the key rename, assertions still fail because the cast result values themselves changed: nanoseconds -> microseconds for datetime / Timedelta inputs, and one cell where pandas 3 now succeeds where pandas 2 errored.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Ran the suite locally under two envs:

# pandas 2.3.3 / Python 3.13
$ python/run-tests --testnames "pyspark.sql.tests.coercion.test_pandas_udf_return_type PandasUDFReturnTypeTests"
...
Tests passed in 31 seconds

# pandas 3.0.2 / Python 3.13
$ python/run-tests --testnames "pyspark.sql.tests.coercion.test_pandas_udf_return_type PandasUDFReturnTypeTests"
...
Tests passed in 30 seconds

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

…24.04 tzdata

- Switch the tz-aware fixture from the legacy alias `US/Eastern` to its
  canonical IANA name `America/New_York`. On Ubuntu 24.04 the system
  `tzdata` package no longer ships the legacy `US/*` aliases (those
  moved to `tzdata-legacy`), so under pandas >= 3.0 (which resolves tz
  via stdlib zoneinfo instead of bundled pytz), the previous fixture
  raised `ZoneInfoNotFoundError` in CI.
- Remap the loaded golden DataFrame in memory when running under pandas
  >= 3.0 so the pandas-2-generated golden columns still line up:
  `datetime64[ns]` -> `[us]` and `Categorical` categories `object` ->
  `str`. Only the column keys are remapped; the on-disk golden file is
  unchanged.

Generated-by: Claude Code
@zhengruifeng zhengruifeng changed the title [PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3 and Ubuntu 24.04 tzdata [PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3 May 19, 2026
@zhengruifeng zhengruifeng changed the title [PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3 [SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3 May 19, 2026
@zhengruifeng zhengruifeng requested a review from HyukjinKwon May 19, 2026 07:10
Extend the pandas-3 in-memory adapter so the value comparisons also
line up:

- Scale 13+ digit integers in cells of datetime64 / Timedelta-list
  columns by 1/1000. Pandas 3 returns microseconds where pandas 2
  returned nanoseconds for the same cast, e.g. bigint <-
  pd.date_range(...).values flips from 86_400_000_000_000 to
  86_400_000_000.
- Override the single decimal(10,0) x ['12','34']@list cell, which
  flipped from "X" (pandas 2 errored) to [Decimal('12'), Decimal('34')]
  (pandas 3 succeeds).

Test now passes under both pandas 2.3.3 (spark-dev-313) and pandas
3.0.2 (spark-dev-313-p3) locally.

Generated-by: Claude Code
No behavior change. Folds _patch_golden_for_pandas3 directly into the
loader block where it is used, since it is only called once. Also
replaces the local re.sub helper with Series.str.replace(regex=True)
to drop the `import re`.

Generated-by: Claude Code
No behavior change. Use self.repr_value(value) and self.repr_type(...)
to derive both rename and scale targets directly from self.test_data
and the affected Spark type, instead of grep-matching the golden
column names. Single loop over test_data builds both rename and
scale_cols.

Generated-by: Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants