Skip to content

[Python] Dataset.to_batches() / ParquetFileFragment.to_batches() hang forever #45214

@lhoestq

Description

@lhoestq

In the datasets library we are using ParquetFileFragment.to_batches() to stream batches of data while applying filters file-per-file. We create fragments from file-like objects (because files can be local or remote).

However @AlexKoff88 reported that for some datasets like phiyodr/InpaintCOCO it causes the code to hang at huggingface/datasets#7357.

I managed to make a reproducible example:

wget https://huggingface.co/datasets/phiyodr/InpaintCOCO/resolve/c56e31947190173d2d6373c4833b0a9889ff6eee/data/test-00000-of-00003.parquet

file info here

import pyarrow.dataset as ds

file = "test-00000-of-00003.parquet"
with open(file, "rb") as f:
    parquet_fragment = ds.ParquetFileFormat().make_fragment(f)
    for record_batch in parquet_fragment.to_batches():
        print(len(record_batch))  # 100
        break  # hangs forever

Environment:

  • python 3.12.2
  • pyarrow 18.1.0
  • macbook pro m2

Most of the time the code hangs, and in some (rare) random cases it is able to terminate.

The issue appears when running the python script and also in a python shell / ipython in exit().

The issue also appears for example for eltorio/ROCOv2-radiology and bigcode/the-stack

In the original issue in datasets this message was also reported:

Fatal Python error: PyGILState_Release: thread state 0x7fa1f409ade0 must be current when releasing
Python runtime state: finalizing (tstate=0x0000000000ad2958)

Thread 0x00007fa33d157740 (most recent call first):
  <no Python frame>

Component(s)

Parquet, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions