In the datasets library we are using ParquetFileFragment.to_batches() to stream batches of data while applying filters file-per-file. We create fragments from file-like objects (because files can be local or remote).
However @AlexKoff88 reported that for some datasets like phiyodr/InpaintCOCO it causes the code to hang at huggingface/datasets#7357.
I managed to make a reproducible example:
wget https://huggingface.co/datasets/phiyodr/InpaintCOCO/resolve/c56e31947190173d2d6373c4833b0a9889ff6eee/data/test-00000-of-00003.parquet
file info here
import pyarrow.dataset as ds
file = "test-00000-of-00003.parquet"
with open(file, "rb") as f:
parquet_fragment = ds.ParquetFileFormat().make_fragment(f)
for record_batch in parquet_fragment.to_batches():
print(len(record_batch)) # 100
break # hangs forever
Environment:
- python 3.12.2
- pyarrow 18.1.0
- macbook pro m2
Most of the time the code hangs, and in some (rare) random cases it is able to terminate.
The issue appears when running the python script and also in a python shell / ipython in exit().
The issue also appears for example for eltorio/ROCOv2-radiology and bigcode/the-stack
In the original issue in datasets this message was also reported:
Fatal Python error: PyGILState_Release: thread state 0x7fa1f409ade0 must be current when releasing
Python runtime state: finalizing (tstate=0x0000000000ad2958)
Thread 0x00007fa33d157740 (most recent call first):
<no Python frame>
Component(s)
Parquet, Python
In the
datasetslibrary we are usingParquetFileFragment.to_batches()to stream batches of data while applying filters file-per-file. We create fragments from file-like objects (because files can be local or remote).However @AlexKoff88 reported that for some datasets like phiyodr/InpaintCOCO it causes the code to hang at huggingface/datasets#7357.
I managed to make a reproducible example:
file info here
Environment:
Most of the time the code hangs, and in some (rare) random cases it is able to terminate.
The issue appears when running the python script and also in a python shell / ipython in
exit().The issue also appears for example for eltorio/ROCOv2-radiology and bigcode/the-stack
In the original issue in
datasetsthis message was also reported:Component(s)
Parquet, Python