Describe the enhancement requested
Arrow ipc format assigns dict_id to dictionary arrays and serializes dictionaries first.
It is allowed to use the same dict_id in multiple columns so you can serialize dictionary once and use multiple times.
Currently ipc writer always assigns new dict_id to every dictionary encountered. Hence, while the dictionary deduplication is supported by the ipc format, it can't be exercised by the user.
I suggest to add to IpcWriteOptions new option dedup_dictionaries. Please opine on the name of the option.
If enabled the writer will check every dictionary buffer is unique. The ipc writer will serialize only unique dictionaries to the file and reuse dict_ids.
I am interested in this feature and I can implement it if there are no objections across the dev community.
I can do C++ and Python parts.
Component(s)
C++
Python