Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR cleans up the OasisDataManager library with ergonomic improvements, bug fixes, and expanded test coverage. Changes span the public API, internal reader logic, storage backends, and exception naming.
Public API improvements
Short import paths
oasis_data_manager/__init__.pynow exportsPandasReader,DaskReader,PyarrowReader,LocalStorage,AwsS3Storage,AzureABFSStoragedirectly, allowingfrom oasis_data_manager import PandasReaderinstead of importing from the full backend path.oasis_data_manager/filestore/__init__.pynow exports all three storage classes and the config helpers, enablingfrom oasis_data_manager.filestore import AwsS3Storage.oasis_data_manager/df_reader/__init__.pyre-exports all reader classes.Storage backend module renames
filestore/backends/aws_s3.py→filestore/backends/aws.py(canonical path)filestore/backends/azure_abfs.py→filestore/backends/azure.py(canonical path)DeprecationWarningon import, so existing code continues to work.Exception rename
OasisExceptionrenamed toOasisDataManagerExceptionto better reflect the library it belongs to and avoid confusion with the same name used elsewhere in the Oasis platform.OasisExceptionis kept as a backward-compatible alias (same class object) so nothing breaks.MissingInputsExceptionupdated to subclassOasisDataManagerException.Bug fixes
Dask
RecursionErroron parquet readsOasisReader._read()now setshas_read = Truebefore callingread_parquet()/read_csv(), wrapped in a try/except that resets the flag on failure.read_parquet()to re-enter_read()viaself.df(a property that calls_read()), producing infinite recursion.Dask
copy_with_dftype mismatchOasisDaskReader.copy_with_df()now converts any incoming pandas DataFrame to a Dask DataFrame before passing it to the base implementation.self._dfas pandas, causingAttributeErrorwhenas_pandas()called.compute()on it.Double
_read()callsOasisReader.filter()andOasisReader.as_pandas()now accessself._dfdirectly instead of going through theself.dfproperty, eliminating a redundant second_read()call.Code quality
for/else/breakloop in_read()with a one-lineany()expression.super()calls: Updated tosuper()(no arguments) inAwsS3Storage,AzureABFSStorage,MissingInputsException, andOasisDataManagerException..format()call inMissingInputsException.__init__with an f-string.delete_file()anddelete_dir()inBaseStoragenow callself.logger.info()instead of the bare module-levellogging.info(), consistent with the rest of the class. Fixed a "Unknwon" typo in the log message.AwsS3Storage.config_optionsserialization: AWS and Azure backends now store the originalroot_dirargument (self._root_dir_arg) before joining it onto the bucket/container path, and use it inconfig_options. This avoids a fragilePath.relative_to()reverse-computation that could fail if the paths didn't align.ComplexData.run()clarity: Added a comment explaining thefetch_requiredlogic — CSV and Parquet files are read directly by the df_reader, sofetch()is only needed for formats the reader cannot handle directly.Test coverage
New test files
tests/df_reader/test_pyarrow.py: PyArrow backend tests covering parquet reads, column selection, and filter predicates.tests/filestorage/test_storage_utils.py: Tests forBaseStorage.create_traceback()andAwsS3Storage._strip_signing_parameters().New tests in existing files
test_read_csv.py/test_read_parquet.py:OasisReader.query(),copy_with_df(), andOasisDaskReader.read_from_dataframe().test_from_dataframe.py: Passing a pandas DataFrame viadataframe=to a Dask reader.test_caching.py:OasisDataManagerExceptionbackward-compat alias verification.Test fixes
query()test: added.compute()call for lazy scalar results (e.g.frame["D"].sum()).test_complex/test_base.py: guarded dask import withpytest.importorskipso the file is skipped cleanly when Dask is not installed.type: ignorecomments on optional-dependency fallback assignments updated to suppress the correct error codes.Readme creation