Skip to content

refactor#35

Draft
Ha-Ree wants to merge 7 commits intodevelopfrom
feature/cleanup-and-simplify
Draft

refactor#35
Ha-Ree wants to merge 7 commits intodevelopfrom
feature/cleanup-and-simplify

Conversation

@Ha-Ree
Copy link
Copy Markdown

@Ha-Ree Ha-Ree commented Mar 10, 2026

Summary

This PR cleans up the OasisDataManager library with ergonomic improvements, bug fixes, and expanded test coverage. Changes span the public API, internal reader logic, storage backends, and exception naming.


Public API improvements

Short import paths

  • oasis_data_manager/__init__.py now exports PandasReader, DaskReader, PyarrowReader, LocalStorage, AwsS3Storage, AzureABFSStorage directly, allowing from oasis_data_manager import PandasReader instead of importing from the full backend path.
  • oasis_data_manager/filestore/__init__.py now exports all three storage classes and the config helpers, enabling from oasis_data_manager.filestore import AwsS3Storage.
  • oasis_data_manager/df_reader/__init__.py re-exports all reader classes.

Storage backend module renames

  • filestore/backends/aws_s3.pyfilestore/backends/aws.py (canonical path)
  • filestore/backends/azure_abfs.pyfilestore/backends/azure.py (canonical path)
  • The old paths are retained as backward-compatible shims that emit a DeprecationWarning on import, so existing code continues to work.

Exception rename

  • OasisException renamed to OasisDataManagerException to better reflect the library it belongs to and avoid confusion with the same name used elsewhere in the Oasis platform.
  • OasisException is kept as a backward-compatible alias (same class object) so nothing breaks.
  • MissingInputsException updated to subclass OasisDataManagerException.

Bug fixes

Dask RecursionError on parquet reads

  • OasisReader._read() now sets has_read = True before calling read_parquet() / read_csv(), wrapped in a try/except that resets the flag on failure.
  • Previously the flag was set after the read, causing Dask's read_parquet() to re-enter _read() via self.df (a property that calls _read()), producing infinite recursion.

Dask copy_with_df type mismatch

  • OasisDaskReader.copy_with_df() now converts any incoming pandas DataFrame to a Dask DataFrame before passing it to the base implementation.
  • Previously, copying with a pandas DataFrame left self._df as pandas, causing AttributeError when as_pandas() called .compute() on it.

Double _read() calls

  • OasisReader.filter() and OasisReader.as_pandas() now access self._df directly instead of going through the self.df property, eliminating a redundant second _read() call.

Code quality

  • Extension detection: Replaced a nested for/else/break loop in _read() with a one-line any() expression.
  • Old-style super() calls: Updated to super() (no arguments) in AwsS3Storage, AzureABFSStorage, MissingInputsException, and OasisDataManagerException.
  • f-strings: Replaced .format() call in MissingInputsException.__init__ with an f-string.
  • Logger usage: delete_file() and delete_dir() in BaseStorage now call self.logger.info() instead of the bare module-level logging.info(), consistent with the rest of the class. Fixed a "Unknwon" typo in the log message.
  • Stale docstring: Removed outdated Django-storage references and TODO stubs from AwsS3Storage.
  • config_options serialization: AWS and Azure backends now store the original root_dir argument (self._root_dir_arg) before joining it onto the bucket/container path, and use it in config_options. This avoids a fragile Path.relative_to() reverse-computation that could fail if the paths didn't align.
  • ComplexData.run() clarity: Added a comment explaining the fetch_required logic — CSV and Parquet files are read directly by the df_reader, so fetch() is only needed for formats the reader cannot handle directly.

Test coverage

New test files

  • tests/df_reader/test_pyarrow.py: PyArrow backend tests covering parquet reads, column selection, and filter predicates.
  • tests/filestorage/test_storage_utils.py: Tests for BaseStorage.create_traceback() and AwsS3Storage._strip_signing_parameters().

New tests in existing files

  • test_read_csv.py / test_read_parquet.py: OasisReader.query(), copy_with_df(), and OasisDaskReader.read_from_dataframe().
  • test_from_dataframe.py: Passing a pandas DataFrame via dataframe= to a Dask reader.
  • test_caching.py: OasisDataManagerException backward-compat alias verification.

Test fixes

  • Dask query() test: added .compute() call for lazy scalar results (e.g. frame["D"].sum()).
  • test_complex/test_base.py: guarded dask import with pytest.importorskip so the file is skipped cleanly when Dask is not installed.
  • mypy type: ignore comments on optional-dependency fallback assignments updated to suppress the correct error codes.

Readme creation

  • Updated readme to be nonempty

@Ha-Ree Ha-Ree changed the title seeing tests refactor Mar 10, 2026
@Ha-Ree Ha-Ree linked an issue Mar 11, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OasisDataManager Refactor

1 participant