Skip to content

Comments

Make GCS filesystem lookup lazy to match S3 behavior (fixes #37445)#37674

Open
MansiSingh17 wants to merge 1 commit intoapache:masterfrom
MansiSingh17:fix/gcs-filesystem-lazy-load-37445
Open

Make GCS filesystem lookup lazy to match S3 behavior (fixes #37445)#37674
MansiSingh17 wants to merge 1 commit intoapache:masterfrom
MansiSingh17:fix/gcs-filesystem-lazy-load-37445

Conversation

@MansiSingh17
Copy link
Contributor

Make GCS filesystem lookup lazy to match S3 behavior (fixes #37445)
FileSystems.get_filesystem() previously raised a ValueError for gs:// paths when GCP dependencies were not installed, while s3:// paths returned a filesystem object and deferred dependency validation until usage time. This PR aligns GCS behavior with S3.
Changes:

  • Removed top-level from apache_beam.io.gcp import gcsio import, replaced with try/except ImportError for backward compatibility with existing mocks
  • Made _gcsIO() method lazy load gcsio at call time
  • Converted CHUNK_SIZE from a class-level variable to a lazy @Property so it doesn't require GCP deps at import time
  • Fixed report_lineage to use lazy import of gcsio
  • Added test_get_filesystem_does_not_require_gcp_extra to verify the fix

Testing:
pytest -q sdks/python/apache_beam/io/gcp/gcsfilesystem_test.py
18 passed, 1 failed (test_lineage fails locally due to missing google.api_core — was already skipped on master before this change)

fixes #37445
Update CHANGES.md with noteworthy changes.

)

- Remove top-level gcsio import, replace with try/except for backward compat
- Add _get_gcsio() lazy import in _gcsIO() method
- Convert CHUNK_SIZE from class variable to lazy property
- Add test verifying get_filesystem() works without GCP extras installed
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @MansiSingh17, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the GCS filesystem implementation to defer the loading of GCP-specific dependencies. The primary goal is to prevent ValueError exceptions when gs:// paths are encountered without GCP dependencies installed, mirroring the existing lazy dependency validation behavior of S3 paths. This change improves the user experience by allowing GCS filesystem objects to be instantiated without immediate dependency checks, postponing validation until actual usage.

Highlights

  • Lazy GCS Dependency Loading: The gcsio module import in GCSFileSystem has been made lazy, deferring its loading until it's actually needed, aligning GCS behavior with S3.
  • Lazy CHUNK_SIZE Property: The CHUNK_SIZE variable was converted into a lazy @property to prevent requiring GCP dependencies at import time.
  • Lazy _gcsIO Method: The _gcsIO() method now lazily loads gcsio at call time, ensuring dependencies are only checked when the GCS IO operations are performed.
  • Lazy report_lineage Import: The report_lineage method was updated to use a lazy import of gcsio for parsing GCS paths.
  • New Test Case: A new test, test_get_filesystem_does_not_require_gcp_extra, was added to verify that GCS filesystem lookup does not immediately require GCP dependencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • sdks/python/apache_beam/io/gcp/gcsfilesystem.py
    • Modified the top-level gcsio import to be conditional, wrapping it in a try/except ImportError block.
    • Converted CHUNK_SIZE from a class-level variable to a lazy @property to avoid early dependency loading.
    • Updated the _gcsIO method to perform a lazy import of gcsio if it hasn't been loaded yet.
    • Adjusted the report_lineage method to include a local, lazy import of gcsio for parse_gcs_path.
  • sdks/python/apache_beam/io/gcp/gcsfilesystem_test.py
    • Added test_get_filesystem_does_not_require_gcp_extra to confirm that GCS filesystem lookup does not require GCP dependencies at lookup time.
Activity
  • The author ran pytest -q sdks/python/apache_beam/io/gcp/gcsfilesystem_test.py.
  • The tests resulted in 18 passed and 1 failed, with the failure (test_lineage) noted as already being skipped on master due to a missing google.api_core dependency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@shunping
Copy link
Collaborator

Could you fix "Precommit Python" tests please?

self.assertEqual(self.fs.scheme(), 'gs')
self.assertEqual(gcsfilesystem.GCSFileSystem.scheme(), 'gs')

def test_get_filesystem_does_not_require_gcp_extra(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this test is run in a Python environment with gcp extra, this test will always pass.

To really test it, I think we need to mock the behavior of not having gcp extra (i.e not being able to import gcsio) and ensure getFileSystem can run without it.

def _gcsIO(self):
return gcsio.GcsIO(pipeline_options=self._pipeline_options)
if gcsio is None:
from apache_beam.io.gcp import gcsio as _gcsio # pylint: disable=g-import-not-at-top
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain Why we need to try importing gcsio again if the import at the beginning of the module fails?

def report_lineage(self, path, lineage):
try:
components = gcsio.parse_gcs_path(path, object_optional=True)
from apache_beam.io.gcp import gcsio as _gcsio # pylint: disable=g-import-not-at-top
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. See the previous comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GCSFileSystem requires gcp extra at lookup time while S3FileSystem does not

2 participants