Add support for environment context#3441
Conversation
838ef11 to
07587a3
Compare
abnobdoss
left a comment
There was a problem hiding this comment.
This is great! This looks very helpful for debugging.
# Rationale for this change Allow users to create a view with fewer parameters. We can set an environment context (`engine-name` and `engine-version`) in `summary` field by default once #3441 is merged. ## Are these changes tested? Yes ## Are there any user-facing changes? No <!-- In the case of user-facing changes, please add the changelog label. -->
07587a3 to
afe8bec
Compare
kevinjqliu
left a comment
There was a problem hiding this comment.
Thanks for the PR! This will be very useful to users. I got a couple of nits
| timestamp_ms: int = Field(alias="timestamp-ms", default_factory=lambda: int(time.time() * 1000)) | ||
| """Timestamp when the version was created (ms from epoch)""" | ||
| summary: dict[str, str] = Field(default_factory=dict) | ||
| summary: dict[str, str] = Field(default_factory=lambda: EnvironmentContext.get()) |
There was a problem hiding this comment.
this will add EnvironmentContext to all ViewVersion which might not be the desired behavior.
Similar to Table Metadata, can we only add EnvironmentContext in the write path?
| "total-files-size": str(file_size), | ||
| "total-position-deletes": "0", | ||
| "total-records": "3", | ||
| "engine-name": "pyiceberg", |
There was a problem hiding this comment.
Could we find a way to avoid repeating env context everywhere? Perhaps a small test helper like this would keep the exact assertions while localizing this new default behavior:
def with_environment_context(summary: dict[str, str]) -> dict[str, str]:
return {**summary, **EnvironmentContext.get()}
assert summaries[0] == with_environment_context({
"added-data-files": "3",
"added-records": "5",
...
})
| def test_default_value() -> None: | ||
| actual = EnvironmentContext.get() | ||
| assert len(actual) == 2 | ||
| assert actual["engine-name"] == "pyiceberg" | ||
| assert re.match(r"^\d+\.\d+\.\d+", actual["engine-version"]) |
There was a problem hiding this comment.
| def test_default_value() -> None: | |
| actual = EnvironmentContext.get() | |
| assert len(actual) == 2 | |
| assert actual["engine-name"] == "pyiceberg" | |
| assert re.match(r"^\d+\.\d+\.\d+", actual["engine-version"]) | |
| def test_default_value() -> None: | |
| assert EnvironmentContext.get() == { | |
| "engine-name": "pyiceberg", | |
| "engine-version": __version__, | |
| } | |
| def test_get_returns_copy() -> None: | |
| actual = EnvironmentContext.get() | |
| actual["test-key"] = "test-value" | |
| assert "test-key" not in EnvironmentContext.get() |
EnvironmentContext might have been mutated, this is a better way to test that the returned value is a copy
There was a problem hiding this comment.
+1 on this. The regex is really brittle and won't work for RC builds.
There was a problem hiding this comment.
Ugh, I was missing the RC build.
I intentionally avoided using the same logic between environment_context.py and this test. I generally don't believe it's a good idea to write tests.
| def test_put_and_remove() -> None: | ||
| EnvironmentContext.put("test-key", "test-value") | ||
| assert EnvironmentContext.get()["test-key"] == "test-value" | ||
|
|
||
| EnvironmentContext.remove("test-key") | ||
| assert "test-key" not in EnvironmentContext.get() |
There was a problem hiding this comment.
| def test_put_and_remove() -> None: | |
| EnvironmentContext.put("test-key", "test-value") | |
| assert EnvironmentContext.get()["test-key"] == "test-value" | |
| EnvironmentContext.remove("test-key") | |
| assert "test-key" not in EnvironmentContext.get() | |
| def test_put_and_remove() -> None: | |
| try: | |
| EnvironmentContext.put("test-key", "test-value") | |
| assert EnvironmentContext.get()["test-key"] == "test-value" | |
| assert EnvironmentContext.remove("test-key") == "test-value" | |
| assert "test-key" not in EnvironmentContext.get() | |
| finally: | |
| EnvironmentContext.remove("test-key") |
use try/finally to avoid mutating EnvironmentContext
| for key, value in EnvironmentContext.get().items(): | ||
| summary.__setitem__(key, value) |
There was a problem hiding this comment.
| for key, value in EnvironmentContext.get().items(): | |
| summary.__setitem__(key, value) | |
| for key, value in EnvironmentContext.get().items(): | |
| summary[key] = value |
nit
| class EnvironmentContext: | ||
| _PROPERTIES: dict[str, str] = { | ||
| "engine-name": "pyiceberg", | ||
| "engine-version": version("pyiceberg"), |
There was a problem hiding this comment.
| "engine-version": version("pyiceberg"), | |
| "engine-version": __version__, |
nit
|
This looks great! My biggest note is to remove the version regexes, since they won't work on RCs. |
Rationale for this change
Adds
EnvironmentContextutility class to track engine metadata (name and version) and automatically includes it in snapshot summaries. This provides visibility into which engine version created/modified each snapshot.Are these changes tested?
Yes
Are there any user-facing changes?
No - it's internal change