Skip to content

Add before/after dataset comparison with lineage tracking#47

Merged
rad1092 merged 1 commit intomainfrom
codex/implement-dataset-fingerprint-and-lineage-logic
Feb 15, 2026
Merged

Add before/after dataset comparison with lineage tracking#47
rad1092 merged 1 commit intomainfrom
codex/implement-dataset-fingerprint-and-lineage-logic

Conversation

@rad1092
Copy link
Copy Markdown
Owner

@rad1092 rad1092 commented Feb 15, 2026

Motivation

  • Provide basic dataset versioning and lineage capture so comparisons between dataset snapshots can be persisted and audited.
  • Enable programmatic detection of distribution drift (numeric and categorical) between two CSVs and expose this via CLI and web API for integration with workflows.

Description

  • Added bitnet_tools/versioning.py which computes a SHA256 dataset fingerprint via build_dataset_fingerprint and persists before/after lineage records under .bitnet_cache/lineage/ using save_lineage_link.
  • Added bitnet_tools/compare.py implementing compare_csv_texts and compare_csv_files that compute bucketed/numeric and categorical distributions and report psi, js_divergence, and chi_square per common column, and which stores lineage via the versioning API.
  • Extended CLI in bitnet_tools/cli.py with a new compare subcommand (--before, --after, --out) that writes the JSON comparison result using the compare helpers.
  • Added Web API endpoint POST /api/compare in bitnet_tools/web.py which accepts before and after payloads (same payload contract as other endpoints), runs the comparison, and returns the result JSON.
  • Added tests in tests/test_compare.py that validate zero-drift case, positive-drift case, CLI invocation, and API contract.

Testing

  • Ran unit tests: pytest -q tests/test_compare.py tests/test_cli.py tests/test_web.py and all tests passed (28 passed).
  • Exercised CLI compare flow and web POST /api/compare via the new test cases which validated output structure and non-zero drift metrics for changed data.

Codex Task

@rad1092 rad1092 merged commit 87900f5 into main Feb 15, 2026
4 checks passed
@rad1092 rad1092 deleted the codex/implement-dataset-fingerprint-and-lineage-logic branch February 15, 2026 05:22
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8810752080

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread bitnet_tools/compare.py
Comment on lines +175 to +176
before_path.read_text(encoding='utf-8'),
after_path.read_text(encoding='utf-8'),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Strip BOM when loading CSV files for compare

compare_csv_files reads both files as plain utf-8, so a UTF-8 BOM is preserved in the first header (for example, \ufeffcity). If one input has a BOM and the other does not (a common Excel-export case), that first column will not match in common_columns, and drift metrics for it are silently omitted. Please decode BOM-safe (e.g., utf-8-sig) or normalize headers before comparison.

Useful? React with 👍 / 👎.

Comment on lines +22 to +24
lines = [line.rstrip() for line in csv_text.strip().splitlines() if line.strip()]
header = lines[0].split(',') if lines else []
row_count = max(len(lines) - 1, 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse fingerprint metadata with a CSV parser

build_dataset_fingerprint computes header and row_count by manually splitting lines and commas, which breaks on valid CSV features like quoted commas and embedded newlines. That yields incorrect columns/column_count/row_count in lineage records and can change fingerprints for semantically identical datasets. Using csv.reader here would keep fingerprint metadata aligned with actual CSV semantics.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant