Add before/after dataset comparison with lineage tracking by rad1092 · Pull Request #47 · rad1092/First_two

rad1092 · 2026-02-15T05:22:22Z

Motivation

Provide basic dataset versioning and lineage capture so comparisons between dataset snapshots can be persisted and audited.
Enable programmatic detection of distribution drift (numeric and categorical) between two CSVs and expose this via CLI and web API for integration with workflows.

Description

Added bitnet_tools/versioning.py which computes a SHA256 dataset fingerprint via build_dataset_fingerprint and persists before/after lineage records under .bitnet_cache/lineage/ using save_lineage_link.
Added bitnet_tools/compare.py implementing compare_csv_texts and compare_csv_files that compute bucketed/numeric and categorical distributions and report psi, js_divergence, and chi_square per common column, and which stores lineage via the versioning API.
Extended CLI in bitnet_tools/cli.py with a new compare subcommand (--before, --after, --out) that writes the JSON comparison result using the compare helpers.
Added Web API endpoint POST /api/compare in bitnet_tools/web.py which accepts before and after payloads (same payload contract as other endpoints), runs the comparison, and returns the result JSON.
Added tests in tests/test_compare.py that validate zero-drift case, positive-drift case, CLI invocation, and API contract.

Testing

Ran unit tests: pytest -q tests/test_compare.py tests/test_cli.py tests/test_web.py and all tests passed (28 passed).
Exercised CLI compare flow and web POST /api/compare via the new test cases which validated output structure and non-zero drift metrics for changed data.

Codex Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8810752080

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-15T05:25:44Z

+        before_path.read_text(encoding='utf-8'),
+        after_path.read_text(encoding='utf-8'),


Strip BOM when loading CSV files for compare

compare_csv_files reads both files as plain utf-8, so a UTF-8 BOM is preserved in the first header (for example, \ufeffcity). If one input has a BOM and the other does not (a common Excel-export case), that first column will not match in common_columns, and drift metrics for it are silently omitted. Please decode BOM-safe (e.g., utf-8-sig) or normalize headers before comparison.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-15T05:25:44Z

+    lines = [line.rstrip() for line in csv_text.strip().splitlines() if line.strip()]
+    header = lines[0].split(',') if lines else []
+    row_count = max(len(lines) - 1, 0)


Parse fingerprint metadata with a CSV parser

build_dataset_fingerprint computes header and row_count by manually splitting lines and commas, which breaks on valid CSV features like quoted commas and embedded newlines. That yields incorrect columns/column_count/row_count in lineage records and can change fingerprints for semantically identical datasets. Using csv.reader here would keep fingerprint metadata aligned with actual CSV semantics.

Useful? React with 👍 / 👎.

Add dataset compare workflow with lineage tracking

8810752

rad1092 added the codex label Feb 15, 2026 — with ChatGPT Codex Connector

rad1092 merged commit 87900f5 into main Feb 15, 2026
4 checks passed

rad1092 deleted the codex/implement-dataset-fingerprint-and-lineage-logic branch February 15, 2026 05:22

chatgpt-codex-connector Bot reviewed Feb 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add before/after dataset comparison with lineage tracking#47

Add before/after dataset comparison with lineage tracking#47
rad1092 merged 1 commit intomainfrom
codex/implement-dataset-fingerprint-and-lineage-logic

rad1092 commented Feb 15, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Feb 15, 2026

Uh oh!

chatgpt-codex-connector Bot Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		before_path.read_text(encoding='utf-8'),
		after_path.read_text(encoding='utf-8'),

Conversation

rad1092 commented Feb 15, 2026

Motivation

Description

Testing

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant