Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8810752080
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| before_path.read_text(encoding='utf-8'), | ||
| after_path.read_text(encoding='utf-8'), |
There was a problem hiding this comment.
Strip BOM when loading CSV files for compare
compare_csv_files reads both files as plain utf-8, so a UTF-8 BOM is preserved in the first header (for example, \ufeffcity). If one input has a BOM and the other does not (a common Excel-export case), that first column will not match in common_columns, and drift metrics for it are silently omitted. Please decode BOM-safe (e.g., utf-8-sig) or normalize headers before comparison.
Useful? React with 👍 / 👎.
| lines = [line.rstrip() for line in csv_text.strip().splitlines() if line.strip()] | ||
| header = lines[0].split(',') if lines else [] | ||
| row_count = max(len(lines) - 1, 0) |
There was a problem hiding this comment.
Parse fingerprint metadata with a CSV parser
build_dataset_fingerprint computes header and row_count by manually splitting lines and commas, which breaks on valid CSV features like quoted commas and embedded newlines. That yields incorrect columns/column_count/row_count in lineage records and can change fingerprints for semantically identical datasets. Using csv.reader here would keep fingerprint metadata aligned with actual CSV semantics.
Useful? React with 👍 / 👎.
Motivation
Description
bitnet_tools/versioning.pywhich computes a SHA256 dataset fingerprint viabuild_dataset_fingerprintand persists before/after lineage records under.bitnet_cache/lineage/usingsave_lineage_link.bitnet_tools/compare.pyimplementingcompare_csv_textsandcompare_csv_filesthat compute bucketed/numeric and categorical distributions and reportpsi,js_divergence, andchi_squareper common column, and which stores lineage via the versioning API.bitnet_tools/cli.pywith a newcomparesubcommand (--before,--after,--out) that writes the JSON comparison result using the compare helpers.POST /api/compareinbitnet_tools/web.pywhich acceptsbeforeandafterpayloads (same payload contract as other endpoints), runs the comparison, and returns the result JSON.tests/test_compare.pythat validate zero-drift case, positive-drift case, CLI invocation, and API contract.Testing
pytest -q tests/test_compare.py tests/test_cli.py tests/test_web.pyand all tests passed (28 passed).compareflow and webPOST /api/comparevia the new test cases which validated output structure and non-zero drift metrics for changed data.Codex Task