Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .github/workflows/collect-fix-commits.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: Hourly sync for collecting fix commits

on:
workflow_dispatch:
schedule:
- cron: '0 * * * *'

permissions:
contents: write

jobs:
scheduled:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'

- name: Install required packages
run: pip install GitPython==3.1.46 packageurl-python==0.17.6 aboutcode.pipeline==0.2.1

- name: Run sync
run: python fix_commits_collector.py

- name: Commit and push if it changed
run: |-
git config user.name "AboutCode Automation"
git config user.email "automation@aboutcode.org"
git add -A
timestamp=$(date -u)
git commit -m "$(echo -e "Sync Collecting Fix Commits: $timestamp\n\nSigned-off-by: AboutCode Automation <automation@aboutcode.org>")" || exit 0
git push
40 changes: 40 additions & 0 deletions .github/workflows/collect-issues-prs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: Hourly sync for collecting issues and pull requests

on:
workflow_dispatch:
schedule:
- cron: '0 * * * *'

permissions:
contents: write

jobs:
scheduled:
runs-on: ubuntu-latest
env:
GITHUB_TOKEN: ${{ secrets.GH_API_TOKEN }}
GITLAB_TOKEN: ${{ secrets.GLAB_API_TOKEN }}

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'

- name: Install required packages
run: pip install PyGithub==2.8.1 packageurl-python==0.17.6 python-gitlab==8.1.0 aboutcode.pipeline==0.2.1

- name: Run sync
run: python issues_prs_collector.py

- name: Commit and push if it changed
run: |-
git config user.name "AboutCode Automation"
git config user.email "automation@aboutcode.org"
git add -A
timestamp=$(date -u)
git commit -m "$(echo -e "Sync Collecting Issues and Pull requests related to vulnerabilities: $timestamp\n\nSigned-off-by: AboutCode Automation <automation@aboutcode.org>")" || exit 0
git push
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/.env
61 changes: 60 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,60 @@
# vulnerablecode-vcs-collector
# vulnerablecode-vcs-collector
Collect data ( fix commits , issues, prs ) related to vulnerabilities


#### Fix commits:
To collect fix commits we clone the target git repo and loop over every git commit message searching for ( CVE-id or GHSA-id )

File structure:

```json
{
"vcs_url": "https://github.com/mirror/busybox",
"vulnerabilities": {
"CVE-2023-42363": {
"fb08d43d44d1fea1f741fafb9aa7e1958a5f69aa": "awk: fix use after free (CVE-2023-42363)\n\nfunction old new delta\nevaluate 3377 3385 +8\n\nFixes https://bugs.busybox.net/show_bug.cgi?id=15865\n\nSigned-off-by: Natanael Copa <ncopa@alpinelinux.org>\nSigned-off-by: Denys Vlasenko <vda.linux@googlemail.com>"
}
}
}
```

#### Issues and PRs:
To collect issues and pull requests we are using Github/Gitlab API to do quick search by `CVE-`

File structure:

```json
{
"vcs_url": "https://github.com/python/cpython",
"vulnerabilities": {
"CVE-2026-2297": {
"Issues": [
"https://github.com/python/cpython/issues/145506"
],
"PRs": [
"https://github.com/python/cpython/pull/145514",
"https://github.com/python/cpython/pull/145516",
"https://github.com/python/cpython/pull/145515",
"https://github.com/python/cpython/pull/145507",
"https://github.com/python/cpython/pull/145512",
"https://github.com/python/cpython/pull/145513"
]
}
}
}
```

### File Naming
The results are stored in a json file `{repo_name}-{repo_url_hash}.json` ex: `nginx-9251c307.json`

**Notes:** `repo_url_hash` represents the first 8 characters of repository url `SHA-256` hash
## Usage

To get started, clone the repository:

```bash
git clone https://github.com/aboutcode-data/vulnerablecode-vcs-collector.git
```


Once cloned, you can find the existing data in the `data/fix-commits` or `data/issues-prs` directory
39 changes: 39 additions & 0 deletions config/fix_commits_targets.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
[
"https://github.com/torvalds/linux",
"https://github.com/mirror/busybox",
"https://github.com/nginx/nginx",
"https://github.com/apache/tomcat",
"https://github.com/mysql/mysql-server",
"https://github.com/postgres/postgres",
"https://github.com/mongodb/mongo",
"https://github.com/redis/redis",
"https://github.com/sqlite/sqlite",
"https://github.com/php/php-src",
"https://github.com/python/cpython",
"https://github.com/ruby/ruby",
"https://github.com/golang/go",
"https://github.com/nodejs/node",
"https://github.com/rust-lang/rust",
"https://github.com/openjdk/jdk",
"https://github.com/swiftlang/swift",
"https://github.com/django/django",
"https://github.com/rails/rails",
"https://github.com/laravel/framework",
"https://github.com/spring-projects/spring-framework",
"https://github.com/facebook/react",
"https://github.com/angular/angular",
"https://github.com/WordPress/WordPress",
"https://github.com/moby/moby",
"https://github.com/kubernetes/kubernetes",
"https://gitlab.com/qemu-project/qemu",
"https://github.com/xen-project/xen",
"https://github.com/mirror/vbox",
"https://github.com/containerd/containerd",
"https://github.com/ansible/ansible",
"https://github.com/hashicorp/terraform",
"https://gitlab.com/wireshark/wireshark",
"https://github.com/the-tcpdump-group/tcpdump",
"https://github.com/git/git",
"https://github.com/jenkinsci/jenkins",
"https://gitlab.com/gitlab-org/gitlab-foss"
]
31 changes: 31 additions & 0 deletions config/issues_prs_targets.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
[
"https://github.com/mirror/busybox",
"https://github.com/nginx/nginx",
"https://github.com/apache/tomcat",
"https://github.com/mongodb/mongo",
"https://github.com/redis/redis",
"https://github.com/php/php-src",
"https://github.com/python/cpython",
"https://github.com/ruby/ruby",
"https://github.com/golang/go",
"https://github.com/nodejs/node",
"https://github.com/rust-lang/rust",
"https://github.com/openjdk/jdk",
"https://github.com/swiftlang/swift",
"https://github.com/django/django",
"https://github.com/rails/rails",
"https://github.com/laravel/framework",
"https://github.com/spring-projects/spring-framework",
"https://github.com/facebook/react",
"https://github.com/angular/angular",
"https://github.com/moby/moby",
"https://github.com/kubernetes/kubernetes",
"https://github.com/containerd/containerd",
"https://github.com/ansible/ansible",
"https://github.com/hashicorp/terraform",
"https://github.com/the-tcpdump-group/tcpdump",
"https://github.com/jenkinsci/jenkins",
"https://gitlab.com/gitlab-org/gitlab-foss",
"https://gitlab.com/wireshark/wireshark",
"https://gitlab.com/qemu-project/qemu"
]
143 changes: 143 additions & 0 deletions fix_commits_collector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
#
# Copyright (c) nexB Inc. and others. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
# See https://aboutcode.org for more information about nexB OSS projects.
#

import hashlib
import json
import re
import shutil
import sys
import tempfile
from collections import defaultdict
from datetime import datetime, timezone
from pathlib import Path

from aboutcode.pipeline import BasePipeline, LoopProgress
from git import Repo
from packageurl.contrib.url2purl import url2purl


class CollectVCSFixCommitPipeline(BasePipeline):
"""
Pipeline to collect fix commits from any git repository.
"""

vcs_url: str
patterns: list[str] = [
r"\bCVE-\d{4}-\d{4,19}\b",
r"GHSA-[2-9cfghjmpqrvwx]{4}-[2-9cfghjmpqrvwx]{4}-[2-9cfghjmpqrvwx]{4}",
]

def __init__(self, vcs_url: str, *args, **kwargs):
self.vcs_url = vcs_url
super().__init__(*args, **kwargs)

@classmethod
def steps(cls):
return (
cls.clone,
cls.collect_fix_commits,
cls.store_items,
cls.clean_downloads,
)

def log(self, message):
now_local = datetime.now(timezone.utc).astimezone()
timestamp = now_local.strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
message = f"{timestamp} {message}"
print(message)

def clone(self):
"""Clone the repository."""
self.repo = Repo.clone_from(
url=self.vcs_url,
to_path=tempfile.mkdtemp(),
bare=True,
no_checkout=True,
multi_options=["--filter=blob:none"],
)

def extract_vulnerability_id(self, commit) -> list[str]:
"""
Extract vulnerability id from a commit message and returns a list of matched vulnerability IDs
"""
matches = []
for pattern in self.patterns:
found = re.findall(pattern, commit.message, flags=re.IGNORECASE)
matches.extend(found)
return matches

def collect_fix_commits(self):
"""
Iterate through repository commits and group them by vulnerability identifiers.
"""
self.log(
"Processing git repository fix commits (grouped by vulnerability IDs)."
)

self.collected_items = {
"vcs_url": self.vcs_url,
"vulnerabilities": defaultdict(dict),
}

for commit in self.repo.iter_commits("--all"):
matched_ids = self.extract_vulnerability_id(commit)
if not matched_ids:
continue

commit_id = commit.hexsha
commit_message = commit.message.strip()

for vuln_id in matched_ids:
vuln_id = vuln_id.upper()
self.collected_items["vulnerabilities"][vuln_id][
commit_id
] = commit_message

self.log(
f"Found {len(self.collected_items)} vulnerabilities with related commits."
)
self.log("Finished processing all commits.")
return self.collected_items

def store_items(self):
"""Storing collected fix commits for this repository"""
self.log("Storing collected fix commits")
purl = url2purl(self.vcs_url)

if not (purl and purl.name) or not self.collected_items.get("vulnerabilities"):
self.log("Nothing to store for collected fix commits")
return

vcs_url_hash = hashlib.sha256(self.vcs_url.encode("utf-8")).hexdigest()[:8]
path = Path(f"data/fix-commits/{purl.name}-{vcs_url_hash}.json")
path.parent.mkdir(parents=True, exist_ok=True)

with open(path, "w", encoding="utf-8") as f:
json.dump(self.collected_items, f, indent=2)
return

def clean_downloads(self):
"""Cleanup any temporary repository data"""
self.log("Cleaning up local repository resources")
if hasattr(self, "repo") and self.repo.working_dir:
shutil.rmtree(path=self.repo.working_dir)


if __name__ == "__main__":
with open("config/fix_commits_targets.json") as f:
vcs_urls = json.load(f)

progress = LoopProgress(
total_iterations=len(vcs_urls),
logger=print,
)

for vcs_url in progress.iter(vcs_urls):
status_code, error_msg = CollectVCSFixCommitPipeline(vcs_url=vcs_url).execute()
print(error_msg)

sys.exit(0)
Loading