PepSeqPred is a residue-level epitope prediction pipeline for protein workflows.
It converts upstream assay and sequence data into training-ready artifacts, trains feed-forward neural network models on ESM-2 embeddings, produces binary residue masks for downstream inference, and supports post-training evaluation workflows on held-out datasets.
PepSeqPred can be installed via:
pip install pepseqpredThis API allows you to use any of the pretrained models in your own code, or load your own model(s) for downstream predictions.
Example usage:
from pepseqpred import load_predictor
predictor = load_predictor("path/to/model.pt", device="cuda")
result = predictor.predict_sequence("ACDEFGHIKLMNP")
print(result.binary_mask, result.n_epitopes)Disclaimer: The API is meant to be a simplified version of the overall codebase where you can build epitope inference into your existing/new workflows. To have more control over exactly how PepSeqPred works, feel free to fork this repository and ensure you adhere to the GPL-3.0 license.
PepSeqPred is designed for research workflows where you need to:
- map peptide-level signals to residue-level supervision,
- train reproducible epitope prediction models,
- run large jobs on HPCs with DistributedDataParallel (DDP),
- generate per-residue binary epitope predictions to develop new peptide libraries,
- evaluate trained checkpoints or ensemble manifests with residue-level and peptide-level metrics.
The pipeline is built around CLI entrypoints in src/pepseqpred/apps/ and matching HPC scripts in scripts/hpc/.
Typical inputs:
- Metadata and reactivity tables (TSV) for preprocessing and label generation.
- Protein FASTA files for embedding generation and prediction.
- Optional metadata file for ID-family embedding key generation.
Typical outputs:
- Preprocessed model input table (TSV).
- Per-protein ESM-2 embedding shards (
.pt) plus embedding index CSV. - Residue-level label shards (
.pt) including epitope / uncertain / non-epitope supervision. - FFNN checkpoints and run summaries (for standard training or Optuna tuning).
- Predicted binary epitope mask FASTA files for inference.
- FFNN evaluation summary JSON and optional per-peptide comparison CSV/JSON artifacts.
Preprocess data
-> cleaned + labeled metadata TSV for downstream steps
Generate ESM-2 embeddings
-> per-protein embedding .pt files (+ index CSV), often sharded
Generate residue-level labels
-> label shard .pt files aligned to embedding keys/shards
Train FFNN (DDP)
-> checkpoints + metrics artifacts
Predict epitopes
-> output FASTA with binary residue mask predictions
Evaluate trained FFNN
-> residue-level metrics JSON and optional peptide-level comparison outputs
Optional: Optuna hyperparameter tuning
-> study storage + trial CSV + best-model artifacts
This repository supports both:
- local development and validation of pipeline logic, and
- production-scale HPC execution for embedding generation, training, and tuning.
- Use GitHub issues for normal development questions, bug reports, and feature requests.
- Use email for private or sensitive matters that should not be posted publicly.
- Maintainer contact: Jeffrey Hoelzel or Jason Ladner.
Software:
- At least Python
3.12(required by project configuration and CI). pipand virtual environments (venv) orconda.gitfor cloning and contribution workflows.
Platform notes:
- Local development can run on CPU.
- GPU is highly recommended for embedding generation and required for practical training/tuning runtimes.
- HPC scripts in
scripts/hpc/assume a SLURM-style environment and module-based setup (for example,anaconda3andcuda).
Local CPU-oriented environment:
conda env create -f envs/environment.local.yml
conda activate pepseqpred
pip install -e .[dev]HPC/GPU-oriented environment:
conda env create -f envs/environment.hpc.yml
conda activate pepseqpred
pip install -e .[dev]Linux/macOS:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e .[dev]Windows PowerShell:
python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e .[dev]CI-equivalent install shortcut:
pip install -r requirements.txtConfirm package and CLI entrypoints:
python -c "import pepseqpred; print('pepseqpred import ok')"
pepseqpred-preprocess --help
pepseqpred-esm --help
pepseqpred-train-ffnn --help
pepseqpred-predict --help
pepseqpred-eval-ffnn --helpRun required preflight checks before any development or pipeline usage:
ruff check .
pytest -m "unit or integration or e2e"This repository expects all of the checks above to pass before you start development work or run pipeline stages.
The HPC shell scripts in scripts/hpc/ execute zipapp files such as esm.pyz, labels.pyz, train_ffnn.pyz, predict.pyz, and evaluate_ffnn.pyz.
Build only the app(s) you need for your current stage, then copy those .pyz files plus the matching SLURM script(s) to HPC.
List available zipapp targets:
python scripts/tools/buildpyz.py --listBuild one runtime app (recommended default):
python scripts/tools/buildpyz.py <app_name>Examples:
python scripts/tools/buildpyz.py esm
python scripts/tools/buildpyz.py train_ffnnOptional: build all apps:
python scripts/tools/buildpyz.py allBy default this writes artifacts to dist/ as:
- versioned files:
<name>_<gitrev>.pyz - convenience copies:
<name>_latest.pyz
Each HPC script expects a plain .pyz filename in the same working directory:
generateembeddings.sh->esm.pyzgeneratelabels.sh->labels.pyztrainffnn.sh->train_ffnn.pyztrainffnnoptuna.sh->train_ffnn_optuna.pyzpredictepitope.sh->predict.pyzevaluateffnn.sh->evaluate_ffnn.pyz(or fallback module import)preprocessdata.sh->preprocess.pyz(optional)
For the Cocci evaluation workflow in evaluateffnn.sh, also transfer:
scripts/tools/cocci_eval_pipeline.py(called directly by the shell script forprepareandcomparestages)
Example: transfer only embedding stage artifacts:
scp dist/esm_latest.pyz <user>@<cluster-host>:/home/<user>/pepseqpred_hpc/esm.pyz
scp scripts/hpc/generateembeddings.sh <user>@<cluster-host>:/home/<user>/pepseqpred_hpc/Example: transfer multiple stage artifacts:
scp dist/labels_latest.pyz <user>@<cluster-host>:/home/<user>/pepseqpred_hpc/labels.pyz
scp dist/train_ffnn_latest.pyz <user>@<cluster-host>:/home/<user>/pepseqpred_hpc/train_ffnn.pyz
scp scripts/hpc/generatelabels.sh scripts/hpc/trainffnn.sh <user>@<cluster-host>:/home/<user>/pepseqpred_hpc/Example: transfer evaluation artifacts:
scp dist/esm_latest.pyz <user>@<cluster-host>:/home/<user>/pepseqpred_hpc/esm.pyz
scp dist/labels_latest.pyz <user>@<cluster-host>:/home/<user>/pepseqpred_hpc/labels.pyz
scp dist/predict_latest.pyz <user>@<cluster-host>:/home/<user>/pepseqpred_hpc/predict.pyz
scp dist/evaluate_ffnn_latest.pyz <user>@<cluster-host>:/home/<user>/pepseqpred_hpc/evaluate_ffnn.pyz
scp scripts/hpc/evaluateffnn.sh scripts/tools/cocci_eval_pipeline.py <user>@<cluster-host>:/home/<user>/pepseqpred_hpc/cd /home/<user>/pepseqpred_hpc
chmod +x *.sh
ls -lh *.pyz *.sh
# Run help checks for the app(s) you uploaded
python3 esm.pyz --helpRun SLURM jobs from this directory so relative .pyz filenames resolve correctly.
Any CLI code change requires rebuilding and redeploying corresponding .pyz files:
- Re-run
python scripts/tools/buildpyz.py <app_name>. - Re-transfer updated
dist/<app>_latest.pyzto HPC as<app>.pyz. - Re-run jobs using the updated artifact.
Run the main pipeline in this order:
- Preprocess metadata/reactivity data.
- Generate ESM-2 embeddings.
- Generate residue-level labels.
- Train FFNN model.
- Predict epitopes on new FASTA input.
- Evaluate FFNN outputs on labeled embeddings and/or Cocci reduced subsets.
Optional branch:
- Run Optuna tuning after label generation instead of fixed-parameter FFNN training.
| Stage | Hardware target | Default SLURM request in repo scripts |
|---|---|---|
| Preprocess | CPU only | Local shell helper (no fixed #SBATCH resources) |
| Embeddings | GPU | a100 (1 GPU), 2 CPU/GPU, 8G/GPU, 01:00:00, array 0-3 |
| Labels | CPU only | 1 CPU, 16G RAM, 01:00:00 |
| Train FFNN | Multi-GPU | 4x a100, 20 CPU, 256G RAM, 12:00:00 |
| Train FFNN Optuna | Multi-GPU | 4x a100, 20 CPU, 448G RAM, 48:00:00 |
| Predict | GPU | a100 (1 GPU), 4 CPU, 32G RAM, 00:30:00 |
| Evaluate FFNN | GPU recommended | a100 (1 GPU), 8 CPU, 64G RAM, 08:00:00 |
These are baseline defaults from the current HPC scripts, not strict requirements for every dataset size. You will need to increase or decrease SLURM requests depending on the hardware available to you.
Purpose:
- Merge metadata with z-score reactivity data and generate training-ready labels at preprocessing stage.
Required inputs:
- Metadata TSV (for example, PV1 metadata).
- Reactivity/z-score TSV.
Local CLI example:
pepseqpred-preprocess data/meta.tsv data/zscores.tsv --saveExpected outputs:
- A generated TSV in the working directory, named like
input_data_<is_epi_z>_<is_epi_min_subs>_<not_epi_z>_<not_epi_max_subs|all>.tsv.
Expected hardware:
- CPU only; lightweight compared with downstream stages.
Purpose:
- Convert protein FASTA sequences into per-residue ESM-2 embeddings.
Required inputs:
- FASTA file.
- Metadata file when using
id-familyembedding keys (default mode).
Local CLI example:
pepseqpred-esm \
--fasta-file data/targets.fasta \
--metadata-file data/targets.metadata \
--embedding-key-mode id-family \
--key-delimiter - \
--model-name esm2_t33_650M_UR50D \
--max-tokens 1022 \
--batch-size 8 \
--out-dir localdata/esm2_runHPC script example:
sbatch --export=ALL,IN_FASTA=/scratch/$USER/data/targets.fasta scripts/hpc/generateembeddings.shExpected outputs:
- Per-sequence
.ptembeddings under<out_dir>/artifacts/pts/(or shard subfolders in sharded mode). - Embedding index CSV under
<out_dir>/artifacts/.
Expected hardware:
- GPU strongly recommended; this is the first expensive stage.
Purpose:
- Build dense residue labels aligned to generated embedding keys/shards.
Required inputs:
- Preprocessed metadata TSV.
- Embedding directory (or shard directories).
Local CLI example:
pepseqpred-labels \
data/input_data_20_4_10_all.tsv \
localdata/labels/labels_shard_000.pt \
--emb-dir localdata/esm2_run/artifacts/pts/shard_000 \
--restrict-to-embeddings \
--calc-pos-weight \
--embedding-key-delim -HPC script examples:
sbatch --array=0-3 scripts/hpc/generatelabels.sh data/input_data_20_4_10_all.tsv /scratch/$USER/labels /scratch/$USER/esm2/artifacts/ptsExpected outputs:
- Label shard files such as
labels_shard_000.pt. - Optional positive class weight in the saved payload when
--calc-pos-weightis used.
Expected hardware:
- CPU only; moderate memory.
Purpose:
- Train PepSeqPred FFNN on embedding shards and label shards.
Required inputs:
- One or more embedding shard directories.
- One or more label shard
.ptfiles aligned to those embeddings.
Local smoke-test CLI example (small subset):
pepseqpred-train-ffnn \
--embedding-dirs localdata/esm2_run/artifacts/pts/shard_000 \
--label-shards localdata/labels/labels_shard_000.pt \
--epochs 1 \
--subset 100 \
--save-path localdata/models/ffnn_smoke \
--results-csv localdata/models/ffnn_smoke/runs.csvHPC script example:
sbatch scripts/hpc/trainffnn.sh \
/scratch/$USER/esm2/artifacts/pts/shard_000 /scratch/$USER/esm2/artifacts/pts/shard_001 /scratch/$USER/esm2/artifacts/pts/shard_002 /scratch/$USER/esm2/artifacts/pts/shard_003 \
-- \
/scratch/$USER/labels/labels_shard_000.pt /scratch/$USER/labels/labels_shard_001.pt /scratch/$USER/labels/labels_shard_002.pt /scratch/$USER/labels/labels_shard_003.ptExpected outputs:
- Run directories under
--save-pathcontaining checkpoints (for examplefully_connected.pt). - Multi-run CSV summary (default
multi_run_results.csv, or--results-csvpath). - Aggregated
multi_run_summary.json.
Expected hardware:
- Practical training is multi-GPU/HPC-oriented.
Purpose:
- Run distributed hyperparameter optimization over FFNN architecture/training ranges.
Required inputs:
- Same embedding shard directories and label shard files used by training.
HPC script example:
sbatch scripts/hpc/trainffnnoptuna.sh \
/scratch/$USER/esm2/artifacts/pts/shard_000 /scratch/$USER/esm2/artifacts/pts/shard_001 /scratch/$USER/esm2/artifacts/pts/shard_002 /scratch/$USER/esm2/artifacts/pts/shard_003 \
-- \
/scratch/$USER/labels/labels_shard_000.pt /scratch/$USER/labels/labels_shard_001.pt /scratch/$USER/labels/labels_shard_002.pt /scratch/$USER/labels/labels_shard_003.ptExpected outputs:
- Trial metrics CSV (
--csv-path). - Optuna storage DB (
--storage), SQLite on scratch. - Trial checkpoint directories under
--save-path. - Best-trial metadata JSON under
--save-path.
Expected hardware:
- Most expensive stage; budget multi-GPU runtime from several hours to days depending on
--n-trials.
Purpose:
- Apply trained checkpoint to new FASTA input and emit binary residue masks.
Required inputs:
- Trained checkpoint
.pt. - Input FASTA file.
Local CLI example:
pepseqpred-predict \
localdata/models/ffnn_smoke/run_001_split_11_train_101/fully_connected.pt \
data/inference_targets.fasta \
--output-fasta localdata/predictions/predictions.fasta \
--model-name esm2_t33_650M_UR50D \
--max-tokens 1022HPC script example:
sbatch scripts/hpc/predictepitope.sh /scratch/$USER/models/ffnn_v1/run_001_split_11_train_101/fully_connected.pt /scratch/$USER/data/inference_targets.fasta /scratch/$USER/predictions/predictions.fastaExpected outputs:
- Output FASTA with binary residue-level mask predictions.
- Prediction logs (console and optional log directory).
Expected hardware:
- Single GPU is recommended for throughput; CPU inference is possible but much slower.
Purpose:
- Evaluate trained checkpoints or ensemble manifests on residue-level labels.
- Optionally run Cocci-specific reduced-dataset preparation and peptide-level 1-count comparison.
Required inputs (minimum residue-level eval):
- Trained checkpoint
.ptor ensemble manifest.json. - One or more embedding directories.
- One or more label shard
.ptfiles.
Local CLI example:
pepseqpred-eval-ffnn \
localdata/models/ffnn_smoke/run_001_split_11_train_101/fully_connected.pt \
--embedding-dirs localdata/esm2_run/artifacts/pts/shard_000 \
--label-shards localdata/labels/labels_shard_000.pt \
--output-json localdata/eval/ffnn_eval_summary.json \
--batch-size 64 \
--num-workers 0HPC script example (Cocci workflow):
sbatch --export=ALL,EVAL_MODE=nonreactive,SKIP_IF_EXISTS=1,RUN_PREP=1,RUN_EMBED=1,RUN_LABELS=1,RUN_PREDICT=1,RUN_EVAL=1,RUN_COMPARE=1 \
scripts/hpc/evaluateffnn.sh \
/scratch/$USER/models/phaseA/ffnn_ens_1.0_xxxxxxxx/ensemble_manifest.json \
/scratch/$USER/evals/cocci_eval/nonreactiveExpected outputs:
prepared/eval_metadata.tsv,prepared/eval_proteins.fasta,prepared/prepare_summary.json.embeddings/artifacts/eval_embedding_index.csv.labels/labels_eval.pt.prediction/predictions.fasta.evaluation/ffnn_eval_summary.json.peptide_compare/peptide_comparison.csvandpeptide_compare/peptide_comparison_summary.json.
- Keep embedding key scheme consistent (
idvsid-family) between embedding and label generation. - Keep shard alignment explicit: embedding shard directories should map cleanly to label shard files.
- Use local smoke settings (
--subset, low epochs, single shard) before submitting expensive HPC jobs. - For
evaluateffnn.sh, ensureesm.pyz,labels.pyz,predict.pyz,evaluate_ffnn.pyz, andcocci_eval_pipeline.pyare present in the submission working directory.
- Use a new output root per run/study to avoid accidental overwrite of prior artifacts.
- Keep preprocessing outputs, embeddings, labels, checkpoints, and predictions in separate subdirectories.
- Keep
split-seeds,train-seeds, and core hyperparameters fixed when comparing experiments. - Keep
split-typeand embedding key scheme (idvsid-family) unchanged within a single experiment. - Do not mix artifacts from different preprocessing thresholds into one training run.
Suggested layout:
localdata/
runs/
<run_name_or_date>/
preprocess/
embeddings/
labels/
models/
predictions/
logs/
Common issues and fixes:
Metadata file is required for --embedding-key-mode='id-family': Provide--metadata-file, or use--embedding-key-mode idif that is your intended key scheme.--embedding-key-delim must be '' or '-': Use-forID-family.ptnaming and empty delimiter forID.ptnaming.No .pt files found in <emb_dir>during label generation: Verify Stage 2 completed successfully and--emb-dirpoints to the directory that directly contains embedding.ptfiles.- Label shard missing
class_statswhen training: Rebuild labels with--calc-pos-weight, or pass--pos-weightexplicitly to training. --hidden-sizes and --dropouts must be the same length: Ensure both CSV lists have one value per hidden layer.- Prediction threshold errors (
(0.0, 1.0)required): Set--thresholdstrictly between0and1, or omit it to use checkpoint/default behavior. python: can't open file ... esm.pyz(orlabels.pyz/predict.pyz): Transfer missing.pyzfiles to the same working directory as the HPC shell script, or run from an installed package environment using CLI entrypoints.labels_eval.ptorpredictions.fastanot found during evaluation: Upstream stage failed or was skipped; rerun with stage flags set (RUN_EMBED=1,RUN_LABELS=1,RUN_PREDICT=1) or disable dependent downstream stages.- DDP or multi-GPU runs stall/hang:
Confirm the requested GPU count matches
torchrun --nproc_per_node, and validate with a small single-shard smoke test first. - CUDA OOM:
Reduce embedding
--batch-size, reduce training--batch-size, or lower trial/search scope for Optuna.
Before starting a real run:
- Environment is activated and
pip install -e .[dev]completed. - Required preflight checks pass:
ruff check .andpytest -m "unit or integration or e2e". - Stage input files exist and are from the intended experiment branch.
- Output paths are unique for this run and will not overwrite prior artifacts.
- Embedding key scheme/delimiter and split configuration are consistent across stages.