This repository contains the code for the paper "Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives" by Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati and Manuel Gomez-Rodriguez.
State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it—they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we introduce an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, to completely eliminate the financial incentive to strategize, we introduce a simple incentive-compatible token pricing mechanism. Under this mechanism, the price users pay for an output provided by a model depends on the number of characters of the output—they pay a fixed price per character. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama, Gemma and Mistral families, and input prompts from the LMSYS Chatbot Arena platform.
All experiments were performed using Python 3.11.2. To create a virtual environment and install the project dependencies:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt├── data
│ ├── LMSYS.txt (English prompts)
│ ├── LMSYS_ch.txt (Chinese prompts)
│ ├── LMSYS_esp.txt (Spanish prompts)
│ └── LMSYS_ru.txt (Russian prompts)
├── figures
│ ├── energy (per-prompt energy scatter plots)
│ ├── heur (heuristic-misreporting success/over curves)
│ ├── margin_dist (provider-margin distribution plots)
│ └── price_dist (character-per-token price distributions)
├── notebooks
│ ├── appendix_example.ipynb
│ ├── cpt.ipynb
│ ├── energy_profit_plots.ipynb
│ ├── plot_profit_no_transparency.ipynb
│ ├── plots_heur_new.ipynb
│ ├── plots_heur_random.ipynb
│ └── process_ds.ipynb
├── outputs
│ ├── cpt (JSON results from LMSYS_generation.py)
│ ├── energy_outputs (JSON results from energy.py)
│ └── heuristic_new (JSON results from heuristic_misreporting.py)
├── scripts
│ ├── pkl_to_json.py
│ ├── script_slurm_heur.sh
│ ├── script_slurm_lmsys.sh
│ ├── script_slurm_lmsys_energy.sh
│ ├── script_slurm_lmsys_generation.sh
│ └── script_slurm_lmsys_generation_loop.sh
└── src
├── energy.py
├── heuristic_misreporting.py
├── LMSYS_generation.py
├── tokenizations.py
├── tokenizations_fixed.py
├── tokenizations_fixed_plausible.py
└── utils.py
datacontains the LMSYS prompt sets used in the experiments, one file per language (English, Chinese, Spanish, Russian).figurescontains all the figures presented in the paper, organized by experiment type (energy, heuristic, margin distribution, price distribution).notebookscontains the Jupyter notebooks that generate the figures:- cpt.ipynb — character-per-token price and provider-margin distributions across models, temperatures, top-p values, and languages.
- plots_heur_new.ipynb — per-model "plausible sequences" and "overcharged tokens" curves from the heuristic-misreporting evaluation.
- plots_heur_random.ipynb — random-variant heuristic plausibility curve.
- plot_profit_no_transparency.ipynb — random-split profit increase as a function of split iterations.
- energy_profit_plots.ipynb — per-call energy demo, per-model energy statistics, and the profit / utility plots that combine energy ratios with heuristic-misreporting results.
- process_ds.ipynb — builds the LMSYS prompt sets used by the other notebooks.
- appendix_example.ipynb — generates the worked examples in Appendix C.2 of the paper.
outputscontains intermediate files produced by the experiment scripts and consumed by the notebooks. Each subdirectory ships.jsonsnapshots so the notebooks can be re-executed from a fresh clone:cpt— generation outputs fromLMSYS_generation.py.energy_outputs— GPU energy results fromenergy.py.heuristic_new— plausibility counts fromheuristic_misreporting.py.
scriptscontains SLURM submission templates and a small utility:- script_slurm_heur.sh — runs
heuristic_misreporting.pyon the LMSYS prompts. - script_slurm_lmsys_generation.sh — runs
LMSYS_generation.pyon the LMSYS prompts. - script_slurm_lmsys_generation_loop.sh — loops
LMSYS_generation.pyover a grid of (language, temperature, top-p, model). - script_slurm_lmsys_energy.sh — runs
energy.pyfor GPU energy measurement. - script_slurm_lmsys.sh — convenience submission template kept for reproducibility.
- pkl_to_json.py — converts a directory of
.pkloutputs into portable.jsonfiles (used to produce the JSON snapshots shipped underoutputs/cpt/andoutputs/energy_outputs/).
- script_slurm_heur.sh — runs
srccontains the source code used to produce the experiment outputs:- LMSYS_generation.py — generates outputs from a HuggingFace model under top-p or top-k sampling for a list of prompts.
- heuristic_misreporting.py — main script of the heuristic experiments. It generates outputs, iteratively splits tokens, and checks whether each depth still satisfies the original top-p / top-k sampling condition.
- energy.py — measures GPU energy during one generation call and one scoring forward pass.
- tokenizations_fixed.py — enumerates all tokenizations of a fixed output string and computes the model's conditional probability of each.
- tokenizations_fixed_plausible.py — same enumeration but flags which tokenizations are top-p / top-k plausible.
- tokenizations.py — helper routines used by the above (enumerate tokenizations, compute tokenization probability, verify sampling conditions).
- utils.py — plotting helpers (LaTeX-friendly figure dimensions) and the optimal-tokenization (shortest-cover) routine.
.pkl files are excluded from the repository via .gitignore.
The outputs/ subdirectories (cpt, energy_outputs, heuristic_new)
ship .json snapshots so the
corresponding notebooks can be re-executed from a fresh clone. To
regenerate any of them from scratch, run the relevant script in src/
(or its SLURM submission template) and convert the resulting pickles to
JSON with scripts/pkl_to_json.py.
Our experiments use LLMs from the Llama, Gemma and Mistral families. These are gated models and require accepting the corresponding license. Request access at:
Once you have access, authenticate with huggingface-cli login and the scripts will cache the model weights under models/ (which is excluded from the repository via .gitignore).
The main heuristic-misreporting experiment runs heuristic_misreporting.py. It can be invoked locally or via the cluster template script_slurm_heur.sh (which reads prompts from data/LMSYS.txt). Key flags:
--modelsets the HuggingFace model id (e.g.meta-llama/Llama-3.2-1B-Instruct).--temperaturesets the sampling temperature.--p/--kset top-p or top-k.--promptsis the list of prompt strings to process.--splitsis the list of iteration depths at which to record plausibility.
To produce the per-model heuristic-misreporting figures from the resulting pickles, run plots_heur_new.ipynb.
energy.py measures the GPU energy of one generate() call and one scoring forward pass for a list of prompts, sampling NVML power draw at 10 Hz. The aggregated per-model results (shipped under outputs/energy_outputs/) are consumed by energy_profit_plots.ipynb.
LMSYS_generation.py generates outputs under a chosen sampling configuration. The resulting outputs (shipped under outputs/cpt/) are loaded by cpt.ipynb to compute character-per-token price and provider-margin distributions across models, temperatures, top-p values, and languages. script_slurm_lmsys_generation_loop.sh sweeps the full (language, temperature, top-p, model) grid in one submission.
In case you have questions about the code, you identify potential bugs or you would like us to include additional functionalities, feel free to open an issue or contact Ander Artola Velasco.
If you use parts of the code in this repository for your own research, please consider citing:
@misc{velasco2026llmoverchargingyoutokenization,
title={Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives},
author={Ander Artola Velasco and Stratis Tsirtsis and Nastaran Okati and Manuel Gomez-Rodriguez},
year={2026},
eprint={2505.21627},
archivePrefix={arXiv},
primaryClass={cs.GT},
url={https://arxiv.org/abs/2505.21627},
}