This dataset is designed to evaluate tokenizer performance across before you use it for model pre-training/finetuning:
- 🌍 Human languages (multilingual + scripts)
- 💻 Programming languages (syntax-heavy)
- 🧮 Math & science expressions (symbols, unicode, formulas)
This dataset helps evaluate:
- Multilingual tokenization quality
- Code token handling
- Mathematical symbol parsing
- Robustness to noisy and mixed inputs
The dataset is organized into modular Python files:
data/
├── human_languages.py
├── programming_languages.py
├── scientific_formulas.py
├── edge_cases.py
Each file contains structured dictionaries that can be directly imported and used for tokenizer evaluation.
from tokenizerbench.data.human_languages import human_languages
from tokenizerbench.data.programming_languages import programming_languages
from tokenizerbench.data.scientific_formulas import scientific_formulasdataset = {
"human_languages": human_languages,
"programming_languages": programming_languages,
"scientific_formulas": scientific_formulas
}Example using any tokenizer (HuggingFace, TikToken, SentencePiece, etc.):
def evaluate_tokenizer(tokenizer, dataset):
results = {}
for category, data in dataset.items():
results[category] = {}
for subcategory, samples in data.items():
token_counts = []
for text in samples:
tokens = tokenizer.encode(text)
token_counts.append(len(tokens))
results[category][subcategory] = {
"avg_tokens": sum(token_counts) / len(token_counts),
"max_tokens": max(token_counts),
"min_tokens": min(token_counts)
}
return resultsdef compression_ratio(tokenizer, text):
tokens = tokenizer.encode(text)
return len(tokens) / len(text)👉 Run this across:
- Different languages
- Code snippets
- Math expressions
def unicode_test(tokenizer, text):
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
return text == decodedTest on:
- Multilingual text
- Emojis
- Scientific symbols
long_text = "AI_TOKEN_TEST " * 1000 # ~10K chars
tokens = tokenizer.encode(long_text)
print("Token count:", len(tokens))👉 Helps evaluate:
- Context handling
- Token explosion
- Memory efficiency
Run comparisons across:
-
Multiple tokenizers (BPE, SentencePiece, Unigram)
-
Multiple categories:
- Human languages
- Code
- Math & symbols
Track:
- Token count
- Compression ratio
- Decode fidelity
- Stability on long inputs
For serious benchmarking, log results like:
{
"tokenizer": "tiktoken",
"language": "hindi",
"avg_tokens": 18.2,
"compression_ratio": 0.32,
"unicode_safe": True
}👉 This allows you to build:
- Leaderboards
- Tokenizer comparisons
- Performance dashboards
Measure how many tokens each input produces.
tokens = tokenizer.encode(text)
print(len(tokens))👉 Lower token count (for same meaning) = better efficiency
compression_ratio = len(tokens) / len(text)- Lower ratio → better tokenizer
- Indicates how efficiently text is represented
Test:
- Multilingual text
- Emojis
- Mathematical symbols
test = "Hello 世界 🚀 α β γ ∑"
tokens = tokenizer.encode(test)
decoded = tokenizer.decode(tokens)Check:
- Is decoded text identical?
- Any corruption?
- Any token explosion?
Test:
- Long sequences (2K–10K chars)
- Mixed scripts
- Noisy text
This dataset helps evaluate:
- Multilingual tokenization quality
- Code token handling
- Mathematical symbol parsing
- Robustness to noisy and long inputs
- Expand human_languages → 100 languages using ISO language list
- Keep same semantic structure across languages for consistency
- Add longer sequences (2K–10K chars) to test tokenizer limits