Why does BLEU fail for low-resource languages?

BLEU depends on multiple reference translations to smooth out noise—it was designed for high-resource pairs like English-German with 4+ references per sentence. With single references (typical in low-resource), BLEU becomes unstable; valid paraphrases become false negatives, causing variance swings of 5-10 points between identical runs.

What is tokenization bias and how does it affect low-resource NLP?

Subword tokenization (BPE, WordPiece) systematically biases against morphologically-rich languages. A Turkish word 'kalabalıklașdığınızdan' becomes 9+ tokens while its English equivalent 'crowdedness' is 3 tokens. Since BLEU counts token-level n-grams, Turkish models are heavily penalized for morphological variants while English models have flexibility—same fluency, different BLEU scores.

What metrics work best when you have only one reference translation?

chrF (Character F-Score) is the workhorse metric for single-reference evaluation. It computes F-score over character n-grams instead of tokens, completely sidestepping tokenization bias. It correlates with human judgment at 0.70-0.85 for morphologically-rich languages—better than BLEU—and works equally well across all languages.

How does chrF differ from BLEU?

chrF operates at the character level rather than token level, which sidesteps tokenization bias. It handles morphological variants naturally, works well with single references, and correlates better with human judgment. The tradeoff: chrF penalizes length variation (outputs more characters → higher score, not necessarily better).

When should you use TER instead of BLEU?

Use Translation Edit Rate (TER) when you want a metric that mirrors how humans judge translations. TER measures minimum edits (insertions, deletions, substitutions, word shifts) to convert hypothesis into reference, giving partial credit for word reordering. It correlates better with human judgment on morphologically-rich languages but is computationally more expensive than BLEU.

What is BERTScore and when should you use it?

BERTScore uses embeddings from multilingual models (mBERT, XLM-RoBERTa) to compute semantic similarity between reference and hypothesis, bypassing tokenization and morphology entirely. It handles paraphrases naturally and works with zero references if your language is in the model's pretraining. Limitation: fails silently for truly rare languages not in pretraining.

Evaluating Language Models for Low-Resource Languages: Beyond BLEU

Standard NLP evaluation metrics (BLEU, CIDEr) are designed for high-resource language pairs with multiple reference translations, making them unsuitable for low-resource languages (fewer than 100M speakers) where you have at most one or two reference translations and minimal annotation budget.

Evidence: Building a Uyghur-English translation system with exactly one reference translation per sentence and a team of three annotators, I found that BLEU would swing ±5 points between runs on identical data. CIDEr penalized valid paraphrases. Tokenization bias systematically punishes morphologically-rich languages like Turkish (9+ tokens for a single word) while favoring English (3 tokens). No standard metric reliably indicated whether the model was improving.

The solution: Character-level metrics (chrF) and semantic similarity measures (embedding-based scores) work where token-level metrics fail. They are robust to paraphrases, morphological variants, and single-reference evaluation. For low-resource NLP, the standard metrics require replacement, not fine-tuning.

The Problem: BLEU Assumes Abundance

BLEU (Bilingual Evaluation Understudy) depends on multiple reference translations to work. In high-resource settings like WMT (Workshop on Machine Translation) for English↔German, each test sentence has 4 reference translations. This redundancy smooths out noise.

In low-resource settings, you're lucky to have one reference translation. With a single reference, BLEU becomes unstable — a valid paraphrase that differs from the reference becomes a false negative. Variance can swing 5–10 points between runs on the same data.

But that's only the first problem.

Problem 2: Tokenization Bias

Subword tokenization (BPE, WordPiece) introduced a clever solution to the out-of-vocabulary problem for neural models. It also introduced a systematic bias against morphologically-rich languages.

Consider this Turkish word:

kalabalıklașdığınızdan
(because you got crowded)

Tokenized: kal##ab##al##ık##las##tir##di##gin##iz##dan
(9+ tokens)

English equivalent:

crowdedness

Tokenized: crowd##ed##ness
(3 tokens)

Now, BLEU counts n-gram matches at the token level. When the Turkish model outputs those 9 tokens in a slightly different order, or substitutes a morphologically-related variant, BLEU penalizes it heavily because the n-grams don't align. The English model, with 3 tokens, has much more flexibility.

Same fluency. Different BLEU scores. This isn't a measurement problem; it's a metric bias problem.

What Actually Works

Here's the practical guide when you're evaluating low-resource models:

1. chrF (Character F-Score) — The Workhorse

Compute F-score over character n-grams (typically 6-grams) instead of token n-grams. This completely sidesteps tokenization.

✅ Works excellently with single references
✅ Language-agnostic (works for Turkish, Finnish, Uyghur equally well)
✅ Correlates with human judgment at 0.70–0.85 (better than BLEU for many language pairs)
⚠️ Penalizes length variation (outputs more characters → higher score, not necessarily better)

Use chrF when: You have 1 reference and morphological diversity to handle.

2. TER (Translation Edit Rate) — The Human Proxy

Measure minimum edits (insertions, deletions, substitutions, word shifts) to convert hypothesis into reference.

✅ Gives partial credit for word reordering (fluency matters)
✅ Correlates better with human judgment on morphologically-rich languages
⚠️ Computationally expensive; requires reference translation

Use TER when: You want a metric that mirrors how humans judge translations.

3. Perplexity — When References Don't Exist

No gold-standard translations? Use perplexity computed on held-out in-domain data.

✅ Requires zero references (this is huge for low-resource)
✅ Stable even with tiny validation sets (100–1000 examples)
⚠️ Doesn't directly measure task quality (model A might have lower perplexity but worse translations)
⚠️ Domain-specific (meaningless to compare perplexity across different domains)

Use perplexity when: Collecting references is impossible, but you have in-domain data.

4. BERTScore with Multilingual Models — The Semantic Shortcut

Use embeddings from mBERT or XLM-RoBERTa to compute semantic similarity between reference and hypothesis, bypassing tokenization and morphology entirely.

✅ Works with zero references if you have a good semantic model
✅ Handles paraphrases naturally
⚠️ Only works if your language is in the multilingual model's pretraining
⚠️ Fails silently for truly rare languages

Use BERTScore when: Your language is in XLM-R's pretraining and you want semantic matching.

The Decision Tree

Your Constraint	Metric	Why
Single reference, any language	chrF	Stable, language-agnostic
No references, in-domain data	Perplexity	Only option; good enough for ranking
Morphologically-rich language	TER	Human-like judgment
Want semantic matching	mBERT + cosine similarity	Bypasses tokenization
Budget: 10–20 examples	Human pairwise preference	Relative ranking works; absolute scores don't

The Real Constraint: Annotation Budget

Here's what actually changes in low-resource settings: You can't afford big evaluation campaigns.

For English, you might evaluate on BLEU (automated, free) + human judgment on 500 examples (expensive, but worthwhile). For a minority language, 500 references might be 20% of your entire corpus.

Cascade your evaluation:

Run perplexity on all 10,000 examples (free)
Use perplexity to identify high-uncertainty cases (models A and B are close)
Get humans to judge only those uncertain cases (maybe 100 examples)
Use the human judgments to make the final ranking

This way, automation does the heavy lifting, and human time is allocated where it matters.

One More Thing: Inter-Annotator Agreement

When you do use human evaluators (because you can afford 20 examples), measure inter-annotator agreement with Cohen's κ or Krippendorff's α.

If κ < 0.60, your evaluation criteria are ambiguous — clarify the definition before scaling. If κ > 0.75, you can trust the scores.

At low-resource scale, this is your sanity check.

Takeaway

Don't use BLEU for low-resource languages. Use chrF instead. If you have no references, use perplexity. If you're comparing morphologically-rich languages, use TER. The metric you choose should match your constraints, not your habit.

The best part? These metrics are open-source and free. chrF is a few lines of Python. TER is available via the TERCOM toolkit. No expensive APIs needed.

Your language model evaluation deserves better than a metric designed for English ↔ German abundance.