Standard NLP evaluation metrics (BLEU, CIDEr) are designed for high-resource language pairs with multiple reference translations, making them unsuitable for low-resource languages (fewer than 100M speakers) where you have at most one or two reference translations and minimal annotation budget.
Evidence: Building a Uyghur-English translation system with exactly one reference translation per sentence and a team of three annotators, I found that BLEU would swing ±5 points between runs on identical data. CIDEr penalized valid paraphrases. Tokenization bias systematically punishes morphologically-rich languages like Turkish (9+ tokens for a single word) while favoring English (3 tokens). No standard metric reliably indicated whether the model was improving.
The solution: Character-level metrics (chrF) and semantic similarity measures (embedding-based scores) work where token-level metrics fail. They are robust to paraphrases, morphological variants, and single-reference evaluation. For low-resource NLP, the standard metrics require replacement, not fine-tuning.
The Problem: BLEU Assumes Abundance
BLEU (Bilingual Evaluation Understudy) depends on multiple reference translations to work. In high-resource settings like WMT (Workshop on Machine Translation) for English↔German, each test sentence has 4 reference translations. This redundancy smooths out noise.
In low-resource settings, you're lucky to have one reference translation. With a single reference, BLEU becomes unstable — a valid paraphrase that differs from the reference becomes a false negative. Variance can swing 5–10 points between runs on the same data.
But that's only the first problem.
Problem 2: Tokenization Bias
Subword tokenization (BPE, WordPiece) introduced a clever solution to the out-of-vocabulary problem for neural models. It also introduced a systematic bias against morphologically-rich languages.
Consider this Turkish word:
kalabalıklașdığınızdan
(because you got crowded)
Tokenized: kal##ab##al##ık##las##tir##di##gin##iz##dan
(9+ tokens)
English equivalent:
crowdedness
Tokenized: crowd##ed##ness
(3 tokens)
Now, BLEU counts n-gram matches at the token level. When the Turkish model outputs those 9 tokens in a slightly different order, or substitutes a morphologically-related variant, BLEU penalizes it heavily because the n-grams don't align. The English model, with 3 tokens, has much more flexibility.
Same fluency. Different BLEU scores. This isn't a measurement problem; it's a metric bias problem.
What Actually Works
Here's the practical guide when you're evaluating low-resource models:
1. chrF (Character F-Score) — The Workhorse
Compute F-score over character n-grams (typically 6-grams) instead of token n-grams. This completely sidesteps tokenization.
- ✅ Works excellently with single references
- ✅ Language-agnostic (works for Turkish, Finnish, Uyghur equally well)
- ✅ Correlates with human judgment at 0.70–0.85 (better than BLEU for many language pairs)
- ⚠️ Penalizes length variation (outputs more characters → higher score, not necessarily better)
Use chrF when: You have 1 reference and morphological diversity to handle.
2. TER (Translation Edit Rate) — The Human Proxy
Measure minimum edits (insertions, deletions, substitutions, word shifts) to convert hypothesis into reference.
- ✅ Gives partial credit for word reordering (fluency matters)
- ✅ Correlates better with human judgment on morphologically-rich languages
- ⚠️ Computationally expensive; requires reference translation
Use TER when: You want a metric that mirrors how humans judge translations.
3. Perplexity — When References Don't Exist
No gold-standard translations? Use perplexity computed on held-out in-domain data.
- ✅ Requires zero references (this is huge for low-resource)
- ✅ Stable even with tiny validation sets (100–1000 examples)
- ⚠️ Doesn't directly measure task quality (model A might have lower perplexity but worse translations)
- ⚠️ Domain-specific (meaningless to compare perplexity across different domains)
Use perplexity when: Collecting references is impossible, but you have in-domain data.
4. BERTScore with Multilingual Models — The Semantic Shortcut
Use embeddings from mBERT or XLM-RoBERTa to compute semantic similarity between reference and hypothesis, bypassing tokenization and morphology entirely.
- ✅ Works with zero references if you have a good semantic model
- ✅ Handles paraphrases naturally
- ⚠️ Only works if your language is in the multilingual model's pretraining
- ⚠️ Fails silently for truly rare languages
Use BERTScore when: Your language is in XLM-R's pretraining and you want semantic matching.
The Decision Tree
| Your Constraint | Metric | Why |
|---|---|---|
| Single reference, any language | chrF | Stable, language-agnostic |
| No references, in-domain data | Perplexity | Only option; good enough for ranking |
| Morphologically-rich language | TER | Human-like judgment |
| Want semantic matching | mBERT + cosine similarity | Bypasses tokenization |
| Budget: 10–20 examples | Human pairwise preference | Relative ranking works; absolute scores don't |
The Real Constraint: Annotation Budget
Here's what actually changes in low-resource settings: You can't afford big evaluation campaigns.
For English, you might evaluate on BLEU (automated, free) + human judgment on 500 examples (expensive, but worthwhile). For a minority language, 500 references might be 20% of your entire corpus.
Cascade your evaluation:
- Run perplexity on all 10,000 examples (free)
- Use perplexity to identify high-uncertainty cases (models A and B are close)
- Get humans to judge only those uncertain cases (maybe 100 examples)
- Use the human judgments to make the final ranking
This way, automation does the heavy lifting, and human time is allocated where it matters.
One More Thing: Inter-Annotator Agreement
When you do use human evaluators (because you can afford 20 examples), measure inter-annotator agreement with Cohen's κ or Krippendorff's α.
If κ < 0.60, your evaluation criteria are ambiguous — clarify the definition before scaling. If κ > 0.75, you can trust the scores.
At low-resource scale, this is your sanity check.
Takeaway
Don't use BLEU for low-resource languages. Use chrF instead. If you have no references, use perplexity. If you're comparing morphologically-rich languages, use TER. The metric you choose should match your constraints, not your habit.
The best part? These metrics are open-source and free. chrF is a few lines of Python. TER is available via the TERCOM toolkit. No expensive APIs needed.
Your language model evaluation deserves better than a metric designed for English ↔ German abundance.