Vache prompts. Claude codes.How it works

Evaluating Language Models for Low-Resource Languages: Beyond BLEU

·5 min read·by Vache Sarkissian
Updated June 3, 2026
·
Reviewed March 29, 2026
NLPevaluationlow-resourcemachine translationmetrics
📚Top of Funnel

Written by Claude (Opus 4.6) Vache prompted, reviewed, and published. The data and benchmarks are real; the prose is AI-generated.

Standard NLP evaluation metrics (BLEU, CIDEr) are designed for high-resource language pairs with multiple reference translations, making them unsuitable for low-resource languages (fewer than 100M speakers) where you have at most one or two reference translations and minimal annotation budget.

Evidence: Building a Uyghur-English translation system with exactly one reference translation per sentence and a team of three annotators, I found that BLEU would swing ±5 points between runs on identical data. CIDEr penalized valid paraphrases. Tokenization bias systematically punishes morphologically-rich languages like Turkish (9+ tokens for a single word) while favoring English (3 tokens). No standard metric reliably indicated whether the model was improving.

The solution: Character-level metrics (chrF) and semantic similarity measures (embedding-based scores) work where token-level metrics fail. They are robust to paraphrases, morphological variants, and single-reference evaluation. For low-resource NLP, the standard metrics require replacement, not fine-tuning.

The Problem: BLEU Assumes Abundance

BLEU (Bilingual Evaluation Understudy) depends on multiple reference translations to work. In high-resource settings like WMT (Workshop on Machine Translation) for English↔German, each test sentence has 4 reference translations. This redundancy smooths out noise.

In low-resource settings, you're lucky to have one reference translation. With a single reference, BLEU becomes unstable — a valid paraphrase that differs from the reference becomes a false negative. Variance can swing 5–10 points between runs on the same data.

But that's only the first problem.

Problem 2: Tokenization Bias

Subword tokenization (BPE, WordPiece) introduced a clever solution to the out-of-vocabulary problem for neural models. It also introduced a systematic bias against morphologically-rich languages.

Consider this Turkish word:

kalabalıklașdığınızdan
(because you got crowded)

Tokenized: kal##ab##al##ık##las##tir##di##gin##iz##dan
(9+ tokens)

English equivalent:

crowdedness

Tokenized: crowd##ed##ness
(3 tokens)

Now, BLEU counts n-gram matches at the token level. When the Turkish model outputs those 9 tokens in a slightly different order, or substitutes a morphologically-related variant, BLEU penalizes it heavily because the n-grams don't align. The English model, with 3 tokens, has much more flexibility.

Same fluency. Different BLEU scores. This isn't a measurement problem; it's a metric bias problem.

What Actually Works

Here's the practical guide when you're evaluating low-resource models:

1. chrF (Character F-Score) — The Workhorse

Compute F-score over character n-grams (typically 6-grams) instead of token n-grams. This completely sidesteps tokenization.

  • ✅ Works excellently with single references
  • ✅ Language-agnostic (works for Turkish, Finnish, Uyghur equally well)
  • ✅ Correlates with human judgment at 0.70–0.85 (better than BLEU for many language pairs)
  • ⚠️ Penalizes length variation (outputs more characters → higher score, not necessarily better)

Use chrF when: You have 1 reference and morphological diversity to handle.

2. TER (Translation Edit Rate) — The Human Proxy

Measure minimum edits (insertions, deletions, substitutions, word shifts) to convert hypothesis into reference.

  • ✅ Gives partial credit for word reordering (fluency matters)
  • ✅ Correlates better with human judgment on morphologically-rich languages
  • ⚠️ Computationally expensive; requires reference translation

Use TER when: You want a metric that mirrors how humans judge translations.

3. Perplexity — When References Don't Exist

No gold-standard translations? Use perplexity computed on held-out in-domain data.

  • ✅ Requires zero references (this is huge for low-resource)
  • ✅ Stable even with tiny validation sets (100–1000 examples)
  • ⚠️ Doesn't directly measure task quality (model A might have lower perplexity but worse translations)
  • ⚠️ Domain-specific (meaningless to compare perplexity across different domains)

Use perplexity when: Collecting references is impossible, but you have in-domain data.

4. BERTScore with Multilingual Models — The Semantic Shortcut

Use embeddings from mBERT or XLM-RoBERTa to compute semantic similarity between reference and hypothesis, bypassing tokenization and morphology entirely.

  • ✅ Works with zero references if you have a good semantic model
  • ✅ Handles paraphrases naturally
  • ⚠️ Only works if your language is in the multilingual model's pretraining
  • ⚠️ Fails silently for truly rare languages

Use BERTScore when: Your language is in XLM-R's pretraining and you want semantic matching.

The Decision Tree

Your ConstraintMetricWhy
Single reference, any languagechrFStable, language-agnostic
No references, in-domain dataPerplexityOnly option; good enough for ranking
Morphologically-rich languageTERHuman-like judgment
Want semantic matchingmBERT + cosine similarityBypasses tokenization
Budget: 10–20 examplesHuman pairwise preferenceRelative ranking works; absolute scores don't

The Real Constraint: Annotation Budget

Here's what actually changes in low-resource settings: You can't afford big evaluation campaigns.

For English, you might evaluate on BLEU (automated, free) + human judgment on 500 examples (expensive, but worthwhile). For a minority language, 500 references might be 20% of your entire corpus.

Cascade your evaluation:

  1. Run perplexity on all 10,000 examples (free)
  2. Use perplexity to identify high-uncertainty cases (models A and B are close)
  3. Get humans to judge only those uncertain cases (maybe 100 examples)
  4. Use the human judgments to make the final ranking

This way, automation does the heavy lifting, and human time is allocated where it matters.

One More Thing: Inter-Annotator Agreement

When you do use human evaluators (because you can afford 20 examples), measure inter-annotator agreement with Cohen's κ or Krippendorff's α.

If κ < 0.60, your evaluation criteria are ambiguous — clarify the definition before scaling. If κ > 0.75, you can trust the scores.

At low-resource scale, this is your sanity check.

Takeaway

Don't use BLEU for low-resource languages. Use chrF instead. If you have no references, use perplexity. If you're comparing morphologically-rich languages, use TER. The metric you choose should match your constraints, not your habit.

The best part? These metrics are open-source and free. chrF is a few lines of Python. TER is available via the TERCOM toolkit. No expensive APIs needed.

Your language model evaluation deserves better than a metric designed for English ↔ German abundance.

About the Author

Vache Sarkissian

Building research infrastructure and products at the intersection of knowledge systems and machine learning. Creator of Linesheet Pro, vault-search, and the vachsark learning engine.

View Full Bio →
© 2026 Vache Sarkissian·Built with Claude Code
vachsark.com