AIFebruary 22, 2026 · 9 min read

Tamil NLP at 96% — what it actually took.

A Tamil sentence has more morphological variance than a German one. Generic models were never going to work.

We shipped a Tamil NLP system that classifies media articles at 96% accuracy across 14 taxonomies. For context: off-the-shelf multilingual BERT on the same data clocked 71%. Here's the gap we closed and how.

The problem with generic multilingual models

mBERT, XLM-R, and even IndicBERT treat Tamil as a morphologically light language. It isn't. A single Tamil verb can inflect across person, number, tense, aspect, and politeness, producing hundreds of surface forms that share a stem.

Generic tokenizers shatter those forms into meaningless subwords. "பார்க்கிறேன்" (I see) and "பார்த்தான்" (he saw) end up looking like unrelated strings.

The three-part fix

### 1. Morphology-aware tokenization We pre-process with a Tamil morphological analyzer before the model ever sees a sentence. The stem survives; the inflections become features. This alone gained us ~11 percentage points.

### 2. Domain-specific corpus Off-the-shelf Tamil embeddings are trained on Wikipedia + news. Our client's corpus was decades of news + cinema magazines + government notifications — three distinct registers. We built a 2.4M-article in-domain corpus and fine-tuned IndicBERT on masked-language modeling for 40 epochs.

### 3. Label-noise cleanup Our client had 14 taxonomies labeled by different editors over a decade. Label consistency was garbage. We used confident-learning (the Cleanlab library, excellent) to find ~3% of labels that were outright wrong and 8% that were ambiguous. Fixing them added another 7 points.

The architecture

At serve time it's embarrassingly simple: - Input Tamil text → morphological analyzer → IndicBERT fine-tuned on client corpus → softmax over 14 classes - FastAPI service, 2-node Kubernetes deployment, p99 latency of 340ms

The hard work was data, not architecture.

What we'd do differently

With 2024's tools we'd probably use a 7B-param multilingual LLM with in-context examples instead of fine-tuning BERT. Accuracy would likely be similar, development time would halve, inference cost would 20x.

For this client the economics favored the BERT approach. For new clients asking us this question in 2026, we'd benchmark both.

What this unlocks

Once a media archive is machine-classified at 96%, the downstream workflows compound: - Auto-taxonomy on new articles (38% faster publishing) - Searchable archive across 14 years of content - Recommendation engine that actually works in Tamil

The client went from a catalog nobody could search to a data asset that's now one of their differentiators.