Using AI for Bengali folklore preservation

Much of our region’s folklore exists in fragile, hard-to-access forms — handwritten manuscripts, old print editions, and poetry in archaic language. Here’s a behind-the-scenes look at how we’re using AI across our folklore digitization pipeline to bridge the gap between these historical texts and today’s readers.

Digitization: Enhancing OCR with AI

Many of the books in our archive are over a century old. They’ve been passed through countless hands, often with paper creases, annotations in the margins, and signs of wear that confuse traditional OCR (Optical Character Recognition) tools.

To tackle this, we use a two-step AI-enhanced digitization process:

  • First, we pass the raw OCR output through a large language model (LLM) to clean up the text — correcting spelling errors, removing scanning artifacts, and standardizing formatting.
  • Second, we ask the same model to summarize the cleaned content so that our human reviewers have a clearer understanding of the material when checking the output. This improves both the speed and accuracy of our review process.

Translation: Making Archaic Bengali Accessible

A significant portion of medieval Bengali folklore comes in the form of narrative poetry written in archaic language. Even native speakers today often find it impenetrable without training.

Surprisingly, modern LLMs are proving capable of parsing these dense, poetic texts and reconstructing the gist of the narrative. We use them in two ways:

  • To help us interpret the original Bengali stories, giving us a rough understanding of the events, characters, and themes.
  • To translate the stories into English and even adapt them into prose retellings. While these AI-generated adaptations aren’t final drafts, they are often good enough to serve as functional plot summaries.

NLP Labeling: Structuring the Folktales

Once a story passes through our digitization and translation pipeline, we run it through another NLP (Natural Language Processing) phase using an LLM. Here, we extract and annotate key structural and cultural elements:

  • Genre classification (e.g., creation myth, romantic tragedy)
  • Identification of religious and cultural references
  • Named entity recognition to tag characters and place names

This labeling makes our archive searchable and linkable in ways that traditional literary archives are not — allowing researchers and creators to find stories with specific themes, characters, or motifs.

Folklore Analysis: Identifying Global Motifs

The final step in our pipeline is what we call computational folklore analysis. Since the mid-20th century, folklorists have worked on classifying global folk narratives into recognizable motifs and tale-types — from trickster patterns to resurrection plots.

We’ve digitized one such motif catalog into a structured database and now use LLMs to:

  • Automatically identify common folklore motifs
  • Detect patterns across our collection, which in turn helps us trace cross-cultural influences
অতিবেগুনী @otibeguni