
Cut through clutter: Tamil dataset to train AI models
Netizens engaging with AI models in Tamil or any regional language often come across incoherent translations, jumbled sentences, bizarre choices of words and poor grammar, but overlook them as the babels of a budding ecosystem.
But not Raju Kandaswamy, a senior IT professional, who believes such errors throw a spanner in the works of Tamil's linguistic integrity.
"Current training datasets are heavily distributed in English language and therefore do not accurately represent Tamil language or its cultural context. Users over a period of time absorb and internalise these biases leading to slow erosion of cultural values," he said. An increasing number of people, including senior citizens, rarely pick up books and consume Tamil content only through the internet or through speech.
In a world increasingly mediated by LLMs, involved in web searches, shopping and education, this creates a problem.
Kandaswamy is principal consultant at Thoughtworks in Coimbatore and part of AI Tamil Nadu, a non-profit community aiming to improve how AI models work in Tamil. The team is building a large-scale Tamil language dataset to train AI models, and is collaborating with authors and other organisations to curate large, high-quality datasets.
Their plan is using these data repositories to fine-tune open-source models such as Meta's LLama and make it available for anyone to build Tamil-specific models.
You Can Also Check:
Chennai AQI
|
Weather in Chennai
|
Bank Holidays in Chennai
|
Public Holidays in Chennai
He believes that these models can be used to deliver govt services, communicate welfare schemes, and enable vernacular education to the masses, especially the rural population. Abinaya Mahendiran, a natural language processing expert and member of AI Tamil Nadu, is leading the initiative named Vidhai.
She too thinks it is crucial for preserving Tamil culture.
"Access to high-quality datasets in less represented languages is limited, and Tamil is no exception. Machine-translated content is often inaccurate. So, we collect original Tamil texts such as books, essays and articles from various sources, clean them, and annotate them with the help of volunteers, students, linguists, retirees, and teachers. A trove of Tamil books and printed material is yet to be digitised," she said.
Today, many independent researchers and language enthusiasts are spending their own money to improve Tamil AI models. But as Abinaya notes, lack of computing resources and difficulty in mobilising volunteers are major barriers.
The Tamil Virtual Academy (TVA) has a digital library of more than 1 lakh books containing around 1.5 crore pages, spanning subjects from science to history. It is also developing tools like syntactic parsers, morphological analysers, and 'parts of speech' taggers, resources critical for NLP research.
Yet, fragmented efforts, siloed developments, and fuzzy copyright guidelines hinder collaboration. A senior official confirmed that TVA could collaborate with AI technologists, but ambiguity around copyright and fair use remains a bottleneck.
Navaneeth Malingan, founder of AI Tamil Nadu, is attempting to bridge the ecosystem, by bringing together various elements -- from scouting for students volunteers and linguists to getting access to computing resources through corporate sponsorship.
He says these kinds of models are crucial for delivery of govt services for locals, while commercial AI models will be useful for most business cases. "The govt can use it to fill forms through voice, give instructions to farmers and teach Tamil to the younger generation.
Various stakeholders including govt and companies should be brought together to build these models suitable for the use cases," he asserted.
The community is currently fine-tuning existing AI models to improve its performance in Tamil, but is ambitious about building one from scratch - albeit a small-domain focused model.
It will use a tokenisation method inspired by Nannul, the 13th century Tamil grammar treatise, to better reflect the language's morphological structure instead of the currently widely used Byte-Pair Encoding (BPE) method. Tokenisation refers to the process of breaking down text into smaller units called tokens.
From adopting the printing press in the 1500s (the first in India), to adopting Unicode for the internet, Tamil has consistently been an early adopter of new communication technologies. Now, it should be able to find its place in the AI age.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Time of India
4 hours ago
- Time of India
Brains on autopilot: MIT study warns AI is eroding human thought, Here's how to stay intellectually alive
AI-generated representative image Inventions have redefined the very existence of humankind, challenging us to alter the way we think, learn, and live. The printing press etched history in bold letters. Calculators reshaped arithmetic. Now, artificial intelligence has entered the scene, permeating every niche of human life and painting it with a palette of new possibilities. Yet, like every groundbreaking invention, this too carries its fair share of repercussions. But what happens when the very tools built to extend the human mind begin to replace it? The answer is unsettling: it produces a generation with crippled thinking abilities. A profound transition is already underway, one that, like an asymptomatic disease, may erupt into a full-blown cognitive pandemic in the years ahead. Generative AI systems like ChatGPT promise instant answers, elegant prose, and streamlined tasks. But we now stand on the precipice of bidding adieu to creativity. Beneath the sheen of this alluring technology lies a deeper question: Are we keeping our thinking abilities on the shelf, and completely forgetting how to think? A striking study by the Massachusetts Institute of Technology (MIT) has surfaced some troubling trends. And no, it's not good news for the next generation. Inside the MIT study: The brain on ChatGPT Computer scientist Nataliya Kosmyna and her team at MIT's Media Lab set out to investigate whether heavy reliance on AI tools like ChatGPT alters the way our brains function. The experiment involved 60 college students aged 18 to 39, who were assigned to write short essays using one of three methods: ChatGPT, Google Search, or no external tools at all. Equipped with EEG headsets to monitor their neural activity, participants crafted essays in response to prompts like 'Should we always think before we speak?' The results? Students who wrote without any assistance demonstrated the highest levels of cognitive engagement, showing strong neural connectivity across brain regions responsible for memory, reasoning, and decision-making. They thought harder and more deeply. By contrast, ChatGPT users showed the lowest neural activity. Their thinking was fragmented, their recall impaired, and their essays often lacked originality. Many participants could not even remember what they had written, clear evidence that the information had not been internalised. AI hadn't just helped them write. It had done the writing for them. Their brains had taken a backseat. The risk of outsourcing thought The cognitive offloading, as the researchers have named it, is not about the convenience, it is about the control. The more we allow machines to handle the hard segments of thinking, the less frequently we are exercising our brain muscles for critical thinking, creativity, and memory formation. Over time, these muscles can weaken. When participants who initially used ChatGPT were later asked to write without it, their brain activity increased, but it never met the levels of those who had worked independently from the start. It provided a clear inference that the potential for deep thinking is on the verge of erosion. Tools reflect intent, not intelligence It is usually the invention that is treated as a scapegoat, but more than that, it depends on the way we use it. The problem is not the tool, but how we decide to put it to use. As one teacher once said, 'Every tool carries with it a story, not of what it is, but how it is used.' AI, like a pair of scissors, is brilliant in design, but only when built with everyone in mind. For decades, scissors excluded left-handed children, not because the tool was faulty, but because its design lacked inclusivity. AI shares no different story. There are two roads: it can either democratise education or further deepen inequality. It can hone creativity or dull it. Our actions will decide which road we are pushing our next generation to traverse. According to the World Bank, students from disadvantaged backgrounds are 50% less likely to access AI-powered learning tools compared to their peers (World Bank, 2024). And as UNESCO's 2024 Global Education Monitoring Report reveals, nearly 40% of teachers worldwide fear being replaced by machines. But those outcomes are not the fault of AI. They're the result of how we've chosen to implement it. Used well, AI can elevate learning When utilised cautiously, AI can still elevate the quality of education. A study by Mc Kinsey Global Institute has shown that personalised learning with the help of AI tools can bolster a student's performance by 30%. The Organization for Economic Co-operation and Development (OECD 2022) study shared similar findings, adding weight to the stance by stating that it can mitigate teacher workloads, critical, given that educators spend 50% of their time on administrative duties. In rural India, digital initiatives like National Digital Education Architecture (NDEAR) aim to use AI to reach over 250 million school-age children who lack access to quality teachers. However, even in a world driven and dominated by artificial intelligence, the human element in learning cannot be substituted. The struggle for reflection, the delight of discovery dwell at the heart of human learning. As it is said, we must begin with the end in mind. Are we cultivating a cohort of students to complete the tasks, or ones who can think beyond limits and add meaning? 'AI is already born. We must learn to co-exist.' In a conversation with The Times of India, Siddharth Rajgarhia, Chief Learner and Director of DPS, said it emphatically: 'AI is already born; we cannot keep it back in the womb. It is important to learn to co-exist with the guest and keep our human element alive.' That co-existence begins by redefining the role of AI, not as a shortcut, but as a companion in the learning journey. Here's how educators and students can stay intellectually alive in the age of automation: Think before you prompt : Encourage students to brainstorm ideas independently before turning to AI. Reclaim authorship : Every AI-assisted draft should be critically revised and fully owned by the student. Foster metacognition : Teach learners to reflect on how they think, not just what they produce. Center equity in design : Ensure tools are accessible to all learners, not just the digitally privileged few. Use AI to deepen, not replace, curiosity : Let it challenge assumptions, not hand out ready answers. Final thought: Let AI assist, but let humans lead The brain was never meant to idle. It was designed to wrestle with complexity, to stumble and reframe, to wonder and imagine. When we surrender that process to machines—when we allow AI to become the default setting for thought—we risk losing more than creativity. We risk losing cognitive ownership. The human brain was never made to sit idle. It is designed to grapple with complexity, to stumble and reframe, to wonder and imagine. When we hand over that task to machines, we allow AI to become the default setting for thought; we are losing more than creativity. We are keeping at stake our cognitive ownership, our voices, and our opinions. When we forget to think, we let go of the very power of being human. AI is not the negative protagonist or a bane here. We need to understand that it amplifies our intentions, good or bad, lazy or inspired. The future of learning and the workplace does not depend on the fastest prompt or smartest algorithm. It stands on the shoulders of the brightest minds who have kept their curiosity intact and who resist easy answers. At the core of learning lie educators who remind us that the goal of education is not just knowledge, it is wisdom. We so wish that prompts could generate wisdom and a human element. Alas, they cannot. It needs to be developed by the vanguards of imagination. Is your child ready for the careers of tomorrow? Enroll now and take advantage of our early bird offer! Spaces are limited.


Time of India
8 hours ago
- Time of India
Gautham, Bhavya team up for a whodunit film with sci-fi elements
Sooriyaprathap says, 'Gautham plays a CB-CID officer investigating a serial killing. Criminal profiling isn't that explored in Tamil cinema and we are blending that in this story.' Debut director Sooriyaprathap S will be directing a sci-fi crime thriller starring Gautham Ram Karthik and Bhavya Trikha . Talking about his untitled film, Sooriyaprathap says, 'Gautham plays a CB-CID officer investigating a serial killing. Criminal profiling isn't that explored in Tamil cinema and we are blending that in this story. ' He believes that the biggest plus of his film will be its sci-fi element. 'Many have explored sci-fi and murder investigation separately, but blending both would be an experiment. Also, our audiences have always received crime thriller movies well,' he states. Speaking about what led him to cast Gautham and Bhavya as the leads, he reveals, 'Gautham and I had previously discussed doing a sports drama. I've always felt he would be perfect as a cop. Even in Pathu Thala, his character was an undercover cop. With Bhavya, when I met her, she came across as a 'namma veettu ponnu'. She plays a very mature role in the film and we felt her attitude works well for the part.' Sooriyaprathap, who was the associate director of Kochadaiiyaan , shares, 'After Kochadaiiyaan , I was involved in the story discussions for a few movies. Around 2016, I was supposed to do a web series in Soundarya (Rajinikanth) ma'am's production, but it didn't take off. I've now helmed a web series, which will be released later this year.'


Time of India
20 hours ago
- Time of India
US Senate debates whether to adopt revised state AI regulation ban
Academy Empower your mind, elevate your skills Two key US Republican senators agreed to a revised federal moratorium on state regulation of artificial intelligence to five years and allow states to adopt rules on child online safety and protecting artists' image or Commerce Committee chair Ted Cruz originally proposed securing compliance by blocking states that regulate AI from a $42 billion broadband infrastructure fund as part of a broad tax and budget bill.A revised version released last week would only restrict states regulating AI form tapping a new $500 million fund to support AI a compromise announced Sunday by Senator Marsha Blackburn, a critic of the state AI regulatory moratorium, the proposed 10-year moratorium would be cut to five years and allow states to regulate issues like protecting artists' voices or child online safety if they do not impose an "undue or disproportionate burden" on passed a law last year dubbed the ELVIS Act to protect songwriters and performers from the use of AI to make unauthorized fake works in the image and voice of well-known artists. Texas approved legislation to bar AI use for the creation of child pornography or to encourage a person to commit physical self-harm or commit is not clear if the change will be enough to assuage concerns. On Friday, 17 Republican governors urged the Senate to drop the AI plan."We cannot support a provision that takes away states' powers to protect our citizens. Let states function as the laboratories of democracy they were intended to be and allow state leaders to protect our people," said the governors led by Arkansas' Sarah Huckabee Commerce Secretary Howard Lutnick voiced his support for the revised measure calling it a pragmatic compromise. "Congress should stand by the Cruz provision to keep America First in AI," Lutnick wrote on has failed for years to pass any meaningful AI regulations or safety Maria Cantwell, the top Democrat on the Commerce Committee, said the Blackburn Cruz amendment "does nothing to protect kids or consumers. It's just another giveaway to tech companies." Cantwell said Lutnick could simply opt to strip states of internet funding if they did not agree to the moratorium.