- Blockchain Council
- December 04, 2023
In the southwestern Indian state of Karnataka, a groundbreaking project has been underway, involving villagers contributing to the creation of the country’s first AI-based chatbot for Tuberculosis through an innovative language dataset initiative. Over the course of a few weeks, native Kannada speakers, numbering over 40 million in India, participated in reading out sentences into a dedicated app. Kannada is not only one of the country’s 22 official languages but is also among the 121 languages spoken by 10,000 people or more in India.
The endeavor is fueled by the recognition that many Indian languages, including Kannada, are not adequately covered by natural language processing (NLP), a crucial component of artificial intelligence that enables machines to comprehend and process human language. This linguistic gap has left hundreds of millions of Indians excluded from valuable information and economic opportunities.
“For AI tools to work for everyone, they need to also cater to people who don’t speak English or French or Spanish,” said Kalika Bali, principal researcher at Microsoft Research India.
The villagers in Karnataka are part of a broader movement, with thousands of speakers from various Indian languages contributing speech data for tech firm Karya. This data is instrumental in building datasets for major players like Microsoft and Google, facilitating the development of AI models for education, healthcare, and other essential services.
“The government is pushing very strongly to create datasets to train large language models in Indian languages, and these are already in use in translation tools for education, tourism and in the courts,” said Pushpak Bhattacharyya, head of the Computation for Indian Language Technology Lab in Mumbai.
Bhashini, an AI-led language translation system, is a pivotal part of the Indian government’s efforts to digitize more services. This platform operates as an open-source initiative, involving crowdsourcing contributions from people in various languages. The collected data is then used to train large language models for applications in education, tourism, and the legal system.
“Crowdsourcing also helps to capture linguistic, cultural and socio-economic nuances,” said Bali.
Pushpak Bhattacharyya highlighted the challenges in this ambitious undertaking, stating, “But there are many challenges: Indian languages mainly have an oral tradition, electronic records are not plentiful, and there is a lot of code mixing. Also, to collect data in less common languages is hard, and requires a special effort.”
Out of over 7,000 living languages globally, fewer than 100 are captured in major NLPs, with English being the most advanced. Grassroots organizations and startups globally are working to bridge this gap. Masakhane is actively working to strengthen NLP research in African languages, and the United Arab Emirates has introduced Jais, a large language model powering generative AI applications in Arabic.
Crowdsourcing emerges as a practical solution for a country as diverse as India. Kalika Bali emphasized that crowdsourcing not only helps collect linguistic data but also captures cultural and socio-economic nuances. However, she stressed the importance of ethical practices in crowdsourcing, including awareness of bias and fair compensation for contributors.
“The bot, named after a duet where two musicians riff off each other, uses language models from AI4Bharat and reasoning models from Microsoft, and can be accessed on WhatsApp, which is used by about 500 million people in India,” mentioned in the article.
Karya, the tech firm working on language datasets, collaborates with non-profit organizations to identify workers below the poverty line, paying them well above the minimum wage to generate data. This approach ensures that workers own a part of the data they generate, providing them with an opportunity to earn royalties and potentially contribute to AI products for their communities in areas such as healthcare and farming.
“We see huge potential for adding economic value with speech data – an hour of Odia speech data used to cost about $3-$4, now it’s $40,” said Safiya Husain, co-founder of Karya.
India’s linguistic diversity also influences the development of AI models focused on speech and speech recognition. Google-funded Project Vaani is collecting speech data from around 1 million Indians, contributing to automatic speech recognition and speech-to-speech translation. Other initiatives, such as Bengaluru-based EkStep Foundation’s AI-based translation tools and the government-backed AI4Bharat center’s Jugalbandi chatbot, further demonstrate the transformative impact of AI on various sectors, including the legal system and welfare schemes.
The success of these initiatives is evident in the economic value they bring to individuals like Swarnalata Nayak in Odisha, who earns additional income by contributing her native Odia language to Karya’s speech data. The demand for speech data has not only provided economic opportunities but has also empowered communities at the grassroots level.
Indian Prime Minister Narendra Modi’s emphasis on India’s role as a testing ground for innovative solutions aligns with the country’s commitment to digital inclusion.
“We are building ‘Bhashini,’ an AI powered language translation platform. It will support digital inclusion in all the diverse languages of India,” said the Prime Minister.
In conclusion, India’s adoption of AI tools for building language datasets marks a significant step towards inclusive digital growth. The collaboration between government initiatives, tech firms, and grassroots organizations showcases a holistic approach to overcoming linguistic barriers and leveraging the potential of AI for the benefit of all citizens. As the nation continues on this transformative journey, the impact on education, healthcare, and economic opportunities promises to be profound and far-reaching.