Indian AI Startup Sarvam Launches LLM Trained on 10 Indic Languages

Explainer Briefly Slides Sarvam AI, a Bengaluru-based artificial intelligence startup, has announced the launch of Sarvam 1, its latest open-source large language model (LLM) tailored for Indian languages. The system reportedly supports 10 Indic languages namely, Bengali, Gujarati, Hindi, Marathi, Malayalam, Kannada, Odia, Tamil, Telugu, Punjabi, and English. How Does It Operate? Sarvam 1 operates on a 2-billion-parameter architecture and is built on a specialised tokeniser developed by Sarvam AI.

The model’s foundation involved training on 4 trillion tokens, using Nvidia’s H100 Tensor Core GPUs as its computing backbone. Sarvam AI used synthetic data generation to create training datasets for Indian languages. In addition to Nvidia, the AI model utilised Yotta’s data centres and AI4Bharat’s language technology resources.

“The Sarvam 1 model is the first example of an LLM trained from scratch with data, research, and compute being fully in India. We expect it to power a range of use cases including voice and messaging agents. This is the beginning of our mission to build full stack sovereign AI, and we are deeply excited to be working together with Nvidia towards this mission.

”, said Dr. Pratyush Kumar, Sarvam AI’s co-founder, reported The Hindu . Developers can access the base model on the open-source platform Hugging Face to create AI applications for Indic language users.

The model provides a toolset that developers can leverage to create applications such as automated customer support, voice recognition, and language translation tools. Past Product Launches In August this year, Sarvam AI introduced its first foundational model, Sarvam 2B, trained on 4 trillion tokens. The startup also launched AI voice agents for customer service and sales in Indian languages, available at Rs 1 per minute, targeted at industries like healthcare and banking services.

Additionally, Sarvam rolled out A1, a generative AI tool for legal drafting and data extraction, along with Shuka v1, an audio model for understanding spoken Indic languages, and APIs for text-to-speech and translation. Previously, in December last year, the startup launched India’s first Hindi-focused open-source LLM, Open Hathi, based on Meta AI’s Llama 2-7B model. The model aims to innovate in Indian language AI and claims to have achieved GPT-3.

5-level accuracy for Indic languages. Furthermore, it underwent two-phase training to reduce tokenization costs, particularly high for Hindi due to limited training data. Sarvam Faces Challenges Amidst Ambitious Product Launches Sarvam AI has raised $54 million to develop AI models, reported The Ken.

Yet, it’s reportedly struggling to gain traction within India’s AI community. The launch of Sarvam 2B was equated with Google’s Gemini but on a smaller scale. This model, alongside its voice model Shuka—combining speech-to-text with translation—marked its August product lineup.

However, challenges in functionality, like low transcription accuracy and poor handling of multilingual audio, etc. have emerged. Read More:.