Speaking with Praveen Gorla and Ranjit Raja on Swecha’s Community centric AI project, data sets, and forthcoming tools.

Apr 02, 2026

Interacting with two members of Swecha’s team we gather what the developments are in the organisation’s community centric AI project, the forthcoming release of the data sets that they are trained on, and the application that is presently in development and would be released to the public soon.

Praveen Gorla is directly involved with Swecha since 2013 as a student at Jawaharlal Nehru Technical University, and is presently a Research Scientist at Swecha’s community centric AI project.  - Swecha has multiple initiatives in the free and open source software domains. While at JNTU he was part of a GLUG GNU Linux User Group that was conducting activities for SWECHA there. The AI initiative of the Swecha activities is Vishwam which is a centre for excellence in AI in the Global South. Their website can be found here. For Swecha, Praveen, Ranjit and Raja look after the student, developer, activities and the faculty. The faculty for engineering and non-engineering institutions such as pharmacy colleges, are the ones who collaborate with Swecha and are themselves regular employees in colleges. They involve students in activities such as corpus collection initiatives, and shape the research ecosystem of Swecha.

Swecha itself, even before being established in 9th Feb, 2005 started with the question of how the digital divide can be bridged and to spur innovation. Efforts towards this began from 2002 onwards.  Free and open source softwares with community licences were thought to be the way to go, as opposed to Windows. So the goal was to make software available in local languages, moving towards building a Telugu operating system which would require a glossary of words. The word for the internet for example that is used in Telugu is antarjalam etymologically meaning “within the water”. Swecha Gonthuka, an initiative to create a speech to text recognition software that could understand Telugu, was launched in 2021.

However in coming back to the licence itself, what would that consist of? The Software Freedom Law Centre on the 31st of January would host a panel discussing community AI. Suggestions from the assembled would be welcome. The licence would also be a collaboration between SFLC, Swecha and IIIT. We should note that the presently available licences such as the General Public Licence were originally meant for code and not for massive data sets, which is perhaps another reason why a new licence is being worked towards.

AI’s that are commercially available as of now have not made publicly available the data sets and the large language models that they were trained on, this is something that Swecha is presently in the process of undertaking. Let us take Gotuka as an example, the data sets and LLM’s that it has been trained on has not been made publicly available so far as an appropriate people’s licence is still in the making. There is however the Swecha Telugu Automatic Speech Recognition dataset that is available here, though it does require login access to download. This corpus was collected as a part of the Swecha Gontuka drive.

Ranjit Raj, who was mentioned earlier outlines the tools that Swecha are to release on the 17th. These include the Swecha Telugu Corpus which is also being used within closed groups and developers. It is a tool that features, transcription and proofreading abilities. Apart from this there is also a peer review feature which allows peers to edit descriptions and the release rights of a file. The corporate server application, which is yet to be released can be thought of as the back end of the corporate user application which is the app itself. The server would be the repository of the uploads from users and where the processing of the information, for example the extraction of a transcript from an audio file happens and this is then relayed back to the app itself, where it can be accessed by the user.

Apart from this there is also plans that are underway for an AI based corpus review. The story generator that was launched earlier built on the Small Language Model of uploads of Chandamama Kathulu is also being updated. It may be found here.

Subsequently the latest version of Swecha’s operating system which is AI enabled will be launched on the day of the organisation’s foundation commemoration. The data sets that this AI would be trained on would include the corpus collection undertaken by volunteers during the Sumer of AI internship programs. These include a collection of audio and text files consisting of recipes, customs, songs and places, rural professions apart from the folk tale collection Chandamama Kathulu. These would be used for example to train the speech recognition system. Apart from this the Summer of AI dataset preparation handbook is available hereon a site run by Swecha but requires sign in credentials to access.

12.01.2026