Indian English ASR Challenge Data (ASR Speech Data) - NLTM Pilot
  • Contributor: ASR Consortia
  • Product Code: NLTMP-ASR-CHALLENGE-ENG-001

Added on : 10 Jun 2021

The data set comprises of Indian English read speech and lecture speech data along with the corresponding transcriptions. The read speech covers genres like politics sports, entertainment, etc. It was collected by Speech Lab ITM and has text data crawled from newspapers. The volunteers were asked to read them. The lecture speech data was obtained from Computer Science and Electrical lectures of NPTEL. The read speech corpus is named IITM whereas the lecture speech corpus is referred to as NPTEL. Lexicon, baseline models, results and recipes to replicate the baseline experiments are also made available. The following data sets are released for this challenge. Train set - 280 hours --- IITM (80 hours) + NPTEL (200 hours) Development set IITM - 6 hours --- IITM Development set NPTEL - 5 hours --- NPTEL Evaluation set IITM - 6 hours --- IITM Evaluation set NPTEL - 5 hours --- NPTEL

Speech Data Attributes
Language Indian Accent English
Transcription Yes, Available
Duration 302 Hours
Recording Environment Studio recorded or classroom recorded, Read Speech and Lectures
Speaker Type Indian people, native language can be any.
BitRate 16 KHz
No. of Audio Segment Token-48k, Sentences-168k sentences
Speaker Gender Both Male & Female

