Your cart is empty!
0 reviews / Write a review
97:43:54 Hours | 62.2 GB
speech data | 1916 Speakers | 1,916 Audio
segments | 48 kHz | 16 bit wav.
LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech
corpora published by LDC-IL in various Indian languages. This dataset is built
to address the needs of some applications like language identifier modules
where multiple language samples are a requirement, to explore cross-linguistic
variations and diatopic comparison to determine what generalizations are
possible about the types of variable features, to build multilingual phoneme
set and models etc.
Multi-Lingual speech dataset sampling is taken from the content type of
‘Creative Text-T2’ which is extracted mainly from literary sources. The
creative text of the LDC-IL Speech dataset comprises of essays or short
stories. One of these essays or short stories, selected randomly from a data
set, is assigned to a speaker for reading out. The same story may be read out
by multiple speakers.
The available Speech
Total Speakers 1916 (958 Female and 958 Male)
A detailed explanation of the Multi-Lingual Raw Speech Corpus will be available in the Multilingual Raw Speech Documentation.
For any research-based citations, please use the following citations:
Narayan Kumar Choudhary, Rajesha N., Manasa G., 2021. Multilingual Raw Speech Corpus. Central Institute of Indian Languages, Mysore
Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview
Raw Speech Corpus,