• Tamil Raw Speech Corpus
Tamil Raw Speech Corpus
  • Contributor: CIIL Mysore
  • Product Code: CIIL-TAM-RAW-Speech-138
Sample Download | size: 2.8MB | type: zip
Added on : 27 Aug 2021

Dataset Description

139:11:41 Hours | 86 GB speech data | 452 Speakers | 60,287 Audio segments | 48 kHz | 16 bit wav.

 

Tamil is one of the longest-surviving classical languages in the world.  It is one of the prominent language among the Dravidian language family. Tamil is widely spoken in the state of Tamil Nadu, Union Territory of Pondicherry, Sri Lanka, in East-Asian countries like Burma, Malaysia, Singapore, Indonesia, Indio china, Fiji, in South-Africa, British Guinea and in islands like Mauritius and Madagascar etc. The language is an official language in Tamil Nadu and some of the foreign countries such as Sri Lanka and Singapore. It has official status in the Indian state of Tamil Nadu and the Indian Union Territory of Pondicherry. Tamil has its own font. The language is highly agglutinative in nature. Tamil has Phonological simplicity, Morphological parity and primitiveness. There is separability and significance of all affixes in Tamil language. There is an absence nominative case termination and arbitrary words in Tamil language. 

The LDC-IL speech data is collected from the regions of Kongu, Kumari, Madurai, Nellai, Salem and Thanjai, from both the genders and different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.

 

 The available Speech Corpus details:



Total Speakers 452 (214 Female and 219 Male)



Domains

Audio Segments

Each Domain Duration

Contemporary Text (News)

433

57:53:48

Creative Text

429

14:21:31

Sentence

10,764

14:51:03

Date Format

842

01:20:17

Command and Control Words

12,882

12:57:06

Person Name

8,755

03:57:29

Place Name

4,002

10:34:38

Most Frequent Word - Part

12,813

11:14:05

Most Frequent Word - Full Set

2,000

02:26:05

Phonetically Balanced

3,860

04:55:10

Form and Function - Word

3,507

04:40:29



A  detailed explanation of the Nepali Speech Corpus will be available in the Nepali Speech Data Documentation. 

For any research-based citations, please use the following citations:


  • Ramamoorthy, L., Narayan Choudhary, Thennarasu S, Prem Kumar L R, Amudha R, Prabagaran R, Srikanth D. 2021.  Tamil Raw Speech Corpus Central Institute of Indian Languages, Mysore.
  • Narayan Choudhary,  Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174.

 

Speech Data Attributes
Annotation Raw Speech Corpus
Language Tamil
Duration 139:11:41
Speaker Type Native
No. of Audio Segment 60,287
Speaker Gender Male and Female

Write a review

Please login or register to review

Tags: Tamil, Raw Speech Corpus, Speech Corpus

Disclaimer: The information provided on this page has been procured through different sources. Please write back to us at nplt_support[at]cdac[dot]in in case you would like to suggest an update.