• Hindi - Assamese Parallel Chunked Text Corpus ILCI
Hindi - Assamese Parallel Chunked Text Corpus ILCI
  • Contributor: ILCI Consortia
  • Product Code: ILCI-HIN-ASM-CHUNK-TEXT-035

Available Under License: Commercial   Research  

Sample Download | size: 460.5KB | type: rar
Added on : 16 Jul 2019

Under the Indian Languages Corpora Initiative (ILCI) project, ILCI Consortia led by Jawaharlal Nehru University, New Delhi has created parallel corpus, Hindi as source language and translated in Assamese as the target language. Health, Tourism, Agriculture and Entertainment domains have been covered in this corpus. This corpus has a unique sentence ID for each sentence and complete corpus is  in UTF-8 encoding. The translated sentences have been POS tagged (as per Bureau of Indian Standards - BIS tagset) and Chunked properly. The chunking guideline used in this corpus creation, is also provided.

Text Corpus Attributes
Language Hindi and Assamese
Parallel or Monolingual Parallel Hindi to Assamese
Annotation POS Tagged, Chunk Tagged
No. of Sentences 70000 Hindi sentences 70000 Assamese sentences
Word-Count 1128215 Hindi Words 893681 Assamese Words
File Format Text File
Encoding UTF-8
File Size 6.77 MB

Write a review

Please login or register to review

Tags: Hindi, Assamese, Text Corpus, Chunked, Parallel text corpus

Disclaimer: The information provided on this page has been procured through different sources. Please write back to us at nplt_support[at]cdac[dot]in in case you would like to suggest an update.