Search | Korea Science

Part-of-speech Tagging for Hindi Corpus in Poor Resource Scenario

Modi, Deepa;Nain, Neeta;Nehra, Maninder
- Journal of Multimedia Information System
- /
- v.5 no.3
- /
- pp.147-154
- /
- 2018
Natural language processing (NLP) is an emerging research area in which we study how machines can be used to perceive and alter the text written in natural languages. We can perform different tasks on natural languages by analyzing them through various annotational tasks like parsing, chunking, part-of-speech tagging and lexical analysis etc. These annotational tasks depend on morphological structure of a particular natural language. The focus of this work is part-of-speech tagging (POS tagging) on Hindi language. Part-of-speech tagging also known as grammatical tagging is a process of assigning different grammatical categories to each word of a given text. These grammatical categories can be noun, verb, time, date, number etc. Hindi is the most widely used and official language of India. It is also among the top five most spoken languages of the world. For English and other languages, a diverse range of POS taggers are available, but these POS taggers can not be applied on the Hindi language as Hindi is one of the most morphologically rich language. Furthermore there is a significant difference between the morphological structures of these languages. Thus in this work, a POS tagger system is presented for the Hindi language. For Hindi POS tagging a hybrid approach is presented in this paper which combines "Probability-based and Rule-based" approaches. For known word tagging a Unigram model of probability class is used, whereas for tagging unknown words various lexical and contextual features are used. Various finite state machine automata are constructed for demonstrating different rules and then regular expressions are used to implement these rules. A tagset is also prepared for this task, which contains 29 standard part-of-speech tags. The tagset also includes two unique tags, i.e., date tag and time tag. These date and time tags support all possible formats. Regular expressions are used to implement all pattern based tags like time, date, number and special symbols. The aim of the presented approach is to increase the correctness of an automatic Hindi POS tagging while bounding the requirement of a large human-made corpus. This hybrid approach uses a probability-based model to increase automatic tagging and a rule-based model to bound the requirement of an already trained corpus. This approach is based on very small labeled training set (around 9,000 words) and yields 96.54% of best precision and 95.08% of average precision. The approach also yields best accuracy of 91.39% and an average accuracy of 88.15%.
https://doi.org/10.9717/JMIS.2018.5.3.147 인용 PDF KSCI

Optical Character Recognition for Hindi Language Using a Neural-network Approach

Yadav, Divakar;Sanchez-Cuadrado, Sonia;Morato, Jorge
- Journal of Information Processing Systems
- /
- v.9 no.1
- /
- pp.117-140
- /
- 2013
Hindi is the most widely spoken language in India, with more than 300 million speakers. As there is no separation between the characters of texts written in Hindi as there is in English, the Optical Character Recognition (OCR) systems developed for the Hindi language carry a very poor recognition rate. In this paper we propose an OCR for printed Hindi text in Devanagari script, using Artificial Neural Network (ANN), which improves its efficiency. One of the major reasons for the poor recognition rate is error in character segmentation. The presence of touching characters in the scanned documents further complicates the segmentation process, creating a major problem when designing an effective character segmentation technique. Preprocessing, character segmentation, feature extraction, and finally, classification and recognition are the major steps which are followed by a general OCR. The preprocessing tasks considered in the paper are conversion of gray scaled images to binary images, image rectification, and segmentation of the document's textual contents into paragraphs, lines, words, and then at the level of basic symbols. The basic symbols, obtained as the fundamental unit from the segmentation process, are recognized by the neural classifier. In this work, three feature extraction techniques-: histogram of projection based on mean distance, histogram of projection based on pixel value, and vertical zero crossing, have been used to improve the rate of recognition. These feature extraction techniques are powerful enough to extract features of even distorted characters/symbols. For development of the neural classifier, a back-propagation neural network with two hidden layers is used. The classifier is trained and tested for printed Hindi texts. A performance of approximately 90% correct recognition rate is achieved.
https://doi.org/10.3745/JIPS.2013.9.1.117 인용 PDF KSCI

An Artificial Intelligence Approach for Word Semantic Similarity Measure of Hindi Language

Younas, Farah;Nadir, Jumana;Usman, Muhammad;Khan, Muhammad Attique;Khan, Sajid Ali;Kadry, Seifedine;Nam, Yunyoung
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.15 no.6
- /
- pp.2049-2068
- /
- 2021
AI combined with NLP techniques has promoted the use of Virtual Assistants and have made people rely on them for many diverse uses. Conversational Agents are the most promising technique that assists computer users through their operation. An important challenge in developing Conversational Agents globally is transferring the groundbreaking expertise obtained in English to other languages. AI is making it possible to transfer this learning. There is a dire need to develop systems that understand secular languages. One such difficult language is Hindi, which is the fourth most spoken language in the world. Semantic similarity is an important part of Natural Language Processing, which involves applications such as ontology learning and information extraction, for developing conversational agents. Most of the research is concentrated on English and other European languages. This paper presents a Corpus-based word semantic similarity measure for Hindi. An experiment involving the translation of the English benchmark dataset to Hindi is performed, investigating the incorporation of the corpus, with human and machine similarity ratings. A significant correlation to the human intuition and the algorithm ratings has been calculated for analyzing the accuracy of the proposed similarity measures. The method can be adapted in various applications of word semantic similarity or module for any other language.
https://doi.org/10.3837/tiis.2021.06.006 인용 PDF KSCI HTML

Hindi version of short form of douleur neuropathique 4 (S-DN4) questionnaire for assessment of neuropathic pain component: a cross-cultural validation study

Gudala, Kapil;Ghai, Babita;Bansal, Dipika
- The Korean Journal of Pain
- /
- v.30 no.3
- /
- pp.197-206
- /
- 2017
Background: Pain with neuropathic characteristics is generally more severe and associated with a lower quality of life compared to nociceptive pain (NcP). Short form of the Douleur Neuropathique en 4 Questions (S-DN4) is one of the most used and reliable screening questionnaires and is reported to have good diagnostic properties. This study was aimed to cross-culturally validate the Hindi version of the S-DN4 in patients with various chronic pain conditions. Methods: The S-DN4 is already translated into the Hindi language by Mapi Research Trust. This study assessed the psychometric properties of the Hindi version of the S-DN4 including internal consistency and test-retest reliability after 3 days' post-baseline assessment. Diagnostic performance was also assessed. Results: One hundred sixty patients with chronic pain, 80 each in the neuropathic pain (NeP) present and NeP absent groups, were recruited. Patients with NeP present reported significantly higher S-DN4 scores in comparison to patients in the NeP absent group (mean (SD), 4.7 (1.7) vs. 1.8 (1.6), P < 0.01). The S-DN4 was found to have an AUC of 0.88 with adequate internal consistency (Cronbach's ${\alpha}=0.80$) and a test-retest reliability (ICC = 0.92) with an optimal cut-off value of 3 (Youden's index = 0.66, sensitivity and specificity of 88.7% and 77.5%). The diagnostic concordance rate between clinician diagnosis and the S-DN4 questionnaire was 83.1% (kappa = 0.66). Conclusions: Overall, the Hindi version of the S-DN4 has good internal consistency and test-retest reliability along with good diagnostic accuracy.
https://doi.org/10.3344/kjp.2017.30.3.197 인용 PDF KSCI

The Role of Contrast in Prosodically Induced Acoustic Variation

Choi, Han-Sook
- Phonetics and Speech Sciences
- /
- v.1 no.3
- /
- pp.29-37
- /
- 2009
This paper presents results from speech production experiments on English, Korean, and Hindi that compare variation in the acoustic expression of dissimilar phonological laryngeal contrast in stops conditioned by prosodic prominence. Target stops are analyzed from utterance-initial, -medial, and -final positions, with a variation in contrastive focal accent, from the speech data by six male American English speakers, five male Seoul Korean speakers, and five male Delhi Hindi speakers. The results show that prosodic prominence conditions enhanced distinctiveness between contrastive segments in the three languages. The manner in which prosodic prominence and prosodic phrase structure is marked at the level of segmental variation is, however, found to be language-specific to some extent. In addition, a correlation between the size of the phonological inventory and the corresponding acoustic variation was found but the linear correlation was not strongly supported with the findings in the present study.
PDF

A Text Processing Method for Devanagari Scripts in Andriod (안드로이드에서 힌디어 텍스트 처리 방법)

Kim, Jae-Hyeok;Maeng, Seung-Ryol
- The Journal of the Korea Contents Association
- /
- v.11 no.12
- /
- pp.560-569
- /
- 2011
In this paper, we propose a text processing method for Hindi characters, Devanagari scripts, in the Android. The key points of the text processing are to device automata, which define the combining rules of alphabets into a set of syllables, and to implement a font rendering engine, which retrieves and displays the glyph images corresponding to specific characters. In general, an automaton depends on the type and the number of characters. For the soft-keyboard, we designed the automata with 14 consonants and 34 vowels based on Unicode. Finally, a combined syllable is converted into a glyph index using the mapping table, used as a handle to load its glyph image. According to the multi-lingual framework of Freetype font engine, Dvanagari scripts can be supported in the system level by appending the implementation of our method to the font engine as the Hindi module. The proposed method is verified through a simple message system.
https://doi.org/10.5392/JKCA.2011.11.12.560 인용 PDF KSCI

Korean NPIs amu-(N)-to and amu-(N)-rato

Yoon, Young-Eun
- Language and Information
- /
- v.12 no.2
- /
- pp.21-47
- /
- 2008
This paper reviews the analysis of the so-called Korean NPIs, amu-(N)-to and amu-(N)-rato, proposed by An (2007). An proposes that the two so-called polarity items are identical semantically, tantamount to English even, but they are in complementary distribution due to the opposite scope properties of the emphatic particles to and rato contained in the NPIs in question. Resorting to Karttunen and Peters' (1979) and Wilkinson's (1996) scope analysis of even, Lahiri's (1998) analysis of Hindi NPIs, and Guerzoni's (2002) analysis of the negative bias of yes/no-questions containing minimizers, An accounts for the distributional properties of the two Korean NPIs. Given this, however, it is observed that unlike amu-(N)-to, amu-(N)-rato could be licensed in much broader contexts. Based on this observation, this paper proposes that the two particles to and rato are two different particles with different meanings.
PDF

The Interpreggtation of the Indian Stupa as Origin of Korean Pagoda (탑의 원조 인도 스투파의 형태 해석 - 인도 전역의 현장 답사를 바탕으로 -)

Lee, Hee-Bong
- Journal of architectural history
- /
- v.18 no.6
- /
- pp.103-126
- /
- 2009
This study aims to discover historical trends and change of form of all stupas in India with observation of field study that is as direct as possible, by classifying, analyzing, and synthesizing the stupas. Study of Indian stupa in Korea has a number of shortcomings since only introductory partial approach has been made in order to seek the origin of Korean pagoda. This study also aims to correct errors of stupa terminology in Chinese character committed by misinterpretation of Hindi language which was established by precedent Japanese scholars several decades ago. Piled-up stupas were totally destroyed by pagans, therefore their remains tell us only of structure, material, sizeand disposition. However remains of carved stone at torana and drum give us clues as to the original form of stupa and worshipping activity, as well as change to a more luxurious form. Many rock cave stupas of India show us both simple forms matching the ascetic age of early Buddhism and luxurious changes in Mahayanan era introducing us to statues of Buddha. Indians recovered the spheric form of 'anda,' a Hindi term meaning cosmic egg, from the hemispheric form of the piled-up stupa. Therefore we might discard the erratic term of 'bokbal', which means an upset vessel. Railings and parasols became main factors of stupa design. Carved railings around stupa became a sign of divinity. Serious worshipping activity made drums long or high and created multi-embossed stripes. Bases of circular drums of some cave stupas changed their shapes to rectangular or octagonal. Single parasols became multiparasols of affluent flowerlike curved stems on carved stupa. Multistoried, elongated and high parasols of Gandhara stupas are closely related to such factors as diverse changes of form in Indian subcontinent. Four-sided torana gate and ayaka column of the circular form of original stupas suggest the rectangular form of subsequent East Asian pagoda, and higher and wider base of Indian stupas became the origin of East Asian rectangular pagoda.
PDF

The Status of Ramsar wetlands in India: A review of ecosystem benefits, threats, and management strategies (인도 내 람사르 습지 현황 : 생태계 이점, 위협 및 관리 전략)

Farheen, K.S.;Reyes, N.J.D.G.;Jeon, M.S.;Kim, L.H.
- Journal of Wetlands Research
- /
- v.24 no.2
- /
- pp.123-141
- /
- 2022
Wetland also known as "Jheelon" in Hindi language is one of the most important natural resources, contributing various economic and ecological benefits. The study gave a short review of the current status of Ramsar wetlands in India. The wildlife species, conservation measures, and their significance in Indian wetlands were also explored in this review paper. As of 2022, there are 49 Ramsar sites in India covering approximately 1,09363.6 km² of land. The largest Ramsar wetland is Sundarbans, while the smallest is Chandertal. It was found that preventing wetland loss is important even though studies about wetland degradation in various nations including India, caused directly by human activities is still limited. Since Monitoring and protecting natural wetlands, supporting scientific studies on preservation and restoration of wetlands, demand on imposing regulations for limiting pollutant discharges were recommended allowing researchers, policymakers, and practitioners to better maintain wetland and its ecosystem services.
https://doi.org/10.17663/JWR.2022.24.2.123 인용 PDF KSCI HTML

Search Result 9, Processing Time 0.022 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)