• Title/Summary/Keyword: Corpus-based

Search Result 568, Processing Time 0.03 seconds

Modal Auxiliary Verbs in Japanese EFL Learners' Conversation: A Corpus-based Study

  • Nakayama, Shusaku
    • Asia Pacific Journal of Corpus Research
    • /
    • v.2 no.1
    • /
    • pp.23-34
    • /
    • 2021
  • This research examines Japanese non-native speakers' (JNNS) modal auxiliary verb use from two different perspectives: frequency of use and preferences for modalities. Additionally, error analysis is carried out to identify errors in modal use common among JNNSs. Their modal use is compared to that of English native speakers within a spoken dialogue corpus which is part of the International Corpus Network of Asian Learners' English. Research findings show at a statistically significant level that when compared to native speakers, JNNSs underuse past forms of modals and infrequently convey epistemic modality, indicating the possibility that JNNSs fail to express their opinions or thoughts indirectly when needed or to convey politeness appropriately. Error analysis identifies the following three types of common errors: (1) the use of incorrect tenses of modal verb phrases, (2) the use of inflected verb forms after modals, and (3) the non-use of main verbs after modals. The first type of error is largely because JNNSs do not master how to express past meanings of modals. The second and third types of errors seem to be due to first language transfer into second language acquisition and JNNSs' overgeneralization of the subject-verb agreement rules to modals respectively.

The Contruction of the Comparable Corpus Based on SGML (SGML 기반 비교 가능 코퍼스 구축)

  • 이창열;김용순;김성혁
    • Journal of the Korean Society for information Management
    • /
    • v.15 no.3
    • /
    • pp.7-26
    • /
    • 1998
  • The large scale documents of the data repository are utilized to the diverse applications. In the cross-language information retrieval, if the words of a query contain polymorphic meanings, the system needs multilingual corpus to exactly translate to the target words. We constructed the financial comparable corpus, called KFCM(Korean Financial Corpus corresponding to MLCC Corpus), comparing to the MLCC Polylingual Documents which consisted with the 6 European languages. It is independently constructed under the DTD of MLCC comparable corpus, and can be utilized to the cross-language information retrieval. In this paper, we discussed about the application and construction procedures of KFCM which is public domain data.

  • PDF

Lessons from Developing an Annotated Corpus of Patient Histories

  • Rost, Thomas Brox;Huseth, Ola;Nytro, Oystein;Grimsmo, Anders
    • Journal of Computing Science and Engineering
    • /
    • v.2 no.2
    • /
    • pp.162-179
    • /
    • 2008
  • We have developed a tool for annotation of electronic health record (EHR) data. Currently we are in the process of manually annotating a corpus of Norwegian general practitioners' EHRs with mainly linguistic information. The purpose of this project is to attain a linguistically annotated corpus of patient histories from general practice. This corpus will be put to future use in medical language processing and information extraction applications. The paper outlines some of our practical experiences from developing such a corpus and, in particular, the effects of semi-automated annotation. We have also done some preliminary experiments with part-of-speech tagging based on our corpus. The results indicated that relevant training data from the clinical domain gives better results for the tagging task in this domain than training the tagger on a corpus form a more general domain. We are planning to expand the corpus annotations with medical information at a later stage.

Corpus-Based Ambiguity-Driven Learning of Context- Dependent Lexical Rules for Part-of-Speech Tagging (품사태킹을 위한 어휘문맥 의존규칙의 말뭉치기반 중의성주도 학습)

  • 이상주;류원호;김진동;임해창
    • Journal of KIISE:Software and Applications
    • /
    • v.26 no.1
    • /
    • pp.178-178
    • /
    • 1999
  • Most stochastic taggers can not resolve some morphological ambiguities that can be resolved only by referring to lexical contexts because they use only contextual probabilities based ontag n-grams and lexical probabilities. Existing lexical rules are effective for resolving such ambiguitiesbecause they can refer to lexical contexts. However, they have two limitations. One is that humanexperts tend to make erroneous rules because they are deterministic rules. Another is that it is hardand time-consuming to acquire rules because they should be manually acquired. In this paper, wepropose context-dependent lexical rules, which are lexical rules based on the statistics of a taggedcorpus, and an ambiguity-driven teaming method, which is the method of automatically acquiring theproposed rules from a tagged corpus. By using the proposed rules, the proposed tagger can partiallyannotate an unseen corpus with high accuracy because it is a kind of memorizing tagger that canannotate a training corpus with 100% accuracy. So, the proposed tagger is useful to improve theaccuracy of a stochastic tagger. And also, it is effectively used for detecting and correcting taggingerrors in a manually tagged corpus. Moreover, the experimental results show that the proposed methodis also effective for English part-of-speech tagging.

A Method of Chinese and Thai Cross-Lingual Query Expansion Based on Comparable Corpus

  • Tang, Peili;Zhao, Jing;Yu, Zhengtao;Wang, Zhuo;Xian, Yantuan
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.805-817
    • /
    • 2017
  • Cross-lingual query expansion is usually based on the relationship among monolingual words. Bilingual comparable corpus contains relationships among bilingual words. Therefore, this paper proposes a method based on these relationships to conduct query expansion. First, the word vectors which characterize the bilingual words are trained using Chinese and Thai bilingual comparable corpus. Then, the correlation between Chinese query words and Thai words are computed based on these word vectors, followed with selecting the Thai candidate expansion terms via the correlative value. Then, multi-group Thai query expansion sentences are built by the Thai candidate expansion words based on Chinese query sentence. Finally, we can get the optimal sentence using the Chinese and Thai query expansion method, and perform the Thai query expansion. Experiment results show that the cross-lingual query expansion method we proposed can effectively improve the accuracy of Chinese and Thai cross-language information retrieval.

Sentence-Chain Based Seq2seq Model for Corpus Expansion

  • Chung, Euisok;Park, Jeon Gue
    • ETRI Journal
    • /
    • v.39 no.4
    • /
    • pp.455-466
    • /
    • 2017
  • This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence-chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4-times the number of n-grams with superior performance for English text.

A study of flaps in American English based on the Buckeye Corpus (Buckeye corpus에 나타난 탄설음화 현상 분석)

  • Hwang, Byeonghoo;Kang, Seokhan
    • Phonetics and Speech Sciences
    • /
    • v.10 no.3
    • /
    • pp.9-18
    • /
    • 2018
  • This paper presents an acoustic and phonological study of the alveolar flaps in American English. Based on the Buckeye Corpus, the flapping tokens produced by twenty men are analyzed at both lexical and post-lexical levels. The data, analyzed with Pratt speech analysis, include duration, F2 and F3 in voicing during the flap, as well as duration, F1, F2, F3, and f0 in the adjacent vowels. The results provide evidence on two issues: (1) The different ways in which voiced and voiceless alveolar stops give rise to neutralized flapping stops by following lexical and post-lexical levels, (2) The extent to which the vowel features (height, frontness, and tenseness) affect flapping sounds. The results show that flaps are affected by pre-consonantal vowel features at the lexical as well as post-lexical levels. Unlike previous studies, this study uses the Praat method to distinguish flapped from unflapped tokens in the Buckeye Corpus and examines connections between the lexical and post-lexical levels.

Korean Nominal Bank, Using Language Resources of Sejong Project (세종계획 언어자원 기반 한국어 명사은행)

  • Kim, Dong-Sung
    • Language and Information
    • /
    • v.17 no.2
    • /
    • pp.67-91
    • /
    • 2013
  • This paper describes Korean Nominal Bank, a project that provides argument structure for instances of the predicative nouns in the Sejong parsed Corpus. We use the language resources of the Sejong project, so that the same set of data is annotated with more and more levels of annotation, since a new type of a language resource building project could bring new information of separate and isolated processing. We have based on the annotation scheme based on the Sejong electronic dictionary, semantically tagged corpus, and syntactically analyzed corpus. Our work also involves the deep linguistic knowledge of syntaxsemantic interface in general. We consider the semantic theories including the Frame Semantics of Fillmore (1976), argument structure of Grimshaw (1990) and argument alternation of Levin (1993), and Levin and Rappaport Hovav (2005). Various syntactic theories should be needed in explaining various sentence types, including empty categories, raising, left (or right dislocation). We also need an explanation on the idiosyncratic lexical feature, such as collocation and etc.

  • PDF

A Hybrid Sentence Alignment Method for Building a Korean-English Parallel Corpus (한영 병렬 코퍼스 구축을 위한 하이브리드 기반 문장 자동 정렬 방법)

  • Park, Jung-Yeul;Cha, Jeong-Won
    • MALSORI
    • /
    • v.68
    • /
    • pp.95-114
    • /
    • 2008
  • The recent growing popularity of statistical methods in machine translation requires much more large parallel corpora. A Korean-English parallel corpus, however, is not yet enoughly available, little research on this subject is being conducted. In this paper we present a hybrid method of aligning sentences for Korean-English parallel corpora. We use bilingual news wire web pages, reading comprehension materials for English learners, computer-related technical documents and help files of localized software for building a Korean-English parallel corpus. Our hybrid method combines sentence-length based and word-correspondence based methods. We show the results of experimentation and evaluate them. Alignment results from using a full translation model are very encouraging, especially when we apply alignment results to an SMT system: 0.66% for BLEU score and 9.94% for NIST score improvement compared to the previous method.

  • PDF

A study on the release burst spectra of the voiceless plosives from the English and Korean spontaneous speech corpus (영어와 한국어 자연발화 코퍼스에서의 무성 폐쇄음 개방 파열 스펙트럼 연구)

  • Hwang, Sunmi;Yoon, Kyuchul
    • Phonetics and Speech Sciences
    • /
    • v.9 no.4
    • /
    • pp.27-34
    • /
    • 2017
  • The purpose of this work is to examine the English and Korean voiceless plosives from the Buckeye[15] and Seoul[16] corpus in terms of their static spectral characteristics. The plosives were automatically extracted by a Praat script. In order to estimate the percent correctness in the classification of the plosives, discriminant analyses were performed whose trainings were based on four spectral moments, i.e. the center of gravity, variance, skewness and kurtosis as suggested in [6]. Another set of discriminant analyses were performed based on the spectral tilts. In the last set of analyeses, the spectral moments and tilts were both used in the training. Results showed that the correct classification rate did not exceed around 65% in the best case, which suggested that phonetic cues other than the release burst would be necessary including the dynamic spectral aspects and vowel-onset cues.