• Title/Summary/Keyword: English-Korean parallel corpus

Search Result 24, Processing Time 0.02 seconds

A Hybrid Sentence Alignment Method for Building a Korean-English Parallel Corpus (한영 병렬 코퍼스 구축을 위한 하이브리드 기반 문장 자동 정렬 방법)

  • Park, Jung-Yeul;Cha, Jeong-Won
    • MALSORI
    • /
    • v.68
    • /
    • pp.95-114
    • /
    • 2008
  • The recent growing popularity of statistical methods in machine translation requires much more large parallel corpora. A Korean-English parallel corpus, however, is not yet enoughly available, little research on this subject is being conducted. In this paper we present a hybrid method of aligning sentences for Korean-English parallel corpora. We use bilingual news wire web pages, reading comprehension materials for English learners, computer-related technical documents and help files of localized software for building a Korean-English parallel corpus. Our hybrid method combines sentence-length based and word-correspondence based methods. We show the results of experimentation and evaluate them. Alignment results from using a full translation model are very encouraging, especially when we apply alignment results to an SMT system: 0.66% for BLEU score and 9.94% for NIST score improvement compared to the previous method.

  • PDF

Translating English By-Phrase Passives into Korean: A Parallel Corpus Analysis (영한 병렬 코퍼스에 나타난 영어 수동문의 한국어 번역)

  • Lee, Seung-Ah
    • Journal of English Language & Literature
    • /
    • v.56 no.5
    • /
    • pp.871-905
    • /
    • 2010
  • This paper is motivated by Watanabe's (2001) observation that English byphrase passives are sometimes translated into Japanese object topicalization constructions. That is, the original English sentence in the passive may be translated into the active voice with the logical object topicalized. A number of scholars, including Chomsky (1981) and Baker (1992), have remarked that languages have various ways to avoid focusing on the logical subject. The aim of the present study is to examine the translation equivalents of the English by-phrase passives in an English-Korean parallel corpus compiled by the author. A small sample of articles from Newsweek magazine and its published Korean translation reveals that there are indeed many ways to translate English by-phrase passives, including object topicalization (12.5%). Among the 64 translated sentences analyzed and classified, 12 (18.8%) examples were problematic in terms of agent defocusing, which is the primary function of passives. Of these 12 instances, five cases were identified where an alternative translation would be more suitable. The results suggest that the functional characteristics of English by-phrase passives should be highlighted in translator training as well as language teaching.

A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus (공공 한영 병렬 말뭉치를 이용한 기계번역 성능 향상 연구)

  • Park, Chanjun;Lim, Heuiseok
    • Journal of Digital Convergence
    • /
    • v.18 no.6
    • /
    • pp.271-277
    • /
    • 2020
  • Machine translation refers to software that translates a source language into a target language, and has been actively researching Neural Machine Translation through rule-based and statistical-based machine translation. One of the important factors in the Neural Machine Translation is to extract high quality parallel corpus, which has not been easy to find high quality parallel corpus of Korean language pairs. Recently, the AI HUB of the National Information Society Agency(NIA) unveiled a high-quality 1.6 million sentences Korean-English parallel corpus. This paper attempts to verify the quality of each data through performance comparison with the data published by AI Hub and OpenSubtitles, the most popular Korean-English parallel corpus. As test data, objectivity was secured by using test set published by IWSLT, official test set for Korean-English machine translation. Experimental results show better performance than the existing papers tested with the same test set, and this shows the importance of high quality data.

The Study on the Principles of Selecting Korean Particle 'Ka' and 'Nun' Using Korean-English Parallel Corpus (한영 병렬 말뭉치를 이용한 한국어 조사 '가'와 '는'의 선택 원리 연구)

  • Yoo, Hyun-Kyung;An, Ye-Ri;Yang, Su-Hyang
    • Language and Information
    • /
    • v.11 no.1
    • /
    • pp.1-23
    • /
    • 2007
  • This study aims to research into the meaning of Korean particle 'ka' and 'nun' inductively by examining the correspondences of those particles and English articles on the Korean-English parallel corpus. The correspondences were checked in three ways: semantically, syntactically and pragmatically. This study found that when the semantic or syntactic tier is not salient, the pragmatic tier is activated and particles are selected according to the pragmatic elements such as the amount of information or the change of topic. However, if the meaning of the particles is salient or if there is any syntactic motive, particles are selected in accordance with the semantic or syntactic elements. Former studies which focused on one of those three tiers cannot properly explain such correspondences on the Korean-English parallel corpus. This study shows that semantic, syntactic and pragmatic tiers hierarchically affect the selection of a particle and that the selection process is also related to speaker's intention. This dimensional analysis of particles is expected to contribute to theoretical studies and applied studies like Korean language education as well.

  • PDF

Design and Construction of Korean-Spoken English Corpus(K-SEC) (한국인의 영어 음성 코퍼스 설계 및 구축)

  • Rhee Seok-Chae;Lee Sook-Hyang;Kang Seok-keun;Lee Yong-Ju
    • MALSORI
    • /
    • no.46
    • /
    • pp.159-174
    • /
    • 2003
  • K-SEC (Korean-Spoken English Corpus) is a kind of speech database that is being under construction by the authors of this paper This article discusses the needs of the K-SEC from various academic disciplines and industrial circles, and it introduces the characteristics of the K-SEC design, its catalogues and contents of the recorded database, exemplifying what are being considered from both Korean and English languages' phonetics and phonologies. The K-SEC can be marked as a beginning of a parallel speech corpus, and it is suggested that a similar corpus should be enlarged for the future advancements of the experimental phonetics and the speech information technology.

  • PDF

An Automatic Extraction of English-Korean Bilingual Terms by Using Word-level Presumptive Alignment (단어 단위의 추정 정렬을 통한 영-한 대역어의 자동 추출)

  • Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.6
    • /
    • pp.433-442
    • /
    • 2013
  • A set of bilingual terms is one of the most important factors in building language-related applications such as a machine translation system and a cross-lingual information system. In this paper, we introduce a new approach that automatically extracts candidates of English-Korean bilingual terms by using a bilingual parallel corpus and a basic English-Korean lexicon. This approach can be useful even though the size of the parallel corpus is small. A sentence alignment is achieved first for the document-level parallel corpus. We can align words between a pair of aligned sentences by referencing a basic bilingual lexicon. For unaligned words between a pair of aligned sentences, several assumptions are applied in order to align bilingual term candidates of two languages. A location of a sentence, a relation between words, and linguistic information between two languages are examples of the assumptions. An experimental result shows approximately 71.7% accuracy for the English-Korean bilingual term candidates which are automatically extracted from 1,000 bilingual parallel corpus.

English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

  • Jeong-Uk Bang;Joon-Gyu Maeng;Jun Park;Seung Yun;Sang-Hun Kim
    • ETRI Journal
    • /
    • v.45 no.1
    • /
    • pp.18-27
    • /
    • 2023
  • We present an English-Korean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English-Korean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English-Korean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English-Korean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.

Design and Construction of Korean-Spoken English Corpus (K-SEC) (한국인의 영어 음성코퍼스 설계 및 구축)

  • Rhee Seok-Chae;Lee Sook-Hyang;Kang Seok-keun;Lee Yong-Ju
    • Proceedings of the KSPS conference
    • /
    • 2003.05a
    • /
    • pp.12-20
    • /
    • 2003
  • K-SEC(Korean-Spoken English Corpus) is a kind of speech database that is being under construction by the authors of this paper. This article discusses the needs of the K-SEC from various academic disciplines and industrial circles, and it introduces the characteristics of the K-SEC design, its catalogues and contents of the recorded database, exemplifying what are being considered from both Korean and English languages' phonetics and phonologies. The K-SEC can be marked as a beginning of a parallel speech corpus, and it is suggested that a similar corpus should be enlarged for the future advancements of the experimental phonetics and the speech information technology.

  • PDF

English Hedge Expressions and Korean Endings: Grammar Explanation for English-Speaking Leaners of Korean (영어 완화 표지와 한국어 종결어미 비교 - 영어권 학습자를 위한 문법 설명 -)

  • Kim, Young A
    • Journal of Korean language education
    • /
    • v.25 no.1
    • /
    • pp.1-27
    • /
    • 2014
  • This study investigates how common English hedge expressions such as 'I think' and 'I guess' appear in Korean, with the aim of providing explicit explanation for English-speaking leaners of Korean. Based on a contrastive analysis of spoken English and Korean corpus, this study argues three points: Firstly, 'I guess' appears with a wider variety of modalities in Korean than 'I think'. Secondly, this study has found that Korean textbooks contain inappropriate use of registers regarding the English translations of '-geot -gat-': although these markers are used in spoken Korean, they were translated into written English. Therefore, this study suggests that '-geot -gat-' be translated into 'I think' in spoken English, and into 'it seems' in the case of written English and narratives. Lastly, the contrastive analysis has shown that when 'I think' is used with deontic modalities such as 'I think I have to', Korean use '-a-ya-get-': the use of hedge marker 'I think' with 'I have to', which shows obligation or speaker's volition turns the deontic modalities into expressions of speaker's opinion.

Chunking Korean and an Application (한국어 낱말 묶기와 그 응용)

  • Un Koaunghi;Hong Jungha;You Seok-Hoon;Lee Kiyong;Choe Jae-Woong
    • Language and Information
    • /
    • v.9 no.2
    • /
    • pp.49-68
    • /
    • 2005
  • Application of chunking to English and some other European languages has shown that it is a viable parsing mechanism for natural languages. Although a small number of attempts have been made to apply chunking to the analysis of the Korean language, it still is not clear enough what criteria there are to identify appropriate units of chunking, and how efficient and valid the chunking algorithms would be when applied to some authentic Korean texts. The purpose of this research is to provide an alternative set of algorithms for chunking Korean, and to implement them, and to test them against some English-Korean parallel corpora, which is English and Korean bibles matched sentence by sentence. It is shown in the paper that aligning related texts and identifying matched phrases between the two languages can be achieved through appropriate chunking and matching algorithms defined on the morphologically-tagged parallel corpus. Chunking and matching processes are based on the content words rather than the function words, and the matching itself is done in terms of the transfer dictionary. The implementation is done in C and XML, and can be accessed through the Internet.

  • PDF