• Title/Summary/Keyword: parallel corpus

Search Result 66, Processing Time 0.03 seconds

Building an Annotated English-Vietnamese Parallel Corpus for Training Vietnamese-related NLPs

  • Dien Dinh;Kiem Hoang
    • Proceedings of the IEEK Conference
    • /
    • summer
    • /
    • pp.103-109
    • /
    • 2004
  • In NLP (Natural Language Processing) tasks, the highest difficulty which computers had to face with, is the built-in ambiguity of Natural Languages. To disambiguate it, formerly, they based on human-devised rules. Building such a complete rule-set is time-consuming and labor-intensive task whilst it doesn't cover all the cases. Besides, when the scale of system increases, it is very difficult to control that rule-set. So, recently, many NLP tasks have changed from rule-based approaches into corpus-based approaches with large annotated corpora. Corpus-based NLP tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for Vietnamese are at a deadlock due to absence of annotated training data. Furthermore, hand-annotation of even reasonably well-determined features such as part-of-speech (POS) tags has proved to be labor intensive and costly. In this paper, we present our building an annotated English-Vietnamese parallel aligned corpus named EVC to train for Vietnamese-related NLP tasks such as Word Segmentation, POS-tagger, Word Order transfer, Word Sense Disambiguation, English-to-Vietnamese Machine Translation, etc.

  • PDF

Bilingual lexicon induction through a pivot language

  • Kim, Jae-Hoon;Seo, Hyeong-Won;Kwon, Hong-Seok
    • Journal of Advanced Marine Engineering and Technology
    • /
    • v.37 no.3
    • /
    • pp.300-306
    • /
    • 2013
  • This paper presents a new method for constructing bilingual lexicons through a pivot language. The proposed method is adapted from the context-based approach, called the standard approach, which is well-known for building bilingual lexicons using comparable corpora. The main difference between the standard approach and the proposed method is how to represent context vectors. The former is to represent context vectors in a target language, while the latter in a pivot language. The proposed method is very simplified from the standard approach thereby. Furthermore, the proposed method is more accurate than the standard approach because it uses parallel corpora instead of comparable corpora. The experiments are conducted on a language pair, Korean and Spanish. Our experimental results have shown that the proposed method is quite attractive where a parallel corpus directly between source and target languages are unavailable, but both source-pivot and pivot-target parallel corpora are available.

An Automatic Extraction of English-Korean Bilingual Terms by Using Word-level Presumptive Alignment (단어 단위의 추정 정렬을 통한 영-한 대역어의 자동 추출)

  • Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.6
    • /
    • pp.433-442
    • /
    • 2013
  • A set of bilingual terms is one of the most important factors in building language-related applications such as a machine translation system and a cross-lingual information system. In this paper, we introduce a new approach that automatically extracts candidates of English-Korean bilingual terms by using a bilingual parallel corpus and a basic English-Korean lexicon. This approach can be useful even though the size of the parallel corpus is small. A sentence alignment is achieved first for the document-level parallel corpus. We can align words between a pair of aligned sentences by referencing a basic bilingual lexicon. For unaligned words between a pair of aligned sentences, several assumptions are applied in order to align bilingual term candidates of two languages. A location of a sentence, a relation between words, and linguistic information between two languages are examples of the assumptions. An experimental result shows approximately 71.7% accuracy for the English-Korean bilingual term candidates which are automatically extracted from 1,000 bilingual parallel corpus.

Automatic Extraction of Alternative Words using Parallel Corpus (병렬말뭉치를 이용한 대체어 자동 추출 방법)

  • Baik, Jong-Bum;Lee, Soo-Won
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.12
    • /
    • pp.1254-1258
    • /
    • 2010
  • In information retrieval, different surface forms of the same object can cause poor performance of systems. In this paper, we propose the method extracting alternative words using translation words as features of each word extracted from parallel corpus, korean/english title pair of patent information. Also, we propose an association word filtering method to remove association words from an alternative word list. Evaluation results show that the proposed method outperforms other alternative word extraction methods.

A study on performance improvement considering the balance between corpus in Neural Machine Translation (인공신경망 기계번역에서 말뭉치 간의 균형성을 고려한 성능 향상 연구)

  • Park, Chanjun;Park, Kinam;Moon, Hyeonseok;Eo, Sugyeong;Lim, Heuiseok
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.5
    • /
    • pp.23-29
    • /
    • 2021
  • Recent deep learning-based natural language processing studies are conducting research to improve performance by training large amounts of data from various sources together. However, there is a possibility that the methodology of learning by combining data from various sources into one may prevent performance improvement. In the case of machine translation, data deviation occurs due to differences in translation(liberal, literal), style(colloquial, written, formal, etc.), domains, etc. Combining these corpora into one for learning can adversely affect performance. In this paper, we propose a new Corpus Weight Balance(CWB) method that considers the balance between parallel corpora in machine translation. As a result of the experiment, the model trained with balanced corpus showed better performance than the existing model. In addition, we propose an additional corpus construction process that enables coexistence with the human translation market, which can build high-quality parallel corpus even with a monolingual corpus.

Design and Construction of Korean-Spoken English Corpus(K-SEC) (한국인의 영어 음성 코퍼스 설계 및 구축)

  • Rhee Seok-Chae;Lee Sook-Hyang;Kang Seok-keun;Lee Yong-Ju
    • MALSORI
    • /
    • no.46
    • /
    • pp.159-174
    • /
    • 2003
  • K-SEC (Korean-Spoken English Corpus) is a kind of speech database that is being under construction by the authors of this paper This article discusses the needs of the K-SEC from various academic disciplines and industrial circles, and it introduces the characteristics of the K-SEC design, its catalogues and contents of the recorded database, exemplifying what are being considered from both Korean and English languages' phonetics and phonologies. The K-SEC can be marked as a beginning of a parallel speech corpus, and it is suggested that a similar corpus should be enlarged for the future advancements of the experimental phonetics and the speech information technology.

  • PDF

Design and Construction of Korean-Spoken English Corpus (K-SEC) (한국인의 영어 음성코퍼스 설계 및 구축)

  • Rhee Seok-Chae;Lee Sook-Hyang;Kang Seok-keun;Lee Yong-Ju
    • Proceedings of the KSPS conference
    • /
    • 2003.05a
    • /
    • pp.12-20
    • /
    • 2003
  • K-SEC(Korean-Spoken English Corpus) is a kind of speech database that is being under construction by the authors of this paper. This article discusses the needs of the K-SEC from various academic disciplines and industrial circles, and it introduces the characteristics of the K-SEC design, its catalogues and contents of the recorded database, exemplifying what are being considered from both Korean and English languages' phonetics and phonologies. The K-SEC can be marked as a beginning of a parallel speech corpus, and it is suggested that a similar corpus should be enlarged for the future advancements of the experimental phonetics and the speech information technology.

  • PDF

Method for Detecting Errors of Korean-Chinese MT Using Parallel Corpus (병렬 코퍼스를 이용한 한중 기계번역 오류 탐지 방법)

  • Jin, Yun;Kim, Young-Kil
    • Annual Conference on Human and Language Technology
    • /
    • 2008.10a
    • /
    • pp.113-117
    • /
    • 2008
  • 본 논문에서는 패턴기반 자동번역시스템의 효율적인 번역 성능 향상을 위해 병렬 코퍼스(parallel corpus)를 이용한 오류 자동 탐지 방법을 제안하고자 한다. 번역시스템에 존재하는 대부분 오류는 크게 지식 오류와 엔진 오류로 나눌 수 있는데 통상 이런 오류는 이중 언어가 가능한 훈련된 언어학자가 대량의 자동번역 된 결과 문장을 읽음으로써 오류를 탐지하고 분석하여 번역 지식을 수정/확장하거나 또는 엔진을 개선하게 된다. 하지만, 이런 작업은 많은 시간과 노력을 필요로 하게 된다. 따라서 본 논문에서는 병렬 코퍼스 중의 목적 언어(Target Language) 문장 즉, 정답 문장과 자동번역 된 결과 문장을 다양한 방법으로 비교하면서 번역시스템에 존재하고 있는 지식 및 엔진 오류를 자동으로 탐지하는 방법을 제안한다. 제안한 방법은 한-중 자동번역시스템에 적용하여 그 정확률과 재현률을 측정하였으며, 자동적으로 오류를 탐지하여 추출 할 수 있음을 증명하였다.

  • PDF

English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

  • Jeong-Uk Bang;Joon-Gyu Maeng;Jun Park;Seung Yun;Sang-Hun Kim
    • ETRI Journal
    • /
    • v.45 no.1
    • /
    • pp.18-27
    • /
    • 2023
  • We present an English-Korean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English-Korean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English-Korean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English-Korean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.

Building a Korean-English Parallel Corpus by Measuring Sentence Similarities Using Sequential Matching of Language Resources and Topic Modeling (언어 자원과 토픽 모델의 순차 매칭을 이용한 유사 문장 계산 기반의 위키피디아 한국어-영어 병렬 말뭉치 구축)

  • Cheon, JuRyong;Ko, YoungJoong
    • Journal of KIISE
    • /
    • v.42 no.7
    • /
    • pp.901-909
    • /
    • 2015
  • In this paper, to build a parallel corpus between Korean and English in Wikipedia. We proposed a method to find similar sentences based on language resources and topic modeling. We first applied language resources(Wiki-dictionary, numbers, and online dictionary in Daum) to match word sequentially. We construct the Wiki-dictionary using titles in Wikipedia. In order to take advantages of the Wikipedia, we used translation probability in the Wiki-dictionary for word matching. In addition, we improved the accuracy of sentence similarity measuring method by using word distribution based on topic modeling. In the experiment, a previous study showed 48.4% of F1-score with only language resources based on linear combination and 51.6% with the topic modeling considering entire word distributions additionally. However, our proposed methods with sequential matching added translation probability to language resources and achieved 9.9% (58.3%) better result than the previous study. When using the proposed sequential matching method of language resources and topic modeling after considering important word distributions, the proposed system achieved 7.5%(59.1%) better than the previous study.