• Title/Summary/Keyword: English-Korean parallel corpus

Search Result 24, Processing Time 0.018 seconds

An Use of the Patterns for an Efficient Example-Based Machine Translation (효율적인 예제 기반 기계번역을 위한 패턴의 사용)

  • Lee, Gi-Yeong;Kim, Han-U
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.37 no.3
    • /
    • pp.1-11
    • /
    • 2000
  • An example-based machine translation approach is a new paradigm for resolving various problems caused by the rules of conventional rule-based machine translation. But, in pure example-based machine translation, it is very hard to find similar examples matched with input sentences by using reasonable parallel corpus. This problem causes large overheads in the process of sentence generation. This paper proposes new method of English-Korean transfer using both patterns and examples. The patterns are composed of sentence patterns and phrase patterns. Meta parts of the patterns make the example-based machine translation more practical by raising the probability to find similar examples. The use of patterns and examples can reduce the ambiguities in source language analysis and give us a high quality of MT. And experimental results with a test corpus are discussed.

  • PDF

Addressing Low-Resource Problems in Statistical Machine Translation of Manual Signals in Sign Language (말뭉치 자원 희소성에 따른 통계적 수지 신호 번역 문제의 해결)

  • Park, Hancheol;Kim, Jung-Ho;Park, Jong C.
    • Journal of KIISE
    • /
    • v.44 no.2
    • /
    • pp.163-170
    • /
    • 2017
  • Despite the rise of studies in spoken to sign language translation, low-resource problems of sign language corpus have been rarely addressed. As a first step towards translating from spoken to sign language, we addressed the problems arising from resource scarcity when translating spoken language to manual signals translation using statistical machine translation techniques. More specifically, we proposed three preprocessing methods: 1) paraphrase generation, which increases the size of the corpora, 2) lemmatization, which increases the frequency of each word in the corpora and the translatability of new input words in spoken language, and 3) elimination of function words that are not glossed into manual signals, which match the corresponding constituents of the bilingual sentence pairs. In our experiments, we used different types of English-American sign language parallel corpora. The experimental results showed that the system with each method and the combination of the methods improved the quality of manual signals translation, regardless of the type of the corpora.

Deletion-Based Sentence Compression Using Sentence Scoring Reflecting Linguistic Information (언어 정보가 반영된 문장 점수를 활용하는 삭제 기반 문장 압축)

  • Lee, Jun-Beom;Kim, So-Eon;Park, Seong-Bae
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.3
    • /
    • pp.125-132
    • /
    • 2022
  • Sentence compression is a natural language processing task that generates concise sentences that preserves the important meaning of the original sentence. For grammatically appropriate sentence compression, early studies utilized human-defined linguistic rules. Furthermore, while the sequence-to-sequence models perform well on various natural language processing tasks, such as machine translation, there have been studies that utilize it for sentence compression. However, for the linguistic rule-based studies, all rules have to be defined by human, and for the sequence-to-sequence model based studies require a large amount of parallel data for model training. In order to address these challenges, Deleter, a sentence compression model that leverages a pre-trained language model BERT, is proposed. Because the Deleter utilizes perplexity based score computed over BERT to compress sentences, any linguistic rules and parallel dataset is not required for sentence compression. However, because Deleter compresses sentences only considering perplexity, it does not compress sentences by reflecting the linguistic information of the words in the sentences. Furthermore, since the dataset used for pre-learning BERT are far from compressed sentences, there is a problem that this can lad to incorrect sentence compression. In order to address these problems, this paper proposes a method to quantify the importance of linguistic information and reflect it in perplexity-based sentence scoring. Furthermore, by fine-tuning BERT with a corpus of news articles that often contain proper nouns and often omit the unnecessary modifiers, we allow BERT to measure the perplexity appropriate for sentence compression. The evaluations on the English and Korean dataset confirm that the sentence compression performance of sentence-scoring based models can be improved by utilizing the proposed method.

Pivot Discrimination Approach for Paraphrase Extraction from Bilingual Corpus (이중 언어 기반 패러프레이즈 추출을 위한 피봇 차별화 방법)

  • Park, Esther;Lee, Hyoung-Gyu;Kim, Min-Jeong;Rim, Hae-Chang
    • Korean Journal of Cognitive Science
    • /
    • v.22 no.1
    • /
    • pp.57-78
    • /
    • 2011
  • Paraphrasing is the act of writing a text using other words without altering the meaning. Paraphrases can be used in many fields of natural language processing. In particular, paraphrases can be incorporated in machine translation in order to improve the coverage and the quality of translation. Recently, the approaches on paraphrase extraction utilize bilingual parallel corpora, which consist of aligned sentence pairs. In these approaches, paraphrases are identified, from the word alignment result, by pivot phrases which are the phrases in one language to which two or more phrases are connected in the other language. However, the word alignment is itself a very difficult task, so there can be many alignment errors. Moreover, the alignment errors can lead to the problem of selecting incorrect pivot phrases. In this study, we propose a method in paraphrase extraction that discriminates good pivot phrases from bad pivot phrases. Each pivot phrase is weighted according to its reliability, which is scored by considering the lexical and part-of-speech information. The experimental result shows that the proposed method achieves higher precision and recall of the paraphrase extraction than the baseline. Also, we show that the extracted paraphrases can increase the coverage of the Korean-English machine translation.

  • PDF