• 제목/요약/키워드: Corpora

검색결과 249건 처리시간 0.026초

코퍼스를 통한 고등학교 영어교과서의 어휘 분석 (Usage analysis of vocabulary in Korean high school English textbooks using multiple corpora)

  • 김영미;서진희
    • 영어어문교육
    • /
    • 제12권4호
    • /
    • pp.139-157
    • /
    • 2006
  • As the Communicative Approach has become the norm in foreign language teaching, the objectives of teaching English in school have changed radically in Korea. The focus in high school English textbooks has shifted from mere mastery of structures to communicative proficiency. This paper will study five polysemous words which appear in twelve high school English textbooks used in Korea. The twelve text books are incorporated into a single corpus and analyzed to classify the usage of the selected words. Then the usage of each word was compared with that of three other corpora based sources: the BNC(British National Corpus) Sampler, ICE Singapore(International Corpus of English for Singapore) and Collins COBUILD learner's dictionary which is based on the corpus, "The Bank of English". The comparisons carried out as part of this study will demonstrate that Korean text books do not always supply the full range of meanings of polysemous words.

  • PDF

한글 필사본 음식조리서 말뭉치 구축을 위한 마크업 방안 연구 (A Study on the Markup Scheme for Building the Corpora of Korean Culinary Manuscripts)

  • 안의정;박진양;남길임
    • 한국언어정보학회지:언어와정보
    • /
    • 제12권2호
    • /
    • pp.95-114
    • /
    • 2008
  • This study aims at establishing a markup system for 17-19th century culinary manuscripts. To achieve this aim, we, in section 2, look into various theoretical considerations regarding encoding large-scale historical corpora. In section 3, we identify and analyze the characteristics of textual theme and structure of our source text. Section 4 proposes a markup scheme based on the XML standard for bibliographical and structural markups for the corpus as well as the grammatical annotations. We show that it is highly desirable to use XML-based markup system since it is extremely powerful and flexible in its expressiveness and scalable. The markup scheme we suggest is a modified and extended version of the TEI-P5 to accommodate the textual and linguistic characteristics of premodern Korean culinary manuscripts.

  • PDF

Mining Parallel Text from the Web based on Sentence Alignment

  • Li, Bo;Liu, Juan;Zhu, Huili
    • 한국언어정보학회:학술대회논문집
    • /
    • 한국언어정보학회 2007년도 정기학술대회
    • /
    • pp.285-292
    • /
    • 2007
  • The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.

  • PDF

Semi-Automatic Annotation Tool to Build Large Dependency Tree-Tagged Corpus

  • Park, Eun-Jin;Kim, Jae-Hoon;Kim, Chang-Hyun;Kim, Young-Kill
    • 한국언어정보학회:학술대회논문집
    • /
    • 한국언어정보학회 2007년도 정기학술대회
    • /
    • pp.385-393
    • /
    • 2007
  • Corpora annotated with lots of linguistic information are required to develop robust and statistical natural language processing systems. Building such corpora, however, is an expensive, labor-intensive, and time-consuming work. To help the work, we design and implement an annotation tool for establishing a Korean dependency tree-tagged corpus. Compared with other annotation tools, our tool is characterized by the following features: independence of applications, localization of errors, powerful error checking, instant annotated information sharing, user-friendly. Using our tool, we have annotated 100,904 Korean sentences with dependency structures. The number of annotators is 33, the average annotation time is about 4 minutes per sentence, and the total period of the annotation is 5 months. We are confident that we can have accurate and consistent annotations as well as reduced labor and time.

  • PDF

전문용어의 정의문 분석 (An analysis of terminological definitions)

  • 이해윤
    • 한국독어학회지:독어학
    • /
    • 제7집
    • /
    • pp.145-163
    • /
    • 2003
  • In this paper, we examined various definitions of terminological definition for the extraction of terminological information from corpora. After we reviewed researches at the lexicography and at the terminology, we introduced the qualia structure of Generative Lexicon (Pustejovsky 1995) for the purpose of analyzing terminological definitions. By means of the qualia structure, we analyzed the definitions which are presented at the terminological dictionaries. As a result, we confirmed that the terminological definitions can be discomposed into 4 subtypes of qualia structure. Based on this examination, we analyzed terminological definitions of articles at a newspaper and showed the usefulness of the qualia structure at the extraction of terminological definitions from the corpora.

  • PDF

Momel을 이용한 한국어의 억양 연구 (A Study on Korean Intonation Using Momel)

  • 김선희;유현지;홍혜진;이호영
    • 대한음성학회지:말소리
    • /
    • 제63호
    • /
    • pp.85-100
    • /
    • 2007
  • This paper aims to propose how to extract intonation patterns using Momel, a pitch stylization algorithm, and to present results of analyzing speech corpora in comparison with those in earlier researches. Two speech corpora are used: one is the sound files obtained from the K-ToBI web site, and the other consists of 80 passages pronounced by 4 speakers (2 male and 2 female). The results show that Momel provides significant pitch targets which can be labeled as H and L tones within prosodic units such as Accentual Phrase (AP) and Intonation Phrase (IP). The resulting AP patterns and IP boundary tone patterns correspond to those in earlier researches. Thus, this study will contribute to the study of intonation as well as to the development of automatic intonation labeling systems.

  • PDF

The Use of MSVM and HMM for Sentence Alignment

  • Fattah, Mohamed Abdel
    • Journal of Information Processing Systems
    • /
    • 제8권2호
    • /
    • pp.301-314
    • /
    • 2012
  • In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on the Multi-Class Support Vector Machine (MSVM) and the Hidden Markov Model (HMM) classifiers are presented. A feature vector is extracted from the text pair that is under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the Multi-Class Support Vector Machine and Hidden Markov Model. Another set of data was used for testing. The results of the MSVM and HMM outperform the results of the length based approach. Moreover these new approaches are valid for any language pairs and are quite flexible since the feature vector may contain less, more, or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research.

SiTEC의 STiLL관련 음성 코퍼스의 구축 현황 (Creation of Speech Corpora for STiLL at SiTEC)

  • 김영일;김봉완;최대림;이광현;정은순;이용주
    • 대한음성학회:학술대회논문집
    • /
    • 대한음성학회 2005년도 추계 학술대회 발표논문집
    • /
    • pp.13-16
    • /
    • 2005
  • As language learning that utilizes speech and information processing technology is getting popular. Speech Information Technology & Promotion Center(SiTEC) has created and is distributing speech corpora for STiLL in order to support basic research and development of products. We will introduce the corpus for Korean and those for English which we have created and are distributing.

  • PDF

코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로 (The Statistical Relationship between Linguistic Items and Corpus Size)

  • 양경숙;박병선
    • 한국언어정보학회지:언어와정보
    • /
    • 제7권2호
    • /
    • pp.103-115
    • /
    • 2003
  • In recent years, many organizations have been constructing their own large corpora to achieve corpus representativeness. However, there is no reliable guideline as to how large corpus resources should be compiled, especially for Korean corpora. In this study, we have contrived a new statistical model, ARIMA (Autoregressive Integrated Moving Average), for predicting the relationship between linguistic items (the number of types) and corpus size (the number of tokens), overcoming the major flaws of several previous researches on this issue. Finally, we shall illustrate that the ARIMA model presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, corpus representativeness and linguistic comprehensiveness.

  • PDF

A Rule-Based Analysis from Raw Korean Text to Morphologically Annotated Corpora

  • Lee, Ki-Yong;Markus Schulze
    • 한국언어정보학회지:언어와정보
    • /
    • 제6권2호
    • /
    • pp.105-128
    • /
    • 2002
  • Morphologically annotated corpora are the basis for many tasks of computational linguistics. Most current approaches use statistically driven methods of morphological analysis, that provide just POS-tags. While this is sufficient for some applications, a rule-based full morphological analysis also yielding lemmatization and segmentation is needed for many others. This work thus aims at 〔1〕 introducing a rule-based Korean morphological analyzer called Kormoran based on the principle of linearity that prohibits any combination of left-to-right or right-to-left analysis or backtracking and then at 〔2〕 showing how it on be used as a POS-tagger by adopting an ordinary technique of preprocessing and also by filtering out irrelevant morpho-syntactic information in analyzed feature structures. It is shown that, besides providing a basis for subsequent syntactic or semantic processing, full morphological analyzers like Kormoran have the greater power of resolving ambiguities than simple POS-taggers. The focus of our present analysis is on Korean text.

  • PDF