• 제목/요약/키워드: corpus linguistics

검색결과 78건 처리시간 0.027초

어휘의미분석 말뭉치 구축의 절차와 문제 (Procedures and Problems in Compiling a Disambiguated Tagged Corpus)

  • 신지현;최민우;강범모
    • 한국정보과학회 언어공학연구회:학술대회논문집(한글 및 한국어 정보처리)
    • /
    • 한국정보과학회언어공학연구회 2001년도 제13회 한글 및 한국어 정보처리 학술대회
    • /
    • pp.479-486
    • /
    • 2001
  • 동음이의어 간의 서로 다른 의미를 효율적으로 변별해 줄 수 있는 방법 중 하나로 어휘의미분석 말뭉치의 활용을 들 수 있다. 이는 품사 단위의 중의성을 해소해 줄 수 있는 형태소 분석 말뭉치를 기반으로, 이 단계에서 해결하지 못하는 어휘적인 중의성을 해결한 것으로, 보다 정밀한 언어학적 연구와 단어 의미의 중의성 해결(word sense disambiguation) 등 자연언어처리 기술 개발에 사용될 수 있는 중요한 언어 자원이다. 본 연구는 실제로 어휘의미분석 말뭉치를 구축하기 위한 기반 연구로서, 어휘의미분서 말뭉치의 설계와 구축 방법론상의 제반 사항을 살펴보고, 중의적 단어들의 분포적 특징과 단어의 중의성 해결 단계에서 발생할 수 있는 문제점을 지적하고, 아울러 그 해결 방법을 모색해 의는 것을 목적으로 한다.

  • PDF

주택디자인에서 건축가들의 어휘 사용행태 및 기본어휘에 관한 연구 (A Study on the Lexicon-Use Behaviour of Architects & the Basic Lexicons in House Design)

  • 윤대한
    • 한국주거학회논문집
    • /
    • 제17권5호
    • /
    • pp.27-37
    • /
    • 2006
  • This paper analyzed statistically two corpora that were constructed from the texts about house designs written by Korean architects and PA Awards architects. The main results are as follows; (1) The numbers of words in Korean house-design corpus were 9,352 and those of words in PA Awards house design corpus were 2,379. The former were 18.7% and the latter 4.8% of about 50,000 words regarded as the rest using scale in actual life. (2) When the architects described their house designs, the lexicon-concentration phenomenon was pervasive in both groups. Therefore, we can infer that the high-frequency lexicons are very important in house design. (3) The architects' behaviour patterns of using the house-design lexicons, went by rules according to the word frequency order. The tendency formulas of them had the $R^{2}$ values which were more than 90%. (4) In Korean house design corpus, the high frequency lexicons were '공간', '층', '주택', '집', '대지', '거실', and '실'. In PA awards house design corpus, they were 'house','room','space','living','wall','level' and 'area'. From these results, We can tell that 'space' is the highest frequency word in house design of the two groups, and that '대지 ' and 'wall' are the words that reveal well the differences between the two groups.

In My Opinion: Modality in Japanese EFL Learners' Argumentative Essays

  • Pemberton, Christine
    • 아시아태평양코퍼스연구
    • /
    • 제1권2호
    • /
    • pp.57-72
    • /
    • 2020
  • This study seeks to add to the current understanding of learners' use of modality in argumentative writing. A learner corpus of argumentative essays on four topics was created and compared to native English speaker data from the International Corpus Network of Asian Learners of English (ICNALE). The relationship between learners' use of modal devices (MDs) and the devices' appearance in the school's curriculum was also examined. The results showed that learners relied on a very narrow range of MDs compared to those in previous studies. The frequency of use of MDs varied based on the topic and did not seem to be driven by cultural factors as has been previously suggested. Learners used more hedges than boosters on all topics, contradicting most previous studies. Curriculum was determined to have a direct correlation with MD use, and other important factors may include perception of topic and overreliance on certain MDs over others (the One-to-One principal). This research implies that learners' perception of topic should be explored further as a variable affecting MD use. Curricula should be designed based on frequency of MD use by English native speakers, and learners should receive instruction that teaches the norms of MD use in academic writing. The methodology used in the study to determine correlations between MD use and the curriculum has a wide range of potential applications in the field of Contrastive Interlanguage Analysis.

The Effects of Corpus Use on Learning L2 Collocations of Light Verbs and Nouns

  • Yoshiho Satake
    • 아시아태평양코퍼스연구
    • /
    • 제4권2호
    • /
    • pp.41-55
    • /
    • 2023
  • In data-driven learning (DDL), learners explore a corpus to understand vocabulary and grammar. Although many studies have emphasized the role of DDL in second language (L2) acquisition, L2 light verbs have been largely under-explored. To bridge this gap, this study focused on the learning outcomes of L2 light verbs among 29 intermediate-level Japanese university students. The research zeroed in on six prevalent light verbs in English: "make," "do," "take," "have," "give," and "get." Over nine weeks, the participants engaged with verb-noun collocations using worksheets that juxtaposed Japanese translations of the target collocations with their English equivalents, with the verbs omitted. With the aid of Wordbanks Online, they filled in the blanks and constructed accurate sentences. Before this activity, a 20-minute tutorial was given to the participants on how to interpret the concordance lines. The effectiveness of the DDL method was evaluated using pre-tests, immediate post-tests, and delayed post-tests. The results showed that DDL significantly improved the participants' knowledge of the target collocations of light verbs and nouns; the post-test and delayed post-test scores were significantly higher than the pre-test scores. The results showed that, overall, DDL contributed to memorizing the collocations of light verbs and nouns; however, DDL had different effects on the memorization of collocations across different light verbs. The extent of work on the worksheet is not the only factor in its retention, and observing concordance lines may promote learners' memorization of light-verb collocations.

Extracting Multiword Sentiment Expressions by Using a Domain-Specific Corpus and a Seed Lexicon

  • Lee, Kong-Joo;Kim, Jee-Eun;Yun, Bo-Hyun
    • ETRI Journal
    • /
    • 제35권5호
    • /
    • pp.838-848
    • /
    • 2013
  • This paper presents a novel approach to automatically generate Korean multiword sentiment expressions by using a seed sentiment lexicon and a large-scale domain-specific corpus. A multiword sentiment expression consists of a seed sentiment word and its contextual words occurring adjacent to the seed word. The multiword sentiment expressions that are the focus of our study have a different polarity from that of the seed sentiment word. The automatically extracted multiword sentiment expressions show that 1) the contextual words should be defined as a part of a multiword sentiment expression in addition to their corresponding seed sentiment word, 2) the identified multiword sentiment expressions contain various indicators for polarity shift that have rarely been recognized before, and 3) the newly recognized shifters contribute to assigning a more accurate polarity value. The empirical result shows that the proposed approach achieves improved performance of the sentiment analysis system that uses an automatically generated lexicon.

한국어 수사구조 분류체계 수립 및 주석 코퍼스 구축 (Building an RST-tagged Corpus and its Classification Scheme for Korean News Texts)

  • 노은정;이연수;김연우;이도길
    • 한국어정보학회:학술대회논문집
    • /
    • 한국어정보학회 2016년도 제28회 한글및한국어정보처리학술대회
    • /
    • pp.33-38
    • /
    • 2016
  • 수사구조는 텍스트의 각 구성 성분이 맺고 있는 관계를 의미하며, 필자의 의도는 논리적인 구조를 통해서 독자에게 더 잘 전달될 수 있다. 따라서 독자의 인지적 효과를 극대화할 수 있도록 수사구조를 고려하여 단락과 문장 구조를 구성하는 것이 필요하다. 그럼에도 불구하고 지금까지 수사구조에 기초한 한국어 분류체계를 만들거나 주석 코퍼스를 설계하려는 시도가 없었다. 본 연구에서는 기존 수사구조 이론을 기반으로, 한국어 보도문 형식에 적합한 30개 유형의 분류체계를 정제하고 최소 담화 단위별로 태깅한 코퍼스를 구축하였다. 또한 구축한 코퍼스를 토대로 중심문장을 비롯한 문장 구조의 특징과 분포 비율, 신문기사의 장르적 특성 등을 살펴봄으로써 텍스트에서 응집성의 실현 양상과 구문상의 특징을 확인하였다. 본 연구는 한국어 담화 구문에 적합한 수사구조 분류체계를 설계하고 이를 이용한 주석 코퍼스를 최초로 구축하였다는 점에서 의의를 갖는다.

  • PDF

코퍼스를 이용한 상하위어 추출 연구 (A Study of the Automatic Extraction of Hypernyms arid Hyponyms from the Corpus)

  • 방찬성;이해윤
    • 인지과학
    • /
    • 제19권2호
    • /
    • pp.143-161
    • /
    • 2008
  • 본 논문에서는 코퍼스를 이용하여 어휘들의 상하위 관계 패턴들을 추출하는 방법을 제안한다. 기존 연구들에서는 어순 교체가 자유로운 한국어의 특성으로 인해 주로 사전의 정의문을 이용하여 어휘들의 의미관계 패턴들을 추출하는 방법을 취하고 있으나, 본 논문에서는 코퍼스를 이용하여 보다 다양한 의미관계 패턴들을 추출하여 제시하고자 한다. 이를 위해 먼저 기존의 사전들을 이용해 상하위어 쌍들의 목록을 선정하였다. 다음 이 목록의 어휘 쌍들을 포함하는 문장들을 코퍼스에서 추출한 이후, 이로부터 다시 체계적으로 패턴화 할 수 있는 문장들을 추출하여 21 가지 상하위 관계 패턴들로 일반화하였다. 21가지 패턴들을 정규식으로 표현한 뒤 각각 동일한 패턴들을 가진 문장들을 코퍼스에서 다시 추출한 결과 57%의 정확률이 측정되었다.

  • PDF

A Corpus-Based Study of the Use of HEART and HEAD in English

  • Oh, Sang-suk
    • 한국언어정보학회지:언어와정보
    • /
    • 제18권2호
    • /
    • pp.81-102
    • /
    • 2014
  • The purpose of this paper is to provide corpus-based quantitative analyses of HEART and HEAD in order to examine their actual usage status and to consider some cognitive linguistic aspects associated with their use. The two corpora COCA and COHA are used for analysis in this study. The analysis of COCA corpus reveals that the total frequency of HEAD is much higher than that of HEART, and that the figurative use of HEART (60%) is two times higher than its literal use (32%); by contrast, the figurative use of HEAD (41%) is a bit higher than its literal use (38%). Among all four genres, both lexemes occur most frequently in fictions and then in magazines. Over the past two centuries, the use of HEART has been steadily decreasing; by contrast, that the use of HEAD has been steadily increasing. It is assumed that the decreasing use of HEART has partially to do with the decrease in its figurative use and that the increasing use of HEAD is attributable to its diverse meanings, the increase of its lexical use, and the partial increase in its figurative use. The analysis of the collocation of verbs and adjectives preceding HEART and HEAD, as well the modifying and predicating forms of HEART and HEAD also provides some relevant information of the usage of the two lexemes. This paper showcases that the quantitative information helps understanding not only of the actual usage of the two lexemes but also of the cognitive forces working behind it.

  • PDF

Fillers in the Hong Kong Corpus of Spoken English (HKCSE)

  • Seto, Andy
    • 아시아태평양코퍼스연구
    • /
    • 제2권1호
    • /
    • pp.13-22
    • /
    • 2021
  • The present study employed an analytical framework that is characterised by a synthesis of quantitative and qualitative analyses with a specially designed computer software SpeechActConc to examine speech acts in business communication. The naturally occurring data from the audio recordings and the prosodic transcriptions of the business sub-corpora of the HKCSE (prosodic) are manually annotated with a speech act taxonomy for finding out the frequency of fillers, the co-occurring patterns of fillers with other speech acts, and the linguistic realisations of fillers. The discoursal function of fillers to sustain the discourse or to hold the floor has diverse linguistic realisations, ranging from a sound (e.g. 'uhuh') and a word (e.g. 'well') to sounds (e.g. 'um er') and words, namely phrase ('sort of') and clause (e.g. 'you know'). Some are even combinations of sound(s) and word(s) (e.g. 'and um', 'yes er um', 'sort of erm'). Among the top five frequent linguistic realisations of fillers, 'er' and 'um' are the most common ones found in all the six genres with relatively higher percentages of occurrence. The remaining more frequent realisations consist of clause ('you know'), word ('yeah') and sound ('erm'). These common forms are syntactically simpler than the less frequent realisations found in the genres. The co-occurring patterns of fillers and other speech acts are diverse. The more common co-occurring speech acts with fillers include informing and answering. The findings show that fillers are not only frequently used by speakers in spontaneous conversation but also mostly represented in sounds or non-linguistic realisations.

Non-Discourse Marker Uses of So in EFL Writings: Functional Variability among Asian Learners

  • Sato, Shie
    • 아시아태평양코퍼스연구
    • /
    • 제1권2호
    • /
    • pp.27-39
    • /
    • 2020
  • This paper examines the frequency and distribution of the so-called "non-discourse marker functions" of so in essay writings produced by 200 L1 English speakers and 1,300 EFL learners in China, Japan, Korea, and Taiwan. Based on the data drawn from the International Corpus Network of Asian Learners of English, this study compares EFL learners and L1 English speakers' uses of so, identifying four grammatical uses, as (1) an adverb, (2) part of a fixed phrase, (3) a pro-form, and (4) a conjunction phrase specifying purpose. This study aims to show the wide variability among EFL learners with different L1s, identifying the tendency of usage both common among and specific to the sub-groups of EFL learners. The findings suggest that the learners demonstrate patterns distinctively different from those of L1 English speakers, indicating an underuse of so as a marker expressing "purpose" and an overuse as part of fixed phrases. Compared to L1 English speakers, the learners also tend to overuse so in the discourse marker functions, regardless of their L1s. The study proposes pedagogical implications focusing on discourse flow and diachronic aspects of so in order to understand its multifunctionality, although the latter is primarily suggested for advanced learners.