• Title/Summary/Keyword: word Weighting

Search Result 51, Processing Time 0.027 seconds

The acoustic cue-weighting and the L2 production-perception link: A case of English-speaking adults' learning of Korean stops

  • Kong, Eun Jong;Kang, Soyoung;Seo, Misun
    • Phonetics and Speech Sciences
    • /
    • v.14 no.3
    • /
    • pp.1-9
    • /
    • 2022
  • The current study examined English-speaking adult learners' production and perception of L2 Korean stops (/t/ or /t'/ or /th/) to investigate whether the two modalities are linked in utilizing voice onset time (VOT) and fundamental frequency (F0) for the L2 sound distinction and how the learners' L2 proficiency mediates the relationship. Twenty-two English-speaking learners of Korean living in Seoul participated in the word-reading task of producing stop-initial words and the identification task of labelling CV stimuli synthesized to vary VOT and F0. Using logistic mixed-effects regression models, we quantified group- and individual-level weights of the VOT and F0 cues in differentiating the tense-lax, lax-aspirated, and tense-aspirated stops in Korean. The results showed that the learners as a group relied on VOT more than F0 both in production and perception (except the tense-lax pair), reflecting the dominant role of VOT in their L1 stop distinction. Individual-level analyses further revealed that the learners' L2 proficiency was related to their use of F0 in L2 production and their use of VOT in L2 perception. With this effect of L2 proficiency controlled in the partial correlation tests, we found a significant correlation between production and perception in using VOT and F0 for the lax-aspirated stop contrast. However, the same correlation was absent for the other stop pairs. We discuss a contrast-specific role of acoustic cues to address the non-uniform patterns of the production-perception link in the L2 sound learning context.

Context-Weighted Metrics for Example Matching (문맥가중치가 반영된 문장 유사 척도)

  • Kim, Dong-Joo;Kim, Han-Woo
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.43 no.6 s.312
    • /
    • pp.43-51
    • /
    • 2006
  • This paper proposes a metrics for example matching under the example-based machine translation for English-Korean machine translation. Our metrics served as similarity measure is based on edit-distance algorithm, and it is employed to retrieve the most similar example sentences to a given query. Basically it makes use of simple information such as lemma and part-of-speech information of typographically mismatched words. Edit-distance algorithm cannot fully reflect the context of matched word units. In other words, only if matched word units are ordered, it is considered that the contribution of full matching context to similarity is identical to that of partial matching context for the sequence of words in which mismatching word units are intervened. To overcome this drawback, we propose the context-weighting scheme that uses the contiguity information of matched word units to catch the full context. To change the edit-distance metrics representing dissimilarity to similarity metrics, to apply this context-weighted metrics to the example matching problem and also to rank by similarity, we normalize it. In addition, we generalize previous methods using some linguistic information to one representative system. In order to verify the correctness of the proposed context-weighted metrics, we carry out the experiment to compare it with generalized previous methods.

A Research on Enhancement of Text Categorization Performance by using Okapi BM25 Word Weight Method (Okapi BM25 단어 가중치법 적용을 통한 문서 범주화의 성능 향상)

  • Lee, Yong-Hun;Lee, Sang-Bum
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.11 no.12
    • /
    • pp.5089-5096
    • /
    • 2010
  • Text categorization is one of important features in information searching system which classifies documents according to some criteria. The general method of categorization performs the classification of the target documents by eliciting important index words and providing the weight on them. Therefore, the effectiveness of algorithm is so important since performance and correctness of text categorization totally depends on such algorithm. In this paper, an enhanced method for text categorization by improving word weighting technique is introduced. A method called Okapi BM25 has been proved its effectiveness from some information retrieval engines. We applied Okapi BM25 and showed its good performance in the categorization. Various other words weights methods are compared: TF-IDF, TF-ICF and TF-ISF. The target documents used for this experiment is Reuter-21578, and SVM and KNN algorithms are used. Finally, modified Okapi BM25 shows the most excellent performance.

Heuristic-based Korean Coreference Resolution for Information Extraction

  • Euisok Chung;Soojong Lim;Yun, Bo-Hyun
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2002.02a
    • /
    • pp.50-58
    • /
    • 2002
  • The information extraction is to delimit in advance, as part of the specification of the task, the semantic range of the output and to filter information from large volumes of texts. The most representative word of the document is composed of named entities and pronouns. Therefore, it is important to resolve coreference in order to extract the meaningful information in information extraction. Coreference resolution is to find name entities co-referencing real-world entities in the documents. Results of coreference resolution are used for name entity detection and template generation. This paper presents the heuristic-based approach for coreference resolution in Korean. We constructed the heuristics expanded gradually by using the corpus and derived the salience factors of antecedents as the importance measure in Korean. Our approach consists of antecedents selection and antecedents weighting. We used three kinds of salience factors that are used to weight each antecedent of the anaphor. The experiment result shows 80% precision.

  • PDF

A Comparative Study of Feature Extraction Methods for Authorship Attribution in the Text of Traditional East Asian Medicine with a Focus on Function Words (한의학 고문헌 텍스트에서의 저자 판별 - 기능어의 역할을 중심으로 -)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.33 no.2
    • /
    • pp.51-59
    • /
    • 2020
  • Objectives : We would like to study what is the most appropriate "feature" to effectively perform authorship attribution of the text of Traditional East Asian Medicine Methods : The authorship attribution performance of the Support Vector Machine (SVM) was compared by cross validation, depending on whether the function words or content words, single word or collocations, and IDF weights were applied or not, using 'Variorum of the Nanjing' as an experimental Corpus. Results : When using the combination of 'function words/uni-bigram/TF', the performance was best with accuracy of 0.732, and the combination of 'content words/unigram/TFIDF' showed the lowest accuracy of 0.351. Conclusions : This shows the following facts from the authorship attribution of the text of East Asian traditional medicine. First, function words play an important role in comparison to content words. Second, collocations was relatively important in content words, but single words have more important meanings in function words. Third, unlike general text analysis, IDF weighting resulted in worse performance.

A Study on Design and Implementation of Filtering System on Hurtfulness Site (유해 사이트 필터링에 관한 연구)

  • 장혜숙;강일고;박기홍
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2002.11a
    • /
    • pp.636-639
    • /
    • 2002
  • This article is focused on the research for the system design that isolate noxious data from internet for juveniles Normally, by motivating this software which was designed to isolate noxious data, harmful data was deleted or graded But these normal process contains a lot of complexity, for example, essential continual upgrade, grading mistake, etc. So, to solve these fallacy, word-weighting process, where several harmful words which can be optained in internet site are discriminance and weighted, is utilized by using AC machine. At the result, the isolation rate of harmful site rose up to 90%, which means this process is greatly efficient.

  • PDF

A Study about interception on Hurtfulness Site using Aho-Corasik machine (AC 머신을 이용한 유해 사이트 차단에 관한 연구)

  • 정현수;정규철;김후남;박기홍
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2004.05b
    • /
    • pp.541-544
    • /
    • 2004
  • Change is doing our life more conveniently and abundantly by knowledge information society, but side effect and that is happening considerable and gropes solution in reply that did not expect in advance is urgent real condition. It can be called one of representative dysfunction of information-oriented society that human nature is revealed in open state to great many objectionable material and poisonous information such as violence kind that teenagerses who do not grow are gotten abroad through Information network system yet. So, to solve these fallacy, word-weighting process, where several harmful words which can be optained in internet site are discriminance and weighted, is utilized by using AC machine. At the result, the isolation rate of harmful site rose up to 90%, which means this process is greatly efficient.

  • PDF

An Effective Method for Blocking Illegal Sports Gambling Ads on Social Media

  • Kim, Ji-A;Lee, Geum-Boon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.24 no.12
    • /
    • pp.201-207
    • /
    • 2019
  • In this paper, we propose an effective method to block illegal gambling advertisement on social media. With the increase of smartphone and internet usage, users can easily access various information while sharing information such as text and video with a large number of others. In addition, illegal sports gambling advertisements are also continue to be transmitted on SNS. To avoid most surveillance networks, users are easily exposed to illegal sports gambling advertisement images by including phrases in the images that indicate illegal sports gambling advertisements. In order to cope with these problems, we proposed a method to actively block illegal sports gambling advertisements in a way different from the conventional passive methods. In this paper, we select words frequently used for illegal sports gambling, classifies them into three groups according to their importance, calculate WF for each word using weighted formula by degree of relevance and frequency, and then sum the WF of the words in the image. Blocking, warning, and passing were determined by cv, the total of WF. Experimenting with the proposed method, 193 out of 200 experimental images were correctly judged with 96.5% accuracy, and even though 7 images were illegal sports gambling advertisements. Further research is needed to block 3.5% of illegal sports betting ads that cannot be blocked in the future.

Keyword Extraction from News Corpus using Modified TF-IDF (TF-IDF의 변형을 이용한 전자뉴스에서의 키워드 추출 기법)

  • Lee, Sung-Jick;Kim, Han-Joon
    • The Journal of Society for e-Business Studies
    • /
    • v.14 no.4
    • /
    • pp.59-73
    • /
    • 2009
  • Keyword extraction is an important and essential technique for text mining applications such as information retrieval, text categorization, summarization and topic detection. A set of keywords extracted from a large-scale electronic document data are used for significant features for text mining algorithms and they contribute to improve the performance of document browsing, topic detection, and automated text classification. This paper presents a keyword extraction technique that can be used to detect topics for each news domain from a large document collection of internet news portal sites. Basically, we have used six variants of traditional TF-IDF weighting model. On top of the TF-IDF model, we propose a word filtering technique called 'cross-domain comparison filtering'. To prove effectiveness of our method, we have analyzed usefulness of keywords extracted from Korean news articles and have presented changes of the keywords over time of each news domain.

  • PDF

Constructing the Semantic Information Model using A Collective Intelligence Approach

  • Lyu, Ki-Gon;Lee, Jung-Yong;Sun, Dong-Eon;Kwon, Dai-Young;Kim, Hyeon-Cheol
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.5 no.10
    • /
    • pp.1698-1711
    • /
    • 2011
  • Knowledge is often represented as a set of rules or a semantic network in intelligent systems. Recently, ontology has been widely used to represent semantic knowledge, because it organizes thesaurus and hierarchal information between concepts in a particular domain. However, it is not easy to collect semantic relationships among concepts. Much time and expense are incurred in ontology construction. Collective intelligence can be a good alternative approach to solve these problems. In this paper, we propose a collective intelligence approach of Games With A Purpose (GWAP) to collect various semantic resources, such as words and word-senses. We detail how to construct the semantic information model or ontology from the collected semantic resources, constructing a system named FunWords. FunWords is a Korean lexical-based semantic resource collection tool. Experiments demonstrated the resources were grouped as common nouns, abstract nouns, adjective and neologism. Finally, we analyzed their characteristics, acquiring the semantic relationships noted above. Common nouns, with structural semantic relationships, such as hypernym and hyponym, are highlighted. Abstract nouns, with descriptive and characteristic semantic relationships, such as synonym and antonym are underlined. Adjectives, with such semantic relationships, as description and status, illustration - for example, color and sound - are expressed more. Last, neologism, with the semantic relationships, such as description and characteristics, are emphasized. Weighting the semantic relationships with these characteristics can help reduce time and cost, because it need not consider unnecessary or slightly related factors. This can improve the expressive power, such as readability, concentrating on the weighted characteristics. Our proposal to collect semantic resources from the collective intelligence approach of GWAP (our FunWords) and to weight their semantic relationship can help construct the semantic information model or ontology would be a more effective and expressive alternative.