• Title/Summary/Keyword: 나이브 베이즈 분류

Search Result 71, Processing Time 0.027 seconds

Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities (문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구)

  • Kim, Pan-Jun;Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.1 s.63
    • /
    • pp.251-271
    • /
    • 2007
  • This paper studies the problem of classifying documents with labeled and unlabeled learning data, especially with regards to using document similarity features. The problem of using unlabeled data is practically important because in many information systems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. There are two steps In general semi-supervised learning algorithm. First, it trains a classifier using the available labeled documents, and classifies the unlabeled documents. Then, it trains a new classifier using all the training documents which were labeled either manually or automatically. We suggested two types of semi-supervised learning algorithm with regards to using document similarity features. The one is one step semi-supervised learning which is using unlabeled documents only to generate document similarity features. And the other is two step semi-supervised learning which is using unlabeled documents as learning examples as well as similarity features. Experimental results, obtained using support vector machines and naive Bayes classifier, show that we can get improved performance with small labeled and large unlabeled documents then the performance of supervised learning which uses labeled-only data. When considering the efficiency of a classifier system, the one step semi-supervised learning algorithm which is suggested in this study could be a good solution for improving classification performance with unlabeled documents.

Rank-based Multiclass Gene Selection for Cancer Classification with Naive Bayes Classifiers based on Gene Expression Profiles (나이브 베이스 분류기를 이용한 유전발현 데이타기반 암 분류를 위한 순위기반 다중클래스 유전자 선택)

  • Hong, Jin-Hyuk;Cho, Sung-Bae
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.35 no.8
    • /
    • pp.372-377
    • /
    • 2008
  • Multiclass cancer classification has been actively investigated based on gene expression profiles, where it determines the type of cancer by analyzing the large amount of gene expression data collected by the DNA microarray technology. Since gene expression data include many genes not related to a target cancer, it is required to select informative genes in order to obtain highly accurate classification. Conventional rank-based gene selection methods often use ideal marker genes basically devised for binary classification, so it is difficult to directly apply them to multiclass classification. In this paper, we propose a novel method for multiclass gene selection, which does not use ideal marker genes but directly analyzes the distribution of gene expression. It measures the class-discriminability by discretizing gene expression levels into several regions and analyzing the frequency of training samples for each region, and then classifies samples by using the naive Bayes classifier. We have demonstrated the usefulness of the proposed method for various representative benchmark datasets of multiclass cancer classification.

Improving Text Categorization with High Quality Bigrams (고품질 바이그램을 이용한 문서 범주화 성능 향상)

  • Lee, Chan-Do;Tan, Chade-Meng;Wang, Yuan-Fang
    • The KIPS Transactions:PartB
    • /
    • v.9B no.4
    • /
    • pp.415-420
    • /
    • 2002
  • This paper presents an efficient text categorization algorithm that generates high quality bigrams by using the information gain metric, combined with various frequency thresholds. The bigrams, along with unigrams, are then given as features to a Naive Bayes classifier. The experimental results suggest that the bigrams, while small in number, can substantially contribute to improving text categorization. Upon close examination of the results, we conclude that the algorithm is most successful in correctly classifying more positive documents, but may cause more negative documents to be classified incorrectly.

Comparison of Korean Classification Models' Korean Essay Score Range Prediction Performance (한국어 학습 모델별 한국어 쓰기 답안지 점수 구간 예측 성능 비교)

  • Cho, Heeryon;Im, Hyeonyeol;Yi, Yumi;Cha, Junwoo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.3
    • /
    • pp.133-140
    • /
    • 2022
  • We investigate the performance of deep learning-based Korean language models on a task of predicting the score range of Korean essays written by foreign students. We construct a data set containing a total of 304 essays, which include essays discussing the criteria for choosing a job ('job'), conditions of a happy life ('happ'), relationship between money and happiness ('econ'), and definition of success ('succ'). These essays were labeled according to four letter grades (A, B, C, and D), and a total of eleven essay score range prediction experiments were conducted (i.e., five for predicting the score range of 'job' essays, five for predicting the score range of 'happiness' essays, and one for predicting the score range of mixed topic essays). Three deep learning-based Korean language models, KoBERT, KcBERT, and KR-BERT, were fine-tuned using various training data. Moreover, two traditional probabilistic machine learning classifiers, naive Bayes and logistic regression, were also evaluated. Experiment results show that deep learning-based Korean language models performed better than the two traditional classifiers, with KR-BERT performing the best with 55.83% overall average prediction accuracy. A close second was KcBERT (55.77%) followed by KoBERT (54.91%). The performances of naive Bayes and logistic regression classifiers were 52.52% and 50.28% respectively. Due to the scarcity of training data and the imbalance in class distribution, the overall prediction performance was not high for all classifiers. Moreover, the classifiers' vocabulary did not explicitly capture the error features that were helpful in correctly grading the Korean essay. By overcoming these two limitations, we expect the score range prediction performance to improve.

A Method for Spam Message Filtering Based on Lifelong Machine Learning (Lifelong Machine Learning 기반 스팸 메시지 필터링 방법)

  • Ahn, Yeon-Sun;Jeong, Ok-Ran
    • Journal of IKEEE
    • /
    • v.23 no.4
    • /
    • pp.1393-1399
    • /
    • 2019
  • With the rapid growth of the Internet, millions of indiscriminate advertising SMS are sent every day because of the convenience of sending and receiving data. Although we still use methods to block spam words manually, we have been actively researching how to filter spam in a various ways as machine learning emerged. However, spam words and patterns are constantly changing to avoid being filtered, so existing machine learning mechanisms cannot detect or adapt to new words and patterns. Recently, the concept of Lifelong Learning emerged to overcome these limitations, using existing knowledge to keep learning new knowledge continuously. In this paper, we propose a method of spam filtering system using ensemble techniques of naive bayesian which is most commonly used in document classification and LLML(Lifelong Machine Learning). We validate the performance of lifelong learning by applying the model ELLA and the Naive Bayes most commonly used in existing spam filters.

Causal Relation Extraction Using Cue Phrases and Lexical Pair Probabilities (단서 구문과 어휘 쌍 확률을 이용한 인과관계 추출)

  • Chang, Du-Seong;Choi, Key-Sun
    • Annual Conference on Human and Language Technology
    • /
    • 2003.10d
    • /
    • pp.163-169
    • /
    • 2003
  • 현재의 질의응답 시스템은 TREC(Text Retrieval Conference) 질의집합에 대해 최대 80% 정도의 응답 성공률을 보이고 있다. 하지만 질의 유형에 다라 성능의 많은 차이가 있으며, 인과관계에 대한 질의에 대해서는 매우 낮은 응답 성공률을 보이고 있다. 본 연구는 인접한 두 문장 혹은 두 문장 혹은 두 명사구 사이에 존재하는 인과관계를 추출하고자 한다. 기존의 명사구 간 인과관계 추출 연구에서는 인과관계 단서구문과 두 명사구의 의미를 주요한 정보로 사용하였으나, 사전 미등록어가 사용되었을 때 올바른 선택을 하기 어려웠다. 또한, 학습 코퍼스에 대한 인과관계 부착과정이 선행되어야 하며, 다량의 학습자료를 사용하기가 어려웠다. 본 연구에서는 인과관계 명사구 쌍에서 추출된 어휘 쌍을 기존의 단서구문과 같이 사용하는 방법을 제안한다. 인과관계 분류를 위해 나이브 베이즈 분류기를 사용하였으며, 비지도식 학습과정을 사용하였다. 제안된 분류 모델은 기존의 분류 모델과 달리 사전 미등록어에 의한 성능 저하가 없으며, 학습 코퍼스의 인과관계 분류 작업이 선행될 필요 없다. 문장 내 명사구간의 인과관계 추출 실험 결과 79.07%의 정확도를 얻었다. 이러한 결과는 단서구문과 명사구 의미를 이용한 방법에 비해 6.32% 향상된 결과이며, 지도식 학습방식을 통해 얻은 방법과 유사한 결과이다. 또한 제안된 학습 및 분류 모델은 문장간의 인과관계 추출에도 적용가능하며, 한국어에서 인접한 두 문장간의 인과관계 추출 실험에서 74.68%의 정확도를 보였다.

  • PDF

An Empirical Comparison of Machine Learning Models for Classifying Emotions in Korean Twitter (한국어 트위터의 감정 분류를 위한 기계학습의 실증적 비교)

  • Lim, Joa-Sang;Kim, Jin-Man
    • Journal of Korea Multimedia Society
    • /
    • v.17 no.2
    • /
    • pp.232-239
    • /
    • 2014
  • As online texts have been rapidly growing, their automatic classification gains more interest with machine learning methods. Nevertheless, comparatively few research could be found, aiming for Korean texts. Evaluating them with statistical methods are also rare. This study took a sample of tweets and used machine learning methods to classify emotions with features of morphemes and n-grams. As a result, about 76% of emotions contained in tweets was correctly classified. Of the two methods compared in this study, Support Vector Machines were found more accurate than Na$\ddot{i}$ve Bayes. The linear model of SVM was not inferior to the non-linear one. Morphological features did not contribute to accuracy more than did the n-grams.

A Hierarchical CPV Solar Generation Tracking System based on Modular Bayesian Network (베이지안 네트워크 기반 계층적 CPV 태양광 추적 시스템)

  • Park, Susang;Yang, Kyon-Mo;Cho, Sung-Bae
    • Journal of KIISE:Software and Applications
    • /
    • v.41 no.7
    • /
    • pp.481-491
    • /
    • 2014
  • The power production using renewable energy is more important because of a limited amount of fossil fuel and the problem of global warming. A concentrative photovoltaic system comes into the spotlight with high energy production, since the rate of power production using solar energy is proliferated. These systems, however, need to sophisticated tracking methods to give the high power production. In this paper, we propose a hierarchical tracking system using modular Bayesian networks and a naive Bayes classifier. The Bayesian networks can respond flexibly in uncertain situations and can be designed by domain knowledge even when the data are not enough. Bayesian network modules infer the weather states which are classified into nine classes. Then, naive Bayes classifier selects the most effective method considering inferred weather states and the system makes a decision using the rules. We collected real weather data for the experiments and the average accuracy of the proposed method is 93.9%. In addition, comparing the photovoltaic efficiency with the pinhole camera system results in improved performance of about 16.58%.

Sensitivity Identification Method for New Words of Social Media based on Naive Bayes Classification (나이브 베이즈 기반 소셜 미디어 상의 신조어 감성 판별 기법)

  • Kim, Jeong In;Park, Sang Jin;Kim, Hyoung Ju;Choi, Jun Ho;Kim, Han Il;Kim, Pan Koo
    • Smart Media Journal
    • /
    • v.9 no.1
    • /
    • pp.51-59
    • /
    • 2020
  • From PC communication to the development of the internet, a new term has been coined on the social media, and the social media culture has been formed due to the spread of smart phones, and the newly coined word is becoming a culture. With the advent of social networking sites and smart phones serving as a bridge, the number of data has increased in real time. The use of new words can have many advantages, including the use of short sentences to solve the problems of various letter-limited messengers and reduce data. However, new words do not have a dictionary meaning and there are limitations and degradation of algorithms such as data mining. Therefore, in this paper, the opinion of the document is confirmed by collecting data through web crawling and extracting new words contained within the text data and establishing an emotional classification. The progress of the experiment is divided into three categories. First, a word collected by collecting a new word on the social media is subjected to learned of affirmative and negative. Next, to derive and verify emotional values using standard documents, TF-IDF is used to score noun sensibilities to enter the emotional values of the data. As with the new words, the classified emotional values are applied to verify that the emotions are classified in standard language documents. Finally, a combination of the newly coined words and standard emotional values is used to perform a comparative analysis of the technology of the instrument.

캐릭터 이름을 이용한 MMORPG 봇 탐지 기법

  • Kang, Sung Wook;Lee, Eun Jo
    • Review of KIISC
    • /
    • v.27 no.4
    • /
    • pp.6-13
    • /
    • 2017
  • 온라인 게임에서 불법 프로그램을 이용한 게임 봇을 대규모로 운영하는 전문 사설 업체를 속칭 '작업장(Gold Farming Group, GFG)'이라고 부른다. 기존에 작업장에서 운영하는 게임 봇은 24시간 쉬지 않고 반복적인 파밍을 통해 수익을 극대화하는 전략을 취했으나 최근 온라인 게임의 계정 가입이 쉬워지고 무료 플레이가 보편화되면서 개개의 게임 봇 계정이 수행하는 플레이 시간이나 취득 재화 수준을 낮추는 대신 수만 개의 계정을 번갈아 가며 운영하는 방식으로 변하고 있다. 이로 인해 플레이 활동 패턴에 기반한 기존의 탐지 모델들이 점차 무력화되고 있으며 진입 초기에 게임 봇을 빠르게 탐지하고 제재하는 방안이 점차 중요해지고 있다. 우리는 게임 봇을 조기에 탐지하기 위한 방안으로 계정 및 캐릭터의 이름이 갖는 특성을 활용한 게임 봇 탐지 기법을 제안한다. 제안한 기법의 유효성을 검증하기 위해 북미에서 서비스 중인 엔씨소프트의 MMORPG인 '블레이드 앤 소울'의 약 20만 개 계정 정보를 이용해 탐지 성능을 측정하였다. 실험에 의하면 캐릭터 이름에 대해 간단한 나이브 베이즈 분류기를 적용하는 것만으로도 AUC 기준으로 약 0.901의 성능을 기록하였다.