Search | Korea Science

A Text Categorization Method Improved by Removing Noisy Training Documents (오류 학습 문서 제거를 통한 문서 범주화 기법의 성능 향상)

Han, Hyoung-Dong;Ko, Young-Joong;Seo, Jung-Yun
- Journal of KIISE:Software and Applications
- /
- v.32 no.9
- /
- pp.912-919
- /
- 2005
When we apply binary classification to multi-class classification for text categorization, we use the One-Against-All method generally, However, this One-Against-All method has a problem. That is, documents of a negative set are not labeled by human. Thus, they can include many noisy documents in the training data. In this paper, we propose that the Sliding Window technique and the EM algorithm are applied to binary text classification for solving this problem. We here improve binary text classification through extracting noise documents from the training data by the Sliding Window technique and re-assigning categories of these documents using the EM algorithm.
PDF KSCI

A Comparative Study on Category Assignment Methods of a KNN Classifier (KNN 분류기의 범주할당 방법 비교 실험)

이영숙;정영미
- Proceedings of the Korean Society for Information Management Conference
- /
- 2000.08a
- /
- pp.37-40
- /
- 2000
KNN(K-Neatest Neighbors)을 사용한 문서의 자동분류에서는 새로운 입력문서에 범주를 할당하기 위해 K개의 유사문서로부터 범주별 문서의 분류빈도나 유사도를 이용한다. 본 연구에서는 KNN 기법에서 보편적으로 사용되는 범주 할당 방법을 응용하여 K개 유사문서 중 최상위 및 상위 M개 문서에 가중치를 부여하는 방법들을 고안하였고 K값의 변화에 따른 이들의 성능을 비교해 보았다.
PDF

ITU-T 응용서비스 보안 및 서비스 지향 구조(SOA) 국제표준화 동향

Lim, Hyung-Jin;Seo, Dae-Hee;Nah, Jae-Hoon
- Review of KIISC
- /
- v.21 no.2
- /
- pp.53-60
- /
- 2011
국제표준화기구 ITU-T에서는 연구그룹(Study Group)17이 정보통신 응용보안에 관한 표준화를 리드하는 연구그룹으로, 산하 4개의 연구과제(Question)를 구성하여 정보보호 국제표준을 개발하고 있다. 이 연구과제들 중 Q.7(의장, 나재훈, ETRI)에서는 안전한 응용 서비스라는 범주로 안전한 응용 프로토콜, 웹서비스 보안, P2P(Peer-to-Peer) 보안 등 정보통신환경의 응용서비스 보호에 적용될 수 있는 국제표준들의 개발을 담당하고 있다. Q.8(의장, Liang Wei, CATR)에서는 서비스 지향 구조라는 범주로 SOA(Service Oriented Architecture) 기술 및 통신 보안에 관련된 국제표준들의 개발을 담당하고있다. 현재 Q.7에서는 총 7건의 국제표준을 제정하였으며, 총 6건의 표준초안들이 개발중에 있다. Q.8의 경우는 현재 총3건의 표준초안들이 개발중에 있다. 본 논문에서는 해당 연구과제들의 표준화 현황과 향후 추진 방향을 제시한다.
PDF KSCI

A study on the optimal parameter design by analyzing the ordered categorical data (순차 범주형 데이타분석을 위한 최적모수설계에 관한 연구)

전태준;홍남표;박호일
- Proceedings of the Korean Operations and Management Science Society Conference
- /
- 1992.04b
- /
- pp.188-197
- /
- 1992
제품 개발에 관한 응용 연구 혹은 개발 연구의 실험 결과가 품질특성의 본질적인 성격이나 측정시의 편의때문에 순차 범주형 자료(ordered categorical data)로 분류되는 경우가 있다. 본 논문에서는 망목 특성 문제(nominal-the-best type problem)를 분석하는데 있어서 기존의 다구찌 누적법이 순차 범주형 자료분석법이 안고 있는 문제점들을 고찰하고, 이를 개선하기 위해 품질손실에 근거한 목표 누적법을 제시한다. 본 논문에서 제시한 기법을 post-etch contact window데이타에 적용해 본 결과 인자의 최적수준을 결정하는데 용이하였다.
PDF

Automatic Text Categorization based on Semi-Supervised Learning (준지도 학습 기반의 자동 문서 범주화)

Ko, Young-Joong;Seo, Jung-Yun
- Journal of KIISE:Software and Applications
- /
- v.35 no.5
- /
- pp.325-334
- /
- 2008
The goal of text categorization is to classify documents into a certain number of pre-defined categories. The previous studies in this area have used a large number of labeled training documents for supervised learning. One problem is that it is difficult to create the labeled training documents. While it is easy to collect the unlabeled documents, it is not so easy to manually categorize them for creating training documents. In this paper, we propose a new text categorization method based on semi-supervised learning. The proposed method uses only unlabeled documents and keywords of each category, and it automatically constructs training data from them. Then a text classifier learns with them and classifies text documents. The proposed method shows a similar degree of performance, compared with the traditional supervised teaming methods. Therefore, this method can be used in the areas where low-cost text categorization is needed. It can also be used for creating labeled training documents.
PDF KSCI

An Analysis of Categorical Time Series Driven by Clipping GARCH Processes (연속형-GARCH 시계열의 범주형화(Clipping)를 통한 분석)

Choi, M.S.;Baek, J.S.;Hwan, S.Y.
- The Korean Journal of Applied Statistics
- /
- v.23 no.4
- /
- pp.683-692
- /
- 2010
This short article is concerned with a categorical time series obtained after clipping a heteroscedastic GARCH process. Estimation methods are discussed for the model parameters appearing both in the original process and in the resulting binary time series from a clipping (cf. Zhen and Basawa, 2009). Assuming AR-GARCH model for heteroscedastic time series, three data sets from Korean stock market are analyzed and illustrated with applications to calculating certain probabilities associated with the AR-GARCH process.
https://doi.org/10.5351/KJAS.2010.23.4.683 인용 PDF KSCI

Latent class model for mixed variables with applications to text data (혼합모드 잠재범주모형을 통한 텍스트 자료의 분석)

Shin, Hyun Soo;Seo, Byungtae
- The Korean Journal of Applied Statistics
- /
- v.32 no.6
- /
- pp.837-849
- /
- 2019
Latent class models (LCM) are useful tools to draw hidden information from categorical data. This model can also be interpreted as a mixture model with multinomial component distributions. In some cases, however, an available dataset may contain both categorical and count or continuous data. For such cases, we can extend the LCM to a mixture model with both multinomial and other component distributions such as normal and Poisson distributions. In this paper, we consider a LCM for the data containing categorical and count data to analyze the Drug Review dataset which contains categorical responses and text review. From this data analysis, we show that we can obtain more specific hidden inforamtion than those from the LCM only with categorical responses.
https://doi.org/10.5351/KJAS.2019.32.6.837 인용 PDF KSCI

Categorical Date Analysis System in the internet (인터넷상에서의 범주형 자료분석 시스템 개발)

홍종선;김동욱;오민권
- The Korean Journal of Applied Statistics
- /
- v.12 no.1
- /
- pp.83-95
- /
- 1999
본 논문의 목적은 인터넷에서 범주형 자료분석에 대한 전문적인 지식이 없는 일반 분석자들에게 보다 쉽고, 간편하게 다룰 수 있는 범주형 자료 분석 시스템을 제공하는것이다. 이 분석 시스템은 크게 세 가지 측면으로 설계하여 구현하였다. 첫째, 범주형 자료에 대한 탐색적 자료분석을 위하여 세 가지 종류의 히스토그램을 제공한다. 둘째, 범주형 변수들간에 존재하는 연관성을 측정하기 위한 여러 연관성 측도들을 제공한다. 특히, 현재 많이 사용되는 통계 패키지들에서 제공하지 못하는 모자익 그림과 연관 그림을 동적 그래픽스로 구현하여 연관성을 측정하거나 모형을 설정하는데 유용한 정보를 얻을 수 있도록 하였다. 셋째, 대수선형모형에 대한 분석을 통해 사용자가 가장 잘 적합된 대수선형모형을 선택할 수 있게 하였다.
PDF

Developing of Exact Tests for Order-Restrictions in Categorical Data (범주형 자료에서 순서화된 대립가설 검정을 위한 정확검정의 개발)

Nam, Jusun;Kang, Seung-Ho
- The Korean Journal of Applied Statistics
- /
- v.26 no.4
- /
- pp.595-610
- /
- 2013
Testing of order-restricted alternative hypothesis in $2{\times}k$ contingency tables can be applied to various fields of medicine, sociology, and business administration. Most testing methods have been developed based on a large sample theory. In the case of a small sample size or unbalanced sample size, the Type I error rate of the testing method (based on a large sample theory) is very different from the target point of 5%. In this paper, the exact testing method is introduced in regards to the testing of an order-restricted alternative hypothesis in categorical data (particularly if a small sample size or extreme unbalanced data). Power and exact p-value are calculated, respectively.
https://doi.org/10.5351/KJAS.2013.26.4.595 인용 PDF KSCI

An Incremental Method Using Sample Split Points for Global Discretization (전역적 범주화를 위한 샘플 분할 포인트를 이용한 점진적 기법)

한경식;이수원
- Journal of KIISE:Software and Applications
- /
- v.31 no.7
- /
- pp.849-858
- /
- 2004
Most of supervised teaming algorithms could be applied after that continuous variables are transformed to categorical ones at the preprocessing stage in order to avoid the difficulty of processing continuous variables. This preprocessing stage is called global discretization, uses the class distribution list called bins. But, when data are large and the range of the variable to be discretized is very large, many sorting and merging should be performed to produce a single bin because most of global discretization methods need a single bin. Also, if new data are added, they have to perform discretization from scratch to construct categories influenced by the data because the existing methods perform discretization in batch mode. This paper proposes a method that extracts sample points and performs discretization from these sample points in order to solve these problems. Because the approach in this paper does not require merging for producing a single bin, it is efficient when large data are needed to be discretized. In this study, an experiment using real and synthetic datasets was made to compare the proposed method with an existing one.
PDF KSCI

Search Result 243, Processing Time 0.02 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)