A Co-training Method based on Classification Using Unlabeled Data

;;;;;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 31 Issue 8
/
Pages.991-998
/
2004
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

A Co-training Method based on Classification Using Unlabeled Data

비분류표시 데이타를 이용하는 분류 기반 Co-training 방법

윤혜성 (이화여자대학교 컴퓨터학과) ;
이상호 (이화여자대학교 컴퓨터학) ;
박승수 (이화여자대학교 컴퓨터학) ;
용환승 (이화여자대학교 컴퓨터학) ;
김주한 (서울대학교 의과대학 생명의료정보학)

Published : 2004.08.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In many practical teaming problems including bioinformatics area, there is a small amount of labeled data along with a large pool of unlabeled data. Labeled examples are fairly expensive to obtain because they require human efforts. In contrast, unlabeled examples can be inexpensively gathered without an expert. A common method with unlabeled data for data classification and analysis is co-training. This method uses a small set of labeled examples to learn a classifier in two views. Then each classifier is applied to all unlabeled examples, and co-training detects the examples on which each classifier makes the most confident predictions. After some iterations, new classifiers are learned in training data and the number of labeled examples is increased. In this paper, we propose a new co-training strategy using unlabeled data. And we evaluate our method with two classifiers and two experimental data: WebKB and BIND XML data. Our experimentation shows that the proposed co-training technique effectively improves the classification accuracy when the number of labeled examples are very small.

생물 정보학 등 많은 응용 분야에서 데이타 분석을 할 때는 적은 수의 분류표시된 데이터 (labeled data)와 많은 수의 비분류표시된 데이타(unlabeled data)가 있을 수 있다 분류표시된 자료는 사람의 노력이 요구되기 때문에 얻기가 어렵고 비용이 많이 들지만, 비분류표시된 자료는 별 어려움 없이 쉽게 얻을 수 있다. 이때 비분류표시된 자료를 이용하여 자료를 분류하고 분석하는데 널리 이용되고 있는 방법이 co-training 알고리즘이다. 이 방법은 적은 수의 분류표시된 자료에서 두 가지 뷰(view)로 각 분류자를 학습한다. 그리고 각 분류자는 분석하고자 하는 모든 비분류표시된 자료에서 가장 만족할만한 예측자들을 만들어 나간다. 이렇게 훈련 데이타 셋에서 실험을 여러 번 반복적으로 하게 되면 각 뷰에서 새로운 분류자가 학습되어 분류표시된 자료의 수가 증가한다. 본 논문에서는 비분류표시된 데이타를 이용하여 새로운 co-training 방법을 제시한다. 이 방법은 두 가지 분류자와 WebKB 및 BIND XML의 2가지 실험 데이타를 가지고 평가하였다. 실험 결과로서, 이 논문에서 제안한 co-training 방법이 분류표시된 자료의 수가 매우 적을 때 분류정확성을 효과적으로 향상시킬 수 있음을 보였다.

Keywords

References

T. Mitchell, 'The Role of Unlabeled Data in Supervised Learning,' Proceedings of the 6th International Colloquium on Cognitive Science (ICCS), pp. 254-278, 1999
S. Goldman and Y. Zhou, 'Enhancing Supervised Learning with Unlabeled Data,' Proceedings of the 7th International Conference on Machine Learning(ICML), pp. 327-334, 2000
A. Blum and T. Mitchell, 'Combining Labeled and Unlabeled Data with Co-Training,' Proceedings of the 11th Annual Conference on Compotational Learning Theory (COLT), pp.92-100, 1998
K. Nigam and R. Ghani, 'Analyzing the Effectiveness and Applicability of Co-Training,' Proceedings of Information and Knowledge Management, pp.86-93, 2000
K. Nigam and R. Ghani, 'Understanding the Behavior of Co-training,' in KDD-2000 Workshop on Text Mining, 2000
K. Nigam, A. K. Mccallum, S. Thrun, T. and Mitchell, 'Text Classification from Labeled and Unlabeled Documents using EM,' Machine Learning, 39(2/3), pp.103-134, 2000 https://doi.org/10.1023/A:1007692713085
I. Muslea, S. Minston and C. Knoblock, 'Selective Sampling with Redundant Views,' Proceedings of National Conference on Artificial Intelligence, pp.621-626, 2000
I. Muslea, S. Minston and C. Knoblock, 'Active+Semi-Supervised Learning=Robust Multi-view Learning,' Proceedings of International Conference on Machine Learning (ICML), pp.435-442, 2002
B. Raskutti, H. Ferra, A. Kowalczyk, 'Combining Clustering and Co-training to Enhance Text Classification Using Unlabelled Data,' Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (KDD), pp.620-625, 2002
M. Figueiredo, A. K. Jain and M. H. Law, 'A Feature Selection Wrapper for Mixtures,' Proceedings of the First Iberian Conference on Pattern Recognition and Image Analysis, Puerto de Andratx, Spain, June, 2003
I. Muslea, S. Minston and C. Knoblock, 'Active Learning with Strong and Weak Views: A Case study on Wrapper Induction,' Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2003
P. Buneman, 'Tutorial: Semistructured Data,' Proceedings of ACM Symposium on Principles of Database Systems, pp.117-121, 1997
D. Suciu, 'Semistructured Data and XML,' Proceedings of International Conference on Foundations of Data Organization (FODO), 1998
O. Chapelle, J. Weston and B. Scholkopf, 'Cluster Kernels for Semi-Supervised Learning,' Advances in Neural Information Processing Systems (NIPS 2002), MIT Press, Cambridge, MA, 2003

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

A Co-training Method based on Classification Using Unlabeled Data

비분류표시 데이타를 이용하는 분류 기반 Co-training 방법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)