• Title/Summary/Keyword: Automatic Document Classification

Search Result 77, Processing Time 0.029 seconds

The selection of Best suited Automatic Web Document Classification Based on Intranet (인트라넷 기반의 최적의 웹문서 자동 분류기법 선정)

  • 김국희;윤희병
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2004.10a
    • /
    • pp.423-426
    • /
    • 2004
  • 인트라넷에서는 증가하는 웹문서의 검색을 목적으로 웹 검색엔진의 도입이 활발히 진행 중이며 대부분 찾아야할 키워드를 알고 접근하는 검색엔진 형태이다. 그러나 사용자가 무엇을 찾아야 하는지 모르는 경우 웹문서 분류체계는 효율적인 방법을 제시할 수 있다. 일부 구축되어 있는 분류체계는 수작업에 의한 분류로 인해 증가하는 웹문서의 양에 효율적으로 대처하기 곤란하므로 자동분류기법을 활용한 분류가 더 효율적일 것이다. 본 논문에서는 국방인트라넷의 수작업으로 구축된 분류체계를 대상으로 용어 가중치를 계산하는 방법을 달리하여 다양한 분류기법을 적용하여 성능을 비교평가하고 웹문서 자동분류시스템에 적용하여 분류성능의 향상을 도모하고자 한다.

  • PDF

A Feature Selection Technique for an Efficient Document Automatic Classification (효율적인 문서 자동 분류를 위한 대표 색인어 추출 기법)

  • 김지숙;김영지;문현정;우용태
    • The Journal of Information Technology and Database
    • /
    • v.8 no.1
    • /
    • pp.117-128
    • /
    • 2001
  • Recently there are many researches of text mining to find interesting patterns or association rules from mass textual documents. However, the words extracted from informal documents are tend to be irregular and there are too many general words, so if we use pre-exist method, we would have difficulty in retrieving knowledge information effectively. In this paper, we propose a new feature extraction method to classify mass documents using association rule based on unsupervised learning technique. In experiment, we show the efficiency of suggested method by extracting features and classifying of documents.

  • PDF

Automatic Document Classification by Term-Weighting Method (범주 대표어의 가중치 계산 방식에 의한 자동 문서 분류 시스템)

  • 이경찬;강승식
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.04b
    • /
    • pp.475-477
    • /
    • 2002
  • 자동 문서 분류는 범주 특성 벡터와 입력 문서 벡터의 유사도 비교에 의해 가장 유사한 범주를 선택하는 방법이다. 문서 분류 시스템을 구현하기 위하여 각 범주의 특성 벡터를 정보 검색 시스템의 역파일 형태로 구축하였으며, 용어 가중치를 계산하는 방법을 달리하여 문서 분류 시스템의 정확도를 실험하였다. 실험 문서는 일간지의 신문기사들을 무작위로 추출한 문서 집합을 대상으로 하였으며, 정보 검색 모델에서 보편적으로 사용되는 TF-lDF 방식이 변형된 방식에 비해 더 나은 성능을 보였다.

  • PDF

Similar Patent Search Service System using Latent Dirichlet Allocation (잠재 의미 분석을 적용한 유사 특허 검색 서비스 시스템)

  • Lim, HyunKeun;Kim, Jaeyoon;Jung, Hoekyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.8
    • /
    • pp.1049-1054
    • /
    • 2018
  • Keyword searching used in the past as a method of finding similar patents, and automated classification by machine learning is using in recently. Keyword searching is a method of analyzing data that is formalized through data refinement. While the accuracy for short text is high, long one consisted of several words like as document that is not able to analyze the meaning contained in sentences. In semantic analysis level, the method of automatic classification is used to classify sentences composed of several words by unstructured data analysis. There was an attempt to find similar documents by combining the two methods. However, it have a problem in the algorithm w the methods of analysis are different ways to use simultaneous unstructured data and regular data. In this paper, we study the method of extracting keywords implied in the document and using the LDA(Latent Semantic Analysis) method to classify documents efficiently without human intervention and finding similar patents.

Automatic Classification of Academic Articles Using BERT Model Based on Deep Learning (딥러닝 기반의 BERT 모델을 활용한 학술 문헌 자동분류)

  • Kim, In hu;Kim, Seong hee
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.3
    • /
    • pp.293-310
    • /
    • 2022
  • In this study, we analyzed the performance of the BERT-based document classification model by automatically classifying documents in the field of library and information science based on the KoBERT. For this purpose, abstract data of 5,357 papers in 7 journals in the field of library and information science were analyzed and evaluated for any difference in the performance of automatic classification according to the size of the learned data. As performance evaluation scales, precision, recall, and F scale were used. As a result of the evaluation, subject areas with large amounts of data and high quality showed a high level of performance with an F scale of 90% or more. On the other hand, if the data quality was low, the similarity with other subject areas was high, and there were few features that were clearly distinguished thematically, a meaningful high-level performance evaluation could not be derived. This study is expected to be used as basic data to suggest the possibility of using a pre-trained learning model to automatically classify the academic documents.

Automatic Title Detection by Spatial Feature and Projection Profile for Document Images (공간 정보와 투영 프로파일을 이용한 문서 영상에서의 타이틀 영역 추출)

  • Park, Hyo-Jin;Kim, Bo-Ram;Kim, Wook-Hyun
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.11 no.3
    • /
    • pp.209-214
    • /
    • 2010
  • This paper proposes an algorithm of segmentation and title detection for document image. The automated title detection method that we have developed is composed of two phases, segmentation and title area detection. In the first phase, we extract and segment the document image. To perform this operation, the binary map is segmented by combination of morphological operation and CCA(connected component algorithm). The first phase provides segmented regions that would be detected as title area for the second stage. Candidate title areas are detected using geometric information, then we can extract the title region that is performed by removing non-title regions. After classification step that removes non-text regions, projection is performed to detect a title region. From the fact that usually the largest font is used for the title in the document, horizontal projection is performed within text areas. In this paper, we proposed a method of segmentation and title detection for various forms of document images using geometric features and projection profile analysis. The proposed system is expected to have various applications, such as document title recognition, multimedia data searching, real-time image processing and so on.

Comparison of Document Features Extraction Methods for Automatic Classification of Real World FAQ Mails (실세계의 FAQ 메일 자동분류를 위한 문서 특징추출 방법의 성능 비교)

  • 홍진혁;류중원;조성배
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2001.04b
    • /
    • pp.271-273
    • /
    • 2001
  • 최근 문서 자동분류의 중요성이 널리 인식되어 다양한 연구가 진행되고 있다. 본 논문에서는 한글 문서의 효과적인 자동분류를 위한 다양한 특징추출 방법들을 구현하고 실제 질의메일에 대한 효율적인 특징주출 방법을 제시한다. 실험을 위해 문서 빈도(document frequency), 정보획득(information gain), 상호 정보량(mutual information), x$^2$등 7가지 특징추출 방법을 사용하였으며 463개의 실제 테스트 질의메일에 적용한 결과, x$^2$ 방법이 74.7%의 인식률을 내어 성능이 가장 좋음을 알 수 있었다. 반면에 x$^2$와 함께 가장 자주 쓰이는 방법 중의 하나인 정보 이득은 인식률이 최대 40.6%밖에 되지 않았다.

  • PDF

A Study on the Documents's Automatic Classification Using Machine Learning (기계학습을 이용한 문서 자동분류에 관한 연구)

  • Kim, Seong-Hee;Eom, Jae-Eun
    • Journal of Information Management
    • /
    • v.39 no.4
    • /
    • pp.47-66
    • /
    • 2008
  • This study introduced the machine learning algorithms to overcome the many different limitations involved with manual classification and to provide the users with faster and more accurate classification service. The experiments objects of the study were consisted of 100 literature titles for each of the eight subject categories in MeSH. The algorithms used to the experiments included Neural network, C5.0, CHAID and KNN. As results, the combination of the neural network and C5.0 technique recorded classification accuracy of 83.75%, which was 2.5% and 3.75% higher than that of the neural network alone and C5.0 alone, respectively. The number represented the highest accuracy rates among the four classification experiments. Thus the use of the neural network and C5.0 technique together will result in higher accuracy rates than the techniques individually.

Automatic Classification of Web documents According to their Styles (스타일에 따른 웹 문서의 자동 분류)

  • Lee, Kong-Joo;Lim, Chul-Su;Kim, Jae-Hoon
    • The KIPS Transactions:PartB
    • /
    • v.11B no.5
    • /
    • pp.555-562
    • /
    • 2004
  • A genre or a style is another view of documents different from a subject or a topic. The style is also a criterion to classify the documents. There have been several studies on detecting a style of textual documents. However, only a few of them dealt with web documents. In this paper we suggest sets of features to detect styles of web documents. Web documents are different from textual documents in that Dey contain URL and HTML tags within the pages. We introduce the features specific to web documents, which are extracted from URL and HTML tags. Experimental results enable us to evaluate their characteristics and performances.

Improving the Performance of Document Clustering with Distributional Similarities (분포유사도를 이용한 문헌클러스터링의 성능향상에 대한 연구)

  • Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.4
    • /
    • pp.267-283
    • /
    • 2007
  • In this study, measures of distributional similarity such as KL-divergence are applied to cluster documents instead of traditional cosine measure, which is the most prevalent vector similarity measure for document clustering. Three variations of KL-divergence are investigated; Jansen-Shannon divergence, symmetric skew divergence, and minimum skew divergence. In order to verify the contribution of distributional similarities to document clustering, two experiments are designed and carried out on three test collections. In the first experiment the clustering performances of the three divergence measures are compared to that of cosine measure. The result showed that minimum skew divergence outperformed the other divergence measures as well as cosine measure. In the second experiment second-order distributional similarities are calculated with Pearson correlation coefficient from the first-order similarity matrixes. From the result of the second experiment, secondorder distributional similarities were found to improve the overall performance of document clustering. These results suggest that minimum skew divergence must be selected as document vector similarity measure when considering both time and accuracy, and second-order similarity is a good choice for considering clustering accuracy only.