• Title/Summary/Keyword: Automatic Categorization

Search Result 84, Processing Time 0.022 seconds

Machine Learning Based Automatic Categorization Model for Text Lines in Invoice Documents

  • Shin, Hyun-Kyung
    • Journal of Korea Multimedia Society
    • /
    • v.13 no.12
    • /
    • pp.1786-1797
    • /
    • 2010
  • Automatic understanding of contents in document image is a very hard problem due to involvement with mathematically challenging problems originated mainly from the over-determined system induced by document segmentation process. In both academic and industrial areas, there have been incessant and various efforts to improve core parts of content retrieval technologies by the means of separating out segmentation related issues using semi-structured document, e.g., invoice,. In this paper we proposed classification models for text lines on invoice document in which text lines were clustered into the five categories in accordance with their contents: purchase order header, invoice header, summary header, surcharge header, purchase items. Our investigation was concentrated on the performance of machine learning based models in aspect of linear-discriminant-analysis (LDA) and non-LDA (logic based). In the group of LDA, na$\"{\i}$ve baysian, k-nearest neighbor, and SVM were used, in the group of non LDA, decision tree, random forest, and boost were used. We described the details of feature vector construction and the selection processes of the model and the parameter including training and validation. We also presented the experimental results of comparison on training/classification error levels for the models employed.

Prescriptive Analytics System Design Fusing Automatic Classification Method and Intellectual Structure Analysis Method (자동 분류 기법과 지적 구조 분석 기법을 융합한 처방적 분석 시스템 구현 방안 연구)

  • Jeong, Do-Heon
    • Journal of the Korean Society for information Management
    • /
    • v.34 no.4
    • /
    • pp.33-57
    • /
    • 2017
  • This study aims to introduce an emerging prescriptive analytics method and suggest its efficient application to a category-based service system. Prescriptive analytics method provides the whole process of analysis and available alternatives as well as the results of analysis. To simulate the process of optimization, large scale journal articles have been collected and categorized by classification scheme. In the process of applying the concept of prescriptive analytics to a real system, we have fused a dynamic automatic-categorization method for large scale documents and intellectual structure analysis method for scholarly subject fields. The test result shows that some optimized scenarios can be generated efficiently and utilized effectively for reorganizing the classification-based service system.

An Experimental Study on the Performance Improvement of Automatic Classification for the Articles of Korean Journals Based on Controlled Keywords in International Database (해외 데이터베이스의 통제키워드에 기초한 국내 학술지 논문의 자동분류 성능 향상에 관한 실험적 연구)

  • Kim, Pan Jun;Lee, Jae Yun
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.48 no.3
    • /
    • pp.491-510
    • /
    • 2014
  • As a major factor for efficient management and retrieval of the articles in databases, keywords are classified into uncontrolled keywords and controlled keywords. Most of Korean scholarly databases fail to provide controlled vocabularies to indexing research articles which help users to retrieve relevant papers exhaustively. In this paper, we carried out automatic descriptor assignment experiments to Korean articles using automatic classifiers learned with descriptors in international database. The results of the experiments show that the classifier learning with descriptors in international database can potentially offer controlled vocabularies to Korean scholarly articles having English s. Also, we sought to improve the performance of automatic descriptor assignment using various classifiers and combination of them.

Design of Automatic Records Classification System Using Contextual Information (맥락정보를 이용한 기록 자동분류시스템 설계)

  • Jang, Ji-Sook;Rieh, Hae-Young
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.9 no.1
    • /
    • pp.151-173
    • /
    • 2009
  • The classification in the Records and Archives Sciences focuses on the contextual information in producing and utilizing records rather than their contents. This study aimed at designing an automatic records classification system to enable an automatic classification focusing on the aggregation of the context of records rather than the contents of individual record in the classification scheme, structured on the basis of business activities analyses for records reflecting the business activities. The automatic records classification system was designed to have mutual supplements by constructing the classification scheme and thesaurus together as the classification reference, as well as the aggregation of records that have been already classified. Additionally included are plans to apply the classified contextual information of records to the classification reference on the real-time base right after the category assignment of records to be classified. Although there are limitations as the designed system depends on the quality of the contextual information, it is considered that the system could lead to ensure that the contextual information of records should be more substantial.

An Analytic Study on the Categorization of Query through Automatic Term Classification (용어 자동분류를 사용한 검색어 범주화의 분석적 고찰)

  • Lee, Tae-Seok;Jeong, Do-Heon;Moon, Young-Su;Park, Min-Soo;Hyun, Mi-Hwan
    • The KIPS Transactions:PartD
    • /
    • v.19D no.2
    • /
    • pp.133-138
    • /
    • 2012
  • Queries entered in a search box are the results of users' activities to actively seek information. Therefore, search logs are important data which represent users' information needs. The purpose of this study is to examine if there is a relationship between the results of queries automatically classified and the categories of documents accessed. Search sessions were identified in 2009 NDSL(National Discovery for Science Leaders) log dataset of KISTI (Korea Institute of Science and Technology Information). Queries and items used were extracted by session. The queries were processed using an automatic classifier. The identified queries were then compared with the subject categories of items used. As a result, it was found that the average similarity was 58.8% for the automatic classification of the top 100 queries. Interestingly, this result is a numerical value lower than 76.8%, the result of search evaluated by experts. The reason for this difference explains that the terms used as queries are newly emerging as those of concern in other fields of research.

An Experimental Study on the Automatic Classification of Korean Journal Articles through Feature Selection (자질선정을 통한 국내 학술지 논문의 자동분류에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.1
    • /
    • pp.69-90
    • /
    • 2022
  • As basic data that can systematically support and evaluate R&D activities as well as set current and future research directions by grasping specific trends in domestic academic research, I sought efficient ways to assign standardized subject categories (control keywords) to individual journal papers. To this end, I conducted various experiments on major factors affecting the performance of automatic classification, focusing on feature selection techniques, for the purpose of automatically allocating the classification categories on the National Research Foundation of Korea's Academic Research Classification Scheme to domestic journal papers. As a result, the automatic classification of domestic journal papers, which are imbalanced datasets of the real environment, showed that a fairly good level of performance can be expected using more simple classifiers, feature selection techniques, and relatively small training sets.

A Study on automatic assignment of descriptors using machine learning (기계학습을 통한 디스크립터 자동부여에 관한 연구)

  • Kim, Pan-Jun
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.1 s.59
    • /
    • pp.279-299
    • /
    • 2006
  • This study utilizes various approaches of machine learning in the process of automatically assigning descriptors to journal articles. The effectiveness of feature selection and the size of training set were examined, after selecting core journals in the field of information science and organizing test collection from the articles of the past 11 years. Regarding feature selection, after reducing the feature set using $x^2$ statistics(CHI) and criteria that prefer high-frequency features(COS, GSS, JAC), the trained Support Vector Machines(SVM) performed the best. With respect to the size of the training set, it significantly influenced the performance of Support Vector Machines(SVM) and Voted Perceptron(VTP). However, it had little effect on Naive Bayes(NB).

A Novel Whale Optimized TGV-FCMS Segmentation with Modified LSTM Classification for Endometrium Cancer Prediction

  • T. Satya Kiranmai;P.V.Lakshmi
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.5
    • /
    • pp.53-64
    • /
    • 2023
  • Early detection of endometrial carcinoma in uterus is essential for effective treatment. Endometrial carcinoma is the worst kind of endometrium cancer among the others since it is considerably more likely to affect the additional parts of the body if not detected and treated early. Non-invasive medical computer vision, also known as medical image processing, is becoming increasingly essential in the clinical diagnosis of various diseases. Such techniques provide a tool for automatic image processing, allowing for an accurate and timely assessment of the lesion. One of the most difficult aspects of developing an effective automatic categorization system is the absence of huge datasets. Using image processing and deep learning, this article presented an artificial endometrium cancer diagnosis system. The processes in this study include gathering a dermoscopy images from the database, preprocessing, segmentation using hybrid Fuzzy C-Means (FCM) and optimizing the weights using the Whale Optimization Algorithm (WOA). The characteristics of the damaged endometrium cells are retrieved using the feature extraction approach after the Magnetic Resonance pictures have been segmented. The collected characteristics are classified using a deep learning-based methodology called Long Short-Term Memory (LSTM) and Bi-directional LSTM classifiers. After using the publicly accessible data set, suggested classifiers obtain an accuracy of 97% and segmentation accuracy of 93%.

Automatic Text Categorization by Term Weighting and Inverted Category Frequency (용어 가중치와 역범주 빈도에 의한 자동문서 범주화)

  • Lee, Kyung-Chan;Kang, Seung-Shik
    • Annual Conference on Human and Language Technology
    • /
    • 2003.10d
    • /
    • pp.14-17
    • /
    • 2003
  • 문서의 확률을 이용하여 자동으로 문서를 분류하는 문서 범주화 기법의 대표적인 방법이 나이브 베이지언 확률 모델이다. 이 방법의 기본 형식은 출현 용어의 확률 계산 방법이다. 하지만 실제 문서 범주화 과정에서 출현하지 않는 용어들도 성능에 많은 영향을 줄 수 있으며, 출현 용어들에 대한 빈도 이외의 역범주 빈도나 용어가중치를 적용하여 문서 범주화 시스템의 성능을 향상시킬 수 있다. 본 논문에서는 나이브 베이지언 확률 모델에 출현 용어와 출현하지 않는 용어들에 대한 smoothing 기법을 적용하여 실험하였다. 성능 평가를 위해 뉴스그룹 문서들을 이용하였으며, 역범주 빈도와 가중치를 적용했을 때 나이브 베이지언 확률 모델에 비해 약 7% 정도 성능 개선 효과가 있었다.

  • PDF

Automatic Document Categorization by the Importance of Features (자질 중요도 계산 기법에 의한 자동문서 범주화)

  • 이경찬;강승식
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2003.04c
    • /
    • pp.537-539
    • /
    • 2003
  • 문서 범주화를 위해 자질을 선별하는 기법으로는 자질의 출현 빈도에 따라 범주를 대표하는 자질들을 선별하는 것이 일반적이다. 출현 빈도에 의한 자질을 선별하는 통계적인 기법은 문서의 내용을 대표하는 용어들의 중요도를 간과하는 문제가 발생한다. 본 논문에서는 학습 문서 및 실험 문서에서 자질의 중요도에 의해 범주 대표어를 선별하는 문서 범주화 기법을 제안하였으며, 역범주 빈도 및 카이제곱 통계량에 의해 자질을 선별하는 방법과 비교-실험을 하였다. 문서 범주화 모델로는 나이브 베이지언 확률 모델을 이용하였으며, 성능 평가를 위해서 웹 디렉토리에서 수집된 데이터를 이용하여 실험하였다. 본 논문에서 제안한 자질 중요도에 의한 자질 선별 기법은 용어의 출현 빈도 및 카이제곱 통계량에 의해 자질을 선별한 방법보다 더 나은 성능을 보였다.

  • PDF