• Title/Summary/Keyword: Large-set Classification

Search Result 183, Processing Time 0.028 seconds

CNN-based Weighted Ensemble Technique for ImageNet Classification (대용량 이미지넷 인식을 위한 CNN 기반 Weighted 앙상블 기법)

  • Jung, Heechul;Choi, Min-Kook;Kim, Junkwang;Kwon, Soon;Jung, Wooyoung
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.15 no.4
    • /
    • pp.197-204
    • /
    • 2020
  • The ImageNet dataset is a large scale dataset and contains various natural scene images. In this paper, we propose a convolutional neural network (CNN)-based weighted ensemble technique for the ImageNet classification task. First, in order to fuse several models, our technique uses weights for each model, unlike the existing average-based ensemble technique. Then we propose an algorithm that automatically finds the coefficients used in later ensemble process. Our algorithm sequentially selects the model with the best performance of the validation set, and then obtains a weight that improves performance when combined with existing selected models. We applied the proposed algorithm to a total of 13 heterogeneous models, and as a result, 5 models were selected. These selected models were combined with weights, and we achieved 3.297% Top-5 error rate on the ImageNet test dataset.

Keyword Reorganization Techniques for Improving the Identifiability of Topics (토픽 식별성 향상을 위한 키워드 재구성 기법)

  • Yun, Yeoil;Kim, Namgyu
    • Journal of Information Technology Services
    • /
    • v.18 no.4
    • /
    • pp.135-149
    • /
    • 2019
  • Recently, there are many researches for extracting meaningful information from large amount of text data. Among various applications to extract information from text, topic modeling which express latent topics as a group of keywords is mainly used. Topic modeling presents several topic keywords by term/topic weight and the quality of those keywords are usually evaluated through coherence which implies the similarity of those keywords. However, the topic quality evaluation method based only on the similarity of keywords has its limitations because it is difficult to describe the content of a topic accurately enough with just a set of similar words. In this research, therefore, we propose topic keywords reorganizing method to improve the identifiability of topics. To reorganize topic keywords, each document first needs to be labeled with one representative topic which can be extracted from traditional topic modeling. After that, classification rules for classifying each document into a corresponding label are generated, and new topic keywords are extracted based on the classification rules. To evaluated the performance our method, we performed an experiment on 1,000 news articles. From the experiment, we confirmed that the keywords extracted from our proposed method have better identifiability than traditional topic keywords.

Improving the Effectiveness of Customer Classification Models: A Pre-segmentation Approach (사전 세분화를 통한 고객 분류모형의 효과성 제고에 관한 연구)

  • Chang, Nam-Sik
    • Information Systems Review
    • /
    • v.7 no.2
    • /
    • pp.23-40
    • /
    • 2005
  • Discovering customers' behavioral patterns from large data set and providing them with corresponding services or products are critical components in managing a current business. However, the diversity of customer needs coupled with the limited resources suggests that companies should make more efforts on understanding and managing specific groups of customers, not the whole customers. The key issue of this paper is based on the fact that the behavioral patterns extracted from the specific groups of customers shall be different from those from the whole customers. This paper proposes the idea of pre-segmentation before developing customer classification models. We collected three customers' demographic and transactional data sets from a credit card, a tele-communication, and an insurance company in Korea, and then segmented customers by major variables. Different churn prediction models were developed from each segments and the whole data set, respectively, using the decision tree induction approach, and compared in terms of the hit ratio and the simplicity of generated rules.

A Classifier for Textured Images Based on Matrix Feature (행렬 속성을 이용하는 질감 영상 분별기)

  • 김준철;이준환
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.31B no.3
    • /
    • pp.91-102
    • /
    • 1994
  • For the analysis of textured image, it requires large storage space and computation time to calculate the matrix features such as SGLDM(Spatial Gray Level Dependence Matrix). NGLDM(Neighboring Gray Level Dependence Matrix). NSGLDM(Neighboring Spatial Gray Level Dependence Matrix) and GLRLM(Gray Level Run Length Matrix). In spite of a large amount of information that each matrix contains, a set of several correlated scalar features calculated from the matrix is not sufficient to approximate it. In this paper, we propose a new classifier for textured images based on these matrices in which the projected vectors of each matrix on the meaningful directions are used as features. In the proposed method, an unknown image is classified to the class of a known image that gives the maximum similarity between the projected model vector from the known image and the vector from the unknown image. In the experiment to classify images of agricultural products, the proposed method shows good performance as much as 85-95% of correct classification ratio.

  • PDF

TCAM Partitioning for High-Performance Packet Classification (고성능 패킷 분류를 위한 TCAM 분할)

  • Kim Kyu-Ho;Kang Seok-Min;Song Il-Seop;Kwon Teack-Geun
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.31 no.2B
    • /
    • pp.91-97
    • /
    • 2006
  • As increasing the network bandwidth, the threat of a network also increases with emerging various new services. For a high-performance network security, It is generally used that high-speed packet classification methods which employ hardware like TCAM. There needs an method using these devices efficiently because they are expensive and their capacity is not sufficient. In this paper, we propose an efficient packet classification using a Ternary-CAM(TCAM) which is widely used device for high-speed packet classification in which we have applied Snort rule set for the well-known intrusion detection system. In order to save the size of an expensive TCAM, we have eliminated duplicated IP addresses and port numbers in the rule according to the partitioning of a table in the TCAM, and we have represented negation and range rules with reduced TCAM size. We also keep advantages of low TCAM capacity consumption and reduce the number of TCAM lookups by decreasing the TCAM partitioning using combining port numbers. According to simulation results on our TCAM partitioning, the size of a TCAM can be reduced by upto 98$\%$ and the performance does not degrade significantly for high-speed packet classification with a large amount of rules.

Classification Protein Subcellular Locations Using n-Gram Features (단백질 서열의 n-Gram 자질을 이용한 세포내 위치 예측)

  • Kim, Jinsuk
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2007.11a
    • /
    • pp.12-16
    • /
    • 2007
  • The function of a protein is closely co-related with its subcellular location(s). Given a protein sequence, therefore, how to determine its subcellular location is a vitally important problem. We have developed a new prediction method for protein subcellular location(s), which is based on n-gram feature extraction and k-nearest neighbor (kNN) classification algorithm. It classifies a protein sequence to one or more subcellular compartments based on the locations of top k sequences which show the highest similarity weights against the input sequence. The similarity weight is a kind of similarity measure which is determined by comparing n-gram features between two sequences. Currently our method extract penta-grams as features of protein sequences, computes scores of the potential localization site(s) using kNN algorithm, and finally presents the locations and their associated scores. We constructed a large-scale data set of protein sequences with known subcellular locations from the SWISS-PROT database. This data set contains 51,885 entries with one or more known subcellular locations. Our method show very high prediction precision of about 93% for this data set, and compared with other method, it also showed comparable prediction improvement for a test collection used in a previous work.

  • PDF

A Study on Defect Prediction through Real-time Monitoring of Die-Casting Process Equipment (주조공정 설비에 대한 실시간 모니터링을 통한 불량예측에 대한 연구)

  • Chulsoon Park;Heungseob Kim
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.45 no.4
    • /
    • pp.157-166
    • /
    • 2022
  • In the case of a die-casting process, defects that are difficult to confirm by visual inspection, such as shrinkage bubbles, may occur due to an error in maintaining a vacuum state. Since these casting defects are discovered during post-processing operations such as heat treatment or finishing work, they cannot be taken in advance at the casting time, which can cause a large number of defects. In this study, we propose an approach that can predict the occurrence of casting defects by defect type using machine learning technology based on casting parameter data collected from equipment in the die casting process in real time. Die-casting parameter data can basically be collected through the casting equipment controller. In order to perform classification analysis for predicting defects by defect type, labeling of casting parameters must be performed. In this study, first, the defective data set is separated by performing the primary clustering based on the total defect rate obtained during the post-processing. Second, the secondary cluster analysis is performed using the defect rate by type for the separated defect data set, and the labeling task is performed by defect type using the cluster analysis result. Finally, a classification learning model is created by collecting the entire labeled data set, and a real-time monitoring system for defect prediction using LabView and Python was implemented. When a defect is predicted, notification is performed so that the operator can cope with it, such as displaying on the monitoring screen and alarm notification.

The Analysis of Korean Cities Biotope Type Characteristic using Cluster Analysis (군집분석을 통한 한국 도시 비오톱 유형 특성분석)

  • Kim, Jin-Hyo;Ra, Jung-Hwa;Lee, Soon-Ju;Kwon, Oh-Sung;Cho, Hyun-Ju
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.43 no.4
    • /
    • pp.112-123
    • /
    • 2015
  • The purpose of this study is to analyze the biotope characteristics of Korean cities and set up biotope type structures for Korean cities based on biotope type classification, dominant biotope type, city's human and nature environmental characteristics and cluster analysis. The findings of the study are summarized as follows: First, regarding the analysis of biotope type classification, cities showed differences in terms of the standard of biotope classification and classification hierarchy. Next, the analysis of dominant biotope types showed the type of forest represents the largest area in most cities. Moreover, a city's characteristic analysis revealed large differences between cities. As a result of cluster analysis, cities were classified into five clusters overall. First, Cluster A showed a lower population level and urbanization level. Unlike other cities, Cluster A revealed that it has the largest percentage of agricultural areas. Cluster C showed very high levels in terms of population amount and urbanization conditions was named the 'Large-sized metropolitan cities-center of forest biotope area' based on it's characteristics. The findings of this study as summarized above are considered to play an important role in enabling detailed classification and preservation of biotope types fit for the characteristics of cities and minimizing the confusion caused by different biotope mapping methods when revising and complementing biotope maps.

Application of Text-Classification Based Machine Learning in Predicting Psychiatric Diagnosis (텍스트 분류 기반 기계학습의 정신과 진단 예측 적용)

  • Pak, Doohyun;Hwang, Mingyu;Lee, Minji;Woo, Sung-Il;Hahn, Sang-Woo;Lee, Yeon Jung;Hwang, Jaeuk
    • Korean Journal of Biological Psychiatry
    • /
    • v.27 no.1
    • /
    • pp.18-26
    • /
    • 2020
  • Objectives The aim was to find effective vectorization and classification models to predict a psychiatric diagnosis from text-based medical records. Methods Electronic medical records (n = 494) of present illness were collected retrospectively in inpatient admission notes with three diagnoses of major depressive disorder, type 1 bipolar disorder, and schizophrenia. Data were split into 400 training data and 94 independent validation data. Data were vectorized by two different models such as term frequency-inverse document frequency (TF-IDF) and Doc2vec. Machine learning models for classification including stochastic gradient descent, logistic regression, support vector classification, and deep learning (DL) were applied to predict three psychiatric diagnoses. Five-fold cross-validation was used to find an effective model. Metrics such as accuracy, precision, recall, and F1-score were measured for comparison between the models. Results Five-fold cross-validation in training data showed DL model with Doc2vec was the most effective model to predict the diagnosis (accuracy = 0.87, F1-score = 0.87). However, these metrics have been reduced in independent test data set with final working DL models (accuracy = 0.79, F1-score = 0.79), while the model of logistic regression and support vector machine with Doc2vec showed slightly better performance (accuracy = 0.80, F1-score = 0.80) than the DL models with Doc2vec and others with TF-IDF. Conclusions The current results suggest that the vectorization may have more impact on the performance of classification than the machine learning model. However, data set had a number of limitations including small sample size, imbalance among the category, and its generalizability. With this regard, the need for research with multi-sites and large samples is suggested to improve the machine learning models.

Hierarchical Automatic Classification of News Articles based on Association Rules (연관규칙을 이용한 뉴스기사의 계층적 자동분류기법)

  • Joo, Kil-Hong;Shin, Eun-Young;Lee, Joo-Il;Lee, Won-Suk
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.6
    • /
    • pp.730-741
    • /
    • 2011
  • With the development of the internet and computer technology, the amount of information through the internet is increasing rapidly and it is managed in document form. For this reason, the research into the method to manage for a large amount of document in an effective way is necessary. The conventional document categorization method used only the keywords of related documents for document classification. However, this paper proposed keyword extraction method of based on association rule. This method extracts a set of related keywords which are involved in document's category and classifies representative keyword by using the classification rule proposed in this paper. In addition, this paper proposed the preprocessing method for efficient keywords creation and predicted the new document's category. We can design the classifier and measure the performance throughout the experiment to increase the profile's classification performance. When predicting the category, substituting all the classification rules one by one is the major reason to decrease the process performance in a profile. Finally, this paper suggested automatically categorizing plan which can be applied to hierarchical category architecture, extended from simple category architecture.