• Title/Summary/Keyword: Large-set Classification

Search Result 183, Processing Time 0.028 seconds

The Unified Framework for AUC Maximizer

  • Jun, Jong-Jun;Kim, Yong-Dai;Han, Sang-Tae;Kang, Hyun-Cheol;Choi, Ho-Sik
    • Communications for Statistical Applications and Methods
    • /
    • v.16 no.6
    • /
    • pp.1005-1012
    • /
    • 2009
  • The area under the curve(AUC) is commonly used as a measure of the receiver operating characteristic(ROC) curve which displays the performance of a set of binary classifiers for all feasible ratios of the costs associated with true positive rate(TPR) and false positive rate(FPR). In the bipartite ranking problem where one has to compare two different observations and decide which one is "better", the AUC measures the quantity that ranking score of a randomly chosen sample in one class is larger than that of a randomly chosen sample in the other class and hence, the function which maximizes an AUC of bipartite ranking problem is different to the function which maximizes (minimizes) accuracy (misclassification error rate) of binary classification problem. In this paper, we develop a way to construct the unified framework for AUC maximizer including support vector machines based on maximizing large margin and logistic regression based on estimating posterior probability. Moreover, we develop an efficient algorithm for the proposed unified framework. Numerical results show that the propose unified framework can treat various methodologies successfully.

Printed Hangul Recognition with Adaptive Hierarchical Structures Depending on 6-Types (6-유형 별로 적응적 계층 구조를 갖는 인쇄 한글 인식)

  • Ham, Dae-Sung;Lee, Duk-Ryong;Choi, Kyung-Ung;Oh, Il-Seok
    • The Journal of the Korea Contents Association
    • /
    • v.10 no.1
    • /
    • pp.10-18
    • /
    • 2010
  • Due to a large number of classes in Hangul character recognition, it is usual to use the six-type preclassification stage. After the preclassification, the first consonent, vowel, and last consonent can be classified separately. Though each of three components has a few of classes, classification errors occurs often due to shape similarity such as 'ㅔ' and 'ㅖ'. So this paper proposes a hierarchical recognition method which adopts multi-stage tree structures for each of 6-types. In addition, to reduce the interference among three components, the method uses the recognition results of first consonents and vowel as features of vowel classifier. The recognition accuracy for the test set of PHD08 database was 98.96%.

Multi-channel Long Short-Term Memory with Domain Knowledge for Context Awareness and User Intention

  • Cho, Dan-Bi;Lee, Hyun-Young;Kang, Seung-Shik
    • Journal of Information Processing Systems
    • /
    • v.17 no.5
    • /
    • pp.867-878
    • /
    • 2021
  • In context awareness and user intention tasks, dataset construction is expensive because specific domain data are required. Although pretraining with a large corpus can effectively resolve the issue of lack of data, it ignores domain knowledge. Herein, we concentrate on data domain knowledge while addressing data scarcity and accordingly propose a multi-channel long short-term memory (LSTM). Because multi-channel LSTM integrates pretrained vectors such as task and general knowledge, it effectively prevents catastrophic forgetting between vectors of task and general knowledge to represent the context as a set of features. To evaluate the proposed model with reference to the baseline model, which is a single-channel LSTM, we performed two tasks: voice phishing with context awareness and movie review sentiment classification. The results verified that multi-channel LSTM outperforms single-channel LSTM in both tasks. We further experimented on different multi-channel LSTMs depending on the domain and data size of general knowledge in the model and confirmed that the effect of multi-channel LSTM integrating the two types of knowledge from downstream task data and raw data to overcome the lack of data.

Image-based rainfall prediction from a novel deep learning method

  • Byun, Jongyun;Kim, Jinwon;Jun, Changhyun
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2021.06a
    • /
    • pp.183-183
    • /
    • 2021
  • Deep learning methods and their application have become an essential part of prediction and modeling in water-related research areas, including hydrological processes, climate change, etc. It is known that application of deep learning leads to high availability of data sources in hydrology, which shows its usefulness in analysis of precipitation, runoff, groundwater level, evapotranspiration, and so on. However, there is still a limitation on microclimate analysis and prediction with deep learning methods because of deficiency of gauge-based data and shortcomings of existing technologies. In this study, a real-time rainfall prediction model was developed from a sky image data set with convolutional neural networks (CNNs). These daily image data were collected at Chung-Ang University and Korea University. For high accuracy of the proposed model, it considers data classification, image processing, ratio adjustment of no-rain data. Rainfall prediction data were compared with minutely rainfall data at rain gauge stations close to image sensors. It indicates that the proposed model could offer an interpolation of current rainfall observation system and have large potential to fill an observation gap. Information from small-scaled areas leads to advance in accurate weather forecasting and hydrological modeling at a micro scale.

  • PDF

Computational Analysis of PCA-based Face Recognition Algorithms (PCA기반의 얼굴인식 알고리즘들에 대한 연산방법 분석)

  • Hyeon Joon Moon;Sang Hoon Kim
    • Journal of Korea Multimedia Society
    • /
    • v.6 no.2
    • /
    • pp.247-258
    • /
    • 2003
  • Principal component analysis (PCA) based algorithms form the basis of numerous algorithms and studies in the face recognition literature. PCA is a statistical technique and its incorporation into a face recognition system requires numerous design decisions. We explicitly take the design decisions by in-troducing a generic modular PCA-algorithm since some of these decision ate not documented in the literature We experiment with different implementations of each module, and evaluate the different im-plementations using the September 1996 FERET evaluation protocol (the do facto standard method for evaluating face recognition algorithms). We experiment with (1) changing the illumination normalization procedure; (2) studying effects on algorithm performance of compressing images using JPEG and wavelet compression algorithms; (3) varying the number of eigenvectors in the representation; and (4) changing the similarity measure in classification process. We perform two experiments. In the first experiment, we report performance results on the standard September 1996 FERET large gallery image sets. The result shows that empirical analysis of preprocessing, feature extraction, and matching performance is extremely important in order to produce optimized performance. In the second experiment, we examine variations in algorithm performance based on 100 randomly generated image sets (galleries) of the same size. The result shows that a reasonable threshold for measuring significant difference in performance for the classifiers is 0.10.

  • PDF

Study on Improvement of Frost Occurrence Prediction Accuracy (서리발생 예측 정확도 향상을 위한 방법 연구)

  • Kim, Yongseok;Choi, Wonjun;Shim, Kyo-moon;Hur, Jina;Kang, Mingu;Jo, Sera
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.23 no.4
    • /
    • pp.295-305
    • /
    • 2021
  • In this study, we constructed using Random Forest(RF) by selecting the meteorological factors related to the occurrence of frost. As a result, when constructing a classification model for frost occurrence, even if the amount of data set is large, the imbalance in the data set for development of model has been analyzed to have a bad effect on the predictive power of the model. It was found that building a single integrated model by grouping meteorological factors related to frost occurrence by region is more efficient than building each model reflecting high-importance meteorological factors. Based on our results, it is expected that a high-accuracy frost occurrence prediction model will be able to be constructed as further studies meteorological factors for frost prediction.

The Development of a Fault Diagnosis Model Based on Principal Component Analysis and Support Vector Machine for a Polystyrene Reactor (주성분 분석과 서포트 벡터 머신을 이용한 폴리스티렌 중합 반응기 이상 진단 모델 개발)

  • Jeong, Yeonsu;Lee, Chang Jun
    • Korean Chemical Engineering Research
    • /
    • v.60 no.2
    • /
    • pp.223-228
    • /
    • 2022
  • In chemical processes, unintended faults can make serious accidents. To tackle them, proper fault diagnosis models should be designed to identify the root cause of faults. To design a fault diagnosis model, a process and its data should be analyzed. However, most previous researches in the field of fault diagnosis just handle the data set of benchmark processes simulated on commercial programs. It indicates that it is really hard to get fresh data sets on real processes. In this study, real faulty conditions of an industrial polystyrene process are tested. In this process, a runaway reaction occurred and this caused a large loss since operators were late aware of the occurrence of this accident. To design a proper fault diagnosis model, we analyzed this process and a real accident data set. At first, a mode classification model based on support vector machine (SVM) was trained and principal component analysis (PCA) model for each mode was constructed under normal operation conditions. The results show that a proposed model can quickly diagnose the occurrence of a fault and they indicate that this model is able to reduce the potential loss.

Prediction of Plant Operator Error Mode (원자력발전소 운전원의 오류모드 예측)

  • Lee, H.C.;E. Hollnagel;M. Kaarstad
    • Proceedings of the ESK Conference
    • /
    • 1997.04a
    • /
    • pp.56-60
    • /
    • 1997
  • The study of human erroneous actions has traditionally taken place along two different lines of approach. One has been concerned with finding and explaining the causes of erroneous actions, such as studies in the psychology of "error". The other has been concerned with the qualitative and quantitative prediction of possible erroneous actions, exemplified by the field of human reliability analysis (HRA). Another distinction is also that the former approach has been dominated by an academic point of view, hence emphasising theories, models, and experiments, while the latter has been of a more pragmatic nature, hence putting greater emphasis on data and methods. We have been developing a method to make predictions about error modes. The input to the method is a detailed task description of a set of scenarios for an experiment. This description is then analysed to characterise thd nature of the individual task steps, as well as the conditions under which they must be carried out. The task steps are expressed in terms of a predefined set of cognitive activity types. Following that each task step is examined in terms of a systematic classification of possible error modes and the likely error modes are identified. This effectively constitutes a qualitative analysis of the possibilities for erroneous action in a given task. In order to evaluate the accuracy of the predictions, the data from a large scale experiment were analysed. The experiment used the full-scale nuclear power plant simulator in the Halden Man-Machine Systems Laboratory (HAMMLAB) and used six crews of systematic performance observations by experts using a pre-defined task description, as well as audio and video recordings. The purpose of the analysis was to determine how well the predictions matiched the actually observed performance failures. The results indicated a very acceptable rate of accuracy. The emphasis in this experiment has been to develop a practical method for qualitative performance prediction, i.e., a method that did not require too many resources or specialised human factors knowledge. If such methods are to become practical tools, it is important that they are valid, reliable, and robust.

  • PDF

Feature selection for text data via sparse principal component analysis (희소주성분분석을 이용한 텍스트데이터의 단어선택)

  • Won Son
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.6
    • /
    • pp.501-514
    • /
    • 2023
  • When analyzing high dimensional data such as text data, if we input all the variables as explanatory variables, statistical learning procedures may suffer from over-fitting problems. Furthermore, computational efficiency can deteriorate with a large number of variables. Dimensionality reduction techniques such as feature selection or feature extraction are useful for dealing with these problems. The sparse principal component analysis (SPCA) is one of the regularized least squares methods which employs an elastic net-type objective function. The SPCA can be used to remove insignificant principal components and identify important variables from noisy observations. In this study, we propose a dimension reduction procedure for text data based on the SPCA. Applying the proposed procedure to real data, we find that the reduced feature set maintains sufficient information in text data while the size of the feature set is reduced by removing redundant variables. As a result, the proposed procedure can improve classification accuracy and computational efficiency, especially for some classifiers such as the k-nearest neighbors algorithm.

Feature Extraction to Detect Hoax Articles (낚시성 인터넷 신문기사 검출을 위한 특징 추출)

  • Heo, Seong-Wan;Sohn, Kyung-Ah
    • Journal of KIISE
    • /
    • v.43 no.11
    • /
    • pp.1210-1215
    • /
    • 2016
  • Readership of online newspapers has grown with the proliferation of smart devices. However, fierce competition between Internet newspaper companies has resulted in a large increase in the number of hoax articles. Hoax articles are those where the title does not convey the content of the main story, and this gives readers the wrong information about the contents. We note that the hoax articles have certain characteristics, such as unnecessary celebrity quotations, mismatch in the title and content, or incomplete sentences. Based on these, we extract and validate features to identify hoax articles. We build a large-scale training dataset by analyzing text keywords in replies to articles and thus extracted five effective features. We evaluate the performance of the support vector machine classifier on the extracted features, and a 92% accuracy is observed in our validation set. In addition, we also present a selective bigram model to measure the consistency between the title and content, which can be effectively used to analyze short texts in general.