• Title/Summary/Keyword: tree classification method

Search Result 361, Processing Time 0.037 seconds

Clustering and classification to characterize daily electricity demand (시간단위 전력사용량 시계열 패턴의 군집 및 분류분석)

  • Park, Dain;Yoon, Sanghoo
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.2
    • /
    • pp.395-406
    • /
    • 2017
  • The purpose of this study is to identify the pattern of daily electricity demand through clustering and classification. The hourly data was collected by KPS (Korea Power Exchange) between 2008 and 2012. The time trend was eliminated for conducting the pattern of daily electricity demand because electricity demand data is times series data. We have considered k-means clustering, Gaussian mixture model clustering, and functional clustering in order to find the optimal clustering method. The classification analysis was conducted to understand the relationship between external factors, day of the week, holiday, and weather. Data was divided into training data and test data. Training data consisted of external factors and clustered number between 2008 and 2011. Test data was daily data of external factors in 2012. Decision tree, random forest, Support vector machine, and Naive Bayes were used. As a result, Gaussian model based clustering and random forest showed the best prediction performance when the number of cluster was 8.

Improving the Accuracy of Document Classification by Learning Heterogeneity (이질성 학습을 통한 문서 분류의 정확성 향상 기법)

  • Wong, William Xiu Shun;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.21-44
    • /
    • 2018
  • In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.

Characterization and Phylogenetic Analysis of Chitin Synthase Genes from the Genera Sporobolomyces and Bensingtonia subrorea

  • Nam, Jin-Sik
    • Korean Journal of Environmental Biology
    • /
    • v.23 no.4
    • /
    • pp.335-342
    • /
    • 2005
  • We cloned seven genes encoding chitin synthases (CHSs) by PCR amplification from genomic DNAs of four strains of the genus Sporobolomyces and of Bensingtonia subrosea using degenerated primers based on conserved regions of the CHS genes. Though amino acid sequences of these genes were shown similar as 176 to 189 amino acids except SgCHS2, DNA sequences were different in size, which was due to various introns present in seven fragments. Alignment and phylogenetic analysis of their deduced amino acid sequences together with the reported CHS genes of basidiomycetes separated the sequences into classes I, II and III. This analysis also permitted the classification of isolated CHSs; SgCHS1 belongs to class I, BsCHS1, SaCHS1, SgCHS2, SpgCHS1, and SsCHS1 belong to class II, and BsCHS2 belongs to class III. The deduced amino acid sequences involving in class II that were discovered from five strains were also compared with those of other basidiomycetes by CLUSTAL X program. The bootstrap analysis and phylogenetic tree by neighbor-joining method revealed the taxonomic and evolutionary position for four strains of the genus Sporobolomyces and for Bensingtonia subrosea which agreed with the previous classification. The results clearly showed that CHS fragments could be used as a valuable key for the molecular taxonomic and phylogenetic studies of basidiomycetes.

A Method of Predicting Service Time Based on Voice of Customer Data (고객의 소리(VOC) 데이터를 활용한 서비스 처리 시간 예측방법)

  • Kim, Jeonghun;Kwon, Ohbyung
    • Journal of Information Technology Services
    • /
    • v.15 no.1
    • /
    • pp.197-210
    • /
    • 2016
  • With the advent of text analytics, VOC (Voice of Customer) data become an important resource which provides the managers and marketing practitioners with consumer's veiled opinion and requirements. In other words, making relevant use of VOC data potentially improves the customer responsiveness and satisfaction, each of which eventually improves business performance. However, unstructured data set such as customers' complaints in VOC data have seldom used in marketing practices such as predicting service time as an index of service quality. Because the VOC data which contains unstructured data is too complicated form. Also that needs convert unstructured data from structure data which difficult process. Hence, this study aims to propose a prediction model to improve the estimation accuracy of the level of customer satisfaction by combining unstructured from textmining with structured data features in VOC. Also the relationship between the unstructured, structured data and service processing time through the regression analysis. Text mining techniques, sentiment analysis, keyword extraction, classification algorithms, decision tree and multiple regression are considered and compared. For the experiment, we used actual VOC data in a company.

An enhanced feature selection filter for classification of microarray cancer data

  • Mazumder, Dilwar Hussain;Veilumuthu, Ramachandran
    • ETRI Journal
    • /
    • v.41 no.3
    • /
    • pp.358-370
    • /
    • 2019
  • The main aim of this study is to select the optimal set of genes from microarray cancer datasets that contribute to the prediction of specific cancer types. This study proposes the enhancement of the feature selection filter algorithm based on Joe's normalized mutual information and its use for gene selection. The proposed algorithm is implemented and evaluated on seven benchmark microarray cancer datasets, namely, central nervous system, leukemia (binary), leukemia (3 class), leukemia (4 class), lymphoma, mixed lineage leukemia, and small round blue cell tumor, using five well-known classifiers, including the naive Bayes, radial basis function network, instance-based classifier, decision-based table, and decision tree. An average increase in the prediction accuracy of 5.1% is observed on all seven datasets averaged over all five classifiers. The average reduction in training time is 2.86 seconds. The performance of the proposed method is also compared with those of three other popular mutual information-based feature selection filters, namely, information gain, gain ratio, and symmetric uncertainty. The results are impressive when all five classifiers are used on all the datasets.

Extraction of the Tree Regions in Forest Areas Using LIDAR Data and Ortho-image (라이다 자료와 정사영상을 이용한 산림지역의 수목영역추출)

  • Kim, Eui Myoung
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.21 no.2
    • /
    • pp.27-34
    • /
    • 2013
  • Due to the increased interest in global warming, interest in forest resources aimed towards reducing greenhouse gases have subsequently increased. Thus far, data related to forest resources have been obtained, through the employment of aerial photographs or satellite images, by means of plotting. However, the use of imaging data is disadvantageous; merely, due to the fact that recorded measurements such as the height of trees, in dense forest areas, lack accuracy. Within such context, the authors of this study have presented a method of data processing in which an individual tree is isolated within forested areas through the use of LIDAR data and ortho-images. Such isolation resulted in the provision of more efficient and accurate data in regards to the height of trees. As for the data processing of LIDAR, the authors have generated a normalized digital surface model to extract tree points via local maxima filtering, and have additionally, with motives to extract forest areas, applied object oriented image classifications to the processing of data using ortho-images. The final tree point was then given a figure derived from the combination of LIDAR and ortho-images results. Based from an experiment conducted in the Yongin area, the authors have analyzed the merits and demerits of methods that either employ LIDAR data or ortho-images and have thereby obtained information of individual trees within forested areas by combining the two data; thus verifying the efficiency of the above presented method.

Protecting Accounting Information Systems using Machine Learning Based Intrusion Detection

  • Biswajit Panja
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.5
    • /
    • pp.111-118
    • /
    • 2024
  • In general network-based intrusion detection system is designed to detect malicious behavior directed at a network or its resources. The key goal of this paper is to look at network data and identify whether it is normal traffic data or anomaly traffic data specifically for accounting information systems. In today's world, there are a variety of principles for detecting various forms of network-based intrusion. In this paper, we are using supervised machine learning techniques. Classification models are used to train and validate data. Using these algorithms we are training the system using a training dataset then we use this trained system to detect intrusion from the testing dataset. In our proposed method, we will detect whether the network data is normal or an anomaly. Using this method we can avoid unauthorized activity on the network and systems under that network. The Decision Tree and K-Nearest Neighbor are applied to the proposed model to classify abnormal to normal behaviors of network traffic data. In addition to that, Logistic Regression Classifier and Support Vector Classification algorithms are used in our model to support proposed concepts. Furthermore, a feature selection method is used to collect valuable information from the dataset to enhance the efficiency of the proposed approach. Random Forest machine learning algorithm is used, which assists the system to identify crucial aspects and focus on them rather than all the features them. The experimental findings revealed that the suggested method for network intrusion detection has a neglected false alarm rate, with the accuracy of the result expected to be between 95% and 100%. As a result of the high precision rate, this concept can be used to detect network data intrusion and prevent vulnerabilities on the network.

Soil Moisture Estimation Using CART Algorithm and Ancillary Data (CART기법과 보조자료를 이용한 토양수분 추정)

  • Kim, Gwang-Seob;Park, Han-Gyun
    • Journal of Korea Water Resources Association
    • /
    • v.43 no.7
    • /
    • pp.597-608
    • /
    • 2010
  • In this study, a method for soil moisture estimation was proposed to obtain the nationwide soil moisture distribution map using on-site soil moisture observations, rainfall, surface temperature, NDVI, land cover, effective soil depth, and CART (Classification And Regression Tree) algorithm. The method was applied to the Yong-dam dam basin since the soil moisture data (4 sites) of the basin were reliable. Soil moisture observations of 3 sites (Bu-gui, San-jeon, Cheon-cheon2) were used for training the algorithm and 1 site (Gye-buk2) was used for the algorithm validation. The correlation coefficient between the observed and estimated data of soil moisture in the validation sites is about 0.737. Results show that even though there are limitations of the lack of reliable soil moisture observation for various land use, soil type, and topographic conditions, the soil moisture estimation method using ancillary data and CART algorithm can be a reasonable approach since the algorithm provided a fairly good estimation of soil moisture distribution for the study area.

Linear interpolation and Machine Learning Methods for Gas Leakage Prediction Base on Multi-source Data Integration (다중소스 데이터 융합 기반의 가스 누출 예측을 위한 선형 보간 및 머신러닝 기법)

  • Dashdondov, Khongorzul;Jo, Kyuri;Kim, Mi-Hye
    • Journal of the Korea Convergence Society
    • /
    • v.13 no.3
    • /
    • pp.33-41
    • /
    • 2022
  • In this article, we proposed to predict natural gas (NG) leakage levels through feature selection based on a factor analysis (FA) of the integrating the Korean Meteorological Agency data and natural gas leakage data for considering complex factors. The paper has been divided into three modules. First, we filled missing data based on the linear interpolation method on the integrated data set, and selected essential features using FA with OrdinalEncoder (OE)-based normalization. The dataset is labeled by K-means clustering. The final module uses four algorithms, K-nearest neighbors (KNN), decision tree (DT), random forest (RF), Naive Bayes (NB), to predict gas leakage levels. The proposed method is evaluated by the accuracy, area under the ROC curve (AUC), and mean standard error (MSE). The test results indicate that the OrdinalEncoder-Factor analysis (OE-F)-based classification method has improved successfully. Moreover, OE-F-based KNN (OE-F-KNN) showed the best performance by giving 95.20% accuracy, an AUC of 96.13%, and an MSE of 0.031.

Length of stay in PACU among surgical patients using data mining technique (데이터 마이닝을 활용한 외과수술환자의 회복실 체류시간 분석)

  • Yoo, Je-Bog;Jang, Hee Jung
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.14 no.7
    • /
    • pp.3400-3411
    • /
    • 2013
  • The data mining is a new approach to extract useful information through effective analysis of huge data in numerous fields. This study was analyzed by decision making tree model using Clementine C&RT(Classification & Regression Tree, CART) as data mining technique. We utilized this data mining technique to analyze medical record of 1,500 people. Whole data were assorted by length of stay in PACU and divided into 3 groups. The result extracted by C5.0 decision tree method showed that important related factors for lengh of stay in PACU are type of operation, preoperative EKG abnormality, anesthetics, operative duration, age.