• Title/Summary/Keyword: Learning data set

Search Result 1,101, Processing Time 0.026 seconds

Performance Improvement of Web Document Classification through Incorporation of Feature Selection and Weighting (특징선택과 특징가중의 융합을 통한 웹문서분류 성능의 개선)

  • Lee, Ah-Ram;Kim, Han-Joon;Man, Xuan
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.13 no.4
    • /
    • pp.141-148
    • /
    • 2013
  • Automated classification systems which utilize machine learning develops classification models through learning process, and then classify unknown data into predefined set of categories according to the model. The performance of machine learning-based classification systems relies greatly upon the quality of features composing classification models. For textual data, we can use their word terms and structure information in order to generate the set of features. Particularly, in order to extract feature from Web documents, we need to analyze tag and hyperlink information. Recent studies on Web document classification focus on feature engineering technology other than machine learning algorithms themselves. Thus this paper proposes a novel method of incorporating feature selection and weighting which can improves classification models effectively. Through extensive experiments using Web-KB document collections, the proposed method outperforms conventional ones.

A Study on Multi-Dimensional learning data composition based on Wi-Fi radio fingerprint (Wi-Fi 전파 지문 기반 다차원 학습 데이터 구성에 관한 연구)

  • Yoon, Chang-Pyo;Hwang, Chi-Gon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2018.10a
    • /
    • pp.639-640
    • /
    • 2018
  • Currently, the technique of identifying location using radio wave fingerprint is widely used in indoor positioning field. At this time, in order to confirm a successful position, it is necessary to construct the data necessary for learning and testing and to construct the multidimensional data. That is, location data collection and data management technology capable of responding to environmental changes that may occur due to various changes in peripheral radio wave fingerprint such as wireless AP, BLE iBeacon, and mobile terminal are required. Therefore, this paper proposes a technique to construct and manage multidimensional data which is less sensitive to environmental changes of radio wave fingerprinting required for positioning.

  • PDF

A Deep Learning Model for Extracting Consumer Sentiments using Recurrent Neural Network Techniques

  • Ranjan, Roop;Daniel, AK
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.8
    • /
    • pp.238-246
    • /
    • 2021
  • The rapid rise of the Internet and social media has resulted in a large number of text-based reviews being placed on sites such as social media. In the age of social media, utilizing machine learning technologies to analyze the emotional context of comments aids in the understanding of QoS for any product or service. The classification and analysis of user reviews aids in the improvement of QoS. (Quality of Services). Machine Learning algorithms have evolved into a powerful tool for analyzing user sentiment. Unlike traditional categorization models, which are based on a set of rules. In sentiment categorization, Bidirectional Long Short-Term Memory (BiLSTM) has shown significant results, and Convolution Neural Network (CNN) has shown promising results. Using convolutions and pooling layers, CNN can successfully extract local information. BiLSTM uses dual LSTM orientations to increase the amount of background knowledge available to deep learning models. The suggested hybrid model combines the benefits of these two deep learning-based algorithms. The data source for analysis and classification was user reviews of Indian Railway Services on Twitter. The suggested hybrid model uses the Keras Embedding technique as an input source. The suggested model takes in data and generates lower-dimensional characteristics that result in a categorization result. The suggested hybrid model's performance was compared using Keras and Word2Vec, and the proposed model showed a significant improvement in response with an accuracy of 95.19 percent.

A Study of Safety Accident Prediction Model (Focusing on Military Traffic Accident Cases) (안전사고 예측모형 개발 방안에 관한 연구(군 교통사고 사례를 중심으로))

  • Ki, Jae-Sug;Hong, Myeong-Gi
    • Journal of the Society of Disaster Information
    • /
    • v.17 no.3
    • /
    • pp.427-441
    • /
    • 2021
  • Purpose: This study proposes a method for developing a model that predicts the probability of traffic accidents in advance to prevent the most frequent traffic accidents in the military. Method: For this purpose, CRISP-DM (Cross Industry Standard Process for Data Mining) was applied in this study. The CRISP-DM process consists of 6 stages, and each stage is not unidirectional like the Waterfall Model, but improves the level of completeness through feedback between stages. Results: As a result of modeling the same data set as the previously constructed accident investigation data for the entire group, when the classification criterion was 0.5, Significant results were derived from the accuracy, specificity, sensitivity, and AUC of the model for predicting traffic accidents. Conclusion: In the process of designing the prediction model, it was confirmed that it was difficult to obtain a meaningful prediction value due to the lack of data. The methodology for designing a predictive model using the data set was proposed by reorganizing and expanding a data set capable of rational inference to solve the data shortage.

A Modified Approach to Density-Induced Support Vector Data Description

  • Park, Joo-Young;Kang, Dae-Sung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.7 no.1
    • /
    • pp.1-6
    • /
    • 2007
  • The SVDD (support vector data description) is one of the most well-known one-class support vector learning methods, in which one tries the strategy of utilizing balls defined on the feature space in order to distinguish a set of normal data from all other possible abnormal objects. Recently, with the objective of generalizing the SVDD which treats all training data with equal importance, the so-called D-SVDD (density-induced support vector data description) was proposed incorporating the idea that the data in a higher density region are more significant than those in a lower density region. In this paper, we consider the problem of further improving the D-SVDD toward the use of a partial reference set for testing, and propose an LMI (linear matrix inequality)-based optimization approach to solve the improved version of the D-SVDD problems. Our approach utilizes a new class of density-induced distance measures based on the RSDE (reduced set density estimator) along with the LMI-based mathematical formulation in the form of the SDP (semi-definite programming) problems, which can be efficiently solved by interior point methods. The validity of the proposed approach is illustrated via numerical experiments using real data sets.

River Water Level Prediction Method based on LSTM Neural Network

  • Le, Xuan Hien;Lee, Giha
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2018.05a
    • /
    • pp.147-147
    • /
    • 2018
  • In this article, we use an open source software library: TensorFlow, developed for the purposes of conducting very complex machine learning and deep neural network applications. However, the system is general enough to be applicable in a wide variety of other domains as well. The proposed model based on a deep neural network model, LSTM (Long Short-Term Memory) to predict the river water level at Okcheon Station of the Guem River without utilization of rainfall - forecast information. For LSTM modeling, the input data is hourly water level data for 15 years from 2002 to 2016 at 4 stations includes 3 upstream stations (Sutong, Hotan, and Songcheon) and the forecasting-target station (Okcheon). The data are subdivided into three purposes: a training data set, a testing data set and a validation data set. The model was formulated to predict Okcheon Station water level for many cases from 3 hours to 12 hours of lead time. Although the model does not require many input data such as climate, geography, land-use for rainfall-runoff simulation, the prediction is very stable and reliable up to 9 hours of lead time with the Nash - Sutcliffe efficiency (NSE) is higher than 0.90 and the root mean square error (RMSE) is lower than 12cm. The result indicated that the method is able to produce the river water level time series and be applicable to the practical flood forecasting instead of hydrologic modeling approaches.

  • PDF

Analysis Model Evaluation based on IoT Data and Machine Learning Algorithm for Prediction of Acer Mono Sap Liquid Water

  • Lee, Han Sung;Jung, Se Hoon
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.10
    • /
    • pp.1286-1295
    • /
    • 2020
  • It has been increasingly difficult to predict the amounts of Acer mono sap to be collected due to droughts and cold waves caused by recent climate changes with few studies conducted on the prediction of its collection volume. This study thus set out to propose a Big Data prediction system based on meteorological information for the collection of Acer mono sap. The proposed system would analyze collected data and provide managers with a statistical chart of prediction values regarding climate factors to affect the amounts of Acer mono sap to be collected, thus enabling efficient work. It was designed based on Hadoop for data collection, treatment and analysis. The study also analyzed and proposed an optimal prediction model for climate conditions to influence the volume of Acer mono sap to be collected by applying a multiple regression analysis model based on Hadoop and Mahout.

Evaluation of Attribute Selection Methods and Prior Discretization in Supervised Learning

  • Cha, Woon Ock;Huh, Moon Yul
    • Communications for Statistical Applications and Methods
    • /
    • v.10 no.3
    • /
    • pp.879-894
    • /
    • 2003
  • We evaluated the efficiencies of applying attribute selection methods and prior discretization to supervised learning, modelled by C4.5 and Naive Bayes. Three databases were obtained from UCI data archive, which consisted of continuous attributes except for one decision attribute. Four methods were used for attribute selection : MDI, ReliefF, Gain Ratio and Consistency-based method. MDI and ReliefF can be used for both continuous and discrete attributes, but the other two methods can be used only for discrete attributes. Discretization was performed using the Fayyad and Irani method. To investigate the effect of noise included in the database, noises were introduced into the data sets up to the extents of 10 or 20%, and then the data, including those either containing the noises or not, were processed through the steps of attribute selection, discretization and classification. The results of this study indicate that classification of the data based on selected attributes yields higher accuracy than in the case of classifying the full data set, and prior discretization does not lower the accuracy.

Unlabeled Wi-Fi RSSI Indoor Positioning by Using IMU

  • Chanyeong, Ju;Jaehyun, Yoo
    • Journal of Positioning, Navigation, and Timing
    • /
    • v.12 no.1
    • /
    • pp.37-42
    • /
    • 2023
  • Wi-Fi Received Signal Strength Indicator (RSSI) is considered one of the most important sensor data types for indoor localization. However, collecting a RSSI fingerprint, which consists of pairs of a RSSI measurement set and a corresponding location, is costly and time-consuming. In this paper, we propose a Wi-Fi RSSI learning technique without true location data to overcome the limitations of static database construction. Instead of the true reference positions, inertial measurement unit (IMU) data are used to generate pseudo locations, which enable a trainer to move during data collection. This improves the efficiency of data collection dramatically. From an experiment it is seen that the proposed algorithm successfully learns the unsupervised Wi-Fi RSSI positioning model, resulting in 2 m accuracy when the cumulative distribution function (CDF) is 0.8.

Financial Fraud Detection using Text Mining Analysis against Municipal Cybercriminality (지자체 사이버 공간 안전을 위한 금융사기 탐지 텍스트 마이닝 방법)

  • Choi, Sukjae;Lee, Jungwon;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.3
    • /
    • pp.119-138
    • /
    • 2017
  • Recently, SNS has become an important channel for marketing as well as personal communication. However, cybercrime has also evolved with the development of information and communication technology, and illegal advertising is distributed to SNS in large quantity. As a result, personal information is lost and even monetary damages occur more frequently. In this study, we propose a method to analyze which sentences and documents, which have been sent to the SNS, are related to financial fraud. First of all, as a conceptual framework, we developed a matrix of conceptual characteristics of cybercriminality on SNS and emergency management. We also suggested emergency management process which consists of Pre-Cybercriminality (e.g. risk identification) and Post-Cybercriminality steps. Among those we focused on risk identification in this paper. The main process consists of data collection, preprocessing and analysis. First, we selected two words 'daechul(loan)' and 'sachae(private loan)' as seed words and collected data with this word from SNS such as twitter. The collected data are given to the two researchers to decide whether they are related to the cybercriminality, particularly financial fraud, or not. Then we selected some of them as keywords if the vocabularies are related to the nominals and symbols. With the selected keywords, we searched and collected data from web materials such as twitter, news, blog, and more than 820,000 articles collected. The collected articles were refined through preprocessing and made into learning data. The preprocessing process is divided into performing morphological analysis step, removing stop words step, and selecting valid part-of-speech step. In the morphological analysis step, a complex sentence is transformed into some morpheme units to enable mechanical analysis. In the removing stop words step, non-lexical elements such as numbers, punctuation marks, and double spaces are removed from the text. In the step of selecting valid part-of-speech, only two kinds of nouns and symbols are considered. Since nouns could refer to things, the intent of message is expressed better than the other part-of-speech. Moreover, the more illegal the text is, the more frequently symbols are used. The selected data is given 'legal' or 'illegal'. To make the selected data as learning data through the preprocessing process, it is necessary to classify whether each data is legitimate or not. The processed data is then converted into Corpus type and Document-Term Matrix. Finally, the two types of 'legal' and 'illegal' files were mixed and randomly divided into learning data set and test data set. In this study, we set the learning data as 70% and the test data as 30%. SVM was used as the discrimination algorithm. Since SVM requires gamma and cost values as the main parameters, we set gamma as 0.5 and cost as 10, based on the optimal value function. The cost is set higher than general cases. To show the feasibility of the idea proposed in this paper, we compared the proposed method with MLE (Maximum Likelihood Estimation), Term Frequency, and Collective Intelligence method. Overall accuracy and was used as the metric. As a result, the overall accuracy of the proposed method was 92.41% of illegal loan advertisement and 77.75% of illegal visit sales, which is apparently superior to that of the Term Frequency, MLE, etc. Hence, the result suggests that the proposed method is valid and usable practically. In this paper, we propose a framework for crisis management caused by abnormalities of unstructured data sources such as SNS. We hope this study will contribute to the academia by identifying what to consider when applying the SVM-like discrimination algorithm to text analysis. Moreover, the study will also contribute to the practitioners in the field of brand management and opinion mining.