• Title/Summary/Keyword: Public dataset

Search Result 254, Processing Time 0.031 seconds

An Intelligent Intrusion Detection Model Based on Support Vector Machines and the Classification Threshold Optimization for Considering the Asymmetric Error Cost (비대칭 오류비용을 고려한 분류기준값 최적화와 SVM에 기반한 지능형 침입탐지모형)

  • Lee, Hyeon-Uk;Ahn, Hyun-Chul
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.4
    • /
    • pp.157-173
    • /
    • 2011
  • As the Internet use explodes recently, the malicious attacks and hacking for a system connected to network occur frequently. This means the fatal damage can be caused by these intrusions in the government agency, public office, and company operating various systems. For such reasons, there are growing interests and demand about the intrusion detection systems (IDS)-the security systems for detecting, identifying and responding to unauthorized or abnormal activities appropriately. The intrusion detection models that have been applied in conventional IDS are generally designed by modeling the experts' implicit knowledge on the network intrusions or the hackers' abnormal behaviors. These kinds of intrusion detection models perform well under the normal situations. However, they show poor performance when they meet a new or unknown pattern of the network attacks. For this reason, several recent studies try to adopt various artificial intelligence techniques, which can proactively respond to the unknown threats. Especially, artificial neural networks (ANNs) have popularly been applied in the prior studies because of its superior prediction accuracy. However, ANNs have some intrinsic limitations such as the risk of overfitting, the requirement of the large sample size, and the lack of understanding the prediction process (i.e. black box theory). As a result, the most recent studies on IDS have started to adopt support vector machine (SVM), the classification technique that is more stable and powerful compared to ANNs. SVM is known as a relatively high predictive power and generalization capability. Under this background, this study proposes a novel intelligent intrusion detection model that uses SVM as the classification model in order to improve the predictive ability of IDS. Also, our model is designed to consider the asymmetric error cost by optimizing the classification threshold. Generally, there are two common forms of errors in intrusion detection. The first error type is the False-Positive Error (FPE). In the case of FPE, the wrong judgment on it may result in the unnecessary fixation. The second error type is the False-Negative Error (FNE) that mainly misjudges the malware of the program as normal. Compared to FPE, FNE is more fatal. Thus, when considering total cost of misclassification in IDS, it is more reasonable to assign heavier weights on FNE rather than FPE. Therefore, we designed our proposed intrusion detection model to optimize the classification threshold in order to minimize the total misclassification cost. In this case, conventional SVM cannot be applied because it is designed to generate discrete output (i.e. a class). To resolve this problem, we used the revised SVM technique proposed by Platt(2000), which is able to generate the probability estimate. To validate the practical applicability of our model, we applied it to the real-world dataset for network intrusion detection. The experimental dataset was collected from the IDS sensor of an official institution in Korea from January to June 2010. We collected 15,000 log data in total, and selected 1,000 samples from them by using random sampling method. In addition, the SVM model was compared with the logistic regression (LOGIT), decision trees (DT), and ANN to confirm the superiority of the proposed model. LOGIT and DT was experimented using PASW Statistics v18.0, and ANN was experimented using Neuroshell 4.0. For SVM, LIBSVM v2.90-a freeware for training SVM classifier-was used. Empirical results showed that our proposed model based on SVM outperformed all the other comparative models in detecting network intrusions from the accuracy perspective. They also showed that our model reduced the total misclassification cost compared to the ANN-based intrusion detection model. As a result, it is expected that the intrusion detection model proposed in this paper would not only enhance the performance of IDS, but also lead to better management of FNE.

Social Tagging-based Recommendation Platform for Patented Technology Transfer (특허의 기술이전 활성화를 위한 소셜 태깅기반 지적재산권 추천플랫폼)

  • Park, Yoon-Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.3
    • /
    • pp.53-77
    • /
    • 2015
  • Korea has witnessed an increasing number of domestic patent applications, but a majority of them are not utilized to their maximum potential but end up becoming obsolete. According to the 2012 National Congress' Inspection of Administration, about 73% of patents possessed by universities and public-funded research institutions failed to lead to creating social values, but remain latent. One of the main problem of this issue is that patent creators such as individual researcher, university, or research institution lack abilities to commercialize their patents into viable businesses with those enterprises that are in need of them. Also, for enterprises side, it is hard to find the appropriate patents by searching keywords on all such occasions. This system proposes a patent recommendation system that can identify and recommend intellectual rights appropriate to users' interested fields among a rapidly accumulating number of patent assets in a more easy and efficient manner. The proposed system extracts core contents and technology sectors from the existing pool of patents, and combines it with secondary social knowledge, which derives from tags information created by users, in order to find the best patents recommended for users. That is to say, in an early stage where there is no accumulated tag information, the recommendation is done by utilizing content characteristics, which are identified through an analysis of key words contained in such parameters as 'Title of Invention' and 'Claim' among the various patent attributes. In order to do this, the suggested system extracts only nouns from patents and assigns a weight to each noun according to the importance of it in all patents by performing TF-IDF analysis. After that, it finds patents which have similar weights with preferred patents by a user. In this paper, this similarity is called a "Domain Similarity". Next, the suggested system extract technology sector's characteristics from patent document by analyzing the international technology classification code (International Patent Classification, IPC). Every patents have more than one IPC, and each user can attach more than one tag to the patents they like. Thus, each user has a set of IPC codes included in tagged patents. The suggested system manages this IPC set to analyze technology preference of each user and find the well-fitted patents for them. In order to do this, the suggeted system calcuates a 'Technology_Similarity' between a set of IPC codes and IPC codes contained in all other patents. After that, when the tag information of multiple users are accumulated, the system expands the recommendations in consideration of other users' social tag information relating to the patent that is tagged by a concerned user. The similarity between tag information of perferred 'patents by user and other patents are called a 'Social Simialrity' in this paper. Lastly, a 'Total Similarity' are calculated by adding these three differenent similarites and patents having the highest 'Total Similarity' are recommended to each user. The suggested system are applied to a total of 1,638 korean patents obtained from the Korea Industrial Property Rights Information Service (KIPRIS) run by the Korea Intellectual Property Office. However, since this original dataset does not include tag information, we create virtual tag information and utilized this to construct the semi-virtual dataset. The proposed recommendation algorithm was implemented with JAVA, a computer programming language, and a prototype graphic user interface was also designed for this study. As the proposed system did not have dependent variables and uses virtual data, it is impossible to verify the recommendation system with a statistical method. Therefore, the study uses a scenario test method to verify the operational feasibility and recommendation effectiveness of the system. The results of this study are expected to improve the possibility of matching promising patents with the best suitable businesses. It is assumed that users' experiential knowledge can be accumulated, managed, and utilized in the As-Is patent system, which currently only manages standardized patent information.

Ready-to-eat Cereal Consumption Enhances Milk and Calcium Intake in Korean Population from 2001 Korean National Health and Nutrition Survey (한국인의 시리얼 섭취실태와 우유 및 칼슘섭취와의 관련성 연구 - 2001년도 국민건강영양조사 자료를 이용하여 -)

  • Chung, Chin-Eun
    • Journal of Nutrition and Health
    • /
    • v.39 no.8
    • /
    • pp.786-794
    • /
    • 2006
  • The purpose of this study was to establish an association between the consumption of ready-to-eat cereal (RTEC), milk, and calcium within the context of the most current population dietary practice in Korea. Inadequate calcium intake among Korean children and adults is one of the important public health concern. Milk is one of the best calcium sources because or its bioavailability, and RTEC is one or the foods commonly consumed with milk. The most recent Korean National Health and Nutrition Survey, 2001 dataset was used as the source of data for this research. Subjects excluding pregnant women, were categorized according to gender and age ($1{\sim}5,\;6{\sim}11,\;12{\sim}19,\;20{\sim}49,\;50+$ years) and then by consumption of RTEC and milk. SAS and SUDAAN were used for statistical analyses. Sample weighted means, standard errors, and population percentages were calculated, and multiple regression model with adjustment for covariates were used to determine the predictability of total daily calcium intake from inclusion of RTEC and milk compared to the meal without RTEC and milk. RTEC was consumed by 2.4% or Korean people. Average calcium intake was 17 times greater when RTEC was consumed with milk than when RTEC was consumed without milk. Respondents who consumed RTEC with milk had significantly higher mean daily calcium and other nutrient intakes than respondents who consumed neither. in the multiple regression analysis, milk consumption with or without RTEC predicted total daily calcium intake after adjusting for age, income, and alcohol consumption (p<0.0001). The percentage of respondents below the estimated average requirement (EAR) level for calcium was lower for RTEC consumers than for RTEC non-consumers in all age-gender groups, especially significant differences were in children aged $1{\sim}5$, boys and girls aged $12{\sim}19$, men aged $20{\sim}49$, and women older than 50 years of age. RTEC consumption was not associated with intake in excess of the tolerable upper intake level (UL) for calcium. In conclusion, RTEC consumption was positively associated with both milk and calcium intakes in all age and gender groups in Korean population.

Improvement of Model based on Inherent Optical Properties for Remote Sensing of Cyanobacterial Bloom (고유분광특성을 이용한 남조류 원격 추정 모델 개선)

  • Ha, Rim;Nam, Gibeom;Park, Sanghyun;Kang, Taegu;Shin, Hyunjoo;Kim, Kyunghyun;Rhew, Doughee;Lee, Hyuk
    • Korean Journal of Remote Sensing
    • /
    • v.33 no.2
    • /
    • pp.111-123
    • /
    • 2017
  • The phycocyanin pigment (PC) is a marker for cyanobacterial presence in eutrophic inland water. Accurate estimation of low PC concentration in turbid inland water is challenging due to the optical complexity and criticalforissuing an early warning of potentialrisks of cyanobacterial bloom to the public. To monitor cyanobacterial bloom in eutrophic inland waters, an approach is proposed to partition non-water absorption coefficient from measured reflectance and to retrieve absorption coefficient of PC with the aim of improving the accuracy in remotely estimated PC, in particular for low concentrations. The proposed inversion model retrieves absorption spectra of PC ($a_{pc}({\lambda})$) with $R^2{\geq}0.8$ for $a_{pc}(620)$. The algorithm achieved more accurate Chl-a and PC estimation with $0.71{\leq}R^2{\leq}0.85$, relative root mean square error (rRMSE) ${\leq}39.4%$ and mean relative error(RE) ${\leq}78.0%$ than the widely used semi-empirical algorithm for the same dataset. In particular, low PC ($PC{\leq}50mg/m^3$) and low PC: Chl-a ratio values of for all datasets used in this study were well predicted by the proposed algorithm.

The Factors that Affects the Employment Type of The Graduates by Data-mining Approach (데이터마이닝 기법을 활용한 대졸자 고용에 미치는 영향요인 분석)

  • Kim, Hyoung-Rae;Jeon, Do-Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.7
    • /
    • pp.167-174
    • /
    • 2012
  • Data mining technique can be adapted to analysing Employment information in order to discover valuable information out of large data. As the issue employment such as jobless of college graduate, recruitment for women, recruitment for elders etc. became social problem, there are many efforts of various public employment services and studies. The factors that affects the college graduate's employment type (regular, temporary, daily) can be used to guide employment and to prepare employment for college students. In analyzing large number of attributes and the huge amount of data elements, regular statistical methods faces their limitation; therefore, data-mining technique is more suitable for the dataset of about 170 attributes and 20,000 elements. We divide the factors that may affect the employment type into personal factor, school factor, company factor, and experience factor; decision tree algorithm is used to find out the interesting relationship between the attributes of the factors and employment type. Personal factors such as the income of parents and marital status were the most affective factors to the employment type. The learned decision tree was able to classify the employment type with 87% of accuracy. We also assume the level of the school affects the employment type of the graduates.

Construction of a Full-length cDNA Library from Korean Stewartia (Stewartia koreana Nakai) and Characterization of EST Dataset (노각나무(Stewartia koreana Nakai)의 cDNA library 제작 및 EST 분석)

  • Im, Su-Bin;Kim, Joon-Ki;Choi, Young-In;Choi, Sun-Hee;Kwon, Hye-Jin;Song, Ho-Kyung;Lim, Yong-Pyo
    • Horticultural Science & Technology
    • /
    • v.29 no.2
    • /
    • pp.116-122
    • /
    • 2011
  • In this study, we report the generation and analysis of 1,392 expressed sequence tags (ESTs) from Korean Stewartia (Stewartia koreana Nakai). A cDNA library was generated from the young leaf tissue and a total of 1,392 cDNA were partially sequenced. EST and unigene sequence quality were determined by computational filtering, manual review, and BLAST analyses. Finally, 1,301 ESTs were acquired after the removal of the vector sequence and filtering over a minimum length 100 nucleotides. A total of 893 unigene, consisting of 150 contigs and 743 singletons, was identified after assembling. Also, we identified 95 new microsatellite-containing sequences from the unigenes and classified the structure according to their repeat unit. According to homology search with BLASTX against the NCBI database, 65% of ESTs were homologous with known function and 11.6% of ESTs were matched with putative or unknown function. The remaining 23.2% of ESTs showed no significant similarity to any protein sequences found in the public database. Annotation based searches against multiple databases including wine grape and populus sequences helped to identify putative functions of ESTs and unigenes. Gene ontology (GO) classification showed that the most abundant GO terms were transport, nucleotide binding, plastid, in terms biological process, molecular function and cellular component, respectively. The sequence data will be used to characterize potential roles of new genes in Stewartia and provided for the useful tools as a genetic resource.

Prediction Model of Real Estate ROI with the LSTM Model based on AI and Bigdata

  • Lee, Jeong-hyun;Kim, Hoo-bin;Shim, Gyo-eon
    • International journal of advanced smart convergence
    • /
    • v.11 no.1
    • /
    • pp.19-27
    • /
    • 2022
  • Across the world, 'housing' comprises a significant portion of wealth and assets. For this reason, fluctuations in real estate prices are highly sensitive issues to individual households. In Korea, housing prices have steadily increased over the years, and thus many Koreans view the real estate market as an effective channel for their investments. However, if one purchases a real estate property for the purpose of investing, then there are several risks involved when prices begin to fluctuate. The purpose of this study is to design a real estate price 'return rate' prediction model to help mitigate the risks involved with real estate investments and promote reasonable real estate purchases. Various approaches are explored to develop a model capable of predicting real estate prices based on an understanding of the immovability of the real estate market. This study employs the LSTM method, which is based on artificial intelligence and deep learning, to predict real estate prices and validate the model. LSTM networks are based on recurrent neural networks (RNN) but add cell states (which act as a type of conveyer belt) to the hidden states. LSTM networks are able to obtain cell states and hidden states in a recursive manner. Data on the actual trading prices of apartments in autonomous districts between January 2006 and December 2019 are collected from the Actual Trading Price Disclosure System of the Ministry of Land, Infrastructure and Transport (MOLIT). Additionally, basic data on apartments and commercial buildings are collected from the Public Data Portal and Seoul Metropolitan Government's data portal. The collected actual trading price data are scaled to monthly average trading amounts, and each data entry is pre-processed according to address to produce 168 data entries. An LSTM model for return rate prediction is prepared based on a time series dataset where the training period is set as April 2015~August 2017 (29 months), the validation period is set as September 2017~September 2018 (13 months), and the test period is set as December 2018~December 2019 (13 months). The results of the return rate prediction study are as follows. First, the model achieved a prediction similarity level of almost 76%. After collecting time series data and preparing the final prediction model, it was confirmed that 76% of models could be achieved. All in all, the results demonstrate the reliability of the LSTM-based model for return rate prediction.

Implementation of AI-based Object Recognition Model for Improving Driving Safety of Electric Mobility Aids (전동 이동 보조기기 주행 안전성 향상을 위한 AI기반 객체 인식 모델의 구현)

  • Je-Seung Woo;Sun-Gi Hong;Jun-Mo Park
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.23 no.3
    • /
    • pp.166-172
    • /
    • 2022
  • In this study, we photograph driving obstacle objects such as crosswalks, side spheres, manholes, braille blocks, partial ramps, temporary safety barriers, stairs, and inclined curb that hinder or cause inconvenience to the movement of the vulnerable using electric mobility aids. We develop an optimal AI model that classifies photographed objects and automatically recognizes them, and implement an algorithm that can efficiently determine obstacles in front of electric mobility aids. In order to enable object detection to be AI learning with high probability, the labeling form is labeled as a polygon form when building a dataset. It was developed using a Mask R-CNN model in Detectron2 framework that can detect objects labeled in the form of polygons. Image acquisition was conducted by dividing it into two groups: the general public and the transportation weak, and image information obtained in two areas of the test bed was secured. As for the parameter setting of the Mask R-CNN learning result, it was confirmed that the model learned with IMAGES_PER_BATCH: 2, BASE_LEARNING_RATE 0.001, MAX_ITERATION: 10,000 showed the highest performance at 68.532, so that the user can quickly and accurately recognize driving risks and obstacles.

Multi-Time Window Feature Extraction Technique for Anger Detection in Gait Data

  • Beom Kwon;Taegeun Oh
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.4
    • /
    • pp.41-51
    • /
    • 2023
  • In this paper, we propose a technique of multi-time window feature extraction for anger detection in gait data. In the previous gait-based emotion recognition methods, the pedestrian's stride, time taken for one stride, walking speed, and forward tilt angles of the neck and thorax are calculated. Then, minimum, mean, and maximum values are calculated for the entire interval to use them as features. However, each feature does not always change uniformly over the entire interval but sometimes changes locally. Therefore, we propose a multi-time window feature extraction technique that can extract both global and local features, from long-term to short-term. In addition, we also propose an ensemble model that consists of multiple classifiers. Each classifier is trained with features extracted from different multi-time windows. To verify the effectiveness of the proposed feature extraction technique and ensemble model, a public three-dimensional gait dataset was used. The simulation results demonstrate that the proposed ensemble model achieves the best performance compared to machine learning models trained with existing feature extraction techniques for four performance evaluation metrics.

A study on the aspect-based sentiment analysis of multilingual customer reviews (다국어 사용자 후기에 대한 속성기반 감성분석 연구)

  • Sungyoung Ji;Siyoon Lee;Daewoo Choi;Kee-Hoon Kang
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.6
    • /
    • pp.515-528
    • /
    • 2023
  • With the growth of the e-commerce market, consumers increasingly rely on user reviews to make purchasing decisions. Consequently, researchers are actively conducting studies to effectively analyze these reviews. Among the various methods of sentiment analysis, the aspect-based sentiment analysis approach, which examines user reviews from multiple angles rather than solely relying on simple positive or negative sentiments, is gaining widespread attention. Among the various methodologies for aspect-based sentiment analysis, there is an analysis method using a transformer-based model, which is the latest natural language processing technology. In this paper, we conduct an aspect-based sentiment analysis on multilingual user reviews using two real datasets from the latest natural language processing technology model. Specifically, we use restaurant data from the SemEval 2016 public dataset and multilingual user review data from the cosmetic domain. We compare the performance of transformer-based models for aspect-based sentiment analysis and apply various methodologies to improve their performance. Models using multilingual data are expected to be highly useful in that they can analyze multiple languages in one model without building separate models for each language.