• Title/Summary/Keyword: High Dimensionality Data

Search Result 121, Processing Time 0.026 seconds

Distributed Processing System Design and Implementation for Feature Extraction from Large-Scale Malicious Code (대용량 악성코드의 특징 추출 가속화를 위한 분산 처리 시스템 설계 및 구현)

  • Lee, Hyunjong;Euh, Seongyul;Hwang, Doosung
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.8 no.2
    • /
    • pp.35-40
    • /
    • 2019
  • Traditional Malware Detection is susceptible for detecting malware which is modified by polymorphism or obfuscation technology. By learning patterns that are embedded in malware code, machine learning algorithms can detect similar behaviors and replace the current detection methods. Data must collected continuously in order to learn malicious code patterns that change over time. However, the process of storing and processing a large amount of malware files is accompanied by high space and time complexity. In this paper, an HDFS-based distributed processing system is designed to reduce space complexity and accelerate feature extraction time. Using a distributed processing system, we extract two API features based on filtering basis, 2-gram feature and APICFG feature and the generalization performance of ensemble learning models is compared. In experiments, the time complexity of the feature extraction was improved about 3.75 times faster than the processing time of a single computer, and the space complexity was about 5 times more efficient. The 2-gram feature was the best when comparing the classification performance by feature, but the learning time was long due to high dimensionality.

Diagnosis by Rough Set and Information Theory in Reinforcing the Competencies of the Collegiate (러프집합과 정보이론을 이용한 대학생역량강화 진단)

  • Park, In-Kyoo
    • Journal of Digital Convergence
    • /
    • v.12 no.8
    • /
    • pp.257-264
    • /
    • 2014
  • This paper presents the core competencies diagnosis system which targeted our collegiate students in an attempt to induce the core competencies for reinforcing the learning and employment capabilities. Because these days data give rise to a high level of redundancy and dimensionality with time complexity, they are more likely to have spurious relationships, and even the weakest relationships will be highly significant by any statistical test. So as to address the measurement of uncertainties from the classification of categorical data and the implementation of its analytic system, an uncertainty measure of rough entropy and information entropy is defined so that similar behaviors analysis is carried out and the clustering ability is demonstrated in the comparison with the statistical approach. Because the acquired and necessary competencies of the collegiate is deduced by way of the results of the diagnosis, i.e. common core competencies and major core competencies, they facilitate not only the collegiate life and the employment capability reinforcement but also the revitalization of employment and the adjustment to college life.

Estimation of Spatial Distribution Using the Gaussian Mixture Model with Multivariate Geoscience Data (다변량 지구과학 데이터와 가우시안 혼합 모델을 이용한 공간 분포 추정)

  • Kim, Ho-Rim;Yu, Soonyoung;Yun, Seong-Taek;Kim, Kyoung-Ho;Lee, Goon-Taek;Lee, Jeong-Ho;Heo, Chul-Ho;Ryu, Dong-Woo
    • Economic and Environmental Geology
    • /
    • v.55 no.4
    • /
    • pp.353-366
    • /
    • 2022
  • Spatial estimation of geoscience data (geo-data) is challenging due to spatial heterogeneity, data scarcity, and high dimensionality. A novel spatial estimation method is needed to consider the characteristics of geo-data. In this study, we proposed the application of Gaussian Mixture Model (GMM) among machine learning algorithms with multivariate data for robust spatial predictions. The performance of the proposed approach was tested through soil chemical concentration data from a former smelting area. The concentrations of As and Pb determined by ex-situ ICP-AES were the primary variables to be interpolated, while the other metal concentrations by ICP-AES and all data determined by in-situ portable X-ray fluorescence (PXRF) were used as auxiliary variables in GMM and ordinary cokriging (OCK). Among the multidimensional auxiliary variables, important variables were selected using a variable selection method based on the random forest. The results of GMM with important multivariate auxiliary data decreased the root mean-squared error (RMSE) down to 0.11 for As and 0.33 for Pb and increased the correlations (r) up to 0.31 for As and 0.46 for Pb compared to those from ordinary kriging and OCK using univariate or bivariate data. The use of GMM improved the performance of spatial interpretation of anthropogenic metals in soil. The multivariate spatial approach can be applied to understand complex and heterogeneous geological and geochemical features.

Data Mining using Instance Selection in Artificial Neural Networks for Bankruptcy Prediction (기업부도예측을 위한 인공신경망 모형에서의 사례선택기법에 의한 데이터 마이닝)

  • Kim, Kyoung-jae
    • Journal of Intelligence and Information Systems
    • /
    • v.10 no.1
    • /
    • pp.109-123
    • /
    • 2004
  • Corporate financial distress and bankruptcy prediction is one of the major application areas of artificial neural networks (ANNs) in finance and management. ANNs have showed high prediction performance in this area, but sometimes are confronted with inconsistent and unpredictable performance for noisy data. In addition, it may not be possible to train ANN or the training task cannot be effectively carried out without data reduction when the amount of data is so large because training the large data set needs much processing time and additional costs of collecting data. Instance selection is one of popular methods for dimensionality reduction and is directly related to data reduction. Although some researchers have addressed the need for instance selection in instance-based learning algorithms, there is little research on instance selection for ANN. This study proposes a genetic algorithm (GA) approach to instance selection in ANN for bankruptcy prediction. In this study, we use ANN supported by the GA to optimize the connection weights between layers and select relevant instances. It is expected that the globally evolved weights mitigate the well-known limitations of gradient descent algorithm of backpropagation algorithm. In addition, genetically selected instances will shorten the learning time and enhance prediction performance. This study will compare the proposed model with other major data mining techniques. Experimental results show that the GA approach is a promising method for instance selection in ANN.

  • PDF

A Node2Vec-Based Gene Expression Image Representation Method for Effectively Predicting Cancer Prognosis (암 예후를 효과적으로 예측하기 위한 Node2Vec 기반의 유전자 발현량 이미지 표현기법)

  • Choi, Jonghwan;Park, Sanghyun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.10
    • /
    • pp.397-402
    • /
    • 2019
  • Accurately predicting cancer prognosis to provide appropriate treatment strategies for patients is one of the critical challenges in bioinformatics. Many researches have suggested machine learning models to predict patients' outcomes based on their gene expression data. Gene expression data is high-dimensional numerical data containing about 17,000 genes, so traditional researches used feature selection or dimensionality reduction approaches to elevate the performance of prognostic prediction models. These approaches, however, have an issue of making it difficult for the predictive models to grasp any biological interaction between the selected genes because feature selection and model training stages are performed independently. In this paper, we propose a novel two-dimensional image formatting approach for gene expression data to achieve feature selection and prognostic prediction effectively. Node2Vec is exploited to integrate biological interaction network and gene expression data and a convolutional neural network learns the integrated two-dimensional gene expression image data and predicts cancer prognosis. We evaluated our proposed model through double cross-validation and confirmed superior prognostic prediction accuracy to traditional machine learning models based on raw gene expression data. As our proposed approach is able to improve prediction models without loss of information caused by feature selection steps, we expect this will contribute to development of personalized medicine.

A Study on the High Altitude Mountain Tourism Motivations and Constraints (고산지대 산악관광 동기와 제약요인에 대한 국제적 연구)

  • Lee, Seung-Koo;Sharma, Renuka
    • Korean Business Review
    • /
    • v.22 no.2
    • /
    • pp.139-156
    • /
    • 2009
  • Mountain tourism is regarded as an important inbound tourist destination for the whole world. The Himalayan Mountains are house of world's highest peaks that includes over 100 mountains exceeding 8,500 meters. However limited dimension of visitors constraints and motivation has been reported about the high altitude mountain. This research work permits the identification of some of the motivation and constraints related to the decision making of tourism in high altitude mountains. The study was conducted in Korea, Indian state (Sikkim), and Nepal (Kathmandu) due to the popularity and the major destination for mountain tourism. A set of 9 motive, 45 motivation items and 40 constraints were initially generated from a review of research pertaining to visitor motivation and constraints. They were considered to be the most appropriate for measuring visitors motivation and constraints for experiencing high altitude mountain tourism. Validity of dimensionality and inter correlation was evaluated by factor analysis investigation and analysis of obtained data revealed that constraints of Korean are significantly higher than Indian and other inbound tourist. Among the major constraints structural constraints were recorded higher for Indian, Korean and other visitors. Similarly, motives of different visitors varied significantly. This analysis also revealed that Korean motives for travelling were influenced by health and pleasure, whereas, Indian and others motives were mostly related to knowledge seeking and adventure. The environmental importance were given priority by all the countries. The purpose of this study includes; (1) To identify the motives of visitors in high altitude destinations. (2) To analysis the major motivation factor for the altitude tourism. (3) To report the major constraints of visitors travelling to the high altitude. (4) To study whether the strength of motivation help to overcome the constraints.

  • PDF

Efficient Implementation of SVM-Based Speech/Music Classifier by Utilizing Temporal Locality (시간적 근접성 향상을 통한 효율적인 SVM 기반 음성/음악 분류기의 구현 방법)

  • Lim, Chung-Soo;Chang, Joon-Hyuk
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.49 no.2
    • /
    • pp.149-156
    • /
    • 2012
  • Support vector machines (SVMs) are well known for their pattern recognition capability, but proper care should be taken to alleviate their inherent implementation cost resulting from high computational intensity and memory requirement, especially in embedded systems where only limited resources are available. Since the memory requirement determined by the dimensionality and the number of support vectors is generally too high for a cache in embedded systems to accomodate, frequent accesses to the main memory occur inevitably whenever the cache is not able to provide requested data to the processor. These frequent accesses to the main memory result in overall performance degradation and increased energy consumption because a memory access typically takes longer and consumes more energy than a cache access or a register access. In this paper, we propose a technique that reduces the number of main memory accesses by optimizing the data access pattern of the SVM-based classifier in such a way that the temporal locality of the accesses increases, fully utilizing data loaded into the processor chip. With experiments, we confirm the enhancement made by the proposed technique in terms of the number of memory accesses, overall execution time, and energy consumption.

A Study of The Determinants of Turnover Intention and Organizational Commitment by Data Mining (데이터마이닝을 활용한 이직의도와 조직몰입의 결정요인에 대한 연구)

  • Choi, Young Joon;Shim, Won Shul;Baek, Seung Hyun
    • Journal of the Korea Society for Simulation
    • /
    • v.23 no.1
    • /
    • pp.21-31
    • /
    • 2014
  • In this article, data mining simulation is applied to find a proper approach and results of analysis for study of variables related to organization. Also, turnover intention and organizational commitment are used as target (dependent) variables in this simulation. Classification and regression tree (CART) with ensemble methods are used in this study for simulation. Human capital corporate panel data of Korea Research Institute for Vocation Education & Training (KRIVET) is used. The panel data is collected in 2005, 2007, and 2009. Organizational commitment variables are analyzed with combined measure variables which are created after investigation of reliability and single dimensionality for multiple-item measurement details. The results of this study are as follows. First, major determinants of turnover intention are trust, communication, and talent management-oriented trend. Second, the main determining factors for organizational commitment are trust, the number of years worked, innovation, communication. CART with ensemble methods has two ensemble CART methods which are CART with Bagging and CART with Arcing. Comparing two methods, CART with Arcing (Arc-x4) extracted scenarios with very high coefficients of determination. In this study, a scenario with maximum coefficient of determinant and minimum error is obtained and practical implications are presented. Using one of data mining methods, CART with ensemble method. Also, the limitation and future research are discussed.

Effects of Cosmetics Shopping Mall Attributes on Revisit Intentions of Total Mall and Specialty Mall at Internet (인터넷쇼핑몰 유형별 쇼핑몰속성이 화장품 쇼핑몰 재방문의도에 미치는 영향)

  • Park, Eun-Joo;Kim, Ji-Eun
    • Fashion & Textile Research Journal
    • /
    • v.12 no.1
    • /
    • pp.38-45
    • /
    • 2010
  • Cosmetics retailers would benefit from studies that examine which shopping-mall attributes can be manipulated to favorably affect consumer satisfaction and revisit intention at Internet. The purposes of this study were (1) to examine the dimensionality of shopping-mall attribute for cosmetics retailers, (2) to determine which dimensions of shopping-mall attribute were significant predictors of consumer satisfaction and revisit intention and (3) to find out the moderating effect of consumer satisfaction through shopping-mall attributes on revisit intention to buy cosmetics across the types of shopping-mall at Internet (i.e., total mall and specialty mall). Data were collected from 209 online cosmetic shoppers among high school girls. Factor analysis identified five dimensions of shopping-mall attributes at Internet, such as Convenience, Price, Loading speed, Sales promotion, and Service. Only two dimensions(i.e., convenience and service) were significant predictors of online shopper satisfaction in both total mall and specialty mall. The moderating effect of consumer satisfaction on revisit intention was significant in both two mall types at Internet. For total mall, price was a significant predictor through consumer satisfaction on revisit intention, while loading speed was a significant predictor directly on revisit intention for specialty mall. In light of the major findings, this study sets forth strategic implications for consumer satisfaction and revisit intention to buy cosmetics in the setting of electronic commerce.

Missing Value Estimation and Sensor Fault Identification using Multivariate Statistical Analysis (다변량 통계 분석을 이용한 결측 데이터의 예측과 센서이상 확인)

  • Lee, Changkyu;Lee, In-Beum
    • Korean Chemical Engineering Research
    • /
    • v.45 no.1
    • /
    • pp.87-92
    • /
    • 2007
  • Recently, developments of process monitoring system in order to detect and diagnose process abnormalities has got the spotlight in process systems engineering. Normal data obtained from processes provide available information of process characteristics to be used for modeling, monitoring, and control. Since modern chemical and environmental processes have high dimensionality, strong correlation, severe dynamics and nonlinearity, it is not easy to analyze a process through model-based approach. To overcome limitations of model-based approach, lots of system engineers and academic researchers have focused on statistical approach combined with multivariable analysis such as principal component analysis (PCA), partial least squares (PLS), and so on. Several multivariate analysis methods have been modified to apply it to a chemical process with specific characteristics such as dynamics, nonlinearity, and so on.This paper discusses about missing value estimation and sensor fault identification based on process variable reconstruction using dynamic PCA and canonical variate analysis.