• Title/Summary/Keyword: 빅데이터모델

Search Result 764, Processing Time 0.03 seconds

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

Analyzing the Performance of the South Korean Men's National Football Team Using Social Network Analysis: Focusing on the Manager Bento's Matches (사회연결망분석을 활용한 한국 남자축구대표팀 경기성과 분석: 벤투 감독 경기를 중심으로)

  • Yeonsik Jung;Eunkyung Kang;Sung-Byung Yang
    • Knowledge Management Research
    • /
    • v.24 no.2
    • /
    • pp.241-262
    • /
    • 2023
  • The phenomena and game records that occur in sports matches are being analyzed in the field of sports game analysis, utilizing advanced technologies and various scientific analysis methods. Among these methods, social network analysis is actively employed in analyzing pass networks. As football is a representative sport in which the game unfolds through player interactions, efforts are being made to provide new insights into the game using social network analysis, which were previously unattainable. Consequently, this study aims to analyze the changes in pass networks over time for a specific football team and compare them in different scenarios, including variations in the game's nature (Qatar World Cup games vs. A match games) and alterations in the opposing team (higher FIFA rankers vs. lower FIFA rankers). To elaborate, we selected ten matches from the games of the Korean national football team following Coach Bento's appointment, extracted network indicators for these matches, and applied four indicators (efficiency, cohesion, vulnerability, and activity/leadership) from a football team's performance evaluation model to the extracted data for analysis under different circumstances. The research findings revealed a significant increase in cohesion and a substantial decrease in vulnerability during the analysis of game performance over time. In the comparative analysis based on changes in the game's nature, Qatar World Cup matches exhibited superior performance across all aspects of the evaluation model compared to A matches. Lastly, in the comparative analysis considering the variations in the opposing team, matches against lower FIFA rankers displayed superior performance in all aspects of the evaluation model in comparison to matches against top FIFA rankers. We hope that the outcomes of this study can serve as essential foundational data for the selection of football team coaches and the development of game strategies, thereby contributing to the enhancement of the team's performance.

A Study of Life Safety Index Model based on AHP and Utilization of Service (AHP 기반의 생활안전지수 모델 및 서비스 활용방안 연구)

  • Oh, Hye-Su;Lee, Dong-Hoon;Jeong, Jong-Woon;Jang, Jae-Min;Yang, Sang-Woon
    • Journal of the Society of Disaster Information
    • /
    • v.17 no.4
    • /
    • pp.864-881
    • /
    • 2021
  • Purpose: This study aims is to provide a total care solution preventing disaster based on Big Data and AI technology and to service safety considered by individual situations and various risk characteristics. The purpose is to suggest a method that customized comprehensive index services to prevent and respond to safety accidents for calculating the living safety index that quantitatively represent individual safety levels in relation to daily life safety. Method: In this study, we use method of mixing AHP(Analysis Hierarchy Process) and Likert Scale that extracted from consensus formation model of the expert group. We organize evaluation items that can evaluate life safety prevention services into risk indicators, vulnerability indicators, and prevention indicators. And We made up AHP hierarchical structure according to the AHP decision methodology and proposed a method to calculate relative weights between evaluation criteria through pairwise comparison of each level item. In addition, in consideration of the expansion of life safety prevention services in the future, the Likert scale is used instead of the AHP pair comparison and the weights between individual services are calculated. Result: We obtain result that is weights for life safety prevention services and reflected them in the individual risk index calculated through the artificial intelligence prediction model of life safety prevention services, so the comprehensive index was calculated. Conclusion: In order to apply the implemented model, a test environment consisting of a life safety prevention service app and platform was built, and the efficacy of the function was evaluated based on the user scenario. Through this, the life safety index presented in this study was confirmed to support the golden time for diagnosis, response and prevention of safety risks by comprehensively indication the user's current safety level.

Customer Behavior Prediction of Binary Classification Model Using Unstructured Information and Convolution Neural Network: The Case of Online Storefront (비정형 정보와 CNN 기법을 활용한 이진 분류 모델의 고객 행태 예측: 전자상거래 사례를 중심으로)

  • Kim, Seungsoo;Kim, Jongwoo
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.221-241
    • /
    • 2018
  • Deep learning is getting attention recently. The deep learning technique which had been applied in competitions of the International Conference on Image Recognition Technology(ILSVR) and AlphaGo is Convolution Neural Network(CNN). CNN is characterized in that the input image is divided into small sections to recognize the partial features and combine them to recognize as a whole. Deep learning technologies are expected to bring a lot of changes in our lives, but until now, its applications have been limited to image recognition and natural language processing. The use of deep learning techniques for business problems is still an early research stage. If their performance is proved, they can be applied to traditional business problems such as future marketing response prediction, fraud transaction detection, bankruptcy prediction, and so on. So, it is a very meaningful experiment to diagnose the possibility of solving business problems using deep learning technologies based on the case of online shopping companies which have big data, are relatively easy to identify customer behavior and has high utilization values. Especially, in online shopping companies, the competition environment is rapidly changing and becoming more intense. Therefore, analysis of customer behavior for maximizing profit is becoming more and more important for online shopping companies. In this study, we propose 'CNN model of Heterogeneous Information Integration' using CNN as a way to improve the predictive power of customer behavior in online shopping enterprises. In order to propose a model that optimizes the performance, which is a model that learns from the convolution neural network of the multi-layer perceptron structure by combining structured and unstructured information, this model uses 'heterogeneous information integration', 'unstructured information vector conversion', 'multi-layer perceptron design', and evaluate the performance of each architecture, and confirm the proposed model based on the results. In addition, the target variables for predicting customer behavior are defined as six binary classification problems: re-purchaser, churn, frequent shopper, frequent refund shopper, high amount shopper, high discount shopper. In order to verify the usefulness of the proposed model, we conducted experiments using actual data of domestic specific online shopping company. This experiment uses actual transactions, customers, and VOC data of specific online shopping company in Korea. Data extraction criteria are defined for 47,947 customers who registered at least one VOC in January 2011 (1 month). The customer profiles of these customers, as well as a total of 19 months of trading data from September 2010 to March 2012, and VOCs posted for a month are used. The experiment of this study is divided into two stages. In the first step, we evaluate three architectures that affect the performance of the proposed model and select optimal parameters. We evaluate the performance with the proposed model. Experimental results show that the proposed model, which combines both structured and unstructured information, is superior compared to NBC(Naïve Bayes classification), SVM(Support vector machine), and ANN(Artificial neural network). Therefore, it is significant that the use of unstructured information contributes to predict customer behavior, and that CNN can be applied to solve business problems as well as image recognition and natural language processing problems. It can be confirmed through experiments that CNN is more effective in understanding and interpreting the meaning of context in text VOC data. And it is significant that the empirical research based on the actual data of the e-commerce company can extract very meaningful information from the VOC data written in the text format directly by the customer in the prediction of the customer behavior. Finally, through various experiments, it is possible to say that the proposed model provides useful information for the future research related to the parameter selection and its performance.

Development of a complex failure prediction system using Hierarchical Attention Network (Hierarchical Attention Network를 이용한 복합 장애 발생 예측 시스템 개발)

  • Park, Youngchan;An, Sangjun;Kim, Mintae;Kim, Wooju
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.4
    • /
    • pp.127-148
    • /
    • 2020
  • The data center is a physical environment facility for accommodating computer systems and related components, and is an essential foundation technology for next-generation core industries such as big data, smart factories, wearables, and smart homes. In particular, with the growth of cloud computing, the proportional expansion of the data center infrastructure is inevitable. Monitoring the health of these data center facilities is a way to maintain and manage the system and prevent failure. If a failure occurs in some elements of the facility, it may affect not only the relevant equipment but also other connected equipment, and may cause enormous damage. In particular, IT facilities are irregular due to interdependence and it is difficult to know the cause. In the previous study predicting failure in data center, failure was predicted by looking at a single server as a single state without assuming that the devices were mixed. Therefore, in this study, data center failures were classified into failures occurring inside the server (Outage A) and failures occurring outside the server (Outage B), and focused on analyzing complex failures occurring within the server. Server external failures include power, cooling, user errors, etc. Since such failures can be prevented in the early stages of data center facility construction, various solutions are being developed. On the other hand, the cause of the failure occurring in the server is difficult to determine, and adequate prevention has not yet been achieved. In particular, this is the reason why server failures do not occur singularly, cause other server failures, or receive something that causes failures from other servers. In other words, while the existing studies assumed that it was a single server that did not affect the servers and analyzed the failure, in this study, the failure occurred on the assumption that it had an effect between servers. In order to define the complex failure situation in the data center, failure history data for each equipment existing in the data center was used. There are four major failures considered in this study: Network Node Down, Server Down, Windows Activation Services Down, and Database Management System Service Down. The failures that occur for each device are sorted in chronological order, and when a failure occurs in a specific equipment, if a failure occurs in a specific equipment within 5 minutes from the time of occurrence, it is defined that the failure occurs simultaneously. After configuring the sequence for the devices that have failed at the same time, 5 devices that frequently occur simultaneously within the configured sequence were selected, and the case where the selected devices failed at the same time was confirmed through visualization. Since the server resource information collected for failure analysis is in units of time series and has flow, we used Long Short-term Memory (LSTM), a deep learning algorithm that can predict the next state through the previous state. In addition, unlike a single server, the Hierarchical Attention Network deep learning model structure was used in consideration of the fact that the level of multiple failures for each server is different. This algorithm is a method of increasing the prediction accuracy by giving weight to the server as the impact on the failure increases. The study began with defining the type of failure and selecting the analysis target. In the first experiment, the same collected data was assumed as a single server state and a multiple server state, and compared and analyzed. The second experiment improved the prediction accuracy in the case of a complex server by optimizing each server threshold. In the first experiment, which assumed each of a single server and multiple servers, in the case of a single server, it was predicted that three of the five servers did not have a failure even though the actual failure occurred. However, assuming multiple servers, all five servers were predicted to have failed. As a result of the experiment, the hypothesis that there is an effect between servers is proven. As a result of this study, it was confirmed that the prediction performance was superior when the multiple servers were assumed than when the single server was assumed. In particular, applying the Hierarchical Attention Network algorithm, assuming that the effects of each server will be different, played a role in improving the analysis effect. In addition, by applying a different threshold for each server, the prediction accuracy could be improved. This study showed that failures that are difficult to determine the cause can be predicted through historical data, and a model that can predict failures occurring in servers in data centers is presented. It is expected that the occurrence of disability can be prevented in advance using the results of this study.

The effect of climate change on hydroelectric power generation of multipurpose dams according to SSP scenarios (SSP 시나리오에 따른 기후변화가 다목적댐 수력발전량에 미치는 영향 분석)

  • Wang, Sizhe;Kim, Jiyoung;Kim, Yongchan;Kim, Dongkyun;Kim, Tae-Woong
    • Journal of Korea Water Resources Association
    • /
    • v.57 no.7
    • /
    • pp.481-491
    • /
    • 2024
  • Recent droughts make hydroelectric power generation (HPG) decreasing. Due to climate change in the future, the frequency and intensity of drought are expected to increase, which will increase uncertainty of HPG in multi-purpose dams. Therefore, it is necessary to estimate the amount of HPG according to climate change scenarios and analyze the effect of drought on the amount of HPG. This study analyzed the future HPG of the Soyanggang Dam and Chungju Dam according to the SSP2-4.5 and SSP5-8.5 scenarios. Regression equations for HPG were developed based on the observed data of power generation discharge and HPG in the past provided by My Water, and future HPGs were estimated according to the SSP scenarios. The effect of drought on the amount of HPG was investigated based on the drought severity calculated using the standardized precipitation index (SPI). In this study, the future SPIs were calculated using precipitation data based on four GCM models (CanESM5, ACCESS-ESM1-5, INM-CM4-8, IPSL-CM6A) provided through the environmental big data platform. Overall results show that climate change had significant effects on the amount of HPG. In the case of Soyanggang Dam, the amount of HPG decreased in the SSP2-4.5 and SSP5-8.5 scenarios. Under the SSP2-4.5 scenario the CanESM model showed a 65% reduction in 2031, and under the SSP5-8.5 scenario the ACCESS-ESM1-5 model showed a 54% reduction in 2029. In the case of Chungju Dam, under the SSP2-4.5 and SSP5-8.5 scenarios the average monthly HPG compared to the reference period showed a decreasing trend except for INM-CM4 model.

Exploring User Attitude to Information Privacy (개인정보 노출에 대한 인터넷 사용자의 태도에 관한 연구)

  • Baek, Seung Ik;Choi, Duk Sun
    • The Journal of Society for e-Business Studies
    • /
    • v.20 no.1
    • /
    • pp.45-59
    • /
    • 2015
  • As many companies have been interested in big data, they have invested a lot of resources to get more customer data. Some companies try to trade the data illegally. In order to collect more customer data, companies provide various incentive programs to customers. However, their results are normally much less than their expectations. This study focuses on exploring the relative importance of the factors which influence customer attitudes to providing his/her personal information. This study conducts a conjoint analysis to assess trade-offs among the five influential factors-monetary reward, concern for data collection, concern for secondary use, concern for unauthorized use, and concern for errors. This study finds that the customer attitude to providing personal information is most influenced by the concern for secondary use. Furthermore, it shows that there are some differences between the light internet user group and the heavy internet user group in the relative importances of these factors. The monetary rewards appeal to the heavy internet users, rather than the light internet users.

Retrieval of Land Surface Temperature Using Landsat 8 Images with Deep Neural Networks (Landsat 8 영상을 이용한 심층신경망 기반의 지표면온도 산출)

  • Kim, Seoyeon;Lee, Soo-Jin;Lee, Yang-Won
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.3
    • /
    • pp.487-501
    • /
    • 2020
  • As a viable option for retrieval of LST (Land Surface Temperature), this paper presents a DNN (Deep Neural Network) based approach using 148 Landsat 8 images for South Korea. Because the brightness temperature and emissivity for the band 10 (approx. 11-㎛ wavelength) of Landsat 8 are derived by combining physics-based equations and empirical coefficients, they include uncertainties according to regional conditions such as meteorology, climate, topography, and vegetation. To overcome this, we used several land surface variables such as NDVI (Normalized Difference Vegetation Index), land cover types, topographic factors (elevation, slope, aspect, and ruggedness) as well as the T0 calculated from the brightness temperature and emissivity. We optimized four seasonal DNN models using the input variables and in-situ observations from ASOS (Automated Synoptic Observing System) to retrieve the LST, which is an advanced approach when compared with the existing method of the bias correction using a linear equation. The validation statistics from the 1,728 matchups during 2013-2019 showed a good performance of the CC=0.910~0.917 and RMSE=3.245~3.365℃, especially for spring and fall. Also, our DNN models produced a stable LST for all types of land cover. A future work using big data from Landsat 5/7/8 with additional land surface variables will be necessary for a more reliable retrieval of LST for high-resolution satellite images.

A step-by-step service encryption model based on routing pattern in case of IP spoofing attacks on clustering environment (클러스터링 환경에 대한 IP 스푸핑 공격 발생시 라우팅 패턴에 기반한 단계별 서비스 암호화 모델)

  • Baek, Yong-Jin;Jeong, Won-Chang;Hong, Suk-Won;Park, Jae-Hung
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.10 no.6
    • /
    • pp.580-586
    • /
    • 2017
  • The establishment of big data service environment requires both cloud-based network technology and clustering technology to improve the efficiency of information access. These cloud-based networks and clustering environments can provide variety of valuable information in real-time, which can be an intensive target of attackers attempting illegal access. In particular, attackers attempting IP spoofing can analyze information of mutual trust hosts constituting clustering, and attempt to attack directly to system existing in the cluster. Therefore, it is necessary to detect and respond to illegal attacks quickly, and it is demanded that the security policy is stronger than the security system that is constructed and operated in the existing single system. In this paper, we investigate routing pattern changes and use them as detection information to enable active correspondence and efficient information service in illegal attacks at this network environment. In addition, through the step-by -step encryption based on the routing information generated during the detection process, it is possible to manage the stable service information without frequent disconnection of the information service for resetting.

Evaluation of Applicability of RGB Image Using Support Vector Machine Regression for Estimation of Leaf Chlorophyll Content of Onion and Garlic (양파 마늘의 잎 엽록소 함량 추정을 위한 SVM 회귀 활용 RGB 영상 적용성 평가)

  • Lee, Dong-ho;Jeong, Chan-hee;Go, Seung-hwan;Park, Jong-hwa
    • Korean Journal of Remote Sensing
    • /
    • v.37 no.6_1
    • /
    • pp.1669-1683
    • /
    • 2021
  • AI intelligent agriculture and digital agriculture are important for the science of agriculture. Leaf chlorophyll contents(LCC) are one of the most important indicators to determine the growth status of vegetable crops. In this study, a support vector machine (SVM) regression model was produced using an unmanned aerial vehicle-based RGB camera and a multispectral (MSP) sensor for onions and garlic, and the LCC estimation applicability of the RGB camera was reviewed by comparing it with the MSP sensor. As a result of this study, the RGB-based LCC model showed lower results than the MSP-based LCC model with an average R2 of 0.09, RMSE 18.66, and nRMSE 3.46%. However, the difference in accuracy between the two sensors was not large, and the accuracy did not drop significantly when compared with previous studies using various sensors and algorithms. In addition, the RGB-based LCC model reflects the field LCC trend well when compared with the actual measured value, but it tends to be underestimated at high chlorophyll concentrations. It was possible to confirm the applicability of the LCC estimation with RGB considering the economic feasibility and versatility of the RGB camera. The results obtained from this study are expected to be usefully utilized in digital agriculture as AI intelligent agriculture technology that applies artificial intelligence and big data convergence technology.