• 제목/요약/키워드: digital tree

Search Result 402, Processing Time 0.022 seconds

Clickstream Big Data Mining for Demographics based Digital Marketing (인구통계특성 기반 디지털 마케팅을 위한 클릭스트림 빅데이터 마이닝)

  • Park, Jiae;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.143-163
    • /
    • 2016
  • The demographics of Internet users are the most basic and important sources for target marketing or personalized advertisements on the digital marketing channels which include email, mobile, and social media. However, it gradually has become difficult to collect the demographics of Internet users because their activities are anonymous in many cases. Although the marketing department is able to get the demographics using online or offline surveys, these approaches are very expensive, long processes, and likely to include false statements. Clickstream data is the recording an Internet user leaves behind while visiting websites. As the user clicks anywhere in the webpage, the activity is logged in semi-structured website log files. Such data allows us to see what pages users visited, how long they stayed there, how often they visited, when they usually visited, which site they prefer, what keywords they used to find the site, whether they purchased any, and so forth. For such a reason, some researchers tried to guess the demographics of Internet users by using their clickstream data. They derived various independent variables likely to be correlated to the demographics. The variables include search keyword, frequency and intensity for time, day and month, variety of websites visited, text information for web pages visited, etc. The demographic attributes to predict are also diverse according to the paper, and cover gender, age, job, location, income, education, marital status, presence of children. A variety of data mining methods, such as LSA, SVM, decision tree, neural network, logistic regression, and k-nearest neighbors, were used for prediction model building. However, this research has not yet identified which data mining method is appropriate to predict each demographic variable. Moreover, it is required to review independent variables studied so far and combine them as needed, and evaluate them for building the best prediction model. The objective of this study is to choose clickstream attributes mostly likely to be correlated to the demographics from the results of previous research, and then to identify which data mining method is fitting to predict each demographic attribute. Among the demographic attributes, this paper focus on predicting gender, age, marital status, residence, and job. And from the results of previous research, 64 clickstream attributes are applied to predict the demographic attributes. The overall process of predictive model building is compose of 4 steps. In the first step, we create user profiles which include 64 clickstream attributes and 5 demographic attributes. The second step performs the dimension reduction of clickstream variables to solve the curse of dimensionality and overfitting problem. We utilize three approaches which are based on decision tree, PCA, and cluster analysis. We build alternative predictive models for each demographic variable in the third step. SVM, neural network, and logistic regression are used for modeling. The last step evaluates the alternative models in view of model accuracy and selects the best model. For the experiments, we used clickstream data which represents 5 demographics and 16,962,705 online activities for 5,000 Internet users. IBM SPSS Modeler 17.0 was used for our prediction process, and the 5-fold cross validation was conducted to enhance the reliability of our experiments. As the experimental results, we can verify that there are a specific data mining method well-suited for each demographic variable. For example, age prediction is best performed when using the decision tree based dimension reduction and neural network whereas the prediction of gender and marital status is the most accurate by applying SVM without dimension reduction. We conclude that the online behaviors of the Internet users, captured from the clickstream data analysis, could be well used to predict their demographics, thereby being utilized to the digital marketing.

A Hierarchical Construction of Peer-to-Peer Systems Based on Super-Peer Networks (Super-Peer 네트워크에 기반을 둔 Peer-to-Peer 시스템의 계층적 구성)

  • Chung, Won-Ho
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.16 no.6
    • /
    • pp.65-73
    • /
    • 2016
  • Peer-to-Peer (P2P) systems with super-peer overlay networks show combined advantages of both hybrid and pure P2P systems. Super-peer is a special peer acting as a server to a cluster of generic peers. Organizing a super-peer network is one of important issues for P2P systems with super-peer networks. Conventional P2P systems are based on two-level hierarchies of peers. One is a layer for generic peers and the other is for super-peers. And it is usual that super-peer networks have forms of random graphs. However, for accommodating a large-scale collection of generic peers, the super-peer network has also to be extended. In this paper, we propose a scheme of hierarchically constructing super-peer networks for large-scale P2P systems. At first, a two-level tree, called a simple super-peer network, is proposed, and then a scheme of generalizing and then extending the simple super-peer network to multi-level super-peer network is presented to construct a large-scale super-peer network. We call it an extended super-peer network. The simple super-peer network has several good features, but due to the fixed number of levels, it may have a scalability problem. Thus, it is extended to k-level tree of a super-peer network, called extended super-peer network. It shows good scalability and easy management of generic peers for large scale P2P system.

New VLSI Architecture of Parallel Multiplier-Accumulator Based on Radix-2 Modified Booth Algorithm (Radix-2 MBA 기반 병렬 MAC의 VLSI 구조)

  • Seo, Young-Ho;Kim, Dong-Wook
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.4
    • /
    • pp.94-104
    • /
    • 2008
  • In this paper, we propose a new architecture of multiplier-and-accumulator (MAC) for high speed multiplication and accumulation arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator which has the largest delay in MAC was removed and its function was included into CSA, the overall performance becomes to be elevated. The proposed CSA tree uses 1's complement-based radix-2 modified booth algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of operands. The CSA propagates the carries by the least significant bits of the partial products and generates the least significant bits in advance for decreasing the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits not the output of the final adder for improving the performance by optimizing the efficiency of pipeline scheme. The proposed architecture was synthesized with $250{\mu}m,\;180{\mu}m,\;130{\mu}m$ and 90nm standard CMOS library after designing it. We analyzed the results such as hardware resource, delay, and pipeline which are based on the theoretical and experimental estimation. We used Sakurai's alpha power low for the delay modeling. The proposed MAC has the superior properties to the standard design in many ways and its performance is twice as much than the previous research in the similar clock frequency.

Development of a Korean Speech Recognition Platform (ECHOS) (한국어 음성인식 플랫폼 (ECHOS) 개발)

  • Kwon Oh-Wook;Kwon Sukbong;Jang Gyucheol;Yun Sungrack;Kim Yong-Rae;Jang Kwang-Dong;Kim Hoi-Rin;Yoo Changdong;Kim Bong-Wan;Lee Yong-Ju
    • The Journal of the Acoustical Society of Korea
    • /
    • v.24 no.8
    • /
    • pp.498-504
    • /
    • 2005
  • We introduce a Korean speech recognition platform (ECHOS) developed for education and research Purposes. ECHOS lowers the entry barrier to speech recognition research and can be used as a reference engine by providing elementary speech recognition modules. It has an easy simple object-oriented architecture, implemented in the C++ language with the standard template library. The input of the ECHOS is digital speech data sampled at 8 or 16 kHz. Its output is the 1-best recognition result. N-best recognition results, and a word graph. The recognition engine is composed of MFCC/PLP feature extraction, HMM-based acoustic modeling, n-gram language modeling, finite state network (FSN)- and lexical tree-based search algorithms. It can handle various tasks from isolated word recognition to large vocabulary continuous speech recognition. We compare the performance of ECHOS and hidden Markov model toolkit (HTK) for validation. In an FSN-based task. ECHOS shows similar word accuracy while the recognition time is doubled because of object-oriented implementation. For a 8000-word continuous speech recognition task, using the lexical tree search algorithm different from the algorithm used in HTK, it increases the word error rate by $40\%$ relatively but reduces the recognition time to half.

Variation of Seasonal Groundwater Recharge Analyzed Using Landsat-8 OLI Data and a CART Algorithm (CART알고리즘과 Landsat-8 위성영상 분석을 통한 계절별 지하수함양량 변화)

  • Park, Seunghyuk;Jeong, Gyo-Cheol
    • The Journal of Engineering Geology
    • /
    • v.31 no.3
    • /
    • pp.395-432
    • /
    • 2021
  • Groundwater recharge rates vary widely by location and with time. They are difficult to measure directly and are thus often estimated using simulations. This study employed frequency and regression analysis and a classification and regression tree (CART) algorithm in a machine learning method to estimate groundwater recharge. CART algorithms are considered for the distribution of precipitation by subbasin (PCP), geomorphological data, indices of the relationship between vegetation and landuse, and soil type. The considered geomorphological data were digital elevaion model (DEM), surface slope (SLOP), surface aspect (ASPT), and indices were the perpendicular vegetation index (PVI), normalized difference vegetation index (NDVI), normalized difference tillage index (NDTI), normalized difference residue index (NDRI). The spatio-temperal distribution of groundwater recharge in the SWAT-MOD-FLOW program, was classified as group 4, run in R, sampled for random and a model trained its groundwater recharge was predicted by CART condidering modified PVI, NDVI, NDTI, NDRI, PCP, and geomorphological data. To assess inter-rater reliability for group 4 groundwater recharge, the Kappa coefficient and overall accuracy and confusion matrix using K-fold cross-validation were calculated. The model obtained a Kappa coefficient of 0.3-0.6 and an overall accuracy of 0.5-0.7, indicating that the proposed model for estimating groundwater recharge with respect to soil type and vegetation cover is quite reliable.

Changes and determinants affecting on geographic variations in health behavior, prevalence of hypertension and diabetes in Korean (지역사회 건강행태, 고혈압, 당뇨병 유병률 변화와 변이 요인)

  • Kim, Yoo-Mi;Kang, Sung-Hong
    • Journal of Digital Convergence
    • /
    • v.13 no.11
    • /
    • pp.241-254
    • /
    • 2015
  • This study examined changes in health behavior and prevalence of hypertension and diabetes during five years and analyzed determinants affecting on geographic variations of them. Data from Korean Community Health Survey in the period of 2008 and 2013 with 246 small districts were analyzed. Data were analyzed using convergence tools such as geographic information system tool and decision tree. During the five years period, areas of the increases in smoking and drinking were southwest regions showed increased smoking and areas of increases in physical activity are western regions. Areas of the increases in the prevalence of hypertension were west and south regions and in the prevalence of diabetes were east and north regions. Determinants affecting on regional variations in the prevalence of hypertension and diabetes were drinking, physical activity, obesity, arthritis, depressive symptom and stress. Mental health program should be developed for non-communicable disease. Thus, to decrease the prevalence of hypertension and diabetes, our study emphasized the necessity to develop customized mental health policies according to the region-specific characteristics.

The Comparison of Risk-adjusted Mortality Rate between Korea and United States (한국과 미국 의료기관의 중증도 보정 사망률 비교)

  • Chung, Tae-Kyoung;Kang, Sung-Hong
    • Journal of Digital Convergence
    • /
    • v.11 no.5
    • /
    • pp.371-384
    • /
    • 2013
  • The purpose of this study was to develop the risk-adjusted mortality model using Korean Hospital Discharge Injury data and US National Hospital Discharge Survey data and to suggest some ways to manage hospital mortality rates through comparison of Korea and United States Hospital Standardized Mortality Ratios(HSMR). This study used data mining techniques, decision tree and logistic regression, for developing Korea and United States risk-adjustment model of in-hospital mortality. By comparing Hospital Standardized Mortality Ratio(HSMR) with standardized variables, analysis shows the concrete differences between the two countries. While Korean Hospital Standardized Mortality Ratio(HSMR) is increasing every year(101.0 in 2006, 101.3 in 2007, 103.3 in 2008), HSMR appeared to be reduced in the United States(102.3 in 2006, 100.7 in 2007, 95.9 in 2008). Korean Hospital Standardized Mortality Ratios(HSMR) by hospital beds were higher than that of the United States. A two-aspect approach to management of hospital mortality rates is suggested; national and hospital levels. The government is to release Hospital Standardized Mortality Ratio(HSMR) of large hospitals and to offer consulting on effective hospital mortality management to small and medium hospitals.

The big data method for flash flood warning (돌발홍수 예보를 위한 빅데이터 분석방법)

  • Park, Dain;Yoon, Sanghoo
    • Journal of Digital Convergence
    • /
    • v.15 no.11
    • /
    • pp.245-250
    • /
    • 2017
  • Flash floods is defined as the flooding of intense rainfall over a relatively small area that flows through river and valley rapidly in short time with no advance warning. So that it can cause damage property and casuality. This study is to establish the flash-flood warning system using 38 accident data, reported from the National Disaster Information Center and Land Surface Model(TOPLATS) between 2009 and 2012. Three variables were used in the Land Surface Model: precipitation, soil moisture, and surface runoff. The three variables of 6 hours preceding flash flood were reduced to 3 factors through factor analysis. Decision tree, random forest, Naive Bayes, Support Vector Machine, and logistic regression model are considered as big data methods. The prediction performance was evaluated by comparison of Accuracy, Kappa, TP Rate, FP Rate and F-Measure. The best method was suggested based on reproducibility evaluation at the each points of flash flood occurrence and predicted count versus actual count using 4 years data.

Accuracy Evaluation of LiDAR Measurement in Forest Area (산림지역에서 LiDAR 측량의 정확도 평가)

  • Lee, Sang-Hoon;Lee, Byoung-Kil;Kim, Jin-Kwang;Kim, Chang-Jae
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.27 no.5
    • /
    • pp.545-553
    • /
    • 2009
  • Digital Elevation Models (DEM) is widely used in establishing the topographic profile in nation spatial information. Aerial Light Detection And Ranging (LiDAR) system is one of the well-known means to produce DEM. The system has fast data acquisition procedures and less weather-dependent restrictions compared to photogrammetric approaches. In this regards, LiDAR has been widely utilized and accepted in the process of nation spatial information generation due to its sufficient positional accuracy. However, the investigation of the accuracy of aerial LiDAR data over the area of forestation with various kinds of vegetations has been barely implemented in Korea. Hence, this research focuses on the investigation of the accuracy of aerial LiDAR data over the area of forestation and the evaluation of the acquired accuracy according to the characteristics of the vegetations. The study areas include land with shrubs and its adjacent forest area with mixed tree species. The spots for the investigation have been selected to be well-distributed over the whole study areas and their coordinates are surveyed by Global Positioning Systems (GPS). Then, the surveyed information and aerial LiDAR data have been compared with each other and the result accuracy has been evaluated. Conclusively, it is recommended that LiDAR data collection to be conducted after defoliation period, especially over the areas with broadleaf trees due to the possibility of significant outliers.

Research of recognition factors of folk medicine using statistical testing and data mining (통계적 검정과 데이터마이닝기법의 융합을 통한 민간요법 인식 요인 탐색조사)

  • Yoo, Jin Ah;Choi, Kyoung-Ho;Cho, Jung-Keun
    • Journal of Digital Convergence
    • /
    • v.13 no.2
    • /
    • pp.393-399
    • /
    • 2015
  • Nowaday, beyond the time of wellbeing and LOHAS, many people have great interest in self therapy, so it is called healing era. As the folk medicine fields are actively industrialized and the interest in health improvement, not disease cure, is increased, many researches about the alternative medicine and therapy in various fields are being performed. In the times of the interest in health improvement and spontaneous, natural healing ability of human body is getting increase, it is very meaningful to search the factors which consist of recognition to folk medicine. So in this study, we developed the questionaries on the basis of previous studies, researched the factors affecting the recognition to folk medicine using factor analysis, and tested statistically the difference of recognition character according to demo-statistical traits. As the result, the twenty-four measurable variables related to folk medicine are sorted to four factors, ie, health improvement factor, safety factor, psycholocial factor, and substitutional factor. And overall, the middle and senior ages, the forties to sixties, and higher-educated peoples have more experiences in folk medicine than the younger ages, below thirties and lower-educated peoples. The distiction of sex makes little differences.