• Title/Summary/Keyword: Big Data Cluster

Search Result 210, Processing Time 0.022 seconds

Design and Implementation of Big Data Cluster for Indoor Environment Monitering (실내 환경 모니터링을 위한 빅데이터 클러스터 설계 및 구현)

  • Jeon, Byoungchan;Go, Mingu
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.13 no.2
    • /
    • pp.77-85
    • /
    • 2017
  • Due to the expansion of accommodation space caused by increase of population along with lifestyle changes, most of people spend their time indoor except for the travel time. Because of this, environmental change of indoor is very important, and it affects people's health and economy in resources. But, most of people don't acknowledge the importance of indoor environment. Thus, monitoring system for sustaining and managing indoor environment systematically is needed, and big data clusters should be used in order to save and manage numerous sensor data collected from many spaces. In this paper, we design a big data cluster for the indoor environment monitoring in order to store the sensor data and monitor unit of the huge building Implementation design big data cluster-based system for the analysis, and a distributed file system and building a Hadoop, HBase for big data processing. Also, various sensor data is saved for collection, and effective indoor environment management and health enhancement through monitoring is expected.

Scalable Prediction Models for Airbnb Listing in Spark Big Data Cluster using GPU-accelerated RAPIDS

  • Muralidharan, Samyuktha;Yadav, Savita;Huh, Jungwoo;Lee, Sanghoon;Woo, Jongwook
    • Journal of information and communication convergence engineering
    • /
    • v.20 no.2
    • /
    • pp.96-102
    • /
    • 2022
  • We aim to build predictive models for Airbnb's prices using a GPU-accelerated RAPIDS in a big data cluster. The Airbnb Listings datasets are used for the predictive analysis. Several machine-learning algorithms have been adopted to build models that predict the price of Airbnb listings. We compare the results of traditional and big data approaches to machine learning for price prediction and discuss the performance of the models. We built big data models using Databricks Spark Cluster, a distributed parallel computing system. Furthermore, we implemented models using multiple GPUs using RAPIDS in the spark cluster. The model was developed using the XGBoost algorithm, whereas other models were developed using traditional central processing unit (CPU)-based algorithms. This study compared all models in terms of accuracy metrics and computing time. We observed that the XGBoost model with RAPIDS using GPUs had the highest accuracy and computing time.

Scaling of Hadoop Cluster for Cost-Effective Processing of MapReduce Applications (비용 효율적 맵리듀스 처리를 위한 클러스터 규모 설정)

  • Ryu, Woo-Seok
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.15 no.1
    • /
    • pp.107-114
    • /
    • 2020
  • This paper studies a method for estimating the scale of a Hadoop cluster to process big data as a cost-effective manner. In the case of medical institutions, demands for cloud-based big data analysis are increasing as medical records can be stored outside the hospital. This paper first analyze the Amazon EMR framework, which is one of the popular cloud-based big data framework. Then, this paper presents a efficiency model for scaling the Hadoop cluster to execute a Mapreduce application more cost-effectively. This paper also analyzes the factors that influence the execution of the Mapreduce application by performing several experiments under various conditions. The cost efficiency of the analysis of the big data can be increased by setting the scale of cluster with the most efficient processing time compared to the operational cost.

A Container Orchestration System for Process Workloads

  • Jong-Sub Lee;Seok-Jae Moon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.15 no.4
    • /
    • pp.270-278
    • /
    • 2023
  • We propose a container orchestration system for process workloads that combines the potential of big data and machine learning technologies to integrate enterprise process-centric workloads. This proposed system analyzes big data generated from industrial automation to identify hidden patterns and build a machine learning prediction model. For each machine learning case, training data is loaded into a data store and preprocessed for model training. In the next step, you can use the training data to select and apply an appropriate model. Then evaluate the model using the following test data: This step is called model construction and can be performed in a deployment framework. Additionally, a visual hierarchy is constructed to display prediction results and facilitate big data analysis. In order to implement parallel computing of PCA in the proposed system, several virtual systems were implemented to build the cluster required for the big data cluster. The implementation for evaluation and analysis built the necessary clusters by creating multiple virtual machines in a big data cluster to implement parallel computation of PCA. The proposed system is modeled as layers of individual components that can be connected together. The advantage of a system is that components can be added, replaced, or reused without affecting the rest of the system.

A Study on FIFA Partner Adidas of 2022 Qatar World Cup Using Big Data Analysis

  • Kyung-Won, Byun
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.15 no.1
    • /
    • pp.164-170
    • /
    • 2023
  • The purpose of this study is to analyze the big data of Adidas brand participating in the Qatar World Cup in 2022 as a FIFA partner to understand useful information, semantic connection and context from unstructured data. Therefore, this study collected big data generated during the World Cup from Adidas participating in sponsorship as a FIFA partner for the 2022 Qatar World Cup and collected data from major portal sites to understand its meaning. According to text mining analysis, 'Adidas' was used the most 3,340 times based on the frequency of keyword appearance, followed by 'World Cup', 'Qatar World Cup', 'Soccer', 'Lionel Messi', 'Qatar', 'FIFA', 'Korea', and 'Uniform'. In addition, the TF-IDF rankings were 'Qatar World Cup', 'Soccer', 'Lionel Messi', 'World Cup', 'Uniform', 'Qatar', 'FIFA', 'Ronaldo', 'Korea', and 'Nike'. As a result of semantic network analysis and CONCOR analysis, four groups were formed. First, Cluster A named it 'Qatar World Cup Sponsor' as words such as 'Adidas', 'Nike', 'Qatar World Cup', 'Sponsor', 'Sponsor Company', 'Marketing', 'Nation', 'Launch', 'Official', 'Commemoration' and 'National Team' were formed into groups. Second, B Cluster named it 'Group stage' as words such as 'Qatar', 'Uruguay', 'FIFA' and 'group stage' were formed into groups. Third, C Cluster named it 'Winning' as words such as 'World Cup Winning', 'Champion', 'France', 'Argentina', 'Lionel Messi', 'Advertising' and 'Photograph' formed a group. Fourth, D Cluster named it 'Official Ball' as words such as 'Official Ball', 'World Cup Official Ball', 'Soccer Ball', 'All Times', 'Al Rihla', 'Public', 'Technology' was formed into groups.

Big Data Analysis on the Perception of Home Training According to the Implementation of COVID-19 Social Distancing

  • Hyun-Chang Keum;Kyung-Won Byun
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.15 no.3
    • /
    • pp.211-218
    • /
    • 2023
  • Due to the implementation of COVID-19 distancing, interest and users in 'home training' are rapidly increasing. Therefore, the purpose of this study is to identify the perception of 'home training' through big data analysis on social media channels and provide basic data to related business sector. Social media channels collected big data from various news and social content provided on Naver and Google sites. Data for three years from March 22, 2020 were collected based on the time when COVID-19 distancing was implemented in Korea. The collected data included 4,000 Naver blogs, 2,673 news, 4,000 cafes, 3,989 knowledge IN, and 953 Google channel news. These data analyzed TF and TF-IDF through text mining, and through this, semantic network analysis was conducted on 70 keywords, big data analysis programs such as Textom and Ucinet were used for social big data analysis, and NetDraw was used for visualization. As a result of text mining analysis, 'home training' was found the most frequently in relation to TF with 4,045 times. The next order is 'exercise', 'Homt', 'house', 'apparatus', 'recommendation', and 'diet'. Regarding TF-IDF, the main keywords are 'exercise', 'apparatus', 'home', 'house', 'diet', 'recommendation', and 'mat'. Based on these results, 70 keywords with high frequency were extracted, and then semantic indicators and centrality analysis were conducted. Finally, through CONCOR analysis, it was clustered into 'purchase cluster', 'equipment cluster', 'diet cluster', and 'execute method cluster'. For the results of these four clusters, basic data on the 'home training' business sector were presented based on consumers' main perception of 'home training' and analysis of the meaning network.

A Classification Algorithm Based on Data Clustering and Data Reduction for Intrusion Detection System over Big Data

  • Wang, Qiuhua;Ouyang, Xiaoqin;Zhan, Jiacheng
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.7
    • /
    • pp.3714-3732
    • /
    • 2019
  • With the rapid development of network, Intrusion Detection System(IDS) plays a more and more important role in network applications. Many data mining algorithms are used to build IDS. However, due to the advent of big data era, massive data are generated. When dealing with large-scale data sets, most data mining algorithms suffer from a high computational burden which makes IDS much less efficient. To build an efficient IDS over big data, we propose a classification algorithm based on data clustering and data reduction. In the training stage, the training data are divided into clusters with similar size by Mini Batch K-Means algorithm, meanwhile, the center of each cluster is used as its index. Then, we select representative instances for each cluster to perform the task of data reduction and use the clusters that consist of representative instances to build a K-Nearest Neighbor(KNN) detection model. In the detection stage, we sort clusters according to the distances between the test sample and cluster indexes, and obtain k nearest clusters where we find k nearest neighbors. Experimental results show that searching neighbors by cluster indexes reduces the computational complexity significantly, and classification with reduced data of representative instances not only improves the efficiency, but also maintains high accuracy.

Investigation on the Korean Cyclists' Body Type Through Anthropometric Measurements (사이클 선수들의 체형 특성에 관한 연구)

  • 최미성;정성필
    • Journal of the Korean Society of Clothing and Textiles
    • /
    • v.28 no.7
    • /
    • pp.1019-1028
    • /
    • 2004
  • The purpose of this study was to compare the body measurements of cyclists and non-cyclists and to classify cyclists' body types to offer basic information for the bicycle apparel manufacturer in Korea. The anthropometric data was collected including both direct and indirect measurements of 81 cyclists (40 female, 41 male) aged from 19 to 24. Anthropometric measurements were analyzed using percentiles, T-test, factor and cluster analysis. The results were as follows; Comparison of anthropomeoic data between cyclist and non-cyclist was to clarify that cyclists have bigger size than non-cyclists; especially the thigh circumference shows big differences. As the result of factor analysis, 5 factors, which explain 74% of variance, were extracted from all items for male and female cyclists. The results of cluster analysis classified body types into 3 groups. Cluster 1 among three female cyclist groups has biggest torso and had an erect back. Cluster 2 has small size among three female group and drooping shoulders. Cluster 3 has the bended forward shoulders and shows the protrusion back. In case of male cyclists, cluster 1 has thin body type owing to big height measurements and small girth measurements. Cluster 2 among three male groups has the biggest torso and thigh circumference. Cluster 3 has big forward angle of shoulders and shows the protrusion of the back as female cyclist.

A Study on the Effect of the Name Node and Data Node on the Big Data Processing Performance in a Hadoop Cluster (Hadoop 클러스터에서 네임 노드와 데이터 노드가 빅 데이터처리 성능에 미치는 영향에 관한 연구)

  • Lee, Younghun;Kim, Yongil
    • Smart Media Journal
    • /
    • v.6 no.3
    • /
    • pp.68-74
    • /
    • 2017
  • Big data processing processes various types of data such as files, images, and video to solve problems and provide insightful useful information. Currently, various platforms are used for big data processing, but many organizations and enterprises are using Hadoop for big data processing due to the simplicity, productivity, scalability, and fault tolerance of Hadoop. In addition, Hadoop can build clusters on various hardware platforms and handle big data by dividing into a name node (master) and a data node (slave). In this paper, we use a fully distributed mode used by actual institutions and companies as an operation mode. We have constructed a Hadoop cluster using a low-power and low-cost single board for smooth experiment. The performance analysis of Name node is compared through the same data processing using single board and laptop as name nodes. Analysis of influence by number of data nodes increases the number of data nodes by two times from the number of existing clusters. The effect of the above experiment was analyzed.

Shared Distributed Big-Data Processing Platform Model: a Study (대용량 분산처리 플랫폼 공유 모델 연구)

  • Jeong, Hwanjin;Kang, Taeho;Kim, GyuSeok;Shin, YoungHo;Jeong, Jinkyu
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.11
    • /
    • pp.601-613
    • /
    • 2016
  • With the increasing need for big data processing, building a shared big data processing platform is important to minimize time and monetary costs. In shared big data processing, multitenancy is a major requirement that needs to be addressed, in order to provide a single isolated personal big data platform for each user, but to share the underlying hardware is shared among users to increase hardware utilization. In this paper, we explore two well-known shared big data processing platform models. One is to use a native Hadoop cluster, and the other is to build a virtual Hadoop cluster for each user. For each model we verified whether it is sufficient to support multi-tenancy. We also present a method to complement unsupported multi-tenancy features in a native Hadoop cluster model. Lastly we built prototype platforms and compared the performance of both models.