• Title/Summary/Keyword: Hadoop Distribution

Search Result 24, Processing Time 0.024 seconds

An Analytical Approach to Evaluation of SSD Effects under MapReduce Workloads

  • Ahn, Sungyong;Park, Sangkyu
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.15 no.5
    • /
    • pp.511-518
    • /
    • 2015
  • As the cost-per-byte of SSDs dramatically decreases, the introduction of SSDs to Hadoop becomes an attractive choice for high performance data processing. In this paper the cost-per-performance of SSD-based Hadoop cluster (SSD-Hadoop) and HDD-based Hadoop cluster (HDD-Hadoop) are evaluated. For this, we propose a MapReduce performance model using queuing network to simulate the execution time of MapReduce job with varying cluster size. To achieve an accurate model, the execution time distribution of MapReduce job is carefully profiled. The developed model can precisely predict the execution time of MapReduce jobs with less than 7% difference for most cases. It is also found that SSD-Hadoop is 20% more cost efficient than HDD-Hadoop because SSD-Hadoop needs a smaller number of nodes than HDD-Hadoop to achieve a comparable performance, according to the results of simulation with varying the number of cluster nodes.

Hadoop System Design for Big data Processing of RFID Distribution (RFID/NFC 물류의 빅 데이터 처리를 위한 하둡 시스템의 설계)

  • Kim, Nam-Ho;Noh, Jin-Heon;Jeong, Hee-Ja
    • Smart Media Journal
    • /
    • v.2 no.3
    • /
    • pp.47-53
    • /
    • 2013
  • Recently convergence of IT in logistics system as a typical application RFID/NFC technology is being used, such as, according to the distribution of the flow is generated by a lot of big data. The Hadoop distributed system to collect data items produced by the parallel processing capabilities of logistics information and logistics information for the record management can create. Hadoop system to support the design and development of prototypes were approaching the possibility of its utilization.

  • PDF

Initial Authentication Protocol of Hadoop Distribution System based on Elliptic Curve (타원곡선기반 하둡 분산 시스템의 초기 인증 프로토콜)

  • Jeong, Yoon-Su;Kim, Yong-Tae;Park, Gil-Cheol
    • Journal of Digital Convergence
    • /
    • v.12 no.10
    • /
    • pp.253-258
    • /
    • 2014
  • Recently, the development of cloud computing technology is developed as soon as smartphones is increases, and increased that users want to receive big data service. Hadoop framework of the big data service is provided to hadoop file system and hadoop mapreduce supported by data-intensive distributed applications. But, smpartphone service using hadoop system is a very vulnerable state to data authentication. In this paper, we propose a initial authentication protocol of hadoop system assisted by smartphone service. Proposed protocol is combine symmetric key cryptography techniques with ECC algorithm in order to support the secure multiple data processing systems. In particular, the proposed protocol to access the system by the user Hadoop when processing data, the initial authentication key and the symmetric key instead of the elliptic curve by using the public key-based security is improved.

An Efficient Parallel Construction Scheme of An R-Tree using Hadoop (Hadoop을 이용한 R-트리의 효율적인 병렬 구축 기법)

  • Cong, Viet-Ngu Huynh;Kim, Jongmin;Kwon, Oh-Heum;Song, Ha-Joo
    • Journal of Korea Multimedia Society
    • /
    • v.22 no.2
    • /
    • pp.231-241
    • /
    • 2019
  • Bulk-loading an R-tree can be a good approach to build an efficient one. However, it takes a lot of time to bulk-load an R-tree for huge amount of data. In this paper, we propose a parallel R-tree construction scheme based on a Hadoop framework. The proposed scheme divides the data set into a number of partitions for which local R-trees are built in parallel via Map-Reduce operations. Then the local R-trees are merged into an global R-tree that covers the whole data set. While generating the partitions, it considers the spatial distribution of the data into account so that each partition has nearly equal amounts of data. Therefore, the proposed scheme gives an efficient index structure while reducing the construction time. Experimental tests show that the proposed scheme builds an R-tree more efficiently than the existing approaches.

Anomalous Pattern Analysis of Large-Scale Logs with Spark Cluster Environment

  • Sion Min;Youyang Kim;Byungchul Tak
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.3
    • /
    • pp.127-136
    • /
    • 2024
  • This study explores the correlation between system anomalies and large-scale logs within the Spark cluster environment. While research on anomaly detection using logs is growing, there remains a limitation in adequately leveraging logs from various components of the cluster and considering the relationship between anomalies and the system. Therefore, this paper analyzes the distribution of normal and abnormal logs and explores the potential for anomaly detection based on the occurrence of log templates. By employing Hadoop and Spark, normal and abnormal log data are generated, and through t-SNE and K-means clustering, templates of abnormal logs in anomalous situations are identified to comprehend anomalies. Ultimately, unique log templates occurring only during abnormal situations are identified, thereby presenting the potential for anomaly detection.

A Study On Recommend System Using Co-occurrence Matrix and Hadoop Distribution Processing (동시발생 행렬과 하둡 분산처리를 이용한 추천시스템에 관한 연구)

  • Kim, Chang-Bok;Chung, Jae-Pil
    • Journal of Advanced Navigation Technology
    • /
    • v.18 no.5
    • /
    • pp.468-475
    • /
    • 2014
  • The recommend system is getting more difficult real time recommend by lager preference data set, computing power and recommend algorithm. For this reason, recommend system is proceeding actively one's studies toward distribute processing method of large preference data set. This paper studied distribute processing method of large preference data set using hadoop distribute processing platform and mahout machine learning library. The recommend algorithm is used Co-occurrence Matrix similar to item Collaborative Filtering. The Co-occurrence Matrix can do distribute processing by many node of hadoop cluster, and it needs many computation scale but can reduce computation scale by distribute processing. This paper has simplified distribute processing of co-occurrence matrix by changes over from four stage to three stage. As a result, this paper can reduce mapreduce job and can generate recommend file. And it has a fast processing speed, and reduce map output data.

Visualization Method of Social Networks Service based on Cloud (클라우드 기반의 소셜 네트워크 서비스 시각화 방법)

  • Kim, Yong IL;Park, Sun;Kim, Chul Won
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2013.05a
    • /
    • pp.699-700
    • /
    • 2013
  • This paper proposes a new visualization method based on cloud technique which uses internal relationship of user correlation and external relation of social network to visualize user relationship hierarchy. The proposed method use hadoop and hive for distribution storing and parallel processing which the result of calculation visualizes hierarchy graph using D3.

  • PDF

Visualization of Social Networks Service based on Virtualization (가상화 기반의 SNS 시각화)

  • Park, Sun;Kim, Chul Won
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.05a
    • /
    • pp.637-638
    • /
    • 2014
  • This paper proposes a new visualization method based on Vitualization technique which uses internal relationship of user correlation and external information of social network to visualize user relationship hierarchy. The proposed method use hadoop on virtual machine of OpenStack for distribution and parallel processing which the result of calculation visualizes hierarchy graph to analyze link nodes of Social Network Services for users.

  • PDF

A Development Study of The VPT for the improvement of Hadoop performance (하둡 성능 향상을 위한 VPT 개발 연구)

  • Yang, Ill Deung;Kim, Seong Ryeol
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.9
    • /
    • pp.2029-2036
    • /
    • 2015
  • Hadoop MR(MapReduce) uses a partition function for passing the outputs of mappers to reducers. The partition function determines target reducers after calculating the hash-value from the key and performing mod-operation by reducer number. The legacy partition function doesn't divide the job effectively because it is so sensitive to key distribution. If the job isn't divided effectively then it can effect the total processing time of the job because some reducers need more time to process. This paper proposes the VPT(Virtual Partition Table) and has tested appling the VPT with a preponderance of data. The applied VPT improved three seconds on average and we figure it will improve more when data is increased.

An Adaptively Speculative Execution Strategy Based on Real-Time Resource Awareness in a Multi-Job Heterogeneous Environment

  • Liu, Qi;Cai, Weidong;Liu, Qiang;Shen, Jian;Fu, Zhangjie;Liu, Xiaodong;Linge, Nigel
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.2
    • /
    • pp.670-686
    • /
    • 2017
  • MapReduce (MRV1), a popular programming model, proposed by Google, has been well used to process large datasets in Hadoop, an open source cloud platform. Its new version MapReduce 2.0 (MRV2) developed along with the emerging of Yarn has achieved obvious improvement over MRV1. However, MRV2 suffers from long finishing time on certain types of jobs. Speculative Execution (SE) has been presented as an approach to the problem above by backing up those delayed jobs from low-performance machines to higher ones. In this paper, an adaptive SE strategy (ASE) is presented in Hadoop-2.6.0. Experiment results have depicted that the ASE duplicates tasks according to real-time resources usage among work nodes in a cloud. In addition, the performance of MRV2 is largely improved using the ASE strategy on job execution time and resource consumption, whether in a multi-job environment.