• Title/Summary/Keyword: HDFS(Hadoop Distributed File System)

Search Result 54, Processing Time 0.026 seconds

Performance Analysis of Distributed Hadoop Systems (분산 하둡 시스템의 성능 비교 분석)

  • Bae, Byoung-Jin;Kim, Young-Joo;Kim, Young-Kuk
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.05a
    • /
    • pp.479-482
    • /
    • 2014
  • Nowadays open-source hadoop systems have been using widely to efficiently manage a fast-growing big data. Hadoop systems consist of distributed file processing system called HDFS (Hadoop Distributed File System) and distributed parallel processing system called MapReduce. The MapReduce reads and processes big data from HDFS and then processed results are written in HDFS again by the MapReduce. Such a processing method has different system structure respectively according to hadoop version. Therefore, this paper shows analysis results for performance of hadoop systems. For this, we devise a way which monitors hadoop systems and measure occurrence frequency of processes, threads, and variables generated in hadoop system itself using the devised way. So, by using the measured results as analysis indicator, we help the indicator predict inner performance of hadoop systems.

  • PDF

Processing Method of Mass Small File Using Hadoop Platform (하둡 플랫폼을 이용한 대량의 스몰파일 처리방법)

  • Kim, Chang-Bok;Chung, Jae-Pil
    • Journal of Advanced Navigation Technology
    • /
    • v.18 no.4
    • /
    • pp.401-408
    • /
    • 2014
  • Hadoop is composed with MapReduce programming model for distributed processing and HDFS distributed file system. Hadoop is suitable framework for big data processing, but processing of mass small files have many problems. The processing of mass small file in hadoop have problems to created one mapper per one file, and it have problems to needed many memory for store of meta information of file. This paper have comparison evaluation processing method of mass small file with various method in hadoop platform. The processing of general compression format is inadequate because of processing by one mapper regardless of data size. The processing of sequence and hadoop archive file is removed memory problem of namenode by compress and combine of small file. Hadoop archive file is faster then sequence file about combine time of small file. The processing using CombineFileInputFormat class is needed not combine of small file, and it have similar speed big data processing method.

A Study on Security Improvement in Hadoop Distributed File System Based on Kerberos (Kerberos 기반 하둡 분산 파일 시스템의 안전성 향상방안)

  • Park, So Hyeon;Jeong, Ik Rae
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.23 no.5
    • /
    • pp.803-813
    • /
    • 2013
  • As the developments of smart devices and social network services, the amount of data has been exploding. The world is facing Big data era. For these reasons, the Big data processing technology which is a new technology that can handle such data has attracted much attention. One of the most representative technologies is Hadoop. Hadoop Distributed File System(HDFS) designed to run on commercial Linux server is an open source framework and can store many terabytes of data. The initial version of Hadoop did not consider security because it only focused on efficient Big data processing. As the number of users rapidly increases, a lot of sensitive data including personal information were stored on HDFS. So Hadoop announced a new version that introduces Kerberos and token system in 2009. However, this system is vulnerable to the replay attack, impersonation attack and other attacks. In this paper, we analyze these vulnerabilities of HDFS security and propose a new protocol which complements these vulnerabilities and maintains the performance of Hadoop.

A File Merging Scheme for Efficient Handling of Small Files in Hadoop Distributed File System (Hadoop Distribute file system에서 Small file을 효과적으로 처리하기 위한 파일 병합 기법 연구)

  • Park, Jong-Chang;Youn, Hee-Yong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.11a
    • /
    • pp.15-17
    • /
    • 2013
  • HDFS(Hadoop Distribute File System)는 대용량 파일 처리를 목적으로 설계 되었으며 현재 이상적인 분산 파일 시스템으로 각광 받고 있다. 이러한 HDFS는 기존 분산파일 시스템과 많은 유사성을 가지고 있으나, Fault Tolerance를 제공하고, 데이터 엑세스 패턴을 스트리밍 방식으로 지원하여 대용량 파일을 효율적으로 저장할 수 있다는 차별성을 가지고 있다. 하지만 실제 HDFS 데이터 집합에는 Small file이 차지하는 비중이 상당히 높으며, 이러한 다수의 Small file 은 데이터 처리에 있어 높은 비용을 초래할 뿐 아니라 Master Node 의 파일 처리 및 메모리 성능에 악영향을 미친다. 따라서 본 논문에서는 HDFS에서 Small file 이 미치는 영향을 분석하고 이러한 문제점을 해결 할 수 있는 로컬 인덱스 파일기반의 파일 병합 기법을 제안한다.

Dynamic Cluster Management of Hadoop Distributed Filesystem (하둡 분산 파일시스템의 동적 클러스터 관리 기법)

  • Ryu, Wooseok
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2016.10a
    • /
    • pp.435-437
    • /
    • 2016
  • Hadoop Distributed File System(HDFS) is a file system for distributed processing of big data by replicating data to distributed data nodes. HDFS cluster shows a great scalability up to thousands of nodes, but it assumes a exclusive node cluster with numerous nodes for the big data processing. Various operational-purpose worker systems used by office are hardly considered as a part of cluster. This paper discusses this problem and proposes a dynamic cluster management technique to increase storage capability and analytic performance of hadoop cluster. The propsed technique can add legacy systems to the cluster and can remove them from the cluster dynamically depending on their availability.

  • PDF

A Dynamic Data Replica Deletion Strategy on HDFS using HMM (HMM을 이용한 HDFS 기반 동적 데이터 복제본 삭제 전략)

  • Seo, Young-Ho;Youn, Hee-Yong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2014.07a
    • /
    • pp.241-244
    • /
    • 2014
  • 본 논문에서는 HDFS(Hadoop Distributed File System)에서 문제되고 있는 복제정책의 개선을 위해 HMM(Hidden Markov Model)을 이용한 동적 데이터 복제본 삭제 전략을 제안한다. HDFS는 대용량 데이터를 효과적으로 처리할 수 있는 분산 파일 시스템으로 높은 Fault-Tolerance를 제공하며, 데이터의 접근에 높은 처리량을 제공하여 대용량 데이터 집합을 갖는 응용 프로그램에 최적화 되어있는 장점을 가지고 있다. 하지만 HDFS 에서의 복제 메커니즘은 시스템의 안정성과 성능을 향상시키지만, 추가 블록 복제본이 많은 디스크 공간을 차지하여 유지보수 비용 또한 증가하게 된다. 본 논문에서는 HMM과 최상의 상태 순서를 찾는 알고리즘인 Viterbi Algorithm을 이용하여 불필요한 데이터 복제본을 탐색하고, 탐색된 복제본의 삭제를 통하여 HDFS의 디스크 공간과 유지보수 비용을 절약 할 수 있는 전략을 제안한다.

  • PDF

Sim-Hadoop : Leveraging Hadoop Distributed File System and Parallel I/O for Reliable and Efficient N-body Simulations (Sim-Hadoop : 신뢰성 있고 효율적인 N-body 시뮬레이션을 위한 Hadoop 분산 파일 시스템과 병렬 I / O)

  • Awan, Ammar Ahmad;Lee, Sungyoung;Chung, Tae Choong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.05a
    • /
    • pp.476-477
    • /
    • 2013
  • Gadget-2 is a scientific simulation code has been used for many different types of simulations like, Colliding Galaxies, Cluster Formation and the popular Millennium Simulation. The code is parallelized with Message Passing Interface (MPI) and is written in C language. There is also a Java adaptation of the original code written using MPJ Express called Java Gadget. Java Gadget writes a lot of checkpoint data which may or may not use the HDF-5 file format. Since, HDF-5 is MPI-IO compliant, we can use our MPJ-IO library to perform parallel reading and writing of the checkpoint files and improve I/O performance. Additionally, to add reliability to the code execution, we propose the usage of Hadoop Distributed File System (HDFS) for writing the intermediate (checkpoint files) and final data (output files). The current code writes and reads the input, output and checkpoint files sequentially which can easily become bottleneck for large scale simulations. In this paper, we propose Sim-Hadoop, a framework to leverage HDFS and MPJ-IO for improving the I/O performance of Java Gadget code.

Efficient Multimedia Data File Management and Retrieval Strategy on Big Data Processing System

  • Lee, Jae-Kyung;Shin, Su-Mi;Kim, Kyung-Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.8
    • /
    • pp.77-83
    • /
    • 2015
  • The storage and retrieval of multimedia data is becoming increasingly important in many application areas including record management, video(CCTV) management and Internet of Things (IoT). In these applications, the files containing multimedia that need to be stored and managed is tremendous and constantly scaling. In this paper, we propose a technique to retrieve a very large number of files, in multimedia format, using the Hadoop Framework. Our strategy is based on the management of metadata that describes the characteristic of files that are stored in Hadoop Distributed File System (HDFS). The metadata schema is represented in Hbase and looked up using SQL On Hadoop (Hive, Tajo). Both the Hbase, Hive and Tajo are part of the Hadoop Ecosystem. Preliminary experiment on multimedia data files stored in HDFS shows the viability of the proposed strategy.

A Parallel HDFS and MapReduce Functions for Emotion Analysis (감성분석을 위한 병렬적 HDFS와 맵리듀스 함수)

  • Back, BongHyun;Ryoo, Yun-Kyoo
    • Journal of the Korea society of information convergence
    • /
    • v.7 no.2
    • /
    • pp.49-57
    • /
    • 2014
  • Recently, opinion mining is introduced to extract useful information from SNS data and to evaluate the true intention of users. Opinion mining are required several efficient techniques to collect and analyze a large amount of SNS data and extract meaningful data from them. Therefore in this paper, we propose a parallel HDFS(Hadoop Distributed File System) and emotion functions based on Mapreduce to extract some emotional information of users from various unstructured big data on social networks. The experiment results have verified that the proposed system and functions perform faster than O(n) for data gathering time and loading time, and maintain stable load balancing for memory and CPU resources.

  • PDF

Secure Authentication Protocol in Hadoop Distributed File System based on Hash Chain (해쉬 체인 기반의 안전한 하둡 분산 파일 시스템 인증 프로토콜)

  • Jeong, So Won;Kim, Kee Sung;Jeong, Ik Rae
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.23 no.5
    • /
    • pp.831-847
    • /
    • 2013
  • The various types of data are being created in large quantities resulting from the spread of social media and the mobile popularization. Many companies want to obtain valuable business information through the analysis of these large data. As a result, it is a trend to integrate the big data technologies into the company work. Especially, Hadoop is regarded as the most representative big data technology due to its terabytes of storage capacity, inexpensive construction cost, and fast data processing speed. However, the authentication token system of Hadoop Distributed File System(HDFS) for the user authentication is currently vulnerable to the replay attack and the datanode hacking attack. This can cause that the company secrets or the personal information of customers on HDFS are exposed. In this paper, we analyze the possible security threats to HDFS when tokens or datanodes are exposed to the attackers. Finally, we propose the secure authentication protocol in HDFS based on hash chain.