• Title/Summary/Keyword: HDFS(Hadoop Distributed File System)

Search Result 54, Processing Time 0.023 seconds

A Dynamic Prefetchiong Scheme for Handling Small Files based on Hadoop Distributed File System (하둡 분산 파일 시스템 기반 소용량 파일 처리를 위한 동적 프리페칭 기법)

  • Yoo, Sang-Hyun;Youn, Hee-Yong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2014.07a
    • /
    • pp.329-332
    • /
    • 2014
  • 클라우드 컴퓨팅이 활성화 됨에 따라 기존의 파일 시스템과는 다른 대용량 파일 처리에 효율적인 분산파일시스템의 요구가 대두 되었다. 그 중에 하둡 분산 파일 시스템(Hadoop Distribute File System, HDFS)은 기존의 분산파일 시스템과는 달리 가용성과 내고장성을 보장하고, 데이터 접근 패턴을 스트리밍 방식으로 지원하여 대용량 파일을 효율적으로 저장할 수 있다. 이러한 장점 때문에, 클라우드 컴퓨팅의 파일시스템으로 대부분 채택하고 있다. 하지만 실제 HDFS 데이터 집합에서 대용량 파일 보다 소용량 파일이 차지하는 비율이 높으며, 이러한 다수의 소 용량 파일은 데이터 처리에 있어 높은 처리비용을 초래 할 뿐 만 아니라 메모리 성능에 악영향을 끼친다. 하지만 소 용량 파일을 프리패칭 함으로서 이러한 문제점을 해결 할 수 있다. HDFS의 데이터 프리페칭은 기존의 데이터 프리페칭의 기법으로는 적용하기 어려워 HDFS를 위한 데이터 프리패칭 기법을 제안한다.

  • PDF

Design of a Sentiment Analysis System to Prevent School Violence and Student's Suicide (학교폭력과 자살사고를 예방하기 위한 감성분석 시스템의 설계)

  • Kim, YoungTaek
    • The Journal of Korean Association of Computer Education
    • /
    • v.17 no.6
    • /
    • pp.115-122
    • /
    • 2014
  • One of the problems with current youth generations is increasing rate of violence and suicide in their school lives, and this study aims at the design of a sentiment analysis system to prevent suicide by uising big data process. The main issues of the design are economical implementation, easy and fast processing for the users, so, the open source Hadoop system with MapReduce algorithm is used on the HDFS(Hadoop Distributed File System) for the experimentation. This study uses word count method to do the sentiment analysis with informal data on some sns communications concerning a kinds of violent words, in terms of text mining to avoid some expensive and complex statistical analysis methods.

  • PDF

A Performance Analysis Based on Hadoop Application's Characteristics in Cloud Computing (클라우드 컴퓨팅에서 Hadoop 애플리케이션 특성에 따른 성능 분석)

  • Keum, Tae-Hoon;Lee, Won-Joo;Jeon, Chang-Ho
    • Journal of the Korea Society of Computer and Information
    • /
    • v.15 no.5
    • /
    • pp.49-56
    • /
    • 2010
  • In this paper, we implement a Hadoop based cluster for cloud computing and evaluate the performance of this cluster based on application characteristics by executing RandomTextWriter, WordCount, and PI applications. A RandomTextWriter creates given amount of random words and stores them in the HDFS(Hadoop Distributed File System). A WordCount reads an input file and determines the frequency of a given word per block unit. PI application induces PI value using the Monte Carlo law. During simulation, we investigate the effect of data block size and the number of replications on the execution time of applications. Through simulation, we have confirmed that the execution time of RandomTextWriter was proportional to the number of replications. However, the execution time of WordCount and PI were not affected by the number of replications. Moreover, the execution time of WordCount was optimum when the block size was 64~256MB. Therefore, these results show that the performance of cloud computing system can be enhanced by using a scheduling scheme that considers application's characteristics.

Delayed Block Replication Scheme of Hadoop Distributed File System for Flexible Management of Distributed Nodes (하둡 분산 파일시스템에서의 유연한 노드 관리를 위한 지연된 블록 복제 기법)

  • Ryu, Woo-Seok
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.12 no.2
    • /
    • pp.367-374
    • /
    • 2017
  • This paper discusses management problems of Hadoop distributed node, which is a platform for big data processing, and proposes a novel technique for enabling flexible node management of Hadoop Distributed File System. Hadoop cannot configure Hadoop cluster dynamically because it judges temporarily unavailable nodes as a failure. Delayed block replication scheme proposed in this paper delays the removal of unavailable node as much as possible so as to be easily rejoined. Experimental results show that the proposed scheme increases flexibility of node management with little impact on distributed processing performance when the cluster size changes.

A Study on Phon Call Big Data Analytics (전화통화 빅데이터 분석에 관한 연구)

  • Kim, Jeongrae;Jeong, Chanki
    • Journal of Information Technology and Architecture
    • /
    • v.10 no.3
    • /
    • pp.387-397
    • /
    • 2013
  • This paper proposes an approach to big data analytics for phon call data. The analytical models for phon call data is composed of the PVPF (Parallel Variable-length Phrase Finding) algorithm for identifying verbal phrases of natural language and the word count algorithm for measuring the usage frequency of keywords. In the proposed model, we identify words using the PVPF algorithm, and measure the usage frequency of the identified words using word count algorithm in MapReduce. The results can be interpreted from various viewpoints. We design and implement the model based HDFS (Hadoop Distributed File System), verify the proposed approach through a case study of phon call data. So we extract useful results through analysis of keyword correlation and usage frequency.

Access efficiency of small sized files in Big Data using various Techniques on Hadoop Distributed File System platform

  • Alange, Neeta;Mathur, Anjali
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.7
    • /
    • pp.359-364
    • /
    • 2021
  • In recent years Hadoop usage has been increasing day by day. The need of development of the technology and its specified outcomes are eagerly waiting across globe to adopt speedy access of data. Need of computers and its dependency is increasing day by day. Big data is exponentially growing as the entire world is working in online mode. Large amount of data has been produced which is very difficult to handle and process within a short time. In present situation industries are widely using the Hadoop framework to store, process and produce at the specified time with huge amount of data that has been put on the server. Processing of this huge amount of data having small files & its storage optimization is a big problem. HDFS, Sequence files, HAR, NHAR various techniques have been already proposed. In this paper we have discussed about various existing techniques which are developed for accessing and storing small files efficiently. Out of the various techniques we have specifically tried to implement the HDFS- HAR, NHAR techniques.

LDBAS: Location-aware Data Block Allocation Strategy for HDFS-based Applications in the Cloud

  • Xu, Hua;Liu, Weiqing;Shu, Guansheng;Li, Jing
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.1
    • /
    • pp.204-226
    • /
    • 2018
  • Big data processing applications have been migrated into cloud gradually, due to the advantages of cloud computing. Hadoop Distributed File System (HDFS) is one of the fundamental support systems for big data processing on MapReduce-like frameworks, such as Hadoop and Spark. Since HDFS is not aware of the co-location of virtual machines in the cloud, the default scheme of block allocation in HDFS does not fit well in the cloud environments behaving in two aspects: data reliability loss and performance degradation. In this paper, we present a novel location-aware data block allocation strategy (LDBAS). LDBAS jointly optimizes data reliability and performance for upper-layer applications by allocating data blocks according to the locations and different processing capacities of virtual nodes in the cloud. We apply LDBAS to two stages of data allocation of HDFS in the cloud (the initial data allocation and data recovery), and design the corresponding algorithms. Finally, we implement LDBAS into an actual Hadoop cluster and evaluate the performance with the benchmark suite BigDataBench. The experimental results show that LDBAS can guarantee the designed data reliability while reducing the job execution time of the I/O-intensive applications in Hadoop by 8.9% on average and up to 11.2% compared with the original Hadoop in the cloud.

Design on the IoT Sensor Data Collection Envionment using Lambda Architecture (Lambda 구조를 적용한 IoT 센서 데이터 수집 환경 설계)

  • Hwang, Yun-Young;Kim, Soo-Hyun;Shin, Yong-Tae
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2020.07a
    • /
    • pp.547-548
    • /
    • 2020
  • 데이터의 양은 기술의 발전과 함께 크게 증가하였다. Hadoop은 빅데이터 분야에서 사용되는 대표적인 빅데이터 처리 플랫폼으로 IoT 분야에서도 사용된다. HDFS(Haddop Distributed File System)는 Hadoop의 코어 프로젝트로 블록 기반의 대용량 데이터 저장소다. 기존의 Hadoop 기반 IoT 센서 데이터 수집 환경은 HDFS를 사용한다. 그러나 HDFS의 Small File로 인한 네임노드의 과부하 문제와 한 번 Import된 데이터의 Update와 Delete를 지원하지 않는 Hadoop의 특징으로 인해 성능과 활용이 제한적이다. 본 논문에서는 기존 Hadoop 기반 IoT 센서 데이터 수집 환경의 단점을 극복하기 위해 Lambda 구조를 적용한 IoT 센서 데이터 수집 환경을 설계한다.

  • PDF

Design of Distributed Cloud System for Managing large-scale Genomic Data

  • Seine Jang;Seok-Jae Moon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.16 no.2
    • /
    • pp.119-126
    • /
    • 2024
  • The volume of genomic data is constantly increasing in various modern industries and research fields. This growth presents new challenges and opportunities in terms of the quantity and diversity of genetic data. In this paper, we propose a distributed cloud system for integrating and managing large-scale gene databases. By introducing a distributed data storage and processing system based on the Hadoop Distributed File System (HDFS), various formats and sizes of genomic data can be efficiently integrated. Furthermore, by leveraging Spark on YARN, efficient management of distributed cloud computing tasks and optimal resource allocation are achieved. This establishes a foundation for the rapid processing and analysis of large-scale genomic data. Additionally, by utilizing BigQuery ML, machine learning models are developed to support genetic search and prediction, enabling researchers to more effectively utilize data. It is expected that this will contribute to driving innovative advancements in genetic research and applications.

A Distributed Cache Management Scheme for Efficient Accesses of Small Files in HDFS (HDFS에서 소형 파일의 효율적인 접근을 위한 분산 캐시 관리 기법)

  • Oh, Hyunkyo;Kim, Kiyeon;Hwang, Jae-Min;Park, Junho;Lim, Jongtae;Bok, Kyoungsoo;Yoo, Jaesoo
    • The Journal of the Korea Contents Association
    • /
    • v.14 no.11
    • /
    • pp.28-38
    • /
    • 2014
  • In this paper, we propose the distributed cache management scheme to efficiently access small files in Hadoop Distributed File Systems(HDFS). The proposed scheme can reduce the number of metadata managed by a name node since many small files are merged and stored in a chunk. It is also possible to reduce the file access costs, by keeping the information of requested files using the client cache and data node caches. The client cache keeps small files that a user requests and metadata. Each data node cache keeps the small files that are frequently requested by users. It is shown through performance evaluation that the proposed scheme significantly reduces the processing time over the existing scheme.