• Title/Summary/Keyword: HADOOP

Search Result 394, Processing Time 0.028 seconds

A Study On Recommend System Using Co-occurrence Matrix and Hadoop Distribution Processing (동시발생 행렬과 하둡 분산처리를 이용한 추천시스템에 관한 연구)

  • Kim, Chang-Bok;Chung, Jae-Pil
    • Journal of Advanced Navigation Technology
    • /
    • v.18 no.5
    • /
    • pp.468-475
    • /
    • 2014
  • The recommend system is getting more difficult real time recommend by lager preference data set, computing power and recommend algorithm. For this reason, recommend system is proceeding actively one's studies toward distribute processing method of large preference data set. This paper studied distribute processing method of large preference data set using hadoop distribute processing platform and mahout machine learning library. The recommend algorithm is used Co-occurrence Matrix similar to item Collaborative Filtering. The Co-occurrence Matrix can do distribute processing by many node of hadoop cluster, and it needs many computation scale but can reduce computation scale by distribute processing. This paper has simplified distribute processing of co-occurrence matrix by changes over from four stage to three stage. As a result, this paper can reduce mapreduce job and can generate recommend file. And it has a fast processing speed, and reduce map output data.

Shared Distributed Big-Data Processing Platform Model: a Study (대용량 분산처리 플랫폼 공유 모델 연구)

  • Jeong, Hwanjin;Kang, Taeho;Kim, GyuSeok;Shin, YoungHo;Jeong, Jinkyu
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.11
    • /
    • pp.601-613
    • /
    • 2016
  • With the increasing need for big data processing, building a shared big data processing platform is important to minimize time and monetary costs. In shared big data processing, multitenancy is a major requirement that needs to be addressed, in order to provide a single isolated personal big data platform for each user, but to share the underlying hardware is shared among users to increase hardware utilization. In this paper, we explore two well-known shared big data processing platform models. One is to use a native Hadoop cluster, and the other is to build a virtual Hadoop cluster for each user. For each model we verified whether it is sufficient to support multi-tenancy. We also present a method to complement unsupported multi-tenancy features in a native Hadoop cluster model. Lastly we built prototype platforms and compared the performance of both models.

A Study on the Data Collection Methods based Hadoop Distributed Environment (하둡 분산 환경 기반의 데이터 수집 기법 연구)

  • Jin, Go-Whan
    • Journal of the Korea Convergence Society
    • /
    • v.7 no.5
    • /
    • pp.1-6
    • /
    • 2016
  • Many studies have been carried out for the development of big data utilization and analysis technology recently. There is a tendency that government agencies and companies to introduce a Hadoop of a processing platform for analyzing big data is increasing gradually. Increased interest with respect to the processing and analysis of these big data collection technology of data has become a major issue in parallel to it. However, study of the collection technology as compared to the study of data analysis techniques, it is insignificant situation. Therefore, in this paper, to build on the Hadoop cluster is a big data analysis platform, through the Apache sqoop, stylized from relational databases, to collect the data. In addition, to provide a sensor through the Apache flume, a system to collect on the basis of the data file of the Web application, the non-structured data such as log files to stream. The collection of data through these convergence would be able to utilize as a basic material of big data analysis.

Implement of MapReduce-based Big Data Processing Scheme for Reducing Big Data Processing Delay Time and Store Data (빅데이터 처리시간 감소와 저장 효율성이 향상을 위한 맵리듀스 기반 빅데이터 처리 기법 구현)

  • Lee, Hyeopgeon;Kim, Young-Woon;Kim, Ki-Young
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.10
    • /
    • pp.13-19
    • /
    • 2018
  • MapReduce, the Hadoop's essential core technology, is most commonly used to process big data based on the Hadoop distributed file system. However, the existing MapReduce-based big data processing techniques have a feature of dividing and storing files in blocks predefined in the Hadoop distributed file system, thus wasting huge infrastructure resources. Therefore, in this paper, we propose an efficient MapReduce-based big data processing scheme. The proposed method enhances the storage efficiency of a big data infrastructure environment by converting and compressing the data to be processed into a data format in advance suitable for processing by MapReduce. In addition, the proposed method solves the problem of the data processing time delay arising from when implementing with focus on the storage efficiency.

Big data distributed processing system using RHadoop (RHadoop을 이용한 빅데이터 분산처리 시스템)

  • Shin, Ji Eun;Jung, Byung Ho;Lim, Dong Hoon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.5
    • /
    • pp.1155-1166
    • /
    • 2015
  • It is almost impossible to store or analyze big data increasing exponentially with traditional technologies, so Hadoop is a new technology to make that possible. In recent R is using as an engine for big data analysis based on distributed processing with Hadoop technology. With RHadoop that integrates R and Hadoop environment, we implemented parallel multiple regression analysis with various data sizes of actual data and simulated data. Experimental results showed our RHadoop system was faster as the number of data nodes increases. We also compared the performance of our RHadoop with lm function and biglm packages available on bigmemory. The results showed that our RHadoop was faster than other packages owing to paralleling processing with increasing the number of map tasks as the size of data increases.

Secure Authentication Protocol in Hadoop Distributed File System based on Hash Chain (해쉬 체인 기반의 안전한 하둡 분산 파일 시스템 인증 프로토콜)

  • Jeong, So Won;Kim, Kee Sung;Jeong, Ik Rae
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.23 no.5
    • /
    • pp.831-847
    • /
    • 2013
  • The various types of data are being created in large quantities resulting from the spread of social media and the mobile popularization. Many companies want to obtain valuable business information through the analysis of these large data. As a result, it is a trend to integrate the big data technologies into the company work. Especially, Hadoop is regarded as the most representative big data technology due to its terabytes of storage capacity, inexpensive construction cost, and fast data processing speed. However, the authentication token system of Hadoop Distributed File System(HDFS) for the user authentication is currently vulnerable to the replay attack and the datanode hacking attack. This can cause that the company secrets or the personal information of customers on HDFS are exposed. In this paper, we analyze the possible security threats to HDFS when tokens or datanodes are exposed to the attackers. Finally, we propose the secure authentication protocol in HDFS based on hash chain.

A Hot-Data Replication Scheme Based on Data Access Patterns for Enhancing Processing Speed of MapReduce (맵-리듀스의 처리 속도 향상을 위한 데이터 접근 패턴에 따른 핫-데이터 복제 기법)

  • Son, Ingook;Ryu, Eunkyung;Park, Junho;Bok, Kyoungsoo;Yoo, Jaesoo
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.11
    • /
    • pp.21-27
    • /
    • 2013
  • In recently years, with the growth of social media and the development of mobile devices, the data have been significantly increased. Hadoop has been widely utilized as a typical distributed storage and processing framework. The tasks in Mapreduce based on the Hadoop distributed file system are allocated to the map as close as possible by considering the data locality. However, there are data being requested frequently according to the data analysis tasks of Mapreduce. In this paper, we propose a hot-data replication mechanism to improve the processing speed of Mapreduce according to data access patterns. The proposed scheme reduces the task processing time and improves the data locality using the replica optimization algorithm on the high access frequency of hot data. It is shown through performance evaluation that the proposed scheme outperforms the existing scheme in terms of the load of access frequency.

Design and Implementation of a Hadoop-based Efficient Security Log Analysis System (하둡 기반의 효율적인 보안로그 분석시스템 설계 및 구현)

  • Ahn, Kwang-Min;Lee, Jong-Yoon;Yang, Dong-Min;Lee, Bong-Hwan
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.8
    • /
    • pp.1797-1804
    • /
    • 2015
  • Integrated log management system can help to predict the risk of security and contributes to improve the security level of the organization, and leads to prepare an appropriate security policy. In this paper, we have designed and implemented a Hadoop-based log analysis system by using distributed database model which can store large amount of data and reduce analysis time by automating log collecting procedure. In the proposed system, we use the HBase in order to store a large amount of data efficiently in the scale-out fashion and propose an easy data storing scheme for analysing data using a Hadoop-based normal expression, which results in improving data processing speed compared to the existing system.

An Analysis of Factors Affecting Quality of Life through the Analysis of Public Health Big Data (클라우드 기반의 공개의료 빅데이터 분석을 통한 삶의 질에 영향을 미치는 요인분석)

  • Kim, Min-kyoung;Cho, Young-bok
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.6
    • /
    • pp.835-841
    • /
    • 2018
  • In this study, we analyzed public health data analysis using the hadoop-based spack in the cloud environment using the data of the Community Health Survey from 2012 to 2014, and the factors affecting the quality of life and quality of life. In the proposed paper, we constructed a cloud manager for parallel processing support using Hadoop - based Spack for open medical big data analysis. And we analyzed the factors affecting the "quality of life" of the individual among open medical big data quickly without restriction of hardware. The effects of public health data on health - related quality of life were classified into personal characteristics and community characteristics. And multiple-level regression analysis (ANOVA, t-test). As a result of the experiment, the factors affecting the quality of life were 73.8 points for men and 70.0 points for women, indicating that men had higher health - related quality of life than women.

Data Transmitting and Storing Scheme based on Bandwidth in Hadoop Cluster (하둡 클러스터의 대역폭을 고려한 압축 데이터 전송 및 저장 기법)

  • Kim, Youngmin;Kim, Heejin;Kim, Younggwan;Hong, Jiman
    • Smart Media Journal
    • /
    • v.8 no.4
    • /
    • pp.46-52
    • /
    • 2019
  • The size of data generated and collected at industrial sites or in public institutions is growing rapidly. The existing data processing server often handles the increasing data by increasing the performance by scaling up. However, in the big data era, when the speed of data generation is exploding, there is a limit to data processing with a conventional server. To overcome such limitations, a distributed cluster computing system has been introduced that distributes data in a scale-out manner. However, because distributed cluster computing systems distribute data, inefficient use of network bandwidth can degrade the performance of the cluster as a whole. In this paper, we propose a scheme that compresses data when transmitting data in a Hadoop cluster considering network bandwidth. The proposed scheme considers the network bandwidth and the characteristics of the compression algorithm and selects the optimal compression transmission scheme before transmission. Experimental results show that the proposed scheme reduces data transfer time and size.