• Title/Summary/Keyword: Apache Hadoop

Search Result 41, Processing Time 0.026 seconds

Anomaly Detection of Hadoop Log Data Using Moving Average and 3-Sigma (이동 평균과 3-시그마를 이용한 하둡 로그 데이터의 이상 탐지)

  • Son, Siwoon;Gil, Myeong-Seon;Moon, Yang-Sae;Won, Hee-Sun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.6
    • /
    • pp.283-288
    • /
    • 2016
  • In recent years, there have been many research efforts on Big Data, and many companies developed a variety of relevant products. Accordingly, we are able to store and analyze a large volume of log data, which have been difficult to be handled in the traditional computing environment. To handle a large volume of log data, which rapidly occur in multiple servers, in this paper we design a new data storage architecture to efficiently analyze those big log data through Apache Hive. We then design and implement anomaly detection methods, which identify abnormal status of servers from log data, based on moving average and 3-sigma techniques. We also show effectiveness of the proposed detection methods by demonstrating that our methods identifies anomalies correctly. These results show that our anomaly detection is an excellent approach for properly detecting anomalies from Hadoop log data.

External Merge Sorting in Tajo with Variable Server Configuration (매개변수 환경설정에 따른 타조의 외부합병정렬 성능 연구)

  • Lee, Jongbaeg;Kang, Woon-hak;Lee, Sang-won
    • Journal of KIISE
    • /
    • v.43 no.7
    • /
    • pp.820-826
    • /
    • 2016
  • There is a growing requirement for big data processing which extracts valuable information from a large amount of data. The Hadoop system employs the MapReduce framework to process big data. However, MapReduce has limitations such as inflexible and slow data processing. To overcome these drawbacks, SQL query processing techniques known as SQL-on-Hadoop were developed. Apache Tajo, one of the SQL-on-Hadoop techniques, was developed by a Korean development group. External merge sort is one of the heavily used algorithms in Tajo for query processing. The performance of external merge sort in Tajo is influenced by two parameters, sort buffer size and fanout. In this paper, we analyzed the performance of external merge sort in Tajo with various sort buffer sizes and fanouts. In addition, we figured out that there are two major causes of differences in the performance of external merge sort: CPU cache misses which increase as the sort buffer size grows; and the number of merge passes determined by fanout.

Open-Source-leveraged Modeling for Marine Environment Monitoring (해양환경 모니터링을 위한 오픈소스 기반 모델링)

  • Park, Sun;Cha, ByungRae;Kwon, JinCheol;Kim, JongWon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.10a
    • /
    • pp.716-717
    • /
    • 2017
  • In this paper, we propose a modeling approach for marine environment monitoring by leveraging open-source software to link IoT and Cloud together. The proposed model can be scale out by employing Apache Hadoop-based time-series database so that it can handle collected data increase with a resource pool of the same computers. It can also support the analyze monitored data of marine environment by visualizing collected data.

  • PDF

Sequential Pattern Mining for Intrusion Detection System with Feature Selection on Big Data

  • Fidalcastro, A;Baburaj, E
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.10
    • /
    • pp.5023-5038
    • /
    • 2017
  • Big data is an emerging technology which deals with wide range of data sets with sizes beyond the ability to work with software tools which is commonly used for processing of data. When we consider a huge network, we have to process a large amount of network information generated, which consists of both normal and abnormal activity logs in large volume of multi-dimensional data. Intrusion Detection System (IDS) is required to monitor the network and to detect the malicious nodes and activities in the network. Massive amount of data makes it difficult to detect threats and attacks. Sequential Pattern mining may be used to identify the patterns of malicious activities which have been an emerging popular trend due to the consideration of quantities, profits and time orders of item. Here we propose a sequential pattern mining algorithm with fuzzy logic feature selection and fuzzy weighted support for huge volumes of network logs to be implemented in Apache Hadoop YARN, which solves the problem of speed and time constraints. Fuzzy logic feature selection selects important features from the feature set. Fuzzy weighted supports provide weights to the inputs and avoid multiple scans. In our simulation we use the attack log from NS-2 MANET environment and compare the proposed algorithm with the state-of-the-art sequential Pattern Mining algorithm, SPADE and Support Vector Machine with Hadoop environment.

Hadoop Security Technologies and Vulnerability Analysis (하둡 보안 기술과 취약점 분석)

  • Kim, A-Yong;He, Yilun;Kim, Han-Kil;Park, Man-Seub;Jung, Hoe-Kyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2013.05a
    • /
    • pp.681-683
    • /
    • 2013
  • And were the prevalence of smartphones is the Big Data era, such as Facebook or Twitter, SNS (Social Network Service) routine is used in the real world. Take advantage of the analysis, and to extract and utilize developed in the Apache Foundation Hadoop (Hadoop) without abandoning the SNS unstructured data here. Hadoop is an open source framework that can handle large amounts of data. Hadoop has been introduced in the domestic corporate and commercial development and Compared to the technology development Hadoop has been pointed out that the lack of security sector. In this paper, we propose a method to enhance the security and vulnerability analysis of security technologies and Hadoop.

  • PDF

An Analysis of Utilization on Virtualized Computing Resource for Hadoop and HBase based Big Data Processing Applications (Hadoop과 HBase 기반의 빅 데이터 처리 응용을 위한 가상 컴퓨팅 자원 이용률 분석)

  • Cho, Nayun;Ku, Mino;Kim, Baul;Xuhua, Rui;Min, Dugki
    • Journal of Information Technology and Architecture
    • /
    • v.11 no.4
    • /
    • pp.449-462
    • /
    • 2014
  • In big data era, there are a number of considerable parts in processing systems for capturing, storing, and analyzing stored or streaming data. Unlike traditional data handling systems, a big data processing system needs to concern the characteristics (format, velocity, and volume) of being handled data in the system. In this situation, virtualized computing platform is an emerging platform for handling big data effectively, since virtualization technology enables to manage computing resources dynamically and elastically with minimum efforts. In this paper, we analyze resource utilization of virtualized computing resources to discover suitable deployment models in Apache Hadoop and HBase-based big data processing environment. Consequently, Task Tracker service shows high CPU utilization and high Disk I/O overhead during MapReduce phases. Moreover, HRegion service indicates high network resource consumption for transfer the traffic data from DataNode to Task Tracker. DataNode shows high memory resource utilization and Disk I/O overhead for reading stored data.

Advanced Resource Management with Access Control for Multitenant Hadoop

  • Won, Heesun;Nguyen, Minh Chau;Gil, Myeong-Seon;Moon, Yang-Sae
    • Journal of Communications and Networks
    • /
    • v.17 no.6
    • /
    • pp.592-601
    • /
    • 2015
  • Multitenancy has gained growing importance with the development and evolution of cloud computing technology. In a multitenant environment, multiple tenants with different demands can share a variety of computing resources (e.g., CPU, memory, storage, network, and data) within a single system, while each tenant remains logically isolated. This useful multitenancy concept offers highly efficient, and cost-effective systems without wasting computing resources to enterprises requiring similar environments for data processing and management. In this paper, we propose a novel approach supporting multitenancy features for Apache Hadoop, a large scale distributed system commonly used for processing big data. We first analyze the Hadoop framework focusing on "yet another resource negotiator (YARN)", which is responsible for managing resources, application runtime, and access control in the latest version of Hadoop. We then define the problems for supporting multitenancy and formally derive the requirements to solve these problems. Based on these requirements, we design the details of multitenant Hadoop. We also present experimental results to validate the data access control and to evaluate the performance enhancement of multitenant Hadoop.

Performance Comparison of DW System Tajo Based on Hadoop and Relational DBMS (하둡 기반 DW시스템 타조와 관계형 DBMS의 성능 비교)

  • Liu, Chen;Ko, Junghyun;Yeo, Jeongmo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.9
    • /
    • pp.349-354
    • /
    • 2014
  • Since Hadoop which is the Big-data processing platform was announced, SQL-on-Hadoop is the spotlight as the technique to analyze data using SQL on Hadoop. Tajo created by Korean programmers has recently been promoted to Top-Level-Project status by the Apache in April and has been paid attention all around world. Despite a sensible change caused by Hadoop's appearance in DW market, researches of those performance is insufficient. Thus, this study has been conducted to help choose a DW solution based on SQL-on-Hadoop as progressing the test on comparison analysis of RDBMS and Tajo. It has shown that Tajo based on Hadoop is more superior than RDBMS if it is used with accurate strategy. In addition, open-source project Tajo is expected not only to achieve improvements in technique due to active participation of many developers but also to be in charge of an important role of DW in the filed of data analysis.

Analysis of big data using Rhipe (Rhipe를 활용한 빅데이터 처리 및 분석)

  • Ko, Youngjun;Kim, Jinseog
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.5
    • /
    • pp.975-987
    • /
    • 2013
  • The Hadoop system was developed by the Apache foundation based on GFS and MapReduce technologies of Google. Many modern systems for managing and processing the big data have been developing based on the Hadoop because the Hadoop was designed for scalability and distributed computing. The R software has been considered as a well-suited analytic tool in the Hadoop based systems because the R is flexible to other languages and has many libraries for complex analyses. We introduced Rhipe which is a R package supporting MapReduce programming easily under the Hadoop system, and implemented a MapReduce program using Rhipe for multiple regression especially. In addition, we compared the computing speeds of our program with the other packages (ff and bigmemory) for processing the large data. The simulation results showed that our program was more fast than ff and bigmemory as the size of data increases.

A performance comparison for Apache Spark platform on environment of limited memory (제한된 메모리 환경에서의 아파치 스파크 성능 비교)

  • Song, Jun-Seok;Kim, Sang-Young;Lee, Jung-June;Youn, Hee-Yong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2016.01a
    • /
    • pp.67-68
    • /
    • 2016
  • 최근 빅 데이터를 이용한 시스템들이 여러 분야에서 활발히 이용되기 시작하면서 대표적인 빅데이터 저장 및 처리 플랫폼인 하둡(Hadoop)의 기술적 단점을 보완할 수 있는 다양한 분산 시스템 플랫폼이 등장하고 있다. 그 중 아파치 스파크(Apache Spark)는 하둡 플랫폼의 속도저하 단점을 보완하기 위해 인 메모리 처리를 지원하여 대용량 데이터를 효율적으로 처리하는 오픈 소스 분산 데이터 처리 플랫폼이다. 하지만, 아파치 스파크의 작업은 메모리에 의존적이므로 제한된 메모리 환경에서 전체 작업 성능은 급격히 낮아진다. 본 논문에서는 메모리 용량에 따른 아파치 스파크 성능 비교를 통해 아파치 스파크 동작을 위해 필요한 적정 메모리 용량을 확인한다.

  • PDF