• Title/Summary/Keyword: HADOOP

Search Result 395, Processing Time 0.029 seconds

A Digital Secret File Leakage Prevention System via Hadoop-based User Behavior Analysis (하둡 기반의 사용자 행위 분석을 통한 기밀파일 유출 방지 시스템)

  • Yoo, Hye-Rim;Shin, Gyu-Jin;Yang, Dong-Min;Lee, Bong-Hwan
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.11
    • /
    • pp.1544-1553
    • /
    • 2018
  • Recently internal information leakage in industries is severely increasing in spite of industry security policy. Thus, it is essential to prepare an information leakage prevention measure by industries. Most of the leaks result from the insiders, not from external attacks. In this paper, a real-time internal information leakage prevention system via both storage and network is implemented in order to protect confidential file leakage. In addition, a Hadoop-based user behavior analysis and statistics system is designed and implemented for storing and analyzing information log data in industries. The proposed system stores a large volume of data in HDFS and improves data processing capability using RHive, consequently helps the administrator recognize and prepare the confidential file leak trials. The implemented audit system would be contributed to reducing the damage caused by leakage of confidential files inside of the industries via both portable data media and networks.

Sequential Pattern Mining for Intrusion Detection System with Feature Selection on Big Data

  • Fidalcastro, A;Baburaj, E
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.10
    • /
    • pp.5023-5038
    • /
    • 2017
  • Big data is an emerging technology which deals with wide range of data sets with sizes beyond the ability to work with software tools which is commonly used for processing of data. When we consider a huge network, we have to process a large amount of network information generated, which consists of both normal and abnormal activity logs in large volume of multi-dimensional data. Intrusion Detection System (IDS) is required to monitor the network and to detect the malicious nodes and activities in the network. Massive amount of data makes it difficult to detect threats and attacks. Sequential Pattern mining may be used to identify the patterns of malicious activities which have been an emerging popular trend due to the consideration of quantities, profits and time orders of item. Here we propose a sequential pattern mining algorithm with fuzzy logic feature selection and fuzzy weighted support for huge volumes of network logs to be implemented in Apache Hadoop YARN, which solves the problem of speed and time constraints. Fuzzy logic feature selection selects important features from the feature set. Fuzzy weighted supports provide weights to the inputs and avoid multiple scans. In our simulation we use the attack log from NS-2 MANET environment and compare the proposed algorithm with the state-of-the-art sequential Pattern Mining algorithm, SPADE and Support Vector Machine with Hadoop environment.

A Study on the Improving Performance of Massively Small File Using the Reuse JVM in MapReduce (MapReduce에서 Reuse JVM을 이용한 대규모 스몰파일 처리성능 향상 방법에 관한 연구)

  • Choi, Chul Woong;Kim, Jeong In;Kim, Pan Koo
    • Journal of Korea Multimedia Society
    • /
    • v.18 no.9
    • /
    • pp.1098-1104
    • /
    • 2015
  • With the widespread use of smartphones and IoT (Internet of Things), data are being generated on a large scale, and there is increased for the analysis of such data. Hence, distributed processing systems have gained much attention. Hadoop, which is a distributed processing system, saves the metadata of stored files in name nodes; in this case, the main problems are as follows: the memory becomes insufficient; load occurs because of massive small files; scheduling and file processing time increases because of the increased number of small files. In this paper, we propose a solution to address the increase in processing time because of massive small files, and thus improve the processing performance, using the Reuse JVM method provided by Hadoop. Through environment setting, the Reuse JVM method modifies the JVM produced conventionally for every task, so that multiple tasks are reused sequentially in one JVM. As a final outcome, the Reuse JVM method showed the best processing performance when used together with CombineFileInputFormat.

A Parallel HDFS and MapReduce Functions for Emotion Analysis (감성분석을 위한 병렬적 HDFS와 맵리듀스 함수)

  • Back, BongHyun;Ryoo, Yun-Kyoo
    • Journal of the Korea society of information convergence
    • /
    • v.7 no.2
    • /
    • pp.49-57
    • /
    • 2014
  • Recently, opinion mining is introduced to extract useful information from SNS data and to evaluate the true intention of users. Opinion mining are required several efficient techniques to collect and analyze a large amount of SNS data and extract meaningful data from them. Therefore in this paper, we propose a parallel HDFS(Hadoop Distributed File System) and emotion functions based on Mapreduce to extract some emotional information of users from various unstructured big data on social networks. The experiment results have verified that the proposed system and functions perform faster than O(n) for data gathering time and loading time, and maintain stable load balancing for memory and CPU resources.

  • PDF

Scalable P2P Botnet Detection with Threshold Setting in Hadoop Framework (하둡 프레임워크에서 한계점 가변으로 확장성이 가능한 P2P 봇넷 탐지 기법)

  • Huseynov, Khalid;Yoo, Paul D.;Kim, Kwangjo
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.25 no.4
    • /
    • pp.807-816
    • /
    • 2015
  • During the last decade most of coordinated security breaches are performed by the means of botnets, which is a large overlay network of compromised computers being controlled by remote botmaster. Due to high volumes of traffic to be analyzed, the challenge is posed by managing tradeoff between system scalability and accuracy. We propose a novel Hadoop-based P2P botnet detection method solving the problem of scalability and having high accuracy. Moreover, our approach is characterized not to require labeled data and applicable to encrypted traffic as well.

A study on development method for practical use of Big Data related to recommendation to financial item (금융 상품 추천에 관련된 빅 데이터 활용을 위한 개발 방법)

  • Kim, Seok-Soo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.19 no.8
    • /
    • pp.73-81
    • /
    • 2014
  • This study proposed development method for practical use techniques compromise data storage layer, data processing layer, data analysis layer, visualization layer. Data of storage, process, analysis of each phase can see visualization. After data process through Hadoop, the result visualize from Mahout. According to this course, we can capture several features of customer, we can choose recommendation of financial item on time. This study introduce background and problem of big data and discuss development method and case study that how to create big data has new business opportunity through financial item recommendation case.

Performance Comparison of Python and Scala APIs in Spark Distributed Cluster Computing System (Spark 기반에서 Python과 Scala API의 성능 비교 분석)

  • Ji, Keung-yeup;Kwon, Youngmi
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.2
    • /
    • pp.241-246
    • /
    • 2020
  • Hadoop is a framework to process large data sets in a distributed way across clusters of nodes. It has been a popular platform to process big data, but in recent years, other platforms became competitive ones depending on the characteristics of the application. Spark is one of distributed platforms to enable real-time data processing and improve overall processing performance over Hadoop by introducing in-memory processing instead of disk I/O. Whereas Hadoop is designed to work on Java and data analysis is processed using Java API, Spark provides a variety of APIs with Scala, Python, Java and R. In this paper, the goal is to find out whether the APIs of different programming languages af ect the performances in Spark. We chose two popular APIs: Python and Scala. Python is easy to learn and is used in AI domain in a wide range. Scala is a programming language with advantages of parallelism. Our experiment shows much faster processing with Scala API than Python API. For the performance issues on AI-based analysis, further study is needed.

Design and Implementation of Big Data Platform for Image Processing in Agriculture (농업 이미지 처리를 위한 빅테이터 플랫폼 설계 및 구현)

  • Nguyen, Van-Quyet;Nguyen, Sinh Ngoc;Vu, Duc Tiep;Kim, Kyungbaek
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2016.10a
    • /
    • pp.50-53
    • /
    • 2016
  • Image processing techniques play an increasingly important role in many aspects of our daily life. For example, it has been shown to improve agricultural productivity in a number of ways such as plant pest detecting or fruit grading. However, massive quantities of images generated in real-time through multi-devices such as remote sensors during monitoring plant growth lead to the challenges of big data. Meanwhile, most current image processing systems are designed for small-scale and local computation, and they do not scale well to handle big data problems with their large requirements for computational resources and storage. In this paper, we have proposed an IPABigData (Image Processing Algorithm BigData) platform which provides algorithms to support large-scale image processing in agriculture based on Hadoop framework. Hadoop provides a parallel computation model MapReduce and Hadoop distributed file system (HDFS) module. It can also handle parallel pipelines, which are frequently used in image processing. In our experiment, we show that our platform outperforms traditional system in a scenario of image segmentation.

Job-aware Network Scheduling for Hadoop Cluster

  • Liu, Wen;Wang, Zhigang;Shen, Yanming
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.1
    • /
    • pp.237-252
    • /
    • 2017
  • In recent years, data centers have become the core infrastructure to deal with big data processing. For these big data applications, network transmission has become one of the most important factors affecting the performance. In order to improve network utilization and reduce job completion time, in this paper, by real-time monitoring from the application layer, we propose job-aware priority scheduling. Our approach takes the correlations of flows in the same job into account, and flows in the same job are assigned the same priority. Therefore, we expect that flows in the same job finish their transmissions at about the same time, avoiding lagging flows. To achieve load balancing, two approaches (Flow-based and Spray) using ECMP (Equal-Cost multi-path routing) are presented. We implemented our scheme using NS-2 simulator. In our evaluations, we emulate real network environment by setting background traffic, scheduling delay and link failures. The experimental results show that our approach can enhance the Hadoop job execution efficiency of the shuffle stage, significantly reduce the network transmission time of the highest priority job.

An Efficient Design and Implementation of an MdbULPS in a Cloud-Computing Environment

  • Kim, Myoungjin;Cui, Yun;Lee, Hanku
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.9 no.8
    • /
    • pp.3182-3202
    • /
    • 2015
  • Flexibly expanding the storage capacity required to process a large amount of rapidly increasing unstructured log data is difficult in a conventional computing environment. In addition, implementing a log processing system providing features that categorize and analyze unstructured log data is extremely difficult. To overcome such limitations, we propose and design a MongoDB-based unstructured log processing system (MdbULPS) for collecting, categorizing, and analyzing log data generated from banks. The proposed system includes a Hadoop-based analysis module for reliable parallel-distributed processing of massive log data. Furthermore, because the Hadoop distributed file system (HDFS) stores data by generating replicas of collected log data in block units, the proposed system offers automatic system recovery against system failures and data loss. Finally, by establishing a distributed database using the NoSQL-based MongoDB, the proposed system provides methods of effectively processing unstructured log data. To evaluate the proposed system, we conducted three different performance tests on a local test bed including twelve nodes: comparing our system with a MySQL-based approach, comparing it with an Hbase-based approach, and changing the chunk size option. From the experiments, we found that our system showed better performance in processing unstructured log data.