• Title/Summary/Keyword: 하둡 환경

Search Result 95, Processing Time 0.03 seconds

Design of Spark SQL Based Framework for Advanced Analytics (Spark SQL 기반 고도 분석 지원 프레임워크 설계)

  • Chung, Jaehwa
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.10
    • /
    • pp.477-482
    • /
    • 2016
  • As being the advanced analytics indispensable on big data for agile decision-making and tactical planning in enterprises, distributed processing platforms, such as Hadoop and Spark which distribute and handle the large volume of data on multiple nodes, receive great attention in the field. In Spark platform stack, Spark SQL unveiled recently to make Spark able to support distributed processing framework based on SQL. However, Spark SQL cannot effectively handle advanced analytics that involves machine learning and graph processing in terms of iterative tasks and task allocations. Motivated by these issues, this paper proposes the design of SQL-based big data optimal processing engine and processing framework to support advanced analytics in Spark environments. Big data optimal processing engines copes with complex SQL queries that involves multiple parameters and join, aggregation and sorting operations in distributed/parallel manner and the proposing framework optimizes machine learning process in terms of relational operations.

Performance evaluation and prediction for number of slave nodes in Spark (스파크 기반 분산 환경에서 슬레이브 노드의 개수에 따른 성능 분석과 예측)

  • Bak, Bongwoo;Myung, Rohyoung;Chung, KwangSik;Yu, Heonchang;Choi, Sukyong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2017.04a
    • /
    • pp.94-96
    • /
    • 2017
  • 최근 빅 데이터를 이용한 시스템들이 여러 분야에서 활발히 이용되기 시작하면서 대표적인 빅 데이터 저장 및 처리 플랫폼인 하둡(Hadoop)의 기술적 단점을 보완할 수 있는 분산 시스템 플랫폼 스파크(Apache Spark)가 등장하였다. 본 플랫폼을 바탕으로 슬레이브 노드들에게 작업을 분산하여 대용량 연산을 수행한다. 하지만 요구하는 성능을 내기 위해 어느 정도 규모의 슬레이브 노드가 필요한지, 각각의 컴퓨팅 능력은 얼마나 필요한지를 예측하는데 어려움이 있다. 본 논문에서는 스파크에서 원하는 성능을 내기 위해 어떤 조건을 충족해야 하는지, 현재 환경에서는 어느 정도 성능을 낼 수 있는지 실험을 통해 모델을 만들어 예측한다.

Kerberos Authentication Deployment Policy of US in Big data Environment (빅데이터 환경에서 미국 커버로스 인증 적용 정책)

  • Hong, Jinkeun
    • Journal of Digital Convergence
    • /
    • v.11 no.11
    • /
    • pp.435-441
    • /
    • 2013
  • This paper review about kerberos security authentication scheme and policy for big data service. It analyzed problem for security technology based on Hadoop framework in big data service environment. Also when it consider applying problem of kerberos security authentication system, it analyzed deployment policy in center of main contents, which is occurred in commercial business. About the related applied Kerberos policy in US, it is researched about application such as cross platform interoperability support, automated Kerberos set up, integration issue, OPT authentication, SSO, ID, and so on.

A Study on Distributed Semantic Web Data Repository Using HBase (HBase를 이용한 분산 시맨틱 웹 데이터 저장소에 대한 연구)

  • Jo, Daewoong;Kim, Myung Ho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.04a
    • /
    • pp.111-114
    • /
    • 2012
  • 실시간으로 발생되는 대량의 데이터를 효율적으로 저장하기 위한 연구는 분산/병렬 처리를 위한 하둡 및 NoSQL과 관련한 빅 데이터 처리 기술을 통해 진행 중에 있다. 하지만 시맨틱 웹 분야에서 발생되는 대량의 데이터를 처리하기 위한 모델은 현재 연구가 진행되고 있지 않다. 본 논문에서는 시맨틱 웹 환경에서 발생되는 대량의 온톨로지 데이터를 빅 데이터 처리가 가능한 NoSQL 분야인 HBase 데이터베이스에 분산 저장할 수 있는 매핑 규칙을 제안한다. 이와 같은 매핑 규칙을 통해 시맨틱 웹 환경에서도 대량으로 발생될 수 있는 데이터들을 효율적으로 분산 저장 할 수 있다.

Marine Environment Monitoring System based Open Source (오픈소스 기반 해양환경 모니터링 시스템)

  • Park, Sun;Cha, ByungRae;Kim, Jongwon
    • Smart Media Journal
    • /
    • v.6 no.3
    • /
    • pp.75-82
    • /
    • 2017
  • Recently, the marine monitoring technology is actively being studied since the sea is a rich repository of natural resources that is taken notice in the world. In particular, the marine environment data should be collected continuously in order to understand and analyze the marine environment, however the study of automatic monitoring of marine environment in Korea is not enough. In this paper, we proposed the marine environment monitoring system based on open source. The proposed system can be designed as a scale out system using Hadoop based time series database which it can easily process the increasing collection data by a scale out computer resources. It can also be used to analyze marine data by visualizing collected data.

The Implementation and Performance Measurement for Hadoop-Based Android Mobile TPC-C Application (모바일 TPC-C: 하둡 기반 안드로이드 모바일 TPC-C 어플리케이션 구현 및 성능 측정)

  • Jang, Han-Uer;No, Jaechun;Kim, Byung-Moon;Lee, Ji-Eun;Park, Sung-Soon
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.8
    • /
    • pp.203-211
    • /
    • 2013
  • Due to the rapid growth of mobile devices and applications, mobile cloud computing is becoming an important platform in the development of cloud services. However, the mobile cloud computing is facing many challenges in terms of the computing resources and communications. One of them is the performance issue between mobile devices and cloud server. In the paper, we implemented a hadoop-based android mobile application, called mobile TPC-C, and used it for evaluating the performance aspect between mobile devices and cloud server. The mobile TPC-C was implemented based on the existing TPC-C, to make it possible to execute on top of android mobile devices. The performance measurement using mobile TPC-C was executed on various transactions while changing the number of mobile clients. By comparing it to the evaluation on the personal PC, we tried to point out the important aspects affecting the performance improvement between mobile clients and cloud server.

Trend analysis of Open Source Technologies for Cloud Storage Infrastructure (클라우드 스토리지 인프라 구축을 위한 오픈 소스 기술 동향)

  • Bae, Yu-Mi;Jung, Sung-Jae;Bae, Jung-Min;Park, Jeong-Su;Sung, Kyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2013.05a
    • /
    • pp.263-266
    • /
    • 2013
  • The universal cloud computing environment, the increase of mobile devices, and the emergence of various web-based services require large amounts of storage space. With the widespread use of Web-based storage services, such as Google Drive, Naver Ndrive, Daum Cloud, there is a need for more storage space. Therefore, storage areas can be provided according to the needs of users of virtualized storage resources through a network, and a large, easy to extend, and royalty in a specific geographical location, cloud storage may be the limelight. In this paper, find out about the features of open source software technology, Hadoop, Swift, GlusterFS for Cloud Storage infrastructure.

  • PDF

Structuring of unstructured big data and visual interpretation (부산지역 교통관련 기사를 이용한 비정형 빅데이터의 정형화와 시각적 해석)

  • Lee, Kyeongjun;Noh, Yunhwan;Yoon, Sanggyeong;Cho, Youngseuk
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.6
    • /
    • pp.1431-1438
    • /
    • 2014
  • We analyzed the articles from "Kukje Shinmun" and "Busan Ilbo", which are two local newpapers of Busan Metropolitan City. The articles cover from January 1, 2013 to December 31, 2013. Meaningful pattern inherent in 2889 articles of which the title includes "Busan" and "Traffic" and related data was analyzed. Textmining method, which is a part of datamining, was used for the social network analysis (SNA). HDFS and MapReduce (from Hadoop ecosystem), which is open-source framework based on JAVA, were used with Linux environment (Uubntu-12.04LTS) for the construction of unstructured data and the storage, process and the analysis of big data. We implemented new algorithm that shows better visualization compared with the default one from R package, by providing the color and thickness based on the weight from each node and line connecting the nodes.

External Merge Sorting in Tajo with Variable Server Configuration (매개변수 환경설정에 따른 타조의 외부합병정렬 성능 연구)

  • Lee, Jongbaeg;Kang, Woon-hak;Lee, Sang-won
    • Journal of KIISE
    • /
    • v.43 no.7
    • /
    • pp.820-826
    • /
    • 2016
  • There is a growing requirement for big data processing which extracts valuable information from a large amount of data. The Hadoop system employs the MapReduce framework to process big data. However, MapReduce has limitations such as inflexible and slow data processing. To overcome these drawbacks, SQL query processing techniques known as SQL-on-Hadoop were developed. Apache Tajo, one of the SQL-on-Hadoop techniques, was developed by a Korean development group. External merge sort is one of the heavily used algorithms in Tajo for query processing. The performance of external merge sort in Tajo is influenced by two parameters, sort buffer size and fanout. In this paper, we analyzed the performance of external merge sort in Tajo with various sort buffer sizes and fanouts. In addition, we figured out that there are two major causes of differences in the performance of external merge sort: CPU cache misses which increase as the sort buffer size grows; and the number of merge passes determined by fanout.

SPARQL Query Processing in Distributed In-Memory System (분산 메모리 시스템에서의 SPARQL 질의 처리)

  • Jagvaral, Batselem;Lee, Wangon;Kim, Kang-Pil;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.42 no.9
    • /
    • pp.1109-1116
    • /
    • 2015
  • In this paper, we propose a query processing approach that uses the Spark functional programming and distributed memory system to solve the computational overhead of SPARQL. In the semantic web, RDF ontology data is produced at large scale, and the main challenge for the semantic web is to query and manipulate such a large ontology with a high throughput. The most existing studies on SPARQL have focused on deploying the Hadoop MapReduce framework, and although approaches based on Hadoop MapReduce have shown promising results, they achieve a low level of throughput due to the underlying distributed file processes. Therefore, in order to speed up the query processes, we suggest query- processing methods that are based on memory caching in distributed memory system. Our approach is also integrated with a clause unification method for propagating between the clauses that exploits Spark join, map and filter methods along with caching. In our experiments, we have achieved a high level of performance relative to other approaches. In particular, our performance was nearly similar to that of Sempala, which has been considered to be the fastest query processing system.