• Title/Summary/Keyword: HADOOP

Search Result 394, Processing Time 0.021 seconds

A Secure Model for Reading and Writing in Hadoop Distributed File System and its Evaluation (하둡 분산파일시스템에서 안전한 쓰기, 읽기 모델과 평가)

  • Pang, Sechung;Ra, Ilkyeun;Kim, Yangwoo
    • Journal of Internet Computing and Services
    • /
    • v.13 no.5
    • /
    • pp.55-64
    • /
    • 2012
  • Nowadays, as Cloud computing becomes popular, a need for a DFS(distributed file system) is increased. But, in the current Cloud computing environments, there is no DFS framework that is sufficient to protect sensitive private information from attackers. Therefore, we designed and proposed a secure scheme for distributed file systems. The scheme provides confidentiality and availability for a distributed file system using a secret sharing method. In this paper, we measured the speed of encryption and decryption for our proposed method, and compared them with that of SEED algorithm which is the most popular algorithm in this field. This comparison showed the computational efficiency of our method. Moreover, the proposed secure read/write model is independent of Hadoop DFS structure so that our modified algorithm can be easily adapted for use in the HDFS. Finally, the proposed model is evaluated theoretically using performance measurement method for distributed secret sharing model.

A Design of Analysis System on TV Advertising Effect of Social Networking Using Hadoop (하둡을 이용한 소셜네트워킹의 TV광고효과 분석 시스템 설계)

  • Hur, Seoyeon;Kim, Yoonhee
    • Journal of Internet Computing and Services
    • /
    • v.14 no.6
    • /
    • pp.49-57
    • /
    • 2013
  • As 'Big data' has been one of challenging issues, development of new services using Social Network Service (SNS) which is its typical example became active. SNS has developed as a media where everyone communicates at real time and the number of SNS opinion analyzing services is increasing. Meanwhile, new approach to acquire and analyze twitter data becomes necessary in TV advertisement system. This paper proposes LiveAD system, which store and analyze big data such as twitter data as well as analyze TV advertising effect based on twitter data. As a proof of concept, the proposed system has been implemented collecting and analyzing twitter data using Hadoop. The result of collected information over the system increases the chance of analyzing TV advertising effect on twitter in real-time.

A Customized Tourism System Using Log Data on Hadoop (로그 데이터를 이용한 하둡기반 맞춤형 관광시스템)

  • Ya, Ding;Kim, Kang-Chul
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.13 no.2
    • /
    • pp.397-404
    • /
    • 2018
  • As the usage of internet is increasing, a lot of user behavior are written in a log file and the researches and industries using the log files are getting activated recently. This paper uses the Hadoop based on open source distributed computing platform and proposes a customized tourism system by analyzing user behaviors in the log files. The proposed system uses Google Analytics to get user's log files from the website that users visit, and stores search terms extracted by MapReduce to HDFS. Also it gathers features about the sight-seeing places or cities which travelers want to tour from travel guide websites by Octopus application. It suggests the customized cities by matching the search terms and city features. NBP(next bit permutation) algorithm to rearrange the search terms and city features is used to increase the probability of matching. Some customized cities are suggested by analyzing log files for 39 users to show the performance of the proposed system.

S-PARAFAC: Distributed Tensor Decomposition using Apache Spark (S-PARAFAC: 아파치 스파크를 이용한 분산 텐서 분해)

  • Yang, Hye-Kyung;Yong, Hwan-Seung
    • Journal of KIISE
    • /
    • v.45 no.3
    • /
    • pp.280-287
    • /
    • 2018
  • Recently, the use of a recommendation system and tensor data analysis, which has high-dimensional data, is increasing, as they allow us to analyze the tensor and extract potential elements and patterns. However, due to the large size and complexity of the tensor, it needs to be decomposed in order to analyze the tensor data. While several tools are used for tensor decomposition such as rTensor, pyTensor, and MATLAB, since such tools run on a single machine, they are unable to handle large data. Also, while distributed tensor decomposition tools based on Hadoop can handle a scalable tensor, its computing speed is too slow. In this paper, we propose S-PARAFAC, which is a tensor decomposition tool based on Apache Spark, in distributed in-memory environments. We converted the PARAFAC algorithm into an Apache Spark version that enables rapid processing of tensor data. We also compared the performance of the Hadoop based tensor tool and S-PARAFAC. The result showed that S-PARAFAC is approximately 4~25 times faster than the Hadoop based tensor tool.

A Study on Possible Construction of Big Data Analysis System Applied to the Offline Market (오프라인 마켓에 적용 가능한 빅데이터 분석 시스템 구축 방안에 관한 연구)

  • Lee, Hoo-Young;Park, Koo-Rack;Kim, Dong-Hyun
    • Journal of Digital Convergence
    • /
    • v.14 no.9
    • /
    • pp.317-323
    • /
    • 2016
  • Big Data is now seen as a major asset in the company's competitiveness, its influence in the future is expected to grow. Companies that recognize the importance are already actively engaged with Big Data in product development and marketing, which are increasingly applied across sectors of society, including politics, sports. However, lack of knowledge of the system implementation and high costs are still a big obstacles to the introduction of Big Data and systems. It is an objective in this study to build a Big Data system, which is based on open source Hadoop and Hive among Big Data systems, utilizing POS sales data of small and medium-sized offline markets. This approach of convergence is expected to improve existing sales systems that have been simply focusing on profit and loss analysis. It will also be able to use it as the basis for the decisions of the executive to enable prediction of the consumption patterns of customer preference and demand in advance.

Learning algorithms for big data logistic regression on RHIPE platform (RHIPE 플랫폼에서 빅데이터 로지스틱 회귀를 위한 학습 알고리즘)

  • Jung, Byung Ho;Lim, Dong Hoon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.4
    • /
    • pp.911-923
    • /
    • 2016
  • Machine learning becomes increasingly important in the big data era. Logistic regression is a type of classification in machine leaning, and has been widely used in various fields, including medicine, economics, marketing, and social sciences. Rhipe that integrates R and Hadoop environment, has not been discussed by many researchers owing to the difficulty of its installation and MapReduce implementation. In this paper, we present the MapReduce implementation of Gradient Descent algorithm and Newton-Raphson algorithm for logistic regression using Rhipe. The Newton-Raphson algorithm does not require a learning rate, while Gradient Descent algorithm needs to manually pick a learning rate. We choose the learning rate by performing the mixed procedure of grid search and binary search for processing big data efficiently. In the performance study, our Newton-Raphson algorithm outpeforms Gradient Descent algorithm in all the tested data.

Constructing a Support Vector Machine for Localization on a Low-End Cluster Sensor Network (로우엔드 클러스터 센서 네트워크에서 위치 측정을 위한 지지 벡터 머신)

  • Moon, Sangook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.18 no.12
    • /
    • pp.2885-2890
    • /
    • 2014
  • Localization of a sensor network node using machine learning has been recently studied. It is easy for Support vector machines algorithm to implement in high level language enabling parallelism. Raspberrypi is a linux system which can be used as a sensor node. Pi can be used to construct IP based Hadoop clusters. In this paper, we realized Support vector machine using python language and built a sensor network cluster with 5 Pi's. We also established a Hadoop software framework to employ MapReduce mechanism. In our experiment, we implemented the test sensor network with a variety of parameters and examined based on proficiency, resource evaluation, and processing time. The experimentation showed that with more execution power and memory volume, Pi could be appropriate for a member node of the cluster, accomplishing precise classification for sensor localization using machine learning.

Structuring of unstructured big data and visual interpretation (부산지역 교통관련 기사를 이용한 비정형 빅데이터의 정형화와 시각적 해석)

  • Lee, Kyeongjun;Noh, Yunhwan;Yoon, Sanggyeong;Cho, Youngseuk
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.6
    • /
    • pp.1431-1438
    • /
    • 2014
  • We analyzed the articles from "Kukje Shinmun" and "Busan Ilbo", which are two local newpapers of Busan Metropolitan City. The articles cover from January 1, 2013 to December 31, 2013. Meaningful pattern inherent in 2889 articles of which the title includes "Busan" and "Traffic" and related data was analyzed. Textmining method, which is a part of datamining, was used for the social network analysis (SNA). HDFS and MapReduce (from Hadoop ecosystem), which is open-source framework based on JAVA, were used with Linux environment (Uubntu-12.04LTS) for the construction of unstructured data and the storage, process and the analysis of big data. We implemented new algorithm that shows better visualization compared with the default one from R package, by providing the color and thickness based on the weight from each node and line connecting the nodes.

A Development Study of The VPT for the improvement of Hadoop performance (하둡 성능 향상을 위한 VPT 개발 연구)

  • Yang, Ill Deung;Kim, Seong Ryeol
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.9
    • /
    • pp.2029-2036
    • /
    • 2015
  • Hadoop MR(MapReduce) uses a partition function for passing the outputs of mappers to reducers. The partition function determines target reducers after calculating the hash-value from the key and performing mod-operation by reducer number. The legacy partition function doesn't divide the job effectively because it is so sensitive to key distribution. If the job isn't divided effectively then it can effect the total processing time of the job because some reducers need more time to process. This paper proposes the VPT(Virtual Partition Table) and has tested appling the VPT with a preponderance of data. The applied VPT improved three seconds on average and we figure it will improve more when data is increased.

Anomaly Detection of Hadoop Log Data Using Moving Average and 3-Sigma (이동 평균과 3-시그마를 이용한 하둡 로그 데이터의 이상 탐지)

  • Son, Siwoon;Gil, Myeong-Seon;Moon, Yang-Sae;Won, Hee-Sun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.6
    • /
    • pp.283-288
    • /
    • 2016
  • In recent years, there have been many research efforts on Big Data, and many companies developed a variety of relevant products. Accordingly, we are able to store and analyze a large volume of log data, which have been difficult to be handled in the traditional computing environment. To handle a large volume of log data, which rapidly occur in multiple servers, in this paper we design a new data storage architecture to efficiently analyze those big log data through Apache Hive. We then design and implement anomaly detection methods, which identify abnormal status of servers from log data, based on moving average and 3-sigma techniques. We also show effectiveness of the proposed detection methods by demonstrating that our methods identifies anomalies correctly. These results show that our anomaly detection is an excellent approach for properly detecting anomalies from Hadoop log data.