• Title/Summary/Keyword: Big data Processing

Search Result 1,063, Processing Time 0.036 seconds

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment (분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘)

  • Yang, Hyeon-Sik;Jang, Miyoung;Chang, Jae-Woo
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.7
    • /
    • pp.667-680
    • /
    • 2016
  • As distributed computing platforms like Hadoop MapReduce have been developed, it is necessary to perform the conventional query processing techniques, which have been executed in a single computing machine, in distributed computing environments efficiently. Especially, studies on similarity join query processing in distributed computing environments have been done where similarity join means retrieving all data pairs with high similarity between given two data sets. But the existing similarity join query processing schemes for distributed computing environments have a problem of skewed computing load balance between clusters because they consider only the data transmission cost. In this paper, we propose Matrix-based Load-balancing Algorithm for efficient similarity join query processing in distributed computing environment. In order to uniform load balancing of clusters, the proposed algorithm estimates expected computing cost by using matrix and generates partitions based on the estimated cost. In addition, it can reduce computing loads by filtering out data which are not used in query processing in clusters. Finally, it is shown from our performance evaluation that the proposed algorithm is better on query processing performance than the existing one.

Design and Implementation of Multiple Filter Distributed Deduplication System Applying Cuckoo Filter Similarity (쿠쿠 필터 유사도를 적용한 다중 필터 분산 중복 제거 시스템 설계 및 구현)

  • Kim, Yeong-A;Kim, Gea-Hee;Kim, Hyun-Ju;Kim, Chang-Geun
    • Journal of Convergence for Information Technology
    • /
    • v.10 no.10
    • /
    • pp.1-8
    • /
    • 2020
  • The need for storage, management, and retrieval techniques for alternative data has emerged as technologies based on data generated from business activities conducted by enterprises have emerged as the key to business success in recent years. Existing big data platform systems must load a large amount of data generated in real time without delay to process unstructured data, which is an alternative data, and efficiently manage storage space by utilizing a deduplication system of different storages when redundant data occurs. In this paper, we propose a multi-layer distributed data deduplication process system using the similarity of the Cuckoo hashing filter technique considering the characteristics of big data. Similarity between virtual machines is applied as Cuckoo hash, individual storage nodes can improve performance with deduplication efficiency, and multi-layer Cuckoo filter is applied to reduce processing time. Experimental results show that the proposed method shortens the processing time by 8.9% and increases the deduplication rate by 10.3%.

A Proposal of Privacy Protection Method for Location Information to Utilize 5G-Based High-Precision Positioning Big Data (5G 기반 고정밀 측위 빅데이터 활용을 위한 위치정보 프라이버시 보호 기법 제안)

  • Lee, Donghyeok;Park, Namje
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.30 no.4
    • /
    • pp.679-691
    • /
    • 2020
  • In the future, 5G technology will become the core infrastructure driving the 4th industrial era. For intelligent super-convergence service, it will be necessary to collect various personal information such as location data. If a person's high-precision location information is exposed by a malicious person, it can be a serious privacy risk. In the past, various approaches have been researched through encryption and obfuscation to protect location information privacy. In this paper, we proposed a new technique that enables statistical query and data analysis without exposing location information. The proposed method does not allow the original to be re-identified through polynomial-based transform processing. In addition, since the quality of the original data is not compromised, the usability of positioning big data can be maximized.

Secure Authentication Protocol in Hadoop Distributed File System based on Hash Chain (해쉬 체인 기반의 안전한 하둡 분산 파일 시스템 인증 프로토콜)

  • Jeong, So Won;Kim, Kee Sung;Jeong, Ik Rae
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.23 no.5
    • /
    • pp.831-847
    • /
    • 2013
  • The various types of data are being created in large quantities resulting from the spread of social media and the mobile popularization. Many companies want to obtain valuable business information through the analysis of these large data. As a result, it is a trend to integrate the big data technologies into the company work. Especially, Hadoop is regarded as the most representative big data technology due to its terabytes of storage capacity, inexpensive construction cost, and fast data processing speed. However, the authentication token system of Hadoop Distributed File System(HDFS) for the user authentication is currently vulnerable to the replay attack and the datanode hacking attack. This can cause that the company secrets or the personal information of customers on HDFS are exposed. In this paper, we analyze the possible security threats to HDFS when tokens or datanodes are exposed to the attackers. Finally, we propose the secure authentication protocol in HDFS based on hash chain.

Real-Time Ransomware Infection Detection System Based on Social Big Data Mining (소셜 빅데이터 마이닝 기반 실시간 랜섬웨어 전파 감지 시스템)

  • Kim, Mihui;Yun, Junhyeok
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.7 no.10
    • /
    • pp.251-258
    • /
    • 2018
  • Ransomware, a malicious software that requires a ransom by encrypting a file, is becoming more threatening with its rapid propagation and intelligence. Rapid detection and risk analysis are required, but real-time analysis and reporting are lacking. In this paper, we propose a ransomware infection detection system using social big data mining technology to enable real-time analysis. The system analyzes the twitter stream in real time and crawls tweets with keywords related to ransomware. It also extracts keywords related to ransomware by crawling the news server through the news feed parser and extracts news or statistical data on the servers of the security company or search engine. The collected data is analyzed by data mining algorithms. By comparing the number of related tweets, google trends (statistical information), and articles related wannacry and locky ransomware infection spreading in 2017, we show that our system has the possibility of ransomware infection detection using tweets. Moreover, the performance of proposed system is shown through entropy and chi-square analysis.

Analyze the Open data for Natural Language Processing of Learning Counseling (학습 상담 내용의 자연어 처리를 위한 오픈 데이터 현황 분석)

  • Kim, Yu-Doo
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2019.05a
    • /
    • pp.500-501
    • /
    • 2019
  • In the $4^{th}$ generation industry, self-directed learning is very important than Injection learning. Therefore many educational institutions has developed method of self-directed learning. In order for self-directed learning to be effective, it is more important for faculty to manage the overall process of learning rather than being directly involved in the student's academic work. Therefore, learning counseling is an important way to effectively carry out self-directed learning. In this paper, we analyze the status of open data for natural language processing that can implement the learning consultation contents so that various applications can be done through natural language processing.

  • PDF

Development of a Post-Processor for Three-Dimensional Forging Analysis (3차원 단조해석용 후처리기 개발)

  • 정완진;최석우
    • Transactions of Materials Processing
    • /
    • v.12 no.6
    • /
    • pp.542-549
    • /
    • 2003
  • Three-dimensional forging analysis becomes an inevitable tool to make design process more reliable and more producible. In this study, in order to make the investigation for three-dimensional forging analysis more conveniently and accurately, a new post processor was developed. For post-processing of multi-stage forging simulation, efficient data structure was proposed and applied by using STL. New file architecture was developed to handle successive and huge data efficiently, common in three-dimensional forging analysis. Since sectioning and flow tracing plays an important role in the investigation of analysis result, we developed an algorithm suitable for 4-node and 10-node tetrahedron. This flow tracing algorithm can trace and reverse-trace flow through remeshing. Developed program shows good performance and functionality. Especially, a big size problem can be handled easily due to proposed data structure and file architecture.

System Design of Logistics Delivery Route Optimizing (물류 배송 최적화 시스템 디자인)

  • Song, Ha-yoon;Kim, Tae-Hyeon
    • Annual Conference of KIPS
    • /
    • 2018.05a
    • /
    • pp.571-574
    • /
    • 2018
  • 물류 배송은 우리 생활에 꼭 필요한 시스템 중 하나이다. 대한민국의 물류 시스템은 그 영토의 규모에 잘 부합되도록 체계적으로 정비되어 있으나, 배송 경로의 낭비 역시 존재한다. 본 논문에서는 Big Data, Deep Learning, IoT 와 같은 첨단 정보 기술을 이용하여 상기한 문제를 해결하고자 하였다. 물류의 특성을 고려하여 설계한 데이터 모델을 통신 기능과 위치 판별 기능이 포함된 IoT Device 를 통해 수집하고 NoSQL Database 상에 저장한다. 이후 Longest Common Subsequence Algorithm 을 이용한 Deep Learning 으로 수집 된 Data를 학습시킨다. 배송이 발생했을 때 학습된 Data 를 기반으로 해당 배송의 경로 분석을 실시하여 기존의 경로보다 시간적, 물질적 자원이 절약된 새로운 배송 경로를 IoT Device 를 통해 제시하고자 한다.

A Study on Data Safety Test Methodology through De-Anonymization of Anonymized data for Privacy in BigData Environment (빅데이터 환경에서 개인정보보호를 위한 익명화된 데이터의 비익명화를 통한 데이터 안전성 테스트 방법론에 관한 연구)

  • Lee, Jae-Sik;Oh, Yong-Seok;Kim, Ho-Seong
    • Annual Conference of KIPS
    • /
    • 2013.11a
    • /
    • pp.684-687
    • /
    • 2013
  • 빅데이터 환경은 수많은 데이터의 조합으로 가치를 발견하여 이를 활용하는 것이다. 이러한 환경의 전제조건은 데이터의 공개 및 공유 개방이 될 것이다. 하지만 데이터 공개 시 개인정보와 같은 정보가 포함되어 법적 도덕적인 문제나 공개된 정보의 범죄 활용 등 2차적인 피해가 발생할 수 있어 데이터 공개 시 개인정보에 대한 익명화가 반드시 필요하다. 하지만 익명화된 데이터는 다른 정보와 결합을 통하여 재식별되어 비익명화 될 가능성이 항상 존재한다. 따라서 본 논문에서는 데이터 공개 시 익명화된 데이터를 공개하기 전에 재식별성에 대한 위험을 평가하는 테스트 방법론을 제안한다. 제안하는 방법론은 실제 테스트를 수행하는 3가지 과정 및 테스트 레벨 설정과 익명화 시 고려해야 할 부분으로 이루어져 있다. 제안하는 방법론을 통하여 안전한 데이터 공개 환경이 조성되어 빅데이터 시대에 개인정보에 안전한 데이터 공유와 개방이 이루어질 것으로 기대한다.

Consumption behavior in similar areas in Seoul Comparative analysis and diagnosis of deficient industries (서울시 유사지역 내 소비행태 비교분석 및 미비 업종 진단)

  • Daun Choi;Hyunji Shim ;Suhwan Jo;Jung Yum;Taewon Kim;Minjoo Cho;Jin Kim
    • Annual Conference of KIPS
    • /
    • 2023.11a
    • /
    • pp.394-395
    • /
    • 2023
  • 본 연구는 최근 서울시의 모든 지역이 과밀억제권역에 해당되는 문제를 해결하는 방법을 분석하였다. 서울시 행정동별 소득수준, 사업체 통계, 거주인구 데이터, 매출액, 대중교통 승차 승객 수 등의 변수를 활용하여 2번의 클러스터링을 거쳐 진행하였다. 2번에 걸쳐 클러스터링 된 군집별 특징을 바탕으로 서울시 내 미비 업종 진단을 제안한다.