• Title/Summary/Keyword: Big data Processing

Search Result 1,063, Processing Time 0.023 seconds

Technique for Concurrent Processing Graph Structure and Transaction Using Topic Maps and Cassandra (토픽맵과 카산드라를 이용한 그래프 구조와 트랜잭션 동시 처리 기법)

  • Shin, Jae-Hyun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.1 no.3
    • /
    • pp.159-168
    • /
    • 2012
  • Relation in the new IT environment, such as the SNS, Cloud, Web3.0, has become an important factor. And these relations generate a transaction. However, existing relational database and graph database does not processe graph structure representing the relationships and transactions. This paper, we propose the technique that can be processed concurrently graph structures and transactions in a scalable complex network system. The proposed technique simultaneously save and navigate graph structures and transactions using the Topic Maps data model. Topic Maps is one of ontology language to implement the semantic web(Web 3.0). It has been used as the navigator of the information through the association of the information resources. In this paper, the architecture of the proposed technique was implemented and design using Cassandra - one of column type NoSQL. It is to ensure that can handle up to Big Data-level data using distributed processing. Finally, the experiments showed about the process of storage and query about typical RDBMS Oracle and the proposed technique to the same data source and the same questions. It can show that is expressed by the relationship without the 'join' enough alternative to the role of the RDBMS.

Subgraph Searching Scheme Based on Path Queries in Distributed Environments (분산 환경에서 경로 질의 기반 서브 그래프 탐색 기법)

  • Kim, Minyoung;Choi, Dojin;Park, Jaeyeol;Kim, Yeondong;Lim, Jongtae;Bok, Kyoungsoo;Choi, Han Suk;Yoo, Jaesoo
    • The Journal of the Korea Contents Association
    • /
    • v.19 no.1
    • /
    • pp.141-151
    • /
    • 2019
  • A network of graph data structure is used in many applications to represent interactions between entities. Recently, as the size of the network to be processed due to the development of the big data technology is getting larger, it becomes more difficult to handle it in one server, and thus the necessity of distributed processing is also increasing. In this paper, we propose a distributed processing system for efficiently performing subgraph and stores. To reduce unnecessary searches, we use statistical information of the data to determine the search order through probabilistic scoring. Since the relationship between the vertex and the degree of the graph network may show different characteristics depending on the type of data, the search order is determined by calculating a score to reduce unnecessary search through a different scoring method for a graph having various distribution characteristics. The graph is sequentially searched in the distributed servers according to the determined order. In order to demonstrate the superiority of the proposed method, performance comparison with the existing method was performed. As a result, the search time is improved by about 3 ~ 10% compared with the existing method.

MapReduce-based Localized Linear Regression for Electricity Price Forecasting (전기 가격 예측을 위한 맵리듀스 기반의 로컬 단위 선형회귀 모델)

  • Han, Jinju;Lee, Ingyu;On, Byung-Won
    • The Transactions of the Korean Institute of Electrical Engineers P
    • /
    • v.67 no.4
    • /
    • pp.183-190
    • /
    • 2018
  • Predicting accurate electricity prices is an important task in the electricity trading market. To address the electricity price forecasting problem, various approaches have been proposed so far and it is known that linear regression-based approaches are the best. However, the use of such linear regression-based methods is limited due to low accuracy and performance. In traditional linear regression methods, it is not practical to find a nonlinear regression model that explains the training data well. If the training data is complex (i.e., small-sized individual data and large-sized features), it is difficult to find the polynomial function with n terms as the model that fits to the training data. On the other hand, as a linear regression model approximating a nonlinear regression model is used, the accuracy of the model drops considerably because it does not accurately reflect the characteristics of the training data. To cope with this problem, we propose a new electricity price forecasting method that divides the entire dataset to multiple split datasets and find the best linear regression models, each of which is the optimal model in each dataset. Meanwhile, to improve the performance of the proposed method, we modify the proposed localized linear regression method in the map and reduce way that is a framework for parallel processing data stored in a Hadoop distributed file system. Our experimental results show that the proposed model outperforms the existing linear regression model. Specifically, the accuracy of the proposed method is improved by 45% and the performance is faster 5 times than the existing linear regression-based model.

The Standard Processing of a Time Series of Imaging Spectral Data Taken by the Fast Imaging Solar Spectrograph on the Goode Solar Telescope

  • Chae, Jongchul;Kang, Juhyeong;Cho, Kyuhyoun
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.43 no.2
    • /
    • pp.46.1-46.1
    • /
    • 2018
  • The Fast Imaging Solar Spectrograph (FISS) on the Goode Solar Telescope (GST) at Big Bear Solar Observatory is the imaging Echelle spectrograph developed by the Solar Astronomy Group of Seoul National University and the Solar and Space Weather Group of Korea Astronomy and Space Science Institute. The instrument takes spectral data from a region on the Sun in two spectral bands simultaneously. The imaging is done by the organization of intensity data obtained from the fast raster scan of the slit over the field of view. Since the scan repeats many times, the whole set of data can be used to construct the movies of monochromatic intensity at arbitrary wavelengths within the spectral bands, and those of line-of-sight velocity inferred from different spectral lines. So far there are two standard observing configurations: one recording the $H{\alpha}$ line and the Ca II 8542 line simultaneously, and the other recording the Na I D2 line and Fe I 5435 line simultaneously. We have developed the procedures to produce the standard data for each observing configuration. The procedures include the spatial alignment, the correction of spectral shift of instrumental origin, and the lambdameter measurement of the line wavelength. The standard data include the movie of continuum intensity, the movies of intensity and velocity inferred from a chromospheric spectral line, the movies of intensity and velocity inferred from a photospheric line. The processed standard data will be freely available online (fiss.snu.ac.kr) to be used for research and public outreach. Moreover, the IDL procedures will be provided on request as well so that each researcher can adapt the programs for their own research.

  • PDF

Building Energy Time Series Data Mining for Behavior Analytics and Forecasting Energy consumption

  • Balachander, K;Paulraj, D
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.6
    • /
    • pp.1957-1980
    • /
    • 2021
  • The significant aim of this research has always been to evaluate the mechanism for efficient and inherently aware usage of vitality in-home devices, thus improving the information of smart metering systems with regard to the usage of selected homes and the time of use. Advances in information processing are commonly used to quantify gigantic building activity data steps to boost the activity efficiency of the building energy systems. Here, some smart data mining models are offered to measure, and predict the time series for energy in order to expose different ephemeral principles for using energy. Such considerations illustrate the use of machines in relation to time, such as day hour, time of day, week, month and year relationships within a family unit, which are key components in gathering and separating the effect of consumers behaviors in the use of energy and their pattern of energy prediction. It is necessary to determine the multiple relations through the usage of different appliances from simultaneous information flows. In comparison, specific relations among interval-based instances where multiple appliances use continue for certain duration are difficult to determine. In order to resolve these difficulties, an unsupervised energy time-series data clustering and a frequent pattern mining study as well as a deep learning technique for estimating energy use were presented. A broad test using true data sets that are rich in smart meter data were conducted. The exact results of the appliance designs that were recognized by the proposed model were filled out by Deep Convolutional Neural Networks (CNN) and Recurrent Neural Networks (LSTM and GRU) at each stage, with consolidated accuracy of 94.79%, 97.99%, 99.61%, for 25%, 50%, and 75%, respectively.

High-performance computing for SARS-CoV-2 RNAs clustering: a data science-based genomics approach

  • Oujja, Anas;Abid, Mohamed Riduan;Boumhidi, Jaouad;Bourhnane, Safae;Mourhir, Asmaa;Merchant, Fatima;Benhaddou, Driss
    • Genomics & Informatics
    • /
    • v.19 no.4
    • /
    • pp.49.1-49.11
    • /
    • 2021
  • Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.

Detecting Vehicles That Are Illegally Driving on Road Shoulders Using Faster R-CNN (Faster R-CNN을 이용한 갓길 차로 위반 차량 검출)

  • Go, MyungJin;Park, Minju;Yeo, Jiho
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.21 no.1
    • /
    • pp.105-122
    • /
    • 2022
  • According to the statistics about the fatal crashes that have occurred on the expressways for the last 5 years, those who died on the shoulders of the road has been as 3 times high as the others who died on the expressways. It suggests that the crashes on the shoulders of the road should be fatal, and that it would be important to prevent the traffic crashes by cracking down on the vehicles intruding the shoulders of the road. Therefore, this study proposed a method to detect a vehicle that violates the shoulder lane by using the Faster R-CNN. The vehicle was detected based on the Faster R-CNN, and an additional reading module was configured to determine whether there was a shoulder violation. For experiments and evaluations, GTAV, a simulation game that can reproduce situations similar to the real world, was used. 1,800 images of training data and 800 evaluation data were processed and generated, and the performance according to the change of the threshold value was measured in ZFNet and VGG16. As a result, the detection rate of ZFNet was 99.2% based on Threshold 0.8 and VGG16 93.9% based on Threshold 0.7, and the average detection speed for each model was 0.0468 seconds for ZFNet and 0.16 seconds for VGG16, so the detection rate of ZFNet was about 7% higher. The speed was also confirmed to be about 3.4 times faster. These results show that even in a relatively uncomplicated network, it is possible to detect a vehicle that violates the shoulder lane at a high speed without pre-processing the input image. It suggests that this algorithm can be used to detect violations of designated lanes if sufficient training datasets based on actual video data are obtained.

A Study on COP-Transformation Based Metadata Security Scheme for Privacy Protection in Intelligent Video Surveillance (지능형 영상 감시 환경에서의 개인정보보호를 위한 COP-변환 기반 메타데이터 보안 기법 연구)

  • Lee, Donghyeok;Park, Namje
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.28 no.2
    • /
    • pp.417-428
    • /
    • 2018
  • The intelligent video surveillance environment is a system that extracts various information about a video object and enables automated processing through the analysis of video data collected in CCTV. However, since the privacy exposure problem may occur in the process of intelligent video surveillance, it is necessary to take a security measure. Especially, video metadata has high vulnerability because it can include various personal information analyzed based on big data. In this paper, we propose a COP-Transformation scheme to protect video metadata. The proposed scheme is advantageous in that it greatly enhances the security and efficiency in processing the video metadata.

3-D Hetero-Integration Technologies for Multifunctional Convergence Systems

  • Lee, Kang-Wook
    • Journal of the Microelectronics and Packaging Society
    • /
    • v.22 no.2
    • /
    • pp.11-19
    • /
    • 2015
  • Since CMOS device scaling has stalled, three-dimensional (3-D) integration allows extending Moore's law to ever high density, higher functionality, higher performance, and more diversed materials and devices to be integrated with lower cost. 3-D integration has many benefits such as increased multi-functionality, increased performance, increased data bandwidth, reduced power, small form factor, reduced packaging volume, because it vertically stacks multiple materials, technologies, and functional components such as processor, memory, sensors, logic, analog, and power ICs into one stacked chip. Anticipated applications start with memory, handheld devices, and high-performance computers and especially extend to multifunctional convengence systems such as cloud networking for internet of things, exascale computing for big data server, electrical vehicle system for future automotive, radioactivity safety system, energy harvesting system and, wireless implantable medical system by flexible heterogeneous integrations involving CMOS, MEMS, sensors and photonic circuits. However, heterogeneous integration of different functional devices has many technical challenges owing to various types of size, thickness, and substrate of different functional devices, because they were fabricated by different technologies. This paper describes new 3-D heterogeneous integration technologies of chip self-assembling stacking and 3-D heterogeneous opto-electronics integration, backside TSV fabrication developed by Tohoku University for multifunctional convergence systems. The paper introduce a high speed sensing, highly parallel processing image sensor system comprising a 3-D stacked image sensor with extremely fast signal sensing and processing speed and a 3-D stacked microprocessor with a self-test and self-repair function for autonomous driving assist fabricated by 3-D heterogeneous integration technologies.

A Study about Performance Evaluation of Various NoSQL Databases (다양한 NoSQL 데이터베이스의 성능 평가 연구)

  • Park, Hong-Jin
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.9 no.3
    • /
    • pp.298-305
    • /
    • 2016
  • Various NoSQL databases are more excellent to process a large amount of big data than existing relational databases such as MySQL, PostgreSQL and Oracle. Among widely used NoSQL databases, performance of HBase, Cassandra, MongoDB and Redis was comparatively assessed. For distributed processing of a large amount of data, 12 servers were connected through switching hub and Ubuntu was installed as operating system. As for benchmark tool, YCSB was applied. Read and update ratios changed from 50% and 50%, 95% and 5% and finally, 100% and 0% and each of them was assessed as 200,000 commands developed into 1,200,000 commands for each case. Cassandra was most excellent with transaction processing per second while MongoDB was most excellent with the number of processes carried out per unit time.