• 제목/요약/키워드: Big data Processing

Search Result 1,063, Processing Time 0.03 seconds

Resolving data imbalance through differentiated anomaly data processing based on verification data (검증데이터 기반의 차별화된 이상데이터 처리를 통한 데이터 불균형 해소 방법)

  • Hwang, Chulhyun
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.179-190
    • /
    • 2022
  • Data imbalance refers to a phenomenon in which the number of data in one category is too large or too small compared to another category. Due to this, it has been raised as a major factor that deteriorates performance in machine learning that utilizes classification algorithms. In order to solve the data imbalance problem, various ovrsampling methods for amplifying prime number distribution data have been proposed. Among them, SMOTE is the most representative method. In order to maximize the amplification effect of minority distribution data, various methods have emerged that remove noise included in data (SMOTE-IPF) or enhance only border lines (Borderline SMOTE). This paper proposes a method to ultimately improve classification performance by improving the processing method for anomaly data in the traditional SMOTE method that amplifies minority classification data. The proposed method consistently presented relatively high classification performance compared to the existing methods through experiments.

Real-time data processing and visualization for road weather services (도로기상 서비스를 위한 실시간 자료처리 및 시각화)

  • Kim, DaeSung;Ahn, Sukhee;Lee, Chaeyeon;Yoon, Sanghoo
    • Journal of Digital Convergence
    • /
    • v.18 no.4
    • /
    • pp.221-228
    • /
    • 2020
  • As industrial technology advances, convenience is also being developed. Many people living in big cities are commuting using transportation such as buses, taxis, cars, etc. and enjoy leisure, so research is needed to reduce the damages caused by traffic accidents. This study deals with estimating road-level rainfall in real-time. A rainfall observation data and radar data provided by the Korea meteorological administration were collected in real-time to create an integrated database, which was estimated as road-level rainfall by universal kriging method. Besides, we conducted a study to interactively visualization of mash-up road traffic information in real-time with integrating rainfall information.

Machine learning application for predicting the strawberry harvesting time

  • Yang, Mi-Hye;Nam, Won-Ho;Kim, Taegon;Lee, Kwanho;Kim, Younghwa
    • Korean Journal of Agricultural Science
    • /
    • v.46 no.2
    • /
    • pp.381-393
    • /
    • 2019
  • A smart farm is a system that combines information and communication technology (ICT), internet of things (IoT), and agricultural technology that enable a farm to operate with minimal labor and to automatically control of a greenhouse environment. Machine learning based on recently data-driven techniques has emerged with big data technologies and high-performance computing to create opportunities to quantify data intensive processes in agricultural operational environments. This paper presents research on the application of machine learning technology to diagnose the growth status of crops and predicting the harvest time of strawberries in a greenhouse according to image processing techniques. To classify the growth stages of the strawberries, we used object inference and detection with machine learning model based on deep learning neural networks and TensorFlow. The classification accuracy was compared based on the training data volume and training epoch. As a result, it was able to classify with an accuracy of over 90% with 200 training images and 8,000 training steps. The detection and classification of the strawberry maturities could be identified with an accuracy of over 90% at the mature and over mature stages of the strawberries. Concurrently, the experimental results are promising, and they show that this approach can be applied to develop a machine learning model for predicting the strawberry harvesting time and can be used to provide key decision support information to both farmers and policy makers about optimal harvest times and harvest planning.

Design and analysis of monitoring system for illegal overseas direct purchase based on C2C (C2C에 기반으로 해외직구 불법거래에 관한 모니터링 시스템 설계 및 분석)

  • Shin, Yong-Hun;Kim, Jeong-Ho
    • Journal of Digital Convergence
    • /
    • v.20 no.5
    • /
    • pp.609-615
    • /
    • 2022
  • In this paper, we propose a monitoring system for illegal overseas direct purchase based on C2C transaction between individuals. The Customs Act stipulates that direct purchases from overseas are exempted from taxation only if they are less than a certain amount (US$150, but US$200 in the US) or are recognized as self-used goods. The act of reselling overseas direct purchase items purchased with exemption from taxation online, etc., is a crime of smuggling without a report. Nevertheless, the number of re-sells on online second-hand websites is increasing, and it is becoming a controversial social issue of continuous violation of the Customs Act. Therefore, this study collects unspecified transaction details related to overseas direct purchase, refines the data in a big data method, and designs it as a monitoring system through natural language processing, etc. analyzed. It will be possible to use it to crack down on illegal transactions of overseas direct purchase goods.

Job-related analysis and visualization using big data distributed processing system (빅데이터를 활용한 직업관련 분석 및 시각화)

  • Choi, Dong-Cheol;Choi, Nakjin;Kim, Min-Seok;Park, Jun-wook;Lee, Jun-Dong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2020.07a
    • /
    • pp.249-251
    • /
    • 2020
  • 본 논문에서는 코로나바이러스감염증19 사태가 국내 취업시장에 어떠한 영향을 미쳤는지에 대해 알아보기 위하여 빅데이터를 활용한 직업 관련 분석 및 시각화를 수행하였다. 빅데이터를 위한 기본 자료는 통계청 자료와 워크넷 Open API를 활용하였으며, 빅데이터 처리 과정을 거쳐 결과값을 예측을 시도하였다. 2020년도 워크넷 Open API를 통해 고용수와 통계청 자료를 통해 비교 분석 및 시각화를 실시하였고, 08년~20년 취업자수를 통해 시계열 분석 및 예측을 진행해 앞으로의 횡보를 예상해보았다. 분석한 결과 19년, 20년도를 비교 분석했을 때에는 크게 차이가 나지 않았다. 추가적으로 시계열 분석기법을 활용해 보았을 때 매년 고용수는 전체적으로 증가하고 4월에는 감소, 7월에는 증가하는 추세가 나왔다. 코로나바이러스감염증19 사태로 인해 공공기관과 언택트 시대에 따른 화상회의나 재택근무로 인해 운수·통신 취업률은 상승한다는 결과값이 도출되었고, 자영업이나 서비스 직업 등은 다른 직종에 비해 큰 감소를 보여줬으나 국가 경제 활성화에 따른 고용수는 점차 증가할 것이라 예측된다.

  • PDF

Video Big Data Processing Scheme for Spatio-Temporal Analysis of Moving Objects (움직이는 물체의 시공간 분석을 위한 동영상 빅 데이터 처리 방안)

  • Jung, Seungwon;Kim, Yongsung;Jung, Sangwon;Kim, Yoonki;Hwang, Eenjun
    • Annual Conference of KIPS
    • /
    • 2017.04a
    • /
    • pp.833-836
    • /
    • 2017
  • 최근 블랙박스 및 CCTV 같은 영상 촬영 장치가 보편화되면서, 방대한 양의 영상 데이터가 실시간으로 생성되고 있다. 만약 이 대용량 데이터 안의 차량 정보를 추출할 수 있다면 범죄 차량 추적, 교통 혼잡도 측정 등의 활용이 가능할 것이다. 이를 구현하기 위해서는 수많은 자동차에서 실시간으로 생성되는 영상 데이터를 처리할 수 있는 시스템이 필수적이나, 이러한 시스템을 찾기 힘든 것이 현실이다. 이를 위해 이 논문에서는 아파치 카프카, Hbase를 이용한 영상 빅데이터 처리 시스템을 제안한다. 아파치 카프카는 시스템 내에서 영상 손실이 없는 전송과 영상 처리 노드의 스케줄링을 수행하며, Hbase는 처리된 데이터를 테이블로 저장하고 사용자가 보낸 쿼리를 처리한다. 더불어, Hbase에 인덱스를 구성하여 빠른 쿼리 처리가 가능하도록 만든다. 실험 결과, 제안된 시스템은 인덱스가 없을 때보다 뛰어난 쿼리 처리 성능을 보이는 것을 확인할 수 있었다.

A Study on an Automatic Classification Model for Facet-Based Multidimensional Analysis of Civil Complaints (패싯 기반 민원 다차원 분석을 위한 자동 분류 모델)

  • Na Rang Kim
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.29 no.1
    • /
    • pp.135-144
    • /
    • 2024
  • In this study, we propose an automatic classification model for quantitative multidimensional analysis based on facet theory to understand public opinions and demands on major issues through big data analysis. Civil complaints, as a form of public feedback, are generated by various individuals on multiple topics repeatedly and continuously in real-time, which can be challenging for officials to read and analyze efficiently. Specifically, our research introduces a new classification framework that utilizes facet theory and political analysis models to analyze the characteristics of citizen complaints and apply them to the policy-making process. Furthermore, to reduce administrative tasks related to complaint analysis and processing and to facilitate citizen policy participation, we employ deep learning to automatically extract and classify attributes based on the facet analysis framework. The results of this study are expected to provide important insights into understanding and analyzing the characteristics of big data related to citizen complaints, which can pave the way for future research in various fields beyond the public sector, such as education, industry, and healthcare, for quantifying unstructured data and utilizing multidimensional analysis. In practical terms, improving the processing system for large-scale electronic complaints and automation through deep learning can enhance the efficiency and responsiveness of complaint handling, and this approach can also be applied to text data processing in other fields.

Data Transmitting and Storing Scheme based on Bandwidth in Hadoop Cluster (하둡 클러스터의 대역폭을 고려한 압축 데이터 전송 및 저장 기법)

  • Kim, Youngmin;Kim, Heejin;Kim, Younggwan;Hong, Jiman
    • Smart Media Journal
    • /
    • v.8 no.4
    • /
    • pp.46-52
    • /
    • 2019
  • The size of data generated and collected at industrial sites or in public institutions is growing rapidly. The existing data processing server often handles the increasing data by increasing the performance by scaling up. However, in the big data era, when the speed of data generation is exploding, there is a limit to data processing with a conventional server. To overcome such limitations, a distributed cluster computing system has been introduced that distributes data in a scale-out manner. However, because distributed cluster computing systems distribute data, inefficient use of network bandwidth can degrade the performance of the cluster as a whole. In this paper, we propose a scheme that compresses data when transmitting data in a Hadoop cluster considering network bandwidth. The proposed scheme considers the network bandwidth and the characteristics of the compression algorithm and selects the optimal compression transmission scheme before transmission. Experimental results show that the proposed scheme reduces data transfer time and size.

An Efficient Angular Space Partitioning Based Skyline Query Processing Using Sampling-Based Pruning (데이터 샘플링 기반 프루닝 기법을 도입한 효율적인 각도 기반 공간 분할 병렬 스카이라인 질의 처리 기법)

  • Choi, Woosung;Kim, Minseok;Diana, Gromyko;Chung, Jaehwa;Jung, Soonyong
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.1
    • /
    • pp.1-8
    • /
    • 2017
  • Given a multi-dimensional dataset of tuples, a skyline query returns a subset of tuples which are not 'dominated' by any other tuples. Skyline query is very useful in Big data analysis since it filters out uninteresting items. Much interest was devoted to the MapReduce-based parallel processing of skyline queries in large-scale distributed environment. There are three requirements to improve parallelism in MapReduced-based algorithms: (1) workload should be well balanced (2) avoid redundant computations (3) Optimize network communication cost. In this paper, we introduce MR-SEAP (MapReduce sample Skyline object Equality Angular Partitioning), an efficient angular space partitioning based skyline query processing using sampling-based pruning, which satisfies requirements above. We conduct an extensive experiment to evaluate MR-SEAP.

Design and Implementation of Memory-Centric Computing System for Big Data Analysis

  • Jung, Byung-Kwon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.7
    • /
    • pp.1-7
    • /
    • 2022
  • Recently, as the use of applications such as big data programs and machine learning programs that are driven while generating large amounts of data in the program itself becomes common, the existing main memory alone lacks memory, making it difficult to execute the program quickly. In particular, the need to derive results more quickly has emerged in a situation where it is necessary to analyze whether the entire sequence is genetically altered due to the outbreak of the coronavirus. As a result of measuring performance by applying large-capacity data to a computing system equipped with a self-developed memory pool MOCA host adapter instead of processing large-capacity data from an existing SSD, performance improved by 16% compared to the existing SSD system. In addition, in various other benchmark tests, IO performance was 92.8%, 80.6%, and 32.8% faster than SSD in computing systems equipped with memory pool MOCA host adapters such as SortSampleBam, ApplyBQSR, and GatherBamFiles by task of workflow. When analyzing large amounts of data, such as electrical dielectric pipeline analysis, it is judged that the measurement delay occurring at runtime can be reduced in the computing system equipped with the memory pool MOCA host adapter developed in this research.