• Title/Summary/Keyword: Data Collection and Preprocessing

Search Result 59, Processing Time 0.024 seconds

Development of a Data Science Education Program for High School Students Taking the High School Credit System (고교학점제 수강 고등학생을 위한 데이터과학교육 프로그램 개발)

  • Semin Kim;SungHee Woo
    • Journal of Practical Engineering Education
    • /
    • v.14 no.3
    • /
    • pp.471-477
    • /
    • 2022
  • In this study, an educational program was developed that allows students who take data science courses in the high school credit system to explore related fields after learning data science education. Accordingly, the existing research and requirements for data science education were analyzed, a learning plan was designed, and an educational program was developed in accordance with a step-by-step educational program. In addition, since there is no research on data science education for the high school credit system in existing studies, the research was conducted in the stages of problem definition, data collection, data preprocessing, data analysis, data visualization, and simulation, and referred to studies on data science education that have been conducted in existing schools. Through this study, it is expected that research on data science education in the high school credit system will become more active.

Design of Client-Server Model For Effective Processing and Utilization of Bigdata (빅데이터의 효과적인 처리 및 활용을 위한 클라이언트-서버 모델 설계)

  • Park, Dae Seo;Kim, Hwa Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.4
    • /
    • pp.109-122
    • /
    • 2016
  • Recently, big data analysis has developed into a field of interest to individuals and non-experts as well as companies and professionals. Accordingly, it is utilized for marketing and social problem solving by analyzing the data currently opened or collected directly. In Korea, various companies and individuals are challenging big data analysis, but it is difficult from the initial stage of analysis due to limitation of big data disclosure and collection difficulties. Nowadays, the system improvement for big data activation and big data disclosure services are variously carried out in Korea and abroad, and services for opening public data such as domestic government 3.0 (data.go.kr) are mainly implemented. In addition to the efforts made by the government, services that share data held by corporations or individuals are running, but it is difficult to find useful data because of the lack of shared data. In addition, big data traffic problems can occur because it is necessary to download and examine the entire data in order to grasp the attributes and simple information about the shared data. Therefore, We need for a new system for big data processing and utilization. First, big data pre-analysis technology is needed as a way to solve big data sharing problem. Pre-analysis is a concept proposed in this paper in order to solve the problem of sharing big data, and it means to provide users with the results generated by pre-analyzing the data in advance. Through preliminary analysis, it is possible to improve the usability of big data by providing information that can grasp the properties and characteristics of big data when the data user searches for big data. In addition, by sharing the summary data or sample data generated through the pre-analysis, it is possible to solve the security problem that may occur when the original data is disclosed, thereby enabling the big data sharing between the data provider and the data user. Second, it is necessary to quickly generate appropriate preprocessing results according to the level of disclosure or network status of raw data and to provide the results to users through big data distribution processing using spark. Third, in order to solve the problem of big traffic, the system monitors the traffic of the network in real time. When preprocessing the data requested by the user, preprocessing to a size available in the current network and transmitting it to the user is required so that no big traffic occurs. In this paper, we present various data sizes according to the level of disclosure through pre - analysis. This method is expected to show a low traffic volume when compared with the conventional method of sharing only raw data in a large number of systems. In this paper, we describe how to solve problems that occur when big data is released and used, and to help facilitate sharing and analysis. The client-server model uses SPARK for fast analysis and processing of user requests. Server Agent and a Client Agent, each of which is deployed on the Server and Client side. The Server Agent is a necessary agent for the data provider and performs preliminary analysis of big data to generate Data Descriptor with information of Sample Data, Summary Data, and Raw Data. In addition, it performs fast and efficient big data preprocessing through big data distribution processing and continuously monitors network traffic. The Client Agent is an agent placed on the data user side. It can search the big data through the Data Descriptor which is the result of the pre-analysis and can quickly search the data. The desired data can be requested from the server to download the big data according to the level of disclosure. It separates the Server Agent and the client agent when the data provider publishes the data for data to be used by the user. In particular, we focus on the Big Data Sharing, Distributed Big Data Processing, Big Traffic problem, and construct the detailed module of the client - server model and present the design method of each module. The system designed on the basis of the proposed model, the user who acquires the data analyzes the data in the desired direction or preprocesses the new data. By analyzing the newly processed data through the server agent, the data user changes its role as the data provider. The data provider can also obtain useful statistical information from the Data Descriptor of the data it discloses and become a data user to perform new analysis using the sample data. In this way, raw data is processed and processed big data is utilized by the user, thereby forming a natural shared environment. The role of data provider and data user is not distinguished, and provides an ideal shared service that enables everyone to be a provider and a user. The client-server model solves the problem of sharing big data and provides a free sharing environment to securely big data disclosure and provides an ideal shared service to easily find big data.

Analysis of Ammunition Inspection Record Data and Development of Ammunition Condition Code Classification Model (탄약검사기록 데이터 분석 및 탄약상태기호 분류 모델 개발)

  • Young-Jin Jung;Ji-Soo Hong;Sol-Ip Kim;Sung-Woo Kang
    • Journal of the Korea Safety Management & Science
    • /
    • v.26 no.2
    • /
    • pp.23-31
    • /
    • 2024
  • In the military, ammunition and explosives stored and managed can cause serious damage if mishandled, thus securing safety through the utilization of ammunition reliability data is necessary. In this study, exploratory data analysis of ammunition inspection records data is conducted to extract reliability information of stored ammunition and to predict the ammunition condition code, which represents the lifespan information of the ammunition. This study consists of three stages: ammunition inspection record data collection and preprocessing, exploratory data analysis, and classification of ammunition condition codes. For the classification of ammunition condition codes, five models based on boosting algorithms are employed (AdaBoost, GBM, XGBoost, LightGBM, CatBoost). The most superior model is selected based on the performance metrics of the model, including Accuracy, Precision, Recall, and F1-score. The ammunition in this study was primarily produced from the 1980s to the 1990s, with a trend of increased inspection volume in the early stages of production and around 30 years after production. Pre-issue inspections (PII) were predominantly conducted, and there was a tendency for the grade of ammunition condition codes to decrease as the storage period increased. The classification of ammunition condition codes showed that the CatBoost model exhibited the most superior performance, with an Accuracy of 93% and an F1-score of 93%. This study emphasizes the safety and reliability of ammunition and proposes a model for classifying ammunition condition codes by analyzing ammunition inspection record data. This model can serve as a tool to assist ammunition inspectors and is expected to enhance not only the safety of ammunition but also the efficiency of ammunition storage management.

Implementation of a pet product recommendation system using big data (빅 데이터를 활용한 애완동물 상품 추천 시스템 구현)

  • Kim, Sam-Taek
    • Journal of the Korea Convergence Society
    • /
    • v.11 no.11
    • /
    • pp.19-24
    • /
    • 2020
  • Recently, due to the rapid increase of pets, there is a need for an integrated pet-related personalized product recommendation service such as feed recommendation using a health status check of pets and various collected data. This paper implements a product recommendation system that can perform various personalized services such as collection, pre-processing, analysis, and management of pet-related data using big data. First, the sensor information worn by pets, customer purchase patterns, and SNS information are collected and stored in a database, and a platform capable of customized personalized recommendation services such as feed production and pet health management is implemented using statistical analysis. The platform can provide information to customers by outputting similarity product information about the product to be analyzed and information, and finally outputting the result of recommendation analysis.

Deep sequencing of B cell receptor repertoire

  • Kim, Daeun;Park, Daechan
    • BMB Reports
    • /
    • v.52 no.9
    • /
    • pp.540-547
    • /
    • 2019
  • Immune repertoire is a collection of enormously diverse adaptive immune cells within an individual. As the repertoire shapes and represents immunological conditions, identification of clones and characterization of diversity are critical for understanding how to protect ourselves against various illness such as infectious diseases and cancers. Over the past several years, fast growing technologies for high throughput sequencing have facilitated rapid advancement of repertoire research, enabling us to observe the diversity of repertoire at an unprecedented level. Here, we focus on B cell receptor (BCR) repertoire and review approaches to B cell isolation and sequencing library construction. These experiments should be carefully designed according to BCR regions to be interrogated, such as heavy chain full length, complementarity determining regions, and isotypes. We also highlight preprocessing steps to remove sequencing and PCR errors with unique molecular index and bioinformatics techniques. Due to the nature of massive sequence variation in BCR, caution is warranted when interpreting repertoire diversity from error-prone sequencing data. Furthermore, we provide a summary of statistical frameworks and bioinformatics tools for clonal evolution and diversity. Finally, we discuss limitations of current BCR-seq technologies and future perspectives on advances in repertoire sequencing.

A Path Travel Time Estimation Study on Expressways using TCS Link Travel Times (TCS 링크통행시간을 이용한 고속도로 경로통행시간 추정)

  • Lee, Hyeon-Seok;Jeon, Gyeong-Su
    • Journal of Korean Society of Transportation
    • /
    • v.27 no.5
    • /
    • pp.209-221
    • /
    • 2009
  • Travel time estimation under given traffic conditions is important for providing drivers with travel time prediction information. But the present expressway travel time estimation process cannot calculate a reliable travel time. The objective of this study is to estimate the path travel time spent in a through lane between origin tollgates and destination tollgates on an expressway as a prerequisite result to offer reliable prediction information. Useful and abundant toll collection system (TCS) data were used. When estimating the path travel time, the path travel time is estimated combining the link travel time obtained through a preprocessing process. In the case of a lack of TCS data, the TCS travel time for previous intervals is referenced using the linear interpolation method after analyzing the increase pattern for the travel time. When the TCS data are absent over a long-term period, the dynamic travel time using the VDS time space diagram is estimated. The travel time estimated by the model proposed can be validated statistically when compared to the travel time obtained from vehicles traveling the path directly. The results show that the proposed model can be utilized for estimating a reliable travel time for a long-distance path in which there are a variaty of travel times from the same departure time, the intervals are large and the change in the representative travel time is irregular for a short period.

Spatial-temporal texture features for 3D human activity recognition using laser-based RGB-D videos

  • Ming, Yue;Wang, Guangchao;Hong, Xiaopeng
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.3
    • /
    • pp.1595-1613
    • /
    • 2017
  • The IR camera and laser-based IR projector provide an effective solution for real-time collection of moving targets in RGB-D videos. Different from the traditional RGB videos, the captured depth videos are not affected by the illumination variation. In this paper, we propose a novel feature extraction framework to describe human activities based on the above optical video capturing method, namely spatial-temporal texture features for 3D human activity recognition. Spatial-temporal texture feature with depth information is insensitive to illumination and occlusions, and efficient for fine-motion description. The framework of our proposed algorithm begins with video acquisition based on laser projection, video preprocessing with visual background extraction and obtains spatial-temporal key images. Then, the texture features encoded from key images are used to generate discriminative features for human activity information. The experimental results based on the different databases and practical scenarios demonstrate the effectiveness of our proposed algorithm for the large-scale data sets.

Topic Modeling of Korean Newspaper Articles on Aging via Latent Dirichlet Allocation

  • Lee, So Chung
    • Asian Journal for Public Opinion Research
    • /
    • v.10 no.1
    • /
    • pp.4-22
    • /
    • 2022
  • The purpose of this study is to explore the structure of social discourse on aging in Korea by analyzing newspaper articles on aging. The analysis is composed of three steps: first, data collection and preprocessing; second, identifying the latent topics; and third, observing yearly dynamics of topics. In total, 1,472 newspaper articles that included the word "aging" within the title were collected from 10 major newspapers between 2006 and 2019. The underlying topic structure was analyzed using Latent Dirichlet Allocation (LDA), a topic modeling method widely adopted by text mining academics and researchers. Seven latent topics were generated from the LDA model, defined as social issues, death, private insurance, economic growth, national debt, labor market innovation, and income security. The topic loadings demonstrated a clear increase in public interest on topics such as national debt and labor market innovation in recent years. This study concludes that media discourse on aging has shifted towards more productivity and efficiency related issues, requiring older people to be productive citizens. Such subjectivation connotes a decreased role of the government and society by shifting the responsibility to individuals not being able to adapt successfully as productive citizens within the labor market.

Real-Time Streaming Traffic Prediction Using Deep Learning Models Based on Recurrent Neural Network (순환 신경망 기반 딥러닝 모델들을 활용한 실시간 스트리밍 트래픽 예측)

  • Jinho, Kim;Donghyeok, An
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.12 no.2
    • /
    • pp.53-60
    • /
    • 2023
  • Recently, the demand and traffic volume for various multimedia contents are rapidly increasing through real-time streaming platforms. In this paper, we predict real-time streaming traffic to improve the quality of service (QoS). Statistical models have been used to predict network traffic. However, since real-time streaming traffic changes dynamically, we used recurrent neural network-based deep learning models rather than a statistical model. Therefore, after the collection and preprocessing for real-time streaming data, we exploit vanilla RNN, LSTM, GRU, Bi-LSTM, and Bi-GRU models to predict real-time streaming traffic. In evaluation, the training time and accuracy of each model are measured and compared.

A Study on AI-Based Real Estate Rate of Return Decision Models of 5 Sectors for 5 Global Cities: Seoul, New York, London, Paris and Tokyo (인공지능 (AI) 기반 섹터별 부동산 수익률 결정 모델 연구- 글로벌 5개 도시를 중심으로 (서울, 뉴욕, 런던, 파리, 도쿄) -)

  • Wonboo Lee;Jisoo Lee;Minsang Kim
    • Journal of Korean Society for Quality Management
    • /
    • v.52 no.3
    • /
    • pp.429-457
    • /
    • 2024
  • Purpose: This study aims to provide useful information to real estate investors by developing a profit determination model using artificial intelligence. The model analyzes the real estate markets of six selected cities from multiple perspectives, incorporating characteristics of the real estate market, economic indicators, and policies to determine potential profits. Methods: Data on real estate markets, economic indicators, and policies for five cities were collected and cleaned. The data was then normalized and split into training and testing sets. An AI model was developed using machine learning algorithms and trained with this data. The model was applied to the six cities, and its accuracy was evaluated using metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R-squared by comparing predicted profits to actual outcomes. Results: The profit determination model was successfully applied to the real estate markets of six cities, showing high accuracy and predictability in profit forecasts. The study provided valuable insights for real estate investors, demonstrating the model's utility for informed investment decisions. Conclusion: The study identified areas for future improvement, suggesting the integration of diverse data sources and advanced machine learning techniques to enhance predictive capabilities.