• Title/Summary/Keyword: Korean human dataset

Search Result 169, Processing Time 0.025 seconds

Evaluating Korean Machine Reading Comprehension Generalization Performance using Cross and Blind Dataset Assessment (기계독해 데이터셋의 교차 평가 및 블라인드 평가를 통한 한국어 기계독해의 일반화 성능 평가)

  • Lim, Joon-Ho;Kim, Hyunki
    • Annual Conference on Human and Language Technology
    • /
    • 2019.10a
    • /
    • pp.213-218
    • /
    • 2019
  • 기계독해는 자연어로 표현된 질문과 단락이 주어졌을 때, 해당 단락 내에 표현된 정답을 찾는 태스크이다. 최근 기계독해 태스크도 다른 자연어처리 태스크와 유사하게 BERT, XLNet, RoBERTa와 같이 사전에 학습한 언어모델을 이용하고 질문과 단락이 입력되었을 경우 정답의 경계를 추가 학습(fine-tuning)하는 방법이 우수한 성능을 보이고 있으며, 특히 KorQuAD v1.0 데이터셋에서 학습 및 평가하였을 경우 94% F1 이상의 높은 성능을 보이고 있다. 본 논문에서는 현재 최고 수준의 기계독해 기술이 학습셋과 유사한 평가셋이 아닌 일반적인 질문과 단락 쌍에 대해서 가지는 일반화 능력을 평가하고자 한다. 이를 위하여 첫번째로 한국어에 대해서 공개된 KorQuAD v1.0 데이터셋과 NIA v2017 데이터셋, 그리고 엑소브레인 과제에서 구축한 엑소브레인 v2018 데이터셋을 이용하여 데이터셋 간의 교차 평가를 수행하였다. 교차 평가결과, 각 데이터셋의 정답의 길이, 질문과 단락 사이의 오버랩 비율과 같은 데이터셋 통계와 일반화 성능이 서로 관련이 있음을 확인하였다. 다음으로 KorBERT 사전 학습 언어모델과 학습 가능한 기계독해 데이터 셋 21만 건 전체를 이용하여 학습한 기계독해 모델에 대해 블라인드 평가셋 평가를 수행하였다. 블라인드 평가로 일반분야에서 학습한 기계독해 모델의 법률분야 평가셋에서의 일반화 성능을 평가하고, 정답 단락을 읽고 질문을 생성하지 않고 질문을 먼저 생성한 후 정답 단락을 검색한 평가셋에서의 기계독해 성능을 평가하였다. 블라인드 평가 결과, 사전 학습 언어 모델을 사용하지 않은 기계독해 모델 대비 사전 학습 언어 모델을 사용하는 모델이 큰 폭의 일반화 성능을 보였으나, 정답의 길이가 길고 질문과 단락 사이 어휘 오버랩 비율이 낮은 평가셋에서는 아직 80%이하의 성능을 보임을 확인하였다. 본 논문의 실험 결과 기계 독해 태스크는 특성 상 질문과 정답 사이의 어휘 오버랩 및 정답의 길이에 따라 난이도 및 일반화 성능 차이가 발생함을 확인하였고, 일반적인 질문과 단락을 대상으로 하는 기계독해 모델 개발을 위해서는 다양한 유형의 평가셋에서 일반화 평가가 필요함을 확인하였다.

  • PDF

High-Quality Multimodal Dataset Construction Methodology for ChatGPT-Based Korean Vision-Language Pre-training (ChatGPT 기반 한국어 Vision-Language Pre-training을 위한 고품질 멀티모달 데이터셋 구축 방법론)

  • Jin Seong;Seung-heon Han;Jong-hun Shin;Soo-jong Lim;Oh-woog Kwon
    • Annual Conference on Human and Language Technology
    • /
    • 2023.10a
    • /
    • pp.603-608
    • /
    • 2023
  • 본 연구는 한국어 Vision-Language Pre-training 모델 학습을 위한 대규모 시각-언어 멀티모달 데이터셋 구축에 대한 필요성을 연구한다. 현재, 한국어 시각-언어 멀티모달 데이터셋은 부족하며, 양질의 데이터 획득이 어려운 상황이다. 따라서, 본 연구에서는 기계 번역을 활용하여 외국어(영문) 시각-언어 데이터를 한국어로 번역하고 이를 기반으로 생성형 AI를 활용한 데이터셋 구축 방법론을 제안한다. 우리는 다양한 캡션 생성 방법 중, ChatGPT를 활용하여 자연스럽고 고품질의 한국어 캡션을 자동으로 생성하기 위한 새로운 방법을 제안한다. 이를 통해 기존의 기계 번역 방법보다 더 나은 캡션 품질을 보장할 수 있으며, 여러가지 번역 결과를 앙상블하여 멀티모달 데이터셋을 효과적으로 구축하는데 활용한다. 뿐만 아니라, 본 연구에서는 의미론적 유사도 기반 평가 방식인 캡션 투영 일치도(Caption Projection Consistency) 소개하고, 다양한 번역 시스템 간의 영-한 캡션 투영 성능을 비교하며 이를 평가하는 기준을 제시한다. 최종적으로, 본 연구는 ChatGPT를 이용한 한국어 멀티모달 이미지-텍스트 멀티모달 데이터셋 구축을 위한 새로운 방법론을 제시하며, 대표적인 기계 번역기들보다 우수한 영한 캡션 투영 성능을 증명한다. 이를 통해, 우리의 연구는 부족한 High-Quality 한국어 데이터 셋을 자동으로 대량 구축할 수 있는 방향을 보여주며, 이 방법을 통해 딥러닝 기반 한국어 Vision-Language Pre-training 모델의 성능 향상에 기여할 것으로 기대한다.

  • PDF

Driver Group Clustering Technique and Risk Estimation Method for Traffic Accident Prevention

  • Tae-Wook Kim;Ji-Woong Yang;Hyeon-Jin Jung;Han-Jin Lee;Ellen J. Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.8
    • /
    • pp.53-58
    • /
    • 2024
  • Traffic accidents are not only a threat to human lives but also pose significant societal costs. Recently, research has been conducted to address the issue of traffic accidents by predicting the risk using deep learning technology and spatiotemporal information of roads. However, while traffic accidents are influenced not only by the spatiotemporal information of roads but also by human factors, research on the latter has been relatively less active. This paper analyzes driver groups and characteristics by applying clustering techniques to a traffic accident dataset and proposes and applies a method to calculate the Risk Level for each driver group and characteristic. In this process, the preprocessing technique suggested in this paper demonstrates a higher Silhouette Score of 0.255 compared to the commonly used One-Hot Embedding & Min-Max Scaling techniques, indicating its suitability as a preprocessing method.

DeNERT: Named Entity Recognition Model using DQN and BERT

  • Yang, Sung-Min;Jeong, Ok-Ran
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.4
    • /
    • pp.29-35
    • /
    • 2020
  • In this paper, we propose a new structured entity recognition DeNERT model. Recently, the field of natural language processing has been actively researched using pre-trained language representation models with a large amount of corpus. In particular, the named entity recognition, which is one of the fields of natural language processing, uses a supervised learning method, which requires a large amount of training dataset and computation. Reinforcement learning is a method that learns through trial and error experience without initial data and is closer to the process of human learning than other machine learning methodologies and is not much applied to the field of natural language processing yet. It is often used in simulation environments such as Atari games and AlphaGo. BERT is a general-purpose language model developed by Google that is pre-trained on large corpus and computational quantities. Recently, it is a language model that shows high performance in the field of natural language processing research and shows high accuracy in many downstream tasks of natural language processing. In this paper, we propose a new named entity recognition DeNERT model using two deep learning models, DQN and BERT. The proposed model is trained by creating a learning environment of reinforcement learning model based on language expression which is the advantage of the general language model. The DeNERT model trained in this way is a faster inference time and higher performance model with a small amount of training dataset. Also, we validate the performance of our model's named entity recognition performance through experiments.

Generative optical flow based abnormal object detection method using a spatio-temporal translation network

  • Lim, Hyunseok;Gwak, Jeonghwan
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.4
    • /
    • pp.11-19
    • /
    • 2021
  • An abnormal object refers to a person, an object, or a mechanical device that performs abnormal and unusual behavior and needs observation or supervision. In order to detect this through artificial intelligence algorithm without continuous human intervention, a method of observing the specificity of temporal features using optical flow technique is widely used. In this study, an abnormal situation is identified by learning an algorithm that translates an input image frame to an optical flow image using a Generative Adversarial Network (GAN). In particular, we propose a technique that improves the pre-processing process to exclude unnecessary outliers and the post-processing process to increase the accuracy of identification in the test dataset after learning to improve the performance of the model's abnormal behavior identification. UCSD Pedestrian and UMN Unusual Crowd Activity were used as training datasets to detect abnormal behavior. For the proposed method, the frame-level AUC 0.9450 and EER 0.1317 were shown in the UCSD Ped2 dataset, which shows performance improvement compared to the models in the previous studies.

Real-Time Stereoscopic Visualization of Very Large Volume Data on CAVE (CAVE상에서의 방대한 볼륨 데이타의 실시간 입체 영상 가시화)

  • 임무진;이중연;조민수;이상산;임인성
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.8 no.6
    • /
    • pp.679-691
    • /
    • 2002
  • Volume visualization is an important subarea of scientific visualization, and is concerned with techniques that are effectively used in generating meaningful and visual information from abstract and complex volume datasets, defined in three- or higher-dimensional space. It has been increasingly important in various fields including meteorology, medical science, and computational fluid dynamics, and so on. On the other hand, virtual reality is a research field focusing on various techniques that aid gaining experiences in virtual worlds with visual, auditory and tactile senses. In this paper, we have developed a visualization system for CAVE, an immersive 3D virtual environment system, which generates stereoscopic images from huge human volume datasets in real-time using an improved volume visualization technique. In order to complement the 3D texture-mapping based volume rendering methods, that easily slow down as data sizes increase, our system utilizes an image-based rendering technique to guarantee real-time performance. The system has been designed to offer a variety of user interface functionality for effective visualization. In this article, we present detailed description on our real-time stereoscopic visualization system, and show how the Visible Korean Human dataset is effectively visualized on CAVE.

Fake News Detection for Korean News Using Text Mining and Machine Learning Techniques (텍스트 마이닝과 기계 학습을 이용한 국내 가짜뉴스 예측)

  • Yun, Tae-Uk;Ahn, Hyunchul
    • Journal of Information Technology Applications and Management
    • /
    • v.25 no.1
    • /
    • pp.19-32
    • /
    • 2018
  • Fake news is defined as the news articles that are intentionally and verifiably false, and could mislead readers. Spread of fake news may provoke anxiety, chaos, fear, or irrational decisions of the public. Thus, detecting fake news and preventing its spread has become very important issue in our society. However, due to the huge amount of fake news produced every day, it is almost impossible to identify it by a human. Under this context, researchers have tried to develop automated fake news detection method using Artificial Intelligence techniques over the past years. But, unfortunately, there have been no prior studies proposed an automated fake news detection method for Korean news. In this study, we aim to detect Korean fake news using text mining and machine learning techniques. Our proposed method consists of two steps. In the first step, the news contents to be analyzed is convert to quantified values using various text mining techniques (Topic Modeling, TF-IDF, and so on). After that, in step 2, classifiers are trained using the values produced in step 1. As the classifiers, machine learning techniques such as multiple discriminant analysis, case based reasoning, artificial neural networks, and support vector machine can be applied. To validate the effectiveness of the proposed method, we collected 200 Korean news from Seoul National University's FactCheck (http://factcheck.snu.ac.kr). which provides with detailed analysis reports from about 20 media outlets and links to source documents for each case. Using this dataset, we will identify which text features are important as well as which classifiers are effective in detecting Korean fake news.

Variations of AlexNet and GoogLeNet to Improve Korean Character Recognition Performance

  • Lee, Sang-Geol;Sung, Yunsick;Kim, Yeon-Gyu;Cha, Eui-Young
    • Journal of Information Processing Systems
    • /
    • v.14 no.1
    • /
    • pp.205-217
    • /
    • 2018
  • Deep learning using convolutional neural networks (CNNs) is being studied in various fields of image recognition and these studies show excellent performance. In this paper, we compare the performance of CNN architectures, KCR-AlexNet and KCR-GoogLeNet. The experimental data used in this paper is obtained from PHD08, a large-scale Korean character database. It has 2,187 samples of each Korean character with 2,350 Korean character classes for a total of 5,139,450 data samples. In the training results, KCR-AlexNet showed an accuracy of over 98% for the top-1 test and KCR-GoogLeNet showed an accuracy of over 99% for the top-1 test after the final training iteration. We made an additional Korean character dataset with fonts that were not in PHD08 to compare the classification success rate with commercial optical character recognition (OCR) programs and ensure the objectivity of the experiment. While the commercial OCR programs showed 66.95% to 83.16% classification success rates, KCR-AlexNet and KCR-GoogLeNet showed average classification success rates of 90.12% and 89.14%, respectively, which are higher than the commercial OCR programs' rates. Considering the time factor, KCR-AlexNet was faster than KCR-GoogLeNet when they were trained using PHD08; otherwise, KCR-GoogLeNet had a faster classification speed.

Estimating and evaluating usual total fat and fatty acid intake in the Korean population using data from the 2019-2021 Korea National Health and Nutrition Examination Surveys: a cross-sectional study (우리 국민의 총 지방 및 지방산 일상 섭취량 추정 및 평가: 2019 - 2021년 국민건강영양조사 자료를 활용한 단면조사연구)

  • Gyeong-yoon Lee;Dong Woo Kim
    • Korean Journal of Community Nutrition
    • /
    • v.28 no.5
    • /
    • pp.414-422
    • /
    • 2023
  • Objectives: This study evaluated usual dietary intakes of total fat and fatty acids among the Korean population based on the revised Dietary Reference Intakes for Koreans 2020 (2020 KDRIs). Methods: This study utilized data from the eighth Korea National Health and Nutrition Examination Survey (KNHANES 2019-2021). We included 18,895 individuals aged 1 year and above whose 1-day 24-hour dietary recall data were available. To calculate the external variability using the National Cancer Institute 1-day method, data from the U.S. NHANES 2017-March 2020 Pre-pandemic dataset were employed. The total fat and fatty acid intake were evaluated based on the Acceptable Macronutrient Distribution Ranges (AMDRs) and Adequate intake (AI) of 2020 KDRIs for each sex and age groups. Results: Approximately 86% of the Korean population obtained an adequate amount of energy from total fat consumption (within the AMDRs), indicating an appropriate level of intake. However, the percentage of individuals consuming saturated fatty acids below the AMDR was low, with only 12% among those under 19 years of age and 52% aged 19 years and older. On a positive note, approximately 70% of the population showed adequate consumption of essential fatty acids, exceeding the AI. Nevertheless, monitoring the intake ratio of omega 3 (n-3) to omega 6 (n-6) fatty acids is essential to ensure an optimum balance. Conclusions: This study explored the possibility of estimating the distribution of nutrient intake in a population by applying the external variability ratio. Therefore, if future KNHANES conduct multiple 24-hour recalls every few years-similar to the U.S. NHANES-even for a subset of participants, this may aid in the accurate assessment of the nutritional status of the population.

Numerical Model Test of Spilled Oil Transport Near the Korean Coasts Using Various Input Parametric Models

  • Hai Van Dang;Suchan Joo;Junhyeok Lim;Jinhwan Hur;Sungwon Shin
    • Journal of Ocean Engineering and Technology
    • /
    • v.38 no.2
    • /
    • pp.64-73
    • /
    • 2024
  • Oil spills pose significant threats to marine ecosystems, human health, socioeconomic aspects, and coastal communities. Accurate real-time predictions of oil slick transport along coastlines are paramount for quick preparedness and response efforts. This study used an open-source OpenOil numerical model to simulate the fate and trajectories of oil slicks released during the 2007 Hebei Spirit accident along the Korean coasts. Six combinations of input parameters, derived from a five-day met-ocean dataset incorporating various hydrodynamic, meteorological, and wave models, were investigated to determine the input variables that lead to the most reasonable results. The predictive performance of each combination was evaluated quantitatively by comparing the dimensions and matching rates between the simulated and observed oil slicks extracted from synthetic aperture radar (SAR) data on the ocean surface. The results show that the combination incorporating the Hybrid Coordinate Ocean Model (HYCOM) for hydrodynamic parameters exhibited more substantial agreement with the observed spill areas than Copernicus Marine Environment Monitoring Service (CMEMS), yielding up to 88% and 53% similarity, respectively, during a more than four-day oil transportation near Taean coasts. This study underscores the importance of integrating high-resolution met-ocean models into oil spill modeling efforts to enhance the predictive accuracy regarding oil spill dynamics and weathering processes.