• Title/Summary/Keyword: Scientific Dataset

Search Result 43, Processing Time 0.023 seconds

Building Sentence Meaning Identification Dataset Based on Social Problem-Solving R&D Reports (사회문제 해결 연구보고서 기반 문장 의미 식별 데이터셋 구축)

  • Hyeonho Shin;Seonki Jeong;Hong-Woo Chun;Lee-Nam Kwon;Jae-Min Lee;Kanghee Park;Sung-Pil Choi
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.4
    • /
    • pp.159-172
    • /
    • 2023
  • In general, social problem-solving research aims to create important social value by offering meaningful answers to various social pending issues using scientific technologies. Not surprisingly, however, although numerous and extensive research attempts have been made to alleviate the social problems and issues in nation-wide, we still have many important social challenges and works to be done. In order to facilitate the entire process of the social problem-solving research and maximize its efficacy, it is vital to clearly identify and grasp the important and pressing problems to be focused upon. It is understandable for the problem discovery step to be drastically improved if current social issues can be automatically identified from existing R&D resources such as technical reports and articles. This paper introduces a comprehensive dataset which is essential to build a machine learning model for automatically detecting the social problems and solutions in various national research reports. Initially, we collected a total of 700 research reports regarding social problems and issues. Through intensive annotation process, we built totally 24,022 sentences each of which possesses its own category or label closely related to social problem-solving such as problems, purposes, solutions, effects and so on. Furthermore, we implemented four sentence classification models based on various neural language models and conducted a series of performance experiments using our dataset. As a result of the experiment, the model fine-tuned to the KLUE-BERT pre-trained language model showed the best performance with an accuracy of 75.853% and an F1 score of 63.503%.

Tomography Reconstruction of Ionospheric Electron Density with Empirical Orthonormal Functions Using Korea GNSS Network

  • Hong, Junseok;Kim, Yong Ha;Chung, Jong-Kyun;Ssessanga, Nicholas;Kwak, Young-Sil
    • Journal of Astronomy and Space Sciences
    • /
    • v.34 no.1
    • /
    • pp.7-17
    • /
    • 2017
  • In South Korea, there are about 80 Global Positioning System (GPS) monitoring stations providing total electron content (TEC) every 10 min, which can be accessed through Korea Astronomy and Space Science Institute (KASI) for scientific use. We applied the computerized ionospheric tomography (CIT) algorithm to the TEC dataset from this GPS network for monitoring the regional ionosphere over South Korea. The algorithm utilizes multiplicative algebraic reconstruction technique (MART) with an initial condition of the latest International Reference Ionosphere-2016 model (IRI-2016). In order to reduce the number of unknown variables, the vertical profiles of electron density are expressed with a linear combination of empirical orthonormal functions (EOFs) that were derived from the IRI empirical profiles. Although the number of receiver sites is much smaller than that of Japan, the CIT algorithm yielded reasonable structure of the ionosphere over South Korea. We verified the CIT results with NmF2 from ionosondes in Icheon and Jeju and also with GPS TEC at the center of South Korea. In addition, the total time required for CIT calculation was only about 5 min, enabling the exploration of the vertical ionospheric structure in near real time.

Implementation of Saemangeum Coastal Environmental Information System Using GIS (지리정보시스템을 이용한 새만금 해양환경정보시스템 구축)

  • Kim, Jin-Ah;Kim, Chang-Sik;Park, Jin-Ah
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.14 no.4
    • /
    • pp.128-136
    • /
    • 2011
  • To monitor and predict the change of coastal environment according to the construction of Saemangeum sea dyke and the development of land reclamation, we have done real-time and periodic ocean observation and numerical simulation since 2002. Saemangeum coastal environmental data can be largely classified to marine meteorology, ocean physics and circulation, water quality, marine geology and marine ecosystem and each part of data has been generated continuously and accumulated over about 10 years. The collected coastal environmental data are huge amounts of heterogeneous dataset and have some characteristics of multi-dimension, multivariate and spatio-temporal distribution. Thus the implementation of information system possible to data collection, processing, management and service is necessary. In this study, through the implementation of Saemangeum coastal environmental information system using geographic information system, it enables the integral data collection and management and the data querying and analysis of enormous and high-complexity data through the design of intuitive and effective web user interface and scientific data visualization using statistical graphs and thematic cartography. Furthermore, through the quantitative analysis of trend changed over long-term by the geo-spatial analysis with geo- processing, it's being used as a tool for provide a scientific basis for sustainable development and decision support in Saemangeum coast. Moreover, for the effective web-based information service, multi-level map cache, multi-layer architecture and geospatial database were implemented together.

Pattern Recognition of the Herbal Drug, Magnoliae Flos According to their Essential Oil Components

  • Jeong, Eun-Sook;Choi, Kyu-Yeol;Kim, Sun-Chun;Son, In-Seop;Cho, Hwang-Eui;Ahn, Su-Youn;Woo, Mi-Hee;Hong, Jin-Tae;Moon, Dong-Cheul
    • Bulletin of the Korean Chemical Society
    • /
    • v.30 no.5
    • /
    • pp.1121-1126
    • /
    • 2009
  • This paper describes a pattern recognition method of Magnoliae flos based on a gas chromatographic/mass spectrometric (GC/MS) analysis of the essential oil components. The botanical drug is mainly comprised of the four magnolia species (M. denudata, M. biondii, M. kobus, and M. liliflora) in Korea, although some other species are also being dealt with the drug. The GC/MS separation of the volatile components, which was extracted by the simultaneous distillation and extraction (SDE), was performed on a carbowax column (supelcowax 10; 30 m{\time}0.25 mm{\time}0.25{\mu}m$) using temperature programming. Variance in the retention times for all peaks of interests was within RSD 2% for repeated analyses (n = 9). Of the 74 essential oil components identified from the magnolia species, approximately 10 major components, which is $\alpha$-pinene, $\beta$-pinene, sabinene, myrcene, d-limonene, eucarlyptol (1,8-cineol), $\gamma$-terpinene, p-cymene, linalool, $\alpha$-terpineol, were commonly present in the four species. For statistical analysis, the original dataset was reduced to the 13 variables by Fisher criterion and factor analysis (FA). The essential oil patterns were processed by means of the multivariate statistical analysis including hierarchical cluster analysis (HCA), principal component analysis (PCA) and discriminant analysis (DA). All samples were divided into four groups with three principal components by PCA and according to the plant origins by HCA. Thirty-three samples (23 training sets and 10 test samples to be assessed) were correctly classified into the four groups predicted by PCA. This method would provide a practical strategy for assessing the authenticity or quality of the well-known herbal drug, Magnoliae flos.

A Study on Extending Successive Observation Coverage of MODIS Ocean Color Product (MODIS 해색 자료의 유효관측영역 확장에 대한 연구)

  • Park, Jeong-Won;Kim, Hyun-Cheol;Park, Kyungseok;Lee, Sangwhan
    • Korean Journal of Remote Sensing
    • /
    • v.31 no.6
    • /
    • pp.513-521
    • /
    • 2015
  • In the processing of ocean color remote sensing data, spatio-temporal binning is crucial for securing effective observation area. The validity determination for given source data refers to the information in Level-2 flag. For minimizing the stray light contamination, NASA OBPG's standard algorithm suggests the use of large filtering window but it results in the loss of effective observation area. This study is aimed for quality improvement of ocean color remote sensing data by recovering/extending the portion of effective observation area. We analyzed the difference between MODIS/Aqua standard and modified product in terms of chlorophyll-a concentration, spatial and temporal coverage. The recovery fractions in Level-2 swath product, Level-3 daily composite product, 8-day composite product, and monthly composite product were $13.2({\pm}5.2)%$, $30.8({\pm}16.3)%$, $15.8({\pm}9.2)%$, and $6.0({\pm}5.6)%$, respectively. The mean difference between chlorophyll-a concentrations of two products was only 0.012%, which is smaller than the nominal precision of the geophysical parameter estimation. Increase in areal coverage also results in the increase in temporal density of multi-temporal dataset, and this processing gain was most effective in 8-day composite data. The proposed method can contribute for the quality enhancement of ocean color remote sensing data by improving not only the data productivity but also statistical stability from increased number of samples.

Development and Assessment of Real-Time Quality Control Algorithm for PM10 Data Observed by Continuous Ambient Particulate Monitor (부유분진측정기(PM10) 관측 자료 실시간 품질관리 알고리즘 개발 및 평가)

  • Kim, Sunyoung;Lee, Hee Choon;Ryoo, Sang-Boom
    • Atmosphere
    • /
    • v.26 no.4
    • /
    • pp.541-551
    • /
    • 2016
  • A real-time quality control algorithm for $PM_{10}$ concentration measured by Continuous Ambient Particulate Monitor (FH62C14, Thermo Fisher Scientific Inc.) has been developed. The quality control algorithm for $PM_{10}$ data consists of five main procedures. The first step is valid value check. The values should be within the acceptable range limit. Upper ($5,000{\mu}g\;m^{-3}$) and lower ($0{\mu}g\;m^{-3}$) values of instrument detectable limit have to be eliminated as being unrealistic. The second step is valid error check. Whenever unusual condition occurs, the instrument will save error code. Value having an error code is eliminated. The third step is persistence check. This step checks on a minimum required variability of data during a certain period. If the $PM_{10}$ data do not vary over the past 60 minutes by more than the specific limit ($0{\mu}g\;m^{-3}$) then the current 5-minute value fails the check. The fourth step is time continuity check, which is checked to eliminate gross outlier. The last step is spike check. The spikes in the time series are checked. The outlier detection is based on the double-difference time series, using the median. Flags indicating normal and abnormal are added to the raw data after quality control procedure. The quality control algorithm is applied to $PM_{10}$ data for Asian dust and non-Asian dust case at Seoul site and dataset for the period 2013~2014 at 26 sites in Korea.

Analyzing Factors Contributing to Research Performance using Backpropagation Neural Network and Support Vector Machine

  • Ermatita, Ermatita;Sanmorino, Ahmad;Samsuryadi, Samsuryadi;Rini, Dian Palupi
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.1
    • /
    • pp.153-172
    • /
    • 2022
  • In this study, the authors intend to analyze factors contributing to research performance using Backpropagation Neural Network and Support Vector Machine. The analyzing factors contributing to lecturer research performance start from defining the features. The next stage is to collect datasets based on defining features. Then transform the raw dataset into data ready to be processed. After the data is transformed, the next stage is the selection of features. Before the selection of features, the target feature is determined, namely research performance. The selection of features consists of Chi-Square selection (U), and Pearson correlation coefficient (CM). The selection of features produces eight factors contributing to lecturer research performance are Scientific Papers (U: 154.38, CM: 0.79), Number of Citation (U: 95.86, CM: 0.70), Conference (U: 68.67, CM: 0.57), Grade (U: 10.13, CM: 0.29), Grant (U: 35.40, CM: 0.36), IPR (U: 19.81, CM: 0.27), Qualification (U: 2.57, CM: 0.26), and Grant Awardee (U: 2.66, CM: 0.26). To analyze the factors, two data mining classifiers were involved, Backpropagation Neural Networks (BPNN) and Support Vector Machine (SVM). Evaluation of the data mining classifier with an accuracy score for BPNN of 95 percent, and SVM of 92 percent. The essence of this analysis is not to find the highest accuracy score, but rather whether the factors can pass the test phase with the expected results. The findings of this study reveal the factors that have a significant impact on research performance and vice versa.

Intelligent prediction of engineered cementitious composites with limestone calcined clay cement (LC3-ECC) compressive strength based on novel machine learning techniques

  • Enming Li;Ning Zhang;Bin Xi;Vivian WY Tam;Jiajia Wang;Jian Zhou
    • Computers and Concrete
    • /
    • v.32 no.6
    • /
    • pp.577-594
    • /
    • 2023
  • Engineered cementitious composites with calcined clay limestone cement (LC3-ECC) as a kind of green, low-carbon and high toughness concrete, has recently received significant investigation. However, the complicated relationship between potential influential factors and LC3-ECC compressive strength makes the prediction of LC3-ECC compressive strength difficult. Regarding this, the machine learning-based prediction models for the compressive strength of LC3-ECC concrete is firstly proposed and developed. Models combine three novel meta-heuristic algorithms (golden jackal optimization algorithm, butterfly optimization algorithm and whale optimization algorithm) with support vector regression (SVR) to improve the accuracy of prediction. A new dataset about LC3-ECC compressive strength was integrated based on 156 data from previous studies and used to develop the SVR-based models. Thirteen potential factors affecting the compressive strength of LC3-ECC were comprehensively considered in the model. The results show all hybrid SVR prediction models can reach the Coefficient of determination (R2) above 0.95 for the testing set and 0.97 for the training set. Radar and Taylor plots also show better overall prediction performance of the hybrid SVR models than several traditional machine learning techniques, which confirms the superiority of the three proposed methods. The successful development of this predictive model can provide scientific guidance for LC3-ECC materials and further apply to such low-carbon, sustainable cement-based materials.

Knowledge Mining from Many-valued Triadic Dataset based on Concept Hierarchy (개념계층구조를 기반으로 하는 다치 삼원 데이터집합의 지식 추출)

  • Suk-Hyung Hwang;Young-Ae Jung;Se-Woong Hwang
    • Journal of Platform Technology
    • /
    • v.12 no.3
    • /
    • pp.3-15
    • /
    • 2024
  • Knowledge mining is a research field that applies various techniques such as data modeling, information extraction, analysis, visualization, and result interpretation to find valuable knowledge from diverse large datasets. It plays a crucial role in transforming raw data into useful knowledge across various domains like business, healthcare, and scientific research etc. In this paper, we propose analytical techniques for performing knowledge discovery and data mining from various data by extending the Formal Concept Analysis method. It defines algorithms for representing diverse formats and structures of the data to be analyzed, including models such as many-valued data table data and triadic data table, as well as algorithms for data processing (dyadic scaling and flattening) and the construction of concept hierarchies and the extraction of association rules. The usefulness of the proposed technique is empirically demonstrated by conducting experiments applying the proposed method to public open data.

  • PDF

A Decline of Observed Daily Peak Wind Gusts with Distinct Seasonality in Australia, 1941-2016

  • Cesar Azorin-Molina;Tim R. McVicar;Jose A. Guijarro;Blair Trewin;Andrew J. Frost;Gangfeng Zhang;Lorenzo Minola;Seok-Woo Son;Kaiqiang Deng;Deliang Chen
    • Journal of Climate Change Research
    • /
    • v.34 no.8
    • /
    • pp.3103-3127
    • /
    • 2021
  • Wind gusts represent one of the main natural hazards due to their increasing socioeconomic and environmental impacts on, for example, human safety, maritime-terrestrial-aviation activities, engineering and insurance applications, and energy production. However, the existing scientific studies focused on observed wind gusts are relatively few compared to those on mean wind speed. In Australia, previous studies found a slowdown of near-surface mean wind speed, termed "stilling," but a lack of knowledge on the multidecadal variability and trends in the magnitude (wind speed maxima) and frequency (exceeding the 90th percentile) of wind gusts exists. A new homogenized daily peak wind gusts (DPWG) dataset containing 548 time series across Australia for 1941-2016 is analyzed to determine long-term trends in wind gusts. Here we show that both the magnitude and frequency of DPWG declined across much of the continent, with a distinct seasonality: negative trends in summer-pring-autumn and weak negative or nontrending (even positive) trends in winter. We demonstrate that ocean-atmosphere oscillations such as the Indian Ocean dipole and the southern annular mode partly modulate decadal-scale variations of DPWG. The long-term declining trend of DPWG is consistent with the "stilling" phenomenon, suggesting that global warming may have reduced Australian wind gusts.