• Title/Summary/Keyword: Scientific Dataset

Search Result 41, Processing Time 0.024 seconds

Mining Clusters of Sequence Data using Sequence Element-based Similarity Measure (시퀀스 요소 기반의 유사도를 이용한 시퀀스 데이터 클러스터링)

  • 오승준;김재련
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2004.11a
    • /
    • pp.221-229
    • /
    • 2004
  • Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, only a few of the existing clustering algorithms consider sequentiality. This study presents a method for clustering such sequence datasets. The similarity between sequences must be decided before clustering the sequences. This study proposes a new similarity measure to compute the similarity between two sequences using a sequence element. Two clustering algorithms using the proposed similarity measure are proposed: a hierarchical clustering algorithm and a scalable clustering algorithm that uses sampling and a k-nearest neighbor method. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed clustering algorithms is better than that of clusters produced by traditional clustering algorithms.

  • PDF

KorSciQA: A Dataset for Machine Comprehension of Korean Scientific Paper (KorSciQA: 한국어 논문의 기계독해 데이터셋)

  • Hahm, Younggyun;Jeong, Youngbin;Jeong, Heeseok;Hwang, Hyekyong;Choi, Key-Sun
    • Annual Conference on Human and Language Technology
    • /
    • 2019.10a
    • /
    • pp.207-212
    • /
    • 2019
  • 본 논문에서는 한국어로 쓰여진 과학기술 논문에 대한 기계독해 과제(일명 KorSciQA)를 제안하고자 하며, 그와 수반하는 데이터 구축 및 평가를 보고한다. 다양한 제약조건이 부가된 크라우드소싱 디자인을 통하여, 498개의 논문 초록에 대해 일관성 있는 품질의 2,490개의 질의응답으로 구성된 기계독해 데이터셋을 구축하였다. 이 데이터셋은 어느 논문에서나 나타나는 논박 요소들인 논의하는 문제, 푸는 방법, 관련 데이터, 모델 등과 밀접한 질문으로 구성되고, 각 논박 요소의 의미, 목적, 이유 파악 및 다양한 추론을 하여 답을 할 수 있는 것이다. 구축된 KorSciQA 데이터셋은 실험을 통하여 기존의 기계독해 모델의 독해력으로는 풀기 어려운 도전과제로 평가되었다.

  • PDF

Towards inferring reactor operations from high-level waste

  • Benjamin Jung;Antonio Figueroa;Malte Gottsche
    • Nuclear Engineering and Technology
    • /
    • v.56 no.7
    • /
    • pp.2704-2710
    • /
    • 2024
  • Nuclear archaeology research provides scientific methods to reconstruct the operating histories of fissile material production facilities to account for past fissile material production. While it has typically focused on analyzing material in permanent reactor structures, spent fuel or high-level waste also hold information about the reactor operation. In this computational study, we explore a Bayesian inference framework for reconstructing the operational history from measurements of isotope ratios from a sample of nuclear waste. We investigate two different inference models. The first model discriminates between three potential reactors of origin (Magnox, PWR, and PHWR) while simultaneously reconstructing the fuel burnup, time since irradiation, initial enrichment, and average power density. The second model reconstructs the fuel burnup and time since irradiation of two batches of waste in a mixed sample. Each of the models is applied to a set of simulated test data, and the performance is evaluated by comparing the highest posterior density regions to the corresponding parameter values of the test dataset. Both models perform well on the simulated test cases, which highlights the potential of the Bayesian inference framework and opens up avenues for further investigation.

The Automated Scoring of Kinematics Graph Answers through the Design and Application of a Convolutional Neural Network-Based Scoring Model (합성곱 신경망 기반 채점 모델 설계 및 적용을 통한 운동학 그래프 답안 자동 채점)

  • Jae-Sang Han;Hyun-Joo Kim
    • Journal of The Korean Association For Science Education
    • /
    • v.43 no.3
    • /
    • pp.237-251
    • /
    • 2023
  • This study explores the possibility of automated scoring for scientific graph answers by designing an automated scoring model using convolutional neural networks and applying it to students' kinematics graph answers. The researchers prepared 2,200 answers, which were divided into 2,000 training data and 200 validation data. Additionally, 202 student answers were divided into 100 training data and 102 test data. First, in the process of designing an automated scoring model and validating its performance, the automated scoring model was optimized for graph image classification using the answer dataset prepared by the researchers. Next, the automated scoring model was trained using various types of training datasets, and it was used to score the student test dataset. The performance of the automated scoring model has been improved as the amount of training data increased in amount and diversity. Finally, compared to human scoring, the accuracy was 97.06%, the kappa coefficient was 0.957, and the weighted kappa coefficient was 0.968. On the other hand, in the case of answer types that were not included in the training data, the s coring was almos t identical among human s corers however, the automated scoring model performed inaccurately.

Climate Change Impact on Korean Stone Heritage: Research Trends and Prospect (국내 석조유산의 기후변화 영향: 연구동향과 미래전망)

  • Kim, Jiyoung
    • Journal of Conservation Science
    • /
    • v.32 no.3
    • /
    • pp.437-448
    • /
    • 2016
  • Studies on vulnerability of cultural heritage and adaptation strategy to worldwide climate change have been actively carried out in advanced countries since the late 20th century, and this established a valid research methodology and piled up climate and deterioration dataset in the field of climate change. Meanwhile, we still have tasks to acquire related scientific data despite referencing political researches in Korea. Applying Korean future climate to impact analysis, deterioration of Korean stone heritage is likely prospected to change into complexity in terms of physical, chemical and biological weathering that may bring impacts on conservation business and administrative field of cultural heritage. Further studies will ensure detailed implication of climate change impact on Korean stone heritage by means of down-scaling analysis of areas to local scale and dataset frequency to an hour. It is important to sort out capability and vulnerability of the stone heritage to future environment, and to make an adaption and prevention strategies.

A Scalable Clustering Method for Categorical Sequences (범주형 시퀀스들에 대한 확장성 있는 클러스터링 방법)

  • Oh, Seung-Joon;Kim, Jae-Yearn
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.2
    • /
    • pp.136-141
    • /
    • 2004
  • There has been enormous growth in the amount of commercial and scientific data, such as retail transactions, protein sequences, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. We also present an efficient method for determining the similarity measure and develop a clustering algorithm. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a real dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.

A Scientometric Social Network Analysis of International Collaborative Publications of All India Institute of Medical Sciences, India

  • Nishavathi, E.;Jeyshankar, R.
    • Journal of Information Science Theory and Practice
    • /
    • v.8 no.3
    • /
    • pp.64-76
    • /
    • 2020
  • Scientometrics and social network analysis (SNA) measures were used to analyze the international scientific collaboration (ISC) of All India Institute of Medical Sciences (AIIMS) for a period of 10 years (2009-2018). The dataset consists of 19,622 records retrieved from the Scopus database. The mean degree of collaboration 0.95 implied that researchers of AIIMS tend to collaborate domestically (80.29%) and internationally (14.67%). The data exhibits a hyper authorship pattern, and a medium-size research team consists of 4 to 10 authors who contributed a maximum of 62.08% (12,182) publications. 71.97% of research findings are scattered in journal articles. The most preferred journals published 58.55% of medical literature. An undirected collaboration network is constructed in Pajek to study the ISC of AIIMS during the period 2009-2018 which consists of 179 vertices (Vn) and 11,938 edges. The degree centrality (Dc) identified that the United States of America (Dc - 54; CC - 0.99) and United Kingdom (Dc - 41; 0.98) are the most collaborative countries in the whole network as well as the most influential countries. The Louvain community detection method is used to detect influential research groups of AIIMS. The temporal evolution of ISC of AIIMS studied through scientometrics and SNA measures shed light on the structure and properties of ISC networks of AIIMS. It revealed that AIIMS, India has taken keen steps to enrich the quality of research by extending and encouraging the collaboration between institutions and industries at the international level.

Domestic development situation of precision nutrition healthcare (PNH) system based on direct-to-consumer (DTC) obese genes (소비자대상 직접 (DTC) 비만유전자 기반 정밀영양 (PNH)의 국내 현황)

  • Oh Yoen Kim;Myoungsook Lee;Jounghee Lee;Cheongmin Sohn;Mi Ock Yoon
    • Journal of Nutrition and Health
    • /
    • v.55 no.6
    • /
    • pp.601-616
    • /
    • 2022
  • In the era of the fourth industrial revolution technology, the inclusion of personalized nutrition for healthcare (PNH), when establishing a healthcare platform to prevent chronic diseases such as obesity, diabetes, cerebrovascular and cardiovascular disease, pulmonary disease, and inflammatory diseases, enhances the national competitiveness of global healthcare markets. Furthermore, since the government experienced COVID-19 and the population dead cross in 2020, as well as numerous health problems due to an increasing super-aged Korean society, there is an urgent need to secure, develop, and utilize PNH-related technologies. Three conditions are essential for the development of PNH technologies. These include the establishment of causality between obesity genome (genotype) and prevalence (phenotype) in Koreans, validation of clinical intervention research, and securing PNH-utilization technology (i.e., algorithm development, artificial intelligence-based platform, direct-to-customer [DTC]-based PNH, etc.). Therefore, a national control tower is required to establish appropriate PNH infrastructure (basic and clinical research, cultivation of PNH-related experts, etc.). The post-corona era will be aggressive in sharing data knowledge and developing related technologies, and Korea needs to actively participate in the large-scale global healthcare markets. This review provides the importance of scientific evidence based on a huge dataset, which is the primary prerequisite for the DTC obesity gene-based PNH technologies to be competitive in the healthcare market. Furthermore, based on comparing domestic and internationally approved DTC obese genes and the current status of Korean obesity genome-based PNH research, we intend to provide a direction to PNH planners (individuals and industries) for establishing scientific PNH guidelines for the prevention of obesity.

Analysis of the Individual Tree Growth for Urban Forest using Multi-temporal airborne LiDAR dataset (다중시기 항공 LiDAR를 활용한 도시림 개체목 수고생장분석)

  • Kim, Seoung-Yeal;Kim, Whee-Moon;Song, Won-Kyong;Choi, Young-Eun;Choi, Jae-Yong;Moon, Guen-Soo
    • Journal of the Korean Society of Environmental Restoration Technology
    • /
    • v.22 no.5
    • /
    • pp.1-12
    • /
    • 2019
  • It is important to measure the height of trees as an essential element for assessing the forest health in urban areas. Therefore, an automated method that can measure the height of individual tree as a three-dimensional forest information is needed in an extensive and dense forest. Since airborne LiDAR dataset is easy to analyze the tree height(z-coordinate) of forests, studies on individual tree height measurement could be performed as an assessment forest health. Especially in urban forests, that adversely affected by habitat fragmentation and isolation. So this study was analyzed to measure the height of individual trees for assessing the urban forests health, Furthermore to identify environmental factors that affect forest growth. The survey was conducted in the Mt. Bongseo located in Seobuk-gu. Cheonan-si(Middle Chungcheong Province). We segment the individual trees on coniferous by automatic method using the airborne LiDAR dataset of the two periods (year of 2016 and 2017) and to find out individual tree growth. Segmentation of individual trees was performed by using the watershed algorithm and the local maximum, and the tree growth was determined by the difference of the tree height according to the two periods. After we clarify the relationship between the environmental factors affecting the tree growth. The tree growth of Mt. Bongseo was about 20cm for a year, and it was analyzed to be lower than 23.9cm/year of the growth of the dominant species, Pinus rigida. This may have an adverse effect on the growth of isolated urban forests. It also determined different trees growth according to age, diameter and density class in the stock map, effective soil depth and drainage grade in the soil map. There was a statistically significant positive correlation between the distance to the road and the solar radiation as an environmental factor affecting the tree growth. Since there is less correlation, it is necessary to determine other influencing factors affecting tree growth in urban forests besides anthropogenic influences. This study is the first data for the analysis of segmentation and the growth of the individual tree, and it can be used as a scientific data of the urban forest health assessment and management.

Building Sentence Meaning Identification Dataset Based on Social Problem-Solving R&D Reports (사회문제 해결 연구보고서 기반 문장 의미 식별 데이터셋 구축)

  • Hyeonho Shin;Seonki Jeong;Hong-Woo Chun;Lee-Nam Kwon;Jae-Min Lee;Kanghee Park;Sung-Pil Choi
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.4
    • /
    • pp.159-172
    • /
    • 2023
  • In general, social problem-solving research aims to create important social value by offering meaningful answers to various social pending issues using scientific technologies. Not surprisingly, however, although numerous and extensive research attempts have been made to alleviate the social problems and issues in nation-wide, we still have many important social challenges and works to be done. In order to facilitate the entire process of the social problem-solving research and maximize its efficacy, it is vital to clearly identify and grasp the important and pressing problems to be focused upon. It is understandable for the problem discovery step to be drastically improved if current social issues can be automatically identified from existing R&D resources such as technical reports and articles. This paper introduces a comprehensive dataset which is essential to build a machine learning model for automatically detecting the social problems and solutions in various national research reports. Initially, we collected a total of 700 research reports regarding social problems and issues. Through intensive annotation process, we built totally 24,022 sentences each of which possesses its own category or label closely related to social problem-solving such as problems, purposes, solutions, effects and so on. Furthermore, we implemented four sentence classification models based on various neural language models and conducted a series of performance experiments using our dataset. As a result of the experiment, the model fine-tuned to the KLUE-BERT pre-trained language model showed the best performance with an accuracy of 75.853% and an F1 score of 63.503%.