• 제목/요약/키워드: Scientific Dataset

검색결과 41건 처리시간 0.023초

오스트레일리아의 과학데이터 서비스체제(ANDS) 분석과 시사점 (Analysis and Implications of Australian National Data Service(ANDS))

  • 박동진
    • 디지털융복합연구
    • /
    • 제9권3호
    • /
    • pp.1-10
    • /
    • 2011
  • 현재 국가적 수준에서 과학 데이터세트에 대한 관리체제나 보전을 위한 구체적인 방침이 없다. 그래서 연구 프로젝트를 실시하고 있는 과학자 및 연구그룹들은 데이터세트에 대한 정보의 검색은 물론 공유가 불가능하다. 디지털화된 데이터를 이용하는 연구가 급격히 증가되고 있는 현 상황에서 연구자들 간에 과학데이터를 공유하고 재사용하는 것은 매우 중요하게 인식된다. 따라서 국가수준의 과학데이터 정책수립의 필요성이 대두되고 있다. 본 연구는 외국의 선진사례를 분석함으로써 우리나라의 전략적 계획수립에 있어서 중요한 시사점을 찾는 것이다. 먼저 과학데이터에 대한 일반적 사항과 국가별로 개략적인 과학데이터 정책방향을 살펴본 후 우리나라와 비슷한 정부주도의 집중화된 연구 환경, 연구 지원체제 및 정보서비스 등으로 구성된 오스트레일리아를 대상으로 집중적으로 연구하였다. 구체적으로 ANDS(Australian National Data Service)를 분석하고 우리나라에 적용할 수 있는 시사점들을 도출하였다. 마지막으로 우리나라의 과학데이터 정책수립 사 반영되어야 할 가장 기본적인 원칙 들을 제시한다.

OryzaGP: rice gene and protein dataset for named-entity recognition

  • Larmande, Pierre;Do, Huy;Wang, Yue
    • Genomics & Informatics
    • /
    • 제17권2호
    • /
    • pp.17.1-17.3
    • /
    • 2019
  • Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.

GLOVE: 대용량 과학 데이터를 위한 분산공유메모리 기반 병렬 가시화 도구 (GLOVE: Distributed Shared Memory Based Parallel Visualization Tool for Massive Scientific Dataset)

  • 이중연;김민아;이세훈;허영주
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제5권6호
    • /
    • pp.273-282
    • /
    • 2016
  • 가시화 도구는 데이터 입출력, 시각적 변환, 상호작용적인 렌더링의 세 구성요소로 구분할 수 있다. 본 논문에서는 거대용량의 과학 데이터를 실시간으로 가시화하기 위해 가시화 도구의 세 구성요소에 대한 요구사항을 분석, 정의하고 이를 만족시키기 위한 방안을 제시하고자 한다. 특히, 효율적인 가시화 도구의 개발을 위해 공개 소프트웨어 도구를 최대한 활용하고자 하였으며, 서로 다른 용도로 개발된 각 공개 소프트웨어 도구를 통합하여 하나의 가시화 도구로 개발하는 방안과 시공간적인 과학 데이터의 실시간 가시화를 위한 최적화 방법에 대해 논한다. 이를 통해 분산공유메모리 기반의 과학 데이터 병렬 가시화 도구인 GLOVE를 제안하였으며, 유동해석 분야 과학 데이터를 이용한 실험을 통해 GLOVE와 다른 데이터 가시화 소프트웨어와의 성능을 비교 분석했다.

해양과정시뮬레이션의 과학기술적가시화 (Scientific and Technical Visualization for Ocean Process Simulations)

  • 최병호
    • 한국전산유체공학회:학술대회논문집
    • /
    • 한국전산유체공학회 1999년도 춘계 학술대회논문집
    • /
    • pp.1-10
    • /
    • 1999
  • This paper briefly introduces the work done up to 1998 during the past twenty years for numerical modeling of ocean process focussing on the neighbouring seas of Korean Peninsula. Modeling of global ocean dynamics has also been performed as a pathway to understand the regional ocean dynamics. The ocean simulation produces a vast amount of multidimensional multivariate dataset therefore adoption of scientific and technical visualization techniques were essential to properly understand the physics involved.

  • PDF

Stock News Dataset Quality Assessment by Evaluating the Data Distribution and the Sentiment Prediction

  • Alasmari, Eman;Hamdy, Mohamed;Alyoubi, Khaled H.;Alotaibi, Fahd Saleh
    • International Journal of Computer Science & Network Security
    • /
    • 제22권2호
    • /
    • pp.1-8
    • /
    • 2022
  • This work provides a reliable and classified stocks dataset merged with Saudi stock news. This dataset allows researchers to analyze and better understand the realities, impacts, and relationships between stock news and stock fluctuations. The data were collected from the Saudi stock market via the Corporate News (CN) and Historical Data Stocks (HDS) datasets. As their names suggest, CN contains news, and HDS provides information concerning how stock values change over time. Both datasets cover the period from 2011 to 2019, have 30,098 rows, and have 16 variables-four of which they share and 12 of which differ. Therefore, the combined dataset presented here includes 30,098 published news pieces and information about stock fluctuations across nine years. Stock news polarity has been interpreted in various ways by native Arabic speakers associated with the stock domain. Therefore, this polarity was categorized manually based on Arabic semantics. As the Saudi stock market massively contributes to the international economy, this dataset is essential for stock investors and analyzers. The dataset has been prepared for educational and scientific purposes, motivated by the scarcity of data describing the impact of Saudi stock news on stock activities. It will, therefore, be useful across many sectors, including stock market analytics, data mining, statistics, machine learning, and deep learning. The data evaluation is applied by testing the data distribution of the categories and the sentiment prediction-the data distribution over classes and sentiment prediction accuracy. The results show that the data distribution of the polarity over sectors is considered a balanced distribution. The NB model is developed to evaluate the data quality based on sentiment classification, proving the data reliability by achieving 68% accuracy. So, the data evaluation results ensure dataset reliability, readiness, and high quality for any usage.

TPIPF로 계산된 이용자프로파일을 적용한 논문추천시스템에 대한 연구 (A Study on Scientific Article Recommendation System with User Profile Applying TPIPF)

  • 장령령;장우권
    • 정보관리학회지
    • /
    • 제33권1호
    • /
    • pp.317-336
    • /
    • 2016
  • 오늘날 폭발적인 정보의 증가로 이용자들은 자신이 원하는 정보를 찾기 위해 엄청난 시간과 노력을 기울여야 한다. 이 문제를 해결하기 위하여 이용자의 정보요구를 분석하고 이용자에게 적합한 논문을 추천해주는 논문추천시스템이 등장하고 있다. 그러나 대부분의 논문추천시스템은 논문추천시스템의 핵심인 이용자 프로파일을 간과하고 있다. 따라서 이 연구는 논문추천시스템의 성능을 좌우하는 이용자 프로파일을 기존의 평균으로 계산하지 않고 새로운 TPIPF(Topic Proportion-Inverse Paper Frequency)로 계산하는 방법을 제안하였다. 제안된 방법과 기존의 방법을 모두 논문추천시스템에 적용하여 각각의 성능을 온라인 참고문헌 관리도구인 CiteULike에서 제공된 데이터 실험을 통하여 비교하였다. 그 결과 제안된 TPIPF 방법을 적용한 논문추천시스템의 성능이 더 높다는 것을 알 수 있었다.

딥페이크 영상 학습을 위한 데이터셋 평가기준 개발 (Development of Dataset Evaluation Criteria for Learning Deepfake Video)

  • 김량형;김태구
    • 산업경영시스템학회지
    • /
    • 제44권4호
    • /
    • pp.193-207
    • /
    • 2021
  • As Deepfakes phenomenon is spreading worldwide mainly through videos in web platforms and it is urgent to address the issue on time. More recently, researchers have extensively discussed deepfake video datasets. However, it has been pointed out that the existing Deepfake datasets do not properly reflect the potential threat and realism due to various limitations. Although there is a need for research that establishes an agreed-upon concept for high-quality datasets or suggests evaluation criterion, there are still handful studies which examined it to-date. Therefore, this study focused on the development of the evaluation criterion for the Deepfake video dataset. In this study, the fitness of the Deepfake dataset was presented and evaluation criterions were derived through the review of previous studies. AHP structuralization and analysis were performed to advance the evaluation criterion. The results showed that Facial Expression, Validation, and Data Characteristics are important determinants of data quality. This is interpreted as a result that reflects the importance of minimizing defects and presenting results based on scientific methods when evaluating quality. This study has implications in that it suggests the fitness and evaluation criterion of the Deepfake dataset. Since the evaluation criterion presented in this study was derived based on the items considered in previous studies, it is thought that all evaluation criterions will be effective for quality improvement. It is also expected to be used as criteria for selecting an appropriate deefake dataset or as a reference for designing a Deepfake data benchmark. This study could not apply the presented evaluation criterion to existing Deepfake datasets. In future research, the proposed evaluation criterion will be applied to existing datasets to evaluate the strengths and weaknesses of each dataset, and to consider what implications there will be when used in Deepfake research.

대용량 유동해석 데이터에서의 중요도 기반 스트림라인 생성 방법 (Method for Importance based Streamline Generation on the Massive Fluid Dynamics Dataset)

  • 이중연;김민아;이세훈
    • 한국콘텐츠학회논문지
    • /
    • 제18권6호
    • /
    • pp.27-37
    • /
    • 2018
  • 스트림라인 생성은 유동해석 데이터에서 유동의 흐름을 해석하기 위한 대표적인 가시화 기법이다. 그러나 효과적인 스트림라인 배치를 위한 씨드 포인트의 위치를 결정하는 것은 매우 어려운 문제이다. 한편, 대용량의 유동해석 데이터에서 씨드 포인트 결정과 스트림라인 생성 계산은 매우 오랜 시간을 필요로 한다. 본 논문에서는 효과적인 스트림라인 배치를 위해 유동해석 데이터의 중요도를 기반으로 한 씨드 포인트 결정 방법과 분산병렬 가시화 시스템 환경에서의 병렬 처리 기법을 제안한다. 또한, GLOVE 가시화 시스템에서 실제 유동해석 데이터를 이용한 구현 결과를 소개하고 이를 통해 본 논문의 제안 방법을 검증하고자 한다.

Collaborative Research Network and Scientific Productivity: The Case of Korean Statisticians and Computer Scientists

  • Kwon, Ki-Seok;Kim, Jin-Guk
    • Asian Journal of Innovation and Policy
    • /
    • 제6권1호
    • /
    • pp.85-93
    • /
    • 2017
  • This paper focuses on the relationship between the characteristics of network and the productivity of scientists, which is rarely examined in previous studies. Utilizing a unique dataset from the Korean Citation Index (KCI), we examine the overall characteristics of the research network (e.g. distribution of nodes, density and mean distance), and analyze whether the network centrality is related to the scientific productivity. According to the results, firstly we have found that the collaborative research network of the Korean academics in the field of statistics and computer science is a scale-free network. Secondly, these research networks show a disciplinary difference. The network of statisticians is denser than that of computer scientists. In addition, computer scientists are located in a fragmented network compared to statisticians. Thirdly, with regard to the relationship between the researchers' network position and scientific productivity, a significant relation and their disciplinary difference have been observed. In particular, the degree centrality is the strongest predictor for the scientists' productivity. Based on these findings, some policy implications are put forward.

Accuracy Assessment of Global Land Cover Datasets in South Korea

  • Son, Sanghun;Kim, Jinsoo
    • 대한원격탐사학회지
    • /
    • 제34권4호
    • /
    • pp.601-610
    • /
    • 2018
  • The national accuracy of global land cover (GLC) products is of great importance to ecological and environmental research. However, GLC products that are derived from different satellite sensors, with differing spatial resolutions, classification methods, and classification schemes are certain to show some discrepancies. The goal of this study is to assess the accuracy of four commonly used GLC datasets in South Korea, GLC2000, GlobCover2009, MCD12Q1, and GlobeLand30. First, we compared the area of seven classes between four GLC datasets and a reference dataset. Then, we calculated the accuracy of the four GLC datasets based on an aggregated classification scheme containing seven classes, using overall, producer's and user's accuracies, and kappa coefficient. GlobeLand30 had the highest overall accuracy (77.59%). The overall accuracies of MCD12Q1, GLC2000, and GlobCover2009 were 75.51%, 68.38%, and 57.99%, respectively. These results indicate that GlobeLand30 is the most suitable dataset to support a variety of national scientific endeavors in South Korea.