• Title/Summary/Keyword: large dataset

Search Result 539, Processing Time 0.025 seconds

LD-based tagSNP Selection System for Large-scale Haplotype and Genotype Datasets (대용량의 Haplotype과 Genotype데이터에 대한 LD기반의 tagSNP 선택 시스템)

  • Kim, Sang-Jun;Yeo, Sang-Soo;Kim, Sung-Kwon
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2004.11a
    • /
    • pp.279-285
    • /
    • 2004
  • In the disease association study, the tagSNP selection problem is important at the view of time and cost. We developed the new tagSNP selection system that has also facilities for the haplotype reconstruction and missing data processing. In our system, we improved biological meanings using LD coefficients as well as dynamic programming method. And our system has capability of processing large -scale dataset, such as the total SNPs on a chromosome. We have tested our system with various dataset from daly et al., patil et al., HapMap Project, artificial dataset, and so on.

  • PDF

Multi-faceted Image Dataset Construction Method Based on Rotational Images. (회전 영상 기반 다면 영상 데이터셋 구축 방법)

  • Kim, Ji-Seong;Heo, Gyeongyong;Jang, Si-Woong
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.75-77
    • /
    • 2021
  • In order to find objects in an image through deep learning technology, an image dataset for learning is required. In order to increase the recognition rate of objects, a large amount of image learning data is required. It is difficult for individuals to build large amounts of datasets because it is expensive. This paper introduces a method for more easily constructing an image dataset including several sides of an object by photographing a rotating image. A method of constructing a dataset by placing an object on a rotating plate, photographing it, and dividing and synthesizing the captured images according to the needs is proposed.

  • PDF

Korean Lip-Reading: Data Construction and Sentence-Level Lip-Reading (한국어 립리딩: 데이터 구축 및 문장수준 립리딩)

  • Sunyoung Cho;Soosung Yoon
    • Journal of the Korea Institute of Military Science and Technology
    • /
    • v.27 no.2
    • /
    • pp.167-176
    • /
    • 2024
  • Lip-reading is the task of inferring the speaker's utterance from silent video based on learning of lip movements. It is very challenging due to the inherent ambiguities present in the lip movement such as different characters that produce the same lip appearances. Recent advances in deep learning models such as Transformer and Temporal Convolutional Network have led to improve the performance of lip-reading. However, most previous works deal with English lip-reading which has limitations in directly applying to Korean lip-reading, and moreover, there is no a large scale Korean lip-reading dataset. In this paper, we introduce the first large-scale Korean lip-reading dataset with more than 120 k utterances collected from TV broadcasts containing news, documentary and drama. We also present a preprocessing method which uniformly extracts a facial region of interest and propose a transformer-based model based on grapheme unit for sentence-level Korean lip-reading. We demonstrate that our dataset and model are appropriate for Korean lip-reading through statistics of the dataset and experimental results.

Mid-level Feature Extraction Method Based Transfer Learning to Small-Scale Dataset of Medical Images with Visualizing Analysis

  • Lee, Dong-Ho;Li, Yan;Shin, Byeong-Seok
    • Journal of Information Processing Systems
    • /
    • v.16 no.6
    • /
    • pp.1293-1308
    • /
    • 2020
  • In fine-tuning-based transfer learning, the size of the dataset may affect learning accuracy. When a dataset scale is small, fine-tuning-based transfer-learning methods use high computing costs, similar to a large-scale dataset. We propose a mid-level feature extractor that retrains only the mid-level convolutional layers, resulting in increased efficiency and reduced computing costs. This mid-level feature extractor is likely to provide an effective alternative in training a small-scale medical image dataset. The performance of the mid-level feature extractor is compared with the performance of low- and high-level feature extractors, as well as the fine-tuning method. First, the mid-level feature extractor takes a shorter time to converge than other methods do. Second, it shows good accuracy in validation loss evaluation. Third, it obtains an area under the ROC curve (AUC) of 0.87 in an untrained test dataset that is very different from the training dataset. Fourth, it extracts more clear feature maps about shape and part of the chest in the X-ray than fine-tuning method.

Towards Texture-Based Visualization of Multivariate Dataset

  • Mehmood, Raja Majid;Lee, Hyo Jong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2014.04a
    • /
    • pp.582-585
    • /
    • 2014
  • Visualization is a science which makes the invisible to visible through the techniques of experimental visualization and computer-aided visualization. This paper presents the practical aspects of visualization of multivariate dataset. In this paper, we will briefly discuss a previous research work and introduce a new visualization technique which will help us to design and develop a visualization tool for experimental visualization of multivariate dataset. Our newly developed visualization tool can be used in various domains. In this paper, we have chosen a software industry as an application domain and we used the multivariate dataset of software components computed by VizzMaintenance. VizzMaintenance is software analysis tool which give us multiple software metrics of open source Java based programs. Main objective of this research is to develop a new visualization tool for large multivariate dataset which will be more efficient and easy to perceive by viewer. Perception is very important for our research work and we have decided to test the perception level of our proposed visualization approach by researchers of our research lab.

Integration of a Large-Scale Genetic Analysis Workbench Increases the Accessibility of a High-Performance Pathway-Based Analysis Method

  • Lee, Sungyoung;Park, Taesung
    • Genomics & Informatics
    • /
    • v.16 no.4
    • /
    • pp.39.1-39.3
    • /
    • 2018
  • The rapid increase in genetic dataset volume has demanded extensive adoption of biological knowledge to reduce the computational complexity, and the biological pathway is one well-known source of such knowledge. In this regard, we have introduced a novel statistical method that enables the pathway-based association study of large-scale genetic dataset-namely, PHARAOH. However, researcher-level application of the PHARAOH method has been limited by a lack of generally used file formats and the absence of various quality control options that are essential to practical analysis. In order to overcome these limitations, we introduce our integration of the PHARAOH method into our recently developed all-in-one workbench. The proposed new PHARAOH program not only supports various de facto standard genetic data formats but also provides many quality control measures and filters based on those measures. We expect that our updated PHARAOH provides advanced accessibility of the pathway-level analysis of large-scale genetic datasets to researchers.

An Improved Deep Learning Method for Animal Images (동물 이미지를 위한 향상된 딥러닝 학습)

  • Wang, Guangxing;Shin, Seong-Yoon;Shin, Kwang-Weong;Lee, Hyun-Chang
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2019.01a
    • /
    • pp.123-124
    • /
    • 2019
  • This paper proposes an improved deep learning method based on small data sets for animal image classification. Firstly, we use a CNN to build a training model for small data sets, and use data augmentation to expand the data samples of the training set. Secondly, using the pre-trained network on large-scale datasets, such as VGG16, the bottleneck features in the small dataset are extracted and to be stored in two NumPy files as new training datasets and test datasets. Finally, training a fully connected network with the new datasets. In this paper, we use Kaggle famous Dogs vs Cats dataset as the experimental dataset, which is a two-category classification dataset.

  • PDF

STAR-24K: A Public Dataset for Space Common Target Detection

  • Zhang, Chaoyan;Guo, Baolong;Liao, Nannan;Zhong, Qiuyun;Liu, Hengyan;Li, Cheng;Gong, Jianglei
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.2
    • /
    • pp.365-380
    • /
    • 2022
  • The target detection algorithm based on supervised learning is the current mainstream algorithm for target detection. A high-quality dataset is the prerequisite for the target detection algorithm to obtain good detection performance. The larger the number and quality of the dataset, the stronger the generalization ability of the model, that is, the dataset determines the upper limit of the model learning. The convolutional neural network optimizes the network parameters in a strong supervision method. The error is calculated by comparing the predicted frame with the manually labeled real frame, and then the error is passed into the network for continuous optimization. Strongly supervised learning mainly relies on a large number of images as models for continuous learning, so the number and quality of images directly affect the results of learning. This paper proposes a dataset STAR-24K (meaning a dataset for Space TArget Recognition with more than 24,000 images) for detecting common targets in space. Since there is currently no publicly available dataset for space target detection, we extracted some pictures from a series of channels such as pictures and videos released by the official websites of NASA (National Aeronautics and Space Administration) and ESA (The European Space Agency) and expanded them to 24,451 pictures. We evaluate popular object detection algorithms to build a benchmark. Our STAR-24K dataset is publicly available at https://github.com/Zzz-zcy/STAR-24K.

Building a Korean Text Summarization Dataset Using News Articles of Social Media (신문기사와 소셜 미디어를 활용한 한국어 문서요약 데이터 구축)

  • Lee, Gyoung Ho;Park, Yo-Han;Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.8
    • /
    • pp.251-258
    • /
    • 2020
  • A training dataset for text summarization consists of pairs of a document and its summary. As conventional approaches to building text summarization dataset are human labor intensive, it is not easy to construct large datasets for text summarization. A collection of news articles is one of the most popular resources for text summarization because it is easily accessible, large-scale and high-quality text. From social media news services, we can collect not only headlines and subheads of news articles but also summary descriptions that human editors write about the news articles. Approximately 425,000 pairs of news articles and their summaries are collected from social media. We implemented an automatic extractive summarizer and trained it on the dataset. The performance of the summarizer is compared with unsupervised models. The summarizer achieved better results than unsupervised models in terms of ROUGE score.

A GMDH-based estimation model for axial load capacity of GFRP-RC circular columns

  • Mohammed Berradia;El Hadj Meziane;Ali Raza;Mohamed Hechmi El Ouni;Faisal Shabbir
    • Steel and Composite Structures
    • /
    • v.49 no.2
    • /
    • pp.161-180
    • /
    • 2023
  • In the previous research, the axial compressive capacity models for the glass fiber-reinforced polymer (GFRP)-reinforced circular concrete compression elements restrained with GFRP helix were put forward based on small and noisy datasets by considering a limited number of parameters portraying less accuracy. Consequently, it is important to recommend an accurate model based on a refined and large testing dataset that considers various parameters of such components. The core objective and novelty of the current research is to suggest a deep learning model for the axial compressive capacity of GFRP-reinforced circular concrete columns restrained with a GFRP helix utilizing various parameters of a large experimental dataset to give the maximum precision of the estimates. To achieve this aim, a test dataset of 61 GFRP-reinforced circular concrete columns restrained with a GFRP helix has been created from prior studies. An assessment of 15 diverse theoretical models is carried out utilizing different statistical coefficients over the created dataset. A novel model utilizing the group method of data handling (GMDH) has been put forward. The recommended model depicted good effectiveness over the created dataset by assuming the axial involvement of GFRP main bars and the confining effectiveness of transverse GFRP helix and depicted the maximum precision with MAE = 195.67, RMSE = 255.41, and R2 = 0.94 as associated with the previously recommended equations. The GMDH model also depicted good effectiveness for the normal distribution of estimates with only a 2.5% discrepancy from unity. The recommended model can accurately calculate the axial compressive capacity of FRP-reinforced concrete compression elements that can be considered for further analysis and design of such components in the field of structural engineering.