• Title/Summary/Keyword: Large Dataset

Search Result 560, Processing Time 0.024 seconds

Automatic Extraction of Training Dataset Using Expectation Maximization Algorithm - for Automatic Supervised Classification of Road Networks (기대최대화 알고리즘을 활용한 도로노면 training 자료 자동추출에 관한 연구 - 감독분류를 통한 도로 네트워크의 자동추출을 위하여)

  • Han, You-Kyung;Choi, Jae-Wan;Lee, Jae-Bin;Yu, Ki-Yun;Kim, Yong-Il
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.27 no.2
    • /
    • pp.289-297
    • /
    • 2009
  • In the paper, we propose the methodology to extract training dataset automatically for supervised classification of road networks. For the preprocessing, we co-register the airborne photos, LIDAR data and large-scale digital maps and then, create orthophotos and intensity images. By overlaying the large-scale digital maps onto generated images, we can extract the initial training dataset for the supervised classification of road networks. However, the initial training information is distorted because there are errors propagated from registration process and, also, there are generally various objects in the road networks such as asphalt, road marks, vegetation, cars and so on. As such, to generate the training information only for the road surface, we apply the Expectation Maximization technique and finally, extract the training dataset of the road surface. For the accuracy test, we compare the training dataset with manually extracted ones. Through the statistical tests, we can identify that the developed method is valid.

An active learning method with difficulty learning mechanism for crack detection

  • Shu, Jiangpeng;Li, Jun;Zhang, Jiawei;Zhao, Weijian;Duan, Yuanfeng;Zhang, Zhicheng
    • Smart Structures and Systems
    • /
    • v.29 no.1
    • /
    • pp.195-206
    • /
    • 2022
  • Crack detection is essential for inspection of existing structures and crack segmentation based on deep learning is a significant solution. However, datasets are usually one of the key issues. When building a new dataset for deep learning, laborious and time-consuming annotation of a large number of crack images is an obstacle. The aim of this study is to develop an approach that can automatically select a small portion of the most informative crack images from a large pool in order to annotate them, not to label all crack images. An active learning method with difficulty learning mechanism for crack segmentation tasks is proposed. Experiments are carried out on a crack image dataset of a steel box girder, which contains 500 images of 320×320 size for training, 100 for validation, and 190 for testing. In active learning experiments, the 500 images for training are acted as unlabeled image. The acquisition function in our method is compared with traditional acquisition functions, i.e., Query-By-Committee (QBC), Entropy, and Core-set. Further, comparisons are made on four common segmentation networks: U-Net, DeepLabV3, Feature Pyramid Network (FPN), and PSPNet. The results show that when training occurs with 200 (40%) of the most informative crack images that are selected by our method, the four segmentation networks can achieve 92%-95% of the obtained performance when training takes place with 500 (100%) crack images. The acquisition function in our method shows more accurate measurements of informativeness for unlabeled crack images compared to the four traditional acquisition functions at most active learning stages. Our method can select the most informative images for annotation from many unlabeled crack images automatically and accurately. Additionally, the dataset built after selecting 40% of all crack images can support crack segmentation networks that perform more than 92% when all the images are used.

A streamlined pipeline based on HmmUFOtu for microbial community profiling using 16S rRNA amplicon sequencing

  • Hyeonwoo Kim;Jiwon Kim;Ji Won Cho;Kwang-Sung Ahn;Dong-Il Park;Sangsoo Kim
    • Genomics & Informatics
    • /
    • v.21 no.3
    • /
    • pp.40.1-40.11
    • /
    • 2023
  • Microbial community profiling using 16S rRNA amplicon sequencing allows for taxonomic characterization of diverse microorganisms. While amplicon sequence variant (ASV) methods are increasingly favored for their fine-grained resolution of sequence variants, they often discard substantial portions of sequencing reads during quality control, particularly in datasets with large number samples. We present a streamlined pipeline that integrates FastP for read trimming, HmmUFOtu for operational taxonomic units (OTU) clustering, Vsearch for chimera checking, and Kraken2 for taxonomic assignment. To assess the pipeline's performance, we reprocessed two published stool datasets of normal Korean populations: one with 890 and the other with 1,462 independent samples. In the first dataset, HmmUFOtu retained 93.2% of over 104 million read pairs after quality trimming, discarding chimeric or unclassifiable reads, while DADA2, a commonly used ASV method, retained only 44.6% of the reads. Nonetheless, both methods yielded qualitatively similar β-diversity plots. For the second dataset, HmmUFOtu retained 89.2% of read pairs, while DADA2 retained a mere 18.4% of the reads. HmmUFOtu, being a closed-reference clustering method, facilitates merging separately processed datasets, with shared OTUs between the two datasets exhibiting a correlation coefficient of 0.92 in total abundance (log scale). While the first two dimensions of the β-diversity plot exhibited a cohesive mixture of the two datasets, the third dimension revealed the presence of a batch effect. Our comparative evaluation of ASV and OTU methods within this streamlined pipeline provides valuable insights into their performance when processing large-scale microbial 16S rRNA amplicon sequencing data. The strengths of HmmUFOtu and its potential for dataset merging are highlighted.

Text Summarization on Large-scale Vietnamese Datasets

  • Ti-Hon, Nguyen;Thanh-Nghi, Do
    • Journal of information and communication convergence engineering
    • /
    • v.20 no.4
    • /
    • pp.309-316
    • /
    • 2022
  • This investigation is aimed at automatic text summarization on large-scale Vietnamese datasets. Vietnamese articles were collected from newspaper websites and plain text was extracted to build the dataset, that included 1,101,101 documents. Next, a new single-document extractive text summarization model was proposed to evaluate this dataset. In this summary model, the k-means algorithm is used to cluster the sentences of the input document using different text representations, such as BoW (bag-of-words), TF-IDF (term frequency - inverse document frequency), Word2Vec (Word-to-vector), Glove, and FastText. The summary algorithm then uses the trained k-means model to rank the candidate sentences and create a summary with the highest-ranked sentences. The empirical results of the F1-score achieved 51.91% ROUGE-1, 18.77% ROUGE-2 and 29.72% ROUGE-L, compared to 52.33% ROUGE-1, 16.17% ROUGE-2, and 33.09% ROUGE-L performed using a competitive abstractive model. The advantage of the proposed model is that it can perform well with O(n,k,p) = O(n(k+2/p)) + O(nlog2n) + O(np) + O(nk2) + O(k) time complexity.

Empowering Emotion Classification Performance Through Reasoning Dataset From Large-scale Language Model (초거대 언어 모델로부터의 추론 데이터셋을 활용한 감정 분류 성능 향상)

  • NunSol Park;MinHo Lee
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.07a
    • /
    • pp.59-61
    • /
    • 2023
  • 본 논문에서는 감정 분류 성능 향상을 위한 초거대 언어모델로부터의 추론 데이터셋 활용 방안을 제안한다. 이 방안은 Google Research의 'Chain of Thought'에서 영감을 받아 이를 적용하였으며, 추론 데이터는 ChatGPT와 같은 초거대 언어 모델로 생성하였다. 본 논문의 목표는 머신러닝 모델이 추론 데이터를 이해하고 적용하는 능력을 활용하여, 감정 분류 작업의 성능을 향상시키는 것이다. 초거대 언어 모델(ChatGPT)로부터 추출한 추론 데이터셋을 활용하여 감정 분류 모델을 훈련하였으며, 이 모델은 감정 분류 작업에서 향상된 성능을 보였다. 이를 통해 추론 데이터셋이 감정 분류에 있어서 큰 가치를 가질 수 있음을 증명하였다. 또한, 이 연구는 기존에 감정 분류 작업에 사용되던 데이터셋만을 활용한 모델과 비교하였을 때, 추론 데이터를 활용한 모델이 더 높은 성능을 보였음을 증명한다. 이 연구를 통해, 적은 비용으로 초거대 언어모델로부터 생성된 추론 데이터셋의 활용 가능성을 보여주고, 감정 분류 작업 성능을 향상시키는 새로운 방법을 제시한다. 제시한 방안은 감정 분류뿐만 아니라 다른 자연어처리 분야에서도 활용될 수 있으며, 더욱 정교한 자연어 이해와 처리가 가능함을 시사한다.

  • PDF

Crowd Activity Classification Using Category Constrained Correlated Topic Model

  • Huang, Xianping;Wang, Wanliang;Shen, Guojiang;Feng, Xiaoqing;Kong, Xiangjie
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.11
    • /
    • pp.5530-5546
    • /
    • 2016
  • Automatic analysis and understanding of human activities is a challenging task in computer vision, especially for the surveillance scenarios which typically contains crowds, complex motions and occlusions. To address these issues, a Bag-of-words representation of videos is developed by leveraging information including crowd positions, motion directions and velocities. We infer the crowd activity in a motion field using Category Constrained Correlated Topic Model (CC-CTM) with latent topics. We represent each video by a mixture of learned motion patterns, and predict the associated activity by training a SVM classifier. The experiment dataset we constructed are from Crowd_PETS09 bench dataset and UCF_Crowds dataset, including 2000 documents. Experimental results demonstrate that accuracy reaches 90%, and the proposed approach outperforms the state-of-the-arts by a large margin.

ValueRank: Keyword Search of Object Summaries Considering Values

  • Zhi, Cai;Xu, Lan;Xing, Su;Kun, Lang;Yang, Cao
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.12
    • /
    • pp.5888-5903
    • /
    • 2019
  • The Relational ranking method applies authority-based ranking in relational dataset that can be modeled as graphs considering also their tuples' values. Authority directions from tuples that contain the given keywords and transfer to their corresponding neighboring nodes in accordance with their values and semantic connections. From our previous work, ObjectRank extends to ValueRank that also takes into account the value of tuples in authority transfer flows. In a maked difference from ObjectRank, which only considers authority flows through relationships, it is only valid in the bibliographic databases e.g. DBLP dataset, ValueRank facilitates the estimation of importance for any databases, e.g. trading databases, etc. A relational keyword search paradigm Object Summary (denote as OS) is proposed recently, given a set of keywords, a group of Object Summaries as its query result. An OS is a multilevel-tree data structure, in which node (namely the tuple with keywords) is OS's root node, and the surrounding nodes are the summary of all data on the graph. But, some of these trees have a very large in total number of tuples, size-l OSs are the OS snippets, have also been investigated using ValueRank.We evaluated the real bibliographical dataset and Microsoft business databases to verify of our proposed approach.

Developing an Intrusion Detection Framework for High-Speed Big Data Networks: A Comprehensive Approach

  • Siddique, Kamran;Akhtar, Zahid;Khan, Muhammad Ashfaq;Jung, Yong-Hwan;Kim, Yangwoo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.8
    • /
    • pp.4021-4037
    • /
    • 2018
  • In network intrusion detection research, two characteristics are generally considered vital to building efficient intrusion detection systems (IDSs): an optimal feature selection technique and robust classification schemes. However, the emergence of sophisticated network attacks and the advent of big data concepts in intrusion detection domains require two more significant aspects to be addressed: employing an appropriate big data computing framework and utilizing a contemporary dataset to deal with ongoing advancements. As such, we present a comprehensive approach to building an efficient IDS with the aim of strengthening academic anomaly detection research in real-world operational environments. The proposed system has the following four characteristics: (i) it performs optimal feature selection using information gain and branch-and-bound algorithms; (ii) it employs machine learning techniques for classification, namely, Logistic Regression, Naïve Bayes, and Random Forest; (iii) it introduces bulk synchronous parallel processing to handle the computational requirements of large-scale networks; and (iv) it utilizes a real-time contemporary dataset generated by the Information Security Centre of Excellence at the University of Brunswick (ISCX-UNB) to validate its efficacy. Experimental analysis shows the effectiveness of the proposed framework, which is able to achieve high accuracy, low computational cost, and reduced false alarms.

CNN-based Weighted Ensemble Technique for ImageNet Classification (대용량 이미지넷 인식을 위한 CNN 기반 Weighted 앙상블 기법)

  • Jung, Heechul;Choi, Min-Kook;Kim, Junkwang;Kwon, Soon;Jung, Wooyoung
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.15 no.4
    • /
    • pp.197-204
    • /
    • 2020
  • The ImageNet dataset is a large scale dataset and contains various natural scene images. In this paper, we propose a convolutional neural network (CNN)-based weighted ensemble technique for the ImageNet classification task. First, in order to fuse several models, our technique uses weights for each model, unlike the existing average-based ensemble technique. Then we propose an algorithm that automatically finds the coefficients used in later ensemble process. Our algorithm sequentially selects the model with the best performance of the validation set, and then obtains a weight that improves performance when combined with existing selected models. We applied the proposed algorithm to a total of 13 heterogeneous models, and as a result, 5 models were selected. These selected models were combined with weights, and we achieved 3.297% Top-5 error rate on the ImageNet test dataset.

Machine learning of LWR spent nuclear fuel assembly decay heat measurements

  • Ebiwonjumi, Bamidele;Cherezov, Alexey;Dzianisau, Siarhei;Lee, Deokjung
    • Nuclear Engineering and Technology
    • /
    • v.53 no.11
    • /
    • pp.3563-3579
    • /
    • 2021
  • Measured decay heat data of light water reactor (LWR) spent nuclear fuel (SNF) assemblies are adopted to train machine learning (ML) models. The measured data is available for fuel assemblies irradiated in commercial reactors operated in the United States and Sweden. The data comes from calorimetric measurements of discharged pressurized water reactor (PWR) and boiling water reactor (BWR) fuel assemblies. 91 and 171 measurements of PWR and BWR assembly decay heat data are used, respectively. Due to the small size of the measurement dataset, we propose: (i) to use the method of multiple runs (ii) to generate and use synthetic data, as large dataset which has similar statistical characteristics as the original dataset. Three ML models are developed based on Gaussian process (GP), support vector machines (SVM) and neural networks (NN), with four inputs including the fuel assembly averaged enrichment, assembly averaged burnup, initial heavy metal mass, and cooling time after discharge. The outcomes of this work are (i) development of ML models which predict LWR fuel assembly decay heat from the four inputs (ii) generation and application of synthetic data which improves the performance of the ML models (iii) uncertainty analysis of the ML models and their predictions.