• Title/Summary/Keyword: data for training

Search Result 6,587, Processing Time 0.034 seconds

Domain Adaptation for Opinion Classification: A Self-Training Approach

  • Yu, Ning
    • Journal of Information Science Theory and Practice
    • /
    • v.1 no.1
    • /
    • pp.10-26
    • /
    • 2013
  • Domain transfer is a widely recognized problem for machine learning algorithms because models built upon one data domain generally do not perform well in another data domain. This is especially a challenge for tasks such as opinion classification, which often has to deal with insufficient quantities of labeled data. This study investigates the feasibility of self-training in dealing with the domain transfer problem in opinion classification via leveraging labeled data in non-target data domain(s) and unlabeled data in the target-domain. Specifically, self-training is evaluated for effectiveness in sparse data situations and feasibility for domain adaptation in opinion classification. Three types of Web content are tested: edited news articles, semi-structured movie reviews, and the informal and unstructured content of the blogosphere. Findings of this study suggest that, when there are limited labeled data, self-training is a promising approach for opinion classification, although the contributions vary across data domains. Significant improvement was demonstrated for the most challenging data domain-the blogosphere-when a domain transfer-based self-training strategy was implemented.

Document Image Binarization by GAN with Unpaired Data Training

  • Dang, Quang-Vinh;Lee, Guee-Sang
    • International Journal of Contents
    • /
    • v.16 no.2
    • /
    • pp.8-18
    • /
    • 2020
  • Data is critical in deep learning but the scarcity of data often occurs in research, especially in the preparation of the paired training data. In this paper, document image binarization with unpaired data is studied by introducing adversarial learning, excluding the need for supervised or labeled datasets. However, the simple extension of the previous unpaired training to binarization inevitably leads to poor performance compared to paired data training. Thus, a new deep learning approach is proposed by introducing a multi-diversity of higher quality generated images. In this paper, a two-stage model is proposed that comprises the generative adversarial network (GAN) followed by the U-net network. In the first stage, the GAN uses the unpaired image data to create paired image data. With the second stage, the generated paired image data are passed through the U-net network for binarization. Thus, the trained U-net becomes the binarization model during the testing. The proposed model has been evaluated over the publicly available DIBCO dataset and it outperforms other techniques on unpaired training data. The paper shows the potential of using unpaired data for binarization, for the first time in the literature, which can be further improved to replace paired data training for binarization in the future.

The Effect of the Number of Training Data on Speech Recognition

  • Lee, Chang-Young
    • The Journal of the Acoustical Society of Korea
    • /
    • v.28 no.2E
    • /
    • pp.66-71
    • /
    • 2009
  • In practical applications of speech recognition, one of the fundamental questions might be on the number of training data that should be provided for a specific task. Though plenty of training data would undoubtedly enhance the system performance, we are then faced with the problem of heavy cost. Therefore, it is of crucial importance to determine the least number of training data that will afford a certain level of accuracy. For this purpose, we investigate the effect of the number of training data on the speaker-independent speech recognition of isolated words by using FVQ/HMM. The result showed that the error rate is roughly inversely proportional to the number of training data and grows linearly with the vocabulary size.

Multi-temporal Remote-Sensing Imag e ClassificationUsing Artificial Neural Networks (인공신경망 이론을 이용한 위성영상의 카테고리분류)

  • Kang, Moon-Seong;Park, Seung-Woo;Lim, Jae-Chon
    • Proceedings of the Korean Society of Agricultural Engineers Conference
    • /
    • 2001.10a
    • /
    • pp.59-64
    • /
    • 2001
  • The objectives of the thesis are to propose a pattern classification method for remote sensing data using artificial neural network. First, we apply the error back propagation algorithm to classify the remote sensing data. In this case, the classification performance depends on a training data set. Using the training data set and the error back propagation algorithm, a layered neural network is trained such that the training pattern are classified with a specified accuracy. After training the neural network, some pixels are deleted from the original training data set if they are incorrectly classified and a new training data set is built up. Once training is complete, a testing data set is classified by using the trained neural network. The classification results of Landsat TM data show that this approach produces excellent results which are more realistic and noiseless compared with a conventional Bayesian method.

  • PDF

The Effectiveness of the Training Program at HCL

  • Kumari, Neeraj
    • Asian Journal of Business Environment
    • /
    • v.5 no.3
    • /
    • pp.23-28
    • /
    • 2015
  • Purpose - The aim of this study is to evaluate the effectiveness of a corporate training program. The case study of HCL Technologies was used to investigate how training programs improve the performance of employees on the job, as well as to identify unnecessary aspects of the training for the purpose of eliminating these from future training programs. Research design, data, and methodology - An exploratory research design was used to conduct the study. The research sample size included 50 HCL employees. The sampling technique for the data collection was convenience sampling. Results - Training is a crucial process in an organization and thus needs to be well designed. Specifically, the training programs should provide adequate knowledge to all employees, ensure correct methods are used for the selection of trainees, and avoid any perception of biasness. Conclusions - Employees were not fully satisfied by the separation of the training program into two parts, on the job and off the job training, but if sufficient data is provided to employees in advance, this could help them during the training process.

Performance Comparison between Neural Network and Genetic Programming Using Gas Furnace Data

  • Bae, Hyeon;Jeon, Tae-Ryong;Kim, Sung-Shin
    • Journal of information and communication convergence engineering
    • /
    • v.6 no.4
    • /
    • pp.448-453
    • /
    • 2008
  • This study describes design and development techniques of estimation models for process modeling. One case study is undertaken to design a model using standard gas furnace data. Neural networks (NN) and genetic programming (GP) are each employed to model the crucial relationships between input factors and output responses. In the case study, two models were generated by using 70% training data and evaluated by using 30% testing data for genetic programming and neural network modeling. The model performance was compared by using RMSE values, which were calculated based on the model outputs. The average RMSE for training and testing were 0.8925 (training) and 0.9951 (testing) for the NN model, and 0.707227 (training) and 0.673150 (testing) for the GP model, respectively. As concern the results, the NN model has a strong advantage in model training (using the all data for training), and the GP model appears to have an advantage in model testing (using the separated data for training and testing). The performance reproducibility of the GP model is good, so this approach appears suitable for modeling physical fabrication processes.

A Co-training Method based on Classification Using Unlabeled Data (비분류표시 데이타를 이용하는 분류 기반 Co-training 방법)

  • 윤혜성;이상호;박승수;용환승;김주한
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.8
    • /
    • pp.991-998
    • /
    • 2004
  • In many practical teaming problems including bioinformatics area, there is a small amount of labeled data along with a large pool of unlabeled data. Labeled examples are fairly expensive to obtain because they require human efforts. In contrast, unlabeled examples can be inexpensively gathered without an expert. A common method with unlabeled data for data classification and analysis is co-training. This method uses a small set of labeled examples to learn a classifier in two views. Then each classifier is applied to all unlabeled examples, and co-training detects the examples on which each classifier makes the most confident predictions. After some iterations, new classifiers are learned in training data and the number of labeled examples is increased. In this paper, we propose a new co-training strategy using unlabeled data. And we evaluate our method with two classifiers and two experimental data: WebKB and BIND XML data. Our experimentation shows that the proposed co-training technique effectively improves the classification accuracy when the number of labeled examples are very small.

Improving the Subject Independent Classification of Implicit Intention By Generating Additional Training Data with PCA and ICA

  • Oh, Sang-Hoon
    • International Journal of Contents
    • /
    • v.14 no.4
    • /
    • pp.24-29
    • /
    • 2018
  • EEG-based brain-computer interfaces has focused on explicitly expressed intentions to assist physically impaired patients. For EEG-based-computer interfaces to function effectively, it should be able to understand users' implicit information. Since it is hard to gather EEG signals of human brains, we do not have enough training data which are essential for proper classification performance of implicit intention. In this paper, we improve the subject independent classification of implicit intention through the generation of additional training data. In the first stage, we perform the PCA (principal component analysis) of training data in a bid to remove redundant components in the components within the input data. After the dimension reduction by PCA, we train ICA (independent component analysis) network whose outputs are statistically independent. We can get additional training data by adding Gaussian noises to ICA outputs and projecting them to input data domain. Through simulations with EEG data provided by CNSL, KAIST, we improve the classification performance from 65.05% to 66.69% with Gamma components. The proposed sample generation method can be applied to any machine learning problem with fewer samples.

Generating and Validating Synthetic Training Data for Predicting Bankruptcy of Individual Businesses

  • Hong, Dong-Suk;Baik, Cheol
    • Journal of information and communication convergence engineering
    • /
    • v.19 no.4
    • /
    • pp.228-233
    • /
    • 2021
  • In this study, we analyze the credit information (loan, delinquency information, etc.) of individual business owners to generate voluminous training data to establish a bankruptcy prediction model through a partial synthetic training technique. Furthermore, we evaluate the prediction performance of the newly generated data compared to the actual data. When using conditional tabular generative adversarial networks (CTGAN)-based training data generated by the experimental results (a logistic regression task), the recall is improved by 1.75 times compared to that obtained using the actual data. The probability that both the actual and generated data are sampled over an identical distribution is verified to be much higher than 80%. Providing artificial intelligence training data through data synthesis in the fields of credit rating and default risk prediction of individual businesses, which have not been relatively active in research, promotes further in-depth research efforts focused on utilizing such methods.

A Content Analysis of Research Data Management Training Programs at the University Libraries in North America: Focusing on Data Literacy Competencies (북미 대학도서관 연구데이터 관리 교육 프로그램 내용 분석: 데이터 리터러시 세부 역량을 중심으로)

  • Kim, Jihyun
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.4
    • /
    • pp.7-36
    • /
    • 2018
  • This study aimed to analyze the content of Records Data Management (RDM) training programs provided by 51 out of 121 university libraries in North America that implemented RDM services, and to provide implications from the results. For the content analysis, 317 titles of classroom training programs and 42 headings at the highest level from the tables of content of online tutorials were collected and coded based on 12 data literacy competencies identified from previous studies. Among classroom training programs, those regarding data processing and analysis competency were offered the most. The highest number of the libraries provided classroom training programs in relation to data management and organization competency. The third most classroom training programs dealt with data visualization and representation competency. However, each of the remaining 9 competencies was covered by only a few classroom training programs, and this implied that classroom training programs focused on the particular data literacy competencies. There were five university libraries that developed and provided their own online tutorials. The analysis of the headings showed that the competencies of data preservation, ethics and data citation, and data management and organization were mainly covered and the difference existed in the competencies stressed by the classroom training programs. For effective RDM training program, it is necessary to understand and support the education of data literacy competencies that researchers need to draw research results, in addition to competencies that university librarians traditionally have taught and emphasized. It is also needed to develop educational resources that support continuing education for the librarians involved in RDM services.