• Title/Summary/Keyword: Data set

Search Result 10,917, Processing Time 0.039 seconds

Evaluation of Machine Learning Algorithm Utilization for Lung Cancer Classification Based on Gene Expression Levels

  • Podolsky, Maxim D;Barchuk, Anton A;Kuznetcov, Vladimir I;Gusarova, Natalia F;Gaidukov, Vadim S;Tarakanov, Segrey A
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.17 no.2
    • /
    • pp.835-838
    • /
    • 2016
  • Background: Lung cancer remains one of the most common cancers in the world, both in terms of new cases (about 13% of total per year) and deaths (nearly one cancer death in five), because of the high case fatality. Errors in lung cancer type or malignant growth determination lead to degraded treatment efficacy, because anticancer strategy depends on tumor morphology. Materials and Methods: We have made an attempt to evaluate effectiveness of machine learning algorithms in the task of lung cancer classification based on gene expression levels. We processed four publicly available data sets. The Dana-Farber Cancer Institute data set contains 203 samples and the task was to classify four cancer types and sound tissue samples. With the University of Michigan data set of 96 samples, the task was to execute a binary classification of adenocarcinoma and non-neoplastic tissues. The University of Toronto data set contains 39 samples and the task was to detect recurrence, while with the Brigham and Women's Hospital data set of 181 samples it was to make a binary classification of malignant pleural mesothelioma and adenocarcinoma. We used the k-nearest neighbor algorithm (k=1, k=5, k=10), naive Bayes classifier with assumption of both a normal distribution of attributes and a distribution through histograms, support vector machine and C4.5 decision tree. Effectiveness of machine learning algorithms was evaluated with the Matthews correlation coefficient. Results: The support vector machine method showed best results among data sets from the Dana-Farber Cancer Institute and Brigham and Women's Hospital. All algorithms with the exception of the C4.5 decision tree showed maximum potential effectiveness in the University of Michigan data set. However, the C4.5 decision tree showed best results for the University of Toronto data set. Conclusions: Machine learning algorithms can be used for lung cancer morphology classification and similar tasks based on gene expression level evaluation.

A study on the standardization strategy for building of learning data set for machine learning applications (기계학습 활용을 위한 학습 데이터세트 구축 표준화 방안에 관한 연구)

  • Choi, JungYul
    • Journal of Digital Convergence
    • /
    • v.16 no.10
    • /
    • pp.205-212
    • /
    • 2018
  • With the development of high performance CPU / GPU, artificial intelligence algorithms such as deep neural networks, and a large amount of data, machine learning has been extended to various applications. In particular, a large amount of data collected from the Internet of Things, social network services, web pages, and public data is accelerating the use of machine learning. Learning data sets for machine learning exist in various formats according to application fields and data types, and thus it is difficult to effectively process data and apply them to machine learning. Therefore, this paper studied a method for building a learning data set for machine learning in accordance with standardized procedures. This paper first analyzes the requirement of learning data set according to problem types and data types. Based on the analysis, this paper presents the reference model to build learning data set for machine learning applications. This paper presents the target standardization organization and a standard development strategy for building learning data set.

Influence of Self-driving Data Set Partition on Detection Performance Using YOLOv4 Network (YOLOv4 네트워크를 이용한 자동운전 데이터 분할이 검출성능에 미치는 영향)

  • Wang, Xufei;Chen, Le;Li, Qiutan;Son, Jinku;Ding, Xilong;Song, Jeongyoung
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.6
    • /
    • pp.157-165
    • /
    • 2020
  • Aiming at the development of neural network and self-driving data set, it is also an idea to improve the performance of network model to detect moving objects by dividing the data set. In Darknet network framework, the YOLOv4 (You Only Look Once v4) network model was used to train and test Udacity data set. According to 7 proportions of the Udacity data set, it was divided into three subsets including training set, validation set and test set. K-means++ algorithm was used to conduct dimensional clustering of object boxes in 7 groups. By adjusting the super parameters of YOLOv4 network for training, Optimal model parameters for 7 groups were obtained respectively. These model parameters were used to detect and compare 7 test sets respectively. The experimental results showed that YOLOv4 can effectively detect the large, medium and small moving objects represented by Truck, Car and Pedestrian in the Udacity data set. When the ratio of training set, validation set and test set is 7:1.5:1.5, the optimal model parameters of the YOLOv4 have highest detection performance. The values show mAP50 reaching 80.89%, mAP75 reaching 47.08%, and the detection speed reaching 10.56 FPS.

Data Classification Using the Robbins-Monro Stochastic Approximation Algorithm (로빈스-몬로 확률 근사 알고리즘을 이용한 데이터 분류)

  • Lee, Jae-Kook;Ko, Chun-Taek;Choi, Won-Ho
    • Proceedings of the KIPE Conference
    • /
    • 2005.07a
    • /
    • pp.624-627
    • /
    • 2005
  • This paper presents a new data classification method using the Robbins Monro stochastic approximation algorithm k-nearest neighbor and distribution analysis. To cluster the data set, we decide the centroid of the test data set using k-nearest neighbor algorithm and the local area of data set. To decide each class of the data, the Robbins Monro stochastic approximation algorithm is applied to the decided local area of the data set. To evaluate the performance, the proposed classification method is compared to the conventional fuzzy c-mean method and k-nn algorithm. The simulation results show that the proposed method is more accurate than fuzzy c-mean method, k-nn algorithm and discriminant analysis algorithm.

  • PDF

Ranking Candidate Genes for the Biomarker Development in a Cancer Diagnostics

  • Kim, In-Young;Lee, Sun-Ho;Rha, Sun-Young;Kim, Byung-Soo
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2004.11a
    • /
    • pp.272-278
    • /
    • 2004
  • Recently, Pepe et al. (2003) employed the receiver operating characteristic (ROC) approach to rank candidate genes from a microarray experiment that can be used for the biomarker development with the ultimate purpose of the population screening of a cancer, In the cancer microarray experiment based on n patients the researcher often wants to compare the tumor tissue with the normal tissue within the same individual using a common reference RNA. This design is referred to as a reference design or an indirect design. Ideally, this experiment produces n pairs of microarray data, where each pair consists of two sets of microarray data resulting from reference versus normal tissue and reference versus tumor tissue hybridizations. However, for certain individuals either normal tissue or tumor tissue is not large enough for the experimenter to extract enough RNA for conducting the microarray experiment, hence there are missing values either in the normal or tumor tissue data. Practically, we have $n_1$ pairs of complete observations, $n_2$ 'normal only' and $n_3$ 'tumor only' data for the microarray experiment with n patients, where n=$n_1$+$n_2$+$n_3$. We refer to this data set as a mixed data set, as it contains a mix of fully observed and partially observed pair data. This mixed data set was actually observed in the microarray experiment based on human tissues, where human tissues were obtained during the surgical operations of cancer patients. Pepe et al. (2003) provide the rationale of using ROC approach based on two independent samples for ranking candidate gene instead of using t or Mann -Whitney statistics. We first modify ROC approach of ranking genes to a paired data set and further extend it to a mixed data set by taking a weighted average of two ROC values obtained by the paired data set and two independent data sets.

  • PDF

A Real-Time Data Mining for Stream Data Sets (연속발생 데이터를 위한 실시간 데이터 마이닝 기법)

  • Kim Jinhwa;Min Jin Young
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.29 no.4
    • /
    • pp.41-60
    • /
    • 2004
  • A stream data is a data set that is accumulated to the data storage from a data source over time continuously. The size of this data set, in many cases. becomes increasingly large over time. To mine information from this massive data. it takes much resource such as storage, memory and time. These unique characteristics of the stream data make it difficult and expensive to use this large size data accumulated over time. Otherwise. if we use only recent or part of a whole data to mine information or pattern. there can be loss of information. which may be useful. To avoid this problem. we suggest a method that efficiently accumulates information. in the form of rule sets. over time. It takes much smaller storage compared to traditional mining methods. These accumulated rule sets are used as prediction models in the future. Based on theories of ensemble approaches. combination of many prediction models. in the form of systematically merged rule sets in this study. is better than one prediction model in performance. This study uses a customer data set that predicts buying power of customers based on their information. This study tests the performance of the suggested method with the data set alone with general prediction methods and compares performances of them.

An Application of the Rough Set Approach to credit Rating

  • Kim, Jae-Kyeong;Cho, Sung-Sik
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 1999.10a
    • /
    • pp.347-354
    • /
    • 1999
  • The credit rating represents an assessment of the relative level of risk associated with the timely payments required by the debt obligation. In this paper, we present a new approach to credit rating of customers based on the rough set theory. The concept of a rough set appeared to be an effective tool for the analysis of customer information systems representing knowledge gained by experience. The customer information system describes a set of customers by a set of multi-valued attributes, called condition attributes. The customers are classified into groups of risk subject to an expert's opinion, called decision attribute. A natural problem of knowledge analysis consists then in discovering relationships, in terms of decision rules, between description of customers by condition attributes and particular decisions. The rough set approach enables one to discover minimal subsets of condition attributes ensuring an acceptable quality of classification of the customers analyzed and to derive decision rules from the customer information system which can be used to support decisions about rating new customers. Using the rough set approach one analyses only facts hidden in data, it does not need any additional information about data and does not correct inconsistencies manifested in data; instead, rules produced are categorized into certain and possible. A real problem of the evaluation of the evaluation of credit rating by a department store is studied using the rough set approach.

  • PDF

The Conversion of a Set, a Sequence, and a Map in VDM to a Linked List in a Programming Language (VDM의 자료구조인 set, sequency, map의 프로그래밍 언어 자료구조인 linked list로의 변환)

  • Yu, Mun-Seong
    • The KIPS Transactions:PartD
    • /
    • v.8D no.4
    • /
    • pp.421-426
    • /
    • 2001
  • A formal development method is used to develop software rigorously and systematically. In a formal development method, we specify system by a formal specification language and gradually develop the system more concretely until we can implement the system. VDM is one of formal specification languages. VDM uses mathematical data structures such as sets, sequences, and maps to specify the system, but most programming languages do not have such data structures. Therefore, these data structures should be converted. We can convert mathematical data structures in VDM to a linked list, a data structure in a programming language. In this article, we propose a method to convert a set, a sequence, and a map in VDM to a linked list in a programming language and prove the correctness of this conversion mathematically.

  • PDF

Training for Huge Data set with On Line Pruning Regression by LS-SVM

  • Kim, Dae-Hak;Shim, Joo-Yong;Oh, Kwang-Sik
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2003.10a
    • /
    • pp.137-141
    • /
    • 2003
  • LS-SVM(least squares support vector machine) is a widely applicable and useful machine learning technique for classification and regression analysis. LS-SVM can be a good substitute for statistical method but computational difficulties are still remained to operate the inversion of matrix of huge data set. In modern information society, we can easily get huge data sets by on line or batch mode. For these kind of huge data sets, we suggest an on line pruning regression method by LS-SVM. With relatively small number of pruned support vectors, we can have almost same performance as regression with full data set.

  • PDF

Modeling Two-stage Choice Process (2단계 선택과정의 모형화)

  • 박상준;한민희
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.25 no.2
    • /
    • pp.77-86
    • /
    • 2000
  • Consumers facing a large number of brands to choose from are known to use simplified heuristic to screen a set of relevant brands called the consideration set from the whole alternatives, Purchase decisions are then made from the brands in the consideration set, Two approaches have been suggested to model the two-stage choice process., One is to treat the con-sideration set as a crisp set (e.g Roberts and Lattin 1991) The other is to treat the set as a fuzzy set (e.g. Fortheringham 1988) The paper empirically compares the two types of models using data for soft drinks sneakers and departments. The results show that a model employing the crisp set approach fits the data better than that with the fuzzy set approach and better than a single-stage choice logit model.

  • PDF