• Title/Summary/Keyword: Data-set

Search Result 10,986, Processing Time 0.039 seconds

Evaluation of Machine Learning Algorithm Utilization for Lung Cancer Classification Based on Gene Expression Levels

  • Podolsky, Maxim D;Barchuk, Anton A;Kuznetcov, Vladimir I;Gusarova, Natalia F;Gaidukov, Vadim S;Tarakanov, Segrey A
    • Asian Pacific Journal of Cancer Prevention
    • /
    • 제17권2호
    • /
    • pp.835-838
    • /
    • 2016
  • Background: Lung cancer remains one of the most common cancers in the world, both in terms of new cases (about 13% of total per year) and deaths (nearly one cancer death in five), because of the high case fatality. Errors in lung cancer type or malignant growth determination lead to degraded treatment efficacy, because anticancer strategy depends on tumor morphology. Materials and Methods: We have made an attempt to evaluate effectiveness of machine learning algorithms in the task of lung cancer classification based on gene expression levels. We processed four publicly available data sets. The Dana-Farber Cancer Institute data set contains 203 samples and the task was to classify four cancer types and sound tissue samples. With the University of Michigan data set of 96 samples, the task was to execute a binary classification of adenocarcinoma and non-neoplastic tissues. The University of Toronto data set contains 39 samples and the task was to detect recurrence, while with the Brigham and Women's Hospital data set of 181 samples it was to make a binary classification of malignant pleural mesothelioma and adenocarcinoma. We used the k-nearest neighbor algorithm (k=1, k=5, k=10), naive Bayes classifier with assumption of both a normal distribution of attributes and a distribution through histograms, support vector machine and C4.5 decision tree. Effectiveness of machine learning algorithms was evaluated with the Matthews correlation coefficient. Results: The support vector machine method showed best results among data sets from the Dana-Farber Cancer Institute and Brigham and Women's Hospital. All algorithms with the exception of the C4.5 decision tree showed maximum potential effectiveness in the University of Michigan data set. However, the C4.5 decision tree showed best results for the University of Toronto data set. Conclusions: Machine learning algorithms can be used for lung cancer morphology classification and similar tasks based on gene expression level evaluation.

기계학습 활용을 위한 학습 데이터세트 구축 표준화 방안에 관한 연구 (A study on the standardization strategy for building of learning data set for machine learning applications)

  • 최정열
    • 디지털융복합연구
    • /
    • 제16권10호
    • /
    • pp.205-212
    • /
    • 2018
  • 고성능 CPU/GPU의 개발과 심층신경망 등의 인공지능 알고리즘, 그리고 다량의 데이터 확보를 통해 기계학습이 다양한 응용 분야로 확대 적용되고 있다. 특히, 사물인터넷, 사회관계망서비스, 웹페이지, 공공데이터로부터 수집된 다량의 데이터들이 기계학습의 활용에 가속화를 가하고 있다. 기계학습을 위한 학습 데이터세트는 응용 분야와 데이터 종류에 따라 다양한 형식으로 존재하고 있어 효과적으로 데이터를 처리하고 기계학습에 적용하기에 어려움이 따른다. 이에 본 논문은 표준화된 절차에 따라 기계학습을 위한 학습 데이터세트를 구축하기 위한 방안을 연구하였다. 먼저 학습 데이터세트가 갖추어야할 요구사항을 문제 유형과 데이터 유형별로 분석하였다. 이를 토대로 기계학습 활용을 위한 학습 데이터세트 구축에 관한 참조모델을 제안하였다. 또한 학습 데이터세트 구축 참조모델을 국제 표준으로 개발하기 위해 대상 표준화 기구의 선정 및 표준화 전략을 제시하였다.

YOLOv4 네트워크를 이용한 자동운전 데이터 분할이 검출성능에 미치는 영향 (Influence of Self-driving Data Set Partition on Detection Performance Using YOLOv4 Network)

  • 왕욱비;진락;이추담;손진구;정석용;송정영
    • 한국인터넷방송통신학회논문지
    • /
    • 제20권6호
    • /
    • pp.157-165
    • /
    • 2020
  • 뉴-럴 네트워크와 자동운전 데이터 셋을 개발하는 목표중의 하나가 데이터 셋을 분할함에 따라서 움직이는 물체를 검출하는 성능을 개선하는 방법이 있다. 다크넷 (DarkNet) 프레임 워크에 있어서, YOLOv4 네트워크는 Udacity 데이터 셋에서 훈련하는 셋과 검증 셋으로 사용되었다. Udacity 데이터 셋의 7개 비율에 따라서 이 데이터 셋은 훈련 셋, 검증 셋, 테스트 셋을 포함한 3개의 부분 셋으로 나누어진다. K-means++ 알고리즘은 7개 그룹에서 개체 Box 차원 군집화를 수행하기 위해 사용되었다. 훈련을 위한 YOLOv4 네트워크의 슈퍼 파라메타를 조절하여 7개 그룹들에 대하여 최적 모델 파라메타가 각각 구해졌다. 이 모델 파라메타는 각각 7 개 테스트 셋 데이터에 비교하고 검출에 사용되었다. 실험결과에서 YOLOv4 네트워크는 Udacity 데이터 셋에서 트럭, 자동차, 행인으로 표현되는 움직이는 물체에 대하여 대/중/소 물체 검출을 할수 있음을 보여 주었다. 훈련 셋과 검증 셋, 테스트 셋의 비율이 7 ; 1.5 ; 1.5 일 때 최적의 모델 파라메타로서 가장 높은 검출 성능이었다. 그 결과값은, mAP50가 80.89%, mAP75가 47.08%에 달하고, 검출 속도는 10.56 FPS에 달한다.

로빈스-몬로 확률 근사 알고리즘을 이용한 데이터 분류 (Data Classification Using the Robbins-Monro Stochastic Approximation Algorithm)

  • 이재국;고춘택;최원호
    • 전력전자학회:학술대회논문집
    • /
    • 전력전자학회 2005년도 전력전자학술대회 논문집
    • /
    • pp.624-627
    • /
    • 2005
  • This paper presents a new data classification method using the Robbins Monro stochastic approximation algorithm k-nearest neighbor and distribution analysis. To cluster the data set, we decide the centroid of the test data set using k-nearest neighbor algorithm and the local area of data set. To decide each class of the data, the Robbins Monro stochastic approximation algorithm is applied to the decided local area of the data set. To evaluate the performance, the proposed classification method is compared to the conventional fuzzy c-mean method and k-nn algorithm. The simulation results show that the proposed method is more accurate than fuzzy c-mean method, k-nn algorithm and discriminant analysis algorithm.

  • PDF

Ranking Candidate Genes for the Biomarker Development in a Cancer Diagnostics

  • Kim, In-Young;Lee, Sun-Ho;Rha, Sun-Young;Kim, Byung-Soo
    • 한국생물정보학회:학술대회논문집
    • /
    • 한국생물정보시스템생물학회 2004년도 The 3rd Annual Conference for The Korean Society for Bioinformatics Association of Asian Societies for Bioinformatics 2004 Symposium
    • /
    • pp.272-278
    • /
    • 2004
  • Recently, Pepe et al. (2003) employed the receiver operating characteristic (ROC) approach to rank candidate genes from a microarray experiment that can be used for the biomarker development with the ultimate purpose of the population screening of a cancer, In the cancer microarray experiment based on n patients the researcher often wants to compare the tumor tissue with the normal tissue within the same individual using a common reference RNA. This design is referred to as a reference design or an indirect design. Ideally, this experiment produces n pairs of microarray data, where each pair consists of two sets of microarray data resulting from reference versus normal tissue and reference versus tumor tissue hybridizations. However, for certain individuals either normal tissue or tumor tissue is not large enough for the experimenter to extract enough RNA for conducting the microarray experiment, hence there are missing values either in the normal or tumor tissue data. Practically, we have $n_1$ pairs of complete observations, $n_2$ 'normal only' and $n_3$ 'tumor only' data for the microarray experiment with n patients, where n=$n_1$+$n_2$+$n_3$. We refer to this data set as a mixed data set, as it contains a mix of fully observed and partially observed pair data. This mixed data set was actually observed in the microarray experiment based on human tissues, where human tissues were obtained during the surgical operations of cancer patients. Pepe et al. (2003) provide the rationale of using ROC approach based on two independent samples for ranking candidate gene instead of using t or Mann -Whitney statistics. We first modify ROC approach of ranking genes to a paired data set and further extend it to a mixed data set by taking a weighted average of two ROC values obtained by the paired data set and two independent data sets.

  • PDF

연속발생 데이터를 위한 실시간 데이터 마이닝 기법 (A Real-Time Data Mining for Stream Data Sets)

  • 김진화;민진영
    • 한국경영과학회지
    • /
    • 제29권4호
    • /
    • pp.41-60
    • /
    • 2004
  • A stream data is a data set that is accumulated to the data storage from a data source over time continuously. The size of this data set, in many cases. becomes increasingly large over time. To mine information from this massive data. it takes much resource such as storage, memory and time. These unique characteristics of the stream data make it difficult and expensive to use this large size data accumulated over time. Otherwise. if we use only recent or part of a whole data to mine information or pattern. there can be loss of information. which may be useful. To avoid this problem. we suggest a method that efficiently accumulates information. in the form of rule sets. over time. It takes much smaller storage compared to traditional mining methods. These accumulated rule sets are used as prediction models in the future. Based on theories of ensemble approaches. combination of many prediction models. in the form of systematically merged rule sets in this study. is better than one prediction model in performance. This study uses a customer data set that predicts buying power of customers based on their information. This study tests the performance of the suggested method with the data set alone with general prediction methods and compares performances of them.

An Application of the Rough Set Approach to credit Rating

  • Kim, Jae-Kyeong;Cho, Sung-Sik
    • 한국지능정보시스템학회:학술대회논문집
    • /
    • 한국지능정보시스템학회 1999년도 추계학술대회-지능형 정보기술과 미래조직 Information Technology and Future Organization
    • /
    • pp.347-354
    • /
    • 1999
  • The credit rating represents an assessment of the relative level of risk associated with the timely payments required by the debt obligation. In this paper, we present a new approach to credit rating of customers based on the rough set theory. The concept of a rough set appeared to be an effective tool for the analysis of customer information systems representing knowledge gained by experience. The customer information system describes a set of customers by a set of multi-valued attributes, called condition attributes. The customers are classified into groups of risk subject to an expert's opinion, called decision attribute. A natural problem of knowledge analysis consists then in discovering relationships, in terms of decision rules, between description of customers by condition attributes and particular decisions. The rough set approach enables one to discover minimal subsets of condition attributes ensuring an acceptable quality of classification of the customers analyzed and to derive decision rules from the customer information system which can be used to support decisions about rating new customers. Using the rough set approach one analyses only facts hidden in data, it does not need any additional information about data and does not correct inconsistencies manifested in data; instead, rules produced are categorized into certain and possible. A real problem of the evaluation of the evaluation of credit rating by a department store is studied using the rough set approach.

  • PDF

VDM의 자료구조인 set, sequency, map의 프로그래밍 언어 자료구조인 linked list로의 변환 (The Conversion of a Set, a Sequence, and a Map in VDM to a Linked List in a Programming Language)

  • 유문성
    • 정보처리학회논문지D
    • /
    • 제8D권4호
    • /
    • pp.421-426
    • /
    • 2001
  • 정형적 개발 방법론은 소프트웨어를 정확하고 체계적으로 개발하기 위하여 사용되며 시스템을 정형 명세 언어를 사용하여 맹세하고 이를 구현할 때까지 점진적으로 시스템을 구체화하는 방법으로 개발한다. VDM은 정형 명세 언어의 하나로서 set, sequence, map의 수학적 추상적 자료구조를 사용하여 시스템을 명세하는데 대부분의 프로그래밍 언어는 이런 자료구조를 가지고 있지 않다. 그러므로 이들 자료구조들의 변환이 필요하며 VDM의 수학적 자료구조들은 프로그래밍 언어의 자료구조인 연결 리스트로 변환 할 수 있다. 본 논문에서는 VDM의 set, sequence, map의 자료구조를 프로그래밍 언어의 자료구조인 연결 리스트로 변환하는 방법과 그 변환의 타당성을 수학적으로 증명하였다.

  • PDF

Training for Huge Data set with On Line Pruning Regression by LS-SVM

  • Kim, Dae-Hak;Shim, Joo-Yong;Oh, Kwang-Sik
    • 한국통계학회:학술대회논문집
    • /
    • 한국통계학회 2003년도 추계 학술발표회 논문집
    • /
    • pp.137-141
    • /
    • 2003
  • LS-SVM(least squares support vector machine) is a widely applicable and useful machine learning technique for classification and regression analysis. LS-SVM can be a good substitute for statistical method but computational difficulties are still remained to operate the inversion of matrix of huge data set. In modern information society, we can easily get huge data sets by on line or batch mode. For these kind of huge data sets, we suggest an on line pruning regression method by LS-SVM. With relatively small number of pruned support vectors, we can have almost same performance as regression with full data set.

  • PDF

2단계 선택과정의 모형화 (Modeling Two-stage Choice Process)

  • 박상준;한민희
    • 한국경영과학회지
    • /
    • 제25권2호
    • /
    • pp.77-86
    • /
    • 2000
  • Consumers facing a large number of brands to choose from are known to use simplified heuristic to screen a set of relevant brands called the consideration set from the whole alternatives, Purchase decisions are then made from the brands in the consideration set, Two approaches have been suggested to model the two-stage choice process., One is to treat the con-sideration set as a crisp set (e.g Roberts and Lattin 1991) The other is to treat the set as a fuzzy set (e.g. Fortheringham 1988) The paper empirically compares the two types of models using data for soft drinks sneakers and departments. The results show that a model employing the crisp set approach fits the data better than that with the fuzzy set approach and better than a single-stage choice logit model.

  • PDF