Sequence driven features for prediction of subcellular localization of proteins

단백질의 세포내 소 기관별 분포 예측을 위한 서열 기반의 특징 추출 방법

  • Kim, Jong-Kyoung (Department of Computer Science, Pohang University of Science and Technology) ;
  • Choi, Seung-Jin (Department of Computer Science, Pohang University of Science and Technology)
  • 김종경 (포항공과대학교 컴퓨터공학과) ;
  • 최승진 (포항공과대학교 컴퓨터공학과)
  • Published : 2005.07.01

Abstract

Predicting the cellular location of an unknown protein gives valuable information for inferring the possible function of the protein. For more accurate Prediction system, we need a good feature extraction method that transforms the raw sequence data into the numerical feature vector, minimizing information loss. In this paper we propose new methods of extracting underlying features only from the sequence data by computing pairwise sequence alignment scores. In addition, we use composition based features to improve prediction accuracy. To construct an SVM ensemble from separately trained SVM classifiers, we propose specificity based weighted majority voting . The overall prediction accuracy evaluated by the 5-fold cross-validation reached $88.53\%$ for the eukaryotic animal data set. By comparing the prediction accuracy of various feature extraction methods, we could get the biological insight on the location of targeting information. Our numerical experiments confirm that our new feature extraction methods are very useful forpredicting subcellular localization of proteins.

Keywords