Signal Peptide Cleavage Site Prediction Using a String Kernel with Real Exponent Metric

실수 지수 메트릭으로 구성된 스트링 커널을 이용한 신호펩티드의 절단위치 예측

  • 지상문 (경상대학교 컴퓨터과학과)
  • Published : 2009.10.15

Abstract

A kernel in support vector machines can be described as a similarity measure between data, and this measure is used to find an optimal hyperplane that classifies patterns. It is therefore important to effectively incorporate the characteristics of data into the similarity measure. To find an optimal similarity between amino acid sequences, we propose a real exponent exponential form of the two metrices, which are derived from the evolutionary relationships of amino acids and the hydrophobicity of amino acids. We prove that the proposed metric satisfies the conditions to be a metric, and we find a relation between the proposed metric and the metrics in the string kernels which are widely used for the processing of amino acid sequences and DNA sequences. In the prediction experiments on the cleavage site of the signal peptide, the optimal metric can be found in the proposed metrics.

지지벡터기계는 자료간의 유사도를 커널함수를 사용하여 계산하고, 이러한 유사도를 이용하여 패턴을 분류하는 최적인 초평면을 구한다. 따라서 자료의 특성을 효과적으로 반영할 수 있는 유사도의 사용이 중요하다. 본 연구에서는 아미노산 서열간의 최적의 유사도를 얻기 위해서, 아미노산의 진화적인 관계와 소수성으로부터 유도된 메트릭을 실수 지수를 가지는 형태로 일반화하였다. 제안한 메트릭이 메트릭의 조건을 만족하고, 아미노산 서열과 DNA 서열의 유사도를 계산하기 위해서 널리 사용되는 스트링 커널내에서 이용되는 메트릭파의 관련성을 알아본다. 또한, 적용하려는 문제에 보다 효과적인 메트릭을 일반화 메트릭에서 찾을 수 있음을 신호펩티드의 절단위치 예측실험을 통하여 알아본다.

Keywords

References

  1. Jaakkola, T., Diekhans, M., Haussler, D., 'A discriminative framework for detecting remote protein homologies,' J. Comp. Biol., 7, pp.95-114, 2000 https://doi.org/10.1089/10665270050081405
  2. Pavlidis, P., Weston, J., Cai, J. and Noble, W. S., 'Learning gene functional classifications from multiple data types,' J. Comp. Biol., 9, pp. 401-411, 2002 https://doi.org/10.1089/10665270252935539
  3. Hua, S., Sun, Z., 'Support vector machine approach for protein subcellular localization prediction,' Bioinformatics, 17, pp.721-728, 2001 https://doi.org/10.1093/bioinformatics/17.8.721
  4. Zavaljevski, N., Stevens, F. J. and Reifman, J., 'Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions,' Bioinformatics, 18, pp.689-696, 2002 https://doi.org/10.1093/bioinformatics/18.5.689
  5. Vert, J.P., 'Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings,' proc. pacific symposium on biocomputing, pp.649-660, 2002
  6. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J. and Ratsch, G., 'Accurate splice site prediction using support vector machines,' BMC Bioinformatics, 8(Suppl 10):S7, 2007 https://doi.org/10.1186/1471-2105-8-S10-S7
  7. Leslie, C., Eskin, E. and Noble, W. S., 'The spectrum kernel: A string kernel for SVM protein classification,' proc. pacific symposium on biocomputing, pp.566-575, 2002
  8. Leslie, C., Eskin, E., Cohen, A., Weston, J. and Noble, W. S., 'Mismatch string kernels for discriminative protein classification,' Bioinformatics, 20, pp.467-476, 2004 https://doi.org/10.1093/bioinformatics/btg431
  9. Saigo, H. Vert, J.-P., Akutsu, T. and Ueda, N., 'Protein homology detection using string alignment kernels,' Bioinformatics, 20, pp. 1682-1689, 2004 https://doi.org/10.1093/bioinformatics/bth141
  10. Kim, J.K., Bang, S.Y., and Choi, S., 'Sequence driven features for prediction of subcellular localization of proteins' Pattern Recognition, 39(12), pp.2301-2311, 2006 https://doi.org/10.1016/j.patcog.2006.02.021
  11. Paetzel, M., Karla, A., Strynadka, N.C. and Dalbey, R.E., 'Signal peptidases,' Chem. Rev., 102, pp.4549-4580, 2002 https://doi.org/10.1021/cr010166y
  12. Engelman, D.M., Steitz, T.A., Goldman, A., 'Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins,' Annu. Rev. Biophys. Biophys. Chem., 15, pp.321-353, 1986 https://doi.org/10.1146/annurev.bb.15.060186.001541
  13. Boser, B., Guyon, I., Vapnik, V., 'A training algorithm for optimal margin classifiers,' proc. workshop, computational learning theory, pp.144-152, 1992 https://doi.org/10.1145/130385.130401
  14. Cortes, C., Vapnik, V., 'Support-vector network,' Machine learning, 20, pp.273-297, 1995
  15. Vapnik, V., Statistical learning theory, John Wiley & Sons, 1998
  16. Chang, C-C. and Lin, C-J., LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
  17. Henikoff, S., Henikoff, J.G., 'Amino acid substitution matrices from protein blocks,' proc. natl. acad. sci., 89, pp.11915-11919, 1992 https://doi.org/10.1073/pnas.89.24.11915
  18. Kreyszig, E.. Introductory Functional Analysis with Applications, John Wiley & Sons, New York, 1978
  19. Choo, KH, Tan TW and Ranganathan, S., 'SPdb- a signal peptide database,' BMC Bioinformatics, 6:249, 2005 https://doi.org/10.1186/1471-2105-6-249
  20. Bendtsen, J.,D., Nielsen, H., von Heijne, G., Brunak, S., 'Improved prediction of signal peptides: SignalP 3.0,' J. Mol. Biol., 340, pp.783-795, 2004 https://doi.org/10.1016/j.jmb.2004.05.028
  21. Menne, K.M., Hermjakob, H., Apweiler, R., 'A comparison of signal sequence prediction methods using a test set of signal peptides,' Bioinformatics, 16, pp.741-742, 2000 https://doi.org/10.1093/bioinformatics/16.8.741
  22. Kall, L., Krogh, A., Sonnhammer, E.,L.,L., 'A combined transmembrane topology and signal peptide prediction method,' J. Mol. Biol., 338, pp. 1027-1036, 2004 https://doi.org/10.1016/j.jmb.2004.03.016
  23. Vapnik, V., Statistical learning theory, John Wiley & Sons, 1998