DOI QR코드

DOI QR Code

The Use of MSVM and HMM for Sentence Alignment

  • Received : 2011.07.05
  • Accepted : 2012.04.13
  • Published : 2012.06.30

Abstract

In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on the Multi-Class Support Vector Machine (MSVM) and the Hidden Markov Model (HMM) classifiers are presented. A feature vector is extracted from the text pair that is under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the Multi-Class Support Vector Machine and Hidden Markov Model. Another set of data was used for testing. The results of the MSVM and HMM outperform the results of the length based approach. Moreover these new approaches are valid for any language pairs and are quite flexible since the feature vector may contain less, more, or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research.

Keywords

References

  1. M. Fattah, F. Ren, S. Kuroiwa, "Sentence Alignment using P-NNT and GMM," Computer Speech and Language, Vol.21, No.4, 2007, pp.594-608. https://doi.org/10.1016/j.csl.2007.01.002
  2. R. Moore, "Fast and Accurate Sentence Alignment of Bilingual Corpora," AMTA, 2002, pp.135-144.
  3. F. Gey, A. Chen, M. Buckland, R. Larson, "Translingual vocabulary mappings for multilingual information access," SIGIR, 2002, pp.455-456.
  4. M. Davis, F. Ren, "Automatic Japanese-Chinese Parallel Text Alignment," Proceedings of International Conference on Chinese Information Processing, 1998, pp.452-457.
  5. W. Dolan, J. Pinkham, S. Richardson, "MSR-MT, The Microsoft Research Machine Translation System," AMTA, 2002, pp.237-239.
  6. M. Simard, "Text-translation alignment: three languages are better than two" In Proceedings of EMNLP/VLC- 99, College Park, MD, 1999.
  7. A. Chen, F. Gey, "Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval" TREC 2001.
  8. D. Oard, "Alternative approaches for cross-language text retrieval": In D. Hull, & D. Oard (Eds.), AAAI symposium in cross-language text and speech retrieval. American Association for Artificial Intelligence, March, 1997.
  9. C. Christopher, L. Kar, "Building parallel corpora by automatic title alignment using length-based and text-based approaches" Information Processing and Management, Vol.40, 2004, pp.939-955. https://doi.org/10.1016/j.ipm.2003.11.002
  10. M. Fattah, F. Ren, S. Kuroiwa, "Stemming to Improve Translation Lexicon Creation form Bitexts" Information Processing & Management, Vol.42 No.4, 2006, pp.1003-1016. https://doi.org/10.1016/j.ipm.2005.07.002
  11. S. Ker, J. Chang, "A class-based approach to word alignment", Computational Linguistics, Vol.23, No.2, 1997, pp.313-344.
  12. I. Melamed, "A portable algorithm for mapping bitext correspondence" In The 35th Conference of the Association for Computational Linguistics (ACL 1997), Madrid, Spain, 1997.
  13. H. Dejean, E. Gaussier, F. Sadat, "Bilingual Terminology Extraction: An Approach based on a Multilingual thesaurus Applicable to Comparable Corpora", Proceedings of the 19th International Conference on Computational Linguistics COLING 2002, Taipei, Taiwan, 2002, pp.218-224.
  14. C. Thomas, C. Kevin, "Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria", Computational Linguistics and Chinese Language Processing , Vol.10, No.1, 2005, pp.95-122.
  15. W. Gale, K. Church, "A program for aligning sentences in bilingual corpora" Computational Linguistics, Vol.19, 1993, pp.75-102.
  16. P. Brown, J. Lai, R. Mercer, "Aligning sentences in parallel corpora" In Proceedings of the 29th annual meeting of the association for computational linguistics, Berkeley, CA, USA, 1991.
  17. M. Simard, G. Foster, P. Isabelle, "Using cognates to align sentences in bilingual corpora", Proceedings of TMI92, Montreal, Canada, 1992, pp.67-81.
  18. I. Melamed, "Bitext Maps and Alignment via Pattern Recognition", Computational Linguistics, March, Vol.25, No.1, 1999, pp.107-130.
  19. P. Danielsson, K. Mühlenbock, "The Misconception of High-Frequency Words in Scandinavian Translation", AMTA, 2000, pp.158-168.
  20. A. Ribeiro, G. Dias, G. Lopes, J. Mexia, "Cognates Alignment: In Bente Maegaard (ed.)", Proceedings of the Machine Translation Summit VIII (MT Summit VIII) - Machine Translation in the Information Age, Santiago de Compostela, Spain, 2001, pp.287-292.
  21. A. Ceauşu, D. Ştefănescu, D. Tufiş, "Acquis communautaire sentence alignment using support vector machines", Proceedings of the Fifth Language Resources and Evaluation Conference, 2006.
  22. S. Vogel, H. Ney, C. Tillmann, "HMM-Based Word Alignment in Statistical Translation", Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, Denmark, 1996, pp.836-841.
  23. N. Collier, K. Ono, H. Hirakawa, "An Experiment in Hybrid Dictionary and Statistical Sentence Alignment" COLING-ACL, 1998, pp.268-274.
  24. K. Chen, H. Chen, "A Part-of-Speech-Based Alignment Algorithm", Proceedings of 15th International Conference on Computational Linguistics, Kyoto, 1994, pp.166-171.
  25. S. Chen, "Aligning Sentences in Bilingual Corpora Using Lexical Information", Proceedings of ACL-93, Columbus OH, 1993, pp.9-16.
  26. S. Mukherjee, E. Osuna, F. Girosi, "Nonlinear prediction of chaotic time series using support vector machine", In proceedings of the IEEE Workshop on Neural Networks for Signal Processing 7, Amerlia Island, FL, 1997, pp.511-519
  27. E. Osuna, R. Freund, F. Girosi, "An improved training algorithm for support vector machines", In Proc. of the IEEE Workshop on Neural Networks for Signal Processing VII, New York, 1997, pp.276-285.
  28. M. Brown, H. Lewis, S. Gunn, "Linear Spectral Mixture Models and Support Vector Machines for Remote Sensing", IEEE Transactions On Geoscience And Remote Sensing, Vol.38, No.5, 2000, September.
  29. G. Foody, A. Mathur, "A Relative Evaluation of Multiclass Image Classification by Support Vector Machines", IEEE Transactions On Geoscience And Remote Sensing, Vol.42, No.6, 2004, June.
  30. Q. She, H. Su, L. Dong, J. Chu, "Support vector machine with adaptive parameters in image coding", Int. J. Innovative Computing, Information and Control, Vol.4, No.2, 2008, pp.359-368.
  31. R. Chen, S. Chen, "Intrusion detection using a hybrid support vector machine based on entropy and TF-IDF", Int. J. Innovative Computing, Information and Control, Vol.4, No.2, 2008, pp.413-424.
  32. X. Song, , W. Chen, B. Jiang, "Sample Reducing Method in Support Vector Machine Based on KClosest Sub-Clusters", Int. J. Innovative Computing, Information and Control, Vol.4, No.7, 2008, pp.1751-1760.
  33. L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceddings of the IEEE, Vol.77, No.2, 1989, pp.257-286.
  34. P. Resnik, N. Smith, "The Web as a Parallel Corpus", Computational Linguistics, Vol.29, No.3, 2003, pp.349-380. https://doi.org/10.1162/089120103322711578
  35. W. Kraaij, J. Nie, M. Simard, "Embedding Web-Based Statistical Translation Models in Cross- Language Information Retrieval", Computational Linguistics, Vol.29, No.3, 2003, pp.381-419. https://doi.org/10.1162/089120103322711587

Cited by

  1. New term weighting schemes with combination of multiple classifiers for sentiment analysis vol.167, 2015, https://doi.org/10.1016/j.neucom.2015.04.051
  2. Traversable Ground Surface Segmentation and Modeling for Real-Time Mobile Mapping vol.10, pp.4, 2014, https://doi.org/10.1155/2014/795851
  3. Fall-Detection Algorithm Using 3-Axis Acceleration: Combination with Simple Threshold and Hidden Markov Model vol.2014, 2014, https://doi.org/10.1155/2014/896030
  4. Design and implementation of the SARIMA–SVM time series analysis algorithm for the improvement of atmospheric environment forecast accuracy vol.22, pp.13, 2018, https://doi.org/10.1007/s00500-017-2825-y