Segmentation of Long Chinese Sentences using Comma Classification

쉼표의 자동분류에 따른 중국에 장문분할

  • 김미훈 (포항공과대학교 컴퓨터공학과) ;
  • 김미영 (성신여자대학교 컴퓨터정보학부) ;
  • 이종혁 (포항공과대학교 컴퓨터공학과)
  • Published : 2006.05.01

Abstract

The longer the input sentences, the worse the parsing results. To improve the parsing performance, many methods about long sentence segmentation have been reserarched. As an isolating language, Chinese sentence has fewer cues for sentence segmentation. However, the average frequency of comma usage in Chinese is higher than that of other languages. The syntactic information that the comma conveys can play an important role in long sentence segmentation of Chinese languages. This paper proposes a method for classifying commas in Chinese sentences according to the context where the comma occurs. Then, sentences are segmented using the classification result. The experimental results show that the accuracy of the comma classification reaches 87.1%, and with our segmentation model, the dependency parsing accuracy of our parser is improved by 5.6%.

입력문장이 길어질수록 구문분석의 정확률은 크게 낮아진다. 따라서 긴 문장의 구문분석 정확률을 높이기 위해 장문분할 방법들이 많이 연구되었다. 중국어는 고립어로서 자연언어처리에 도움을 줄 수 있는 굴절이나 어미정보가 없는 대신 쉼표를 비교적 많이, 또 정확히 사용하고 있어서 이러한 쉼표사용이 장문분할에 도움을 줄 수 있다. 본 논문에서는 중국어 문장에서 쉼표 주변의 문맥을 파악하여 해당 쉼표위치에 문장분할이 가능한지 Support Vector Machine을 이용해 판단하고자 한다. 쉼표의 분류의 정확률이 87.1%에 이르고, 이 분할모델을 적용한 후 구문분석한 결과, 의존트리의 정확률이 5.6% 증가했다.

Keywords

References

  1. Roger Levy and Christopher Manning, 'Is it harder to parse Chinese, or the Chinese Treebank?,' Proc. of the 41st meeting of the Association for Computational Linguistics, pages 439-446, 2003 https://doi.org/10.3115/1075096.1075152
  2. Shui-fang Lin. 'study and application of punctuation,'(In Chinese). People's Publisher, P.R.China.2000
  3. B. Jones. 'What's the point? A(computational) theory of punctuation,' PhD Thesis, Centre for Cognitive Science, University of Edinburgh, Edinburgh, UK, 1996
  4. R.L. Hill. 'A comma in parsing: A study into the influence of punctuation (commas) on contextually isolated 'garden-path' sentences,' M.Phil disseration, Dundee University, 1996
  5. X. Carreras, L. Marquez, V. Punyakanok, and D. Roth. 'Learning and inference for clause identification,' Proc. of 13th European Conference on Machine Learning, Finland, pages 35-47, 2002
  6. V.J. Leffa. 'clause processing in complex sentences,' Proc. of 1st International Conference on Language Resources and Evaluation, Spain, pages 937-943, 1998
  7. E.F.T.K. Sang and H.Dejean, 'Introduction to the CoNLL-2001 shared task: clause identification,' Proc. of 5th Conference on Computational Natural Language Learning, pages 53-57, 2001
  8. S. Kim, B.Zhang and Y. Kim. 'Learning-based intrasentence segmentation for efficient translation of long sentences,' Machine Translation, Vol.16, no.3, pages 151-174, 2001 https://doi.org/10.1023/A:1019896420277
  9. B. Jones. 'Towards testing the syntax of punctuation,' Proc. of 34th meeting of the Association for Computational Linguistics, pages 363-365, 1996 https://doi.org/10.3115/981863.981916
  10. M. Bayparktar, B. Say and V. Akman, 'An analysis of English punctuation: the special case of comma,' International Journal of Corpus Linguistics, Vol.3, no.1, pages 33-57, 1998 https://doi.org/10.1075/ijcl.3.1.03bay
  11. B. Jones, 'Exploring the role of punctuation in parsing natural text,' Proc. of COLING-94, pages 421-425, 1994 https://doi.org/10.3115/991886.991960
  12. M. J. Collins, 'Head-driven Statistical Models for Natural Language Parsing,' Ph.D. thesis, University of Pennsylvania, Philadelphia, 1999
  13. Briscoe, E. and J. Carroll, 'Developing and evaluating a probabilistic LR parser of part-of-speech and punctuation labels,' Proc. of the 4th ACL/ SIGPARSE International Workshop on Parsing Technologies, Prague, Czech Republic, pages.48-58, 1995
  14. P.L. Shiuan and C.T.H. Ann. 'A divide-andconquer strategy for parsing,' Proc. of the ACL/ SIGPARSE 5th international workshop on parsing technologies, Santa Cruz, USA, pages 57-66, 1996
  15. B. Say and V. Akman, 'current approaches to punctuation in computational linguistics,' Computers and the Humanities, Vol.30, no.6, pages 457-469, 1997 https://doi.org/10.1007/BF00057941
  16. Geoffrey Nunberg. 'the linguistics of punctuation,' CSLI lecture notes. No. 18, University of Chicago Press, 1990
  17. M.Y.Kim, S.J. Kang, J.H. Lee. 'Resolving ambiguity in Inter-chunk dependency parsing,' Proceedings of the sixth Natural Language Processing Pacific Rim Symposium, Tokyo, Japan, 2001
  18. N. Xue and F. Xia. 'The bracketing Guidelines for the Penn Chinese Treebank(3.0),' Technical Report. 00-08, University of Pennsylvania, IRCS Report, 2000
  19. V. N. Vapnik. 'The nature of statistical learning theory,' Springer-Verlag New York, Inc., New York, NY, 1995
  20. H. Yamada and Y. Matsumoto, 'Statistical Dependency Analysis with Support Vector Machines,' IWPT03, pages 195-206 2003
  21. T.Joachims. 'Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning,' B. Scholkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999
  22. P.L. Shiuan and C.T.H. Ann. 'A divide-andconquer strategy for parsing,' Proc. of the ACL/ SIGPARSE 5th international workshop on parsing technologies, Santa Cruz, USA, pages 57-66, 1996
  23. D.M.Bikel and D.Chiang. 'Two statistical parsing models applied to the Chinese Treebank,' Proc. of the NAACL-ANLP workshop of Second Chinese Language Processing Workshop, pages 1 -6, 2000
  24. R.Levy and C.D.Manning. 'Is it Harder to Parse Chinese, or the Chinese Treebank?,' Proc, of the ACL, 2003 https://doi.org/10.3115/1075096.1075152