DOI QR코드

DOI QR Code

Korean Mobile Spam Filtering System Considering Characteristics of Text Messages

문자메시지의 특성을 고려한 한국어 모바일 스팸필터링 시스템

  • Received : 2010.05.12
  • Accepted : 2010.07.06
  • Published : 2010.07.31

Abstract

This paper introduces a mobile spam filtering system that considers the style of short text messages sent to mobile phones for detecting spam. The proposed system not only relies on the occurrence of content words as previously suggested but additionally leverages the style information to reduce critical cases in which legitimate messages containing spam words are mis-classified as spam. Moreover, the accuracy of spam classification is improved by normalizing the messages through the correction of word spacing and spelling errors. Experiment results using real world Korean text messages show that the proposed system is effective for Korean mobile spam filtering.

본 논문에서는 휴대전화로 오는 짧은 문자메시지의 스타일을 반영하여 스팸 문자메시지를 검출해내는 한국어 모바일 스팸필터링 시스템을 소개한다. 제안하는 시스템은 내용어 어휘들의 출현에만 기반을 두는 기존 방법과 달리 제안하는 스타일 정보를 추가적으로 활용하여 스팸성 단어가 포함된 일반 문자메시지가 스팸으로 잘못 분류되는 치명적인 오류를 효과적으로 줄인다. 또한 띄어쓰기 및 철자 오류교정을 거쳐 문자메시지를 정규화 함으로써 스팸 분류성능을 향상시킨다. 실제 한국어 문자메시지를 이용한 실험 결과를 통해 제안하는 시스템이 한국어 스팸 문자메시지 검출에 효과적임을 보인다.

Keywords

References

  1. 정보통신부 뉴스, "이메일 스팸 계속 감소 추세", 7월, 2007.
  2. J. M. Gomez et al., "Content Based SMS Spam Filtering", Proc. of the 2006 ACM Symposium on Document Engineering, pp. 107-114, 2006.
  3. G. V. Cormack et al., "Spam filtering for short messages", Proc. of ACM Sixteenth Conference on Information and Knowledge Management, pp. 313-320, 2007.
  4. G. V. Cormack et al., "Feature engineering for mobile (SMS) spam filtering", Proc. of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 871-872, 2007.
  5. M. Koppel et al., "Automatically categorizing written texts by author gender", Literary and Linguistic Computing, Vol. 17, No. 4, pp. 401-412, 2002. https://doi.org/10.1093/llc/17.4.401
  6. T. C. Mendenhall, "The characteristic curves of composition," Science, pp. 237-246, 1887.
  7. G. U. Yule, "On sentence-length as a statistical characteristic of style in prose: with application to two cases of disputed authorship", Biometrika, Vol. 30, No. 3-4, pp. 363-390, 1939. https://doi.org/10.1093/biomet/30.3-4.363
  8. A. Q. Morton, "The authorship of greek prose", Journal of the Royal Statistical Society Series A(General), pp. 169-233, 1965.
  9. F. Mosteller et al., Applied Bayesian and classical inference: the case of the Federalist papers, Springer Verlag, 1984.
  10. E. Stamatatos et al., "Automatic text categorization in terms of genre and author", Computational Linguistics, Vol. 26, No. 4, pp. 471-495, 2000. https://doi.org/10.1162/089120100750105920
  11. O. Uzuner et al., "A comparative study of language models for book and author recognition", Proc. of 2nd International Joint Conference on Natural Language Processing, pp. 969-980, 2005.
  12. J.-H. Byun et al., "Three-Phase Text Error Correction Model for Korean SMS Messages", IEICE Transactions on Information and Systems, Vol. E92-D, No. 5, pp. 1213-1217, 2009. https://doi.org/10.1587/transinf.E92.D.1213
  13. A. L. Berger et al., "A maximum entropy approach to natural language processing", Computational Linguistics, Vol. 22, No. 1, pp. 39-71, 1996.
  14. L. Zhang et al., "Filtering junk mail with a maximum entropy model", Proc. of 20th International Conference on Computer Processing of Oriental Languages, pp. 446-453, 2003.
  15. K. Nigam et al., "Using maximum entropy for text classification", Proc. of the IJCAI-99 Workshop on Machine Learning for Information Filtering, pp. 61-67, 1999.
  16. 이상주 외, "품사태깅을 위한 어휘문맥 의존규칙의 말뭉치기반 중의성주도 학습", 한국정보과학회 논문지(B), 제 26권, 제 1호, pp. 178-189, 1999.
  17. 박소영 외, "문장성분의 다양한 자질을 이용한 한국어 구문분석 모델", 한국정보처리학회 논문지(B), 제11권, 제 6호, pp. 743-748, 2004. https://doi.org/10.3745/KIPSTB.2004.11B.6.743
  18. Y. Yang et al., "A comparative study on feature selection in text categorization", Proc. of 14th International Conference on Machine Learning, pp. 412-420, 1997.
  19. G. V. Cormack et al., "TREC 2005 spam track overview", Proc. of 2005 Text REtrieval Conference, 2005.

Cited by

  1. A Review on Mobile SMS Spam Filtering Techniques vol.5, pp.2169-3536, 2017, https://doi.org/10.1109/ACCESS.2017.2666785