A Normalization Method of Distorted Korean SMS Sentences for Spam Message Filtering

Kang, Seung-Shik;

doi:10.3745/KTSDE.2014.3.7.271

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 3 Issue 7
/
Pages.271-276
/
2014
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

A Normalization Method of Distorted Korean SMS Sentences for Spam Message Filtering

스팸 문자 필터링을 위한 변형된 한글 SMS 문장의 정규화 기법

Kang, Seung-Shik

강승식 (국민대학교 컴퓨터공학부)

Received : 2014.03.19
Accepted : 2014.06.21
Published : 2014.07.31

https://doi.org/10.3745/KTSDE.2014.3.7.271 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Short message service(SMS) in a mobile communication environment is a very convenient method. However, it caused a serious side effect of generating spam messages for advertisement. Those who send spam messages distort or deform SMS sentences to avoid the messages being filtered by automatic filtering system. In order to increase the performance of spam filtering system, we need to recover the distorted sentences into normal sentences. This paper proposes a method of normalizing the various types of distorted sentence and extracting keywords through automatic word spacing and compound noun decomposition.

휴대폰에서 문자 메시지 전송 기능은 현대인들에게 매우 편리한 새로운 형태의 의사소통 방식이다. 반면에 문자 메시지 기능을 악용한 광고성 문자들이 너무 많이 쏟아져서 휴대폰 사용자들은 스팸 문자 공해에 시달리는 심각한 부작용을 낳게 되었다. 광고성 문자를 발송하는 사람들은 문자 메시지가 자동으로 차단되는 것을 회피하기 위해 한글 문장을 다양한 형태로 변형하거나 왜곡시키고 있으며, 이러한 문자 메시지를 자동으로 차단하기 위해서는 변형되거나 왜곡된 문장들을 정상적인 한글 문장으로 정규화하는 기술이 필수적이다. 본 논문에서는 변형되거나 왜곡된 광고성 문자 메시지를 정상적인 문장으로 정규화하고 정규화된 문장으로부터 자동 띄어쓰기 및 복합명사 분해 과정을 거쳐 키워드를 추출하기 위한 방법을 제안하였다.

Keywords

References

B. Y. Kim, A Study on the Morphological Characteristics of Communicative Languages by the Statistical Frequency, Master Thesis, Kookmin University, 2002.
S. J. Lee and D. J. Choi, "Personalized mobile junk message filtering system," Journal of the Korea Contents Association, pp.122-135, 2011. https://doi.org/10.5392/JKCA.2011.11.12.122
S. S. Kang, "Junk-mail filtering by mail address validation and title-content weighting," Journal of the Korea Multimedia Society, Vol.9, No.2, pp.255-263, 2006.
K. Tretyakov, "Machine learning techniques in spam filtering," Data Mining Problem-oriented Seminar, MTAT. 03. 177, pp.60-79, 2004.
L. Zhang, J. Zhu, and T. Yao, "An evaluation of statistical spam filtering techniques," ACM Transactions on Asian Language Information Processing(TALIP), Vol.3, No.4, pp.243-269, 2004. https://doi.org/10.1145/1039621.1039625
C. Brutlag and J. Meek, "Challenges of the email domain for text classification," Proceedings of the 17th International Conference on Machine Learning, 2000.
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian approach to filtering junk E-mail," Proceedings of the AAAI Workshop, pp.55-62, 1998.
M. Salib, "MeatSlicer: Spam classification with Naive Bayes and smart heuristics," Proceedings of the Spam Conference, MA, Jan., 2003.
K. Schneider, "A comparison of event models for Naive Bayes anti-spam E-mail filtering," Proceedings of 10th Conference of the European Chapter of the Association for Computational Linguistics(EACL 2003), pp.307-314, 2003.
S. S. Kang and K. B. Hwang, "A language independent n-gram model for word segmentation," Proceedings of AI'2006, pp.557-565, 2006.
S. S. Kang, "A decomposition algorithm of Korean compound nouns," Journal of KIISE(B), Vol.25, No.1, pp.172-182, 1998.

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

A Normalization Method of Distorted Korean SMS Sentences for Spam Message Filtering

스팸 문자 필터링을 위한 변형된 한글 SMS 문장의 정규화 기법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)