A Splog Detection System Using Support Vector Systems

Lee, Song-Wook;

doi:10.6109/jkiice.2011.15.1.163

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Volume 15 Issue 1
/
Pages.163-168
/
2011
/
2234-4772(pISSN)
/
2288-4165(eISSN)

The Korea Institute of Information and Commucation Engineering (한국정보통신학회)

DOI QR Code

A Splog Detection System Using Support Vector Systems

지지벡터기계를 이용한 스팸 블로그(Splog) 판별 시스템

Lee, Song-Wook

이성욱 (충주대학교)

Received : 2010.10.19
Accepted : 2010.11.02
Published : 2011.01.31

https://doi.org/10.6109/jkiice.2011.15.1.163 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Blogs are an easy way to publish information, engage in discussions, and form communities on the Internet. Recently, there are several varieties of spam blog whose purpose is to host ads or raise the PageRank of target sites. Our purpose is to develope the system which detects these spam blogs (splogs) automatically among blogs on Web environment. After removing HTML of blogs, they are tagged by part of speech(POS) tagger. Words and their POS tags information is used as a feature type. Among features, we select useful features with X2 statistics and train the SVM with the selected features. Our system acquired 90.5% of F1 measure with SPLOG data set.

블로그는 인터넷 공간에서 가장 손쉽게 정보 출간, 토론 참여, 커뮤니티 형성하는 수단이다. 그러나 최근에 광고를 유치하거나 페이지 순위를 올리기 위한 목적의 다양한 스팸 블로그가 범람하고 있다. 본 연구의 목적은 웹 환경에서 이러한 스팸 블로그(Splog)를 자동으로 판별하는 시스템을 개발하는 것이다. 먼저 블로그의 HTML을 제거한 후 품사를 부착하였다. 어휘/품사 쌍을 자질로 사용하였으며 카이제곱 통계량을 이용하여 유용한 자질을 선택하였다. 선택된 자질의 가중치를 벡터로 표현한 후, 지지벡터기계(Support Vector Machines)를 학습하여 자동으로 스팸 블로그를 판별하는 시스템을 제안하였으며, SPLOG 데이터 집합으로 실험한 결과 F1척도로 90.5%의 정확률을 얻었다.

Keywords

References

Kolari, P., Finin, T., Joshi, A., "SVMs for the Blogosphere: Blog Identification and Splog Detection", AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006.
D. Sculley, Gabriel M. Wachman. ""Relaxed online SVMs for spam filtering," Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp.415-422, 2007.
이성욱, "카이 제곱 통계량과 지지벡터기계를 이용한 자동 스팸 메일 분류기", 춘계 한국해양정보통신학회 논문집, 2009.
은종민, 이성욱, 서정연, "지지벡터기계(Support Vector Machines)를 이용한 한국어 화행분석", 정보처리학회논문지, Vol.12-B, No.3, pp.365-368, 2005.
V. Vapnik. The nature of statistical learning theory, Springer, NewYork, 1995.
http://web.media.mit.edu/-hugo/montylingua, 2009.
http://www.csie.ntu.edu.tw/-cjlin/libsvm, 2009.
Yang, Yiming and Jan O. Pedersen. A comparative study on Feature selection in text categorization. In proceedings of the 14th International conference on Machine Learning, 1997.
Martin Law. "A simple introduction to Support Vector Machines," PPT file, 2003.
Wu, B., and Davison, B. D. Identifying link farm spam pages. In WWW ''05: Special interest tracks and posters of the 14th international conference on World Wide Web, 820-829. New York: ACM Press. 2005.
Umbria. 2005. Spam in the blogosphere. [Online;http://www.umbrialistens.com/consumer/show WhitePaper].
Cuban,M. 2005. A splog here, a splog there, pretty soon it ads up and we all lose. [Online; accessed 22-December-2005;http://www.blogmaverick.com/ entry/1234000870054492/].
Kolari, P.; Java, A.; and Finin, T. 2006. Characterizing the splogosphere. In WWW 2006, 3rd Annual Workshop on the Webloggging Ecosystem: Aggregation, Analysis and Dynamics.
이성욱, "지지벡터기계와 카이 제곱 통계량을 이용 한 스팸 블로그 판별 시스템", 춘계 한국해양정보통 신학회 논문집, 2010.

Cited by

온라인게임 채팅에서의 비속어 차단시스템 vol.15, pp.7, 2011, https://doi.org/10.6109/jkiice.2011.15.7.1531
텍스트 분석 기술 및 활용 동향 vol.42, pp.2, 2017, https://doi.org/10.7840/kics.2017.42.2.471

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

A Splog Detection System Using Support Vector Systems

지지벡터기계를 이용한 스팸 블로그(Splog) 판별 시스템

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)