Topical Clustering Techniques of Twitter Documents Using Korean Wikipedia

Chang, Jae-Young;

doi:10.7236/JIIBC.2014.14.5.189

The Journal of the Institute of Internet, Broadcasting and Communication (한국인터넷방송통신학회논문지)

Volume 14 Issue 5
/
Pages.189-196
/
2014
/
2289-0238(pISSN)
/
2289-0246(eISSN)

The Institute of Internet, Broadcasting and Communication (한국인터넷방송통신학회)

DOI QR Code

Topical Clustering Techniques of Twitter Documents Using Korean Wikipedia

한글 위키피디아를 이용한 트위터 문서의 주제별 클러스터링 기법

Chang, Jae-Young (Dept. of Computer Engineering, Hansung University)

장재영 (한성대학교 컴퓨터공학과)

Received : 2014.07.15
Accepted : 2014.10.10
Published : 2014.10.31

https://doi.org/10.7236/JIIBC.2014.14.5.189 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Recently, the need for retrieving documents is growing in SNS environment such as twitter. For supporting the twitter search, a clustering technique classifying the massively retrieved documents in terms of topics is required. However, due to the nature of twitter, there is a limit in applying previous simple techniques to clustering the twitter documents. To overcome such problem, we propose in this paper a new clustering technique suitable to twitter environment. In proposed method, we augment new terms to feature vectors representing the twitter documents, and recalculate the weights of features using Korean Wikipedia. In addition, we performed the experiments with Korean twitter documents, and proved the usability of proposed method through performance comparison with the previous techniques.

최근 들어 트위터와 같은 SNS 환경에서 검색의 필요성이 증가하고 있다. 트위터 검색을 지원하기 위해서는 다량으로 검색된 문서를 주제별로 분류하는 클러스터링 기법이 필요하다. 하지만 트위터의 특성상 단순한 클러스터링 기술을 그대로 적용하기에는 많은 제약이 따른다. 본 논문에서는 이를 극복하기 위해 트위터 환경에 적합한 클러스터링 기법을 제안한다. 제안된 기법에서는 한글 위키피디아를 이용하여 각 트위터 문서에 대한 특징 벡터를 보강하고 각 특징들의 가중치를 재계산하는 방법을 이용하였다. 또한 한글 트위터 문서를 대상으로 실험을 실시하고 기존 기법과의 성능 비교를 통해서 제안된 기법의 유용성을 증명하였다.

Keywords

References

J. Weng and Q. He, TwitterRank: Finding Topic-sensitive Influential Twitterers, Proceedings of ACM international conference on Web search and data mining conference, 2010.
J.-Y. Chang, An Evaluation of Twitter Ranking Using the Retweet Information, Journal of Society for e-Business Studies, Vol. 17, No. 2, 2012. https://doi.org/10.7838/jsebs.2012.17.2.073
R. Nagmoti and M. D. Cock, Ranking Approach for Microblog Search, Proceedings of Web Intelligence-Intelligent Agent Technology conference, 2010.
H. W. Lauw, A. Ntoulas and K. Kenthapadi, Estimating the Quality of Postings in the Real-time Web, Proceedings of WSDM 2010 Workshop on Search in Social Media, 2010.
J.-Y. Chang, Automatic Retrieval of SNS Opinion Document Using Machine Learning Technique, The Journal of The Institute of Internet, Broadcasting and Communication(JIIBC), Vol. 13, No. 5, October 2013. https://doi.org/10.7236/JIIBC.2013.13.5.27
O. Tsur, A. Littman, and A. Rappoport, Efficient Clustering of Short Messages into General Domains, Proceedings of 7th International AAAI Conference on Weblogs and Social Media (ICWSM), 2013.
T. Xu, and D. W. Orad, Wikipedia-based Topic Clustering for Microblogs, Proceedings of the American Society for Information Science and Technology, 2011.
G. Salton, A. Wong, and C. S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol. 18, No. 11, 1975.
J. Yang, and J. Leskovec, Patterns od Temporal Variation in Online Media, Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 177-186, 2011.
D. Romero, B. Meeder, and J. Kleinberg, Differences in the Mechanics of Information Diffusion Across Topics:Idioms, Political Hashtags, and Complex Contagion on Twitter, Proceedings of the 20th International Conference on World Wide Web, pp. 695-704, 2011.
X. Zhao, and J. Jiang, An Empirical Comparision of Topics in Twitter and Traditional Media, Technical Paper Series, Singaapore Management University School of Information Systems, 2011.
B. O'Connor, M. Krieger, and D. Ahn, TweetMotif: Exploratory Search and Topic Summarization for Twitter, Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, 2010.
M. Michelson, and S. A. Macskassy, Discovering Users Topics of Interest on Twitter: A First Look, Proceedings the Fourth Workshop on Analytics for Noisy Unstructured Text Data, pp. 73-80, 2010.
Q. Chen, T Shipper, and L. Khan, Tweets mining using WIKIPEDIA and impurity cluster measurement, Proceedings of IEEE International Conference on Intelligence and Security Informatics, pp. 23-26, 2010.
S. Ishikawa, Y. Arakawa, and S. Tagashira, Hot topic detection in local areas using Twitter and Wikipedia, Proceedings of International Conference on Architecture of Computing Systems, pp. 28-29, 2012.
B. Liu, Web Data Mining: Exploring hyperlinks, contents, and usage data, Springer, 2006.
J. Shim, H. C. Lee, The Development of Automatic Ontology Generation System Using Extended Search Keywords, Journal of the Korea Academia-Industrial cooperation Society(JKAIS), Vol. 11, no. 6, 2009. https://doi.org/10.5762/KAIS.2009.10.6.1220

The Journal of the Institute of Internet, Broadcasting and Communication (한국인터넷방송통신학회논문지)

Topical Clustering Techniques of Twitter Documents Using Korean Wikipedia

한글 위키피디아를 이용한 트위터 문서의 주제별 클러스터링 기법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)