DOI QR코드

DOI QR Code

Financial Fraud Detection using Text Mining Analysis against Municipal Cybercriminality

지자체 사이버 공간 안전을 위한 금융사기 탐지 텍스트 마이닝 방법

  • Received : 2017.07.12
  • Accepted : 2017.09.20
  • Published : 2017.09.30

Abstract

Recently, SNS has become an important channel for marketing as well as personal communication. However, cybercrime has also evolved with the development of information and communication technology, and illegal advertising is distributed to SNS in large quantity. As a result, personal information is lost and even monetary damages occur more frequently. In this study, we propose a method to analyze which sentences and documents, which have been sent to the SNS, are related to financial fraud. First of all, as a conceptual framework, we developed a matrix of conceptual characteristics of cybercriminality on SNS and emergency management. We also suggested emergency management process which consists of Pre-Cybercriminality (e.g. risk identification) and Post-Cybercriminality steps. Among those we focused on risk identification in this paper. The main process consists of data collection, preprocessing and analysis. First, we selected two words 'daechul(loan)' and 'sachae(private loan)' as seed words and collected data with this word from SNS such as twitter. The collected data are given to the two researchers to decide whether they are related to the cybercriminality, particularly financial fraud, or not. Then we selected some of them as keywords if the vocabularies are related to the nominals and symbols. With the selected keywords, we searched and collected data from web materials such as twitter, news, blog, and more than 820,000 articles collected. The collected articles were refined through preprocessing and made into learning data. The preprocessing process is divided into performing morphological analysis step, removing stop words step, and selecting valid part-of-speech step. In the morphological analysis step, a complex sentence is transformed into some morpheme units to enable mechanical analysis. In the removing stop words step, non-lexical elements such as numbers, punctuation marks, and double spaces are removed from the text. In the step of selecting valid part-of-speech, only two kinds of nouns and symbols are considered. Since nouns could refer to things, the intent of message is expressed better than the other part-of-speech. Moreover, the more illegal the text is, the more frequently symbols are used. The selected data is given 'legal' or 'illegal'. To make the selected data as learning data through the preprocessing process, it is necessary to classify whether each data is legitimate or not. The processed data is then converted into Corpus type and Document-Term Matrix. Finally, the two types of 'legal' and 'illegal' files were mixed and randomly divided into learning data set and test data set. In this study, we set the learning data as 70% and the test data as 30%. SVM was used as the discrimination algorithm. Since SVM requires gamma and cost values as the main parameters, we set gamma as 0.5 and cost as 10, based on the optimal value function. The cost is set higher than general cases. To show the feasibility of the idea proposed in this paper, we compared the proposed method with MLE (Maximum Likelihood Estimation), Term Frequency, and Collective Intelligence method. Overall accuracy and was used as the metric. As a result, the overall accuracy of the proposed method was 92.41% of illegal loan advertisement and 77.75% of illegal visit sales, which is apparently superior to that of the Term Frequency, MLE, etc. Hence, the result suggests that the proposed method is valid and usable practically. In this paper, we propose a framework for crisis management caused by abnormalities of unstructured data sources such as SNS. We hope this study will contribute to the academia by identifying what to consider when applying the SVM-like discrimination algorithm to text analysis. Moreover, the study will also contribute to the practitioners in the field of brand management and opinion mining.

최근 SNS는 개인의 의사소통뿐 아니라 마케팅의 중요한 채널로도 자리매김하고 있다. 그러나 사이버 범죄 역시 정보와 통신 기술의 발달에 따라 진화하여 불법 광고가 SNS에 다량으로 배포되고 있다. 그 결과 개인정보를 빼앗기거나 금전적인 손해가 빈번하게 일어난다. 본 연구에서는 SNS로 전달되는 홍보글인 비정형 데이터를 분석하여 어떤 글이 금융사기(예: 불법 대부업 및 불법 방문판매)와 관련된 글인지를 분석하는 방법론을 제안하였다. 불법 홍보글 학습 데이터를 만드는 과정과, 데이터의 특성을 고려하여 입력 데이터를 구성하는 방안, 그리고 판별 알고리즘의 선택과 추출할 정보 대상의 선정 등이 프레임워크의 주요 구성 요소이다. 본 연구의 방법은 실제로 모 지방자치단체의 금융사기 방지 프로그램의 파일럿 테스트에 활용되었으며, 실제 데이터를 가지고 분석한 결과 금융사기 글을 판정하는 정확도가 사람들에 의하여 판정하는 것이나 키워드 추출법(Term Frequency), MLE 등에 비하여 월등함을 검증하였다.

Keywords

References

  1. Balamurugan, S., R. Rajaram, G. Athiappan and M. Muthupandian, "Data Mining Techniques for Suspicious Email Detection: A Comparative Study," Proceeding of the IADIS European Conference Data Mining 2007, (2007), 213-217.
  2. Banerjee, A., Barman, D., Faloutsos, M., & Bhuyan, L. N. Cyber-fraud is one typo away. INFOCOM 2008. The 27th Conference on Computer Communications. IEEE (2008) (pp. 1939-1947). IEEE.
  3. Bayer, M., W. Sommer and A. Schacht, "Reading emotional words within sentences: The impact of arousal and valence on event-related potentials," International Journal of Psychophysiology, Vol.78, No.3 (2010), 299-307. https://doi.org/10.1016/j.ijpsycho.2010.09.004
  4. Castell, M. R. F. and L. B. Dacuycuy, "Exploring the use of exchange market pressure and RMU deviation indicator for early warning system (EWS) in the ASEAN+3 region," DLSU Business & Economics Review, Vol.18, No.2 (2009), 1-30.
  5. Comfort, L. K., "Crisis Management in Hindsight: Cognition, Communication, Coordination, and Control," Public Administration Review, Vol. 67, No.1 (2007), 189-197. https://doi.org/10.1111/j.1540-6210.2007.00827.x
  6. Choi, S., Jeon, J., Subrata, B., Kwon, O., "An efficient estimation of place brand image power based on text mining technology," Journal of Korea Intelligent Information Systems, Vol. 21, No.2 (2015), 113-129. (최석재, 전종식, 권오병, "텍스트마이닝 기반의 효율적인 장소 브랜드 이미지 강도 측정 방법," 지능정보연구, Vol.21, No.2 (2015), 113-129.) https://doi.org/10.13088/jiis.2015.21.2.113
  7. Choi, S. Song, Y., Kwon, O., "Analyzing contextual polarity of unstructured data for measuring subjective well-being," Journal of Intelligent Information Systems, Vol.22, No.1 (2016), 83-105. (최석재, 송영은, 권오병, "주관적 웰빙 상태 측정을 위한 비정형 데이터의 상황기반 긍부 정성 분석 방법," 지능정보연구, Vol. 22, No.1 (2016), 83-105.) https://doi.org/10.13088/jiis.2016.22.1.083
  8. Cui, M., Jin, Y. and Kwon, O., "A method of analyzing sentiment polarity of multilingual social media : A case of korean-chinese languages," Journal of Intelligent Information Systems, Vol.22, No.3 (2016), 91-111. (최미나, 진윤선, 권오병, "다국어 소셜미디어에 대한 감성분석 방법 개발," 지능정보연구, Vol. 22, No.3 (2016), 91-111.) https://doi.org/10.13088/jiis.2016.22.3.091
  9. DeAngelo, H. and R. M. Stulz, "Liquid-claim production, risk management, and bank capital structure: Why high leverage is optimal for banks," Journal of Financial Economics, Vol.116, No.2 (2015), 219-236. https://doi.org/10.1016/j.jfineco.2014.11.011
  10. Dionne, G., "Risk management: History, definition, and critique," Risk Management and Insurance Review, Vol.16, No.2 (2013), 147-166. https://doi.org/10.1111/rmir.12016
  11. Flores, C., "Management of catastrophic risks considering the existence of early warning systems," Scandinavian Actuarial Journal, Vol.1 (2009), 38-62.
  12. Folino, G., A. Forestiero, G. Papuzzo and G. Spezzano, "A grid portal for solving geoscience problems using distributed knowledge discovery services," Future Generation Computer Systems, Vol.26, No.1 (2010), 87-96. https://doi.org/10.1016/j.future.2009.08.002
  13. Grace, M. F., J. T. Leverty, R. D. Phillips and P. Shimpi, "The value of investing in enterprise risk management," Journal of Risk and Insurance, Vol.82, No.2 (2015), 289-316. https://doi.org/10.1111/jori.12022
  14. Hassan, A. B., F. D. Lass and J. Makinde, "Cybercrime in Nigeria: Causes, Effects and the Way Out," ARPN Journal of Science and Technology, Vol.2, No.7 (2012), 626-631.
  15. Henderson, L. J., "Emergency and disaster: Pervasive risk and public bureaucracy in developing nations," Public Organization Review, Vol.4, No.2 (2004), 103-119. https://doi.org/10.1023/B:PORJ.0000031624.46153.b2
  16. Holton, C., "Identifying disgruntled employee systems fraud risk through text mining: a simple solution for a multi-billion dollar problem," Decision Support Systems, Vol.46, No.4 (2009), 853-864. https://doi.org/10.1016/j.dss.2008.11.013
  17. Jans, M., N. Lybaert and K. Vanhoof, "Internal fraud risk reduction: results of a data mining case study," International Journal of Accounting Information Systems, Vol.11, No.1 (2010), 17-41. https://doi.org/10.1016/j.accinf.2009.12.004
  18. Joachims, T., "Text categorization with support vector machines: Learning with many relevant features," Technical Report LS8-Report, Universitaet Dortmund, 1997.
  19. Kim, J. and Kwon, O., "A method of predicting service time based on voice of customer data," Journal of the Korea society of IT services, Vol. 15 (2016), 197-210. (김정훈, 권오병, "고객의 소리 (VOC) 데이터를 활용한 서비스 처리 시간 예측방법," 한국IT 서비스학회지, Vol.15 (2016), 197-210.) https://doi.org/10.9716/KITS.2016.15.1.197
  20. Kumari, A., K. Sharma, and M. Sharma, "Predictive Analysis of Cyber Crime Against Women in India and Laws Prohibiting Them," International Journal of Innovations & Advancement in Computer Science, Vol.4, No.3 (2015), 1-6.
  21. Lin, M., X. Ke and A.B. Whinston, "Vertical differentiation and a comparison of online advertising models," Journal of Management Information Systems, Vol.29, No.1 (2012), 195-236. https://doi.org/10.2753/MIS0742-1222290106
  22. Mazurczyk, W., T. Holt, and K. Szczypiorski, "Guest Editors' Introduction: Special Issue on Cyber Crime," IEEE Transactions on Dependable and Secure Computing, Vol.13, No.2 (2016), 146-147. https://doi.org/10.1109/TDSC.2015.2502407
  23. McEntire, David A. The status of emergency management theory: Issues, barriers, and recommendations for improved scholarship. University of North Texas. Department of Public Administration. Emergency Administration and Planning, (2004).
  24. Michaelidou, N., N. T. Siamagka and G. Christodoulides, "Usage, barriers and measurement of social media marketing: An exploratory investigation of small and medium B2B brands," Industrial Marketing Management, Vol.40 (2011), 1153-1159. https://doi.org/10.1016/j.indmarman.2011.09.009
  25. Nykodym N., R. Taylor and J. Vilela, "Criminal profiling and insider cyber crime," Digital Investigation, Vol.2 (2005), 261-267. https://doi.org/10.1016/j.diin.2005.11.004
  26. Perez-Gonzalez, F., and H. Yun, "Risk management and firm value: Evidence from weather derivatives," The Journal of Finance, Vol.68, No.5 (2013), 2143-2176. https://doi.org/10.1111/jofi.12061
  27. Petak, W. J., "A Challenge for Public Administration," Public Administration Review, Vol.45 (1985), 3-7. https://doi.org/10.2307/3134992
  28. Sadgrove, K. The complete guide to business risk management. Routledge, 2016.
  29. Sahami, M., S. Dumais and D. Heckerman and E. Horvitz, "A Bayesian Approach to Filtering Junk E-Mail," In Learning for Text Categorization: Papers from the 1998 workshop, Vol.62 (1998), 98-105.
  30. Sreenivasulu, V., and R.S. Prasad, "A Methodology for Cyber Crime Identification using Email Corpus based on Gaussian Mixture Model," International Journal of Computer Applications, Vol.117, No.13 (2015), 29-32. https://doi.org/10.5120/20616-3315
  31. Waugh, W. L., and G. Streib, "Collaboration and leadership for effective emergency management," Public administration review, Vol.66, No.1 (2006), 131-140. https://doi.org/10.1111/j.1540-6210.2006.00673.x
  32. Yates, D, and S. Paquette, "Emergency knowledge management and social media technologies: A case study of the 2010 Haitian earthquake," International Journal of Information Management, Vol.31 (2011), 6-13. https://doi.org/10.1016/j.ijinfomgt.2010.10.001
  33. Zhao. L. and Y. Jiang, "A game theoretic optimization model between project risk set and measurement," International Journal of Information Technology & Decision Making, Vol.8, No.4 (2009), 769-786. https://doi.org/10.1142/S0219622009003697