DOI QR코드

DOI QR Code

금융업의 합성 데이터 유용성 분석: 온라인 P2P 대출연체 분석을 중심으로

Utility of Synthetic Data in Finances: An Application of Online P2P Lending Loan Default Analysis

  • 송민채 (NH농협금융지주, NH금융연구소)
  • Minchae Song
  • 투고 : 2024.06.07
  • 심사 : 2024.07.16
  • 발행 : 2024.08.31

초록

In order to promote the AI applications in the financial industry, the financial sector has recently been paying attention to synthetic data technology. Synthetic data generates using a purpose-built mathematical model or algorithm, with the aim of solving a set of data science tasks. This study evaluates the utility of synthetic data by analyzing heterogeneous tabular data that is composed of discrete, categorical and continuous variables and has the feature of unbalanced data, which is commonly found in the financial sector. As a synthetic data generation technique, the TGAN and CTGAN models are applied by considering the feature of tabular data. As a result of evaluating the utility in terms of resemblance and machine learning efficiency, those of TGAN are confirmed to be high, while the quality of CTGAN are relatively poor. This is interpreted to be particularly due to the generation of categorical variables, and it suggests that how those with categorical properties especially are considered in the synthetic data generation model is a major factor in determining the utility of generation synthetic data.

키워드

참고문헌

  1. 강한바다, 이재우, "가중치 VAE 오버샘플링(W-VAE)을 이용한 보안데이터셋 샘플링 기법 연구", 한국정보통신학회논문지, 제26권, 제12호, 2022, 872-879.
  2. 김상광, 김선경, "빅데이터 활용에 영향을 미치는 개인정보 규제요인과 데이터 결합요인의 탐색", 정보보호학회논문지, 제30권 제2호, 2020, 287-304.
  3. 김슬기, 전용주, 김태영, "합성 데이터셋생성기법을 활용한 초.중등 인공지능 교육용 데이터셋 개발과 효용성 분석", 컴퓨터교육학회 논문지, 제25권, 제3호, 2022, 9-21.
  4. 김태민, 김재곤, "치아 보철물 디자인을 위한 이미지대 이미지 변환 GAN 모델", 한국IT서비스학회지, 제22권, 제5호, 2023, 87-98.
  5. 금융위원회, "금융데이터 규제혁신 T/F 1차 회의 개최", 보도자료, 2023.5.18.
  6. 배재권, 이승연, 서희진, "인공지능기법을 이용한 온라인 P2P 대출거래의 채무불이행 예측에 관한 실증연구", 한국전자거래학회지, 제23권, 제3호, 2018, 207-224.
  7. 정주은, 김한준, "혼합형 테이블 데이터를 위한 딥러닝기반 데이터 증강 기법", 한국전자거래학회지, 제28권 제4호, 2023, 1-22.
  8. Abadie, A., "Using synthetic controls: Feasibility, data requirements, and methodological aspects", Journal of Economic Literature, Vol.59, No.2, 2021. 391-425.
  9. An, C., J. Sun, Y. Wang, and Q. Wei, "A K-means improved CTGAN oversampling method for data imbalance problem", International Conference on Software Quality, Reliability and Security, 2021.
  10. Assefa, S.A., M. Mahfouz, R.E. Tillman, P. Reddy, and M. Veloso, "Generating synthetic data in finance: opportunities, challenges and pitfalls", Proceedings of the First ACM International Conference on AI in Finance, No.44, 2020, 1-8.
  11. Bhanot, K., J.S. Erickson, I. Guyon, and K.P. Bennet, "The problem of fairness in synthetic healthcare data", Entropy, Vol.23, No.9, 2021, 1165.
  12. Bourou, S., A.E. Saer, T-H. Velivassaki, A. Voulkidis, and T. Zahariadis, "A review of tabular data synthesis using GANs on an IDS dataset", Information, Vol.12, No.9, 2021, 375.
  13. Cao, L., "AI in finance: Challenges, techniques, and opportunities", ACM Computing Surveys, Vol.55, No.64, 2022, 1-38.
  14. Chawla, N.V., K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer, "SMOTE: Synthetic minority over-sampling technique", Journal of Artificial Intelligence Research, Vol.16, 2002.
  15. Cheon, M.J., D.H. Lee, J.W. Park, H.J. Choi, J.S. Lee, and O. Lee, "Ctgan vs tgan? Which one is more suitable for generating synthetic eeg data", Journal of Theoretical and Applied Information Technology, Vol.99, No.10, 2021.
  16. Emam, E.K., L. Mosquera, and R. Hoptroff, Practical synthetic data generation: Balancing privacy and the broad availability of data, O'Reilly Media, 2020.
  17. Fallahian, M., M. Dorodchi, and K. Kreth, "GAN-based tabular data generator for constructing synopsis in approximate query processing: Challenges and solutions", Machine Learning & Knowledge Extraction, Vol.6, No.1, 2024, 171-198.
  18. Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial networks", Communications of the ACM, Vol.64, No.11, 2020, 139-144.
  19. Habibi, O., M. Chemmakha, and M. Lazaar, "Imbalanced tabular data modelization using CTGAN and machine learning to improve IoT Botnet attacks detection", Engineering Applications of Artificial Intelligence, Vol.118, 2023, 105669.
  20. Hernadez, M., G. Epelde, A. Alberdi, R. Cilla, and D. Rankin, "Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions", Methods of Information in Medicine, Vol.62, No.1, 2023, 19-38.
  21. Goodfellow, I., "NIPS 2016 Tutorial: Generative Adversarial Networks", arXiv:1701.00160, 2016.
  22. Reiter, J.P., "Releasing multiply imputed, synthetic public use microdata: An illustration and empiri-cal study", Journal of the Royal Statistical Society: Series A (Statistics in Society), Vol.168, No.1, 2005, 185-205.
  23. Jabbar, A., X. Li, and B. Omar, "A survey on generative adversarial networks: Variants, applications, and training", ACM Computing Surveys, Vol.54, No.157, 2021, 1-49.
  24. Park, J.H., "Improving fashion style classification accuracy using VAE in class imbalance problem", The Journal of Korean Institute of Information Technology, Vol.19, No.2, 2021, 1-10.
  25. Jordon, J., J. Yoon, and M. van der Schaar, "Pate-gan: Generating synthetic data with differential privacy guarantees", In International Conference on Learning Representations, 2019.
  26. Kingma, D.P. and M. Welling, "Auto-encoding variational bayes," arXiv:1312.6114v10, 2013.
  27. Kamthe, S., S. Assefa, and M. Deisenroth, "Copula flows for synthetic data generation", Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP), 2021.
  28. Lee, J., J. Hyeong, J. Jeon, N. Park, and J. Cho, "Invertible tabular GANs: Killing two birds with one stone for tabular data synthesis", Part of Advances in Neural Information Processing Systems, 2021.
  29. Li, T. and N. Li, "On the tradeoff between privacy and utility in data publishing", Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, 517-526.
  30. Majeed, A. and S.O. Hwang, "CTGAN-MOS: Conditional generative adversarial network based minority-class-augmented oversampling scheme for imbalanced problems", Journals & Magazines, Vol.11, No.9, 2023.
  31. Meyer, D., T. Nagler, and R.J. Hogan, "Copula-based synthetic data augmentation for machinelearning emulators", Geoscientific Model Development, Vol.14, No.14, 2021, 5205-5215.
  32. Mukherjee, M. and M. Khushi, "SMOTE-ENC: A novel SMOTE-Based method to generate synthetic data for nominal and continuous features", Applied System Innovation, Vol.4, No.1, 2021, 18.
  33. Salman, H.K., H. Munawar, and B. Nick, "Adversarial training of variational autoencoders for high fidelity image generation", IEEE Winter Conference on Applications of Computer Vision, 2018.
  34. Sarker, I.H., M.H. Furhad, and R. Nowrozy, "AIdriven cybersecurity: An overview, security intelligence modeling and research directions", SN Computer Science, Vo1.2, No.173, 2021.
  35. Sun, Y., A. Cuesta-Infante, and K. Veeramachaneni, "Learning vine copula models for synthetic data generation", Proceedings of the AAAI Conference on Artificial Intelligence, Vol.33, No.1, 2019.
  36. Xu, L. and K. Veeramachaneni, "Synthesizing tabular data using generative adversarial networks", Machine Learning, 2018.
  37. Xu, L., M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, " Modeling tabular data using conditional GAN", Advances in Neural Information Processing Systems, 2019.
  38. Yonghao, G. and W. Weiming, "A quantifying method for trade-off between privacy and utility", IET International Conference on Information and Communications Technologies, 2013.
  39. Yoo, S. and N. Park, "Synthetic data generation for individual credit data using CART", The Journal of the Korean Official Statistics, Vol. 25, No.1, 2020, 1-30.
  40. Yue, Y., Y. Li, and Z. Wu, "Synthetic data approach for classification and regression", IEEE 29th International Conference on Applicationspecific Systems, Architectures and Processors, 2018.