DOI QR코드

DOI QR Code

Feature Selection Using Submodular Approach for Financial Big Data

  • Attigeri, Girija (Dept. of Information and Communication Technology, Manipal Institute of Technology, Manipal Academy of Higher Education) ;
  • Manohara Pai, M.M. (Manipal Institute of Technology, Manipal Academy of Higher Education) ;
  • Pai, Radhika M. (Manipal Institute of Technology, Manipal Academy of Higher Education)
  • Received : 2018.07.20
  • Accepted : 2019.02.02
  • Published : 2019.12.31

Abstract

As the world is moving towards digitization, data is generated from various sources at a faster rate. It is getting humungous and is termed as big data. The financial sector is one domain which needs to leverage the big data being generated to identify financial risks, fraudulent activities, and so on. The design of predictive models for such financial big data is imperative for maintaining the health of the country's economics. Financial data has many features such as transaction history, repayment data, purchase data, investment data, and so on. The main problem in predictive algorithm is finding the right subset of representative features from which the predictive model can be constructed for a particular task. This paper proposes a correlation-based method using submodular optimization for selecting the optimum number of features and thereby, reducing the dimensions of the data for faster and better prediction. The important proposition is that the optimal feature subset should contain features having high correlation with the class label, but should not correlate with each other in the subset. Experiments are conducted to understand the effect of the various subsets on different classification algorithms for loan data. The IBM Bluemix BigData platform is used for experimentation along with the Spark notebook. The results indicate that the proposed approach achieves considerable accuracy with optimal subsets in significantly less execution time. The algorithm is also compared with the existing feature selection and extraction algorithms.

Keywords

References

  1. T. Seth and V. Chaudhary, "Big data in finance," in Big Data: Algorithms, Analytics, and Applications. Boca Raton, FL: CRC Press, 2015, pp. 329-356.
  2. I. Taleb, R. Dssouli, and M. A. Serhani, "Big data pre-processing: a quality framework," in Proceedings of 2015 IEEE International Congress on Big Data, New York, NY, 2015, pp. 191-198.
  3. J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, "Feature selection: a data perspective," ACM Computing Surveys, vol. 50, no. 6, article no. 94, 2018.
  4. B. Arguello, "A survey of feature selection methods: algorithms and software," PhD dissertation, University of Texas at Austin, TX, 2015.
  5. A. Krause, "SFO: a toolbox for submodular function optimization," Journal of Machine Learning Research, vol. 11, pp. 1141-1144, 2010.
  6. M. A. Fattah, "A novel statistical feature selection approach for text categorization," Journal of Information Processing Systems, vol. 13, no. 5, pp. 1397-1409, 2017. https://doi.org/10.3745/JIPS.02.0076
  7. K. Kira and L. A. Rendell, "A practical approach to feature selection," in Machine Learning Proceedings 1992. St. Louis, MO: Elsevier, 1992, pp. 249-256.
  8. S. Fallahpour, E. N. Lakvan, and M. H. Zadeh, "Using an ensemble classifier based on sequential floating forward selection for financial distress prediction problem," Journal of Retailing and Consumer Services, vol. 34, pp. 159-167, 2017. https://doi.org/10.1016/j.jretconser.2016.10.002
  9. E. Wright, Q. Hao, K. Rasheed, and Y. Liu, "Feature selection of post-graduation income of college students in the United States," 2018; https://arxiv.org/abs/1803.06615.
  10. S. D. Kim, "A feature selection technique based on distributional differences," Journal of Informaion Processing System, vol. 2, no. 1, pp. 23-27, 2006. https://doi.org/10.3745/JIPS.2006.2.1.023
  11. S. Maldonado, J. Perez, and C. Bravo, "Cost-based feature selection for support vector machines: an application in credit scoring," European Journal of Operational Research, vol. 261, no. 2, pp. 656-665, 2017. https://doi.org/10.1016/j.ejor.2017.02.037
  12. A. Krause and V. Cevher, "Submodular dictionary selection for sparse representation," in Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 2010, pp. 567-574.
  13. Y. Bar, I. Diamant, L. Wolf, S. Lieberman, E. Konen, and H. Greenspan, "Chest pathology identification using deep feature selection with non-medical training," Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 6, no. 3, pp. 259-263, 2018. https://doi.org/10.1080/21681163.2016.1138324
  14. R. Iyer, S. Jegelka, and J. Bilmes, "Fast semidifferential-based submodular function optimization," Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, 2013, pp. 855-863.
  15. K. Wei, Y. Liu, K. Kirchhoff, and J. Bilmes, "Using document summarization techniques for speech data subset selection," in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, 2013, pp. 721-726.
  16. A. Krause and C. Guestrin, "A note on the budgeted maximization of submodular functions," Carnegie Mellon University, Technical Report No. CMU-CALD-05-103, 2005.
  17. D. Kempe, J. Kleinberg, and E. Tardos, "Maximizing the spread of influence through a social network," in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, 2003, pp. 137-146.
  18. G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, "An analysis of approximations for maximizing submodular set functions - I," Mathematical Programming, vol. 14, no. 1, pp. 265-294, 1978. https://doi.org/10.1007/BF01588971
  19. M. A. Hall, "Correlation-based feature selection for machine learning," PhD dissertation, The University of Waikato, Hamilton, New Zealand, 1999.
  20. A. Pouramirarsalani, M. Khalilian, and A. Nikravanshalmani, "Fraud detection in E-banking by using the hybrid feature selection and evolutionary algorithms," International Journal of Computer Science and Network Security, vol. 17, no. 8, pp. 271-279, 2017.
  21. Y. Wang, W. Ke, and X. Tao, "A feature selection method for large-scale network traffic classification based on spark," Information, vol. 7, article no. 6, 2016.
  22. H. D. Gangurde, "Feature selection using clustering approach for big data," International Journal of Computer Applications, vol. 2014, no. 4, pp. 1-3, 2014.
  23. P. Sarlin, "Data and dimension reduction for visual financial performance analysis," Information Visualization, vol. 14, no. 2, pp. 148-167, 2015. https://doi.org/10.1177/1473871613504102
  24. H. S. Bhat and D. Zaelit, "Forecasting retained earnings of privately held companies with PCA and L1 regression," Applied Stochastic Models in Business and Industry, vol. 30, no. 3, pp. 271-293, 2014. https://doi.org/10.1002/asmb.1972
  25. I. Pisica, G. Taylor, and L. Lipan, "Feature selection filter for classification of power system operating states," Computers &Mathematics with Applications, vol. 66, no. 10, pp. 1795-1807, 2013. https://doi.org/10.1016/j.camwa.2013.05.033
  26. H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. New York, NY: Springer Science & Business Media, 2012.
  27. M. Dash, "Feature selection via set cover," in Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop, Newport Beach, CA, 1997, pp. 165-171.
  28. A. Arauzo-Azofra, J. M. Benitez, and J. L. Castro, "A feature set measure based on relief," in Proceedings of the 5th International Conference on Recent Advances in Soft Computing, Nottingham, UK, 2004, pp. 104-109.
  29. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, et al., "MLlib: machine learning in Apache Spark," The Journal of Machine Learning Research, vol. 17, pp. 1-7, 2016.
  30. K. Noyes, "Five things you need to know about Hadoop v. Apache Spark," 2015; https://www.infoworld.com/article/3014440/five-things-you-need-to-know-about-hadoop-vapache- spark.html.
  31. P. Paakkonen and D. Pakkala, "Reference architecture and classification of technologies, products and services for big data systems," Big Data Research, vol. 2, no. 4, pp. 166-186, 2015. https://doi.org/10.1016/j.bdr.2015.01.001
  32. A. Abdiansah and R. Wardoyo, "Time complexity analysis of support vector machines (SVM) in LibSVM," International Journal Computer and Application, vol. 128, no. 3, pp. 28-34, 2015. https://doi.org/10.5120/ijca2015906480
  33. J. Giersdorf and M. Conzelmann, "Analysis of feature-selection for LASSO regression models," 2017; https://www.ni.tu-berlin.de/fileadmin/fg215/teaching/nnproject/Lasso_Project.pdf.
  34. V. Fonti and E. Belitser, "Feature selection using lasso," VU Amsterdam Research Paper in Business Analytics, 2017; https://beta.vu.nl/nl/Images/werkstuk-fonti_tcm235-836234.pdf