DOI QR코드

DOI QR Code

Self-optimizing feature selection algorithm for enhancing campaign effectiveness

캠페인 효과 제고를 위한 자기 최적화 변수 선택 알고리즘

  • Seo, Jeoung-soo (ktds / Graduate School of Business IT, Kookmin University) ;
  • Ahn, Hyunchul (Graduate School of Business IT, Kookmin University)
  • 서정수 (케이티디에스(주) / 국민대학교 비즈니스IT전문대학원) ;
  • 안현철 (국민대학교 비즈니스IT전문대학원)
  • Received : 2020.11.19
  • Accepted : 2020.12.28
  • Published : 2020.12.31

Abstract

For a long time, many studies have been conducted on predicting the success of campaigns for customers in academia, and prediction models applying various techniques are still being studied. Recently, as campaign channels have been expanded in various ways due to the rapid revitalization of online, various types of campaigns are being carried out by companies at a level that cannot be compared to the past. However, customers tend to perceive it as spam as the fatigue of campaigns due to duplicate exposure increases. Also, from a corporate standpoint, there is a problem that the effectiveness of the campaign itself is decreasing, such as increasing the cost of investing in the campaign, which leads to the low actual campaign success rate. Accordingly, various studies are ongoing to improve the effectiveness of the campaign in practice. This campaign system has the ultimate purpose to increase the success rate of various campaigns by collecting and analyzing various data related to customers and using them for campaigns. In particular, recent attempts to make various predictions related to the response of campaigns using machine learning have been made. It is very important to select appropriate features due to the various features of campaign data. If all of the input data are used in the process of classifying a large amount of data, it takes a lot of learning time as the classification class expands, so the minimum input data set must be extracted and used from the entire data. In addition, when a trained model is generated by using too many features, prediction accuracy may be degraded due to overfitting or correlation between features. Therefore, in order to improve accuracy, a feature selection technique that removes features close to noise should be applied, and feature selection is a necessary process in order to analyze a high-dimensional data set. Among the greedy algorithms, SFS (Sequential Forward Selection), SBS (Sequential Backward Selection), SFFS (Sequential Floating Forward Selection), etc. are widely used as traditional feature selection techniques. It is also true that if there are many risks and many features, there is a limitation in that the performance for classification prediction is poor and it takes a lot of learning time. Therefore, in this study, we propose an improved feature selection algorithm to enhance the effectiveness of the existing campaign. The purpose of this study is to improve the existing SFFS sequential method in the process of searching for feature subsets that are the basis for improving machine learning model performance using statistical characteristics of the data to be processed in the campaign system. Through this, features that have a lot of influence on performance are first derived, features that have a negative effect are removed, and then the sequential method is applied to increase the efficiency for search performance and to apply an improved algorithm to enable generalized prediction. Through this, it was confirmed that the proposed model showed better search and prediction performance than the traditional greed algorithm. Compared with the original data set, greed algorithm, genetic algorithm (GA), and recursive feature elimination (RFE), the campaign success prediction was higher. In addition, when performing campaign success prediction, the improved feature selection algorithm was found to be helpful in analyzing and interpreting the prediction results by providing the importance of the derived features. This is important features such as age, customer rating, and sales, which were previously known statistically. Unlike the previous campaign planners, features such as the combined product name, average 3-month data consumption rate, and the last 3-month wireless data usage were unexpectedly selected as important features for the campaign response, which they rarely used to select campaign targets. It was confirmed that base attributes can also be very important features depending on the type of campaign. Through this, it is possible to analyze and understand the important characteristics of each campaign type.

최근 온라인의 비약적인 활성화로 캠페인 채널들이 다양하게 확대되면서 과거와는 비교할 수 없을 수준의 다양한 유형들의 캠페인들이 기업에서 수행되고 있다. 하지만, 고객의 입장에서는 중복 노출로 인한 캠페인에 대한 피로감이 커지면서 스팸으로 인식하는 경향이 있고, 기업입장에서도 캠페인에 투자하는 비용은 점점 더 늘어났지만 실제 캠페인 성공률은 오히려 더 낮아지고 있는 등 캠페인 자체의 효용성이 낮아지고 있다는 문제점이 있어 실무적으로 캠페인의 효과를 높이고자 하는 다양한 연구들이 지속되고 있다. 특히 최근에는 기계학습을 이용하여 캠페인의 반응과 관련된 다양한 예측을 해보려는 시도들이 진행되고 있는데, 이 때 캠페인 데이터의 다양한 특징들로 인해 적절한 특징을 선별하는 것은 매우 중요하다. 전통적인 특징 선택 기법으로 탐욕 알고리즘(Greedy Algorithm) 중 SFS(Sequential Forward Selection), SBS(Sequential Backward Selection), SFFS(Sequential Floating Forward Selection) 등이 많이 사용되었지만 최적 특징만을 학습하는 모델을 생성하기 때문에 과적합의 위험이 크고, 특징이 많은 경우 분류 예측 성능 하락 및 학습시간이 많이 소요된다는 한계점이 있다. 이에 본 연구에서는 기존의 캠페인에서의 효과성 제고를 위해 개선된 방식의 특징 선택 알고리즘을 제안한다. 본 연구의 목적은 캠페인 시스템에서 처리해야 하는 데이터의 통계학적 특성을 이용하여 기계 학습 모델 성능 향상의 기반이 되는 특징 부분 집합을 탐색하는 과정에서 기존의 SFFS의 순차방식을 개선하는 것이다. 구체적으로 특징들의 데이터 변형을 통해 성능에 영향을 많이 끼치는 특징들을 먼저 도출하고 부정적인 영향을 미치는 특징들은 제거를 한 후 순차방식을 적용하여 탐색 성능에 대한 효율을 높이고 일반화된 예측이 가능하도록 개선된 알고리즘을 적용하였다. 실제 캠페인 데이터를 이용해 성능을 검증한 결과, 전통적인 탐욕알고리즘은 물론 유전자알고리즘(GA, Genetic Algorithm), RFE(Recursive Feature Elimination) 같은 기존 모형들 보다 제안된 모형이 보다 우수한 탐색 성능과 예측 성능을 보임을 확인할 수 있었다. 또한 제안 특징 선택 알고리즘은 도출된 특징들의 중요도를 제공하여 예측 결과의 분석 및 해석에도 도움을 줄 수 있다. 이를 통해 캠페인 유형별로 중요 특징에 대한 분석과 이해가 가능할 것으로 기대된다.

Keywords

References

  1. Bluma, A. L. and P. Langley, "Selection of relevant features and examples in machine," Artificial Intelligence, Vol. 97, Nos. 1-2(1997), 245-271. https://doi.org/10.1016/S0004-3702(97)00063-5
  2. Breiman, L., "Bagging predictors," Machine Learning, Vol. 24, No. 2(1996), 123-140. https://doi.org/10.1007/BF00058655
  3. Breiman, L., "Random forests," Machine Learning, Vol. 45, No. 1(2001), 5-32. https://doi.org/10.1023/A:1010933404324
  4. Chandrashekar, G. and F. Sahin, "A survey on feature selection methods," Computers & Electrical Engineering, Vol. 40, No. 1(2014), 16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024
  5. Cho, J., D. Lee, C. Song and M. Chun, "Feature Selection by Genetic Algorithm and Information Theory," Journal of Korean Institute of Intelligent Systems, Vol. 18, No. 1(2008), 94-99. https://doi.org/10.5391/JKIIS.2008.18.1.094
  6. Devijver, P. A. and J. Kittler, Pattern Recognition : A Statistical Approach, Vol. 761. London: Prentice-Hall, 1982.
  7. Ferri, F. J., P. Pudil, M. Hatef, and J. Kittler, "Comparative study of techniques for large-scale feature selection," Machine Intelligence and Pattern Recognition, Vol. 16(1994), 403-413.
  8. Guyon, I. and A. Elisseeff, "An Introduction to Variable and Feature Selection," Journal of Machine Learning Research, Vol. 3(2003), 1157-1182.
  9. Hong, S.-H., and K.-S. Shin, "Using GA based Input Selection Method for Artificial Neural Network Modeling Application to Bankruptcy Prediction", Journal of Intelligence and Information Systems, Vol. 9, No. 1 (2003), 227-249.
  10. Kim, K.-J., and H.-C. Ahn, "Optimization of Support Vector Machines for Financial Forecasting", Journal of Intelligence and Information Systems, Vol. 17, No. 4 (2011), 241-254. https://doi.org/10.13088/JIIS.2011.17.4.241
  11. Kohavi, R. and G. H. John, "Wrappers for feature subset selection," Artificial intelligence, Vol. 97, Nos. 1-2(1997), 273-324. https://doi.org/10.1016/S0004-3702(97)00043-X
  12. Ladha, L. and T. Deepa, "Feature selection methods and algorithms," International Journal on Computer Science and Engineering, Vol. 1, No. 3(2011), 1787-1797.
  13. Lee, C. H., "A Study on the Important Variable Selection Method by Feature Selection," Doctoral Dissertation, The University of Chung- Ang, Korea, 2007.
  14. Lee, J., D. Park and C. Lee, " Feature Selection Algorithm for Intrusions Detection System using Sequential Forward Search and Random Forest Classifier," KSII Transactions on Internet and Information Systems (TIIS), Vol. 11, No. 10(2017), 5132-5138. https://doi.org/10.3837/tiis.2017.10.024
  15. Lee, J. S., and M. K. Jeong, "A Hybrid Feature Selection Method using Univariate Analysis and LVF Algorithm", Journal of Intelligence and Information Systems, Vol. 14, No. 4(2008), 179-200.
  16. Lee, W. and S. Oh, "Efficient Feature Selection Based Near Real-Time Hybrid Intrusion Detection System," KIPS Transactions on Computer and Communication Systems, Vol. 5, No. 12(2016), 471-480. https://doi.org/10.3745/KTCCS.2016.5.12.471
  17. Mitchell, T. M., Machine Learning, McGraw Hill, 1997.
  18. Molina, L. C., L. Belanche and A. Nebot, "Feature selection algorithms: A survey and experimental evaluation," Proceedings of 2002 IEEE International Conference on Data Mining, (2002), 306-313.
  19. Oh, I. S., J. S. Lee, and B. R. Moon, "Hybrid genetic algorithms for feature selection," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, No. 11(2004), 1424-1437. https://doi.org/10.1109/TPAMI.2004.105
  20. Ohn, S. Y., S. D. Chi and M. Y. Han, "Feature Selection for Classification of Mass Spectrometric Proteomic Data Using Random Forest," Journal of the Korea Society for Simulation, Vol. 22, No. 4(2013), 139-147. https://doi.org/10.9709/JKSS.2013.22.4.139
  21. Pudil, P., J. Novovicova and J. Kittler, "Floating search methods in feature selection," Pattern Recognition Letters, Vol. 15, No. 11(1994), 1119-1125. https://doi.org/10.1016/0167-8655(94)90127-9
  22. Samuel, A. L., "Some studies in machine learning using the game of checkers," IBM Journal of Research and Development, Vol. 3, No. 3(1959), 210-229. https://doi.org/10.1147/rd.33.0210
  23. Yu, L. and H. Liu. "Feature selection for high-dimensional data: A fast correlationbased filter solution," Proceedings of the 20th International Conference on Machine Learning (ICML-03), (2003), 856-863.
  24. Zhou, Q., H. Zhou and T. Li, "Cost-sensitive feature selection using random forest: Selecting low-cost subsets of informative features," Knowledge-based Systems, Vol. 95(2016), 1-11. https://doi.org/10.1016/j.knosys.2015.11.010