Feature selection for text data via topic modeling

Woosol, Jang;Ye Eun, Kim;Won, Son;

doi:10.5351/KJAS.2022.35.6.739

The Korean Journal of Applied Statistics (응용통계연구)

Volume 35 Issue 6
/
Pages.739-754
/
2022
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Feature selection for text data via topic modeling

토픽 모형을 이용한 텍스트 데이터의 단어 선택

Woosol, Jang (Department of Applied Statistics, Dankook University) ;
Ye Eun, Kim (Department of Applied Statistics, Dankook University) ;
Won, Son (Department of Applied Statistics, Dankook University)

장우솔 (단국대학교 대학원 응용통계학과) ;
김예은 (단국대학교 대학원 응용통계학과) ;
손원 (단국대학교 대학원 응용통계학과)

Received : 2022.08.25
Accepted : 2022.10.10
Published : 2022.12.31

https://doi.org/10.5351/KJAS.2022.35.6.739 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Usually, text data consists of many variables, and some of them are closely correlated. Such multi-collinearity often results in inefficient or inaccurate statistical analysis. For supervised learning, one can select features by examining the relationship between target variables and explanatory variables. On the other hand, for unsupervised learning, since target variables are absent, one cannot use such a feature selection procedure as in supervised learning. In this study, we propose a word selection procedure that employs topic models to find latent topics. We substitute topics for the target variables and select terms which show high relevance for each topic. Applying the procedure to real data, we found that the proposed word selection procedure can give clear topic interpretation by removing high-frequency words prevalent in various topics. In addition, we observed that, by applying the selected variables to the classifiers such as naïve Bayes classifiers and support vector machines, the proposed feature selection procedure gives results comparable to those obtained by using class label information.

텍스트 데이터는 일반적으로 많은 변수를 포함하고 있으며 변수들 사이의 연관성도 높아 통계 분석의 정확성, 효율성 등에서 문제가 생길 수 있다. 이러한 문제점에 대처하기 위해 목표 변수가 주어진 지도 학습에서는 목표 변수를 잘 설명할 수 있는 단어들을 선택하여 이 단어들만 통계 분석에 이용하기도 한다. 반면, 비지도 학습에서는 목표 변수가 주어지지 않으므로 지도 학습에서와 같은 단어 선택 절차를 활용하기 어렵다. 이 연구에서는 토픽 모형을 이용하여 지도 학습에서의 목표 변수를 대신할 수 있는 토픽을 생성하고 각 토픽별로 연관성이 높은 단어들을 선택하는 단어 선택 절차를 제안한다. 제안된 절차를 실제 텍스트 데이터에 적용한 결과, 단어 선택 절차를 이용하면 많은 토픽에서 공통적으로 자주 등장하는 단어들을 제거함으로써 토픽을 더 명확하게 식별할 수 있었다. 또한, 군집 분석에 적용한 결과, 군집과 범주 사이에 높은 연관성을 가지는 군집 분석 결과를 얻을 수 있는 것으로 나타났다. 목표 변수에 대한 정보없이 토픽 모형을 이용하여 선택한 단어들을 분류 분석에 적용하였을 때 목표 변수를 이용하여 단어들을 선택한 경우와 비슷한 분류 정확성을 얻을 수 있음도 확인하였다.

Keywords

References

Arun R, Suresh V, Madhavan CEV, and Murthy MN (2010). On finding the natural number of topics with latent Dirichlet allocation: Some observation, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Par I, LNAI, 6118, 391-402.
Blei DM, Ng AY, and Jordan MI (2003). Latent Dirichlet allocation, Journal of Machine Learning Research, 3, 993-1022.
Blei DM (2012). Probabilistic topic models, Communications of the ACM, 55, 77-84. https://doi.org/10.1145/2133806.2133826
Boyd-Graber J, Hu Y, and Minmo D (2017). Applications of topic models, Foundations and Trends in Information Retrieval, 11, 143-296. https://doi.org/10.1561/1500000030
Cao J, Xia T, Li J, Zhang Y, and Tang S (2009). A density-based method for adaptive LDA model selection, Neurocomputing, 72, 1775-1781. https://doi.org/10.1016/j.neucom.2008.06.011
Deerwester S, Dumais ST, Furnas GW, Landauer TK, and Harshman R (1990). Indexing by latent semantic analysis, Journal of the American Society for Information Science, 41, 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Deveaud R, SanJuan E, and Bellot P (2014). Accurate and effective latent concept modeling for ad hoc information retrieval, Document Numerique, 17, 61-84. https://doi.org/10.3166/dn.17.1.61-84
Griffiths TL and Steyvers M (2004). Finding scientific topics, Proceedings of the National Academy of Sciences of the United States of America, 101, 5228-5235. https://doi.org/10.1073/pnas.0307752101
Forman G (2003). An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research, 3, 1289-1305.
Hofmann T (1999). Probabilistic latent semantic indexing, In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, California, 50-57.
Lee H, Choi J, Lee S, and Son W (2021). Topic change monitoring study based on Blue House national petition using a control chart, The Korean Journal of Applied Statistics, 34, 795-806. https://doi.org/10.5351/KJAS.2021.34.5.795
Mun HI and Son W (2022) Properties of chi-square statistic and information gain for feature selection of imbalanced text data, The Korean Journal of Applied Statistics, 35, 469-484. https://doi.org/10.5351/KJAS.2022.35.4.469
Son W (2020). Skewness of chi-square statistic for imbalanced text data, Journal of the Korean Data and Information Science Society, 31, 807-821. https://doi.org/10.7465/jkdi.2020.31.5.807

The Korean Journal of Applied Statistics (응용통계연구)

Feature selection for text data via topic modeling

토픽 모형을 이용한 텍스트 데이터의 단어 선택

Abstract

Keywords

References

Detail Search