DOI QR코드

DOI QR Code

Pairwise fusion approach to cluster analysis with applications to movie data

영화 데이터를 위한 쌍별 규합 접근방식의 군집화 기법

  • Kim, Hui Jin (Department of Statistics, Sungkyunkwan University) ;
  • Park, Seyoung (Department of Statistics, Sungkyunkwan University)
  • 김희진 (성균관대학교 통계학과) ;
  • 박세영 (성균관대학교 통계학과)
  • Received : 2021.12.10
  • Accepted : 2022.02.05
  • Published : 2022.04.30

Abstract

MovieLens data consists of recorded movie evaluations that was often used to measure the evaluation score in the recommendation system research field. In this paper, we provide additional information obtained by clustering user-specific genre preference information through movie evaluation data and movie genre data. Because the number of movie ratings per user is very low compared to the total number of movies, the missing rate in this data is very high. For this reason, there are limitations in applying the existing clustering methods. In this paper, we propose a convex clustering-based method using the pairwise fused penalty motivated by the analysis of MovieLens data. In particular, the proposed clustering method execute missing imputation, and at the same time uses movie evaluation and genre weights for each movie to cluster genre preference information possessed by each individual. We compute the proposed optimization using alternating direction method of multipliers algorithm. It is shown that the proposed clustering method is less sensitive to noise and outliers than the existing method through simulation and MovieLens data application.

사용자들의 영화정보를 기록한 MovieLens 데이터는 추천 시스템 연구에서 아이디어를 탐색하고 검증하는데 상당한 가치가 있는 데이터로, 기존 데이터 분할 및 군집화 알고리즘을 사용하여 사용자 평점 데이터를 기반으로 항목 집합을 분할하는 연구 등에 사용되는 데이터이다. 본 논문에서는 기존 연구에서 대표적으로 사용되었던 영화 평점 데이터와 영화 장르 데이터를 통해 사용자의 장르 선호도를 예측하여 선호도 패턴을 기반으로 사용자를 군집화(clustering)하고, 유의미한 정보를 얻는 연구를 진행하였다. MovieLens 데이터는 영화의 전체 개수에 비해 사용자별 평균 영화 평점 수가 낮아 결측 비율이 높다. 이러한 이유로 기존의 군집화 방법을 적용하는 데 한계가 존재한다. 본 논문에서는 MovieLens 데이터 특성에 모티브를 얻어 쌍별 규합 벌점함수(pairwise fused penalty)를 활용한 볼록 군집화(convex clustering) 기반의 방법을 제안한다. 특히 결측치 대체(missing imputation)도 동시에 해결하는 최적화 문제를 통해 기존의 군집화 분석과 차별화하였다. 군집화는 반복 알고리즘인 ADMM을 통해 제안하는 최적화 문제를 풀어 진행한다. 또한 시뮬레이션과 MovieLens 데이터 적용을 통해 제안하는 군집화 방법이 기존의 방법보다 노이즈 및 이상치에 상대적으로 민감하지 않은 것으로 보인다.

Keywords

References

  1. Boyd S, Parikh N, and Chu E (2011). Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers, Now Publishers Inc, 3, 1-122.
  2. Chi EC and Lange K (2015). Splitting methods for convex clustering, Journal of Computational and Graphical Statistics, 24, 994-1013. https://doi.org/10.1080/10618600.2014.948181
  3. Christopher DM, Prabhakar R, and Hinrich S (2008). Introduction to Information Retrieval, Cambridge University Press.
  4. Friedman N and Russell S (2013). Image Segmentation in Video Sequences: A Probabilistic Approach, arXiv preprint arXiv:1302.1539
  5. Hocking TD, Joulin A, Bach F, and Vert JP (2011). Clusterpath an algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, 745-752.
  6. Hubert L and Arabie P (1985). Comparing partitions. Journal of classification, 2, 193-218. https://doi.org/10.1007/BF01908075
  7. Harper FM and Konstan JA (2015). The movielens datasets: History and context. Acm Transactions on Interactive Intelligent Systems (tiis), 5, 1-19.
  8. Hartigan JA and Wong MA (1979). Algorithm AS 136: A k-means clustering algorithm, Journal of the Royal Statistical Society. Series C (Applied Statistics), 28, 100-108.
  9. Kvalseth TO (1987). Entropy and correlation: Some comments, IEEE Transactions on Systems, Man, and Cybernetics, 17, 517-519. https://doi.org/10.1109/TSMC.1987.4309069
  10. Kuhn HW and Tucker AW (2014). Nonlinear programming, Traces and Emergence of Nonlinear Programming, 247-258.
  11. Lindsten F, Ohlsson H, and Ljung L (2011). Clustering using sum-of-norms regularization: With application to particle filter output computation, 2011 IEEE Statistical Signal Processing Workshop (SSP), 201-204.
  12. Ng AY, Jordan MI, and Weiss Y (2002). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 849-856.
  13. O'Connor M and Herlocker J (1999). Clustering items for collaborative filtering. In Proceedings of the ACM SIGIR Workshop on Recommender Systems, 128, UC Berkeley.
  14. Park S and Zhao H (2018). Spectral clustering based on learning similarity matrix, Bioinformatics, 34, 2069-2076. https://doi.org/10.1093/bioinformatics/bty050
  15. Park S and Zhao H (2019). Sparse principal component analysis with missing observations, Annals of Applied Statistics, 13, 1016-1042. https://doi.org/10.1214/18-aoas1220
  16. Park S, Xu H, and Zhao H (2021). Integrating multidimensional data for clustering analysis with applications to cancer patient data, Journal of the American Statistical Association, 116, 14-26. https://doi.org/10.1080/01621459.2020.1730853
  17. Sibson R (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16, 30-34. https://doi.org/10.1093/comjnl/16.1.30
  18. Tibshirani R, Saunders M, Rosset S, Zhu J, and Knight K (2005). Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 91-108. https://doi.org/10.1111/j.1467-9868.2005.00490.x