DOI QR코드

DOI QR Code

다중오믹스 데이터를 위한 가중변수 스펙트럼 군집화 기법

Spectral clustering of weighted variables on multi-omics data

  • 이윤정 (성균관대학교 통계학과) ;
  • 박세영 (성균관대학교 통계학과)
  • Yunjung Lee (Department of Statistics, Sungkyunkwan University) ;
  • Seyoung Park (Department of Statistics, Sungkyunkwan University)
  • 투고 : 2022.11.24
  • 심사 : 2023.02.01
  • 발행 : 2023.06.30

초록

생물학적으로 각기 다른 부분 정보를 담고 있는 오믹스(omics) 데이터를 통합한 다중 오믹스(multi omics) 분석의 중요한 목표 중 하나는 암 타입의 하위 유형을 식별하는 것이다. 그러나 오믹스 데이터의 높은 차원과 이질성으로 인해 기존의 군집화 방법을 적용하는 데에는 한계가 있다. 본 논문에서는 대표적인 그래프 이론에 기반한 스펙트럼 군집화(spectral clustering) 방법론을 기반으로 새로운 알고리즘을 제안하고, 각 오믹스 데이터와 유전자 별로 가중치를 부여하는 것을 통해 중요 오믹스 데이터와 유전자를 식별 할 수 있다는 점에서 기존의 다중 오믹스 분석방법과의 차별점이 있다. 제안하는 방법의 알고리즘 최적화 식은 비볼록 최적화(non-convex optimization) 문제로, 반복적으로 업데이트하는 과정을 통해 군집화를 진행한다. 또한 시뮬레이션과 실데이터 적용을 통해 제안하는 군집화 방법이 기존의 다른 방법들보다 성능이 좋은 것을 확인 가능하다.

In recent years, high-throughput sequencing technologies generated rich resources of diverse types of omics data. Using the integration and analysis of multi-omics datasets, we can better understand cancer etiology and treatment responses. One of the important goals of multi-omics analysis in cancer research is to classify tumors by identifying subtypes of cancer patients. However, there are limitations to applying the existing clustering methods due to high dimensionality and heterogeneity of omics data. In this paper, we propose a new clustering method based on the spectral clustering algorithm, in which different weights are assigned to each omic and gene. The proposed optimization problem is non-convex, and clustering is performed through the iterative update process. The proposed clustering method performs better than the existing methods through simulation and real data application.

키워드

참고문헌

  1. Beck A and Tetruashvili L (2013). On the convergence of block coordinate descent type methods, SIAM Journal on Optimization, 23, 2037-2060. https://doi.org/10.1137/120887679
  2. Boriah S, Chandola V, and Kumar V (2008). Similarity measures for categorical data: A comparative evaluation, Proceedings of the Eighth SIAM International Conference on Data Mining, 30, 243-254.
  3. Canzler S, Schor J, Busch W et al. (2020). Prospects and challenges of multi-omics data integration in toxicology, Archives of Toxicology, 94, 371-388. https://doi.org/10.1007/s00204-020-02656-y
  4. Chari R, Coe BP, Vucic EA, Lockwood WW, and Lam WL (2010). An integrative multi-dimensional genetic and epigenetic strategy to identify aberrant genes and pathways in cancer, BMC Systems Biology, 4, 67.
  5. Church K and Gale WA (1995). Inverse document frequency (IDF): A measure of deviations from Poisson, In Proceedings of the Third Workshop on Very Large Corpora, 121-130.
  6. Cristescu R, Lee J, Nebozhyn M et al. (2015). Molecular analysis of gastric cancer identified subtypes associated with distinct clinical outcomes, Nature Medicine, 21, 449-456. https://doi.org/10.1038/nm.3850
  7. Eskin E, Arnold A, Prerau M, Portmoy L, and Stolfo S (2002). A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data, Applications of Data Mining in Computer Security, 6, 77-102. https://doi.org/10.1007/978-1-4615-0953-0_4
  8. Gower J (1971). A general coefficient of similarity and some of its properties, Biometrics, 27, 857-871. https://doi.org/10.2307/2528823
  9. Guinney J, Dienstmann R, Wang X et al. (2015). The consensus molecular subtypes of colorectal cancer, Nature Medicine, 21, 1350-1356.
  10. Hagen L and Kahng AB (1987). New spectral methods for ratio cut partitioning and clustering, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 11, 1074-1085. https://doi.org/10.1109/43.159993
  11. Kvalseth TO (1987). Entropy and correlation: Some comments, IEEE Transactions on Systems, Man, and Cybernetics, 17, 517-519. https://doi.org/10.1109/TSMC.1987.4309069
  12. Liu Y, Devescovi V, Chen S, and Nardini C (2013). Multilevel omic data integration in cancer cell lines: Advanced annotation and emergent properties, BMC Systems Biology, 7, 14.
  13. Liu H, Zhao R, Fang H, Cheng F, Fu Y, and Liu YY (2017). Entropy-Based consensus clustering for patient stratification, Bioinformatics, 33, 2691-2698. https://doi.org/10.1093/bioinformatics/btx167
  14. Markert EK, Mizuno H, Vazquez A, and Levine AJ (2011). Molecular classification of prostate cancer using cureted expression signatures, Proceedings of the National Academy of Sciences of the United States of America, 108, 20276-21281. https://doi.org/10.1073/pnas.1117029108
  15. Mitra S, Saha S, and Hasanuzzaman M (2020). Multi-View clustering for multi-omics data using unified embedding, Scientific Reports, 10, 13654.
  16. Ng AY, Jordan MI, and Weiss Y (2002). On spectral clustering: Analysis and an algorithm, In Advances in Neural Information Processing Systems, 849-856.
  17. Nguyen T, Tagett R, Diaz D, and Draghici S (2017). A novel approach for data integration and disease subtyping, Genome Research, 27, 2025-2039. https://doi.org/10.1101/gr.215129.116
  18. O'Donnell ST, Ross RP, and Stanton C (2020). The progress of multi-omics technologies: Determining function in lactic acid bacteria using a systems level approach, Frontiers in Microbiology, 10, 3084.
  19. Paik S, Shak S, Tang G, et al. (2004). A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer, The New England Journal of Medicine, 351, 2817-2826. https://doi.org/10.1056/NEJMoa041588
  20. Pakhira MK, Bandyopadhyay S, and Maulik U (2004). Validity index for crisp and fuzzy clusters, Pattern Recognition, 37, 487-501. https://doi.org/10.1016/j.patcog.2003.06.005
  21. Park M and Park S (2020). One-Step spectral clustering of weighted variables on single-cell RNA-sequencing data, The Korean Journal of Applied Statistics, 33, 511-526. https://doi.org/10.5351/KJAS.2020.33.4.511
  22. Park S, Xu H, and Zhao H (2021). Integrating multidimensional data for clustering analysis with applications to cancer patient data, Journal of the American Statistical Association, 116, 14-26. https://doi.org/10.1080/01621459.2020.1730853
  23. Park S and Zhao H (2018). Spectral clustering based on learning similarity matrix, Bioinformatics, 34, 2069-2076. https://doi.org/10.1093/bioinformatics/bty050
  24. Park S and Zhao H (2019). Sparse principal component analysis with missing observations, Annals of Applied Statistics, 13, 1016-1042. https://doi.org/10.1214/18-AOAS1220
  25. Parker JS, Mullins M, Cheang MCU, et al. (2009). Supervised risk predictor of breast cancer sased on intrinsic subtypes, Journal of Clinical Oncology, 27, 1160-1167. https://doi.org/10.1200/JCO.2008.18.1370
  26. Perou C, Sorlie T, Eisen M et al. (2000). Molecular portraits of human breast tumours, Nature, 406, 747-752. https://doi.org/10.1038/35021093
  27. Saha A and Tewari A (2013). On the nonasymptotic convergence of cyclic coordinate descent methods, SIAM Journal on Optimization, 23, 576-601. https://doi.org/10.1137/110840054
  28. Soon WW, Hariharan M, and Snyder MP (2013). High-Throughput sequencing for biology and medicine, Molecular Systems Biology, 9, 640.
  29. Strehl A and Ghosh J (2003). Cluster ensembles-a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, 3, 583-617.
  30. Seo ST, SON SH, Lee IK, Jeong HC, and Kwon SH (2005). A novel cluster validation index, Korean Institute of Intelligent Systems, 15,171-174.
  31. van de Vijver MJ, He YD, van't Veer LJ, et al. (2002). A gene-expression signature as a predictor of survival in breast cancer, The New England Journal of Medicine, 347,1999-2009. https://doi.org/10.1056/NEJMoa021967
  32. Verhaak RGW, Hoadley KA, Purdom E et al. (2010). Integrated genomic analysis identified clinically relevant subtypes of glioblastoma characterized by abnormalitied in pdgfra, idh1, egfr, and nf1, Cancer Cell, 17, 98-110. https://doi.org/10.1016/j.ccr.2009.12.020
  33. Vilanova C and Porcar M (2016). Are multi-omics enough?, Nature Microbiology, 1, 1-2. https://doi.org/10.1038/nmicrobiol.2016.101
  34. Von Luxburg U (2007). A tutorial on spectral clustering, Statistics and Computing, 17, 395-416. https://doi.org/10.1007/s11222-007-9033-z
  35. Von Luxburg U, Bousquet O, and Belkin M (2004). Limits of spectral clustering, MIT Press, 8, 857-864.
  36. Wang B, Zhu J, Pierson E, Ramazzotti D, and Batzoglou S (2017). Visualization and analysis of single-cell rna-seq data by kernel-based similarity learning, Nature Methods, 14, 414-416. https://doi.org/10.1038/nmeth.4207
  37. Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibe-Kains B, and Goldenberg A (2014). Similarity network fusion for aggregating data types on a genomic scale, Nature Methods, 11, 333-337. https://doi.org/10.1038/nmeth.2810
  38. Wang C, Lue W, Kaalia R, Kumar P, and Rajapakse JC (2022). Network-Based integration of multi-omics data for clinical outcome prediction in neuroblastoma, Scientific Reports, 12, 15425.
  39. Weinstein JN, Collisson EA, Mills GB et al. (2013). The cancer genome atlas pan-cancer analysis project, Nature Genetics, 45, 1113-1120. https://doi.org/10.1038/ng.2764
  40. Witten D and Tibshirani R (2011). Penalized classification using Fisher's linear discriminant, Journal of Royal Statistical Society, Series B, 73, 753-772. https://doi.org/10.1111/j.1467-9868.2011.00783.x
  41. Xu Y and Yin W (2017). A globally convergent algorithm for nonconvex optimization based on block coordinate update, Journal of Scientific Computing, 72, 700-734. https://doi.org/10.1007/s10915-017-0376-0
  42. Zaman A, Wu W, and Bivona TG (2019). Targeting oncogenic BRAF: Past, present, and future, Cancers, 11, 1-19. https://doi.org/10.3390/cancers11081197
  43. Zhang E, Zhang M, Shi C, Sun L, Shan L, Zhang H, and Song Y (2020). An overview of advances in multi-omics analysis in prostate cancer, Life Science, 260, 118376.