DOI QR코드

DOI QR Code

Case-Related News Filtering via Topic-Enhanced Positive-Unlabeled Learning

  • Wang, Guanwen (College of Information Engineering and Automation, Kunming University of Science and Technology) ;
  • Yu, Zhengtao (Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology) ;
  • Xian, Yantuan (College of Information Engineering and Automation, Kunming University of Science and Technology) ;
  • Zhang, Yu (College of Information Engineering and Automation, Kunming University of Science and Technology)
  • 투고 : 2021.01.22
  • 심사 : 2021.05.30
  • 발행 : 2021.12.31

초록

Case-related news filtering is crucial in legal text mining and divides news into case-related and case-unrelated categories. Because case-related news originates from various fields and has different writing styles, it is difficult to establish complete filtering rules or keywords for data collection. In addition, the labeled corpus for case-related news is sparse; therefore, to train a high-performance classification model, it is necessary to annotate the corpus. To address this challenge, we propose topic-enhanced positive-unlabeled learning, which selects positive and negative samples guided by topics. Specifically, a topic model based on a variational autoencoder (VAE) is trained to extract topics from unlabeled samples. By using these topics in the iterative process of positive-unlabeled (PU) learning, the accuracy of identifying case-related news can be improved. From the experimental results, it can be observed that the F1 value of our method on the test set is 1.8% higher than that of the PU learning baseline model. In addition, our method is more robust with low initial samples and high iterations, and compared with advanced PU learning baselines such as nnPU and I-PU, we obtain a 1.1% higher F1 value, which indicates that our method can effectively identify case-related news.

키워드

과제정보

This study was supported by the project of the National Key Research and Development Project (No. 2018YFC0830100) and the Science and Technology Plan Projects of Yunnan province (No. 202001AT070046).

참고문헌

  1. Y. Shao, S. Taylor, N. Marshall, C. Morioka, and Q. Zeng-Treitler, "Clinical text classification with word embedding features vs. bag-of-words features," in Proceedings of 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, 2018, pp. 2874-2878.
  2. A. Phung and M. Stamp, "Universal adversarial perturbations and image spam classifiers," in Malware Analysis Using Artificial Intelligence and Deep Learning. Cham, Switzerland: Springer, 2021, pp. 633-651.
  3. R. Rahim, I. Zulkarnain, and H. Jaya, "A review: search visualization with Knuth Morris Pratt algorithm," IOP Conference Series: Materials Science and Engineering, vol. 237, no. 1, article no. 012026, 2017. https://doi.org/10.1088/1757-899x/237/1/012026
  4. J. Song and C. Miao, "Optimization and implementation of Sunday algorithm," in Proceedings of 2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM), Dublin, Ireland, 2019, pp. 263-266.
  5. S. M. H. Dadgar, M. S. Araghi, and M. M. Farahani, "A novel text mining approach based on TF-IDF and Support Vector Machine for news classification," in Proceedings of 2016 IEEE International Conference on Engineering and Technology (ICETECH), Coimbatore, India, 2016, pp. 112-116.
  6. S. M. Jang, T. Geng, J. Y. Q. Li, R. Xia, C. T. Huang, H. Kim, and J. Tang, "A computational approach for examining the roots and spreading patterns of fake news: evolution tree analysis," Computers in Human Behavior, vol. 84, pp. 103-113, 2018. https://doi.org/10.1016/j.chb.2018.02.032
  7. H. Yu, J. Han, and K. C. C. Chang, "PEBL: positive example based learning for web page classification using SVM," in Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002, pp. 239-248.
  8. C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, 1995. https://doi.org/10.1007/BF00994018
  9. S. Gururangan, T. Dang, D. Card, and N. A. Smith, "Variational pretraining for semi-supervised text classification," in Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), Florence, Italy, 2019, pp. 5880-5894.
  10. B. Liu, W. S. Lee, P. S. Yu, and X. Li, "Partially supervised classification of text documents," in Proceedings of the 19th International Conference (ICML), Sydney, Australia, 2002, pp. 387-394.
  11. L. M. Manevitz and M. Yousef, "One-class SVMs for document classification," Journal of Machine Learning Research, vol. 2, pp. 139-154, 2001.
  12. Y. Ren, D. Ji, and H. Zhang, "Positive unlabeled learning for deceptive reviews detection," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 488-498.
  13. X. L. Li, P. S. Yu, B. Liu, and S. K. Ng, "Positive unlabeled learning for data stream classification," in Proceedings of the 2009 SIAM International Conference on Data Mining, Sparks, NV, 2009, pp. 259-270.
  14. Y. Xiao, B. Liu, J. Yin, L. Cao, C. Zhang, and Z. Hao, "Similarity-based approach for positive and unlabelled learning," in Proceedings of 22nd International Joint Conference on Artificial Intelligence, Barcelona, Spain, 2011, pp. 1577-1582.
  15. J. J. Huang, P. W. Li, M. Peng, Q. Q. Xie, and C. Xu, "Review of deep learning-based topic mode," Chinese Journal of Computers, vo. 43, no. 5, pp. 827-855, 2020.
  16. D. P. Kingma and M. Welling, "Auto-encoding variational Bayes," in Proceedings of the 2nd International Conference on Learning Representations, Banff, Canada, 2014.
  17. Y. Miao, L. Yu, and P. Blunsom, "Neural variational inference for text processing," in Proceedings of the 33nd International Conference on Machine Learning, New York, NY, 2016, pp. 1727-1736.
  18. Y. Miao, E. Grefenstette, and P. Blunsom, "Discovering discrete latent topics with neural variational inference," in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 2410-2419.
  19. R. Ding, R. Nallapati, and B. Xiang, "Coherence-aware neural topic modeling," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 830-836.
  20. A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei, "Automatic differentiation variational inference," The Journal of Machine Learning Research, vol. 18, no. 1, pp. 430-474, 2017.
  21. H. Yu, J. Han, and K. C. Chang, "PEBL: web page classification without negative examples," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, pp. 70-81, 2004. https://doi.org/10.1109/TKDE.2004.1264823
  22. D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," in Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, 2015.
  23. K. W. Church, "Word2Vec," Natural Language Engineering, vol. 23, no. 1, pp. 155-162, 2017. https://doi.org/10.1017/S1351324916000334
  24. R. Kiryo, G. Niu, M. C. D. Plessis, and M. Sugiyama, "Positive-unlabeled learning with non-negative risk estimator," 2017 [Online]. Available: https://arxiv.org/abs/1703.00593.
  25. L. Jiang, D. Li, Q. Wang, S. Wang, and S. Wang, "Improving positive unlabeled learning: practical AUL estimation and new training method for extremely imbalanced data sets," 2020 [Online]. Available: https://arxiv.org/abs/2004.09820.