DOI QR코드

DOI QR Code

Souce Code Identification Using Deep Neural Network

심층신경망을 이용한 소스 코드 원작자 식별

  • 임지수 (인하대학교 컴퓨터공학과) ;
  • Received : 2019.01.21
  • Accepted : 2019.07.30
  • Published : 2019.09.30

Abstract

Since many programming sources are open online, problems with reckless plagiarism and copyrights are occurring. Among them, source codes produced by repeated authors may have unique fingerprints due to their programming characteristics. This paper identifies each author by learning from a Google Code Jam program source using deep neural network. In this case, the original creator's source is to be vectored using a pre-processing instrument such as predictive-based vector or frequency-based approach, TF-IDF, etc. and to identify the original program source by learning by using a deep neural network. In addition a language-independent learning system was constructed using a pre-processing machine and compared with other existing learning methods. Among them, models using TF-IDF and in-depth neural networks were found to perform better than those using other pre-processing or other learning methods.

현재 프로그래밍 소스들이 온라인에서 공개되어 있기 때문에 무분별한 표절이나 저작권에 대한 문제가 일어나고 있다. 그 중 반복된 저자가 작성한 소스코드는 프로그래밍 특성상 고유의 지문이 있을 수 있다. 본 논문은 구글 코드 잼 프로그램 소스를 심층신경망을 이용한 학습을 통해 각각의 저자를 분별하는 것이다. 이 때 원작자의 소스를 예측 기반 벡터나, 주파수 기반 접근법인 TF-IDF등의 전처리기를 사용하여 입력값들을 벡터화해주고, 심층신경망을 이용한 학습을 통해 각 프로그램 소스 원작자를 식별하고자 한다. 전처리기를 이용하여 언어에 독립적인 학습시스템을 구성하고, 기존의 다른 학습 방법들과 비교하였다. 그 중 TF-IDF와 심층신경망을 사용한 모델은 다른 전처리기나 다른 학습방식을 사용한 것보다 좋은 성능을 보임을 확인하였다.

Keywords

References

  1. S. Narayanan and S. Simi, "Source code plagiarism detection and performance analysis using fingerprint based distance measure method," 2012 7th International Conference on Computer Science & Education (ICCSE). IEEE, 2012.
  2. A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, and R. Greenstadt, "De-anonymizing programmers via code stylometry," 24th USENIX Security Symposium (USENIX Security), Washington, DC. 2015.
  3. T. Joachims, "A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization," No. CMUCS-96-118. Carnegie-mellon univ pittsburgh pa dept of Computer Science, 1996.
  4. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word repre-sentations in vector space," arXiv preprint arXiv:1301.3781, 2013.
  5. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradientbased learning applied to document recognition," Proceedings of the IEEE, Vol.86, No.11, pp.2278-2324, 1998. https://doi.org/10.1109/5.726791
  6. L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, "Convolutional Neural Networks over Tree Structures for Programming Language Processing," In Thirtieth AAAI Conference on Artificial Intelligence(AAAI), Vol.2, No.3, 2016.
  7. W. S. Choi and S. B. Kim, "N-gram Feature Selection for Text Classification Based on Symmetrical Conditional Probability and TF-IDF," Journal of Korean Institute of Industrial Engineers, Vol.41, No.4, pp.381-388, 2015. https://doi.org/10.7232/JKIIE.2015.41.4.381
  8. G. Frantzeskou, E. Stamatatos, S. Gritzalis, and S. Katsikas, "Source code author identification based on n-gram author profiles." In IFIP International Conference on Artificial Intelligence Applications and Innovations, Springer, Boston, MA. pp.508-515, 2006.
  9. L. Breiman, "Random forests," Machine Learning, Vol.45, No.1, pp.5-32, 2001. https://doi.org/10.1023/A:1010933404324
  10. I. Krsul and E. H. Spafford, "Authorship analysis: Identifying the author of a program," Computers & Security, Vol.16, No.3, pp.233-257, 1997. https://doi.org/10.1016/S0167-4048(97)00005-9
  11. X. Yang, G. Xu, Q. Li, Y. Guo, and M. Zhang, "Authorship attribution of source code by using back propagation neural network based on particle swarm optimization," PloS one, Vol.12, No.11, pp.e0187204, 2017. https://doi.org/10.1371/journal.pone.0187204
  12. Y. Kim, "Convolutional neural networks for sentence classification," arXiv preprint arXiv:1408.5882, 2014.
  13. L. Breiman, "Bagging predictors," Machine Learning, Vol.24, No.2, pp.123-140, 1996. https://doi.org/10.1007/BF00058655
  14. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word repre-sentations in vector space," arXiv preprint arXiv:1301.3781, 2013.
  15. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in Neural Information Processing Systems, 2013.
  16. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in Neural Information Processing Systems, 2012.