DOI QR코드

DOI QR Code

The Identification Framework for source code author using Authorship Analysis and CNN

작성자 분석과 CNN을 적용한 소스 코드 작성자 식별 프레임워크

  • Received : 2018.06.21
  • Accepted : 2018.08.29
  • Published : 2018.10.31

Abstract

Recently, Internet technology has developed, various programs are being created and therefore various codes are being made through many authors. On this aspect, some author deceive a program or code written by other particular author as they make it themselves and use other writers' code indiscriminately, or not indicating the exact code which has been used. Due to this makes it more and more difficult to protect the code. In this paper, we propose author identification framework using Authorship Analysis theory and Natural Language Processing(NLP) based on Convolutional Neural Network(CNN). We apply Authorship Analysis theory to extract features for author identification in the source code, and combine them with the features being used text mining to perform author identification using machine learning. In addition, applying CNN based natural language processing method to source code for code author classification. Therefore, we propose a framework for the identification of authors using the Authorship Analysis theory and the CNN. In order to identify the author, we need special features for identifying the authors only, and the NLP method based on the CNN is able to apply language with a special system such as source code and identify the author. identification accuracy based on Authorship Analysis theory is 95.1% and identification accuracy applied to CNN is 98%.

최근 인터넷 기술이 발전함에 따라 다양한 프로그램들이 만들어지고 있고 이에 따라 다양한 코드들이 많은 사람들을 통해 만들어진다. 이러한 측면을 이용하여 특정 작성자가 작성한 코드들 그대로 가져가 자신이 작성한 것처럼 보여주거나, 참고한 코드들에 대한 정확한 표기 없이 그대로 사용하여 이에 대한 보호가 점차 어려워지고 있다. 따라서 본 논문에서는 작성자 분석 이론과 합성곱 신경망 기반 자연어 처리 방법을 적용한 작성자 식별 프레임워크룰 제안한다. 작성자 분석 이론을 적용하여 소스 코드에서 작성자 식별에 적합한 특징들을 추출하고 이를 텍스트 마이닝에서 사용하고 있는 특징들과 결합하여 기계학습 기반의 작성자 식별을 수행한다. 그리고 합성곱 신경망 기반 자연어 처리 방법을 소스 코드에 적용하여 코드 작성자 분류를 수행한다. 본 논문에서는 작성자 분석이론과 합성곱 신경망을 적용한 작성자 식별 프레임워크를 통해 작성자를 식별하기 위해서는 작성자 식별만을 위한 특징들이 필요하다는 것과 합성곱 신경망 기반 자연어 처리 방법이 소스 코드등과 같은 특수한 체계를 갖추고 있는 언어에서도 적용이 가능하다. 실험 결과 작성자 분석 이론 기반 작성자 식별 정확도는 95.1%였으며 CNN을 적용한 결과 반복횟수가 90번 이상일 경우 98% 이상의 정확도를 보여줬다.

Keywords

References

  1. E. Stamatatos, "A Survey of Modern Authorship Attribution Methods", American Society for Information Science and Technology, Vol 60, Issue 3, pp 538-556, 2009. https://doi.org/10.1002/asi.21001
  2. I. Krsul, H. Spafford, "Authorship Analysis: identifying the author of a program", Computer & Security, pp 233-257, 1997. https://doi.org/10.1016/0167-4048(96)81683-x
  3. G. Andrew, S. Philip, M. Stephen, "Software Forensics Extending Authorship Analysis Techniques to Computer Programs", Information Science, 1997. http://hdl.handle.net/10523/872
  4. S. Alraba, P. Shirani, M. Debbabi, L. Wang, "On the Feasibility of Malware Authorship Attribution", Foundations and Practice of Security, pp 256-272, 2016. https://doi.org/10.1007/978-3-319-51966-1_17
  5. H. Spafford, A. Weeber, "Software Forensics Can We Track Code to its Authors?", Computers & Security, Vol 12, issue 6, pp 585-595, 1993. https://doi.org/10.1016/0167-4048(93)90055-a
  6. D. Britz, "Understanding Convolutional Neural Networks for NLP", WILDML, 2015. http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
  7. M. Moreno, J. Kalita, "Deep Learning applied to NLP", arXiv, 2017. https://arxiv.org/abs/1703.03091
  8. Y. Kim. "Convolutional Neural Networks for Sentence Classification", Empirical Methods on Natural Language Processing, 2014. https://doi.org/10.3115/v1/d14-1181
  9. W. Yin, K. Kann, M. Yu and H. Schutze, "Comparative Study of CNN and RNN for Natural Language Processing", arXiv, 2017. https://arxiv.org/abs/1702.01923
  10. Python, "https://www.python.org/"
  11. scikit-learn, "http://scikit-learn.org/stable/"
  12. Google Code Jam, "https://code.google.com/codejam/"
  13. Github, "https://github.com/"
  14. S. Burrows, M. Tahaghoghi, "Source Code Authorship Attribution using n-grams", In Proc. of the Australasian Document Computing Symposium, 2007. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.5920
  15. J. Houbardas and E. Stamatatos, "N-gram Features Selection for Authorship Identification", AIMSA, pp 77-86, 2006. https://doi.org/10.1007/11861461_10
  16. J. Kothari, M. Shevertalov, E. Stehle, S. Mancoridis, "A Probabilistic Approach to Source Code Authorship Identification", Information Technology, 2007. https://doi.org/10.1109/itng.2007.17
  17. A. Caliskan, F. Yamaguchi, E. Dauber, R. Harangm K. Rieck, R. Greenstadt and A. Narayanan, "When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries", arXiv, 2016. https://doi.org/10.14722/ndss.2018.23304
  18. G. Frantzeskou, G. MacDonell and E. Stamatatos, "Source code authorship analysis for supporting the cybercrime investigation process", INSTICC, pp 85-92, 2004. https://doi.org/10.5220/0001390300850092
  19. N. Rosenblum, P. Miller and X. Zhu, "Recovering the Toolchain Provenance of Binary Code", International Symposium on Software Testing and Analysis, pp 100-110, 2011. https://doi.org/10.1145/2001420.2001433
  20. N. Rosenblum, X. Zhu and B. Miller, "Learning to Analyze Binary Computer Code", AAAI Conference on Artificial Intelligence, 2008. Computer Security -ESORICS, pp 172-189, 2011. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.146.1395
  21. N. Rosenblum, X. Zhu and B. Miller, "Who wrote this code? identifying the authors of program binaries", Computer Security - ESORICS, 99 172-189, 2011. https://doi.org/10.1007/978-3-642-23822-2_10
  22. M. Barreno, B. Nelson, D. Joseph and D. Tygar, "The security of machine learning", Machine Learning, Vol 81, Issue 2, pp 121-148, 2010. https://link.springer.com/article/10.1007/s10994-010-5188-5
  23. D. Joseph, L. Pavel, R. Fabio, J. Doug, N. Blaine, "Machine Learning Methods for Computer Security", Dagstuhl Perspectives Workshop, 2013.
  24. A. Abbasi and H. Chen, "Applying authorship analysis to extremist-group web forum messages", IEEE Intelligent Systems, Vol 20, Issue 5, pp 67-75, 2005. https://doi.org/10.1109/mis.2005.81

Cited by

  1. 중복 허용 범위를 고려한 서바이벌 네트워크 기반 안드로이드 저자 식별 vol.21, pp.6, 2018, https://doi.org/10.7472/jksii.2020.21.6.13