Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task
In this paper we consider the research and development of classifiers that are trained to predict the task solved by source code. Possible applications of such task detection algorithms include method name prediction, hardware–software partitioning, programming standard violation detection, and sema...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-09-01
|
Series: | Future Internet |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-5903/15/9/314 |
_version_ | 1827726041988202496 |
---|---|
author | Artyom V. Gorchakov Liliya A. Demidova Peter N. Sovietov |
author_facet | Artyom V. Gorchakov Liliya A. Demidova Peter N. Sovietov |
author_sort | Artyom V. Gorchakov |
collection | DOAJ |
description | In this paper we consider the research and development of classifiers that are trained to predict the task solved by source code. Possible applications of such task detection algorithms include method name prediction, hardware–software partitioning, programming standard violation detection, and semantic code duplication search. We provide the comparative analysis of modern approaches to source code transformation into vector-based representations that extend the variety of classification and clustering algorithms that can be used for intelligent source code analysis. These approaches include word2vec, code2vec, first-order and second-order Markov chains constructed from abstract syntax trees (AST), histograms of assembly language instruction opcodes, and histograms of AST node types. The vectors obtained with the forementioned approaches are then used to train such classification algorithms as k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP). The obtained results show that the use of program vectors based on first-order AST-based Markov chains with an RF-based classifier leads to the highest accuracy, precision, recall, and F1 score. Increasing the order of Markov chains considerably increases the dimensionality of a vector, without any improvements in classifier quality, so we assume that first-order Markov chains are best suitable for real world applications. Additionally, the experimental study shows that first-order AST-based Markov chains are least sensitive to the used classification algorithm. |
first_indexed | 2024-03-10T22:44:42Z |
format | Article |
id | doaj.art-bbe2cc774eb54572998d125d4c5a83a8 |
institution | Directory Open Access Journal |
issn | 1999-5903 |
language | English |
last_indexed | 2024-03-10T22:44:42Z |
publishDate | 2023-09-01 |
publisher | MDPI AG |
record_format | Article |
series | Future Internet |
spelling | doaj.art-bbe2cc774eb54572998d125d4c5a83a82023-11-19T10:49:29ZengMDPI AGFuture Internet1999-59032023-09-0115931410.3390/fi15090314Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification TaskArtyom V. Gorchakov0Liliya A. Demidova1Peter N. Sovietov2Institute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA—Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, RussiaInstitute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA—Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, RussiaInstitute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA—Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, RussiaIn this paper we consider the research and development of classifiers that are trained to predict the task solved by source code. Possible applications of such task detection algorithms include method name prediction, hardware–software partitioning, programming standard violation detection, and semantic code duplication search. We provide the comparative analysis of modern approaches to source code transformation into vector-based representations that extend the variety of classification and clustering algorithms that can be used for intelligent source code analysis. These approaches include word2vec, code2vec, first-order and second-order Markov chains constructed from abstract syntax trees (AST), histograms of assembly language instruction opcodes, and histograms of AST node types. The vectors obtained with the forementioned approaches are then used to train such classification algorithms as k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP). The obtained results show that the use of program vectors based on first-order AST-based Markov chains with an RF-based classifier leads to the highest accuracy, precision, recall, and F1 score. Increasing the order of Markov chains considerably increases the dimensionality of a vector, without any improvements in classifier quality, so we assume that first-order Markov chains are best suitable for real world applications. Additionally, the experimental study shows that first-order AST-based Markov chains are least sensitive to the used classification algorithm.https://www.mdpi.com/1999-5903/15/9/314program classificationMarkov chainsabstract syntax treestask classificationstatic analysisk-nearest neighbor |
spellingShingle | Artyom V. Gorchakov Liliya A. Demidova Peter N. Sovietov Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task Future Internet program classification Markov chains abstract syntax trees task classification static analysis k-nearest neighbor |
title | Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task |
title_full | Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task |
title_fullStr | Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task |
title_full_unstemmed | Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task |
title_short | Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task |
title_sort | analysis of program representations based on abstract syntax trees and higher order markov chains for source code classification task |
topic | program classification Markov chains abstract syntax trees task classification static analysis k-nearest neighbor |
url | https://www.mdpi.com/1999-5903/15/9/314 |
work_keys_str_mv | AT artyomvgorchakov analysisofprogramrepresentationsbasedonabstractsyntaxtreesandhigherordermarkovchainsforsourcecodeclassificationtask AT liliyaademidova analysisofprogramrepresentationsbasedonabstractsyntaxtreesandhigherordermarkovchainsforsourcecodeclassificationtask AT peternsovietov analysisofprogramrepresentationsbasedonabstractsyntaxtreesandhigherordermarkovchainsforsourcecodeclassificationtask |