Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task

In this paper we consider the research and development of classifiers that are trained to predict the task solved by source code. Possible applications of such task detection algorithms include method name prediction, hardware–software partitioning, programming standard violation detection, and sema...

Full description

Bibliographic Details
Main Authors: Artyom V. Gorchakov, Liliya A. Demidova, Peter N. Sovietov
Format: Article
Language:English
Published: MDPI AG 2023-09-01
Series:Future Internet
Subjects:
Online Access:https://www.mdpi.com/1999-5903/15/9/314
_version_ 1827726041988202496
author Artyom V. Gorchakov
Liliya A. Demidova
Peter N. Sovietov
author_facet Artyom V. Gorchakov
Liliya A. Demidova
Peter N. Sovietov
author_sort Artyom V. Gorchakov
collection DOAJ
description In this paper we consider the research and development of classifiers that are trained to predict the task solved by source code. Possible applications of such task detection algorithms include method name prediction, hardware–software partitioning, programming standard violation detection, and semantic code duplication search. We provide the comparative analysis of modern approaches to source code transformation into vector-based representations that extend the variety of classification and clustering algorithms that can be used for intelligent source code analysis. These approaches include word2vec, code2vec, first-order and second-order Markov chains constructed from abstract syntax trees (AST), histograms of assembly language instruction opcodes, and histograms of AST node types. The vectors obtained with the forementioned approaches are then used to train such classification algorithms as k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP). The obtained results show that the use of program vectors based on first-order AST-based Markov chains with an RF-based classifier leads to the highest accuracy, precision, recall, and F1 score. Increasing the order of Markov chains considerably increases the dimensionality of a vector, without any improvements in classifier quality, so we assume that first-order Markov chains are best suitable for real world applications. Additionally, the experimental study shows that first-order AST-based Markov chains are least sensitive to the used classification algorithm.
first_indexed 2024-03-10T22:44:42Z
format Article
id doaj.art-bbe2cc774eb54572998d125d4c5a83a8
institution Directory Open Access Journal
issn 1999-5903
language English
last_indexed 2024-03-10T22:44:42Z
publishDate 2023-09-01
publisher MDPI AG
record_format Article
series Future Internet
spelling doaj.art-bbe2cc774eb54572998d125d4c5a83a82023-11-19T10:49:29ZengMDPI AGFuture Internet1999-59032023-09-0115931410.3390/fi15090314Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification TaskArtyom V. Gorchakov0Liliya A. Demidova1Peter N. Sovietov2Institute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA—Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, RussiaInstitute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA—Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, RussiaInstitute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA—Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, RussiaIn this paper we consider the research and development of classifiers that are trained to predict the task solved by source code. Possible applications of such task detection algorithms include method name prediction, hardware–software partitioning, programming standard violation detection, and semantic code duplication search. We provide the comparative analysis of modern approaches to source code transformation into vector-based representations that extend the variety of classification and clustering algorithms that can be used for intelligent source code analysis. These approaches include word2vec, code2vec, first-order and second-order Markov chains constructed from abstract syntax trees (AST), histograms of assembly language instruction opcodes, and histograms of AST node types. The vectors obtained with the forementioned approaches are then used to train such classification algorithms as k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP). The obtained results show that the use of program vectors based on first-order AST-based Markov chains with an RF-based classifier leads to the highest accuracy, precision, recall, and F1 score. Increasing the order of Markov chains considerably increases the dimensionality of a vector, without any improvements in classifier quality, so we assume that first-order Markov chains are best suitable for real world applications. Additionally, the experimental study shows that first-order AST-based Markov chains are least sensitive to the used classification algorithm.https://www.mdpi.com/1999-5903/15/9/314program classificationMarkov chainsabstract syntax treestask classificationstatic analysisk-nearest neighbor
spellingShingle Artyom V. Gorchakov
Liliya A. Demidova
Peter N. Sovietov
Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task
Future Internet
program classification
Markov chains
abstract syntax trees
task classification
static analysis
k-nearest neighbor
title Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task
title_full Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task
title_fullStr Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task
title_full_unstemmed Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task
title_short Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task
title_sort analysis of program representations based on abstract syntax trees and higher order markov chains for source code classification task
topic program classification
Markov chains
abstract syntax trees
task classification
static analysis
k-nearest neighbor
url https://www.mdpi.com/1999-5903/15/9/314
work_keys_str_mv AT artyomvgorchakov analysisofprogramrepresentationsbasedonabstractsyntaxtreesandhigherordermarkovchainsforsourcecodeclassificationtask
AT liliyaademidova analysisofprogramrepresentationsbasedonabstractsyntaxtreesandhigherordermarkovchainsforsourcecodeclassificationtask
AT peternsovietov analysisofprogramrepresentationsbasedonabstractsyntaxtreesandhigherordermarkovchainsforsourcecodeclassificationtask