Code Generation Using Machine Learning: A Systematic Review

Recently, machine learning (ML) methods have been used to create powerful language models for a broad range of natural language processing tasks. An important subset of this field is that of generating code of programming languages for automatic software development. This review provides a broad and...

Full description

Bibliographic Details
Main Authors: Enrique Dehaerne, Bappaditya Dey, Sandip Halder, Stefan De Gendt, Wannes Meert
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9849664/
_version_ 1798042962975784960
author Enrique Dehaerne
Bappaditya Dey
Sandip Halder
Stefan De Gendt
Wannes Meert
author_facet Enrique Dehaerne
Bappaditya Dey
Sandip Halder
Stefan De Gendt
Wannes Meert
author_sort Enrique Dehaerne
collection DOAJ
description Recently, machine learning (ML) methods have been used to create powerful language models for a broad range of natural language processing tasks. An important subset of this field is that of generating code of programming languages for automatic software development. This review provides a broad and detailed overview of studies for code generation using ML. We selected 37 publications indexed in arXiv and IEEE Xplore databases that train ML models on programming language data to generate code. The three paradigms of code generation we identified in these studies are description-to-code, code-to-description, and code-to-code. The most popular applications that work in these paradigms were found to be code generation from natural language descriptions, documentation generation, and automatic program repair, respectively. The most frequently used ML models in these studies include recurrent neural networks, transformers, and convolutional neural networks. Other neural network architectures, as well as non-neural techniques, were also observed. In this review, we have summarized the applications, models, datasets, results, limitations, and future work of 37 publications. Additionally, we include discussions on topics general to the literature reviewed. This includes comparing different model types, comparing tokenizers, the volume and quality of data used, and methods for evaluating synthesized code. Furthermore, we provide three suggestions for future work for code generation using ML.
first_indexed 2024-04-11T22:42:50Z
format Article
id doaj.art-cf2c07a4d4984e6e9e4de744411129a0
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-11T22:42:50Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-cf2c07a4d4984e6e9e4de744411129a02022-12-22T03:58:56ZengIEEEIEEE Access2169-35362022-01-0110824348245510.1109/ACCESS.2022.31963479849664Code Generation Using Machine Learning: A Systematic ReviewEnrique Dehaerne0https://orcid.org/0000-0001-9021-2469Bappaditya Dey1https://orcid.org/0000-0002-0886-137XSandip Halder2https://orcid.org/0000-0002-6314-2685Stefan De Gendt3https://orcid.org/0000-0003-3775-3578Wannes Meert4https://orcid.org/0000-0001-9560-3872Department of Computer Science, KU Leuven, Leuven, BelgiumInteruniversity Microelectronics Centre (IMEC), Leuven, BelgiumInteruniversity Microelectronics Centre (IMEC), Leuven, BelgiumDepartment of Computer Science, KU Leuven, Leuven, BelgiumDepartment of Computer Science, KU Leuven, Leuven, BelgiumRecently, machine learning (ML) methods have been used to create powerful language models for a broad range of natural language processing tasks. An important subset of this field is that of generating code of programming languages for automatic software development. This review provides a broad and detailed overview of studies for code generation using ML. We selected 37 publications indexed in arXiv and IEEE Xplore databases that train ML models on programming language data to generate code. The three paradigms of code generation we identified in these studies are description-to-code, code-to-description, and code-to-code. The most popular applications that work in these paradigms were found to be code generation from natural language descriptions, documentation generation, and automatic program repair, respectively. The most frequently used ML models in these studies include recurrent neural networks, transformers, and convolutional neural networks. Other neural network architectures, as well as non-neural techniques, were also observed. In this review, we have summarized the applications, models, datasets, results, limitations, and future work of 37 publications. Additionally, we include discussions on topics general to the literature reviewed. This includes comparing different model types, comparing tokenizers, the volume and quality of data used, and methods for evaluating synthesized code. Furthermore, we provide three suggestions for future work for code generation using ML.https://ieeexplore.ieee.org/document/9849664/Automatic programmingcomputer languagesdata collectionmachine learningnatural language processingneural networks
spellingShingle Enrique Dehaerne
Bappaditya Dey
Sandip Halder
Stefan De Gendt
Wannes Meert
Code Generation Using Machine Learning: A Systematic Review
IEEE Access
Automatic programming
computer languages
data collection
machine learning
natural language processing
neural networks
title Code Generation Using Machine Learning: A Systematic Review
title_full Code Generation Using Machine Learning: A Systematic Review
title_fullStr Code Generation Using Machine Learning: A Systematic Review
title_full_unstemmed Code Generation Using Machine Learning: A Systematic Review
title_short Code Generation Using Machine Learning: A Systematic Review
title_sort code generation using machine learning a systematic review
topic Automatic programming
computer languages
data collection
machine learning
natural language processing
neural networks
url https://ieeexplore.ieee.org/document/9849664/
work_keys_str_mv AT enriquedehaerne codegenerationusingmachinelearningasystematicreview
AT bappadityadey codegenerationusingmachinelearningasystematicreview
AT sandiphalder codegenerationusingmachinelearningasystematicreview
AT stefandegendt codegenerationusingmachinelearningasystematicreview
AT wannesmeert codegenerationusingmachinelearningasystematicreview