Code Generation Using Machine Learning: A Systematic Review

Recently, machine learning (ML) methods have been used to create powerful language models for a broad range of natural language processing tasks. An important subset of this field is that of generating code of programming languages for automatic software development. This review provides a broad and...

Full description

Bibliographic Details
Main Authors:	Enrique Dehaerne, Bappaditya Dey, Sandip Halder, Stefan De Gendt, Wannes Meert
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	Automatic programming computer languages data collection machine learning natural language processing neural networks
Online Access:	https://ieeexplore.ieee.org/document/9849664/

_version_	1798042962975784960
author	Enrique Dehaerne Bappaditya Dey Sandip Halder Stefan De Gendt Wannes Meert
author_facet	Enrique Dehaerne Bappaditya Dey Sandip Halder Stefan De Gendt Wannes Meert
author_sort	Enrique Dehaerne
collection	DOAJ
description	Recently, machine learning (ML) methods have been used to create powerful language models for a broad range of natural language processing tasks. An important subset of this field is that of generating code of programming languages for automatic software development. This review provides a broad and detailed overview of studies for code generation using ML. We selected 37 publications indexed in arXiv and IEEE Xplore databases that train ML models on programming language data to generate code. The three paradigms of code generation we identified in these studies are description-to-code, code-to-description, and code-to-code. The most popular applications that work in these paradigms were found to be code generation from natural language descriptions, documentation generation, and automatic program repair, respectively. The most frequently used ML models in these studies include recurrent neural networks, transformers, and convolutional neural networks. Other neural network architectures, as well as non-neural techniques, were also observed. In this review, we have summarized the applications, models, datasets, results, limitations, and future work of 37 publications. Additionally, we include discussions on topics general to the literature reviewed. This includes comparing different model types, comparing tokenizers, the volume and quality of data used, and methods for evaluating synthesized code. Furthermore, we provide three suggestions for future work for code generation using ML.
first_indexed	2024-04-11T22:42:50Z
format	Article
id	doaj.art-cf2c07a4d4984e6e9e4de744411129a0
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-11T22:42:50Z
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-cf2c07a4d4984e6e9e4de744411129a02022-12-22T03:58:56ZengIEEEIEEE Access2169-35362022-01-0110824348245510.1109/ACCESS.2022.31963479849664Code Generation Using Machine Learning: A Systematic ReviewEnrique Dehaerne0https://orcid.org/0000-0001-9021-2469Bappaditya Dey1https://orcid.org/0000-0002-0886-137XSandip Halder2https://orcid.org/0000-0002-6314-2685Stefan De Gendt3https://orcid.org/0000-0003-3775-3578Wannes Meert4https://orcid.org/0000-0001-9560-3872Department of Computer Science, KU Leuven, Leuven, BelgiumInteruniversity Microelectronics Centre (IMEC), Leuven, BelgiumInteruniversity Microelectronics Centre (IMEC), Leuven, BelgiumDepartment of Computer Science, KU Leuven, Leuven, BelgiumDepartment of Computer Science, KU Leuven, Leuven, BelgiumRecently, machine learning (ML) methods have been used to create powerful language models for a broad range of natural language processing tasks. An important subset of this field is that of generating code of programming languages for automatic software development. This review provides a broad and detailed overview of studies for code generation using ML. We selected 37 publications indexed in arXiv and IEEE Xplore databases that train ML models on programming language data to generate code. The three paradigms of code generation we identified in these studies are description-to-code, code-to-description, and code-to-code. The most popular applications that work in these paradigms were found to be code generation from natural language descriptions, documentation generation, and automatic program repair, respectively. The most frequently used ML models in these studies include recurrent neural networks, transformers, and convolutional neural networks. Other neural network architectures, as well as non-neural techniques, were also observed. In this review, we have summarized the applications, models, datasets, results, limitations, and future work of 37 publications. Additionally, we include discussions on topics general to the literature reviewed. This includes comparing different model types, comparing tokenizers, the volume and quality of data used, and methods for evaluating synthesized code. Furthermore, we provide three suggestions for future work for code generation using ML.https://ieeexplore.ieee.org/document/9849664/Automatic programmingcomputer languagesdata collectionmachine learningnatural language processingneural networks
spellingShingle	Enrique Dehaerne Bappaditya Dey Sandip Halder Stefan De Gendt Wannes Meert Code Generation Using Machine Learning: A Systematic Review IEEE Access Automatic programming computer languages data collection machine learning natural language processing neural networks
title	Code Generation Using Machine Learning: A Systematic Review
title_full	Code Generation Using Machine Learning: A Systematic Review
title_fullStr	Code Generation Using Machine Learning: A Systematic Review
title_full_unstemmed	Code Generation Using Machine Learning: A Systematic Review
title_short	Code Generation Using Machine Learning: A Systematic Review
title_sort	code generation using machine learning a systematic review
topic	Automatic programming computer languages data collection machine learning natural language processing neural networks
url	https://ieeexplore.ieee.org/document/9849664/
work_keys_str_mv	AT enriquedehaerne codegenerationusingmachinelearningasystematicreview AT bappadityadey codegenerationusingmachinelearningasystematicreview AT sandiphalder codegenerationusingmachinelearningasystematicreview AT stefandegendt codegenerationusingmachinelearningasystematicreview AT wannesmeert codegenerationusingmachinelearningasystematicreview

Code Generation Using Machine Learning: A Systematic Review

Similar Items