Software is eating the world…but software is also eating developers. Software development and maintenance has reached a limit due to its complexity. Fortunately, AI is changing the game of software development.
We are getting overwhelmed by source code and their increasing complexity and thus challenges: shadow IT, lack of documentation, language, and framework heterogeneity, lack of visibility in code history, no easy way to maintain the code…
Like vast amounts of data enabled Big Data applications, now large repositories of programs enable « Big Code ». We’ve accumulated petabytes of source code data (e.g. open source code in GitHub, Bitbucket…). This data is open, yet there have been few attempts to fully leverage the knowledge that is sealed inside. Code is the new data to look at!
Using Machine Learning on source code implies to automatically learn from existing code. We can now solve tasks such as predicting program bugs, predicting program behavior, predicting identifier names, or automatically creating new code. This new approach opens the door to extremely exciting opportunities in the ways software and code are developed.
Machine learning on source code : what is it ?
Machine Learning on Source Code (#MLonCode) is emerging. It’s an exciting domain of research standing at the crossroad of deep learning, natural language processing, software engineering, and programming languages.
Analysis of source code is really early stage, unlike NLP or image analysis. Therefore, there are no well-established standard techniques for using the source code as a data source for forecasts. We stand at a frontier tech where a lot of research is still required.
Processing code data
Machine Learning is mathematical algorithms applied to data: consequently, any input data must have a mathematical representation. For example, when you want to process an image, you have to convert it into a matrix. This is easy to picture: an image is a matrix of pixels, pixels are an array of numbers used to describe the color.
When it comes to code the challenge is a bit trickier. A code embeds different levels of understanding:
- semantic level : what is written
- structural level : how it written
- graph flow level : how each part of the code interacts with the rest of the code
The intent and meaning of a source code relies on those 3 levels of understanding.
Any technics of machine learning applied on source code should ensure a mathematical embedding of those three levels of understanding. The quality of embedding will impact quality of model
(This topic will be detailled in a later article)
Machine learning on code : Use Cases
AI and code testing
Software development includes unit and integration test. All the cases would take a long (long, long…) time to fully implement. As developers, especially in testing in Continuous Integration (CI) involves test case prioritization, selection, and execution at each cycle.
Selecting the most promising test cases to detect bugs is hard. There are uncertainties on the impact of committed code changes or if traceability links between code and tests are not available.
Today’s automatic understanding of the code helps prioritizing test cases according to their duration, previous last execution, and failure history. In a constantly changing environment, some algorithms such as Retecs method learn to prioritize error-prone test cases.
Real-world example: Netflix engineers run a series of tests and benchmarks to validate the service across multiple dimensions including audio-video playback quality, license handling, encryption, security…All this leads to a plethora of test cases that need to be executed to validate the functionality of a device running Netflix. To speed up the test process, Netflix engineer uses Retecs technics. It helps them to choose the most promising subset of tests out of thousands of test cases available. They were looking for a set of tests cases that would increase the probability of failing the device in real-time.
AI and code completion
How many times did I look for a way to solve a specific and found myself looking for the same answer every six months ! (And I know…. I’m not the only one there).
Engineers, when tackling a problem, are often looking for a way to solve it that has already been implemented. Over the years, many code search tools and platforms (bless you StackOverflow) have been proposed to help developers. Usual approaches often treat source code as textual documents. So, these tools use information retrieval models to retrieve relevant code snippets that match a given query. They lack a deep understanding of the semantics of queries and source code.
Today’s Machine Learning on Code enables to run semantic similarities search in the problem domain instead of searching in the solution domain.
The most used technic is Code2Vec. Like the famous Natural Language Processing example of:
vec("man")-vec("woman") = vec("king")-vec("queen")
Code2Vec model learns analogies that are relevant to source code, such as:
vec("receive")-vec("send") = vec("downlaod")-vec("upload")
Real world example: Facebook released this year Aroma. It’s a code-to-code search and recommendation tool. Leveraging AI, it makes the process of gaining insights from big codebases much easier. Let’s remember Facebook, has over 2B lines of code…
AI and Code review
Determining program correctness requires a precise understanding of a program’s intended behavior. But today’s tools for code review lack this depth in code understanding. That lead to really not pleasant situation. For instance, you can have a high test coverage rate while having a non-working program. Or you can have a high level of code documentation while every comment is out of date.and a means to convey this understanding unambiguously in a form suitable for automated inspection.
AI on software development opens the door to a new way to (deeply) automatically understand source code. Understand its intent and analyze it. It’s like comparing a 2-D picture to a 3-D picture. Yes for sure current code review tools provide insights on the code. But… it’s nothing compared with what ML-powered tool will be able to do.
Real world example: CodistAI is working on the first version of code reviewing tool that assesses the alignment of the code and its comments. They intent to automatically suggests update if the comments diverge from the code. When reviewing code, that enables developers to have truly actionable insights on how to improve any code longterm readability and maintainability: which is key for us.
Program induction and synthesis
This would be the Grail of Computer Science: get a fully automated programming system. In computer science, program synthesis is the task to automatically construct a program that satisfies a given high-level specification. We are currently in the instructional programming world. When we have to solve a complex problem, as developers, we break it down in smaller problems and we write the code that solves those them. It is called the first principle. Program synthesis works the other way around. We provide the computer with a complex problem ( what we want) and leave the details of how to figure it out to the computer.
We are not there yet for sure yet some tremendous advancement has been made lately both from researchers and companies side. Today we can talk about augmented programming.
Real world example :
Pix2Code: While traditionally it has been the task of front-end developers to transform the work of designers from raw graphical user interface mockups to the actual source code, this trend might soon be a thing of the past. Their code detects the shape in a mockup, interprets their meaning (paragraph, header, image…) and generates the related code.
What is next ?
If the solutions quoted above perform with anything between 60 and 80% accuracy, AI-driven software development is going to be our reality soon.
Let’s take our imaginations one step further. If ML-generated code is at least as good as whatever the best human programmers might have produced, that could hasten the day when most developers never need to touch a single line of executable code ever again.
Yet in the meanwhile, we can still enjoy augmented programming to challenge a lot of today’s software engineering drawback (documentation automated generation, appropriate test writing, code optimization, suggestion…)
Machine Learning on Code is coming !
Maeliza Seymour – Shubhadeep Roychowdhury