What do Facebook, a car and a Boeing have in common? They all run on source code with over 20M lines of code (L.O.C).
Six years ago, Marc Andreessen said that « software is eating the world » and looking at David McCandless visualization we can understand how much it’s true. We are getting overwhelmed by source code and their increasing complexity and thus challenges: shadow IT, lack of documentation, language, and framework heterogeneity, lack of visibility in code history, no easy way to maintain the code…
Just like vast amounts of data on the web, enabled Big Data applications, now large repositories of programs (e.g. open source code in GitHub, Bitbucket…) enable a new class of applications: « Big Code ». We’ve accumulated petabytes of source code data that is open, yet there have been few attempts to fully leverage the knowledge that is sealed inside. Code is the new data to look at!
Using Machine Learning on source code implies to automatically learn from existing code in order to solve tasks such as predicting program bugs, predicting program behavior, predicting identifier names, or automatically creating new code. This new approach opens the door to extremely exciting opportunities in the ways software and code are developed.
Machine learning on source code : what is it ?
Machine Learning on Source Code (#MLonCode) is an emerging and exciting domain of research which stands at the crossroad of deep learning, natural language processing, software engineering, and programming languages.
The analysis of source code is currently less important in the field of machine learning than that of images or natural language. Therefore, there are no well-established standard techniques for using the source code as a data source for forecasts. We stand at a frontier tech where a lot of research is still required.
Processing code data
Machine Learning is mathematical algorithms applied to data: consequently, any input data must have a mathematical representation. For example, when you want to process an image, you have to convert it into a matrix. This is easy to picture: an image is a matrix of pixels, pixels are an array of numbers used to describe the color.
When it comes to code the challenge is a bit trickier. A code embeds different levels of understanding:
- semantic level : what is written
- structural level : how it written
- graph flow level : how each part of the code interacts with the rest of the code
The intent and meaning of a source code relies on those 3 levels of understanding.
Any technics of machine learning applied on source code should ensure a mathematical embedding of those three levels of understanding. The quality of embedding will impact quality of model
(This topic will be detailled in a later article)
Machine learning on code : Use Cases
Automatically test your code
Many software development processes have to cover a large number of unit and integration test cases which takes a long (long, long…) time to fully implement. As developers, especially in testing in Continuous Integration (CI) involves test case prioritization, selection, and execution at each cycle.
Selecting the most promising test cases to detect bugs is hard if there are uncertainties on the impact of committed code changes or if traceability links between code and tests are not available.
Today’s automatic understanding of the code, helps you prioritize test cases according to their duration, previous last execution and failure history. In a constantly changing environment, some algorithms such as Retecs method learn to prioritize error-prone test cases.
Real world example: Netflix engineers run a series of tests and benchmarks to validate the service across multiple dimensions including audio-video playback quality, license handling, encryption, security…All this leads to a plethora of test cases, most of them automated, that need to be executed to validate the functionality of a device running Netflix. To speed up the test process, Netflix engineer uses Retecs technics. It helps them to choose the most promising subset of tests out of thousands of test cases available when running continuous integration against a device or recommend a set of test cases to execute against the device that would increase the probability of failing the device in real-time.
Code suggestion and completion
For years developers have been like assembly lines workers, writing, again and again, the same piece of lines of code to solve the same kind of problem. How many time did I look for a way to solve a specific and found myself looking for the same answer every six months ! (And I know…. I’m not the only one there).
Engineers, when tackling a problem, are often looking for a way to solve it that has already been implemented. Over the years, many code search tools and platforms (bless you StackOverflow) have been proposed to help developers. Usual approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. They lack a deep understanding of the semantics of queries and source code.
Today’s Machine Learning on Code enables to run semantic similarities search in the problem domain instead of searching in the solution domain.
The most used technic is Code2Vec. Like the famous Natural Language Processing example of:
vec("man")-vec("woman") = vec("king")-vec("queen")
Code2Vec model learns analogies that are relevant to source code, such as:
vec("receive")-vec("send") = vec("downlaod")-vec("upload")
Real world example: Facebook released this year **Aroma, a code-to-code search and recommendation tool that uses machine learning (ML) to make the process of gaining insights from big codebases much easier. Let’s remember Facebook, has over 2B lines of code…
Determining program correctness requires a precise understanding of a program’s intended behavior, and a means to convey this understanding unambiguously in a form suitable for automated inspection.
Today’s tools for code review lack of this depth in code understanding which can lead to not pleasant situation (ex. high test coverage rate while having a non-woring program or a high level of code documentation while every comment is out of date ).
Machine Learning on code opens the door to a new way to (deeply) understand the intent of the code and analyze it. It’s like comparing a 2-D picture to a 3-D picture. Yes for sure current code review tools provide insights on the code but it’s nothing compared with what ML-powered tool will be able to do.
Real world example: Autosoft has released the first version of code reviewing tool that assesses the alignment of the code and its comments and automatically suggests update if the comments diverge from the code. Next release will compare the usefulness of unit tests compared to the code. When reviewing code, that enables developers to have truly actionable insights on how to improve any code longterm readability and maintainability: which is key for us.
Program induction and synthesis
This would be the Grail of Computer Science since it has been created 200 years ago: get a fully automated programming system. In computer science, program synthesis is the task to automatically construct a program that satisfies a given high-level specification. We are currently in the instructional programming world. When we have to solve a complex problem, as developers, we break it down in smaller problems and we write the code that solves those them (we talk about first principles). Program synthesis works the other way around: we provide the computer with a complex problem ( what we want) and leave the details of how to figure it out to the computer.
We are not there yet for sure (please wait Mr Spoke) yet some tremendous advancement has been made lately both from researchers and companies side. Today we can talk about augmented programming.
Real world example :
Pix2Code: While traditionally it has been the task of front-end developers to transform the work of designers from raw graphical user interface mockups to the actual source code, this trend might soon be a thing of the past. Their code detects the shape in a mockup, interprets their meaning (paragraph, header, image…) and generates the related code.
Let’s take our imaginations one step further. If ML-generated code is at least as good as whatever the best human programmers might have produced, that could hasten the day when most developers never need to touch a single line of executable code ever again.
Considering roughly that the solutions quoted above perform with anything between 60 and 80% accuracy, ML-driven software engineer is not going to make human programmers obsolete any time soon.
Yet in the meanwhile, we can still enjoy augmented programming to challenge a lot of today’s software engineering drawback (documentation automated generation, appropriate test writing, code optimization, suggestion…)
Machine Learning on Code is coming !
Maeliza Seymour – Shubhadeep Roychowdhury