codist-code-mining

Source code Mining for Human: introducing tree-hugger

  • par
Photo by Markus Spiske on Unsplash

Why code mining matters

We at CodistAI are working hard to build an AI which is able to understand source code and associated documentation. Because being developers ourselves we had faced the pain of writing, and keeping documentation up to date while writing code at the same time(add to it the pressure of delivery and deadlines!). It is also a huge problem to find them back when we need them. And we know that we are not the only one suffering from it.

But, to build such a system we needed data. And lots of it. We needed to mine huge amount of code spanning different languages, and guess what, data sources are not plenty when it comes to code as data. The few main sources that we could find are principally Github’s CodeSearchNet challenge data-set, Google big-query’s Github activity data set, Py150, and few others like these.

Mining different language code files and gathering important information from them is not a trivial job. We did not want to create new parsers, so great parser generator frameworks such as ANTLR or lex-yacc were not an options for us. What we needed was a good, high level library that exposes a simple, Pythonic API on top of some kind of universal code parser.

So at the end the choice came down to two options. Babelfish and tree-sitter. Now, babelfish was the newer kid on the block, and was coming with some nice properties, but the uAST (Universal AST) was not really something we liked that much. The API was not that easy either. So tree-sitter was a natural choice (Also babelfish is not maintained anymore).

We were impressed by the clean design, the speed, language coverage, and the minimal dependency of tree-sitter. However, we were still struggling with the low-level interface it provides with the Python binding. So we started to write some codes to create some higher level abstractions on top of it.

Thus, was born tree-hugger.

Tree-hugger: code mining for human

tree-hugger is a light-weight, extendable, high level, universal code parser built on top of tree-sitter.

Let’s unpack those words one by one.

  1. light-weight: tree-hugger aims to be a simple and easy-to-use framework. It gives just enough tools for a developer to quickly start working on mining code-data while it takes care of a lot of boilerplate. It also aims to make life easier by providing some command-line utilities to them. To that end it remains very lightweight in itself. We are also pretty low on dependencies.
  2. extendable: tree-hugger aims to be extendable by design. It achieves that mainly in two ways. One is to have an external source of queries. We read the queries (s-expressions) from a yml file (an example can be found here) and that means we do not need to write them in the code and we can very easily iterate on them. And the second thing is to have a modular structure with some common, boilerplate code already supplied for you. Which means, you can focus on the actual thing. Writing an important part of the code that matters to you.
  3. high-level: tree-hugger hides the little details of running a query, or walking on the ast, and also the tricky part of retrieving some code from the query result under clean, Pythonic API and so you are free to concentrate on the problem at hand.
  4. universal: We actually leverage the amazing tree-sitter, so by default we are (almost) language-agnostic 🙂

Use-case : Mining Code and Comments

Let’s see an example of code mining. Imagine you have a Python file, with several functions defined in it and some of them have docstrings while others not (Check out an example here). Here are three lines of code (Assuming you have installed tree-hugger and setup the environment. You can also check out how to install tree-hugger and how to build the .so files in our documentation) which reads the file, parses it, creates a parse tree, runs a query, returns a dict which contains function names as keys and their docstrings as values.

from tree_hugger.core import PythonParser

pp = PythonParser()
pp.parse_file("tests/assets/file_with_different_functions.py")
pp.get_all_function_docstrings()
And here is the result.
{'parent': '"""This is the parent function\n    \n    There are other lines in the doc string\n    This is the third line\n\n    And this is the fourth\n    """',
'first_child': "'''\n        This is first child\n        '''",
'second_child': '"""\n        This is second child\n        """',
'my_decorator': '"""\n    Outer decorator function\n    """',
'say_whee': '"""\n    Hellooooooooo\n\n    This is a function with decorators\n    """'
}

That was easy!

And imagine, being able to do that with the same API for all the languages and not worrying about the underlying semantic differences of them. So you can mine all language files at scale and with minimum effort. This is what tree-hugger is about. Our aim is to be the standard of data mining on source code with a clean, high-level, Pythonic API. That sets you free from all the lower level details and let’s you focus on more novel problems at hand.

Final Words

Today, we release the first version of tree-hugger. We have tried to provide a very comprehensive documentation. So please go trough it. If you find something missing, or have a suggestion to improve something, or you spot a bug, please open a Github issue so that we can discuss that. If you want to contribute, that is more than welcome.

Happy coding! 😉

Étiquettes: