Guesslang documentation¶
Guesslang detects the programming language of a given source code. It supports more than 50 programming languages and detects the correct programming language with more than 90% accuracy.
Guesslang is an open source deep learning software that have been trained with over a million source code files.
You can use Guesslang as a command line interface tool or as a Python module:
from guesslang import Guess
guess = Guess()
# Guess the language from code
language = guess.language_name("""
% Quick sort
-module (recursion).
-export ([qsort/1]).
qsort([]) -> [];
qsort([Pivot|T]) ->
qsort([X || X <- T, X < Pivot])
++ [Pivot] ++
qsort([X || X <- T, X >= Pivot]).
""")
print(language) # --> Erlang
Guesslang supports 54 of the world’s most popular programming languages:
Assembly | Batchfile | C | C# | C++ |
Clojure | CMake | COBOL | CoffeeScript | CSS |
CSV | Dart | DM | Dockerfile | Elixir |
Erlang | Fortran | Go | Groovy | Haskell |
HTML | INI | Java | JavaScript | JSON |
Julia | Kotlin | Lisp | Lua | Makefile |
Markdown | Matlab | Objective-C | OCaml | Pascal |
Perl | PHP | PowerShell | Prolog | Python |
R | Ruby | Rust | Scala | Shell |
SQL | Swift | TeX | TOML | TypeScript |
Verilog | Visual Basic | XML | YAML |
Guesslang is used by Visual Studio Code to automatically detect the programming language of the source code that you paste into the editor:
Guesslang is used by other projects including:
- Chameledit a web-editor that auto-highlights code,
- Pasta a Slack bot that pretty-pastes code,
- GG a guessing game.
Install Guesslang¶
Guesslang requires Python 3.7 or later.
Install from Pypi¶
You can run the following command to install Guesslang on your system:
pip install guesslang
Install from source code¶
To install Guesslang from source code, just download the source code from https://github.com/yoeo/guesslang, then run this command:
pip install .
Usage¶
Python package¶
Guesslang Python library helps you detect the programming language of a given text within your Python program. The Python classes are fully documentation here: Guesslang package reference.
Command line tool¶
On a terminal emulator, you can detect the programming language
of a source code file by running guesslang /path/to/file
.
As well, you can detect the programming language of a source code
provided through the standard input using a
pipeline
like some-command | guesslang
.
Examples:
Detect the programming language of
/etc/bashrc
configuration fileguesslang /etc/bashrc # ⟶ Programming language: Shell
Detect the programming language of a source code stored in a file
echo " class Array def quick_sort return self if length <= 1 pivot = self[0] less, greatereq = self[1..-1].partition { |x| x < pivot } less.quick_sort + [pivot] + greatereq.quick_sort end end " > /tmp/quicksort guesslang /tmp/quicksort # ⟶ Programming language: Ruby
Execute a command that generates source code then detect the programming language on the fly:
echo ' Array.prototype.quick_sort = function () { if (this.length < 2) { return this; } var pivot = this[Math.round(this.length / 2)]; return this.filter(x => x < pivot) .quick_sort() .concat(this.filter(x => x == pivot)) .concat(this.filter(x => x > pivot).quick_sort()); }; ' | guesslang # ⟶ Programming language: JavaScript
Show the programming language detection confidence score as probabilities:
echo " def qsort(items): if not items: return [] else: pivot = items[0] less = [x for x in items if x < pivot] more = [x for x in items[1:] if x >= pivot] return qsort(less) + [pivot] + qsort(more) if __name__ == '__main__': items = [1, 4, 2, 7, 9, 3] print(f'Sorted: {qsort(items)}') " | guesslang --probabilities # Language name Probability # Python 74.80% # Haskell 6.73% # CoffeeScript 5.32% # Groovy 1.95% # Markdown 0.93% # ...
With Guesslang command line tool you can also show the detection probabilities for a given source code and even train your custom programming language detection model.
Run guesslang --help
to see all the available options.
How does Guesslang guess?¶
Deep learning Model¶
Guesslang uses a deep learning Tensorflow model built with 1,900,000 unique source code files, randomly picked from 170,000 public Github projects.
Guesslang model is a Deep Neural Network classifier combined with Linear classifier. The model’s hyperparameters have been fine tuned to have both the best performances and the best generalization.
Training¶
Having a data set with a very large number of diverse examples is essential to correctly train a model. This large dataset is built with GuesslangTools. It is used to train, evaluate and test Guesslang’s deep learning model.
To avoid overfitting, each repositories is strictly associated with only one of the 3 aforementioned tasks. Therefore files from a repository assigned to the training task can only be used to train the model and cannot be used to evaluate nor test it.
The training and evaluation steps are done in a loop, as shown by the following loss curve.
The test in the other hand is done after the last training and evaluation steps to ensure that the final model performs well.
Accuracy¶
Guesslang deep learning model performs very well, with 93.45% accuracy. This accuracy was calculated by testing Guesslang with 230,000 distinct source files.
Limitations¶
Guesslang accuracy is very high but it is not perfect.
Some challenging source codes that are at the border between two languages can fool Guesslang. In fact, a valid C source code is almost always a valid C++ code, and a valid JavaScript source code is always a valid TypeScript code.
This phenomenon is shown by Guesslang’s confusion matrix:
In addition to that, Guesslang may not guess the correct programming languages of very small code snippets. Small snippets don’t always provide enough insights to accurately guess the programming language.
For example, print("Hello world")
is a valid code snippet in several
programming languages including Python, Scala, Ruby, Lua, Perl, etc…
References¶
- Guesslang source code is on Github.
- Guesslang is developped with Tensorflow machine learning framework.
- Use GuesslangTools to build your own training dataset.
- The example codes used in this documentation come from Rosetta Code.
- Guesslang logo has been created with Android Asset Studio and Eduardo Tunni’s Warnes font.
- Guesslang — Copyright (c) 2021 Y. SOMDA, MIT Licence.
- Index
- Module Index
- Search Page