GithubHelp home page GithubHelp logo

youcefmco / ml Goto Github PK

View Code? Open in Web Editor NEW

This project forked from src-d/ml

0.0 1.0 0.0 26.45 MB

sourced.ml is a library to build and apply machine learning models on top of Universal Abstract Syntax Trees

License: Apache License 2.0

Python 100.00%

ml's Introduction

sourced.ml Build Status codecov PyPI

Machine Learning models on top of Abstract Syntax Trees. Formerly known as ast2vec.

Currently, there are implemented:

  • id2vec, source code identifier embeddings
  • docfreq, source code identifier document frequencies (part of TF-IDF)
  • nBOW, weighted bag of vectors, as in src-d/wmd-relax
  • topic modeling

This project can be the foundation for machine learning on source code (MLoSC) research and development. It abstracts feature extraction and working with models, thus allowing to focus on the higher level tasks.

It is written in Python3 and has been tested on Linux and macOS. sourced.ml is tightly coupled with Babelfish and delegates all the AST parsing to it.

Here is the list of projects which are built with sourced.ml:

  • vecino - finding similar repositories
  • tmsc - topic modeling of repositories
  • role2vec - AST node embedding and correction
  • snippet-ranger - topic modeling of source code snippets

Installation

pip3 install ast2vec

You need to have libxml2 installed. E.g., on Ubuntu apt install libxml2-dev.

Usage

This project exposes two interfaces: API and command line. The command line is

ast2vec --help

There is an example of using Python API here.

It exposes several tools to generate the models and setup the environment.

API is divided into two domains: models and training. The first is about using while the second is about creating. Models: Id2Vec, DocumentFrequencies, NBOW, Cooccurrences. Transformers (keras/sklearn style): Repo2nBOWTransformer, Repo2CooccTransformer, PreprocessTransformer, SwivelTransformer and PostprocessTransformer.

Docker image

First you need to run babelfish image and download drivers:

docker run -d --privileged -p 9432:9432 --name bblfshd bblfsh/bblfshd:v2.2.0
docker exec -it bblfshd bblfshctl driver install image docker://bblfsh/python-driver:v1.2.6
docker exec -it bblfshd bblfshctl driver install image docker://bblfsh/java-driver:v1.2.2

You can find fill installation guide in the official getting started manual.

Then build ast2vec image and run it:

docker build -t srcd/ast2vec .
docker run -it --rm srcd/ast2vec --help

If the first command fails with

Cannot connect to the Docker daemon. Is the docker daemon running on this host?

And you are sure that the daemon is running, then you need to add your user to docker group: refer to the documentation.

Playing around with ast2vec on Jupyther notebook

You can launch our docker image with a notebook examples just running:

docker run --name ast2vec-jupyter -p 8888:8888 -it --rm --entrypoint jupyter srcd/ast2vec notebook --allow-root --ip=0.0.0.0

Go to http://localhost:8888/notebooks/how_to_use_ast2vec.ipynb to see the notebook example.

Don't forget to run babelfish image and build srcd/ast2vec image as described in the previous section.

Algorithms

Identifier embeddings

We build the source code identifier co-occurrence matrix for every repository.

  1. Clone or read the repository from disk.
  2. Classify files using enry.
  3. Extract UAST from each supported file.
  4. Split and stem all the identifiers in each tree.
  5. Traverse UAST, collapse all non-identifier paths and record all identifiers on the same level as co-occurring. Besides, connect them with their immediate parents.
  6. Write the individual co-occurrence matrices.
  7. Merge co-occurrence matrices from all repositories. Write the document frequencies model.
  8. Train the embeddings using Swivel running on Tensorflow. Interactively view the intermediate results in Tensorboard using --logs.
  9. Write the identifier embeddings model.
  10. Publish generated models to the Google Cloud Storage.

1-6 is performed with repo2coocc tool / Repo2CooccTransformer class, 7 with id2vec_preproc / id_embedding.PreprocessTransformer, 8 with id2vec_train / id_embedding.SwivelTransformer, 9 with id2vec_postproc / id_embedding.PostprocessTransformer and 10 with publish.

Weighted Bag of Vectors

We represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies ("docfreq") and identifier embeddings ("id2vec").

  1. Clone or read the repository from disk.
  2. Classify files using enry.
  3. Extract UAST from each supported file.
  4. Split and stem all the identifiers in each tree.
  5. Leave only those identifiers which are present in "docfreq" and "id2vec".
  6. Set the weight of each such identifier as TF-IDF.
  7. Set the value of each such identifier as the corresponding embedding vector.
  8. Write the nBOW model.
  9. Publish it to the Google Cloud Storage.

1-8 is performed with repo2nbow tool / Repo2nBOWTransformer class and 9 with publish.

Topic modeling

See here.

Contributions

PEP8

We use PEP8 with line length 99 and ". All the tests must pass:

unittest discover /path/to/ast2vec

License

Apache License Version 2.0, see LICENSE

ml's People

Contributors

vmarkovtsev avatar zurk avatar fineguy avatar galdude33 avatar mcuadros avatar marnovo avatar dependabot[bot] avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.