GithubHelp home page GithubHelp logo

mlh-fellowship / jamspam Goto Github PK

View Code? Open in Web Editor NEW
13.0 3.0 1.0 6.35 MB

GitHub App to jam the spam PRs on your repo and keep maintainers stress-free (even in Hacktober ๐ŸŽƒ)

License: MIT License

Shell 1.34% Dockerfile 0.95% JavaScript 35.97% Python 61.74%

jamspam's Introduction

JamSpam

A Machine Learning powered GitHub App built with Probot to jam the spam PRs on your repo and keep maintainers stress-free (even in Hacktober ๐ŸŽƒ)

Summary

Building Dataset

  • We listed links of PRs labelled as โš  SPAM or INVALID โš  on some popular repositories especially those that faced a pool of spam pull-requests during the recently concluded Hacktoberfest ๐ŸŽƒ in a .csv file.
  • Similarly, we also listed links of โœ… MERGED PRs on the repositories in a separate .csv file for Ham (not Spam) features.
  • We used Octokit, an API framework by GitHub to extract Pull Request Information from the PR links and save desired features locally to build our model.

Feature Extraction

We chose the standard PR attributes and some derived features to train our model

  • Standard
    • Number of Commits
    • Number of Files Changed
    • Number of Changes (Additions + Deletions)
  • Derived
    • Number of Files Changed of Documentation Type

      # File Extensions considered to be of Doc-Type 
      ['md', 'txt', 'rst', '']
    • Occurences of spam hit-words in text corpus of PR

      Text Corpus of a Pull Request includes the PR Title, Body, Commit Messages and Diffs.

      All text is pre-processed with regex to exclude any symbols.

Model Design

We are using Keras to build our baseline model. It is essentially a (5-16-16-1) Sequential Neural Network with first three layers being 'RELU' activated and the final output layer activated as a sigmoid function.

The model is run over 500 epochs with a unit batch size.

Transfer Model to Bot

The model is exported from Python using tensorflowjs that creates a model.json and a .bin file to store the model structure, variables and associated weights.

The model is imported seamlessly into Node.js using @tensorflow/tfjs-node for predictions to be made for incoming PRs

Getting Started

Contributing

If you have suggestions for how JamSpam could be improved, or want to report a bug, open an issue! We'd love all and any contributions.

For more, check out the Contributing Guide.

Screenshots

  1. If you are a Collaborator, Contributor, Member, or Owner of the repository your pull request will never be flagged. Ham PR

  2. If you are a First Timer, Mannequin or First Time Contributor your pull requests will be checked.

If the pull request is legit, it is not flagged Ham PR

If the pull request is suspected to be spam, it is marked as spam and closed. Spam PR

License

MIT ยฉ 2020 MLH Fellowship

Made with โค๏ธ by Ajwad Shaikh & Vrushti Mody during Sprint 3 of the MLH Fellowship Explorer Batch, Fall 2020.

jamspam's People

Contributors

ajwad-shaikh avatar vrushti-mody avatar juliasliu avatar

Stargazers

 avatar Somya Prajapati avatar Yozachar avatar Atef Ben Ali avatar Vivek R Shenoy avatar Chau Vu avatar Saurabh Kumar Suryan avatar Vikas Pal avatar HariHaran avatar Shagun goyal avatar Garima Singh avatar Smaranjit Ghose avatar Aditya Chakraborti avatar

Watchers

Jonathan Gottfried avatar James Cloos avatar  avatar

Forkers

ajwad-shaikh

jamspam's Issues

Build Classifier Model

  • Build Spam/Ham Classifier Model
  • Decide on passing threshold for misclassification rate
  • Test for Accuracy / Improve, if poor

Add Check For Valid Documentation PRs

Keep pull request open irrespective of contents if all the below conditions are met (skip spam detection)

  • PR should have a linked issue.
  • Linked issue should have label documentation
  • Labelled issue should be assigned to the same user who has sent the Pull Request

NLP to detect hit keywords for SPAM/HAM dataset

Implement NLP to extract keywords from SPAM and HAM corpus.

A frequency vector of these keywords would be a great feature for our model. To make sure, we have keywords specific to SPAM and HAM characteristics of the PR, we decide to do the following.

N = complexity of the model (starting with 30, might change iteratively to achieve better results)

A = Top N keywords list from SPAM dataset
B = Top N keywords list from HAM dataset

SPAM_KEYWORDS = (A - B)
HAM_KEYWORDS = (B - A)

Suggest using multi-rake for rapid keyword extraction from corpus

Extract 'feature'ful attributes from fetched data

  • PR Title
  • PR Body
  • Diffs (actual changes)
  • Commit Messages
  • Files Changes
  • Documentation File Changes (.md, .rst, .txt)
  • Number of commits
  • Diffs number (additions + deletions)

(Those ticked have been worked on in #12 (WIP) or earlier)

Data Aggregation Script

Create a script to scrape or possibly use APIs to get attributes of spam and ham PRs in order to build a complete dataset

Attributes

  • PR Title
  • PR Description
  • Diffs
  • Commit Messages
  • .. discussion open for more attributes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.