osslab-pku / gfi-bot Goto Github PK

View Code? Open in Web Editor NEW

16.0 5.0 5.0 3.92 MB

[Working in Progress] ML-powered 🤖 for finding and labeling good first issues in your GitHub project!

Home Page: https://gfibot.io

License: GNU General Public License v3.0

Python 98.31% Shell 1.08% Dockerfile 0.62%

github-app machine-learning python

gfi-bot's Issues

Add documentation for RESTful APIs

Currently gfibot.data.backend does not have any documentation. It will be helpful to have documentation describing the behavior of each exposed RESTful API.

Python classes not found

There may be an issue with the link to "here" in the above image, which will appear as follows when clicked:

Improvements to the Panel of each Repository (Testing for GFI-Bot)

GFI => Good First Issues. It is non-intuitive for users to understand this abbreviation
Add more repository statistics (# of stars, contributors, and forks) so that the users can know the quality of the repository
Add more issue-related and module-related metrics, and show them directly in the repo panel (instead of in the per-issue page, as this information is also important for newcomers.
Add options to sort GFIs. Old issues may be harder to resolve while the latest issues may be easier or more relevant.
Add a title showing "XXXX Recommended Good First Issues"
The per-issue item can be revised to resemble a GitHub style look (e.g., at https://github.com/matplotlib/matplotlib/contribute), with issue labels, opened at, last update, etc. It is important to additionally show whether an issue has a pending PR, as those without pending PRs should be of higher priority for a newcomer willing to contribute.
Clicking the issue number will cause the browser to switch to the GitHub issue page. This is unintuitive. It is better to use a dedicated "To GitHub" button instead.
The user should be able to choose the issues per page. (To allow quicker issue browsing)

Revise the Per-Issue Page

With the revisions in #32 , it does not make sense to include the current information on the per-issue page.

Ideally, a user should be able to absorb more information from the per-issue page, but there is no need for us to have comprehensive coverage of everything (or otherwise, the user may simply refer to the GitHub issue page.

My proposed revision is to put the issue description as an expandable content for each issue item (similar to the design here). We may include the full issue description, but when the description goes too long, it should be capped.

It is also possible to display other detailed information, but I currently do not have a good idea.

Add TF-IDF to Model Training

Currently, the model does not take text into consideration. This affects the perceived quality and validity of recommended GFIs, as the model does not learn anything from text and only learns from historical features.

My proposal is to add two lightweight TF-IDF vectors (e.g., 50 dimensions) to learn from the title and description, respectively. Of course, additional effort needs to be spent on carefully parsing the text into bags of words (with stemming, etc.).

Add Explainations Somewhere to the Current GFI-Bot LOGO

The current GFI-Bot logo is designed by Haonan Su. It currently looks good, but, in my opinion, its meaning is not obvious to others interested in our project. Maybe it is better to add some explanations in README.md to explain the rationale of LOGO design.

Add Logs for `gfibot.data.update` and `gfibot.data.datset`

Dataset construction needs to be recorded with histories and current progress in MongoDB, which can be used for:

backend to read current progress and display to users
evaluating how dataset construction consumes GitHub API tokens (this can be an important part of evaluation in the final paper)

An empty page problem in the frontend

Now there are 98 repos in total, so there are only 3 repos on page 20 and page 21 should not exist.
But page 21 still exists and we can see its empty page.

The predicted probabilities of some issues are 99.99%

Currently, many open issues in some projects have a GFI probability of 99.99%, and some of these issues clearly should not be marked as GFI.

The performance metric of the model is also unusually high.

I examined the code and found two features that may be problematic. The first is 'created_at_timestamp', which is not one of the features and should not be included in X (def get_x_y() in gfibot/model/utils.py). The second one is 'rpt_gfi_ratio', when I try to drop this feature, the model performance metrics appear to drop significantly.

The problems can be solved by the following steps:

Add 'created_at_timestamp' to

gfi-bot/gfibot/model/utils.py

Line 135 in 7ed0761

["owner", "name", "number", "is_gfi", "created_at", "closed_at"]
The gfi_ratio and gfi_num features should be calculated with a new issue list, which only includes issues closed before the data collection time.

gfi-bot/gfibot/model/dataloader.py

Line 112 in 7ed0761

def _get_newcomer_ratio(n_user_commits: List[int], newcomer_thres: int) -> float:

gfi-bot/gfibot/model/dataloader.py

Line 118 in 7ed0761

def _get_newcomer_num(n_user_commits: List[int], newcomer_thres: int) -> int:

Now we use the following list. A new issues list like issues = [i for i in user.issues if i.closed_at <= t] should be created for calculating gfi_ratio and gfi_num later.

gfi-bot/gfibot/data/dataset.py

Line 205 in 7ed0761

issues = [i for i in user.issues if i.created_at <= t]

There may be a situation where most of the prediction probabilities are close to 0 after the above features are corrected because of the imbalance of positive and negative instances in the training data, which can be solved by balancing the training dataset using methods such as SMOTE and ADASYN. Then we can check whether the '99.99% probabilities' problem is solved.

Incremental Fetch of User Statistics Over All GitHub

Description

The training of RecGFI requires that for each issue, the overall development experience of every issue participant. For each participant, we choose to estimate this from their GitHub profile, with two key requirements:

For a participant, RecGFI will need to have some estimation of their GitHub experience before any specified time point.
To save precious GitHub rate limit, it is also necessary to incrementally build user statistics based on existing statistics (i.e., only fetch necessary new data from GitHub).

Implementing an approach to collect a comprehensive GitHub profile while satisfying the above requirements can be hard. Therefore, we resort to only collect user created repos, issues, and pull requests using the GitHub GraphQL API, because these statistics are both timestamped and supports time related queries (see User API). The collected data should be saved in gfibot.users MongoDB collection following the schema provided in schemas/users.json.

Files to Touch

gfibot/data/update.py
gfibot/data/graphql.py
tests/data/test_update.py
tests/data/test_graphql.py
schemas/users.json

Can add new dependencies, create new tests or add new graphql query files where necessary.

Implementation Steps

Implement and add tests for incremental fetch of GitHub user stats for a given username using GitHub GraphQL API in gfibot/data/graphql.py.
Implement and test the following function in gfibot/data/update.py, to call your implementation, and save the final data (adhering the schemas in schemas/users.json) in MongoDB.

def update_user(user: str, since: datetime) -> None:
    # TODO: We need an efficient approach to fetch user profile from GitHub,
    #   we may use the GraphQL API with more user-related features than the REST API
    raise NotImplementedError()

Comment out the code in the update_repo() function of gfibot/data/update.py to test whether your contribution work!

A Systematic Evaluation of The Trained ML Model

To align with the requirements of a research paper, the following evaluations need to be conducted:

Comparison with alternative models (LightGBM, Logistic Regression, SVM, DNN, and Random Forest)
Ablation studies on features (along multiple dimensions of features)
Feature importance analysis
Showing the advantage of automated hyperparameter tuning

All evaluation data should be stored in the MongoDB database (for visualization in the frontend in the future).

Improve the "Sort By" and "Tag" Options in Frontend

In the current frontend of GFI-Bot, all repositories are sorted by the order in MongoDB database by default. This does not make sense to end users. It may be better to default the order to "Popularity" and remove the "None" option, as newcomers are likelier to try out popular repositories than the least popular ones.

The name of the options should be more specific and self-explanatory. My suggestions:
- Popularity => Number of Stars
- GFI => Number of Good First Issues
- Newcomer Friendliness => can't understand, change it to the specific sorting metric used
- Tags => Programming Languages
These sorting and filtering options should be made more visible to end users, with a visual layout like:

Users can choose one option from all options in the "Sort By" line, and add or remove filtering conditions in the two "Filter by" lines. The second line is to filter by GitHub tags. Each of the options in the Sort By line should have a tooltip text explaining its meanings.

Improvements to the Panel of each Repository

GFI => Good First Issues. It is non-intuitive for users to understand this abbreviation
Add more repository statistics (# of stars, contributors, and forks) so that the users can know the quality of the repository
Add more issue-related and module-related metrics, and show them directly in the repo panel (instead of in the per-issue page, as this information is also important for newcomers.
Add options to sort GFIs. Old issues may be harder to resolve while the latest issues may be easier or more relevant.
Add a title showing "XXXX Recommended Good First Issues"
The per-issue item can be revised to resemble a GitHub style look (e.g., at https://github.com/matplotlib/matplotlib/contribute), with issue labels, opened at, last update, etc. It is important to additionally show whether an issue has a pending PR, as those without pending PRs should be of higher priority for a newcomer willing to contribute.
Clicking the issue number will cause the browser to switch to the GitHub issue page. This is unintuitive. It is better to use a dedicated "To GitHub" button instead.
The user should be able to choose the issues per page. (To allow quicker issue browsing)

Compute Features and Train Models using the Newly Collected Data

Implement a feature computation module, which computes the features specified in our RecGFI paper and builds a dataset using fetched data in our MongoDB. The collected features (as the final dataset) should be stored in MongoDB.
Implement a model training module, which trains and evaluates machine learning models for GFI recommendation, as done in our RecGFI paper. It should save training progress, training performance, and recommendation results in our MongoDB.

Add Minimal Working Code for Frontend and Backend

Description

GFI-Bot needs to implement:

A Python backend using Flask for exposing JSON-based APIs. We choose Flask because it is lightweight and easy to use. This backend will be used to provide information including registered repositories, collected issues, PRs, commits, and users, data collection progress, training progress, the performance of current models, recommendations, etc.
A JavaScript frontend using React and React Bootstrap for showcasing our project, attracting potential users, and providing a dashboard for registered users.

As the first step, we need minimal working code for both frontend and backend.

For the frontend, we expect to have a basic react project showing a Home page, a navigation bar, and some copyright and about notices at the tail of each page. We also need a "Repositories" page to list currently registered repositories (in the gfibot.repos collection) and display basic statistics for those repositories. Since the number of repositories may become large, this page needs pagination. For the Home page, we need basic information about GFI-Bot and a three-column description like that on this page. No need to fill in text on the Home page. we will try to fill them later.

For the backend, we expect to have some APIs to return currently registered repositories as a paginated list.

Files to Touch

The main file changes should be made in the frontend folder and the gfibot/backend folder. You can also add tests, new dependencies, and new GitHub workflows (.github/workflows) where necessary.

Enable More Transparency in Model Training

We need to show more information in frontend about the model training and evaluation results to users, showing:

performance of each project, trained on what kind of dataset (aggregated as a table)
XGBoost feature weights

Add Tests for RESTful APIs

Request for Field Descriptions in 'dataset.bson' and 'resolved_issue.bson'

Hello,

I recently downloaded the dataset.bson and the resolved_issue.bson datasets from your project on Zenodo. However, I could not find detailed descriptions of the fields contained in these datasets within the repository documentation. I need this information to properly understand and utilize the dataset.

Could you please provide a detailed description of these data fields, or direct me to where I might find this information? It is crucial for my research.

Thank you!

Add User Features for Their Overall GitHub Profile

Outdated Documentation on How to Run and Deploy GFI-Bot

Currently, all documentation is severely outdated regarding collecting data, training models, understanding the code structure, and deploy the backend & frontend. My proposal is to create a separate DEVELOPMENT.md in the project root folder to explain how to run and deploy GFI-Bot, with the following sections:

A section to explain the current code structure,
A section to describe the database (and point to the database schemas in gfibot.collections),
A section to introduce how to test new functionalities,
A section to describe how to deploy GFI-Bot and make it autonomously collect data, train models, etc., in a new machine. We can add options to make this process more lightweight, such as limiting the number of projects involved so that GFI-Bot can be easily tested on a local machine.

Then, all outdated content in README.md can be replaced by a link to DEVELOPMENT.md.

osslab-pku / gfi-bot Goto Github PK

gfi-bot's Issues

Description

Files to Touch

Implementation Steps

Description

Files to Touch

Recommend Projects

Recommend Topics

Recommend Org

Jobs