Saving data exactly as obtained from GitHub for later use. We can use MongoDB to just immediately save the JSON document from the query as-is. Start server:
mongod --noauth --dbpath $PWD/mongo/data/db
GUI to view database: mongodb-compass
Run the following in your terminal before starting a jupyter notebook
export CONNECTION_STRING="<connectionString>"
Main script is in ./ghtesting/ghtesting.py
.
All the data requested from github is hardcoded in a graphql query file: ghtesting/queries/github/repos.graphql
The repos obtained are determined by the search parameters that are provided to this query in the form of graphql variables.
Using the explorer to build and test queries is helpful: https://docs.github.com/en/graphql/overview/explorer
The GHRepo
class defines some functions to extract features from the json but not all features are coded right now.
The script ./ghtesting/dumpdb.py
is an example of iterating through all the repos in the database.
It would take too long to interact and authenticate with all the different CI services, so instead we just look at the badges on the README. Most services allow you to copy the badge markdown/html to paste into your README, so we think most people who take the time to setup CI will probably take this small step to add the badge.
The ./ghtesting/scrape_ci_badges.py
uses the Python requests and BeautifulSoup packages to scrape the README section of the GitHub page for each project.
It specifically grabs all the img tags, regardless of whether or not the element is actually one of the badges.
After obtaining all the images, the URLs were incrementally analyzed to filter out the badges and the regexes for the URLs were added to GHRepo.
Using Jupyter Notebook to analyze the data.
pip install jupyter
pip install matplotlib
pip install pandas
pip install matplotlib-venn #for one plot
`
Set environment variable with Github API tokens (generated in your profile) and then run pytest.
# Only need 1 token
export GITHUB_TOKENS="token1 token2"
pytest