GithubHelp home page GithubHelp logo

Comments (4)

FelixNeutatz avatar FelixNeutatz commented on August 16, 2024

Hi Johannes,

first, a quick tutorial how to set everything up:

  1. setup Maven
git clone https://github.com/FelixNeutatz/GitHubRepositoryClassifier.git
cd GitHubRepositoryClassifier
mvn clean install
  1. setup Python
cd GitHubRepositoryClassifier/model
/usr/bin/python2.7 setup.py install
  1. configure GitHub token
    Now create a file, e.g. "/path/mygittoken.txt". Then you enter the generated token in the following way (without brackets):
oauth=[my GitHub access token]

Then create a file "/path/tokenfilelist.csv" which contains all the token files like this:

/path/mygittoken.txt
/path/mygittoken2.txt
...

Now you specify this file in ../GitHubRepositoryClassifier/commons/src/main/resources/conf/conf.properties:

sample.generation.git-accounts.file=/path/tokenfilelist.csv
  1. run classification:
cd GitHubRepositoryClassifier/model/ml
/usr/bin/python2.7 classify.py --input golden_data_set.csv

To your other questions: extractor / attachment download steps are optional and only required for training / validation.

About the time constraint:
For every repository we download up to 10k issues, 10k commits, 10k releases, 10k branches, 10k languages, 10k contributors and some more statistics.
Given a rate limit of 5000 requests per hour we calculate here the number of requests for the linux repository alone:
10.000 commits (100 requests)
1 branch (1 request)
497 releases (5 requests)
6 languages (1 request)
more than 100 contributors (1 request)

So if we have a lot of big repositories the program will not run through in an hour. We can change the our limit to 1000 elements per category (issues, commits,...) but this would probably reduce the quality of the predictions since we trained on unlimited issues, ...

What should we do?
Fixing the limit is super easy we just change this line: https://github.com/FelixNeutatz/GitHubRepositoryClassifier/blob/master/serialization/src/main/java/tu/kn/ghrepoclassifier/serialization/data/RepoData.java#L79

Of course, you could also add some more authentication tokens to speed up downloading.

Best regards,
Felix & Kevin

from githubrepositoryclassifier.

jonico avatar jonico commented on August 16, 2024

Thanks @FelixNeutatz, your running instruction are now much clearer to us. We will probably give you the benefit of the doubt and assume you classify the linux repo correctly and only categorize the way smaller 220 Git repos.

from githubrepositoryclassifier.

kevinkepp avatar kevinkepp commented on August 16, 2024

Hi @jonico, we commited the predictions the of our algorithm on the golden data set in the tu-berlin-evaluation repository. We used 5 API tokens and it took the program one night to download and classify the repositories.
Best,
Kevin & Felix

from githubrepositoryclassifier.

jonico avatar jonico commented on August 16, 2024

Hi @jonico, we commited the predictions the of our algorithm on the golden data set in the tu-berlin-evaluation repository. We used 5 API tokens and it took the program one night to download and classify the repositories.

Thank you!

from githubrepositoryclassifier.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.