Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

How to run your program about githubrepositoryclassifier HOT 4 CLOSED

jonico commented on August 16, 2024

How to run your program

from githubrepositoryclassifier.

Comments (4)

FelixNeutatz commented on August 16, 2024

Hi Johannes,

first, a quick tutorial how to set everything up:

setup Maven

git clone https://github.com/FelixNeutatz/GitHubRepositoryClassifier.git
cd GitHubRepositoryClassifier
mvn clean install

setup Python

cd GitHubRepositoryClassifier/model
/usr/bin/python2.7 setup.py install

configure GitHub token
Now create a file, e.g. "/path/mygittoken.txt". Then you enter the generated token in the following way (without brackets):

oauth=[my GitHub access token]

Then create a file "/path/tokenfilelist.csv" which contains all the token files like this:

/path/mygittoken.txt
/path/mygittoken2.txt
...

Now you specify this file in ../GitHubRepositoryClassifier/commons/src/main/resources/conf/conf.properties:

sample.generation.git-accounts.file=/path/tokenfilelist.csv

run classification:

cd GitHubRepositoryClassifier/model/ml
/usr/bin/python2.7 classify.py --input golden_data_set.csv

To your other questions: extractor / attachment download steps are optional and only required for training / validation.

About the time constraint:
For every repository we download up to 10k issues, 10k commits, 10k releases, 10k branches, 10k languages, 10k contributors and some more statistics.
Given a rate limit of 5000 requests per hour we calculate here the number of requests for the linux repository alone:
10.000 commits (100 requests)
1 branch (1 request)
497 releases (5 requests)
6 languages (1 request)
more than 100 contributors (1 request)

So if we have a lot of big repositories the program will not run through in an hour. We can change the our limit to 1000 elements per category (issues, commits,...) but this would probably reduce the quality of the predictions since we trained on unlimited issues, ...

What should we do?
Fixing the limit is super easy we just change this line: https://github.com/FelixNeutatz/GitHubRepositoryClassifier/blob/master/serialization/src/main/java/tu/kn/ghrepoclassifier/serialization/data/RepoData.java#L79

Of course, you could also add some more authentication tokens to speed up downloading.

Best regards,
Felix & Kevin

from githubrepositoryclassifier.

jonico commented on August 16, 2024

Thanks @FelixNeutatz, your running instruction are now much clearer to us. We will probably give you the benefit of the doubt and assume you classify the linux repo correctly and only categorize the way smaller 220 Git repos.

from githubrepositoryclassifier.

kevinkepp commented on August 16, 2024

Hi @jonico, we commited the predictions the of our algorithm on the golden data set in the tu-berlin-evaluation repository. We used 5 API tokens and it took the program one night to download and classify the repositories.
Best,
Kevin & Felix

from githubrepositoryclassifier.

jonico commented on August 16, 2024

Hi @jonico, we commited the predictions the of our algorithm on the golden data set in the tu-berlin-evaluation repository. We used 5 API tokens and it took the program one night to download and classify the repositories.

Thank you!

from githubrepositoryclassifier.

How to run your program about githubrepositoryclassifier HOT 4 CLOSED

Comments (4)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs