Comments (4)
Hi Johannes,
first, a quick tutorial how to set everything up:
- setup Maven
git clone https://github.com/FelixNeutatz/GitHubRepositoryClassifier.git
cd GitHubRepositoryClassifier
mvn clean install
- setup Python
cd GitHubRepositoryClassifier/model
/usr/bin/python2.7 setup.py install
- configure GitHub token
Now create a file, e.g. "/path/mygittoken.txt". Then you enter the generated token in the following way (without brackets):
oauth=[my GitHub access token]
Then create a file "/path/tokenfilelist.csv" which contains all the token files like this:
/path/mygittoken.txt
/path/mygittoken2.txt
...
Now you specify this file in ../GitHubRepositoryClassifier/commons/src/main/resources/conf/conf.properties
:
sample.generation.git-accounts.file=/path/tokenfilelist.csv
- run classification:
cd GitHubRepositoryClassifier/model/ml
/usr/bin/python2.7 classify.py --input golden_data_set.csv
To your other questions: extractor / attachment download steps are optional and only required for training / validation.
About the time constraint:
For every repository we download up to 10k issues, 10k commits, 10k releases, 10k branches, 10k languages, 10k contributors and some more statistics.
Given a rate limit of 5000 requests per hour we calculate here the number of requests for the linux repository alone:
10.000 commits (100 requests)
1 branch (1 request)
497 releases (5 requests)
6 languages (1 request)
more than 100 contributors (1 request)
So if we have a lot of big repositories the program will not run through in an hour. We can change the our limit to 1000 elements per category (issues, commits,...) but this would probably reduce the quality of the predictions since we trained on unlimited issues, ...
What should we do?
Fixing the limit is super easy we just change this line: https://github.com/FelixNeutatz/GitHubRepositoryClassifier/blob/master/serialization/src/main/java/tu/kn/ghrepoclassifier/serialization/data/RepoData.java#L79
Of course, you could also add some more authentication tokens to speed up downloading.
Best regards,
Felix & Kevin
from githubrepositoryclassifier.
Thanks @FelixNeutatz, your running instruction are now much clearer to us. We will probably give you the benefit of the doubt and assume you classify the linux repo correctly and only categorize the way smaller 220 Git repos.
from githubrepositoryclassifier.
Hi @jonico, we commited the predictions the of our algorithm on the golden data set in the tu-berlin-evaluation repository. We used 5 API tokens and it took the program one night to download and classify the repositories.
Best,
Kevin & Felix
from githubrepositoryclassifier.
Hi @jonico, we commited the predictions the of our algorithm on the golden data set in the tu-berlin-evaluation repository. We used 5 API tokens and it took the program one night to download and classify the repositories.
Thank you!
from githubrepositoryclassifier.
Related Issues (3)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from githubrepositoryclassifier.