Comments (4)
Thank you for the suggestion. This is not easy to implement for training (where shuffling between epochs requires having access to all the data, or at least some way of addressing smaller groups of samples at a time), but could be an easy enhancement for evaluation.
from codesearchnet.
One way is to have a single training example in a single file, shuffle the list of filenames before every epoch (as you already do), prefetch batches and preprocess them using a CPU while the model trains on GPUs, and maybe use tf.data in the entire process.
from codesearchnet.
Yes, that's roughly how to do it. Two notes on that:
- One example per file has surprising side effects at this scale (e.g., many filesystems become very slow when millions of files are in one directory, so you need subfolders; reading one file from the can lead to horrible disk access patterns; shuffling + length of an epoch make sure that caches don't work well). These things can be mitigated by using K samples per file, and then reading from several files at once (so each minibatch of N samples is drawn from M << N files).
- The minibatching routine is a bit complicated to handle joint training on many languages, optional random switching between docstring and function name embeddings, etc. Pushing this into tf.data is possible, but painful.
So, overall: It's a substantial amount of work that we are most likely not going to do. The released baselines are really just meant as a "here's a simple straightforward approach, beat that" - we are happy for others to improve on this, either by entirely rewriting things or improving our codebase.
from codesearchnet.
Got your point :)
This is an example, although on the data pipeline side, of general problems faced with system requirements when trying to beat SOTA language models these days.
Anyway, closing the issue now. Thank you for your consideration.
from codesearchnet.
Related Issues (20)
- Less number of data found than stated in the paper HOT 1
- question about NDCG calculation HOT 2
- Generating Pypi module for function_parser HOT 3
- How can I get the annotated code? HOT 1
- Error when executing docker run
- Missing annoy module
- Missing code to build files *_dedupe_definitions_v2.pkl HOT 1
- NDCG computation HOT 1
- How to deconstruct code into tokens to extract functions and comments? HOT 2
- How to run the Function Parser?
- What is the difference between the Original String and code fields?
- How big the dataset is?
- Request to provide unfiltered dataset HOT 1
- Codes
- Please add the commit id for each language parser
- Expired or Private Links of Java Code Snippets in CodeSearchNET
- Clone not working HOT 1
- can we combine the original dataset and re-divided to perform cross-validation?
- dataset can not be downloaded HOT 2
- Functions with original comments
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codesearchnet.