Instead of loading all training and test data, can we load the data in memory in batch

Yes, that's roughly how to do it. Two notes on that: One examp

Load data from disk into memory in batches about codesearchnet HOT 4 CLOSED

github commented on May 13, 2024

Load data from disk into memory in batches

from codesearchnet.

Comments (4)

mmjb commented on May 13, 2024

Thank you for the suggestion. This is not easy to implement for training (where shuffling between epochs requires having access to all the data, or at least some way of addressing smaller groups of samples at a time), but could be an easy enhancement for evaluation.

from codesearchnet.

yashjakhotiya commented on May 13, 2024

One way is to have a single training example in a single file, shuffle the list of filenames before every epoch (as you already do), prefetch batches and preprocess them using a CPU while the model trains on GPUs, and maybe use tf.data in the entire process.

from codesearchnet.

mmjb commented on May 13, 2024

Yes, that's roughly how to do it. Two notes on that:

One example per file has surprising side effects at this scale (e.g., many filesystems become very slow when millions of files are in one directory, so you need subfolders; reading one file from the can lead to horrible disk access patterns; shuffling + length of an epoch make sure that caches don't work well). These things can be mitigated by using K samples per file, and then reading from several files at once (so each minibatch of N samples is drawn from M << N files).
The minibatching routine is a bit complicated to handle joint training on many languages, optional random switching between docstring and function name embeddings, etc. Pushing this into tf.data is possible, but painful.

So, overall: It's a substantial amount of work that we are most likely not going to do. The released baselines are really just meant as a "here's a simple straightforward approach, beat that" - we are happy for others to improve on this, either by entirely rewriting things or improving our codebase.

from codesearchnet.

yashjakhotiya commented on May 13, 2024

Got your point :)

This is an example, although on the data pipeline side, of general problems faced with system requirements when trying to beat SOTA language models these days.

Anyway, closing the issue now. Thank you for your consideration.

from codesearchnet.

Recommend Projects

Load data from disk into memory in batches about codesearchnet HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs