GithubHelp home page GithubHelp logo

Comments (4)

mmjb avatar mmjb commented on May 13, 2024

Thank you for the suggestion. This is not easy to implement for training (where shuffling between epochs requires having access to all the data, or at least some way of addressing smaller groups of samples at a time), but could be an easy enhancement for evaluation.

from codesearchnet.

yashjakhotiya avatar yashjakhotiya commented on May 13, 2024

One way is to have a single training example in a single file, shuffle the list of filenames before every epoch (as you already do), prefetch batches and preprocess them using a CPU while the model trains on GPUs, and maybe use tf.data in the entire process.

from codesearchnet.

mmjb avatar mmjb commented on May 13, 2024

Yes, that's roughly how to do it. Two notes on that:

  1. One example per file has surprising side effects at this scale (e.g., many filesystems become very slow when millions of files are in one directory, so you need subfolders; reading one file from the can lead to horrible disk access patterns; shuffling + length of an epoch make sure that caches don't work well). These things can be mitigated by using K samples per file, and then reading from several files at once (so each minibatch of N samples is drawn from M << N files).
  2. The minibatching routine is a bit complicated to handle joint training on many languages, optional random switching between docstring and function name embeddings, etc. Pushing this into tf.data is possible, but painful.

So, overall: It's a substantial amount of work that we are most likely not going to do. The released baselines are really just meant as a "here's a simple straightforward approach, beat that" - we are happy for others to improve on this, either by entirely rewriting things or improving our codebase.

from codesearchnet.

yashjakhotiya avatar yashjakhotiya commented on May 13, 2024

Got your point :)

This is an example, although on the data pipeline side, of general problems faced with system requirements when trying to beat SOTA language models these days.

Anyway, closing the issue now. Thank you for your consideration.

from codesearchnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.