GithubHelp home page GithubHelp logo

Comments (4)

luisgabrielroldan avatar luisgabrielroldan commented on August 11, 2024

The save_ckpts_from_n_epochs must be a number. Use 0 instead of null.

from everydream2trainer.

luisgabrielroldan avatar luisgabrielroldan commented on August 11, 2024

Just realize there is a setup_args function.

Maybe adding this check?

if args.save_ckpts_from_n_epochs is None or args.save_ckpts_from_n_epochs < 1:
        args.save_ckpts_from_n_epochs = 1

from everydream2trainer.

Macgyverops avatar Macgyverops commented on August 11, 2024

I get something similar.

 * Saving diffusers model to logs\redacted\ckpts\errored-redacted-ep00-gs00095
Traceback (most recent call last):
  File "redacted\EveryDream2trainer\train.py", line 1132, in main
    model_pred, target, loss = get_model_prediction_and_target(batch["image"], batch["tokens"], args.zero_frequency_noise_ratio, return_loss=True)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "redacted\EveryDream2trainer\train.py", line 921, in get_model_prediction_and_target
    cuda_caption = tokens.to(text_encoder.device)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.```

from everydream2trainer.

victorchall avatar victorchall commented on August 11, 2024

Set save_ckpts_from_n_epochs to 0 indeed should solve the original issue. The point is to declare from what epoch is the first that checkpoints are allowed to be saved when other rules are also met. null doesn't make any sense, but 0 (always save when other requirements met) or perhaps something like 99999999 or 1e15 (effectively only save at end of training) might.

@Macgyverops your issue looks completely unrelated. It looks like a potential hardware problem as that shouldn't fail after 95 steps. You can try the suggested setting of CUDA_LAUNCH_BLOCKING=1 to get a better error (in windows you can use set CUDA_LAUNCH_BLOCKING=1 to set the ENV var and get a better error in the command line prior to running python train.py ... If you are still having problems please open a new issue as it is not related to the original post here.

from everydream2trainer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.