GithubHelp home page GithubHelp logo

Comments (2)

Antoine101 avatar Antoine101 commented on June 12, 2024

Hi @gboeer

I am "happy" to see I am not the only one having issues logging with MLFlow.

I am finetuning a pretrained transformer model on 2000ish images. So not an insane amount of data.

Here is what am I seeing:
image

As you can see, metrics such as validation_accuracy although recorded on_step=False, on_epoch=True only always show me the value of the last epoch. I would like to see an actual graph with all my previous epochs, it's just a scalar here.

Also, I tell my trainer to log every 50 steps, but in my epochs-step plot I see points at the following steps only: 49, 199, 349, 499, ... not every 50.

Here is my logger:

logger = MLFlowLogger(
            experiment_name=config['logger']['experiment_name'], 
            tracking_uri=config['logger']['tracking_uri'],
            log_model=config['logger']['log_model']
        )

Passed to my trainer:

trainer = Trainer(
    accelerator=config['accelerator'],
    devices=config['devices'],
    max_epochs=config['max_epochs'],
    logger=logger,
    log_every_n_steps=50,
    callbacks=[early_stopping, lr_monitor, checkpoint, progress_bar],
)

My metrics are logged in the following way in the training_step and validation_step functions:

def training_step(self, batch, batch_idx): 
    index, audio_name, targets, inputs = batch
    logits = self.model(inputs) 
    loss = self.loss(logits, targets)
    predictions = torch.argmax(logits, dim=1)
    self.train_accuracy.update(predictions, targets)
    self.log("training_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
    self.log("training_accuracy", self.train_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
    self.log("training_gpu_allocation", torch.cuda.memory_allocated(), on_step=True, on_epoch=False)        
    return {"inputs":inputs, "targets":targets, "predictions":predictions, "loss":loss}

        
def validation_step(self, batch, batch_idx):
    index, audio_name, targets, inputs = batch
    logits = self.model(inputs)
    loss = self.loss(logits, targets)
    predictions = torch.argmax(logits, dim=1)
    self.validation_accuracy(predictions, targets)
    self.validation_precision(predictions, targets)
    self.validation_recall(predictions, targets)
    self.validation_f1_score(predictions, targets)
    self.validation_confmat.update(predictions, targets)
    self.log("validation_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
    self.log("validation_accuracy", self.validation_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
    self.log("validation_precision", self.validation_precision, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
    self.log("validation_recall", self.validation_recall, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
    self.log("validation_f1_score", self.validation_f1_score, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)

I guess it's a problem from lightning but not 100% sure.

I hope we'll get suppot soon. I serve my ML models on MLFlow and it works fine, so I don't want to go back to tensorboard for my DL models only.

EDIT: My bad, it seems to do that just when the training is still on. When the training is finished, the plots display correctly.
image

But still, I thought we were supposed to be able to follow the evolution of metrics as training progresses, and in this case it's not very possible.

from pytorch-lightning.

gboeer avatar gboeer commented on June 12, 2024

@Antoine101
Interesting, that your plots change after the training is finished. For me, they stay the same, though. I tried opening the app in private window to see if there are any caching issues, but it didn't change anything.

I guess what you observed about the stepsize may just have to do with zero-indexing.

from pytorch-lightning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.