Comments (2)
Hi @gboeer
I am "happy" to see I am not the only one having issues logging with MLFlow.
I am finetuning a pretrained transformer model on 2000ish images. So not an insane amount of data.
As you can see, metrics such as validation_accuracy
although recorded on_step=False
, on_epoch=True
only always show me the value of the last epoch. I would like to see an actual graph with all my previous epochs, it's just a scalar here.
Also, I tell my trainer to log every 50 steps, but in my epochs-step plot I see points at the following steps only: 49, 199, 349, 499, ... not every 50.
Here is my logger:
logger = MLFlowLogger(
experiment_name=config['logger']['experiment_name'],
tracking_uri=config['logger']['tracking_uri'],
log_model=config['logger']['log_model']
)
Passed to my trainer:
trainer = Trainer(
accelerator=config['accelerator'],
devices=config['devices'],
max_epochs=config['max_epochs'],
logger=logger,
log_every_n_steps=50,
callbacks=[early_stopping, lr_monitor, checkpoint, progress_bar],
)
My metrics are logged in the following way in the training_step and validation_step functions:
def training_step(self, batch, batch_idx):
index, audio_name, targets, inputs = batch
logits = self.model(inputs)
loss = self.loss(logits, targets)
predictions = torch.argmax(logits, dim=1)
self.train_accuracy.update(predictions, targets)
self.log("training_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
self.log("training_accuracy", self.train_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("training_gpu_allocation", torch.cuda.memory_allocated(), on_step=True, on_epoch=False)
return {"inputs":inputs, "targets":targets, "predictions":predictions, "loss":loss}
def validation_step(self, batch, batch_idx):
index, audio_name, targets, inputs = batch
logits = self.model(inputs)
loss = self.loss(logits, targets)
predictions = torch.argmax(logits, dim=1)
self.validation_accuracy(predictions, targets)
self.validation_precision(predictions, targets)
self.validation_recall(predictions, targets)
self.validation_f1_score(predictions, targets)
self.validation_confmat.update(predictions, targets)
self.log("validation_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
self.log("validation_accuracy", self.validation_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_precision", self.validation_precision, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_recall", self.validation_recall, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_f1_score", self.validation_f1_score, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
I guess it's a problem from lightning but not 100% sure.
I hope we'll get suppot soon. I serve my ML models on MLFlow and it works fine, so I don't want to go back to tensorboard for my DL models only.
EDIT: My bad, it seems to do that just when the training is still on. When the training is finished, the plots display correctly.
But still, I thought we were supposed to be able to follow the evolution of metrics as training progresses, and in this case it's not very possible.
from lightning.
@Antoine101
Interesting, that your plots change after the training is finished. For me, they stay the same, though. I tried opening the app in private window to see if there are any caching issues, but it didn't change anything.
I guess what you observed about the stepsize may just have to do with zero-indexing.
from lightning.
Related Issues (20)
- MLFlowLogger fails when logging hyperparameters as Trainer already does automatically
- Is "Prepare a config file for the CLI" out of date?
- MisconfigurationException: Do not set `gradient_accumulation_steps` in the DeepSpeed config
- Dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, is there a way tom manipulate every device?
- Error when fast_dev_run=True or num_sanity_val_steps=0 and using torchmetrics MetricTracker
- Fabric: Incorrect `num_replicas` (ddp/fsdp) when number of GPUs on each node is different HOT 1
- Creating A Second Comet Logger Disables The First
- CUDA unknown error HOT 1
- AttributeError: type object 'Trainer' has no attribute 'add_argparse_args'
- Add functionality to save nn.Modules supplied as arguments when initialising LightningModule
- I think it's deadly necessary to add docs or tutorials for handling the case when We return multiple loaders in test_dataloaders() method? I think it
- "save_last" could not save a complete checkpoint
- element 0 of tensors does not require grad and does not have a grad_fn in "test_step" and "validation_step" HOT 4
- LR_FIND() does not work in DDP anymore, RuntimeError: No backend type associated with device type cpu
- KeyboardInterrupt raises an exception which results in a zero exit code
- XLA FSDP strategy has undocumented requirement for using activation checkpointing
- The training process will stop unexpectedly HOT 1
- forward method missing required positional argument ‘masks’ in PyTorch Lightning HOT 2
- Lightning Fabric: generic method to get the full state dict
- ModelCheckpoint does not work when using the monitor
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lightning.