Comments (2)
Hi @gboeer
I am "happy" to see I am not the only one having issues logging with MLFlow.
I am finetuning a pretrained transformer model on 2000ish images. So not an insane amount of data.
As you can see, metrics such as validation_accuracy
although recorded on_step=False
, on_epoch=True
only always show me the value of the last epoch. I would like to see an actual graph with all my previous epochs, it's just a scalar here.
Also, I tell my trainer to log every 50 steps, but in my epochs-step plot I see points at the following steps only: 49, 199, 349, 499, ... not every 50.
Here is my logger:
logger = MLFlowLogger(
experiment_name=config['logger']['experiment_name'],
tracking_uri=config['logger']['tracking_uri'],
log_model=config['logger']['log_model']
)
Passed to my trainer:
trainer = Trainer(
accelerator=config['accelerator'],
devices=config['devices'],
max_epochs=config['max_epochs'],
logger=logger,
log_every_n_steps=50,
callbacks=[early_stopping, lr_monitor, checkpoint, progress_bar],
)
My metrics are logged in the following way in the training_step and validation_step functions:
def training_step(self, batch, batch_idx):
index, audio_name, targets, inputs = batch
logits = self.model(inputs)
loss = self.loss(logits, targets)
predictions = torch.argmax(logits, dim=1)
self.train_accuracy.update(predictions, targets)
self.log("training_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
self.log("training_accuracy", self.train_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("training_gpu_allocation", torch.cuda.memory_allocated(), on_step=True, on_epoch=False)
return {"inputs":inputs, "targets":targets, "predictions":predictions, "loss":loss}
def validation_step(self, batch, batch_idx):
index, audio_name, targets, inputs = batch
logits = self.model(inputs)
loss = self.loss(logits, targets)
predictions = torch.argmax(logits, dim=1)
self.validation_accuracy(predictions, targets)
self.validation_precision(predictions, targets)
self.validation_recall(predictions, targets)
self.validation_f1_score(predictions, targets)
self.validation_confmat.update(predictions, targets)
self.log("validation_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
self.log("validation_accuracy", self.validation_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_precision", self.validation_precision, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_recall", self.validation_recall, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_f1_score", self.validation_f1_score, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
I guess it's a problem from lightning but not 100% sure.
I hope we'll get suppot soon. I serve my ML models on MLFlow and it works fine, so I don't want to go back to tensorboard for my DL models only.
EDIT: My bad, it seems to do that just when the training is still on. When the training is finished, the plots display correctly.
But still, I thought we were supposed to be able to follow the evolution of metrics as training progresses, and in this case it's not very possible.
from pytorch-lightning.
@Antoine101
Interesting, that your plots change after the training is finished. For me, they stay the same, though. I tried opening the app in private window to see if there are any caching issues, but it didn't change anything.
I guess what you observed about the stepsize may just have to do with zero-indexing.
from pytorch-lightning.
Related Issues (20)
- Dynamically link arguments in `LightningCLI`? HOT 2
- Support get optimizer and lr_schedulers from deepspeed config
- Validation dataloader is added to train dataloader after first epoch
- Add dog has an error: FileNotFoundError: HOT 1
- Resume training, how to change learning scheduler? HOT 1
- Possible bug in recognizing `mps` accelerator even though PyTorch seems to register the `mps` device?
- Error loading a saved model to run inference (using ddp_notebook strategy) HOT 1
- AttributeError: module 'pytorch_lightning.callbacks' has no attribute 'ProgressBarBase'. Did you mean: 'ProgressBar'?
- can't fit with ddp_notebook on a Vertex AI Workbench instance (CUDA initialized)
- Lightning stalls with 2 GPUs on 1 node with SLURM (and apptainer) HOT 1
- MLFlowLogger fails when logging hyperparameters as Trainer already does automatically
- Is "Prepare a config file for the CLI" out of date?
- MisconfigurationException: Do not set `gradient_accumulation_steps` in the DeepSpeed config
- Dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, is there a way tom manipulate every device?
- Error when fast_dev_run=True or num_sanity_val_steps=0 and using torchmetrics MetricTracker
- Fabric: Incorrect `num_replicas` (ddp/fsdp) when number of GPUs on each node is different HOT 1
- Creating A Second Comet Logger Disables The First
- CUDA unknown error HOT 1
- AttributeError: type object 'Trainer' has no attribute 'add_argparse_args'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch-lightning.