belskikh / kekas Goto Github PK
View Code? Open in Web Editor NEWJust another DL library
License: MIT License
Just another DL library
License: MIT License
Tensorboard is not very convenient for jupyter or kaggle kernels
Current mechanism of epoch metrics aggregation - batch metrics averaging. It is not correct
Example:
If we have an epoch size of 12 batches and I call:
keker.kek_lr(final_lr=1.0, n_steps=250, logdir=logdir)
Than only 12 iterations will be made instead of 250 and lr
will not reach the maximum.
Epoch 1/1: 100% 12/12 [00:02<00:00, 6.90it/s, loss=8.5158]
It would be nice to be able to set N-steps which is larger than a single epoch.
On Windows there is a bug, described here https://discuss.pytorch.org/t/cant-pickle-local-object-dataloader-init-locals-lambda/31857 and here pytorch/vision#689 and here pytorch/ignite#377
but it is appeared when num_workers
for torch DataLoader
more then 0. When num_workers=0
it is goes normal.
So if num_workers > 0
and there is lambda function in transforms code, for example:
def get_transforms(dataset_key, size, p):
PRE_TFMS = Transformer(dataset_key, lambda x: cv2.resize(x, (size, size))) # <-- here
AUGS = Transformer(dataset_key, lambda x: augs()(image=x)["image"]) # <-- here
NRM_TFMS = transforms.Compose([
Transformer(dataset_key, to_torch()), # <-- and here inside to_torch() there is lambda
Transformer(dataset_key, normalize())
])
train_tfms = transforms.Compose([PRE_TFMS, AUGS, NRM_TFMS])
val_tfms = transforms.Compose([PRE_TFMS, NRM_TFMS])
return train_tfms, val_tfms
I get exception:
AttributeError Traceback (most recent call last)
<ipython-input-35-87bd5485ec48> in <module>
4 # !rm -r lrlogs/*
5
----> 6 BCE_keker.kek_lr(final_lr=0.1, logdir=lrlogdir)
7 # BCE_keker.plot_kek_lr(logdir=lrlogdir)
D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in kek_lr(self, final_lr, logdir, init_lr, n_steps, opt, opt_params)
407 self.callbacks = Callbacks(self.core_callbacks + [lrfinder_cb])
408 self.kek(lr=init_lr, epochs=n_epochs, skip_val=True, logdir=logdir,
--> 409 opt=opt, opt_params=opt_params)
410 finally:
411 self.callbacks = callbacks
D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
276 for epoch in range(epochs):
277 self.set_mode("train")
--> 278 self._run_epoch(epoch, epochs)
279
280 if not skip_val:
D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in _run_epoch(self, epoch, epochs)
425
426 with torch.set_grad_enabled(self.is_train):
--> 427 for i, batch in enumerate(self.state.core.loader):
428 self.callbacks.on_batch_begin(i, self.state)
429
D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
191
192 def __iter__(self):
--> 193 return _DataLoaderIter(self)
194
195 def __len__(self):
D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
467 # before it starts, and __del__ tries to join but will get:
468 # AssertionError: can only join a started process.
--> 469 w.start()
470 self.index_queues.append(index_queue)
471 self.workers.append(w)
D:\metya\Anaconda3\lib\multiprocessing\process.py in start(self)
110 'daemonic processes are not allowed to have children'
111 _cleanup()
--> 112 self._popen = self._Popen(self)
113 self._sentinel = self._popen.sentinel
114 # Avoid a refcycle if the target function holds an indirect
D:\metya\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
221 @staticmethod
222 def _Popen(process_obj):
--> 223 return _default_context.get_context().Process._Popen(process_obj)
224
225 class DefaultContext(BaseContext):
D:\metya\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
320 def _Popen(process_obj):
321 from .popen_spawn_win32 import Popen
--> 322 return Popen(process_obj)
323
324 class SpawnContext(BaseContext):
D:\metya\Anaconda3\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
87 try:
88 reduction.dump(prep_data, to_child)
---> 89 reduction.dump(process_obj, to_child)
90 finally:
91 set_spawning_popen(None)
D:\metya\Anaconda3\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
58 def dump(obj, file, protocol=None):
59 '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60 ForkingPickler(file, protocol).dump(obj)
61
62 #
AttributeError: Can't pickle local object 'get_transforms.<locals>.<lambda>'
So I changed all lambda functions to normal and replace to_torch()
to torchvision.transform.ToTensor()
(even monkey patched source kekas transformation.py)
and it works for me with num_workers=0
if num_workers > 0
it is fails with
---------------------------------------------------------------------------
Empty Traceback (most recent call last)
D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_batch(self, timeout)
510 try:
--> 511 data = self.data_queue.get(timeout=timeout)
512 return (True, data)
D:\metya\Anaconda3\lib\multiprocessing\queues.py in get(self, block, timeout)
104 if not self._poll(timeout):
--> 105 raise Empty
106 elif not self._poll():
Empty:
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-106-87bd5485ec48> in <module>
4 # !rm -r lrlogs/*
5
----> 6 BCE_keker.kek_lr(final_lr=0.1, logdir=lrlogdir)
7 # BCE_keker.plot_kek_lr(logdir=lrlogdir)
D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in kek_lr(self, final_lr, logdir, init_lr, n_steps, opt, opt_params)
407 self.callbacks = Callbacks(self.core_callbacks + [lrfinder_cb])
408 self.kek(lr=init_lr, epochs=n_epochs, skip_val=True, logdir=logdir,
--> 409 opt=opt, opt_params=opt_params)
410 finally:
411 self.callbacks = callbacks
D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
276 for epoch in range(epochs):
277 self.set_mode("train")
--> 278 self._run_epoch(epoch, epochs)
279
280 if not skip_val:
D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in _run_epoch(self, epoch, epochs)
425
426 with torch.set_grad_enabled(self.is_train):
--> 427 for i, batch in enumerate(self.state.core.loader):
428 self.callbacks.on_batch_begin(i, self.state)
429
D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
574 while True:
575 assert (not self.shutdown and self.batches_outstanding > 0)
--> 576 idx, batch = self._get_batch()
577 self.batches_outstanding -= 1
578 if idx != self.rcvd_idx:
D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _get_batch(self)
551 else:
552 while True:
--> 553 success, data = self._try_get_batch()
554 if success:
555 return data
D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_batch(self, timeout)
517 if not all(w.is_alive() for w in self.workers):
518 pids_str = ', '.join(str(w.pid) for w in self.workers if not w.is_alive())
--> 519 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
520 if isinstance(e, queue.Empty):
521 return (False, None)
RuntimeError: DataLoader worker (pid(s) 11236, 6592) exited unexpectedly
I think it is a common bug with workers on Windows
I found related issues like that pytorch/pytorch#8976 or like that pytorch/pytorch#5301
Moreover it is funny, but if num_workers
set 0
and return lambdas back everything works fine.
So maybe it is not be the lambdas in code of kekas, but in fucking windows and dataloaders and multiprocessing and I don't know.
Check if it still needed and remove if not
keker.kek() crashes when len(dataloader) == 1, i.e. epoch contains only one mini-batch.
I used my custom batch_sampler in the Dataloader which returns __len__() == 1
.
Epoch 1/500: 0% 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-71521e7b2e71> in <module>
14 "n_best": 2,
15 "prefix": "kek",
---> 16 "mode": "max"})
17
18 keker.kek_one_cycle(max_lr=3e-3, # the maximum learning rate
~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/keker.py in kek_one_cycle(self, max_lr, cycle_len, momentum_range, div_factor, increase_fraction, opt, opt_params, logdir, cp_saver_params, early_stop_params)
309 logdir=logdir,
310 cp_saver_params=cp_saver_params,
--> 311 early_stop_params=early_stop_params)
312 finally:
313 # set old callbacks without OneCycle
~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
235 if not skip_val:
236 self.set_mode("val")
--> 237 self._run_epoch(epoch, epochs)
238
239 if self.state.stop_train:
~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/keker.py in _run_epoch(self, epoch, epochs)
399 break
400
--> 401 self.callbacks.on_epoch_end(epoch, self.state)
402
403 if self.state.checkpoint:
~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/callbacks.py in on_epoch_end(self, epoch, state)
64 def on_epoch_end(self, epoch: int, state: DotDict) -> None:
65 for cb in self.callbacks:
---> 66 cb.on_epoch_end(epoch, state)
67
68 def on_train_begin(self, state: DotDict) -> None:
~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/callbacks.py in on_epoch_end(self, epoch, state)
356 metrics = state.get("epoch_metrics", {})
357 state.pbar.set_postfix_str(extend_postfix(state.pbar.postfix,
--> 358 metrics))
359 state.pbar.close()
360 elif state.mode == "test":
~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/utils.py in extend_postfix(postfix, dct)
107 def extend_postfix(postfix: str, dct: Dict) -> str:
108 postfixes = [postfix] + [f"{k}={v:.4f}" for k, v in dct.items()]
--> 109 return ", ".join(postfixes)
110
111
TypeError: sequence item 0: expected str instance, NoneType found
Deleted.
with pytorch 1.2 release there are no need in tensorboard/tensorflow for parsing TF logs
sayantan@kali:~$ sudo pip3 install kekas
Collecting kekas
Using cached https://files.pythonhosted.org/packages/2d/04/4487855bbc12532d54729b1bf07531c8b69202981fa69561a983e234220d/kekas-0.1.12.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-6wp28_se/kekas/setup.py", line 3, in
import kekas
File "/tmp/pip-install-6wp28_se/kekas/kekas/init.py", line 1, in
from .keker import Keker
File "/tmp/pip-install-6wp28_se/kekas/kekas/keker.py", line 12, in
from .callbacks import Callback, Callbacks, ProgressBarCallback,
File "/tmp/pip-install-6wp28_se/kekas/kekas/callbacks.py", line 14, in
from tensorboardX import SummaryWriter
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/init.py", line 5, in
from .torchvis import TorchVis
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/torchvis.py", line 11, in
from .writer import SummaryWriter
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/writer.py", line 27, in
from .event_file_writer import EventFileWriter
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/event_file_writer.py", line 28, in
from .proto import event_pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/event_pb2.py", line 15, in
from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/summary_pb2.py", line 15, in
from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/tensor_pb2.py", line 15, in
from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in
serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')
TypeError: new() got an unexpected keyword argument 'serialized_options'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-6wp28_se/kekas/
sayantan@kali:~$
Hi,
If the value of parameter n_steps (in kek_lr) is passed in such a way that the number of epochs exceeds 1
progress bar has starting to glitch and make new extra lines
(for example, n_steps = 1400 while number of batches in the epoch is 717 - see below)
>>> keker.kek_lr(final_lr=0.1, logdir="logdir_lr", n_steps=1400)
```Epoch 1/2: 100% 717/717 [03:17<00:00, 3.64it/s, loss=0.1435]
Epoch 2/2: 0% 0/717 [00:00<?, ?it/s]
Epoch 2/2: 0% 0/717 [00:00<?, ?it/s, loss=0.2583]
Epoch 2/2: 0% **1/**717 [00:00<09:51, 1.21it/s, loss=0.2583]
Epoch 2/2: 0% **1**/717 [00:01<09:51, 1.21it/s, loss=0.2422]
Epoch 2/2: 0% 2/717 [00:01<07:54, 1.51it/s, loss=0.2422]
Epoch 2/2: 0% 2/717 [00:01<07:54, 1.51it/s, loss=0.2315]
Epoch 2/2: 0% 3/717 [00:01<06:31, 1.82it/s, loss=0.2315]
Epoch 2/2: 0% 3/717 [00:01<06:31, 1.82it/s, loss=0.2271]
versions:
kekas 0.1.17
tqdm 4.30.0
jupyter-client 5.2.4
P.S pure python script output looks good:
Warning: unknown JFIF revision number 0.00
Epoch 1/2: 12% 83/717 [00:22<02:46, 3.81it/s, loss=0.7120]Corrupt JPEG data: 399 extraneous bytes before marker 0xd9
Epoch 1/2: 17% 121/717 [00:32<02:36, 3.80it/s, loss=0.6896]Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Epoch 1/2: 17% 122/717 [00:32<02:36, 3.80it/s, loss=0.6943]Corrupt JPEG data: 226 extraneous bytes before marker 0xd9
Epoch 1/2: 20% 146/717 [00:38<02:29, 3.81it/s, loss=0.6542]Corrupt JPEG data: 254 extraneous bytes before marker 0xd9
Epoch 1/2: 29% 209/717 [00:55<02:13, 3.81it/s, loss=0.5616]Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
Epoch 1/2: 58% 415/717 [01:49<01:19, 3.79it/s, loss=0.3014]Warning: unknown JFIF revision number 0.00
Epoch 1/2: 59% 422/717 [01:51<01:17, 3.79it/s, loss=0.2748]Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
Epoch 1/2: 60% 429/717 [01:53<01:16, 3.79it/s, loss=0.2928]Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
Epoch 1/2: 65% 469/717 [02:04<01:05, 3.78it/s, loss=0.2795]Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
Epoch 1/2: 68% 491/717 [02:09<00:59, 3.78it/s, loss=0.2552]Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Epoch 1/2: 72% 514/717 [02:15<00:53, 3.78it/s, loss=0.2325]Corrupt JPEG data: 2230 extraneous bytes before marker 0xd9
Epoch 1/2: 84% 604/717 [02:39<00:29, 3.78it/s, loss=0.2118]Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
Epoch 1/2: 100% 717/717 [03:09<00:00, 3.79it/s, loss=0.175Corrupt JPEG data: 399 extraneous bytes before marker 0xd9
Epoch 2/2: 0% 3/717 [00:01<06:18, 1.88it/s, loss=0.0926] Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Epoch 2/2: 6% 46/717 [00:12<02:57, 3.79it/s, loss=0.1687]Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
Epoch 2/2: 12% 87/717 [00:23<02:46, 3.78it/s, loss=0.1583] Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Epoch 2/2: 36% 260/717 [01:09<02:00, 3.78it/s, loss=0.2722]Corrupt JPEG data: 2230 extraneous bytes before marker 0xd9
Epoch 2/2: 53% 379/717 [01:40<01:29, 3.77it/s, loss=1.4624]Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.