belskikh / kekas Goto Github PK

View Code? Open in Web Editor NEW

182.0 10.0 18.0 589 KB

Just another DL library

License: MIT License

Python 72.80% Jupyter Notebook 27.20%

pytorch deep-learning neural-networks kek

kekas's People

Contributors

Stargazers

Watchers

Forkers

vipermdl asanakoy gdcollect afcarl curiousdeeplearner erlemar bityangke lqm-24 happog liuwq168 yangsenwxy vlad-filin metya foromik goodok vitalyvels nastyakrass trigram19

kekas's Issues

savepath="/path/to/save/dir" for predict() in tutorial should be "/path/to/save/file"

Need the ability to draw logs in jupyter with plotly/matplotlib

Tensorboard is not very convenient for jupyter or kaggle kernels

Add option of returning a np.array from predict_loader instead of saving file

Add metrics for epochs

Current mechanism of epoch metrics aggregation - batch metrics averaging. It is not correct

kek_lr bug when n_iter*batch_size is larger than the epoch

Example:
If we have an epoch size of 12 batches and I call:
keker.kek_lr(final_lr=1.0, n_steps=250, logdir=logdir)

Than only 12 iterations will be made instead of 250 and lr will not reach the maximum.

Epoch 1/1: 100% 12/12 [00:02<00:00,  6.90it/s, loss=8.5158]

It would be nice to be able to set N-steps which is larger than a single epoch.

Bad checkpoints are not deleted

pytorch dataloader with datakek can't pickle transforms lamda fucntion on windows

On Windows there is a bug, described here https://discuss.pytorch.org/t/cant-pickle-local-object-dataloader-init-locals-lambda/31857 and here pytorch/vision#689 and here pytorch/ignite#377

but it is appeared when num_workers for torch DataLoader more then 0. When num_workers=0 it is goes normal.

So if num_workers > 0 and there is lambda function in transforms code, for example:

def get_transforms(dataset_key, size, p):
    PRE_TFMS = Transformer(dataset_key, lambda x: cv2.resize(x, (size, size))) # <-- here
    AUGS = Transformer(dataset_key, lambda x: augs()(image=x)["image"]) # <-- here
    NRM_TFMS = transforms.Compose([
        Transformer(dataset_key, to_torch()), # <-- and here inside to_torch() there is lambda 
        Transformer(dataset_key, normalize())
    ])
    train_tfms = transforms.Compose([PRE_TFMS, AUGS, NRM_TFMS])
    val_tfms = transforms.Compose([PRE_TFMS, NRM_TFMS])
    return train_tfms, val_tfms

I get exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-35-87bd5485ec48> in <module>
      4 # !rm -r lrlogs/*
      5 
----> 6 BCE_keker.kek_lr(final_lr=0.1, logdir=lrlogdir)
      7 # BCE_keker.plot_kek_lr(logdir=lrlogdir)

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in kek_lr(self, final_lr, logdir, init_lr, n_steps, opt, opt_params)
    407             self.callbacks = Callbacks(self.core_callbacks + [lrfinder_cb])
    408             self.kek(lr=init_lr, epochs=n_epochs, skip_val=True, logdir=logdir,
--> 409                      opt=opt, opt_params=opt_params)
    410         finally:
    411             self.callbacks = callbacks

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
    276             for epoch in range(epochs):
    277                 self.set_mode("train")
--> 278                 self._run_epoch(epoch, epochs)
    279 
    280                 if not skip_val:

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in _run_epoch(self, epoch, epochs)
    425 
    426         with torch.set_grad_enabled(self.is_train):
--> 427             for i, batch in enumerate(self.state.core.loader):
    428                 self.callbacks.on_batch_begin(i, self.state)
    429 

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
    191 
    192     def __iter__(self):
--> 193         return _DataLoaderIter(self)
    194 
    195     def __len__(self):

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
    467                 #     before it starts, and __del__ tries to join but will get:
    468                 #     AssertionError: can only join a started process.
--> 469                 w.start()
    470                 self.index_queues.append(index_queue)
    471                 self.workers.append(w)

D:\metya\Anaconda3\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect

D:\metya\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):

D:\metya\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):

D:\metya\Anaconda3\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     87             try:
     88                 reduction.dump(prep_data, to_child)
---> 89                 reduction.dump(process_obj, to_child)
     90             finally:
     91                 set_spawning_popen(None)

D:\metya\Anaconda3\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

AttributeError: Can't pickle local object 'get_transforms.<locals>.<lambda>'

So I changed all lambda functions to normal and replace to_torch() to torchvision.transform.ToTensor() (even monkey patched source kekas transformation.py)

and it works for me with num_workers=0
if num_workers > 0 it is fails with

---------------------------------------------------------------------------
Empty                                     Traceback (most recent call last)
D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_batch(self, timeout)
    510         try:
--> 511             data = self.data_queue.get(timeout=timeout)
    512             return (True, data)

D:\metya\Anaconda3\lib\multiprocessing\queues.py in get(self, block, timeout)
    104                     if not self._poll(timeout):
--> 105                         raise Empty
    106                 elif not self._poll():

Empty: 

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-106-87bd5485ec48> in <module>
      4 # !rm -r lrlogs/*
      5 
----> 6 BCE_keker.kek_lr(final_lr=0.1, logdir=lrlogdir)
      7 # BCE_keker.plot_kek_lr(logdir=lrlogdir)

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in kek_lr(self, final_lr, logdir, init_lr, n_steps, opt, opt_params)
    407             self.callbacks = Callbacks(self.core_callbacks + [lrfinder_cb])
    408             self.kek(lr=init_lr, epochs=n_epochs, skip_val=True, logdir=logdir,
--> 409                      opt=opt, opt_params=opt_params)
    410         finally:
    411             self.callbacks = callbacks

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
    276             for epoch in range(epochs):
    277                 self.set_mode("train")
--> 278                 self._run_epoch(epoch, epochs)
    279 
    280                 if not skip_val:

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in _run_epoch(self, epoch, epochs)
    425 
    426         with torch.set_grad_enabled(self.is_train):
--> 427             for i, batch in enumerate(self.state.core.loader):
    428                 self.callbacks.on_batch_begin(i, self.state)
    429 

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    574         while True:
    575             assert (not self.shutdown and self.batches_outstanding > 0)
--> 576             idx, batch = self._get_batch()
    577             self.batches_outstanding -= 1
    578             if idx != self.rcvd_idx:

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _get_batch(self)
    551         else:
    552             while True:
--> 553                 success, data = self._try_get_batch()
    554                 if success:
    555                     return data

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_batch(self, timeout)
    517             if not all(w.is_alive() for w in self.workers):
    518                 pids_str = ', '.join(str(w.pid) for w in self.workers if not w.is_alive())
--> 519                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
    520             if isinstance(e, queue.Empty):
    521                 return (False, None)

RuntimeError: DataLoader worker (pid(s) 11236, 6592) exited unexpectedly

I think it is a common bug with workers on Windows
I found related issues like that pytorch/pytorch#8976 or like that pytorch/pytorch#5301

Moreover it is funny, but if num_workers set 0 and return lambdas back everything works fine.

So maybe it is not be the lambdas in code of kekas, but in fucking windows and dataloaders and multiprocessing and I don't know.

Bug in TBLogger: it gets mean over all epochs, not for the last one

Add plotting examples in Tutorial notebook

`i` argument in `reader_fn` signature

Check if it still needed and remove if not

Check if "preds_key" is still needed

make criterion and dataowner parameters optional for inference reasons

Bug in new keker.load() method. Shapes size check doesn't pass

kek_lr requires stopping on n_steps less than len loader

keker.kek() crashes when len(dataloader) == 1

keker.kek() crashes when len(dataloader) == 1, i.e. epoch contains only one mini-batch.

I used my custom batch_sampler in the Dataloader which returns __len__() == 1.

Epoch 1/500:   0% 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-71521e7b2e71> in <module>
    14                       "n_best": 2,
    15                       "prefix": "kek",
---> 16                       "mode": "max"})
    17
    18 keker.kek_one_cycle(max_lr=3e-3,                  # the maximum learning rate

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/keker.py in kek_one_cycle(self, max_lr, cycle_len, momentum_range, div_factor, increase_fraction, opt, opt_params, logdir, cp_saver_params, early_stop_params)
   309                      logdir=logdir,
   310                      cp_saver_params=cp_saver_params,
--> 311                      early_stop_params=early_stop_params)
   312         finally:
   313             # set old callbacks without OneCycle

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
   235                 if not skip_val:
   236                     self.set_mode("val")
--> 237                     self._run_epoch(epoch, epochs)
   238
   239                 if self.state.stop_train:

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/keker.py in _run_epoch(self, epoch, epochs)
   399                     break
   400
--> 401         self.callbacks.on_epoch_end(epoch, self.state)
   402
   403         if self.state.checkpoint:

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/callbacks.py in on_epoch_end(self, epoch, state)
    64     def on_epoch_end(self, epoch: int, state: DotDict) -> None:
    65         for cb in self.callbacks:
---> 66             cb.on_epoch_end(epoch, state)
    67
    68     def on_train_begin(self, state: DotDict) -> None:

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/callbacks.py in on_epoch_end(self, epoch, state)
   356             metrics = state.get("epoch_metrics", {})
   357             state.pbar.set_postfix_str(extend_postfix(state.pbar.postfix,
--> 358                                                       metrics))
   359             state.pbar.close()
   360         elif state.mode == "test":

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/utils.py in extend_postfix(postfix, dct)
   107 def extend_postfix(postfix: str, dct: Dict) -> str:
   108     postfixes = [postfix] + [f"{k}={v:.4f}" for k, v in dct.items()]
--> 109     return ", ".join(postfixes)
   110
   111

TypeError: sequence item 0: expected str instance, NoneType found

Remove tensorboard and tensorflow requirements

with pytorch 1.2 release there are no need in tensorboard/tensorflow for parsing TF logs

sayantan@kali:~$ sudo pip3 install kekas
Collecting kekas
Using cached https://files.pythonhosted.org/packages/2d/04/4487855bbc12532d54729b1bf07531c8b69202981fa69561a983e234220d/kekas-0.1.12.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-6wp28_se/kekas/setup.py", line 3, in
import kekas
File "/tmp/pip-install-6wp28_se/kekas/kekas/init.py", line 1, in
from .keker import Keker
File "/tmp/pip-install-6wp28_se/kekas/kekas/keker.py", line 12, in
from .callbacks import Callback, Callbacks, ProgressBarCallback,
File "/tmp/pip-install-6wp28_se/kekas/kekas/callbacks.py", line 14, in
from tensorboardX import SummaryWriter
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/init.py", line 5, in
from .torchvis import TorchVis
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/torchvis.py", line 11, in
from .writer import SummaryWriter
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/writer.py", line 27, in
from .event_file_writer import EventFileWriter
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/event_file_writer.py", line 28, in
from .proto import event_pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/event_pb2.py", line 15, in
from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/summary_pb2.py", line 15, in
from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/tensor_pb2.py", line 15, in
from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in
serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')
TypeError: new() got an unexpected keyword argument 'serialized_options'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-6wp28_se/kekas/
sayantan@kali:~$

unfreeze -> freeze typo in Tutorial nb

Progress bar (tqdm) issue in kek_lr when number of epochs exceed 1 (Jupyter notebook)

Hi,
If the value of parameter n_steps (in kek_lr) is passed in such a way that the number of epochs exceeds 1
progress bar has starting to glitch and make new extra lines
(for example, n_steps = 1400 while number of batches in the epoch is 717 - see below)

>>> keker.kek_lr(final_lr=0.1, logdir="logdir_lr", n_steps=1400)

```Epoch 1/2: 100% 717/717 [03:17<00:00,  3.64it/s, loss=0.1435]
Epoch 2/2:   0% 0/717 [00:00<?, ?it/s]
Epoch 2/2:   0% 0/717 [00:00<?, ?it/s, loss=0.2583]
Epoch 2/2:   0% **1/**717 [00:00<09:51,  1.21it/s, loss=0.2583]
Epoch 2/2:   0% **1**/717 [00:01<09:51,  1.21it/s, loss=0.2422]
Epoch 2/2:   0% 2/717 [00:01<07:54,  1.51it/s, loss=0.2422]
Epoch 2/2:   0% 2/717 [00:01<07:54,  1.51it/s, loss=0.2315]
Epoch 2/2:   0% 3/717 [00:01<06:31,  1.82it/s, loss=0.2315]
Epoch 2/2:   0% 3/717 [00:01<06:31,  1.82it/s, loss=0.2271]

versions:
kekas 0.1.17
tqdm 4.30.0
jupyter-client 5.2.4

P.S pure python script output looks good:

Warning: unknown JFIF revision number 0.00
Epoch 1/2:  12% 83/717 [00:22<02:46,  3.81it/s, loss=0.7120]Corrupt JPEG data: 399 extraneous bytes before marker 0xd9
Epoch 1/2:  17% 121/717 [00:32<02:36,  3.80it/s, loss=0.6896]Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Epoch 1/2:  17% 122/717 [00:32<02:36,  3.80it/s, loss=0.6943]Corrupt JPEG data: 226 extraneous bytes before marker 0xd9
Epoch 1/2:  20% 146/717 [00:38<02:29,  3.81it/s, loss=0.6542]Corrupt JPEG data: 254 extraneous bytes before marker 0xd9
Epoch 1/2:  29% 209/717 [00:55<02:13,  3.81it/s, loss=0.5616]Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
Epoch 1/2:  58% 415/717 [01:49<01:19,  3.79it/s, loss=0.3014]Warning: unknown JFIF revision number 0.00
Epoch 1/2:  59% 422/717 [01:51<01:17,  3.79it/s, loss=0.2748]Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
Epoch 1/2:  60% 429/717 [01:53<01:16,  3.79it/s, loss=0.2928]Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
Epoch 1/2:  65% 469/717 [02:04<01:05,  3.78it/s, loss=0.2795]Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
Epoch 1/2:  68% 491/717 [02:09<00:59,  3.78it/s, loss=0.2552]Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Epoch 1/2:  72% 514/717 [02:15<00:53,  3.78it/s, loss=0.2325]Corrupt JPEG data: 2230 extraneous bytes before marker 0xd9
Epoch 1/2:  84% 604/717 [02:39<00:29,  3.78it/s, loss=0.2118]Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
Epoch 1/2: 100% 717/717 [03:09<00:00,  3.79it/s, loss=0.175Corrupt JPEG data: 399 extraneous bytes before marker 0xd9
Epoch 2/2:   0% 3/717 [00:01<06:18,  1.88it/s, loss=0.0926] Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Epoch 2/2:   6% 46/717 [00:12<02:57,  3.79it/s, loss=0.1687]Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
Epoch 2/2:  12% 87/717 [00:23<02:46,  3.78it/s, loss=0.1583] Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Epoch 2/2:  36% 260/717 [01:09<02:00,  3.78it/s, loss=0.2722]Corrupt JPEG data: 2230 extraneous bytes before marker 0xd9
Epoch 2/2:  53% 379/717 [01:40<01:29,  3.77it/s, loss=1.4624]Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9

belskikh / kekas Goto Github PK

kekas's People

Contributors

Stargazers

Watchers

Forkers

kekas's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs