GithubHelp home page GithubHelp logo

belskikh / kekas Goto Github PK

View Code? Open in Web Editor NEW
182.0 10.0 18.0 589 KB

Just another DL library

License: MIT License

Python 72.80% Jupyter Notebook 27.20%
pytorch deep-learning neural-networks kek

kekas's People

Contributors

alxmamaev avatar asanakoy avatar belskikh avatar erlemar avatar goodok avatar maruschin avatar metya avatar trigram19 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kekas's Issues

Add metrics for epochs

Current mechanism of epoch metrics aggregation - batch metrics averaging. It is not correct

kek_lr bug when n_iter*batch_size is larger than the epoch

Example:
If we have an epoch size of 12 batches and I call:
keker.kek_lr(final_lr=1.0, n_steps=250, logdir=logdir)

Than only 12 iterations will be made instead of 250 and lr will not reach the maximum.

Epoch 1/1: 100% 12/12 [00:02<00:00,  6.90it/s, loss=8.5158]

It would be nice to be able to set N-steps which is larger than a single epoch.

pytorch dataloader with datakek can't pickle transforms lamda fucntion on windows

On Windows there is a bug, described here https://discuss.pytorch.org/t/cant-pickle-local-object-dataloader-init-locals-lambda/31857 and here pytorch/vision#689 and here pytorch/ignite#377

but it is appeared when num_workers for torch DataLoader more then 0. When num_workers=0 it is goes normal.

So if num_workers > 0 and there is lambda function in transforms code, for example:

def get_transforms(dataset_key, size, p):
    PRE_TFMS = Transformer(dataset_key, lambda x: cv2.resize(x, (size, size))) # <-- here
    AUGS = Transformer(dataset_key, lambda x: augs()(image=x)["image"]) # <-- here
    NRM_TFMS = transforms.Compose([
        Transformer(dataset_key, to_torch()), # <-- and here inside to_torch() there is lambda 
        Transformer(dataset_key, normalize())
    ])
    train_tfms = transforms.Compose([PRE_TFMS, AUGS, NRM_TFMS])
    val_tfms = transforms.Compose([PRE_TFMS, NRM_TFMS])
    return train_tfms, val_tfms

I get exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-35-87bd5485ec48> in <module>
      4 # !rm -r lrlogs/*
      5 
----> 6 BCE_keker.kek_lr(final_lr=0.1, logdir=lrlogdir)
      7 # BCE_keker.plot_kek_lr(logdir=lrlogdir)

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in kek_lr(self, final_lr, logdir, init_lr, n_steps, opt, opt_params)
    407             self.callbacks = Callbacks(self.core_callbacks + [lrfinder_cb])
    408             self.kek(lr=init_lr, epochs=n_epochs, skip_val=True, logdir=logdir,
--> 409                      opt=opt, opt_params=opt_params)
    410         finally:
    411             self.callbacks = callbacks

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
    276             for epoch in range(epochs):
    277                 self.set_mode("train")
--> 278                 self._run_epoch(epoch, epochs)
    279 
    280                 if not skip_val:

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in _run_epoch(self, epoch, epochs)
    425 
    426         with torch.set_grad_enabled(self.is_train):
--> 427             for i, batch in enumerate(self.state.core.loader):
    428                 self.callbacks.on_batch_begin(i, self.state)
    429 

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
    191 
    192     def __iter__(self):
--> 193         return _DataLoaderIter(self)
    194 
    195     def __len__(self):

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
    467                 #     before it starts, and __del__ tries to join but will get:
    468                 #     AssertionError: can only join a started process.
--> 469                 w.start()
    470                 self.index_queues.append(index_queue)
    471                 self.workers.append(w)

D:\metya\Anaconda3\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect

D:\metya\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):

D:\metya\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):

D:\metya\Anaconda3\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     87             try:
     88                 reduction.dump(prep_data, to_child)
---> 89                 reduction.dump(process_obj, to_child)
     90             finally:
     91                 set_spawning_popen(None)

D:\metya\Anaconda3\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

AttributeError: Can't pickle local object 'get_transforms.<locals>.<lambda>'

So I changed all lambda functions to normal and replace to_torch() to torchvision.transform.ToTensor() (even monkey patched source kekas transformation.py)

and it works for me with num_workers=0
if num_workers > 0 it is fails with

---------------------------------------------------------------------------
Empty                                     Traceback (most recent call last)
D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_batch(self, timeout)
    510         try:
--> 511             data = self.data_queue.get(timeout=timeout)
    512             return (True, data)

D:\metya\Anaconda3\lib\multiprocessing\queues.py in get(self, block, timeout)
    104                     if not self._poll(timeout):
--> 105                         raise Empty
    106                 elif not self._poll():

Empty: 

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-106-87bd5485ec48> in <module>
      4 # !rm -r lrlogs/*
      5 
----> 6 BCE_keker.kek_lr(final_lr=0.1, logdir=lrlogdir)
      7 # BCE_keker.plot_kek_lr(logdir=lrlogdir)

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in kek_lr(self, final_lr, logdir, init_lr, n_steps, opt, opt_params)
    407             self.callbacks = Callbacks(self.core_callbacks + [lrfinder_cb])
    408             self.kek(lr=init_lr, epochs=n_epochs, skip_val=True, logdir=logdir,
--> 409                      opt=opt, opt_params=opt_params)
    410         finally:
    411             self.callbacks = callbacks

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
    276             for epoch in range(epochs):
    277                 self.set_mode("train")
--> 278                 self._run_epoch(epoch, epochs)
    279 
    280                 if not skip_val:

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in _run_epoch(self, epoch, epochs)
    425 
    426         with torch.set_grad_enabled(self.is_train):
--> 427             for i, batch in enumerate(self.state.core.loader):
    428                 self.callbacks.on_batch_begin(i, self.state)
    429 

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    574         while True:
    575             assert (not self.shutdown and self.batches_outstanding > 0)
--> 576             idx, batch = self._get_batch()
    577             self.batches_outstanding -= 1
    578             if idx != self.rcvd_idx:

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _get_batch(self)
    551         else:
    552             while True:
--> 553                 success, data = self._try_get_batch()
    554                 if success:
    555                     return data

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_batch(self, timeout)
    517             if not all(w.is_alive() for w in self.workers):
    518                 pids_str = ', '.join(str(w.pid) for w in self.workers if not w.is_alive())
--> 519                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
    520             if isinstance(e, queue.Empty):
    521                 return (False, None)

RuntimeError: DataLoader worker (pid(s) 11236, 6592) exited unexpectedly

I think it is a common bug with workers on Windows
I found related issues like that pytorch/pytorch#8976 or like that pytorch/pytorch#5301

Moreover it is funny, but if num_workers set 0 and return lambdas back everything works fine.

So maybe it is not be the lambdas in code of kekas, but in fucking windows and dataloaders and multiprocessing and I don't know.

keker.kek() crashes when len(dataloader) == 1

keker.kek() crashes when len(dataloader) == 1, i.e. epoch contains only one mini-batch.

I used my custom batch_sampler in the Dataloader which returns __len__() == 1.

Epoch 1/500:   0% 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-71521e7b2e71> in <module>
    14                       "n_best": 2,
    15                       "prefix": "kek",
---> 16                       "mode": "max"})
    17
    18 keker.kek_one_cycle(max_lr=3e-3,                  # the maximum learning rate

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/keker.py in kek_one_cycle(self, max_lr, cycle_len, momentum_range, div_factor, increase_fraction, opt, opt_params, logdir, cp_saver_params, early_stop_params)
   309                      logdir=logdir,
   310                      cp_saver_params=cp_saver_params,
--> 311                      early_stop_params=early_stop_params)
   312         finally:
   313             # set old callbacks without OneCycle

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
   235                 if not skip_val:
   236                     self.set_mode("val")
--> 237                     self._run_epoch(epoch, epochs)
   238
   239                 if self.state.stop_train:

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/keker.py in _run_epoch(self, epoch, epochs)
   399                     break
   400
--> 401         self.callbacks.on_epoch_end(epoch, self.state)
   402
   403         if self.state.checkpoint:

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/callbacks.py in on_epoch_end(self, epoch, state)
    64     def on_epoch_end(self, epoch: int, state: DotDict) -> None:
    65         for cb in self.callbacks:
---> 66             cb.on_epoch_end(epoch, state)
    67
    68     def on_train_begin(self, state: DotDict) -> None:

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/callbacks.py in on_epoch_end(self, epoch, state)
   356             metrics = state.get("epoch_metrics", {})
   357             state.pbar.set_postfix_str(extend_postfix(state.pbar.postfix,
--> 358                                                       metrics))
   359             state.pbar.close()
   360         elif state.mode == "test":

~/anaconda2/compvisgpu/envs/py36/lib/python3.6/site-packages/kekas/utils.py in extend_postfix(postfix, dct)
   107 def extend_postfix(postfix: str, dct: Dict) -> str:
   108     postfixes = [postfix] + [f"{k}={v:.4f}" for k, v in dct.items()]
--> 109     return ", ".join(postfixes)
   110
   111

TypeError: sequence item 0: expected str instance, NoneType found

Installation Issue

sayantan@kali:~$ sudo pip3 install kekas
Collecting kekas
Using cached https://files.pythonhosted.org/packages/2d/04/4487855bbc12532d54729b1bf07531c8b69202981fa69561a983e234220d/kekas-0.1.12.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-6wp28_se/kekas/setup.py", line 3, in
import kekas
File "/tmp/pip-install-6wp28_se/kekas/kekas/init.py", line 1, in
from .keker import Keker
File "/tmp/pip-install-6wp28_se/kekas/kekas/keker.py", line 12, in
from .callbacks import Callback, Callbacks, ProgressBarCallback,
File "/tmp/pip-install-6wp28_se/kekas/kekas/callbacks.py", line 14, in
from tensorboardX import SummaryWriter
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/init.py", line 5, in
from .torchvis import TorchVis
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/torchvis.py", line 11, in
from .writer import SummaryWriter
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/writer.py", line 27, in
from .event_file_writer import EventFileWriter
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/event_file_writer.py", line 28, in
from .proto import event_pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/event_pb2.py", line 15, in
from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/summary_pb2.py", line 15, in
from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/tensor_pb2.py", line 15, in
from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
File "/usr/local/lib/python3.6/dist-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in
serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')
TypeError: new() got an unexpected keyword argument 'serialized_options'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-6wp28_se/kekas/
sayantan@kali:~$

Progress bar (tqdm) issue in kek_lr when number of epochs exceed 1 (Jupyter notebook)

Hi,
If the value of parameter n_steps (in kek_lr) is passed in such a way that the number of epochs exceeds 1
progress bar has starting to glitch and make new extra lines
(for example, n_steps = 1400 while number of batches in the epoch is 717 - see below)

>>> keker.kek_lr(final_lr=0.1, logdir="logdir_lr", n_steps=1400)

```Epoch 1/2: 100% 717/717 [03:17<00:00,  3.64it/s, loss=0.1435]
Epoch 2/2:   0% 0/717 [00:00<?, ?it/s]
Epoch 2/2:   0% 0/717 [00:00<?, ?it/s, loss=0.2583]
Epoch 2/2:   0% **1/**717 [00:00<09:51,  1.21it/s, loss=0.2583]
Epoch 2/2:   0% **1**/717 [00:01<09:51,  1.21it/s, loss=0.2422]
Epoch 2/2:   0% 2/717 [00:01<07:54,  1.51it/s, loss=0.2422]
Epoch 2/2:   0% 2/717 [00:01<07:54,  1.51it/s, loss=0.2315]
Epoch 2/2:   0% 3/717 [00:01<06:31,  1.82it/s, loss=0.2315]
Epoch 2/2:   0% 3/717 [00:01<06:31,  1.82it/s, loss=0.2271]

versions:
kekas 0.1.17
tqdm 4.30.0
jupyter-client 5.2.4

P.S pure python script output looks good:

Warning: unknown JFIF revision number 0.00
Epoch 1/2:  12% 83/717 [00:22<02:46,  3.81it/s, loss=0.7120]Corrupt JPEG data: 399 extraneous bytes before marker 0xd9
Epoch 1/2:  17% 121/717 [00:32<02:36,  3.80it/s, loss=0.6896]Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Epoch 1/2:  17% 122/717 [00:32<02:36,  3.80it/s, loss=0.6943]Corrupt JPEG data: 226 extraneous bytes before marker 0xd9
Epoch 1/2:  20% 146/717 [00:38<02:29,  3.81it/s, loss=0.6542]Corrupt JPEG data: 254 extraneous bytes before marker 0xd9
Epoch 1/2:  29% 209/717 [00:55<02:13,  3.81it/s, loss=0.5616]Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
Epoch 1/2:  58% 415/717 [01:49<01:19,  3.79it/s, loss=0.3014]Warning: unknown JFIF revision number 0.00
Epoch 1/2:  59% 422/717 [01:51<01:17,  3.79it/s, loss=0.2748]Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
Epoch 1/2:  60% 429/717 [01:53<01:16,  3.79it/s, loss=0.2928]Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
Epoch 1/2:  65% 469/717 [02:04<01:05,  3.78it/s, loss=0.2795]Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
Epoch 1/2:  68% 491/717 [02:09<00:59,  3.78it/s, loss=0.2552]Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Epoch 1/2:  72% 514/717 [02:15<00:53,  3.78it/s, loss=0.2325]Corrupt JPEG data: 2230 extraneous bytes before marker 0xd9
Epoch 1/2:  84% 604/717 [02:39<00:29,  3.78it/s, loss=0.2118]Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
Epoch 1/2: 100% 717/717 [03:09<00:00,  3.79it/s, loss=0.175Corrupt JPEG data: 399 extraneous bytes before marker 0xd9
Epoch 2/2:   0% 3/717 [00:01<06:18,  1.88it/s, loss=0.0926] Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Epoch 2/2:   6% 46/717 [00:12<02:57,  3.79it/s, loss=0.1687]Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
Epoch 2/2:  12% 87/717 [00:23<02:46,  3.78it/s, loss=0.1583] Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Epoch 2/2:  36% 260/717 [01:09<02:00,  3.78it/s, loss=0.2722]Corrupt JPEG data: 2230 extraneous bytes before marker 0xd9
Epoch 2/2:  53% 379/717 [01:40<01:29,  3.77it/s, loss=1.4624]Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.