marian-nmt / marian-dev Goto Github PK

Fast Neural Machine Translation in C++ - development repository

License: Other

C++ 82.91% CMake 4.65% Cuda 7.61% Shell 0.80% Python 3.04% C 0.07% Batchfile 0.34% Perl 0.17% PowerShell 0.18% Dockerfile 0.19% Vim Script 0.02%

neural-machine-translation gpu-acceleration cpp11 fast cuda

marian-dev's Issues

Decaying learning rate and other schedules

Pass "reporter" around to optimizers and models in order to be able to implement decaying learning rates and other decays, for instance used for guided alignment.

OpenNMT allows to mutiply the learning rate with a decay-factor for each new epoch or whenever the cost on the validation set does not improve by at least a given factor.

Maybe we should also rename "reporter" if we use it for things like that.

Options to be added would be for example:

--learning-rate-decay 0.7 --start-decay-epoch 5

What happens at each epoch after epoch 5 is:

lr = lr * 0.7

mnist example files are missing

$ ./mnist_benchmark Loading train set...terminate called after throwing an instance of 'std::runtime_error' what(): Cannot open file../examples/mnist/train-images-idx3-ubyte! Aborted (core dumped)

Either these files should be included, or there should be documentation &/or a script to get them.

Weird GPU Allocation

I run marian with parameter -d 0,
But when I called nvidia-smi, the code actually runs on GPU 3.
Tested on azure

alfikri@4gpu:~$ nvidia-smi
Thu Mar  9 20:30:50 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 8A24:00:00.0     Off |                  Off |
| N/A   40C    P8    15W / 150W |      2MiB /  8123MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 92AA:00:00.0     Off |                  Off |
| N/A   49C    P8    16W / 150W |      2MiB /  8123MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           Off  | AEE7:00:00.0     Off |                  Off |
| N/A   37C    P8    14W / 150W |      2MiB /  8123MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           Off  | B710:00:00.0     Off |                  Off |
| N/A   60C    P0    90W / 150W |   6404MiB /  8123MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    3      4778    C   ./marian                                      6402MiB |
+-----------------------------------------------------------------------------+

Output configuration file for amun when training dl4mt model

Memory Leak

Will this part causes a memory leak?
https://github.com/emjotde/Marian/blob/master/src/tensors/tensor.h#L40

Validation on dev set with BLEU scorer

Framework for validation every n steps. For instance it should be possible to specify --metric perplexity bleu and it should then calculate these values on the dev set. It would be good to have this as a skeleton. How the individual scoring methods are implemented would be a separate issue.

Perplexity can be evaluated just by running forward steps on dev set batches and could be done as a first example how to extend the skeletons. Validation results should be logged to separate files using a specific logger.

For bleu we need of course a translator and a bleu scorer.

NVIDIA Sample EULA requires a statement

src/3rd_party/reduce_all.h appears to derive from an NVIDIA CUDA SDK sample. This looks to be fine, but the EULA requires a statement: "This software contains source code provided by NVIDIA Corporation."

Batch iterator

Working batch iterator for general data sets. Iterator operation "++" should execute a training step with a provided SGD variant.

Use-cases

Application to MNIST toy example
Application to monolingual corpus for RNN-LM training.
Application to parallel corpus for NMT training.

Regression tests

A first simple regression test:

Download a specific WMT model, run marian_test for a mini trainingset and see if the gradients are still the same.

Dropout without CuDNN

Create a kernel that uses curand (?) to quickly create random 0/1 or 0/p (p would be the probability in scaling dropout). This should be roughly as fast as the cuDNN version without depending on cuDNN.

This issue only concerns the low-level kernel. We will see how to integrate this into the autodiff framework later.

Adding Marian version to models

It's worth to keep Marian version in model files. This will be helpful in future when a model will be run on different version of Marian than it was trained.

Separate GPU code from CPU code

Refactor code in order to separate GPU code from CPU code as far as this is painlessly possible without major rewrites. One way would be to hide includes to cuda code in *.cu files that really require this. For instance anything that include ExpressionGraph needs to be a *.cu file now when compiling into executable code. Would be good if it could remain a *.cpp file.

This would be a first step towards different back-ends which would require more extensive rewrites. This should be more of a clean-up.

Compare performance of softmax implementations

CuDNN is supposed to have very efficient softmax implementations for both, the forward and the backward step. It would be interesting to set up a benchmark and compare to our own implementation on different (small and very large) matrix sizes.

Not sure, but CuDNN seems also to have a working log-softmax?

Reimplementation of known examples from other packages

Following issue #5 it would be interesting to see if we can already reimplement a couple of well-known examples from other packages, for instance Keras?

3-D tensors

Currently we only support 2D tensors. Some operations, like the attention mechanism will require 3D tensors. Element-wise operations are easy, but this is a challenge for proper broadcasting and reshaping.

Tensor a({80, 200});
Tensor b({32,200});

z = reshape(a, {1,80,200}) + reshape(b, {32, 1, 200});

Something like the above should work and results in a tensor of shape {32, 80, 200}

Documentation of code and use cases

Documentation with doxygen for API. I guess use-cases should be put on the wiki?
Can doxygen somehow talk to the github wiki format?

Compilation issue

Under CentOS 7.3.1611 with

GCC 4.9.3
Boost 1.53.0
cmake 3.6.3

I had to make two changes to get marian to compile:

I had to add this line near the top of CMakeLists.txt:
```
set(THREADS_PREFER_PTHREAD_FLAG TRUE)
```
I had to change Ptr<Reporter> reporter_ (in src/training/graph_group.h) to be public in order to get the code to compile. (Obviously there is a more principled way to do this, but this at least gets things running).

cmake doesn't ensure that the cuda library is the right version

On gna:

export LD_LIBRARY_PATH="/usr/local/cuda-8.0/lib64"
export LIBRARY_PATH="/usr/local/cuda-8.0/lib64"
export PATH="/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/cuda-8.0/bin:/usr/local/sbin"

cd /marian/base/directory
rm -rf build
cd build
cmake ..
make -j

=> FAIL, because cmake finds the CUDA-7.0 library and is happy with it

export PATH="/usr/local/cuda-8.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
cd ..
rm -rf build/*
cd build
cmake ..
make -j

=> SUCCESS

Keep all parameters in single GPU matrix, create slices to view them

Keeping all parameters in single GPU matrix should quite significantly improve the performance of all SGD variants as they loop through all parameters when applying updates. All updates are element-wise and identical for all parameters, could therefore be performed in a single step maximizing GPU saturation.

Reimplement Nematus NMT model in Marian

Reimplement Nematus NMT model in Marian, should also allow to read in available models. Once done, benchmarking etc.

Make asynchronous SGD scale linearly

We have synchronous and async SGD now. Currently all gradient updates are communicated to the central parameters. There would be two (orthogonal) ways to handle that: Alham's work and introducing a delayed update by tau steps. The second option requires some accumulation and possibly choosing optimizers per worker below the main SGD variant (Adagrad, Adam). Currently on the Titan X scaling is sublinear for a typicial WMT-grade DL4MT model, though not tragically so:

1 GPU: 4100 w/s
2 GPU: 6500 w/s
3 GPU: 9200 w/s

Would be great to get to >12000 w/s on 3 GPUs.

Investigate problem with forking graphs

The current implementation does not support forking graphs out-of-the box. We need to either add this support or temporary recognize and prevent forks.

tensor_test segfaults

On zisa:

Thread 1 "tensor_test" received signal SIGSEGV, Segmentation fault.
spdlog::logger::_log_if_enabled (lvl=spdlog::level::info, this=0x0)
    at /home/heafield/marian/src/3rd_party/spdlog/details/logger_impl.h:68
68          return details::line_logger(this, lvl, should_log(lvl));
(gdb) bt
#0  spdlog::logger::_log_if_enabled (lvl=spdlog::level::info, this=0x0) at /home/heafield/marian/src/3rd_party/spdlog/details/logger_impl.h:68
#1  spdlog::logger::info (this=<optimized out>) at /home/heafield/marian/src/3rd_party/spdlog/details/logger_impl.h:213
#2  marian::TensorAllocator::reserve (this=this@entry=0x73b2f0, elements=<optimized out>)
    at /home/heafield/marian/src/tensors/tensor_allocator.h:99
#3  0x0000000000413001 in marian::TensorAllocator::checkSpace (
    shape=<error reading variable: access outside bounds of object referenced via synthetic pointer>, this=0x73b2f0)
    at /home/heafield/marian/src/tensors/tensor_allocator.h:84
#4  marian::TensorAllocator::allocate (this=0x73b2f0, t=std::shared_ptr (empty) 0x0, shape=...)
    at /home/heafield/marian/src/tensors/tensor_allocator.h:127
#5  0x00000000004053a6 in main () at /home/heafield/marian/src/test/tensor_test.cu:21

Logger isn't initialized properly?

Add integer tensors

Future work: add tensors with underlying integer type. Currently only floats are possible.

Consider using term Vertex for nodes in computational graph

The term "node" is overloaded.

In the context of neural networks, the term node encompasses input nodes, neurons, and output nodes. In the context of algorithmic differentiation (AD), the term node refers to a node in the computation graph.

Given that we are using AD in the context of neural networks, there is a motivation to avoid conflating the two meanings, and for making very clear which is being described.

To that end, one possible solution would be to always use the term "vertex" rather than "node" when referring to the items in a computation graph.

Likewise, given that some computation graph vertices refer to an entire layer of neural network nodes, it may be appropriate to consider naming such vertices explicitly with names such as InputLayer or OutputLayer.

Merge with AmuNMT

A shallow merge for now. Amun and Marian communicate through the model file, which should be ok as long as we do not have a Marian scorer in Amun.

What should we do with the repos? New repo? New organization? New name?

Separate CuDNN wrappers from own functions

Currently all wrappers around CuDNN functions (softmax, dropout) are located in tensor_operators.*, they should be moved to a separate file. Also this requires some clever handling of cudnnHandle_t and similarly cublasHandle_t. Currently there are only single static instances of each one.

Potential race condition?

https://github.com/amunmt/marian/blob/master/src/training/graph_group.h line 73

Should we put cudaStreamSynchronize(0) here? I believe Cuda operations do not block CPU, thus you will finish the pushGradient function (and free the lock) even though the param update is not complete yet.

Save configuration with model

Save configuration with model (model.npz and model.npz.yml)

Orthogonal initialization

Orthogonal intialization of weights. The easiest way to do this is probably on the CPU and then copy to GPU. This can be achieved with any LAPACK library, but there is also CULA, a CUDA LAPACK library.

Node-wise gradient checking

Add operation to each node to compare numerical gradients with actual gradients after backprop step. It is probably enough to modify UnaryNodeOp and BinaryNodeOp. We can do this with fake random tensors and parameters.

Graph should probably have a function that calls gradient checks for all nodes.

Analogously we should check gradients of the whole graph on real data.

Fix and optimize dropout

The dropout operation is currently inefficient and probably broken. We need to learn how random number generators work on the GPU.

For the back burner: Create .shuf files in a temporary directory during training

Not very important, but before we get bitten by it: Marian currently creates the .shuf files in the same directory as the training data files it is given.

There's no guarantee that Marian has write access to those directories.
More importantly, this could cause problems if someone runs two training instances on the same data, as one process might change the files another process relies on. So either we should create names that are unique per process (e.g., <datafile>.<host>.<PID>.shuffled), or use a temporary directory for each training instance.

Investigate why vocabulary creation is slow

Investigate why vocabulary creation is slow, probably a problem in yaml-cpp

softmax appears to be broken in xor

"result": shape=4x1
0.995334
0.00920105
0.995334
0.000189304

"softmax": shape=4x1
1
1
1
1

cmake doesn't check if doxygen is installed

... building fails if it isn't.

Memory management for unrolled graphs

Graphs that are unrolled for different batch size when they contain RNNs should share the entire memory in order to avoid allocation. Parameters should be protected, the rest could be overwritten.

It probably makes sense to change the Tensor class to not hold it's own memory, rather it should be a view on a segment of the memory the graph holds. This should be solved together with issue #9 .

ReLU node causes numeric instability

When running the MNIST example with ReLU for more than 25 epochs NAN appears. This seems to be a known issue with ReLU and crossentropy, and the suggested solution is some sort of clipping. This requires more investigation.

Binarization of models and parameters

Binarize graphs and paramters with boost::serialization. Values and adjoins of all nodes can be ignored apart from ParamNodeOp. Some nodes have special paramters like "DropoutNodeOp" that also need to be serialized.

Error at /mt/marian/src/kernels/dropout.cu:22

From current master, seems to break no matter what I do (on Azure):

/mt/marian/build/marian -t data/corpus.bpe.ro data/corpus.bpe.en -d 0
[2017-03-27 16:43:16] [config] after-batches: 0
[2017-03-27 16:43:16] [config] after-epochs: 0
[2017-03-27 16:43:16] [config] clip-norm: 1
[2017-03-27 16:43:16] [config] devices:
[2017-03-27 16:43:16] [config] - 0
[2017-03-27 16:43:16] [config] dim-emb: 512
[2017-03-27 16:43:16] [config] dim-rnn: 1024
[2017-03-27 16:43:16] [config] dim-vocabs:
[2017-03-27 16:43:16] [config] - 50000
[2017-03-27 16:43:16] [config] - 50000
[2017-03-27 16:43:16] [config] disp-freq: 1000
[2017-03-27 16:43:16] [config] dropout-rnn: 0
[2017-03-27 16:43:16] [config] dropout-src: 0
[2017-03-27 16:43:16] [config] dropout-trg: 0
[2017-03-27 16:43:16] [config] early-stopping: 10
[2017-03-27 16:43:16] [config] layer-normalization: false
[2017-03-27 16:43:16] [config] layers-dec: 1
[2017-03-27 16:43:16] [config] layers-enc: 1
[2017-03-27 16:43:16] [config] learn-rate: 0.0001
[2017-03-27 16:43:16] [config] max-length: 50
[2017-03-27 16:43:16] [config] maxi-batch: 100
[2017-03-27 16:43:16] [config] mini-batch: 64
[2017-03-27 16:43:16] [config] model: model.npz
[2017-03-27 16:43:16] [config] no-reload: false
[2017-03-27 16:43:16] [config] no-shuffle: false
[2017-03-27 16:43:16] [config] optimizer: adam
[2017-03-27 16:43:16] [config] overwrite: false
[2017-03-27 16:43:16] [config] relative-paths: false
[2017-03-27 16:43:16] [config] save-freq: 10000
[2017-03-27 16:43:16] [config] seed: 1234
[2017-03-27 16:43:16] [config] skip: false
[2017-03-27 16:43:16] [config] train-sets:
[2017-03-27 16:43:16] [config] - data/corpus.bpe.ro
[2017-03-27 16:43:16] [config] - data/corpus.bpe.en
[2017-03-27 16:43:16] [config] type: dl4mt
[2017-03-27 16:43:16] [config] valid-freq: 10000
[2017-03-27 16:43:16] [config] valid-metrics:
[2017-03-27 16:43:16] [config] - cross-entropy
[2017-03-27 16:43:16] [config] workspace: 2048
[2017-03-27 16:43:16] [data] Loading vocabulary from data/corpus.bpe.ro.json (max: 50000)
[2017-03-27 16:43:16] [data] Loading vocabulary from data/corpus.bpe.en.json (max: 50000)
Error at /mt/marian/src/kernels/dropout.cu:22

Encode inference vs training in node types

This is being worked on in the Issue22 branch.

Segfault after the program runs for a while

I got segfault after the code runs for a couple hours. Not sure why. This is based on a code pulled a week ago. I'm pulling the new one to see if the issue still there.

[2017-03-22 11:00:52] Ep. 4 : Up. 21900 : Sen. 281600 : Cost 27.39 : Time 49.85s : 2981.00 words/s
[2017-03-22 11:01:42] Ep. 4 : Up. 22000 : Sen. 288000 : Cost 26.35 : Time 49.64s : 2842.05 words/s
[2017-03-22 11:02:33] Ep. 4 : Up. 22100 : Sen. 294400 : Cost 27.27 : Time 50.95s : 2881.92 words/s
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/tensors/tensor.cu 83
GPUassert: driver shutting down /home/alfikri/ori/marian/src/tensors/tensor.cu 46
GPUassert: driver shutting down /home/alfikri/ori/marian/src/tensors/tensor.cu 46
GPUassert: driver shutting down /home/alfikri/ori/marian/src/kernels/tensor_operators.cu 603
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/kernels/tensor_operators.cu 608
*** Error in `./marian': double free or corruption (!prev): 0x000055821fb66320 ***
Segmentation fault (core dumped)

Create yaml vocabulary for each input file if none given

I think this should be an extension of the Vocab class.

Create yaml vocabulary for each input file filename if none given:

search for filename.{yml,json}, if exists load as vocab.

Else:

create a frequency list for that particular file,
sort by frequency
assign id from rank
create yml representation
dump representation to filename.yml

Add --working-dir option

Add --working-dir option across all paths.

Logging not working properly (missing numbers)

It seems the new formatting options in spdlog are not working properly. Reports during training are missing information and are cut off at the end.

Should be caused by 50ddd09

Build is broken

[ 27%] Building NVCC (Device) object src/CMakeFiles/marian_lib.dir/marian_lib_generated_tensor.cu.o
/home/lanes/Marian/src/node_operators_unary.h(137): error: host or device annotation on lambda requires --expt-extended-lambda nvcc flag

/home/lanes/Marian/src/node_operators_unary.h(137): error: host or device annotation on lambda requires --expt-extended-lambda nvcc flag

/home/lanes/Marian/src/tensor_operators.h(88): error: The closure type for a lambda ("lambda [](float &, float)->float") cannot be used in the template argument type of a global function template instantiation, unless the lambda is defined within a device or global function, or the lambda is a 'extended lambda' and the flag --expt-extended-lambda is specified
detected during:
instantiation of "marian::gElement" based on template arguments <lambda [](float &, float)->float, marian::Tensor::TensorView, marian::Bernoulli>
(88): here
instantiation of "void marian::Element(Functor, T1, T2) [with Functor=lambda [](float &, float)->float, T1=marian::Tensor, T2=marian::Bernoulli]"
/home/lanes/Marian/src/node_operators_unary.h(140): here

/home/lanes/Marian/src/tensor_operators.h(88): error: A type defined inside a host function ("lambda [](float &, float)->float") cannot be used in the template argument type of a global function template instantiation
detected during:
instantiation of "marian::gElement" based on template arguments <lambda [](float &, float)->float, marian::Tensor::TensorView, marian::Bernoulli>
(88): here
instantiation of "void marian::Element(Functor, T1, T2) [with Functor=lambda [](float &, float)->float, T1=marian::Tensor, T2=marian::Bernoulli]"
/home/lanes/Marian/src/node_operators_unary.h(140): here

3 errors detected in the compilation of "/tmp/tmpxft_00006419_00000000-7_expression_operators.cpp1.ii".
CMake Error at marian_lib_generated_expression_operators.cu.o.cmake:266 (message):
Error generating file
/home/lanes/Marian/build/src/CMakeFiles/marian_lib.dir//./marian_lib_generated_expression_operators.cu.o

make[2]: *** [src/CMakeFiles/marian_lib.dir/marian_lib_generated_expression_operators.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
3 errors detected in the compilation of "/tmp/tmpxft_00006422_00000000-7_expression_graph.cpp1.ii".
CMake Error at marian_lib_generated_expression_graph.cu.o.cmake:266 (message):
Error generating file
/home/lanes/Marian/build/src/CMakeFiles/marian_lib.dir//./marian_lib_generated_expression_graph.cu.o

make[2]: *** [src/CMakeFiles/marian_lib.dir/marian_lib_generated_expression_graph.cu.o] Error 1
make[1]: *** [src/CMakeFiles/marian_lib.dir/all] Error 2
make: *** [all] Error 2

Consider renaming "stack" to "tape"

Naumann (2012) uses "tape" to refer to this data structure.

Given that we do not actually operate on this as a LIFO data structure, the term "stack" is misleading.

More comparisons with CuDNN (dropout, RNNs, ...)

Following issue #4, there are a number of other interesting layers that have been implemented in CuDNN.

We should evaluate and compare:

Activation layers
Dropout
RNNs (there are multiple variants, also bidirectional)
...

I guess the results might be similar, i.e. the forward steps will be faster than ours. We should then use those and keep our potentially faster backward steps.

The RNNs are particularly interesting, as this might take quite a bit of headache from us, I just fear for the backward step. I am also not sure their implementation allows us to use conditional RNNs which we need for attention. They do have separate inference and forward steps. This would be interesting for Amun as well.

The documentation for CuDNN can be downloaded after registration from: https://developer.nvidia.com/cudnn

Unit and regression tests

Should be able to tell if code is working without waiting an hour.

Unit tests: small graphs exercising only parts of the code?

Regression tests: save some checkpoints of small canned systems. Start from these checkpoints and verify the outcome is about the same.

Seed all random number generators

@AngusL points out that Marian is non-deterministic which makes regression tests hard (paging @afaji). Moreover https://github.com/amunmt/marian/blob/13ee36be159386f93259c3c6c8bfd80db5f52fff/src/training/config.cpp#L154 is misleading because it says "for all random number generators".

https://github.com/amunmt/marian/blob/9ce67850b80b0be5fb3d2a51c645b45ac170786d/src/data/batch_generator.h#L111 should change from the deprecated two-argument `random_shuffle' to the three-argument seeded version.

https://github.com/amunmt/marian/blob/13ee36be159386f93259c3c6c8bfd80db5f52fff/src/data/corpus.cpp#L44 does seed.

marian-nmt / marian-dev Goto Github PK

marian-dev's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs