marian-nmt / marian-dev Goto Github PK
View Code? Open in Web Editor NEWFast Neural Machine Translation in C++ - development repository
Home Page: https://marian-nmt.github.io
License: Other
Fast Neural Machine Translation in C++ - development repository
Home Page: https://marian-nmt.github.io
License: Other
Pass "reporter" around to optimizers and models in order to be able to implement decaying learning rates and other decays, for instance used for guided alignment.
OpenNMT allows to mutiply the learning rate with a decay-factor for each new epoch or whenever the cost on the validation set does not improve by at least a given factor.
Maybe we should also rename "reporter" if we use it for things like that.
Options to be added would be for example:
--learning-rate-decay 0.7 --start-decay-epoch 5
What happens at each epoch after epoch 5
is:
lr = lr * 0.7
$ ./mnist_benchmark Loading train set...terminate called after throwing an instance of 'std::runtime_error' what(): Cannot open file
../examples/mnist/train-images-idx3-ubyte! Aborted (core dumped)
Either these files should be included, or there should be documentation &/or a script to get them.
I run marian with parameter -d 0,
But when I called nvidia-smi, the code actually runs on GPU 3.
Tested on azure
alfikri@4gpu:~$ nvidia-smi
Thu Mar 9 20:30:50 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 8A24:00:00.0 Off | Off |
| N/A 40C P8 15W / 150W | 2MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 Off | 92AA:00:00.0 Off | Off |
| N/A 49C P8 16W / 150W | 2MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M60 Off | AEE7:00:00.0 Off | Off |
| N/A 37C P8 14W / 150W | 2MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M60 Off | B710:00:00.0 Off | Off |
| N/A 60C P0 90W / 150W | 6404MiB / 8123MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 3 4778 C ./marian 6402MiB |
+-----------------------------------------------------------------------------+
Output configuration file for amun when training dl4mt model
Will this part causes a memory leak?
https://github.com/emjotde/Marian/blob/master/src/tensors/tensor.h#L40
Framework for validation every n
steps. For instance it should be possible to specify --metric perplexity bleu
and it should then calculate these values on the dev set. It would be good to have this as a skeleton. How the individual scoring methods are implemented would be a separate issue.
Perplexity can be evaluated just by running forward steps on dev set batches and could be done as a first example how to extend the skeletons. Validation results should be logged to separate files using a specific logger.
For bleu we need of course a translator and a bleu scorer.
src/3rd_party/reduce_all.h
appears to derive from an NVIDIA CUDA SDK sample. This looks to be fine, but the EULA requires a statement: "This software contains source code provided by NVIDIA Corporation."
Working batch iterator for general data sets. Iterator operation "++" should execute a training step with a provided SGD variant.
Use-cases
A first simple regression test:
Download a specific WMT model, run marian_test for a mini trainingset and see if the gradients are still the same.
Create a kernel that uses curand
(?) to quickly create random 0/1
or 0/p
(p
would be the probability in scaling dropout). This should be roughly as fast as the cuDNN version without depending on cuDNN.
This issue only concerns the low-level kernel. We will see how to integrate this into the autodiff framework later.
It's worth to keep Marian version in model files. This will be helpful in future when a model will be run on different version of Marian than it was trained.
Refactor code in order to separate GPU code from CPU code as far as this is painlessly possible without major rewrites. One way would be to hide includes to cuda code in *.cu files that really require this. For instance anything that include ExpressionGraph needs to be a *.cu file now when compiling into executable code. Would be good if it could remain a *.cpp file.
This would be a first step towards different back-ends which would require more extensive rewrites. This should be more of a clean-up.
CuDNN is supposed to have very efficient softmax implementations for both, the forward and the backward step. It would be interesting to set up a benchmark and compare to our own implementation on different (small and very large) matrix sizes.
Not sure, but CuDNN seems also to have a working log-softmax?
Following issue #5 it would be interesting to see if we can already reimplement a couple of well-known examples from other packages, for instance Keras?
Currently we only support 2D tensors. Some operations, like the attention mechanism will require 3D tensors. Element-wise operations are easy, but this is a challenge for proper broadcasting and reshaping.
Tensor a({80, 200});
Tensor b({32,200});
z = reshape(a, {1,80,200}) + reshape(b, {32, 1, 200});
Something like the above should work and results in a tensor of shape {32, 80, 200}
Under CentOS 7.3.1611 with
I had to make two changes to get marian to compile:
I had to add this line near the top of CMakeLists.txt
:
set(THREADS_PREFER_PTHREAD_FLAG TRUE)
I had to change Ptr<Reporter> reporter_
(in src/training/graph_group.h
) to be public in order to get the code to compile. (Obviously there is a more principled way to do this, but this at least gets things running).
On gna:
export LD_LIBRARY_PATH="/usr/local/cuda-8.0/lib64"
export LIBRARY_PATH="/usr/local/cuda-8.0/lib64"
export PATH="/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/cuda-8.0/bin:/usr/local/sbin"
cd /marian/base/directory
rm -rf build
cd build
cmake ..
make -j
=> FAIL, because cmake finds the CUDA-7.0 library and is happy with it
export PATH="/usr/local/cuda-8.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
cd ..
rm -rf build/*
cd build
cmake ..
make -j
=> SUCCESS
Keeping all parameters in single GPU matrix should quite significantly improve the performance of all SGD variants as they loop through all parameters when applying updates. All updates are element-wise and identical for all parameters, could therefore be performed in a single step maximizing GPU saturation.
Reimplement Nematus NMT model in Marian, should also allow to read in available models. Once done, benchmarking etc.
We have synchronous and async SGD now. Currently all gradient updates are communicated to the central parameters. There would be two (orthogonal) ways to handle that: Alham's work and introducing a delayed update by tau
steps. The second option requires some accumulation and possibly choosing optimizers per worker below the main SGD variant (Adagrad, Adam). Currently on the Titan X scaling is sublinear for a typicial WMT-grade DL4MT model, though not tragically so:
1 GPU: 4100 w/s
2 GPU: 6500 w/s
3 GPU: 9200 w/s
Would be great to get to >12000 w/s on 3 GPUs.
The current implementation does not support forking graphs out-of-the box. We need to either add this support or temporary recognize and prevent forks.
On zisa:
Thread 1 "tensor_test" received signal SIGSEGV, Segmentation fault.
spdlog::logger::_log_if_enabled (lvl=spdlog::level::info, this=0x0)
at /home/heafield/marian/src/3rd_party/spdlog/details/logger_impl.h:68
68 return details::line_logger(this, lvl, should_log(lvl));
(gdb) bt
#0 spdlog::logger::_log_if_enabled (lvl=spdlog::level::info, this=0x0) at /home/heafield/marian/src/3rd_party/spdlog/details/logger_impl.h:68
#1 spdlog::logger::info (this=<optimized out>) at /home/heafield/marian/src/3rd_party/spdlog/details/logger_impl.h:213
#2 marian::TensorAllocator::reserve (this=this@entry=0x73b2f0, elements=<optimized out>)
at /home/heafield/marian/src/tensors/tensor_allocator.h:99
#3 0x0000000000413001 in marian::TensorAllocator::checkSpace (
shape=<error reading variable: access outside bounds of object referenced via synthetic pointer>, this=0x73b2f0)
at /home/heafield/marian/src/tensors/tensor_allocator.h:84
#4 marian::TensorAllocator::allocate (this=0x73b2f0, t=std::shared_ptr (empty) 0x0, shape=...)
at /home/heafield/marian/src/tensors/tensor_allocator.h:127
#5 0x00000000004053a6 in main () at /home/heafield/marian/src/test/tensor_test.cu:21
Logger isn't initialized properly?
Future work: add tensors with underlying integer type. Currently only floats are possible.
The term "node" is overloaded.
In the context of neural networks, the term node encompasses input nodes, neurons, and output nodes. In the context of algorithmic differentiation (AD), the term node refers to a node in the computation graph.
Given that we are using AD in the context of neural networks, there is a motivation to avoid conflating the two meanings, and for making very clear which is being described.
To that end, one possible solution would be to always use the term "vertex" rather than "node" when referring to the items in a computation graph.
Likewise, given that some computation graph vertices refer to an entire layer of neural network nodes, it may be appropriate to consider naming such vertices explicitly with names such as InputLayer or OutputLayer.
A shallow merge for now. Amun and Marian communicate through the model file, which should be ok as long as we do not have a Marian scorer in Amun.
What should we do with the repos? New repo? New organization? New name?
Currently all wrappers around CuDNN functions (softmax, dropout) are located in tensor_operators.*, they should be moved to a separate file. Also this requires some clever handling of cudnnHandle_t and similarly cublasHandle_t. Currently there are only single static instances of each one.
https://github.com/amunmt/marian/blob/master/src/training/graph_group.h line 73
Should we put cudaStreamSynchronize(0) here? I believe Cuda operations do not block CPU, thus you will finish the pushGradient function (and free the lock) even though the param update is not complete yet.
Save configuration with model (model.npz and model.npz.yml)
Orthogonal intialization of weights. The easiest way to do this is probably on the CPU and then copy to GPU. This can be achieved with any LAPACK library, but there is also CULA, a CUDA LAPACK library.
Graph should probably have a function that calls gradient checks for all nodes.
The dropout operation is currently inefficient and probably broken. We need to learn how random number generators work on the GPU.
Not very important, but before we get bitten by it: Marian currently creates the .shuf files in the same directory as the training data files it is given.
Investigate why vocabulary creation is slow, probably a problem in yaml-cpp
"result": shape=4x1
0.995334
0.00920105
0.995334
0.000189304
"softmax": shape=4x1
1
1
1
1
... building fails if it isn't.
Graphs that are unrolled for different batch size when they contain RNNs should share the entire memory in order to avoid allocation. Parameters should be protected, the rest could be overwritten.
It probably makes sense to change the Tensor class to not hold it's own memory, rather it should be a view on a segment of the memory the graph holds. This should be solved together with issue #9 .
When running the MNIST example with ReLU for more than 25 epochs NAN appears. This seems to be a known issue with ReLU and crossentropy, and the suggested solution is some sort of clipping. This requires more investigation.
Binarize graphs and paramters with boost::serialization. Values and adjoins of all nodes can be ignored apart from ParamNodeOp. Some nodes have special paramters like "DropoutNodeOp" that also need to be serialized.
From current master, seems to break no matter what I do (on Azure):
/mt/marian/build/marian -t data/corpus.bpe.ro data/corpus.bpe.en -d 0
[2017-03-27 16:43:16] [config] after-batches: 0
[2017-03-27 16:43:16] [config] after-epochs: 0
[2017-03-27 16:43:16] [config] clip-norm: 1
[2017-03-27 16:43:16] [config] devices:
[2017-03-27 16:43:16] [config] - 0
[2017-03-27 16:43:16] [config] dim-emb: 512
[2017-03-27 16:43:16] [config] dim-rnn: 1024
[2017-03-27 16:43:16] [config] dim-vocabs:
[2017-03-27 16:43:16] [config] - 50000
[2017-03-27 16:43:16] [config] - 50000
[2017-03-27 16:43:16] [config] disp-freq: 1000
[2017-03-27 16:43:16] [config] dropout-rnn: 0
[2017-03-27 16:43:16] [config] dropout-src: 0
[2017-03-27 16:43:16] [config] dropout-trg: 0
[2017-03-27 16:43:16] [config] early-stopping: 10
[2017-03-27 16:43:16] [config] layer-normalization: false
[2017-03-27 16:43:16] [config] layers-dec: 1
[2017-03-27 16:43:16] [config] layers-enc: 1
[2017-03-27 16:43:16] [config] learn-rate: 0.0001
[2017-03-27 16:43:16] [config] max-length: 50
[2017-03-27 16:43:16] [config] maxi-batch: 100
[2017-03-27 16:43:16] [config] mini-batch: 64
[2017-03-27 16:43:16] [config] model: model.npz
[2017-03-27 16:43:16] [config] no-reload: false
[2017-03-27 16:43:16] [config] no-shuffle: false
[2017-03-27 16:43:16] [config] optimizer: adam
[2017-03-27 16:43:16] [config] overwrite: false
[2017-03-27 16:43:16] [config] relative-paths: false
[2017-03-27 16:43:16] [config] save-freq: 10000
[2017-03-27 16:43:16] [config] seed: 1234
[2017-03-27 16:43:16] [config] skip: false
[2017-03-27 16:43:16] [config] train-sets:
[2017-03-27 16:43:16] [config] - data/corpus.bpe.ro
[2017-03-27 16:43:16] [config] - data/corpus.bpe.en
[2017-03-27 16:43:16] [config] type: dl4mt
[2017-03-27 16:43:16] [config] valid-freq: 10000
[2017-03-27 16:43:16] [config] valid-metrics:
[2017-03-27 16:43:16] [config] - cross-entropy
[2017-03-27 16:43:16] [config] workspace: 2048
[2017-03-27 16:43:16] [data] Loading vocabulary from data/corpus.bpe.ro.json (max: 50000)
[2017-03-27 16:43:16] [data] Loading vocabulary from data/corpus.bpe.en.json (max: 50000)
Error at /mt/marian/src/kernels/dropout.cu:22
This is being worked on in the Issue22 branch.
Hi
I got segfault after the code runs for a couple hours. Not sure why. This is based on a code pulled a week ago. I'm pulling the new one to see if the issue still there.
[2017-03-22 11:00:52] Ep. 4 : Up. 21900 : Sen. 281600 : Cost 27.39 : Time 49.85s : 2981.00 words/s
[2017-03-22 11:01:42] Ep. 4 : Up. 22000 : Sen. 288000 : Cost 26.35 : Time 49.64s : 2842.05 words/s
[2017-03-22 11:02:33] Ep. 4 : Up. 22100 : Sen. 294400 : Cost 27.27 : Time 50.95s : 2881.92 words/s
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/tensors/tensor.cu 83
GPUassert: driver shutting down /home/alfikri/ori/marian/src/tensors/tensor.cu 46
GPUassert: driver shutting down /home/alfikri/ori/marian/src/tensors/tensor.cu 46
GPUassert: driver shutting down /home/alfikri/ori/marian/src/kernels/tensor_operators.cu 603
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/kernels/tensor_operators.cu 608
*** Error in `./marian': double free or corruption (!prev): 0x000055821fb66320 ***
Segmentation fault (core dumped)
I think this should be an extension of the Vocab
class.
Create yaml vocabulary for each input file filename
if none given:
filename.{yml,json}
, if exists load as vocab.Else:
filename.yml
Add --working-dir option across all paths.
It seems the new formatting options in spdlog are not working properly. Reports during training are missing information and are cut off at the end.
Should be caused by 50ddd09
[ 27%] Building NVCC (Device) object src/CMakeFiles/marian_lib.dir/marian_lib_generated_tensor.cu.o
/home/lanes/Marian/src/node_operators_unary.h(137): error: host or device annotation on lambda requires --expt-extended-lambda nvcc flag
/home/lanes/Marian/src/node_operators_unary.h(137): error: host or device annotation on lambda requires --expt-extended-lambda nvcc flag
/home/lanes/Marian/src/tensor_operators.h(88): error: The closure type for a lambda ("lambda [](float &, float)->float") cannot be used in the template argument type of a global function template instantiation, unless the lambda is defined within a device or global function, or the lambda is a 'extended lambda' and the flag --expt-extended-lambda is specified
detected during:
instantiation of "marian::gElement" based on template arguments <lambda [](float &, float)->float, marian::Tensor::TensorView, marian::Bernoulli>
(88): here
instantiation of "void marian::Element(Functor, T1, T2) [with Functor=lambda [](float &, float)->float, T1=marian::Tensor, T2=marian::Bernoulli]"
/home/lanes/Marian/src/node_operators_unary.h(140): here
/home/lanes/Marian/src/tensor_operators.h(88): error: A type defined inside a host function ("lambda [](float &, float)->float") cannot be used in the template argument type of a global function template instantiation
detected during:
instantiation of "marian::gElement" based on template arguments <lambda [](float &, float)->float, marian::Tensor::TensorView, marian::Bernoulli>
(88): here
instantiation of "void marian::Element(Functor, T1, T2) [with Functor=lambda [](float &, float)->float, T1=marian::Tensor, T2=marian::Bernoulli]"
/home/lanes/Marian/src/node_operators_unary.h(140): here
/home/lanes/Marian/src/tensor_operators.h(88): error: The closure type for a lambda ("lambda [](float &, float)->float") cannot be used in the template argument type of a global function template instantiation, unless the lambda is defined within a device or global function, or the lambda is a 'extended lambda' and the flag --expt-extended-lambda is specified
detected during:
instantiation of "marian::gElement" based on template arguments <lambda [](float &, float)->float, marian::Tensor::TensorView, marian::Bernoulli>
(88): here
instantiation of "void marian::Element(Functor, T1, T2) [with Functor=lambda [](float &, float)->float, T1=marian::Tensor, T2=marian::Bernoulli]"
/home/lanes/Marian/src/node_operators_unary.h(140): here
/home/lanes/Marian/src/tensor_operators.h(88): error: A type defined inside a host function ("lambda [](float &, float)->float") cannot be used in the template argument type of a global function template instantiation
detected during:
instantiation of "marian::gElement" based on template arguments <lambda [](float &, float)->float, marian::Tensor::TensorView, marian::Bernoulli>
(88): here
instantiation of "void marian::Element(Functor, T1, T2) [with Functor=lambda [](float &, float)->float, T1=marian::Tensor, T2=marian::Bernoulli]"
/home/lanes/Marian/src/node_operators_unary.h(140): here
3 errors detected in the compilation of "/tmp/tmpxft_00006419_00000000-7_expression_operators.cpp1.ii".
CMake Error at marian_lib_generated_expression_operators.cu.o.cmake:266 (message):
Error generating file
/home/lanes/Marian/build/src/CMakeFiles/marian_lib.dir//./marian_lib_generated_expression_operators.cu.o
make[2]: *** [src/CMakeFiles/marian_lib.dir/marian_lib_generated_expression_operators.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
3 errors detected in the compilation of "/tmp/tmpxft_00006422_00000000-7_expression_graph.cpp1.ii".
CMake Error at marian_lib_generated_expression_graph.cu.o.cmake:266 (message):
Error generating file
/home/lanes/Marian/build/src/CMakeFiles/marian_lib.dir//./marian_lib_generated_expression_graph.cu.o
make[2]: *** [src/CMakeFiles/marian_lib.dir/marian_lib_generated_expression_graph.cu.o] Error 1
make[1]: *** [src/CMakeFiles/marian_lib.dir/all] Error 2
make: *** [all] Error 2
Naumann (2012) uses "tape" to refer to this data structure.
Given that we do not actually operate on this as a LIFO data structure, the term "stack" is misleading.
Following issue #4, there are a number of other interesting layers that have been implemented in CuDNN.
We should evaluate and compare:
I guess the results might be similar, i.e. the forward steps will be faster than ours. We should then use those and keep our potentially faster backward steps.
The RNNs are particularly interesting, as this might take quite a bit of headache from us, I just fear for the backward step. I am also not sure their implementation allows us to use conditional RNNs which we need for attention. They do have separate inference and forward steps. This would be interesting for Amun as well.
The documentation for CuDNN can be downloaded after registration from: https://developer.nvidia.com/cudnn
Should be able to tell if code is working without waiting an hour.
Unit tests: small graphs exercising only parts of the code?
Regression tests: save some checkpoints of small canned systems. Start from these checkpoints and verify the outcome is about the same.
@AngusL points out that Marian is non-deterministic which makes regression tests hard (paging @afaji). Moreover https://github.com/amunmt/marian/blob/13ee36be159386f93259c3c6c8bfd80db5f52fff/src/training/config.cpp#L154 is misleading because it says "for all random number generators".
https://github.com/amunmt/marian/blob/9ce67850b80b0be5fb3d2a51c645b45ac170786d/src/data/batch_generator.h#L111 should change from the deprecated two-argument `random_shuffle' to the three-argument seeded version.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.