dmlc / mshadow Goto Github PK

Matrix Shadow:Lightweight CPU/GPU Matrix and Tensor Template Library in C++/CUDA for (Deep) Machine Learning

License: Other

Makefile 2.03% C++ 80.48% Cuda 12.16% Shell 0.23% CMake 4.88% C 0.22%

mshadow's Introduction

mshadow: Matrix Shadow

This code base has been donated to the Apache MXNet project per #373, and repo is deprecated. Future development should continue in Apache MXNet.

MShadow is a lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA. The goal of mshadow is to support efficient, device invariant and simple tensor library for machine learning project that aims for maximum performance and control, while also emphasize simplicity.

MShadow also provides interface that allows writing Multi-GPU and distributed deep learning programs in an easy and unified way.

Features

Efficient: all the expression you write will be lazily evaluated and compiled into optimized code
- No temporal memory allocation will happen for expression you write
- mshadow will generate specific kernel for every expression you write in compile time.
Device invariant: you can write one code and it will run on both CPU and GPU
Simple: mshadow allows you to write machine learning code using expressions.
Whitebox: put a float* into the Tensor struct and take the benefit of the package, no memory allocation is happened unless explicitly called
Lightweight library: light amount of code to support frequently used functions in machine learning
Extendable: user can write simple functions that plugs into mshadow and run on GPU/CPU, no experience in CUDA is required.
MultiGPU and Distributed ML: mshadow-ps interface allows user to write efficient MultiGPU and distributed programs in an unified way.

Version

This version mshadow-2.x, there are a lot of changes in the interface and it is not backward compatible with mshadow-1.0
- If you use older version of cxxnet, you will need to use the legacy mshadow code
For legacy code, refer to Here
Change log in CHANGES.md

Projects Using MShadow

mshadow's People

Contributors

Stargazers

Watchers

Forkers

wachaong erkang hezila cnanlmlin fairymane ty01csbaidu lqniunjunlper jchjava popfido alienfeel kennydreame stoictraveler phecy ljwsummer lcode zhimingz he-yunlong vlsi1217 yanshanjing xuedx derekrose ericchanbd fanfannothing sijinli licstar jywang zhuxiaoqiang chagge yiiwood niuzhiheng spideryan wangdongfrank reyoung zhaowei-rs jack-1800 maydaygmail qhduan shaoguangcheng pl8787 putaozhuose dreadlord1984 wush978 starsnet83 ktoc fdoperezi mapleyustat winstywang narayana1208 timwee sinzero gumpu shannonyu njuhugn chonglinsun chris1201 mrgloom billy-inn dfrsg nksg bpiwowar hubberwisdom zencoding minghuam quantscientist3 zxsted datapython fireae hknerdgn draculaiii chhshen redsuncmx wyvern92 wl-gao kuyun-zhangyang yehao tongming beronx86 seaslee sigmaquan yt752 wuntoguo harveyliufly smalliao uwroute thomasdic2000 xdwangh orangelpai superchao1982 yifangxu alaxwang soledad89 ericxsun piotr-teterwak codenfish turi-code liyanghua wycharry nagyistoce sonach dengcy028

mshadow's Issues

'sum_rows' error, when the dimension of tensor is (4,1).

This error will occur when assignment operator execution, and the error line seems in this line. The cuda get no error, and posix error string is 'File exists'.

I test this simple program in cuda 5.5 and cuda 6, and they are both error.

inline void onExitPrintError(){
    cudaError_t err = cudaGetLastError();
    if(err != cudaSuccess)
    {
        // print the CUDA error message and exit
        printf("CUDA error: %s\n", cudaGetErrorString(err));
    }
    printf("Posix errno %s\n",strerror(errno));

}

int main(){
    InitTensorEngine(1);
    atexit(onExitPrintError);
    TensorContainer<cpu, 2> a;
    a.Resize(Shape2(4,1));

    a[0][0] = 0.0f;
    a[1][0] = 1.0f;
    a[2][0] = 1.0f;
    a[3][0] = 0.0f;

    TensorContainer<gpu, 2> gpu_a;
    gpu_a.Resize(Shape2(4,1));
    Copy(gpu_a,a);


    TensorContainer<gpu, 1> b;
    b.Resize(Shape1(1));

    b = sum_rows(gpu_a);

    TensorContainer<cpu, 1> c;
    c.Resize(b.shape);
    Copy(c,b);
    for(int i=0;i<c.shape[0];++i){
        cout<< c[i]<<endl;
    }

    ShutdownTensorEngine();
    return 0;
}

sorry wrong position

sorry wrong position~

stream_gpu-inl.h:65: Default GPU stream was used when MSHADOW_FORCE_STREAM was on

Hi,
is there an idea why i'm getting this error when i attempt to run the binary from _basic_streams.cu_

stream_gpu-inl.h:65: Default GPU stream was used when MSHADOW_FORCE_STREAM was on

any tips and advice is appreciated.

Is Mshadow Cpp_version Theano?

I am currently using Mshadow to develop some NN learning tools from scratch.

And I wonder can I use Mshadow as a CPP Version of Theano?

I mean, nearly all the things that Theano is able to do, Mshadow can do them (faster on running and slower on coding) too in a similar way.

Is that correct?

3D Tensor operations?

Is there operation for 3D tensor( Tensor<cpu,3>) ?

For instance I would like to take one 2D tensor and dot product with the first two dimensions of a 3D tensor:
Tensor<cpu,2> 2d_ten(Shape2(100,50));
Tensor<cpu,3> 3d_ten(Shape3(50,30,20));

someop(2d_ten, 3d_ten);

I would like to get the result which is a 3D Tensor with shape = (100,30,20)

Thanks.

[Discussion] OpenCL support

I notice Android supports OpenCL, which will bring great benefit.

Let's discuss how much work need to be done to make mshadow support OpenCL

@mli @tqchen @piiswrong

question about reshape a tensor

hi,

I've got some problem when I try to reshape a mat from (1,2) to (2,1) , my code is something like this:

Tensor<cpu, 2> mat1 = NewTensor<cpu, float>(Shape2(1,2), 1.0);
//reshape mat from 1*2  to 2*1
Tensor<cpu, 2> mat2(Shape2(2,1));
mat2 = reshape(mat1, mat2.shape_);

I got segmentation falt while doing this. I think it's because I didn't allocate memory for mat2, but in my understanding, reshape operation need no extra memory for handling data since mat2 just share data with mat1. I wonder if its my misunderstanding or just because I've done something wrong.

thanks a lot.

Confusion about stream

Hey, I'm working on editingbackward andforward functions of guide/convnet.cu
As I writing the functions, I notice that the initial version of the .cu has two temp TensorContainers : tmp_col and tmp_dst. In my completion, I need more temp Containers, so I just imitate the code and add more TensorContainers like TensorContainer<xpu, 4, real_t> tmp41, tmp42, tmp43, tmp44; , and in the init-function set their stream.
But when I use them in the forward and bacward , I just can't use -gpu option, the output is

Default GPU stream was used when MSHADOW_FORCE_STREAM was on

In the function I Resize the temp Containers and assign them to some return value like swapaxis or unpack_patch2col, I debugged the process, and find when I assign them to a return value, the error happens.

And , I'm little confused about stream. It is used to store the operation process, right? My concept about stream is poor.
@antinucleon @tqchen
Thank you.

Question about `broadcast_with_axis`

Why broadcast_with_axis is designed to create a tensor of ndim+1? This is a little bit annoying sometimes. For example, if I need to broadcast a [1, 200] tensor to a [100, 200] tensor. I need to first broadcast it to [1, 100, 200] tensor then do a reshape to get rid of the redundant dim. Is there any special concern for this design? I think maybe a keepdim broadcasting will be more convenient (i.e, [1, 200] directly to [100, 200]).

A single space, a BUG?

Check the code here,

https://github.com/dmlc/mshadow/blob/master/mshadow/tensor_blob.h#L320

is different from code here,

https://github.com/dmlc/mshadow/blob/master/mshadow/tensor_cpu-inl.h#L40

there is a small space after the comma.
I think that will give rise to some troubles.

For example, if I use:

TShape shape=mshadow::Shape2(5,5);
std::stringstream ss;
string s;
ss<<shape;
ss>>s;
//s is (5,

Maximum element in Tensor

How to get the maximum element in Tensor?

Just a typo

In https://github.com/dmlc/mshadow/tree/master/guide
"for (index_t j = 0; j < ts.size(1), ++j)"
should be
"for (index_t j = 0; j < ts.size(1); ++j)".

Multiple GPU support

I do not know if mshadow could support multiple GPU now.
Instead of pre-define the id of device, it could also be templated ( with a default device number) to enable Tensor on multiple GPUs.

Sum all

Could there be a sum all function that can give the sum of elements in tensor? Or I missed it? It is useful for MSE calculations. Otherwise loops are inevitable like follows, which may be inefficient:

    tpv=sumall_except_dim<0>(F<Abs>(V));
    tpv2=sumall_except_dim<0>(F<sme>(V));
    for(int i=0;i<V.shape.shape_[0];i++){
        err+=tpv[i];
        err2+=tpv2[i];
    }

Record locations of pooling region.

Could we remember the locations of pooling operation such max pooling. So that we could reverse the pooling operation approximately.

../mshadow/./base.h:145:20: fatal error: cuda.h: Filr or directory doesn't exist

When compiling the simple example basic.cpp:
marco@pc:~/mshadow/prove$ g++ -std=c++11 -DUSE_BLAS=openblas -I .. basic.cpp -obasic
In file included from ../mshadow/tensor.h:16:0,
from basic.cpp:2:
../mshadow/./base.h:145:20: fatal error: cuda.h: File o directory non esistente
#include <cuda.h>
^
compilation terminated.

What to do?

Makefile error

The rule for $(BIN) is incorrect in the Makefile: the order of -o and LDFLAGS needs to be reversed, otherwise you get an error about unresolved symbols. That is, the target should be $(CXX) $(CFLAGS) -o $@ $(filter %.cpp %.o %.c, $^) $(LDFLAGS)

opencl support

Could it be possbile to add opencl support using clBLAS[¹]?

[¹] https://github.com/clMathLibraries/clBLAS

How to assign one tensorContainer to a bigger one?

Say I have a TensorContainer C1 whose shape is (2, 3), and another TensorContainer C2 whose shape is (4,3) (bigger), can I assign C1 to C2's particular highest dimensions?
I tried C2.Slice(1,3) = C1, which proved vain.
@antinucleon @tqchen

matrix transpose

Is there any operation that can deal with matrix transpose efficiently? thanks!

Potential cuda kernel lauching problem of `reduce_except_dim<0,..>` operator

I find that if we use reduce_except_dim<0,..>, we will ultimately call MapReduceKeepDim1(https://github.com/dmlc/mshadow/blob/master/mshadow/tensor_gpu-inl.h#L153-L155), which may have problems for large matrix. In fact, in the implementation of MapReduceKeepDim1, the dimGrid is set directly as p[1] (https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L183-L184), which may exceed the boundary of 65536(https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L45).

This problem does not exist for MapReduceKeepLowest, which has used MemUnits to set kernel launching parameters https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L142-L143. Should we change the implementation of MapReduceKeepDim1 to be similar to MapReduceKeepLowest in the future?

Broadcasting along multiple axis

Sometimes we need to broadcast along some axis given in an std::vector -> broadcast_with_multi_axis. I'm not sure how to code this concisely for Tensor with arbitrary ndim, since we may not know the number of broadcasted axises in prior. One idea I have is to constrain the maximum axis of the broadcast_with_multi_axis to a large number like 5. Is this solution acceptable?

Is there a way to convert a 1x1 2D Tensor to scalar?

Hi,
My code looks something like this which is computing the square error of NN.

real_t = (sumall_except_dim<2>(sum_rows(F(outgrad)) / (real_t) pred.shape[1];

however, this seems not working with the following error msg.

error: cannot convert 'mshadow::expr::BinaryMapExp<mshadow::op::div, mshadow::expr::ReduceTo1DExp<mshadow::expr::ReduceTo1DExp<mshadow::expr::UnaryMapExp<square, mshadow::Tensor<mshadow::cpu, 2>, 1>, mshadow::red::sum, 0>, mshadow::red::sum, 2>, mshadow::expr::ScalarExp, 3>' to 'mshadow::real_t {aka float}' in initialization
real_t err = (sumall_except_dim<2>(sum_rows(F(outgrad))) / (real_t)pred.shape[1]);

which basically says the type is different.

I've also tried

(sumall_except_dim<2>(sum_rows(F(outgrad))[0]

but it says there is no operator[] defined on such data type.
It seems that I need to explicit go over the tensor and sum the result(which could be slow compared to optimized vectorized code)?

Thank you.

Why the default GPU stream was used?

When I compile and run the "mshadow/test/test.cu", it tells me:
"Default GPU stream was used when MSHADOW_FORCE_STREAM was on".
I don't know why the stream was used and how could I make it unused? I just turn off MSHADOW_FORCE_STREAM now. Doing so will make the performance worse? Thanks.

Use GPU To Random a tensor when shape is odd error

Error Code Shows like that

InitTensorEngine(0);
TensorContainer<gpu,1> tmp;
tmp.Resize(Shape1(1));  // 1 Or any odd value
Random<mshadow::gpu> rnd(0);
rnd.SampleGaussian(tmp, 0, 0.1);
ShutdownTensorEngine();

The error code is CURAND_STATUS_LENGTH_NOT_MULTIPLE. The error line maybe is here.

cpu pending

After around 8 hours running, the cpu could idle.

how to transport matrix by using mshadow

If i want to transport a matrix (eg.mshadow::Tensor<xpu, dim> grad), what can i do?

The usage of concat

Hi , @tqchen @antinucleon ,
I notice there's a function concat in mshadow, AWESOME.
But how to use it? I find the function's parameter is LVALUE, RVALUE, but how can I define which dimension to concat?

Compiling NNET examle

Hi,

When I try to compile neural net example I get the following error:

nvcc -o nnet_ps -O3 --use_fast_math -ccbin g++  -Xcompiler "-Wall -O3 -I../../ -fopenmp -msse3 -funroll-loops -Wno-unused-parameter -Wno-unknown-pragmas -I/usr/include/cuda/ -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_DIST_PS=0" -Xlinker "-lm -lm -lcudart -lcublas -lcurand -L/usr/lib64 -lopenblas -L/usr/lib64/atlas" nnet_ps.cu
/tmp/tmpxft_00000cc5_00000000-16_nnet_ps.o: In function `void NNet<mshadow::gpu>::SyncProc<1>(mshadow::Tensor<mshadow::gpu, 1, float>, mshadow::Tensor<mshadow::gpu, 1, float>, int)':
tmpxft_00000cc5_00000000-3_nnet_ps.cudafe1.cpp:(.text._ZN4NNetIN7mshadow3gpuEE8SyncProcILi1EEEvNS0_6TensorIS1_XT_EfEES5_i[_ZN4NNetIN7mshadow3gpuEE8SyncProcILi1EEEvNS0_6TensorIS1_XT_EfEES5_i]+0x108): undefined reference to `NNet<mshadow::gpu>::UpdateEntry::ApplyUpdate(mshadow::Stream<mshadow::gpu>*, void*)'
/tmp/tmpxft_00000cc5_00000000-16_nnet_ps.o: In function `void NNet<mshadow::gpu>::SyncProc<2>(mshadow::Tensor<mshadow::gpu, 2, float>, mshadow::Tensor<mshadow::gpu, 2, float>, int)':
tmpxft_00000cc5_00000000-3_nnet_ps.cudafe1.cpp:(.text._ZN4NNetIN7mshadow3gpuEE8SyncProcILi2EEEvNS0_6TensorIS1_XT_EfEES5_i[_ZN4NNetIN7mshadow3gpuEE8SyncProcILi2EEEvNS0_6TensorIS1_XT_EfEES5_i]+0x165): undefined reference to `NNet<mshadow::gpu>::UpdateEntry::ApplyUpdate(mshadow::Stream<mshadow::gpu>*, void*)'
/tmp/tmpxft_00000cc5_00000000-16_nnet_ps.o: In function `void NNet<mshadow::cpu>::SyncProc<1>(mshadow::Tensor<mshadow::cpu, 1, float>, mshadow::Tensor<mshadow::cpu, 1, float>, int)':
tmpxft_00000cc5_00000000-3_nnet_ps.cudafe1.cpp:(.text._ZN4NNetIN7mshadow3cpuEE8SyncProcILi1EEEvNS0_6TensorIS1_XT_EfEES5_i[_ZN4NNetIN7mshadow3cpuEE8SyncProcILi1EEEvNS0_6TensorIS1_XT_EfEES5_i]+0xe1): undefined reference to `NNet<mshadow::cpu>::UpdateEntry::ApplyUpdate(mshadow::Stream<mshadow::cpu>*, void*)'
/tmp/tmpxft_00000cc5_00000000-16_nnet_ps.o: In function `void NNet<mshadow::cpu>::SyncProc<2>(mshadow::Tensor<mshadow::cpu, 2, float>, mshadow::Tensor<mshadow::cpu, 2, float>, int)':
tmpxft_00000cc5_00000000-3_nnet_ps.cudafe1.cpp:(.text._ZN4NNetIN7mshadow3cpuEE8SyncProcILi2EEEvNS0_6TensorIS1_XT_EfEES5_i[_ZN4NNetIN7mshadow3cpuEE8SyncProcILi2EEEvNS0_6TensorIS1_XT_EfEES5_i]+0x13d): undefined reference to `NNet<mshadow::cpu>::UpdateEntry::ApplyUpdate(mshadow::Stream<mshadow::cpu>*, void*)'
collect2: error: ld returned 1 exit status
Makefile:34: recipe for target 'nnet_ps' failed

Any ideas what can I do to fix it? Is this example old?

Szymon

precision issue

The operations defined with Map need float precision, otherwise it won't compile, is there a way to use double precision as input (for gradient checking etc)? For GPU card with Compute Capability 2.0 or above double precision is supported. (http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDA__MATH__DOUBLE.html)
By the way, I started working on MPI for multiple GPU with mshadow, hope this can work.

Question about operator= for Tensor and TensorContainer

Hi ,
I`m confused with operator= for Tensor and TensorContainer
Could someone explain how it works?
Many thanks :D

#include<mshadow/tensor.h>
#include<iostream>
#include<mshadow/tensor_container.h>
void TestTensor(){
  using namespace std;
  using namespace mshadow;
  TensorContainer<cpu, 3> tc3;
  TensorContainer<cpu, 2> tc2;
  tc3.Resize(Shape3(3, 2, 2));
  tc2.Resize(Shape2(2, 2));
  tc2[0][0] = 0; tc2[0][1] = 0.1;
  tc2[1][0] = 1.0; tc2[1][1] = 1.1;
  for (index_t i = 0; i < 3; i++){
    //tc3[i] = tc2; //in this case , failed
    Copy(tc3[i], tc2, tc3.stream_);//succeed
  }
  for (index_t i = 0; i < 3; i++){
    cout << "channel 0" << endl;
    for (index_t j = 0; j < 2; j++)
    {
      for (index_t k = 0; k < 2; k++)
      {
        cout << tc3[i][j][k] << " ";
      }
    }
    cout << endl;
  }
}

compile errors on dot towards TensorContainer

I add the following codes in basic.cpp to have a test, but it reports a compile error for me? Do I miss something? Thanks.

TensorContainer<cpu, 2, float> matc1(Shape2(2,3));
matc1[0]=2; matc1[1]=3;
matc1 = 1 / matc1;

TensorContainer<cpu, 2, float> matc2(Shape2(3,2));
matc2[0]=1; matc2[1]=2; matc2[2]=3;

// error: conversion from ‘mshadow::expr::DotExp<mshadow::Tensor<mshadow::cpu, 2, float>,
// mshadow::Tensor<mshadow::cpu, 2, float>, false, false, float>’ to non-scalar type
// ‘mshadow::TensorContainer<mshadow::cpu, 2, float>’ requested
TensorContainer<cpu, 2, float> matc3 = dot(matc1, matc2);

for (index_t i = 0; i < matc3.size(0); ++i) {
for (index_t j = 0; j < matc3.size(1); ++j) {
printf("%.2f ", matc3[i][j]);
}
printf("\n");
}
printf("\n");

half_t multiplication

/home/wallnuss/src/mxnet/mshadow/mshadow/././half.h: In instantiation of ‘mshadow::half::half_t mshadow::half::operator*(mshadow::half::half_t, T) [with T = mshadow::expr::CroppingExp<mshadow::expr::MakeTensorExp<mshadow::expr::UnPoolingExp<mshadow::red::maximum, mshadow::expr::MakeTensorExp<mshadow::expr::PaddingExp<mshadow::Tensor<mshadow::cpu, 4, mshadow::half::half_t>, mshadow::half::half_t, 4>, mshadow::Tensor<mshadow::cpu, 4, mshadow::half::half_t>, 4, mshadow::half::half_t>, mshadow::half::half_t, 4>, mshadow::expr::MakeTensorExp<mshadow::expr::PaddingExp<mshadow::Tensor<mshadow::cpu, 4, mshadow::half::half_t>, mshadow::half::half_t, 4>, mshadow::Tensor<mshadow::cpu, 4, mshadow::half::half_t>, 4, mshadow::half::half_t>, 4, mshadow::half::half_t>, mshadow::half::half_t, 4>]’:
src/operator/./pooling-inl.h:145:7:   required from ‘void mxnet::op::PoolingOp<xpu, Reducer, DType>::Backward(const mxnet::OpContext&, const std::vector<mshadow::TBlob>&, const std::vector<mshadow::TBlob>&, const std::vector<mshadow::TBlob>&, const std::vector<mxnet::OpReqType>&, const std::vector<mshadow::TBlob>&, const std::vector<mshadow::TBlob>&) [with xpu = mshadow::cpu; Reducer = mshadow::red::maximum; DType = mshadow::half::half_t]’
src/operator/pooling.cc:47:1:   required from here
/home/wallnuss/src/mxnet/mshadow/mshadow/././half.h:248:31: error: invalid cast from type ‘mshadow::expr::CroppingExp<mshadow::expr::MakeTensorExp<mshadow::expr::UnPoolingExp<mshadow::red::maximum, mshadow::expr::MakeTensorExp<mshadow::expr::PaddingExp<mshadow::Tensor<mshadow::cpu, 4, mshadow::half::half_t>, mshadow::half::half_t, 4>, mshadow::Tensor<mshadow::cpu, 4, mshadow::half::half_t>, 4, mshadow::half::half_t>, mshadow::half::half_t, 4>, mshadow::expr::MakeTensorExp<mshadow::expr::PaddingExp<mshadow::Tensor<mshadow::cpu, 4, mshadow::half::half_t>, mshadow::half::half_t, 4>, mshadow::Tensor<mshadow::cpu, 4, mshadow::half::half_t>, 4, mshadow::half::half_t>, 4, mshadow::half::half_t>, mshadow::half::half_t, 4>’ to type ‘float’
 MSHADOW_HALF_OPERATOR(half_t, *)
                               ^
/home/wallnuss/src/mxnet/mshadow/mshadow/././half.h:35:27: note: in definition of macro ‘MSHADOW_HALF_OPERATOR’
     return RTYPE(float(a) OP float(b));  /* NOLINT(*) */

I am encountering this in apache/mxnet#2280 for a case where I have constant * expr.
I am not that familiar with how mshadow works so I am wondering if it might be some weird interaction between the half_t implementation of multiplication taking precedent
over the expr multiplication.

Fix for Makefile on OS X

On my OS X, the shared libraries of CUDA are placed in CUDA_HOME/lib instead of CUDA_HOME/lib64. Please change the following line from

MSHADOW_LDFLAGS += -L$(USE_CUDA_PATH)/lib64

MSHADOW_LDFLAGS += -L$(USE_CUDA_PATH)/lib64 -L$(USE_CUDA_PATH)/lib

leverage cudnn?

Is there anything from cudnn can be leveraged? It shows ~24% speedup in Titan Black.

[Discussion] Support sort in mshadow?

Do we need to do this? For CPU we can use std::sort, for GPU we can refer CUDA examples.

sampe_api

dst, idx = sort(src)

where idx is the argsort result

Can't get correct answer from reduce_with_axis

I'm trying to implement a tensorflow-like reduce_sum operator for mxnet, based on mshadow::expr::reduce_with_axis. Firstly I wrote some testing code but got the wrong answer. Not sure whether my usage is inappropriate. Here's the test code:

#define MSHADOW_STAND_ALONE 1
#include <iostream>
#include "../mshadow/tensor.h"
#include "../mshadow/extension/reduce_with_axis.h"
using namespace mshadow;
using namespace mshadow::expr;
using namespace std;

int main() {
    Tensor<cpu, 3, float> t3(Shape3(2, 2, 5));
    AllocSpace(&t3);
    t3 = 1.0f;

    Tensor<cpu, 2, float> t2(Shape2(2, 2));
    AllocSpace(&t2);
    t2 = reduce_with_axis<red::sum, false>(t3, 2);
    for (index_t i = 0; i < t2.size(0); ++i) {
        for (index_t j = 0; j < t2.size(1); ++j) {
            cout << t2[i][j] << ' ';
        }
        cout << endl;
    }
}

The output varied from

5 5
7.3787e+19 0

5 5
4 0

or something else. However, if I reduce dimension 0 or 1, rather than 2, then the result is constant and correct.

I also read the source code of reduce_with_axis.h but failed to catch the idea.

@piiswrong

what does ``unpack_patch2col`` exactly do?

I know that this function is to vectorize a matrix to prepare for the convolution,
but in detail, the function receive params (input_image, filter_height, filter_width, stride)
and output a tensor<2>,whose shape is (channel * filter_height * filter_width, batch * ouput_height * output_width)

so I think this function convert the matrix to suit the “dot“ function with filter.

thus , for example , if I input a image tensor ``IMG`` 1 * 2 * 1 * 3, meaning 1 batch, 2 channels, 1 height , 3 width , the first channel of IMG is (1,2,3), the second is (4,5,6), then I apply `` unpack_patch2col(IMG, 1, 2, 1)`` which means the filter is 1 * 2, then the output shoud be like a tensor, (1, 2, 2, 3, 4, 5, 5, 6) but all I get is (6, 2, 6, 2, 6, 2, 6, 2)

Anyone helps me about the function?

does mshadow have safelog function ?

[question] why just replace the inline keyword with predefined macro?

This is just a question about the source code. To improve the performance, it will be benefit to use inline function as many as possible. The MSHADOW_XINLINE is a force inline macro which defined in base.h. But sometimes inline key word appear here and there.
for example,
https://github.com/dmlc/mshadow/blob/master/mshadow/expression.h#L137
why just replace the inline keyword with predefined macro? Is it matter the performance?
Or just use inline keywords where we can make sure that function will be execute in CPU .

potential random issue with DTypes

Currently random number generation only supporting float and double type using cuRAND. According to cuRAND doc(CUDA7.5, I haven't found the link to CUDA8.0), half type random number is not yet supported. A candidate solution is to create one extra float type tensor to generate values and convert them into DTypes other than float or double.

Currently the Random is used in dropout layer to created mask, which might be a issue if we want to support DType.
@tqchen

Initialization on GPU slow ( turns out that my cuda is broken)

The cublasInit() call is very time consuming, which is not trivial in debugging. Is there a possibility to circumvent using this function in your implementation? I have used CAFFE which seems to be very fast in GPU initialization.

Some misleading syntax

In Dot_engine-inl.h, the default type of a template argument is set by

template<typename Device, typename DType = default_real_t>

While in other places like Tensor.h, I notice that DType is set by #define MSHADOW_DEFAULT_DTYPE = default_real_t.

Should we define them in a consistent way?

Is there a function like slice in matlab?

I only find the slice(begin ,end),but what if I want a function similar to [:,:,3,:] to get a particular slice in matlab?
Thx.

take using real_t

Hi,
On line 41 in mshadow/extension/take.h, default_real_t is used as template prameter. Could this value be DType? I'm working on embedding DType and getting error on line 70.
@antinucleon

Building on Mac OS 10.9.4

I managed to build cxxnet+mshadow on Mac OS X 10.9.4 after doing several changes to the Makefile (which I can share if anyone's interested) mostly to deal with the well known libc++/libstdc++ issue in Mavericks.
Clang gave me several harmless warning about types being declared as structs and defined as classes (or the other way around). However, it was also failing to compile and I had to change line 453 in tensor.h because it looks like it was missing the dimkeep parameter and Clang was unforgiving about it:

template<typename Saver, typename Reducer, int dimkeep, typename E, int etype>

I'm new to mshadow/cxxnet (I discovered it yesterday evening) so I apologize beforehand if my contribution is not correct.

Regards,

Steven

The CroppingExp for RValue

Hi,
this issue is comming from:How to copy the data from the Tensor 'src' to the crop of Tensor 'dst'?. i'm not familiar with mshadow, it puzzled me here:

i found that the extensions which support RValue(like SliceExp, ConcatExp ) is inherited from TRValue, but the currently CroppingExp is inherited from MakeTensorExp, so should we change this?

would someone give some more advises for doing this?
i guess we should add REval in Plan structer, like slice and concat, are there any other place to pay attention to?
ths~

<gpu> and <cpu> generates totally different results

Help.
I improved the /guide/neuralnet/convnet.cu, add my own function
when I use -cpu parameter, things go well , the error rate is declining
however, -gpu generates totally different results: the error just stays at 0.9 or so(didn't change a little bit)...
It may seems there's a bug in mshadow's gpu implementation, but I don't know.

I didn't add my own gpu code, just use mshadow's(like <xpu>).

for cpu , I used blas lib. And my cuda version is 7.0.
Please help.
@tqchen @antinucleon

incorrect flag checking for -std=c++0x ?

Using gcc v. 4.6.3, I found that I needed to change the flag check from
__GXX_EXPERIMENTAL_CXX0X
to
GXX_EXPERIMENTAL_CXX0X
in tensor_base.h
in order to detect the -std=c++0x flag correctly.

(I was using lambdas in my code, so needed to add -std=c++0x to the compiler options, but then I got errors about not using constexpr, which is what that flag check in tensor_base.h is supposed to do ... so it looks like you need to add those underscores to get the flag check to work).

Compile error using cuda (basic_stream.cu, mshadow::InitTensorEngine)

nvcc -o basic_stream -O3 --use_fast_math -ccbin g++ -Xcompiler "-Wall -O3 -I../ -msse3 -funroll-loops -Wno-unused-parameter -Wno-unknown-pragmas -I/usr/local/cuda-7.0/include -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0" -Xlinker "-lm -lm -lcudart -lcublas -lcurand -L/usr/local/cuda-7.0/lib64 -lcblas" basic_stream.cu
basic_stream.cu(10): error: no instance of function template "mshadow::InitTensorEngine" matches the argument list

basic_stream.cu(31): error: no instance of function template "mshadow::ShutdownTensorEngine" matches the argument list