alibaba / ai-matrix Goto Github PK

To make it easy to benchmark AI accelerators

License: Other

Python 6.99% Shell 0.54% Makefile 0.08% C++ 11.12% CMake 0.33% Dockerfile 0.02% HTML 0.02% CSS 0.03% Jupyter Notebook 79.72% Cuda 0.94% MATLAB 0.17% Lua 0.02% C 0.03%

ai-matrix's People

Contributors

Stargazers

Watchers

ai-matrix's Issues

DIEN got lower AUC than original paper

Hi, the AUC of DIEN in original paper is 0.8453(mean) on Amazon books dataset. But I only got 0.8279. Is this result reasonable? Has anyone got a higher AUC? Thank you!

Missing dimensions of this benchmark suite

It is nice to have these AI benchmarks.
From an academic perspective, this benchmark can be improved as follows:

Input datasets.
For input-sensitivity study, a lot of datasets are needed. Since this benchmark originates from industry, collecting datasets should be relatively easy to be addressed by the Alibaba than anyone else.
Correctness/Accuracy criteria.
With compiler involved in the optimization process, it is easy to have an incorrect compiled binary. Therefore, it is extremely important to have a correctness checking feature for a successful benchmark suite. For example, SPEC CPU 2006/2017 have built-in correctness checking feature as part of its scripted tool chain; many HPC benchmarks, such as Cloverleaf/Cleverleaf also have these kind of features.
For approximated computation, especially on machine-learning, numerical correctness may not be applicable. Instead, accuracy may be a better criterion. Again, this domain-specific criterion is easy for Alibaba to provide and critical for researchers in other domains.
Automated installation and report.
Installation of big programs on main-stream Linux distribution, especially without root privilege, can be very challenging. Reporting the benchmark results could also be an interesting feature to include.
So far, SPEC seems to be most successful in this aspect than any other benchmark suites I have tried.
User-space software package management tools such as linuxBrew, spack (LLNL), are very useful to automate installation.
As another example, this on-going exascale computing benchmark suite (https://proxyapps.exascaleproject.org/ecp-proxy-apps-suite/) is supported by spack (https://spack.readthedocs.io/en/latest/package_list.html) for automatic installation, not only the package itself but also its dependency, all in user space.

the model files cannot be downloaded via git lfs

Error message shows as below: Error downloading object: macro_benchmark/CNN_Caffe/ResNet-152-model.caffemodel (6253c4c): Smudge error: Error downloading macro_benchmark/CNN_Caffe/ResNet-152-model.caffemodel (6253c4c4132c0b25c112b166629aa57dcaeec044a4c68ac9f003b6c801329d55): batch response: This repository is over its data quota. Purchase more data packs to restore access.

Please help to fix it or could send me one release package. Thanks.

DIEN_TF2 .contrib is used by FP16 and mode=test

Is there any way to solve the problem related to .contrib module not found using Tensorflow >=2?
I'm using --mode=test --data_type=FP16 --embedding_device=cpu.
Output:
Traceback (most recent call last):
File "script/train.py", line 413, in
test(model_type=args.model, seed=SEED, batch_size=args.batch_size, data_type=args.data_type)
File "script/train.py", line 358, in test
fp32_variables = [var_name for var_name, _ in tf.contrib.framework.list_variables(model_path)]
AttributeError: module 'tensorflow' has no attribute 'contrib'
I tried with t2_comapt_v1 and it doesnt solve the problem

Is there DIEN pretrained model provided?

Hi,

I'd like to know if there is pretrained model for DIEN provided? It seems that currently in the dnn_best_model_trained, only .meta and .index file are provided. Could you please upload the .data file for this checkpoint including weights bias gradients?

Thanks

DIEN implementation is not aligned with DIEN paper for some reason

Hello,

Could you please kindly clarify the reason in 'u_t' calculation in DIEN model?

ai-matrix/macro_benchmark/DIEN/script/utils.py

Line 217 in f7e1d77

u = (1.0 - att_score) * u

According to DIEN paper it should be u_t =a_t*u_t and in your implementation it is u_t = (1.0 - a_t) * u_t

Looking forward for your reply.

Thank you!

module 'mlperf_compliance.mlperf_log' has no attribute 'setdefault' in SSD_ResNet34_PyTorch

Hi there,
I'm having facing an issue when trying to training SSD_ResNet34_PyTorch.

Inside macro_benchmark/SSD_ResNet34_PyTorch, if I run the default command:
python -u train.py --local_rank=0 --use-fp16 --nhwc --pad-input --jit --delay-allreduce --opt-loss --epochs 10 --batch-size 128 --max_iter 3200 --warmup-factor 0 --no-save

I get the following error:

Traceback (most recent call last):
File "train.py", line 857, in <module> main()
File "train.py", line 830, in main mlperf_compliance.mlperf_log.setdefault(
AttributeError: module 'mlperf_compliance.mlperf_log' has no attribute 'setdefault'

I run that code inside the recommended docker image: nvcr.io/nvidia/pytorch:19.05-py3

Looking at mlperf_compliance lib, I don't see any 'setdefault' method.

DeepInterestNetwork: inconsistant between code and result

Regarding to deepinterestnetwork, training use batch size of 256,512,1024 and inference use batch size of 1, 32, 64. However, the benchmark result shows 256, 512, 1024 for inference. Is there anything wrong? please help confirm it.

Bugs in DIEN and DIEN_TF2, both got nan when training with prepare_data.sh

Hi, Ali ai-matrix team

I recently tried this repo and verified on DIEN.
Somehow, I verified both using prepare_dataset.sh and prepare_data.sh to prepare data for training, and I noticed that it seems current DIEN codes only works with prepare_dataset.sh and if I used prepare_data.sh to do feature enabling, training will always got nan.
see pic as below:

Is this a known issue? I also tried another repo from ali, https://github.com/alibaba/x-deeplearning/tree/master/xdl-algorithm-solution/DIEN, which seems handles well with prepare_data.sh

Looking forward your guys' reply, I'll also work on to see if I can make a quick fix, after all, I think this is an issue should be reported here.

Best regards,
Chendi

DeepInterestNetwork opens

lt looks like in DeepInterestNetwork, provided infer.py script does not reflect the typical expected inference usage model: infer.py invokes model.eval(), which basically is given two items - one (i) for which the expected correct answer is "positive" ("expected interaction") and another one (j), for which the expected correct answer is "negative" ("expected no interaction"). In practice, though, I believe that more representative inference use case is covered by model.test() method, in which the entire set of items is checked for interactions with a specific user. I would like to confirm that this is correct, because if it is, I have identified potential issues within model.test() implementation, which I would raise as a separate issue.

alibaba / ai-matrix Goto Github PK

ai-matrix's People

Contributors

Stargazers

Watchers

Forkers

ai-matrix's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs