alibaba / ai-matrix Goto Github PK
View Code? Open in Web Editor NEWTo make it easy to benchmark AI accelerators
License: Other
To make it easy to benchmark AI accelerators
License: Other
Hi there,
I'm having facing an issue when trying to training SSD_ResNet34_PyTorch.
Inside macro_benchmark/SSD_ResNet34_PyTorch, if I run the default command:
python -u train.py --local_rank=0 --use-fp16 --nhwc --pad-input --jit --delay-allreduce --opt-loss --epochs 10 --batch-size 128 --max_iter 3200 --warmup-factor 0 --no-save
I get the following error:
Traceback (most recent call last):
File "train.py", line 857, in <module> main()
File "train.py", line 830, in main mlperf_compliance.mlperf_log.setdefault(
AttributeError: module 'mlperf_compliance.mlperf_log' has no attribute 'setdefault'
I run that code inside the recommended docker image: nvcr.io/nvidia/pytorch:19.05-py3
Looking at mlperf_compliance lib, I don't see any 'setdefault' method.
there are typos in
micro_benchmark/gemm/test_allgemm.sh:13 & 15
missing slashes
Hello,
Could you please kindly clarify the reason in 'u_t' calculation in DIEN model?
According to DIEN paper it should be u_t =a_t*u_t
and in your implementation it is u_t = (1.0 - a_t) * u_t
Looking forward for your reply.
Thank you!
Error message shows as below: Error downloading object: macro_benchmark/CNN_Caffe/ResNet-152-model.caffemodel (6253c4c): Smudge error: Error downloading macro_benchmark/CNN_Caffe/ResNet-152-model.caffemodel (6253c4c4132c0b25c112b166629aa57dcaeec044a4c68ac9f003b6c801329d55): batch response: This repository is over its data quota. Purchase more data packs to restore access.
Please help to fix it or could send me one release package. Thanks.
In r1.0.2 branch, there is a DeepSpeech directory in macro_benchmark directory, but in r1.0.4, it's removed.
why the CNN_Tensorflow dir have no script for dataset prepare
lt looks like in DeepInterestNetwork, provided infer.py script does not reflect the typical expected inference usage model: infer.py invokes model.eval(), which basically is given two items - one (i) for which the expected correct answer is "positive" ("expected interaction") and another one (j), for which the expected correct answer is "negative" ("expected no interaction"). In practice, though, I believe that more representative inference use case is covered by model.test() method, in which the entire set of items is checked for interactions with a specific user. I would like to confirm that this is correct, because if it is, I have identified potential issues within model.test() implementation, which I would raise as a separate issue.
It is nice to have these AI benchmarks.
From an academic perspective, this benchmark can be improved as follows:
Input datasets.
For input-sensitivity study, a lot of datasets are needed. Since this benchmark originates from industry, collecting datasets should be relatively easy to be addressed by the Alibaba than anyone else.
Correctness/Accuracy criteria.
With compiler involved in the optimization process, it is easy to have an incorrect compiled binary. Therefore, it is extremely important to have a correctness checking feature for a successful benchmark suite. For example, SPEC CPU 2006/2017 have built-in correctness checking feature as part of its scripted tool chain; many HPC benchmarks, such as Cloverleaf/Cleverleaf also have these kind of features.
For approximated computation, especially on machine-learning, numerical correctness may not be applicable. Instead, accuracy may be a better criterion. Again, this domain-specific criterion is easy for Alibaba to provide and critical for researchers in other domains.
Automated installation and report.
Installation of big programs on main-stream Linux distribution, especially without root privilege, can be very challenging. Reporting the benchmark results could also be an interesting feature to include.
So far, SPEC seems to be most successful in this aspect than any other benchmark suites I have tried.
User-space software package management tools such as linuxBrew, spack (LLNL), are very useful to automate installation.
As another example, this on-going exascale computing benchmark suite (https://proxyapps.exascaleproject.org/ecp-proxy-apps-suite/) is supported by spack (https://spack.readthedocs.io/en/latest/package_list.html) for automatic installation, not only the package itself but also its dependency, all in user space.
Hi, Ali ai-matrix team
I recently tried this repo and verified on DIEN.
Somehow, I verified both using prepare_dataset.sh and prepare_data.sh to prepare data for training, and I noticed that it seems current DIEN codes only works with prepare_dataset.sh and if I used prepare_data.sh to do feature enabling, training will always got nan.
see pic as below:
Is this a known issue? I also tried another repo from ali, https://github.com/alibaba/x-deeplearning/tree/master/xdl-algorithm-solution/DIEN, which seems handles well with prepare_data.sh
Looking forward your guys' reply, I'll also work on to see if I can make a quick fix, after all, I think this is an issue should be reported here.
Best regards,
Chendi
Hi,
I'd like to know if there is pretrained model for DIEN provided? It seems that currently in the dnn_best_model_trained, only .meta and .index file are provided. Could you please upload the .data file for this checkpoint including weights bias gradients?
Thanks
running maskrcnn generates "images/second" but result in the github is "secs". could you please check it. thanks.
Hi, the AUC of DIEN in original paper is 0.8453(mean) on Amazon books dataset. But I only got 0.8279. Is this result reasonable? Has anyone got a higher AUC? Thank you!
Regarding to deepinterestnetwork, training use batch size of 256,512,1024 and inference use batch size of 1, 32, 64. However, the benchmark result shows 256, 512, 1024 for inference. Is there anything wrong? please help confirm it.
Is there any way to solve the problem related to .contrib module not found using Tensorflow >=2?
I'm using --mode=test --data_type=FP16 --embedding_device=cpu.
Output:
Traceback (most recent call last):
File "script/train.py", line 413, in
test(model_type=args.model, seed=SEED, batch_size=args.batch_size, data_type=args.data_type)
File "script/train.py", line 358, in test
fp32_variables = [var_name for var_name, _ in tf.contrib.framework.list_variables(model_path)]
AttributeError: module 'tensorflow' has no attribute 'contrib'
I tried with t2_comapt_v1 and it doesnt solve the problem
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.