Comments (9)
All the information you might want is on the colab I linked in the original issue report, which reproduces the issue in full...
The file size shouldn't be relevant, since, as I point out above, 49s (out of 51) is spend downloading urls. But, sure, each artifact has one file in it, the average file size is 130.64 kb, and the total size of the files is 38.53 mb. I added a cell to the bottom of the colab calculating this. But, as I said, the problem is that wandb is spending 49 seconds downloading urls when the files are already downloaded.
I also don't see how this could possibly be related to slow uploads, because, as I've said, the problem is spending time downloading urls. I gave profile including all relevant lines when reporting this issue, and the colab contains the complete profile.
0.16.1 is not capable of downloading artifacts at all, erroring with "CommError: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found)" (#6783)
0.15.12 is a basically the same as 0.16.3 (3m1s with no cache; 53s when everything is already cached).
from wandb.
I have added another cell at the bottom to test this option, but I expect this won't help at all. The problem is not that downloading is slower with the cache than without---it's not. Downloading with the cache takes 45 seconds while downloading without the cache takes 3 minutes and 17 seconds.
The problem is that even when the cache is turned on and the artifacts are all in the cache already, wandb is 20x slower at "downloading" artifacts than it needs to be, spending only about 5% of time fetching the artifacts from cache, and the other 95% is spent downloading/fetching urls.
from wandb.
Hey @JasonGross , thank you for the detailed report and sorry for the delayed response. I'm on the engineering team and can confirm that this is definitely an issue that we can address. We've filed an internal ticket to track it and will pick this up in the near future. I don't have an exact timeline right now as we're tackling several other high priority tasks with performance. While we're slower than we could be in the case of cached artifacts, the impact of the extra URL signing is small compared to some other use cases dealing with reference artifacts and large scale downloads. We're currently working on improving performance for those cases which is why it will take a little longer getting to this issue.
I just wanted to share the context and some internal roadmap for the team so you are aware of what is going on. Again, thank you for the detailed feedback and we will aim to get to it as soon as we can!
from wandb.
@JasonGross @benrhodes26 , I created #7245 to address this, but you might be disappointed at how little of a difference it makes if your script downloads all the artifacts in series. We still have to make a network request to get the manifests for each artifact--in your script this gets glossed over because you load the manifests when you pre-load the artifacts, but in a typical case you still need to download them.
Using your script on the current wandb
release I was able to get it to run in <6 seconds by parallelizing the requests:
with ThreadPoolExecutor() as executor:
executor.map(lambda a: a.download(), logged_artifacts)
Using your script as-is on my feature branch it runs in <200ms, but that ignores the 1m23s it takes to load the initial manifests. If you parallelize the initial load it finishes in <6 seconds.
from wandb.
Hi @JasonGross thank you for reporting this issue. In wandb v0.16.3 we added the option to skip caching. Can you please try to download the artifacts as follows, and let us know if that resolves this reported issue here?
a = artifact.download(skip_cache=True)
from wandb.
profile of downloading artifacts without cache
9009763 function calls (8946111 primitive calls) in 204.670 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
2 0.000 0.000 204.681 102.341 /usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py:3512(run_code)
4/2 0.000 0.000 204.681 102.341 {built-in method builtins.exec}
1 0.000 0.000 204.681 204.681 <ipython-input-7-c296c4927149>:1(<cell line: 3>)
1 0.004 0.004 204.069 204.069 <ipython-input-7-c296c4927149>:3(<listcomp>)
302 0.020 0.000 201.967 0.669 /usr/local/lib/python3.10/dist-packages/wandb/sdk/artifacts/artifact.py:1557(download)
302 0.051 0.000 201.833 0.668 /usr/local/lib/python3.10/dist-packages/wandb/sdk/artifacts/artifact.py:1683(_download)
910 0.026 0.000 134.378 0.148 /usr/local/lib/python3.10/dist-packages/requests/sessions.py:502(request)
910 0.038 0.000 132.273 0.145 /usr/local/lib/python3.10/dist-packages/requests/sessions.py:673(send)
910 0.036 0.000 131.859 0.145 /usr/local/lib/python3.10/dist-packages/requests/adapters.py:434(send)
910 0.036 0.000 130.531 0.143 /usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py:596(urlopen)
910 0.034 0.000 129.857 0.143 /usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py:381(_make_request)
910 0.049 0.000 103.747 0.114 /usr/local/lib/python3.10/dist-packages/urllib3/connection.py:435(getresponse)
910 0.013 0.000 103.441 0.114 /usr/lib/python3.10/http/client.py:1331(getresponse)
910 0.032 0.000 103.378 0.114 /usr/lib/python3.10/http/client.py:311(begin)
910 0.031 0.000 102.877 0.113 /usr/lib/python3.10/http/client.py:278(_read_status)
22359 0.029 0.000 102.852 0.005 {method 'readline' of '_io.BufferedReader' objects}
925 0.012 0.000 102.823 0.111 /usr/lib/python3.10/socket.py:691(readinto)
925 0.008 0.000 102.809 0.111 /usr/lib/python3.10/ssl.py:1292(recv_into)
925 0.009 0.000 102.800 0.111 /usr/lib/python3.10/ssl.py:1150(read)
925 102.791 0.111 102.791 0.111 {method 'read' of '_ssl._SSLSocket' objects}
906 0.028 0.000 94.507 0.104 /usr/local/lib/python3.10/dist-packages/wandb/sdk/artifacts/artifact.py:610(manifest)
302 0.011 0.000 71.980 0.238 /usr/local/lib/python3.10/dist-packages/wandb/sdk/artifacts/artifact.py:2166(_load_manifest)
302 0.005 0.000 71.825 0.238 /usr/local/lib/python3.10/dist-packages/requests/api.py:62(get)
302 0.008 0.000 71.820 0.238 /usr/local/lib/python3.10/dist-packages/requests/api.py:14(request)
14433 65.397 0.005 65.397 0.005 {method 'acquire' of '_thread.lock' objects}
911 0.007 0.000 65.259 0.072 /usr/lib/python3.10/threading.py:589(wait)
1183 0.016 0.000 65.253 0.055 /usr/lib/python3.10/threading.py:288(wait)
604 0.013 0.000 64.593 0.107 /usr/lib/python3.10/concurrent/futures/_base.py:201(as_completed)
910/608 0.008 0.000 64.324 0.106 /usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/retry.py:210(wrapped_fn)
910/608 0.031 0.000 64.319 0.106 /usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/retry.py:94(__call__)
608 0.006 0.000 63.728 0.105 /usr/local/lib/python3.10/dist-packages/wandb/apis/public/api.py:66(execute)
608 0.005 0.000 63.722 0.105 /usr/local/lib/python3.10/dist-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py:48(execute)
608 0.016 0.000 63.716 0.105 /usr/local/lib/python3.10/dist-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py:58(_get_result)
608 0.022 0.000 63.698 0.105 /usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/gql_request.py:42(execute)
608 0.007 0.000 62.635 0.103 /usr/local/lib/python3.10/dist-packages/requests/sessions.py:626(post)
302 0.006 0.000 42.088 0.139 /usr/local/lib/python3.10/dist-packages/wandb/sdk/artifacts/artifact.py:1773(_fetch_file_urls)
910 0.006 0.000 25.508 0.028 /usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py:1089(_validate_conn)
302 0.012 0.000 25.500 0.084 /usr/local/lib/python3.10/dist-packages/urllib3/connection.py:609(connect)
302 0.014 0.000 24.446 0.081 /usr/local/lib/python3.10/dist-packages/urllib3/connection.py:708(_ssl_wrap_socket_and_match_hostname)
302 0.008 0.000 24.158 0.080 /usr/local/lib/python3.10/dist-packages/urllib3/util/ssl_.py:398(ssl_wrap_socket)
302 22.790 0.075 22.790 0.075 {method 'load_verify_locations' of '_ssl._SSLContext' objects}
303 0.005 0.000 2.098 0.007 /usr/local/lib/python3.10/dist-packages/tqdm/std.py:1160(__iter__)
5 0.001 0.000 1.953 0.391 /usr/local/lib/python3.10/dist-packages/wandb/apis/paginator.py:55(_load_page)
from wandb.
Thank you so much @JasonGross for the detailed download tests. Would it be possible to share some details how many files are you having in your artifacts, what type they are and what's their individual size? also, any chance you could downgrade to wandb 0.16.1 and try again? There seems to be a reported issue with upload speed in 0.16.3 and wanted to confirm they're not related.
from wandb.
Thank you @JasonGross for the detailed information and for the Colab repro. This is a great suggestion, and I've logged it as a feature request. We're focusing on performance overall, including artifacts download/upload, and that's certainly something we would like to address with either of your suggested solutions. I will keep you updated on its progress here.
from wandb.
+1. I do think this deserves attention. Fixing would be a noticeable QoL boost.
from wandb.
Related Issues (20)
- [App]: wandb.init fail(UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte) HOT 2
- [App]: Data not available for various plots HOT 2
- [CLI]: repeated runs started instead of the correct ones in sweep HOT 10
- [CLI]: After wandb sync, There no summary value. Then I checked wandb-summary.json, the content is empty. HOT 5
- How to log confusion matrix that is precomputed? HOT 3
- [Q] How to start official docker container with root user?
- [App]: "There was a problem rendering these panels." after resetting my workspace
- [App]: wandb.keras.WandbMetricsLogger requires tensorflow installation; even when using torch backend HOT 2
- [Q] wandb webUI not showing number of runs for a filter
- [App]: Night mode cuts off colors in images HOT 3
- [App]: Lineage for dataset crashes HOT 6
- [Q] How to retrieve the entire history of GPU memory utilization? HOT 3
- [Q] How to see recent runs in the updated wandb UI? HOT 2
- [Q]how to launch job in self-built k8s environment, is there has guides about this? HOT 3
- [App]: Excessive WebGL VRAM usage with Plotly graphs
- [App]: The list of options for X-axis is non-deterministic HOT 1
- Changing WANDB_CACHE_DIR breaks artifact retrieval HOT 5
- [CLI]: Wandb config and summary not fully saved during some offline runs HOT 3
- [Q] Can't import wandb HOT 9
- [Feature]: Delete an org HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wandb.