data-apis / scipy-2023-presentation Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 11.0 51.86 MB

Materials for the SciPy 2023 Data APIs Consortium presentation and proceedings paper

License: MIT License

Python 12.89% TeX 71.69% Shell 1.84% Jupyter Notebook 13.58%

scipy-2023-presentation's People

Contributors

Stargazers

Watchers

Forkers

kgryte shoyer leofang rgommers honno tylerjereddy saulshanabrook thomasjpfan hyukjinkwon lezcano jakirkham

scipy-2023-presentation's Issues

Initial draft of the array APIs SciPy proceedings paper

I have an initial draft of the array APIs SciPy proceedings paper here.

There are still some TODOs in the document, most notably I still need to come up with a good motivating example to show in the introduction, and potentially to make reference to throughout the document (see #2). Any suggestions there are welcome.

I'll be on PTO for the next week, so if people want to take a look, please do. Feel free to comment here any suggestions, or to push or PR suggestions as well. I'm particularly looking for high level suggestions, like if the overall outline of the content looks good, and if there is anything that I omitted or anything that should be trimmed down. A couple of pieces of the text are cribbed from the spec.

Here's a PDF to give an idea of what the final thing will look like. There is an 8 page limit (not including references). (I'm not going to keep this updated though so please make reference to the rst file)

The deadline for the draft submission is May 26. Anyone who wants to be any author on the paper will need to sign off on it, and also will need to be added on the SciPy website.

Add lockfile for benchmarks

We should ensure that we include a lockfile for the benchmarks in this repository in order to ensure reproducibility of results.

Outline

Here's the outline from the talk proposal (I've also uploaded it here https://github.com/data-apis/scipy-2023-presentation/blob/main/outline.md)

So the first question is if there's anything that we should add for the paper.

A motivating example, adding array API standard usage to a real-world scientific data analysis script so it runs with CuPy and PyTorch in addition to NumPy.
History of the Data APIs Consortium and array API specification.
The scope and general design principles of the specification.
Current status of implementations:
- Two versions of the standard have been released, 2021.12 and 2022.12.
- The standard includes all important core array functionality and extensions for linear algebra and Fast Fourier Transforms.
- NumPy and CuPy have complete reference implementations in submodules (numpy.array_api).
- NumPy, CuPy, and PyTorch have near full compliance and have plans to approach full compliance
- array-api-compat is a wrapper library designed to be vendored by consuming libraries like scikit-learn that makes NumPy, CuPy, and PyTorch use a uniform API.
- The array-api-tests package is a rigorous and complete test suite for testing against the array API and can be used to determine where an array API library follows the specification and where it doesn’t.
Future work
- Add full compliance to NumPy, as part of NumPy 2.0.
- Focus on improving adoption by consuming libraries, such as SciPy and scikit-learn.
- Reporting website that lists array API compliance by library.
- Work is being done to create a similar standard for dataframe libraries. This work has already produced a common dataframe interchange API.

Paper authorship

For anyone who wants to be an author on the proceedings paper, you will need to do the following before Friday, June 2 (note, this was extended from the previous May 26 deadline):

Review the current draft of the paper https://github.com/data-apis/scipy-2023-presentation/blob/main/paper.rst. You can also download a PDF build of the paper by going to https://github.com/data-apis/scipy-2023-presentation/actions, clicking on the latest build, then clicking built-paper under the "Artifacts" section. All authors need to sign off on the contents of the paper.
Submit a pull request to this repository adding your name as an author to the top of the paper. Feel free to also include any changes to the paper contents in your PR as well. Please be sure to pull first before making changes to avoid merge conflicts.
You will need to be added as a co-presenter for the talk. This is a requirement of the SciPy proceedings committee: every co-author on the proceedings paper needs to be listed as a co-presenter on the talk. You will not need to actually present the talk with me at the conference, although if you are attending SciPy and are interested in that please let me know.

Note that we are already listing "Data APIs Consortium" as an author on the paper. If you do not wish to complete the above steps, your contributions will be noted via that authorship.

Motivating example

Do we have a good motivating example for the talk/paper? I know we have @AnirudhDagar's scipy demo scipy/scipy@main...AnirudhDagar:scipy:array-api-demo as well as @thomasjpfan's scikit-learn PR https://github.com/scikit-learn/scikit-learn/pull/22554/files. I could crib some relevant parts from the diff(s) there. Or should we come up with a standalone script that does something? Some good things to show in the example would be:

That the majority of NumPy-like code will remain unchanged (other than np -> xp).
Use of array_namespace at the top of the function.
Some functions are renamed (e.g., concat -> concatenate).
Some functions aren't included and have to be worked around.
Some NumPy behaviors aren't guaranteed in the spec so should be written in a more portable way (e.g., explicitly indexing every axis, avoiding implicit cross-kind casting, not passing Python scalars to functions, not using int dtypes for floating-point functions).
Some libraries may need to be special-cased for performance purposes.

I can demonstrate all of these using the above scipy and scikit-learn PRs. So it's a question of whether it's better to show the actual real world usage, or if it's better to make the example more coherent and self-contained.

And we'll definitely mention scipy and scikit-learn efforts later regardless of the example we choose.

The timers used for GPU libraries are inaccurate

Since we're using the CPU timer perf_counter() as the proxy (technically we should use CUDA events, but it's OK), we need to do device-wide sync before and after the sandwiched code section; that is, the synchronization should also be inserted before line 29 that calls welch:

scipy-2023-presentation/benchmarks/scipy_bench.py

Lines 28 to 33 in 26db2c0

 def main(x): 

 f, p = welch(x, nperseg=8) 

 if namespace == 'torch_gpu': 

 torch.cuda.synchronize(device="cuda") 

 elif namespace == 'cupy': 

 cp.cuda.stream.get_current_stream().synchronize()

Mention the current status of the compat layer?

To me this paragraph ends slightly abruptly:

scipy-2023-presentation/paper.rst

Lines 720 to 721 in acb2ff4

 array libraries. We expect the compatibility layer to have a significant impact 

 in accelerating adoption among array-consuming libraries.

Perhaps we could mention that there are full support for NumPy/CuPy and partial (?) support for PyTorch?

Show the speedup against original code (without any change), not after change?

Just a thought... Right now, my understanding of the benchmarks is that they all use modified implementations that are not yet upstreamed and would use the standardized APIs, and we compare the perf of non-NumPy libraries against NumPy, but all using the modified implementation.

I wonder how difficult is it to compare the results against the current SciPy/scikit-learn implementations without any change? IIRC I've seen it mentioned somewhere that the adoption is actually helping improve the perf even for NumPy (at least for non-strict, I guess?), so it'd be nice to showcase this finding too. It's a nice story that "by adopting the standard, it also helps the project maintainers to discover a better pattern that could improve the CPU perf by X %".

Another question: It doesn't seem that any of the benchmarks uses the compat layer? It'd be nice to note this.

data-apis / scipy-2023-presentation Goto Github PK

scipy-2023-presentation's People

Contributors

Stargazers

Watchers

Forkers

scipy-2023-presentation's Issues

Initial draft of the array APIs SciPy proceedings paper

Add lockfile for benchmarks

Outline

Paper authorship

Motivating example

The timers used for GPU libraries are inaccurate

Mention the current status of the compat layer?

Show the speedup against original code (without any change), not after change?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	def main(x):
	f, p = welch(x, nperseg=8)
	if namespace == 'torch_gpu':
	torch.cuda.synchronize(device="cuda")
	elif namespace == 'cupy':
	cp.cuda.stream.get_current_stream().synchronize()

	array libraries. We expect the compatibility layer to have a significant impact
	in accelerating adoption among array-consuming libraries.