GithubHelp home page GithubHelp logo

data-apis / scipy-2023-presentation Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 11.0 51.86 MB

Materials for the SciPy 2023 Data APIs Consortium presentation and proceedings paper

License: MIT License

Python 12.89% TeX 71.69% Shell 1.84% Jupyter Notebook 13.58%

scipy-2023-presentation's People

Contributors

alextp avatar asmeurer avatar honno avatar hyukjinkwon avatar jakirkham avatar kgryte avatar leofang avatar lezcano avatar rgommers avatar saulshanabrook avatar shoyer avatar szha avatar thomasjpfan avatar tylerjereddy avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scipy-2023-presentation's Issues

Initial draft of the array APIs SciPy proceedings paper

I have an initial draft of the array APIs SciPy proceedings paper here.

There are still some TODOs in the document, most notably I still need to come up with a good motivating example to show in the introduction, and potentially to make reference to throughout the document (see #2). Any suggestions there are welcome.

I'll be on PTO for the next week, so if people want to take a look, please do. Feel free to comment here any suggestions, or to push or PR suggestions as well. I'm particularly looking for high level suggestions, like if the overall outline of the content looks good, and if there is anything that I omitted or anything that should be trimmed down. A couple of pieces of the text are cribbed from the spec.

Here's a PDF to give an idea of what the final thing will look like. There is an 8 page limit (not including references). (I'm not going to keep this updated though so please make reference to the rst file)

The deadline for the draft submission is May 26. Anyone who wants to be any author on the paper will need to sign off on it, and also will need to be added on the SciPy website.

Add lockfile for benchmarks

We should ensure that we include a lockfile for the benchmarks in this repository in order to ensure reproducibility of results.

Outline

Here's the outline from the talk proposal (I've also uploaded it here https://github.com/data-apis/scipy-2023-presentation/blob/main/outline.md)

So the first question is if there's anything that we should add for the paper.

  • A motivating example, adding array API standard usage to a real-world scientific data analysis script so it runs with CuPy and PyTorch in addition to NumPy.
  • History of the Data APIs Consortium and array API specification.
  • The scope and general design principles of the specification.
  • Current status of implementations:
    • Two versions of the standard have been released, 2021.12 and 2022.12.
    • The standard includes all important core array functionality and extensions for linear algebra and Fast Fourier Transforms.
    • NumPy and CuPy have complete reference implementations in submodules (numpy.array_api).
    • NumPy, CuPy, and PyTorch have near full compliance and have plans to approach full compliance
    • array-api-compat is a wrapper library designed to be vendored by consuming libraries like scikit-learn that makes NumPy, CuPy, and PyTorch use a uniform API.
    • The array-api-tests package is a rigorous and complete test suite for testing against the array API and can be used to determine where an array API library follows the specification and where it doesn’t.
  • Future work
    • Add full compliance to NumPy, as part of NumPy 2.0.
    • Focus on improving adoption by consuming libraries, such as SciPy and scikit-learn.
    • Reporting website that lists array API compliance by library.
    • Work is being done to create a similar standard for dataframe libraries. This work has already produced a common dataframe interchange API.

Paper authorship

For anyone who wants to be an author on the proceedings paper, you will need to do the following before Friday, June 2 (note, this was extended from the previous May 26 deadline):

  • Review the current draft of the paper https://github.com/data-apis/scipy-2023-presentation/blob/main/paper.rst. You can also download a PDF build of the paper by going to https://github.com/data-apis/scipy-2023-presentation/actions, clicking on the latest build, then clicking built-paper under the "Artifacts" section. All authors need to sign off on the contents of the paper.

  • Submit a pull request to this repository adding your name as an author to the top of the paper. Feel free to also include any changes to the paper contents in your PR as well. Please be sure to pull first before making changes to avoid merge conflicts.

  • You will need to be added as a co-presenter for the talk. This is a requirement of the SciPy proceedings committee: every co-author on the proceedings paper needs to be listed as a co-presenter on the talk. You will not need to actually present the talk with me at the conference, although if you are attending SciPy and are interested in that please let me know.

Note that we are already listing "Data APIs Consortium" as an author on the paper. If you do not wish to complete the above steps, your contributions will be noted via that authorship.

Motivating example

Do we have a good motivating example for the talk/paper? I know we have @AnirudhDagar's scipy demo scipy/scipy@main...AnirudhDagar:scipy:array-api-demo as well as @thomasjpfan's scikit-learn PR https://github.com/scikit-learn/scikit-learn/pull/22554/files. I could crib some relevant parts from the diff(s) there. Or should we come up with a standalone script that does something? Some good things to show in the example would be:

  • That the majority of NumPy-like code will remain unchanged (other than np -> xp).
  • Use of array_namespace at the top of the function.
  • Some functions are renamed (e.g., concat -> concatenate).
  • Some functions aren't included and have to be worked around.
  • Some NumPy behaviors aren't guaranteed in the spec so should be written in a more portable way (e.g., explicitly indexing every axis, avoiding implicit cross-kind casting, not passing Python scalars to functions, not using int dtypes for floating-point functions).
  • Some libraries may need to be special-cased for performance purposes.

I can demonstrate all of these using the above scipy and scikit-learn PRs. So it's a question of whether it's better to show the actual real world usage, or if it's better to make the example more coherent and self-contained.

And we'll definitely mention scipy and scikit-learn efforts later regardless of the example we choose.

The timers used for GPU libraries are inaccurate

Since we're using the CPU timer perf_counter() as the proxy (technically we should use CUDA events, but it's OK), we need to do device-wide sync before and after the sandwiched code section; that is, the synchronization should also be inserted before line 29 that calls welch:

def main(x):
f, p = welch(x, nperseg=8)
if namespace == 'torch_gpu':
torch.cuda.synchronize(device="cuda")
elif namespace == 'cupy':
cp.cuda.stream.get_current_stream().synchronize()

Show the speedup against original code (without any change), not after change?

Just a thought... Right now, my understanding of the benchmarks is that they all use modified implementations that are not yet upstreamed and would use the standardized APIs, and we compare the perf of non-NumPy libraries against NumPy, but all using the modified implementation.

I wonder how difficult is it to compare the results against the current SciPy/scikit-learn implementations without any change? IIRC I've seen it mentioned somewhere that the adoption is actually helping improve the perf even for NumPy (at least for non-strict, I guess?), so it'd be nice to showcase this finding too. It's a nice story that "by adopting the standard, it also helps the project maintainers to discover a better pattern that could improve the CPU perf by X %".

Another question: It doesn't seem that any of the benchmarks uses the compat layer? It'd be nice to note this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.