GithubHelp home page GithubHelp logo

huachunwang / torchsearchsorted Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aliutkus/torchsearchsorted

0.0 1.0 0.0 90 KB

Pytorch Custom CUDA kernel for searchsorted

License: BSD 3-Clause "New" or "Revised" License

Python 37.43% Cuda 28.70% C++ 33.87%

torchsearchsorted's Introduction

Pytorch Custom CUDA kernel for searchsorted

This repository is an implementation of the searchsorted function to work for pytorch CUDA Tensors. Initially derived from the great C extension tutorial, but totally changed since then because building C extensions is not available anymore on pytorch 1.0.

Warnings:

  • only works with pytorch > v1.4 and CUDA >= v10.1
  • NOTE When using searchsorted() for practical applications, tensors need to be contiguous in memory. This can be easily achieved by calling tensor.contiguous() on the input tensors. Failing to do so will lead to inconsistent results across applications.

Description

Implements a function searchsorted(a, v, out, side) that works just like the numpy version except that a and v are matrices.

  • a is of shape either (1, ncols_a) or (nrows, ncols_a), and is contiguous in memory (do a.contiguous() to ensure this).
  • v is of shape either (1, ncols_v) or (nrows, ncols_v), and is contiguous in memory (do v.contiguous() to ensure this).
  • out is either None or of shape (nrows, ncols_v). If provided and of the right shape, the result is put there. This is to avoid costly memory allocations if the user already did it. If provided, out should be contiguous in memory too (do out.contiguous() to ensure this).
  • side is either "left" or "right". See the numpy doc. Please not that the current implementation does not correctly handle this parameter. Help welcome to improve the speed of this PR

the output is of size as (nrows, ncols_v). If all input tensors are on GPU, a cuda version will be called. Otherwise, it will be on CPU.

Disclaimers

  • This function has not been heavily tested. Use at your own risks
  • When a is not sorted, the results vary from numpy's version. But I decided not to care about this because the function should not be called in this case.
  • In some cases, the results vary from numpy's version. However, as far as I could see, this only happens when values are equal, which means we actually don't care about the order in which this value is added. I decided not to care about this also.
  • vectors have to be contiguous for torchsearchsorted to give consistant results. use .contiguous() on all tensor arguments before calling

Installation

Just pip install ., in the root folder of this repo. This will compile and install the torchsearchsorted module.

be careful that sometimes, nvcc needs versions of gcc and g++ that are older than those found by default on the system. If so, just create symbolic links to the right versions in your cuda/bin folder (where nvcc is)

For instance, on my machine, I had gcc and g++ v9 installed, but nvcc required v8. So I had to do:

sudo apt-get install g++-8 gcc-8
sudo ln -s /usr/bin/gcc-8 /usr/local/cuda-10.1/bin/gcc
sudo ln -s /usr/bin/g++-8 /usr/local/cuda-10.1/bin/g++

be careful that you need pytorch to be installed on your system. The code was tested on pytorch v1.5

Usage

Just import the torchsearchsorted package after installation. I typically do:

from torchsearchsorted import searchsorted

Testing

Under the examples subfolder, you may:

  1. try python test.py with torch available.
Looking for 50000x1000 values in 50000x300 entries
NUMPY:  searchsorted in 4851.592ms
CPU:  searchsorted in 4805.432ms
  difference between CPU and NUMPY: 0.000
GPU:  searchsorted in 1.055ms
  difference between GPU and NUMPY: 0.000

Looking for 50000x1000 values in 50000x300 entries
NUMPY:  searchsorted in 4333.964ms
CPU:  searchsorted in 4753.958ms
  difference between CPU and NUMPY: 0.000
GPU:  searchsorted in 0.391ms
  difference between GPU and NUMPY: 0.000

The first run comprises the time of allocation, while the second one does not.

  1. You may also use the nice benchmark.py code written by @baldassarreFe, that tests searchsorted on many runs:
Benchmark searchsorted:
- a [5000 x 300]
- v [5000 x 100]
- reporting fastest time of 20 runs
- each run executes searchsorted 100 times

Numpy: 	4.6302046799100935
CPU: 	5.041533078998327
CUDA: 	0.0007955809123814106

torchsearchsorted's People

Contributors

aliutkus avatar baldassarrefe avatar chrischoy avatar krrish94 avatar pl0xz0rz avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.