GithubHelp home page GithubHelp logo

rapidsai / custrings Goto Github PK

View Code? Open in Web Editor NEW
46.0 12.0 38.0 4.73 MB

[ARCHIVED] GPU String Manipulation --> Moved to cudf

Home Page: https://github.com/rapidsai/cudf

License: Apache License 2.0

Dockerfile 0.06% Shell 1.15% CMake 1.40% Makefile 0.06% C++ 30.75% Jupyter Notebook 4.23% Python 11.27% Cuda 34.31% C 16.77%

custrings's Introduction

 cuStrings - GPU String Manipulation

Build Status  Documentation Status

NOTE: For the latest stable README.md ensure you are on the master branch.

Built with Pandas DataFrame's columnar string operations in mind, cuStrings is a GPU string manipulation library for splitting, applying regexes, concatenating, replacing tokens, etc in arrays of strings.

nvStrings (the Python bindings for cuStrings), provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

For example, the following snippet loads a CSV, then uses the GPU to perform replacements typical in data-preparation tasks.

import nvstrings, nvcategory
import requests

url="https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')

#split content into a list, remove header
host_lines = content.strip().split('\n')[1:]

#copy strings to gpu
gpu_lines = nvstrings.to_device(host_lines)

#split into columns on gpu
gpu_columns = gpu_lines.split(',')
gpu_day_of_week = gpu_columns[4]

#use gpu `replace` to re-encode tokens on GPU
for idx, day in enumerate(['Sun', 'Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat']):
    gpu_day_of_week = gpu_day_of_week.replace(day, str(idx))

# or, use nvcategory's builtin GPU categorization
cat = nvcategory.from_strings(gpu_columns[4])

# copy category keys to host and print
print(cat.keys())

# copy "cleaned" strings to host and print
print(gpu_day_of_week)

Output:

['Fri', 'Sat', 'Sun', 'Thur']

# many entries omitted for brevity
['0', '0', '0', ..., '6', '6', '4']

cuStrings is a standalone library with no other dependencies. Other RAPIDS projects (like cuDF) depend on cuStrings and its nvStrings Python bindings.

For more examples, see Python API documentation.

Quick Start

Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize cuStrings.

Installation

Conda

cuStrings can be installed with conda (miniconda, or the full Anaconda distribution) from the rapidsai channel:

# for CUDA 9.2
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults \
    nvstrings=0.8 python=3.6 cudatoolkit=9.2

# or, for CUDA 10.0
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults \
    nvstrings=0.8 python=3.6 cudatoolkit=10.0

We also provide nightly conda packages built from the tip of our latest development branch.

Note: cuStrings is supported only on Linux, and with Python versions 3.6 or 3.7.

See the Get RAPIDS version picker for more OS and version info.

Build/Install from Source

See detailed build instructions.

Build and install libcustrings and custrings using build.sh. Build.sh creates build dir under cpp/ directory found in the root of the git repository. build.sh depends on the nvcc executable being on your path or defined in $CUDACXX.

$ ./build.sh -h                                     # Display help and exit
$ ./build.sh -n custrings                           # Build the custrings target without installing
$ ./build.sh                                        # Build and install libcustrings and custrings

## Contributing

Please see our [guide for contributing to cuStrings](CONTRIBUTING.md).

## Contact

Find out more details on the [RAPIDS site](https://rapids.ai/community.html)

## <div align="left"><img src="img/rapids_logo.png" width="265px"/></div> Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

custrings's People

Contributors

ayushdg avatar beckernick avatar datametrician avatar davidwendt avatar dillon-cullinan avatar galipremsagar avatar gputester avatar jakirkham avatar kkraus14 avatar mike-wendt avatar mluukkainen avatar okoskinen avatar randerzander avatar raydouglass avatar revans2 avatar rlratzel avatar rommeldb avatar vibhujawa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

custrings's Issues

Make nvstrings subscriptable

As a Python user, I want to index into lists of things without needing to know a special method name.

With nvstrings, to get a string at an index, I need to use sublist:

import nvstrings
s = nvstrings.to_device(["hello","there","world"])

print(s.sublist([0, 2]))

nvstrings should be subscriptable.

I want to be able to:

import nvstrings
s = nvstrings.to_device(["hello","there","world"])

print(s[0, 2])

Access to NVCategory Indices for performant reads

If we make a column that is the representation of an NVCategory in CUDF it would be nice to be able to read the indices of that column so that we can operate on the numeric representation of that data as opposed to on the data itself. Because we won't be modifying this information and will only be reading it would be much more efficient to access the data directly without fear that we will corrupt the underlying representation (if everyone pink promises not to misbehave). Currently get_values requires a copy to be made which seems like a high price to pay for reading.

[FEA] Implement regex fixed quantifier functionality

Feature Request

We should implement the {} regex fixed quantified functionality.

Example

This behavior:

import nvstrings
s = nvstrings.to_device(["hare","bunny","rabbit"])
s.findall('bb')

[<nvstrings count=0>, <nvstrings count=0>, <nvstrings count=1>]

Would ideally also be achieved with the following:

import nvstrings
s = nvstrings.to_device(["hare","bunny","rabbit"])
s.findall('b{2}')

Currently, the output of this is

[<nvstrings count=0>, <nvstrings count=0>, <nvstrings count=0>]

and the {} quantifier is matched literally, like this:

s = nvstrings.to_device(["hare","bunny","rabbit", "testb{2}"])
s.findall('b{2}')

[<nvstrings count=0>,
 <nvstrings count=0>,
 <nvstrings count=0>,
 <nvstrings count=1>]

[FEA] Add int column to nvstrings function

import cudf, nvstrings

col = cudf.Series([0, 1, 2, 3])
res = nvstrings.itos(col.to_gpu_array())
print(res)

Expected Result:

['0', '1', '2', '3']

@kkraus14 uncertain how you'd handle nulls here.. Would you pass the null mask array to nvstrings as well? I think we want to include a param for what token to use to represent nulls in the resulting nvstrings object.

[BUG] Quantifiers like + appear to fail in combination with regex shorthands in a character class []

Bug

Quantifiers like + fail in combination with regex shorthands like \w, \d, and \s when they are within square brackets. As an nvstrings user, I expect the quantifiers to be consistent between standard alphanumerics and shorthands.

Example

The following two regexes work as expected. The + quantifier allows all matches of the number two and the characters.

import nvstrings
s = nvstrings.to_device(["a1a1","22","c3"])
for result in s.extract('([2]+)'):
    print(result)
[None]
['22']
[None]
s = nvstrings.to_device(["a1a1","aabdbd","[d]"])
for result in s.extract('([a-z]+)'):
    print(result)
['a']
['aabdbd']
['d']

However, when using the \d shorthand, the behavior changes:

s = nvstrings.to_device(["a1a1","22","c3"])
for result in s.extract('([\d]+)'):
    print(result)
[None]
[None]
[None]

With the \d shorthand, the regex quantifier appears to fail. The same behavior is observed with the \w and \s shorthands, too.

Add nvcategory.from_strings support for specifying a known key list

As a cuDF user, I want to perform joins across datasets from different files.

nvcategory lets me encode string values as numerics, but numeric encodings depend on the order in which they appear in input data. If I create an nvcategory from one file, and a separate nvcategory from another file, their encodings may differ based on the order of entries in each file.

To join accurately across two datasets, I need to create two nvcategories, each encoded with the same set of keys:

Example:

lhs = nvstrings.to_device(["apple","orange","apple","banana","grape"])
rhs = nvstrings.to_device(["apple","grape","banana","kiwi","durian"])
cat = nvcategory.from_strings(lhs,rhs)

lhs_cat = nvcategory.from_strings(lhs, categories=cat.keys())
rhs_cat = nvcategory.from_strings(rhs, categories=cat.keys())

print(cat.keys())

print(lhs_cat.values())

print(rhs_cat.values())

Expected Result:

#  keys:
apple, orange, banana, grape, kiwi, durian

# lhs_cat
0, 1, 0, 2, 3

# rhs_cat
0, 3, 2, 4, 5

ToDos:

  1. Add a categories argument to nvcategory.from_strings
  2. categories can be provided as:
    A. a list on the host
    B. a device array
  3. nvcategory should expose its key and value members as device array pointers, so that other libraries like cuDF or numba can operate on them

[BUG] Handle empty device arrays on gather call

data = ['a', 'b', 'c', 'd', 'e']
nvs = nvstrings.to_device(data)
nvs.gather(empty_device.device_ctypes_pointer.value, count=0)
# Expected to return an empty nvstrings object but throws error instead

[BUG] Looping through nvstrings object appears to cause an infinite loop

Bug

Looping through an nvstrings object appears to cause an infinite loop. I expect the loop to terminate after the number of iterations corresponding to the number of underlying nvstrings objects in the collection (three in this case).

Example

import nvstrings
s = nvstrings.to_device(["hello world","goodbye","well said"])

for i, d in enumerate(s):
    print(i, d, '\n')
    if i % 10 == 0 and i != 0:
        break
0 ['hello world'] 

1 ['goodbye'] 

2 ['well said'] 

3 [None] 

4 [None] 

5 [None] 

6 [None] 

7 [None] 

8 [None] 

9 [None] 

10 [None] 

Lower and Upper bounds for keys in NVCategory

Sometimes we want to do something like compare the values of an NVCategory to a literal.

Lets say you have strings
"a" , "b" , "d"

and I want to know if strings are less than or < "c"

Because "c" does not exist in the dictionary I can't just look it up to get the index and say oh if your index is less than this you are solid. What could be done though is if I can get the lower and upper bound so that it returns index 1 and 2. I can do the following. If I am checking for < I know that c lies between one and 2 so I change < to <= and now I can just compare indices in the column to see if they are <= 1. This would give me all the values less than "c" without me having to add "c" to the dictionary for example.

[BUG] Installing from source doesn't install some needed files

In running make install from cpp/build:

-- Install configuration: "Release"
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/libNVStrings.so
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/librmm.so
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/thrust_rmm_allocator.h
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/rmm_api.h
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/detail
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/detail/cnmem.h
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/detail/memory_manager.hpp
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/rmm.hpp
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/rmm.h

Seems NVStrings.h, NVCategory.h, and libNVCategory.so aren't installed.

[BUG] nvstrings.cat with separator doesn't handle nulls properly

Looks like the nulls get arbitrary memory if na_rep isn't populated:

data = ['a', 'b', 'c', 'd', 'e', 'f', 'g', None, 'i', 'j']
nvs = nvstrings.to_device(data)
print(nvs.cat(sep='|'))
# ['a|b|c|d|e|f|g|ii|j']

data = ['a', 'b', None, 'd', 'e', 'f', 'g', None, 'i', 'j']
nvs = nvstrings.to_device(data)
print(nvs.cat(sep='|'))
# ['a|b|']

[BUG] Unable to write nvcategory.values to a devptr

I'm trying to write nvcategory values into a device array:

dev_str = nvstrings.to_device(['a', 'b', 'c'])
dev_array = librmm.device_array(dev_str.size(), dtype=np.int32)
nvcategory.from_strings(dev_str).values(devptr=dev_array)

Result:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: an integer is required

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
<ipython-input-70-a94f90f3165f> in <module>
      1 dev_str = nvstrings.to_device(['a', 'b', 'c'])
      2 dev_array = librmm.device_array(dev_str.size(), dtype=np.int32)
----> 3 nvcategory.from_strings(dev_str).values(devptr=dev_array)

/conda/envs/cudf/lib/python3.7/site-packages/nvcategory.py in values(self, devptr)
    247 
    248         """
--> 249         return pyniNVCategory.n_get_values(self.m_cptr, devptr)
    250 
    251     def add_strings(self, nvs):

SystemError: <built-in function n_get_values> returned a result with an error set

nrows support in cudf.io.csv.read_csv() is causing an exception

Describe the bug
When calling cudf.io.csv.read_csv_strings() using nrows to read a file in chunks, I get a thrust exception.

Exception:

nvs-idx: computing sizes: cudaErrorIllegalAddress(77):an illegal memory access was encountered
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  device free failed: an illegal memory access was encountered

Steps/Code to reproduce bug
Full python code in file attached

Snippet:

while True:
    frame = cudf.io.csv.read_csv_strings(file, quoting=False, doublequote=False, names=names, dtype=dtypes, skiprows=skip_rows, nrows=rows_to_read)
    skip_rows += rows_to_read
...
    if rows_read < rows_to_read:
        break

Expected behavior
Read the file in chunk windows. For last chunk, return remaining rows without exception.

Environment details (please complete the following information):
Latest NGC RAPIDS CUDA 10 Container (0.5.1)

Additional context
Data file is large and private; provided on request.

Program File:
load_pos_file.zip

Start supporting RMM for memory allocation

As a cuDF & nvstrings user, I work with both nvstrings and DataFrames on the same device.

cuDF's use of RMM to pre-allocate and manage a memory pool ends up limiting the amount of memory on the card usable by nvstrings. If nvstrings were to adopt support for allocation via RMM, I can more efficiently use all of my card's memory, and experience fewer OOMs when working with large datasets.

[BUG] Extract method scaling linearly with data size

Bug

extract method appears to be scaling linearly with the size of data (except for tiny data), which makes large scale regex text extraction difficult. As a user, I'd like my regex extractions to scale sub-linearly with data size.

Example

import string
import time

import numpy as np

def create_toy_strings(num_elements):
    return [''.join(np.random.choice(list(string.ascii_letters), 5)) for x in range(num_elements)]

NUM_RECORDS = [1000, 5000, 50000, 500000]
for n in NUM_RECORDS:
    data = create_toy_strings(n)
    device_data = nvstrings.to_device(data)
    
    start = time.time()
    new = device_data.extract('ad(x)')
    end = time.time()
    
    print('elements: {0}'.format(n), end - start)
    time.sleep(2)

elements: 1000 5.249366998672485
elements: 5000 0.3002920150756836
elements: 50000 2.9973433017730713
elements: 500000 31.241347551345825

Add gather/scatter method to nvstrings

cudf needs gather/scatter method.
An nvstrings instance would accept a list of indices and returns a new nvstrings of the strings specified.
Duplicates and random order are allowed.

[BUG] to_offsets returns seemingly random data on the null bitmask buffer

test = nvstrings.to_device(['a', 'b', 'c', 'd', 'e'])

def to_arrow(str_obj):
    sbuf = np.empty(str_obj.byte_count(), dtype='int8')
    obuf = np.empty(len(str_obj), dtype='int32')
    # mask_size = gd.utils.utils.calc_chunk_size(len(str_obj), gd.utils.utils.mask_bitsize)
    mask_size = int(len(str_obj) / 8) + 1
    nbuf = np.empty(mask_size, dtype='int8') # Do you expect something larger here?
    
    str_obj.to_offsets(sbuf, obuf, nbuf=nbuf)
    
    print(nbuf)
    print(obuf)
    print(sbuf)
    
    sbuf = pa.py_buffer(sbuf)
    obuf = pa.py_buffer(obuf)
    nbuf = pa.py_buffer(nbuf)
    return pa.StringArray.from_buffers(len(str_obj), obuf, sbuf, nbuf,
                                       str_obj.null_count())

to_arrow(test)

NVCategory needs merge-remap function

In order to be able to use this we need to be able to perform the following operation.
I have a series of n columns of type NVCategory. I don't want to merge the columns themselves. I just want the indices across them to be comparable to each other. This is necessary for many operations like gpu_compare, joining, etc. The current api doesn't seem to support this use case. So we have to be able to merge the dictionaries, remap the data. So if we are still coupling the dictionaries with the data, then we at least need to be able to generate N gdf_columns of nvcategory that are the result of a merger. Merging two and outputting one doesn't allow us to perform almost any of the operations we need to specially since the way it works now doesnt make the indices comparable

Linking NVCategory fails

I changed the add_library in cmake for nvcategory in a dumb and hacky way in my build so that I could get it to link when including it in cudf. This may just be because I am a dumber than an old brick with lead paint when it comes to cmake.

add_library(NVCategory SHARED
src/NVStrings.cu
src/custring_view.cu
src/custring.cu
src/util.cu
src/regex/regexec.cu
src/regex/regcomp.cpp
src/custring_view.cu
src/custring.cu
src/NVCategory.cu)

I don't know exactly what files are needed but without this I get things like /home/felipe/git-repos/forks/cudf/cpp/build/lib/libNVCategory.so: undefined reference to `NVStrings::create_from_index(std::pair<char const*, unsigned long>*, unsigned int, bool, NVStrings::sorttype)'

I did verify that its not just NVStrings.cu.

Support arbitrary length regex expressions

As an nvstrings user, I want to replace a long list of tokens. I can do this using nvstrings.replace, but the length of supported regex string is too limited, leading me to have to "batch" up calls, which is significantly slower, due to multiple underlying kernel calls.

Example current usage (see nvstrings release post for more context):

def remove_stopwords(nvstring_object, stopwords=STOPWORDS, num_batches=5):
    split_stopwords = np.array_split(stopwords, num_batches)

    for chunk in split_stopwords:
        combined_regex = '|'.join(['\\b({0})\\b'.format(x) for x in chunk])
        nvstring_object = nvstring_object.replace(combined_regex, '', regex=True)

    return nvstring_object

Instead, I would like to be able to make a single call to nvstring.replace, with a regex of arbitrary length, and the library would handle the details of calling a kernel with optimized thread-only buffer, a global memory buffer, or multiple kernel calls as needed.

Rename several methods

split, extract, and rsplit are usually used in parallel, on short strings.

nvstrings has those methods today, but they're designed to operate on single, large chunks of text, turning it into many result records.

Columnar operations on short strings are supported today as split_column, extract_column, and rsplit_column.

This is opposite user intuition. Thus we should rename several functions to make it clearer how to use this library:

split -> split_record
split_column -> split

extract -> extract_record
extract_column -> extract

rsplit -> rsplit_record
rsplit_column -> rsplit

Add release-notes file

To more easily track what changes are made between nvstrings releases, we need a file which each PR should update to include a short description of the change.

[FEA] Deep Copy method

Could use a function exposed to Python that returns a deep copy of the nvstrings instance. I imagine it would be something like

my_nvstrings_obj.copy()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.