rapidsai / custrings Goto Github PK

[ARCHIVED] GPU String Manipulation --> Moved to cudf

Home Page: https://github.com/rapidsai/cudf

License: Apache License 2.0

Dockerfile 0.06% Shell 1.15% CMake 1.40% Makefile 0.06% C++ 30.75% Jupyter Notebook 4.23% Python 11.27% Cuda 34.31% C 16.77%

custrings's Introduction

cuStrings - GPU String Manipulation

NOTE: For the latest stable README.md ensure you are on the master branch.

Built with Pandas DataFrame's columnar string operations in mind, cuStrings is a GPU string manipulation library for splitting, applying regexes, concatenating, replacing tokens, etc in arrays of strings.

nvStrings (the Python bindings for cuStrings), provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

For example, the following snippet loads a CSV, then uses the GPU to perform replacements typical in data-preparation tasks.

import nvstrings, nvcategory
import requests

url="https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')

#split content into a list, remove header
host_lines = content.strip().split('\n')[1:]

#copy strings to gpu
gpu_lines = nvstrings.to_device(host_lines)

#split into columns on gpu
gpu_columns = gpu_lines.split(',')
gpu_day_of_week = gpu_columns[4]

#use gpu `replace` to re-encode tokens on GPU
for idx, day in enumerate(['Sun', 'Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat']):
    gpu_day_of_week = gpu_day_of_week.replace(day, str(idx))

# or, use nvcategory's builtin GPU categorization
cat = nvcategory.from_strings(gpu_columns[4])

# copy category keys to host and print
print(cat.keys())

# copy "cleaned" strings to host and print
print(gpu_day_of_week)

Output:

['Fri', 'Sat', 'Sun', 'Thur']

# many entries omitted for brevity
['0', '0', '0', ..., '6', '6', '4']

cuStrings is a standalone library with no other dependencies. Other RAPIDS projects (like cuDF) depend on cuStrings and its nvStrings Python bindings.

For more examples, see Python API documentation.

Quick Start

Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize cuStrings.

Installation

Conda

cuStrings can be installed with conda (miniconda, or the full Anaconda distribution) from the rapidsai channel:

# for CUDA 9.2
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults \
    nvstrings=0.8 python=3.6 cudatoolkit=9.2

# or, for CUDA 10.0
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults \
    nvstrings=0.8 python=3.6 cudatoolkit=10.0

We also provide nightly conda packages built from the tip of our latest development branch.

Note: cuStrings is supported only on Linux, and with Python versions 3.6 or 3.7.

See the Get RAPIDS version picker for more OS and version info.

Build/Install from Source

See detailed build instructions.

Build and install libcustrings and custrings using build.sh. Build.sh creates build dir under cpp/ directory found in the root of the git repository. build.sh depends on the nvcc executable being on your path or defined in $CUDACXX.

$ ./build.sh -h                                     # Display help and exit
$ ./build.sh -n custrings                           # Build the custrings target without installing
$ ./build.sh                                        # Build and install libcustrings and custrings

## Contributing

Please see our [guide for contributing to cuStrings](CONTRIBUTING.md).

## Contact

Find out more details on the [RAPIDS site](https://rapids.ai/community.html)

## <div align="left"><img src="img/rapids_logo.png" width="265px"/></div> Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

custrings's People

Contributors

Stargazers

Watchers

custrings's Issues

Make nvstrings subscriptable

As a Python user, I want to index into lists of things without needing to know a special method name.

With nvstrings, to get a string at an index, I need to use sublist:

import nvstrings
s = nvstrings.to_device(["hello","there","world"])

print(s.sublist([0, 2]))

nvstrings should be subscriptable.

I want to be able to:

import nvstrings
s = nvstrings.to_device(["hello","there","world"])

print(s[0, 2])

Add IP2Int and reverse

As an analyst working with network logs, I often need to compare IP addresses to one another. Integer comparisons are faster and allow easier range queries.

nvstrings should support an ip2int and int2ip methods to convert an nvstrings of IP addresses to integers, and vice versa.

See https://stackoverflow.com/questions/5619685/conversion-from-ip-string-to-integer-and-backward-in-python/36475065 for example

Expose documentation for nvcategory.from_strings

nvcategory.from_strings is not specified in the published nvstrings.readthedocs.io page

Typedef for indices type for nvcategory

This would just make it easier for us to not worry when if and when the index type for categories changes.

Fix NVStrings.to_host method memory handling

Individual CPU memory pointers are allocated inside the method and returned to be freed by the caller.
Caller should provide its own memory to be managed by them.

Access to NVCategory Indices for performant reads

If we make a column that is the representation of an NVCategory in CUDF it would be nice to be able to read the indices of that column so that we can operate on the numeric representation of that data as opposed to on the data itself. Because we won't be modifying this information and will only be reading it would be much more efficient to access the data directly without fear that we will corrupt the underlying representation (if everyone pink promises not to misbehave). Currently get_values requires a copy to be made which seems like a high price to pay for reading.

[FEA] Implement regex fixed quantifier functionality

Feature Request

We should implement the {} regex fixed quantified functionality.

Example

This behavior:

import nvstrings
s = nvstrings.to_device(["hare","bunny","rabbit"])
s.findall('bb')

[<nvstrings count=0>, <nvstrings count=0>, <nvstrings count=1>]

Would ideally also be achieved with the following:

import nvstrings
s = nvstrings.to_device(["hare","bunny","rabbit"])
s.findall('b{2}')

Currently, the output of this is

[<nvstrings count=0>, <nvstrings count=0>, <nvstrings count=0>]

and the {} quantifier is matched literally, like this:

s = nvstrings.to_device(["hare","bunny","rabbit", "testb{2}"])
s.findall('b{2}')

[<nvstrings count=0>,
 <nvstrings count=0>,
 <nvstrings count=0>,
 <nvstrings count=1>]

[FEA] Add int column to nvstrings function

import cudf, nvstrings

col = cudf.Series([0, 1, 2, 3])
res = nvstrings.itos(col.to_gpu_array())
print(res)

Expected Result:

['0', '1', '2', '3']

@kkraus14 uncertain how you'd handle nulls here.. Would you pass the null mask array to nvstrings as well? I think we want to include a param for what token to use to represent nulls in the resulting nvstrings object.

[BUG] Quantifiers like + appear to fail in combination with regex shorthands in a character class []

Bug

Quantifiers like + fail in combination with regex shorthands like \w, \d, and \s when they are within square brackets. As an nvstrings user, I expect the quantifiers to be consistent between standard alphanumerics and shorthands.

Example

The following two regexes work as expected. The + quantifier allows all matches of the number two and the characters.

import nvstrings
s = nvstrings.to_device(["a1a1","22","c3"])
for result in s.extract('([2]+)'):
    print(result)
[None]
['22']
[None]

s = nvstrings.to_device(["a1a1","aabdbd","[d]"])
for result in s.extract('([a-z]+)'):
    print(result)
['a']
['aabdbd']
['d']

However, when using the \d shorthand, the behavior changes:

s = nvstrings.to_device(["a1a1","22","c3"])
for result in s.extract('([\d]+)'):
    print(result)
[None]
[None]
[None]

With the \d shorthand, the regex quantifier appears to fail. The same behavior is observed with the \w and \s shorthands, too.

[FEA] Create a to_offsets method for nvstrings objects

Basically do the reverse of the from_offsets method to create host buffers in the same format.

[BUG] to_offsets offsets buffer is missing last element

According to Arrow spec: "The first value in the offsets array is 0, and the last element is the length of the values array."

Without the last value we can't know how long the string in the nvstrings object is.

[BUG] nvstrings.strip methods only support first character of argument

The strip() methods do not support multiple characters specified in the argument.
Should be the same as https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html#pandas.Series.str.strip

Add nvcategory.from_strings support for specifying a known key list

As a cuDF user, I want to perform joins across datasets from different files.

nvcategory lets me encode string values as numerics, but numeric encodings depend on the order in which they appear in input data. If I create an nvcategory from one file, and a separate nvcategory from another file, their encodings may differ based on the order of entries in each file.

To join accurately across two datasets, I need to create two nvcategories, each encoded with the same set of keys:

Example:

lhs = nvstrings.to_device(["apple","orange","apple","banana","grape"])
rhs = nvstrings.to_device(["apple","grape","banana","kiwi","durian"])
cat = nvcategory.from_strings(lhs,rhs)

lhs_cat = nvcategory.from_strings(lhs, categories=cat.keys())
rhs_cat = nvcategory.from_strings(rhs, categories=cat.keys())

print(cat.keys())

print(lhs_cat.values())

print(rhs_cat.values())

Expected Result:

#  keys:
apple, orange, banana, grape, kiwi, durian

# lhs_cat
0, 1, 0, 2, 3

# rhs_cat
0, 3, 2, 4, 5

ToDos:

Add a categories argument to nvcategory.from_strings
categories can be provided as:
A. a list on the host
B. a device array
nvcategory should expose its key and value members as device array pointers, so that other libraries like cuDF or numba can operate on them

[BUG] Handle empty device arrays on gather call

data = ['a', 'b', 'c', 'd', 'e']
nvs = nvstrings.to_device(data)
nvs.gather(empty_device.device_ctypes_pointer.value, count=0)
# Expected to return an empty nvstrings object but throws error instead

nvcategory.from_strings should treat missing data as '-1'

strings = nvstrings.to_device(['abc', 'def', np.nan, 'jkl'])
cat = nvcategory.from_strings(strings)
print(cat.values())

Result:

[1, 2, 0, 3]

Expected:

[1, 2, -1, 3]

See https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#missing-data for more discussion

Regex [^]] expression fails to match

This was fixed by Fei in gitlab. Migrating the fix over to here.

[FEA] Allow for passing host and device arrays to getitem for nvstrings and nvcategory

__getitem__ will be used pretty frequently for gather / scatter operations and could be used in very large objects where converting inputs to a list of ints would be non-performant. Would be great to allow passing a device array or a host array to the function.

[BUG] Needs boundschecking on getitem calls

test = nvstrings.to_device(['a', 'b', 'c', 'd', 'e'])
print(test[100])
# Returns None instead of throwing an exception

[BUG] Looping through nvstrings object appears to cause an infinite loop

Bug

Looping through an nvstrings object appears to cause an infinite loop. I expect the loop to terminate after the number of iterations corresponding to the number of underlying nvstrings objects in the collection (three in this case).

Example

import nvstrings
s = nvstrings.to_device(["hello world","goodbye","well said"])

for i, d in enumerate(s):
    print(i, d, '\n')
    if i % 10 == 0 and i != 0:
        break
0 ['hello world'] 

1 ['goodbye'] 

2 ['well said'] 

3 [None] 

4 [None] 

5 [None] 

6 [None] 

7 [None] 

8 [None] 

9 [None] 

10 [None]

Create nvcategory from arrow formatted buffer of char arrays and integer array

One method to support arrow-formatted buffer of char arrays and one that accepts both char arrays and integer array of indexes.

Create nvcategory.add_categories method

https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#appending-new-categories

Lower and Upper bounds for keys in NVCategory

Sometimes we want to do something like compare the values of an NVCategory to a literal.

Lets say you have strings
"a" , "b" , "d"

and I want to know if strings are less than or < "c"

Because "c" does not exist in the dictionary I can't just look it up to get the index and say oh if your index is less than this you are solid. What could be done though is if I can get the lower and upper bound so that it returns index 1 and 2. I can do the following. If I am checking for < I know that c lies between one and 2 so I change < to <= and now I can just compare indices in the column to see if they are <= 1. This would give me all the values less than "c" without me having to add "c" to the dictionary for example.

[BUG] Installing from source doesn't install some needed files

In running make install from cpp/build:

-- Install configuration: "Release"
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/libNVStrings.so
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/librmm.so
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/thrust_rmm_allocator.h
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/rmm_api.h
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/detail
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/detail/cnmem.h
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/detail/memory_manager.hpp
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/rmm.hpp
-- Installing: /home/nfs/kkraus/anaconda3/envs/cudf_dev/include/rmm/rmm.h

Seems NVStrings.h, NVCategory.h, and libNVCategory.so aren't installed.

[FEA] Add accessors for c-pointers in nvstrings/nvcategory

Access to the C++ instance for nvstrings and nvcategory is needed to populated cudf column structure.
Also need accessor for nvcategory values internal array.

Monolithic files should be broken up according to functionality

Monolithic, large files (>1000 lines) are difficult to navigate.

Currently NVStrings.cu is over 5000 lines https://github.com/rapidsai/nvstrings/blob/branch-0.3/cpp/src/NVStrings.cu

This should really be broken up into different files based on logical features.

[BUG] nvstrings.cat with separator doesn't handle nulls properly

Looks like the nulls get arbitrary memory if na_rep isn't populated:

data = ['a', 'b', 'c', 'd', 'e', 'f', 'g', None, 'i', 'j']
nvs = nvstrings.to_device(data)
print(nvs.cat(sep='|'))
# ['a|b|c|d|e|f|g|ii|j']

data = ['a', 'b', None, 'd', 'e', 'f', 'g', None, 'i', 'j']
nvs = nvstrings.to_device(data)
print(nvs.cat(sep='|'))
# ['a|b|']

NVStrings/NVCategories needs proper Doxygen documentation

Documentation of the C++ APIs should follow the Doxygen template found in cuDF: https://github.com/rapidsai/cudf/blob/branch-0.6/cpp/src/examples/documentation_example.cpp

[BUG] Unable to write nvcategory.values to a devptr

I'm trying to write nvcategory values into a device array:

dev_str = nvstrings.to_device(['a', 'b', 'c'])
dev_array = librmm.device_array(dev_str.size(), dtype=np.int32)
nvcategory.from_strings(dev_str).values(devptr=dev_array)

Result:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: an integer is required

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
<ipython-input-70-a94f90f3165f> in <module>
      1 dev_str = nvstrings.to_device(['a', 'b', 'c'])
      2 dev_array = librmm.device_array(dev_str.size(), dtype=np.int32)
----> 3 nvcategory.from_strings(dev_str).values(devptr=dev_array)

/conda/envs/cudf/lib/python3.7/site-packages/nvcategory.py in values(self, devptr)
    247 
    248         """
--> 249         return pyniNVCategory.n_get_values(self.m_cptr, devptr)
    250 
    251     def add_strings(self, nvs):

SystemError: <built-in function n_get_values> returned a result with an error set

nrows support in cudf.io.csv.read_csv() is causing an exception

Describe the bug
When calling cudf.io.csv.read_csv_strings() using nrows to read a file in chunks, I get a thrust exception.

Exception:

nvs-idx: computing sizes: cudaErrorIllegalAddress(77):an illegal memory access was encountered
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  device free failed: an illegal memory access was encountered

Steps/Code to reproduce bug
Full python code in file attached

Snippet:

while True:
    frame = cudf.io.csv.read_csv_strings(file, quoting=False, doublequote=False, names=names, dtype=dtypes, skiprows=skip_rows, nrows=rows_to_read)
    skip_rows += rows_to_read
...
    if rows_read < rows_to_read:
        break

Expected behavior
Read the file in chunk windows. For last chunk, return remaining rows without exception.

Environment details (please complete the following information):
Latest NGC RAPIDS CUDA 10 Container (0.5.1)

Additional context
Data file is large and private; provided on request.

Program File:
load_pos_file.zip

[FEA] Support hex string -> numeric conversion

Pandas doesn't appear to have a direct hex string -> numeric function.

Typical ways to implement anyway are via pandas.apply and normal Python functions (https://stackoverflow.com/questions/37955856/convert-pandas-dataframe-column-from-hex-string-to-int)

Since we don't have a way to do .apply/UDF operations with nvstrings, I propose a stopgap "htoi" method

Start supporting RMM for memory allocation

As a cuDF & nvstrings user, I work with both nvstrings and DataFrames on the same device.

cuDF's use of RMM to pre-allocate and manage a memory pool ends up limiting the amount of memory on the card usable by nvstrings. If nvstrings were to adopt support for allocation via RMM, I can more efficiently use all of my card's memory, and experience fewer OOMs when working with large datasets.

[BUG] Extract method scaling linearly with data size

Bug

extract method appears to be scaling linearly with the size of data (except for tiny data), which makes large scale regex text extraction difficult. As a user, I'd like my regex extractions to scale sub-linearly with data size.

Example

import string
import time

import numpy as np

def create_toy_strings(num_elements):
    return [''.join(np.random.choice(list(string.ascii_letters), 5)) for x in range(num_elements)]

NUM_RECORDS = [1000, 5000, 50000, 500000]
for n in NUM_RECORDS:
    data = create_toy_strings(n)
    device_data = nvstrings.to_device(data)
    
    start = time.time()
    new = device_data.extract('ad(x)')
    end = time.time()
    
    print('elements: {0}'.format(n), end - start)
    time.sleep(2)

elements: 1000 5.249366998672485
elements: 5000 0.3002920150756836
elements: 50000 2.9973433017730713
elements: 500000 31.241347551345825

Add gather/scatter method to nvstrings

cudf needs gather/scatter method.
An nvstrings instance would accept a list of indices and returns a new nvstrings of the strings specified.
Duplicates and random order are allowed.

[BUG] Move RMM into configure step of build process and allow for detecting existing installations

We should do the same in cuDF, because as it stands now, which is installed latter overwrites the formerly installed version.

[BUG] to_offsets returns seemingly random data on the null bitmask buffer

test = nvstrings.to_device(['a', 'b', 'c', 'd', 'e'])

def to_arrow(str_obj):
    sbuf = np.empty(str_obj.byte_count(), dtype='int8')
    obuf = np.empty(len(str_obj), dtype='int32')
    # mask_size = gd.utils.utils.calc_chunk_size(len(str_obj), gd.utils.utils.mask_bitsize)
    mask_size = int(len(str_obj) / 8) + 1
    nbuf = np.empty(mask_size, dtype='int8') # Do you expect something larger here?
    
    str_obj.to_offsets(sbuf, obuf, nbuf=nbuf)
    
    print(nbuf)
    print(obuf)
    print(sbuf)
    
    sbuf = pa.py_buffer(sbuf)
    obuf = pa.py_buffer(obuf)
    nbuf = pa.py_buffer(nbuf)
    return pa.StringArray.from_buffers(len(str_obj), obuf, sbuf, nbuf,
                                       str_obj.null_count())

to_arrow(test)

NVCategory needs merge-remap function

In order to be able to use this we need to be able to perform the following operation.
I have a series of n columns of type NVCategory. I don't want to merge the columns themselves. I just want the indices across them to be comparable to each other. This is necessary for many operations like gpu_compare, joining, etc. The current api doesn't seem to support this use case. So we have to be able to merge the dictionaries, remap the data. So if we are still coupling the dictionaries with the data, then we at least need to be able to generate N gdf_columns of nvcategory that are the result of a merger. Merging two and outputting one doesn't allow us to perform almost any of the operations we need to specially since the way it works now doesnt make the indices comparable

[BUG] NVCategory create_null_bitmask method should return null entries for values not keys.

Also, if possible, provide a bool method that indicates the presence of any null entries.

Add merge category feature to nvcategory

Official issue for new merge categories feature.

Create nvcategory.remove_categories method

https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#removing-categories

Linking NVCategory fails

I changed the add_library in cmake for nvcategory in a dumb and hacky way in my build so that I could get it to link when including it in cudf. This may just be because I am a dumber than an old brick with lead paint when it comes to cmake.

add_library(NVCategory SHARED
src/NVStrings.cu
src/custring_view.cu
src/custring.cu
src/util.cu
src/regex/regexec.cu
src/regex/regcomp.cpp
src/custring_view.cu
src/custring.cu
src/NVCategory.cu)

I don't know exactly what files are needed but without this I get things like /home/felipe/git-repos/forks/cudf/cpp/build/lib/libNVCategory.so: undefined reference to `NVStrings::create_from_index(std::pair<char const*, unsigned long>*, unsigned int, bool, NVStrings::sorttype)'

I did verify that its not just NVStrings.cu.

Support arbitrary length regex expressions

As an nvstrings user, I want to replace a long list of tokens. I can do this using nvstrings.replace, but the length of supported regex string is too limited, leading me to have to "batch" up calls, which is significantly slower, due to multiple underlying kernel calls.

Example current usage (see nvstrings release post for more context):

def remove_stopwords(nvstring_object, stopwords=STOPWORDS, num_batches=5):
    split_stopwords = np.array_split(stopwords, num_batches)

    for chunk in split_stopwords:
        combined_regex = '|'.join(['\\b({0})\\b'.format(x) for x in chunk])
        nvstring_object = nvstring_object.replace(combined_regex, '', regex=True)

    return nvstring_object

Instead, I would like to be able to make a single call to nvstring.replace, with a regex of arbitrary length, and the library would handle the details of calling a kernel with optimized thread-only buffer, a global memory buffer, or multiple kernel calls as needed.

Rename several methods

split, extract, and rsplit are usually used in parallel, on short strings.

nvstrings has those methods today, but they're designed to operate on single, large chunks of text, turning it into many result records.

Columnar operations on short strings are supported today as split_column, extract_column, and rsplit_column.

This is opposite user intuition. Thus we should rename several functions to make it clearer how to use this library:

split -> split_record
split_column -> split

extract -> extract_record
extract_column -> extract

rsplit -> rsplit_record
rsplit_column -> rsplit

rapidsai / custrings Goto Github PK

custrings's Introduction

cuStrings - GPU String Manipulation

Quick Start

Installation

Conda

Build/Install from Source

custrings's People

Contributors

Stargazers

Watchers

Forkers

custrings's Issues

Feature Request

Example

Bug

Example

Bug

Example

Bug

Example

Recommend Projects

Recommend Topics

Recommend Org

Jobs