I'm doing some work generating large sequences of primes. It turned out to be helpful

Just a quick question: The primesieve C API (<a href="http://primesi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Add an interface for generating primes into a preallocated array instead of vector pushbacks,about kimwalisch/primesieve

Comments (17)

kimwalisch commented on August 15, 2024

Just a quick question:

The primesieve C API (primesieve.h) already has functions for generating an array of primes, though not inplace:

/** Get an array with the primes inside the interval [start, stop].
 *  @param size  The size of the returned primes array.
 *  @param type  The type of the primes to generate, e.g. INT_PRIMES.
 *  @pre stop <= 2^64 - 2^32 * 10.
 */
void* primesieve_generate_primes(uint64_t start, uint64_t stop, size_t* size, int type);

/** Get an array with the first n primes >= start.
 *  @param type  The type of the primes to generate, e.g. INT_PRIMES.
 *  @pre stop <= 2^64 - 2^32 * 10.
 */
void* primesieve_generate_n_primes(uint64_t n, uint64_t start, int type);

Please have a look at store_primes_in_array.c to see how it works.

Do you see a possibility to use the above functions instead of your proposed functions and achieve similar performance?

from primesieve.

pstoll commented on August 15, 2024

@kimwalisch - i think I looked at that and thought the underlying implementation still used the std::vector implementation. I'll double check to confirm. It might not suffer from the same perf issues, i.e. by reserving all the space ahead of time.

Although that is easier to do with the "generate N primes" version instead of the "generate primes between A and B" where you don't know the number of outputs.

from primesieve.

pstoll commented on August 15, 2024

Although I'll say that being able to specify an output buffer is a natural way to do this sort of numeric python (i.e. NumPy) integration. Most functions allow the caller to optionally specify the output array.

from primesieve.

kimwalisch commented on August 15, 2024

i think I looked at that and thought the underlying implementation still used the std::vector implementation.

I think the performance loss comes from the fact that the generate_primes() function from primesieve-python returns a list of primes which seems to incur a very slow copy. Hence the inplace generate_primes() function you have implemented is much faster because it does not return a list of primes.

Does python or Cython offer a function similar to the C memcpy for copying arrays very quickly? In this case we could implement the inplace generate_primes() function(s) in primesieve-python using the existing generate_primes() function from primesieve with memcpy at the end for copying the primes into the NumPy array.

Another option would be to implement the inplace generate_primes() functions in primesieve-python using the callback functions from primesieve's C/C++ API. This would incur no additional copy. If I remember correctly the generate_primes() functions from primesieve-ruby are implemented this way.

The reason why I would be happier if the new inplace functions were implemented in primesieve-python is that these functions will solely be introduced to fix python performance issues.

I'm doing some work generating large sequences of primes.

Is this for your own research? Or are you trying to improve the prime number generation performance of public library used by others as well? In the latter case it would be easier to convince me to add the requested functionality to primesieve's C/C++ API.

Thanks,
Kim

from primesieve.

pstoll commented on August 15, 2024

Or are you trying to improve the prime number generation performance of public library used
by others as well?

That is definitely in scope for what I'm trying to do :) I'm helping on a project that needs prime numbers generated, so faster is better. I've done a fair bit of numeric computation and happy to contribute to the broader efforts.

from primesieve.

pstoll commented on August 15, 2024

The option of using the C API function primesieve_generate_n_primes and then copying the output still leaves us in the situation of doubling the memory pressure. I generating sequential sets of primes as large as will fit in RAM. I'll think more about the idea of just writing my own callback method for the python interface. I still think there might still be a place in the API for writing the primes into a preallocated array just to avoid copying it later. Thanks for the feedback.

from primesieve.

kimwalisch commented on August 15, 2024

I still think there might still be a place in the API for writing the primes into a preallocated array just to avoid copying it later.

Another reason why I don't really like the inplace idea is because the API is very vulnurable for errors. The main function for generating primes is generate_primes() it is used more frequently than generate_n_primes(). If I introduce inplace functionality then I want to introduce it for both generate_primes() and generate_n_primes(). But for generate_primes() it is not possible to know how large the primes array will be in advance, so the user might provide an array which is to small to store all the primes.

Please let me know if you have a good idea for generate_primes(), I am definitely open for suggestions.

A third option that comes to my mind for implementing inplace functionality in primesieve-python is using a primesieve::iterator. In pseudo code this solution would look like:

primesieve::iterator pi(start, stop)
uint64_t prime = 0;
uint64_t i = 0;

while ((prime = pi.next_prime()) <= stop)
{
    numpy_array[i++] = prime;
}

This solution is simpler than the callback solution and this solution does not double the memory usage and adds only minor overhead compared to an inplace function for storing primes directly into a numpy array. We could also use this solution for generate_n_primes().

from primesieve.

pstoll commented on August 15, 2024

@kimwalisch - I wrote and benchmarked the iterator approach. The code is small, which is nice. It also doesn't require any new interfaces. It is marginally slower then the new array interface but is still dramatically faster than the current interface that uses vectors.

Do you think it's worth open an issue about investigating the perf bottlenecks in the current use of std:vector?

Test script are here:
https://gist.github.com/pstoll/ceaf0db8d8827dca8734

Results for generating (0,10^8) are:

test1 for 100000000 primes: write directly to array with new generate_n_primes_array 2.64 sec
test1iter for 100000000 primes: write to numpy data using iter interface: 3.62 sec
test2 for 100000000 primes: current vector generate_n_primes: 21.14 sec
test2a for 100000000 primes: current vector + create numpy array: 37.06 sec

from primesieve.

pstoll commented on August 15, 2024

Letting callers specify the memory where data is stores is something I've seen frequently in numeric libraries. The issue of unknown output sizes is a challenge but there are ways to handle it.

As for the issue of not knowing the size of the output array for generate_primes, that is definitely the issue. There are a few approaches that I can think of:

a) allow the asymmetry of interface between (START,N) and (START,STOP).

I don't love it either but there is a difference in what the caller is asking for.

b) allow the function to fill as much data as it can.

So let the function indicate how many primes it filled/if it ran out of space. So the function would return ARRAY_LEN primes or STOP-START primes, whichever limit is hit first.

There is already an output param in the calling interface, e.g. size that can be used.

/** Get an array with the primes inside the interval [start, stop].
 *  @param size  The size of the returned primes array.
 *       on input, SIZE should indicate the length of the array ARRAY
 *       SIZE is used to return the actual number of primes generated. 
 *       if SIZE <  STOP-START, then prime generation stopped because it ran out of output space
 *  @param type  The type of the primes to generate, e.g. INT_PRIMES.
 *  @param array - pointer to array to write output results to
 *  @pre stop <= 2^64 - 2^32 * 10.
 */
void* primesieve_generate_primes_array(uint64_t start, uint64_t stop, size_t* size, int type, void* array);

/** Get an array with the first n primes >= start.
 *  @param type  The type of the primes to generate, e.g. INT_PRIMES.
 *  @param size  The size of the returned primes array.
 *       on input, SIZE should indicate the length of the array ARRAY
 *       SIZE is used to return the actual number of primes generated. 
 *       if SIZE <  STOP-START, then prime generation stopped because it ran out of output space
 *  @param array - pointer to array to write output results to
 *  @pre stop <= 2^64 - 2^32 * 10.
 */
void* primesieve_generate_n_primes_array(uint64_t n, uint64_t start, int type, size_t *size, void* array);

from primesieve.

pstoll commented on August 15, 2024

Ok, I also finally did the test to figure out where the bottleneck is coming from. You are right - the vast majority of the overhead was in a faulty approach to converting the resulting std:vector to a python data structure.

Adding a better approach to getting the output data puts the current functions closer but still 2x the speed of the approach. Here are timings now:

the test1np case below uses the current functions as in and copies the data directly to an output array.

test1arr for 100000000 primes: write directly to array with new generate_n_primes_array 2.82 sec
test1iter2 for 100000000 primes: write to numpy data using iter interface data access: 3.40 sec
test1np for 100000000 primes: use normal vector generation but copy directly to numpy on output: 4.16 sec
test2 for 100000000 primes: current vector generate_n_primes with numpy copy from vector to output list: 21.03 sec

That still means doubling the memory. So I'll probably go w/ the iterator approach.

from primesieve.

kimwalisch commented on August 15, 2024

I like this version best:

def test1np(sz = 10**8):
    import numpy_primesieve as npp
    a = npp.generate_n_primes_numpy(sz, 1)        # will create output array and return it
    #print(a[-1])
    return a

The API is extremely simple to use and generate_n_primes_numpy() is also a good function name.

That still means doubling the memory. So I'll probably go w/ the iterator approach.

That's a real issue. But then I don't understand why the version below does not double the memory usage as well?! Is my understanding right that returning an array from a python function doubles memory usage?

def test1arr(sz = 10**8):
    import numpy_primesieve as npp
    a = npp.generate_n_primes_array(sz,1)        # will create output array and return it
    #print(a[-1])
    return a

Question1: Can we make this optional? Ideally the user should still be able to use primesieve-python if numpy is not installed.

from primesieve.

kimwalisch commented on August 15, 2024

I have a crazy idea for an ideal solution ;-)

I came across this stack-overflow post: Binding C array to Numpy array without copying. So we don't need inplace functionality, I could change the C API functions primesieve_generate_primes() and primesieve_generate_n_primes() to return a true C array allocated using malloc. Then we could bind the C primes array with a numpy array and return it.

I have a few questions:

Does returning a numpy array from a function double memory usage (and hence has a minor overhead)?
Could we change primesieve-python to work exclusively with numpy arrays without issues for the user? Does the user care if generate_primes() returns a numpy array or a plain python list? And it is possible to automatically install numpy if the user installs primesieve-python and numpy is not installed on his system?

from primesieve.

pstoll commented on August 15, 2024

Quick answers with more to follow:

Yes, I believe we should make numpy optional. Making the user have numpy is a small burden.

I now have the numpy extensions compiling conditionally on having numpy installed. It feels right - if the user has numpy, they get bonus faster versions functions.

the generate_n_primes_array function didn't double memory because I wrote the primes directly into the output numpy array (using both a new Array-version of Pushback) I didn't use callback version because I didn't see a generate_n_primes_callback function in the Api.

from primesieve.

pstoll commented on August 15, 2024

re:numpy arrays - no they don't double memory per se. All objects in Python are passed/returned by reference, so no extra copies by accident. What was happening was:

generate_primes would allocate N values as std:vector
Create output buffer of size N as array
(We now have 2 buffers of size N)
Copy data from vector to array
- Free vector
- Return array of size N

re: primesieve-python going all numpy - hmmm. I suppose if someone is installing primesieve they are probably able to install numpy.

I'd seen that article about creating numpy arrays from an allocated buffer of values too :) yes, that would avoid creating a (temporary) extra copy of data.

from primesieve.

kimwalisch commented on August 15, 2024

Yes, I believe we should make numpy optional. Making the user have numpy is a small burden.

What's the best solution for doing this (how are other libraries supporting numpy)? I see in your test you have created numpy_primesieve:

def test1np(sz = 10**8):
    import numpy_primesieve as npp
    a = npp.generate_n_primes_numpy(sz, 1)        # will create output array and return it
    #print(a[-1])
    return a

Do you know any other libraries which handle supporting numpy the same way?

from primesieve.

kimwalisch commented on August 15, 2024

Though I don't have much experience with python, my current thinking is that we should support numpy using a submodule. So my suggestion is:

>>> from primesieve.numpy import *

# Generate a numpy array with the primes below 40
>>> generate_primes(40)

# Generate a numpy array with the first 10 primes
>>> generate_n_primes(10)

I think this is cleaner than e.g. generate_primes_numpy(). What's your opinion on this?

For testing purpose I have created a new primesieve branch (https://github.com/kimwalisch/primesieve/tree/malloc) which already contains the required malloc functionality (in the C API) which is needed for supporting numpy. If it works well with numpy I will merge it into the master branch.

#include <primesieve.h>

int *primes1 = (int*) primesieve_generate_primes(start, stop, &size, INT_PRIMES);
int *primes2 = (int*) primesieve_generate_n_primes(n, start, INT_PRIMES);

free(primes1);
free(primes2);

Below are a few posts I found on the internet which describe how to bind a C array to a numpy array without copying:

http://stackoverflow.com/questions/33478046/binding-c-array-to-numpy-array-without-copying
http://stackoverflow.com/questions/23872946/force-numpy-ndarray-to-take-ownership-of-its-memory-in-cython/
https://gist.github.com/GaelVaroquaux/1249305

from primesieve.

kimwalisch commented on August 15, 2024

Discussion continues at primesieve-python: https://github.com/hickford/primesieve-python/issues/13

from primesieve.

Add an interface for generating primes into a preallocated array instead of vector pushbacks about primesieve HOT 17 CLOSED

Comments (17)

a) allow the asymmetry of interface between (START,N) and (START,STOP).

b) allow the function to fill as much data as it can.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs