Comments (7)
Hi!
Yes, originally we used generator to avoid memory overload for large input data and had to switch to arrays to be able to write the entire input data to hdf5 in one shot.
I agree with your proposal to allow both arrays and generators with a buffer/writer solution
Best regards;
Antoine
from pyphs.
Hi,
I tested a little more the snipplet given above. It seemed to be quite inefficient as it scans the generator at the Python level. Another option is to let numpy scan the generator (in a chunked way, no creating the full array), which is done in the C-side.
Here is a small "benchmark":
import h5py as h5
import time, warnings
import itertools
from_iterable = itertools.chain.from_iterable
# Naive method: Python-side scan of the generator
def set_dataset_from_generator_naive(dataset, generator, chunksize=1024):
offset, npt = 0, chunksize
while npt:
buffer = [el for ind, el in zip(range(chunksize), generator)]
npt = len(buffer)
dataset[offset: offset+npt] = buffer
offset += npt
# Numpy method
def set_dataset_from_generator_numpy(dataset, generator, chunksize=1024):
nt, nsamples = dataset.shape
dtype = dataset.dtype
offset, npt, count = 0, 1, chunksize * nsamples
flattened_generator = from_iterable(generator)
while True:
if nt >= chunksize + offset:
buffer = np.fromiter(flattened_generator, dtype, count)
dataset[offset: offset+chunksize] = buffer.reshape(chunksize, nsamples)
offset += chunksize
else:
warnings.warn('Last chunk : %d < %d' % (nt - offset, chunksize))
buffer = np.fromiter(flattened_generator, dtype)
dataset[offset: offset+chunksize] = buffer.reshape(-1, nsamples)
break
lnpt = 2**np.arange(14, 26)
timings = np.zeros((len(lnpt), 2)) * np.nan
with h5.File('/tmp/test1.h5', 'w') as fid:
for inpt, npt in enumerate(lnpt):
dset = fid.create_dataset('/arr1', shape=(npt, 1), dtype=np.float64)
tip = time.time()
set_dataset_from_generator_naive(dset, ((el,) for el in range(npt)), chunksize=1024)
timings[inpt, 0] = time.time() - tip
del fid['/arr1']
dset = fid.create_dataset('/arr2', shape=(npt, 1), dtype=np.float64)
tip = time.time()
set_dataset_from_generator_numpy(dset, ((el,) for el in range(npt)), chunksize=1024)
timings[inpt, 1] = time.time() - tip
del fid['/arr2']
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(8, 5))
ax.loglog(lnpt, timings, marker='o')
ax.legend(('Naïve', 'Numpy'))
ax.set_xlabel('# values')
ax.set_ylabel('Elapsed time (s)')
Conclusion : an average x2 speed-up with the set_dataset_from_generator_numpy
method
from pyphs.
Last point, I am unsure the scan of the generator is fully done at the C level even in the second method.
In fact, np.fromiter
only builds 1D array and the itertools.chain.from_iterable
function is used to produce a flattened version of the generator and solve the np.fromiter
limitation.
Knowing the shape of the dataset and of the expected inputs, one could produce an even more efficient method.
from pyphs.
Thank you @FabricioS for this benchmark. I implemented the "Numpy method" in the file numerics/simulations/h5data/tools.py called in H5Data. The code is currently in develop branch and PR pass tests.
@FabricioS @WetzelVictor Can you check this solves your case?
from pyphs.
I tested the patch on the same simulation than before: system of dimensionality 4 and 1e6 samples. With this, the program barely used any memory. Previously it used at least 1/3 of it.
No problems on my machine, just warning from hdf5 not being happy about pyphs using generators.
Thank you @afalaize and @FabricioS !
from pyphs.
from pyphs.
Hi! Good to hear that.
I just forgot to remove the deprecation warning, forget about it: it has no meaning...
from pyphs.
Related Issues (20)
- Python doesn't raise errors if the C++ compilation fails HOT 2
- Debug module for NL solver convergence issue
- Unfold the evaluation of implicit function in NL solver
- Complexity of FAUST code is to high for more than 2 NL components
- Externalize the modules for LATEX/JUCE/FAUST code generation HOT 2
- Discarding linear part of a quadratic Hamiltonian HOT 5
- Account for linear part of Hamiltonian of linear components in latex rendering system HOT 1
- matvecprod ShapeError in Typical use HOT 3
- minor warnings related to manifest HOT 2
- Rhodes Example: TypeError: can't convert expression to float HOT 2
- Add rotary mechanical elements to PyPHS HOT 3
- Improve time to plot data HOT 2
- Issue when installing PyPHS HOT 3
- redefinition of symbols are not allowed HOT 3
- 'NodeView' object has no attribute 'index' HOT 5
- AssertionError: flux-controlled edge xBSC1 is effort-controlled I don't understand HOT 2
- Is it possible to define a state feedback control (u = k*x) for the simulation? HOT 4
- Eigen stack allocation limit
- C++ code feature: CMakeLists.txt template HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyphs.