Comments (9)
Now, that the output for the correlator class is final, there are no more things to add to the format, right? At least from the current perspective? In this case, I'll try to come up with a good way to document the format.
from pyerrors.
I agree. I would also suggest to bump the format version to 1.0.
from pyerrors.
I have by now implemented another feature that we have talked about, but I am not sure if it should enter the main branch or if I am the only one who would find this neat:
I have written routines to export and import python dictionaries to JSON. These are implemented as wrappers around load_json and dump_to_json. I export by:
- Browsing through the dictionary
- Replacing all supported structures (Obs, Corr, np.ndarray, list) by placeholders and
- adding the structure to a list
- Afterwards, the list is passed to dump_to_json where
- the dictionary with the placeholders is put into the description of the file. (Which becomes a dictionary with two keys: the placeholder-dict and the original description)
On import, the same thing is done in reverse. The dict may contain anything that is valid in JSON (number, string, boolean, list, dict, None) as well as the above defined supported structures.
So-far, my only issue is that am not able to tell the input routine, that the file is a parsed dictionary. If the JSON file is imported using load_json, you get a list with all the structures (and the description, if you choose full_output) instead of a dict.
I am not sure, if it is wise to have this feature: In my workflow, it would certainly help to switch from pickles to JSON in intermediate steps. Globally, this output could become confusing, when large dicts are written to JSON. However, the placeholder-dict in the description could actually pretty helpful in understanding what is saved in this file and it is created automatically, so the user would not have to write this description him/herself, if the keys of the dictionary is self-explanatory.
@fjosw , @JanNeuendorf what do you think? I could create a pull request, as soon as everything is tested.
from pyerrors.
My feeling is that relational databases like sqlite are better suited for what you want to achieve but I have no objections if you want to add this functionality. My concerns are that this bloats up the specification of the format and that nested code is not easy to maintain.
from pyerrors.
If one does not have such database (they don't run well on lustre!), it could help, though. But I understand your concerns.
I tried to use the format, as it is, such that I don't have to change anything in its specification. This leads to the fact, that one has to know whether to parse the JSON file as list or dict - which is not really nice and prohibits to use the wrapper for reading a random JSON file and deciding if it is a dictionary or not (you could always read the file using load_json and get a valid list of structs). This would need an additional keyword and I don't want to add it to the format, because dicts are somewhat specific.
I'll think a bit more and then I'll propose an implementation.
from pyerrors.
In the meantime I implemented io routines for the json.gz format in a separate julia package: https://github.com/fjosw/ADjson.jl
from pyerrors.
In the meantime I implemented io routines for the json.gz format in a separate julia package: https://github.com/fjosw/ADjson.jl
This is a very good solution, thank you!
from pyerrors.
I am still working on the documentation. Meanwhile, I am still not satisfied with the performance for a large number of Obses. When I replace
pyerrors/pyerrors/input/json.py
Lines 230 to 232 in 032b0b5
by
jsonstring = ''.join(chunk for chunk in my_encoder(indent=indent, ensure_ascii=False).iterencode(d))
the memory consumption increases again (by about 40%), but the time that is needed for the string creating decreases significantly (such that about 60% of the total time is spent in create_json_string
and the rest is needed for writing the file to disk. [As opposed to 84% vs. 16% - this most likely depends on the size of the file/string]
It is probably more important to have a fast routine than one that needs the minimal amount of memory. Instead of juggling around with our own code, we could resort to faster json implementations such as https://github.com/ijl/orjson or https://github.com/ultrajson/ultrajson . These seem, in general, to be suited to our needs, apart from one aspect:
When using an indent for writing files, all of the standard implementations split up multi-dimensional lists, such that one line is written for each element of deltas. For a correlator based on N configs and T time slices, the file has then N*(T+1) many lines, instead of N many, as in my current implementation. The fast packages do not support to change the encoder since these are precompiled c routines. These leaves several possibilities:
- Use a faster package. This would, in principle, still support indentation, but a visual inspection of a file is made much more difficult, because there are many more lines. One would have to see, which packages is most compatible to @fjosw needs to have the string in memory, instead of writing directly to a file.
- Stay with the current setup (or some deviation that still uses the json module but is tuned such that speedup and memory requirements are well balanced for the use case). The visual inspection of files (if an indent is used) is easier.
- Use some kind of mixture, where the slower setup is chosen if indentation is needed - I am not in favor of this.
It kind of boils down to the question, whether we think that it is important to be able to write the files such that they are not written to a single line. Maybe we could think about this to fix the behavior in a possible release candidate.
from pyerrors.
I have to say that I never ran in any issues with creating json strings or writing out the corresponding files so I would leave that totally up to you. I don't have a strong opinion on the indentation. As long as the library can still produce a python string containing the json output I would not have to alter my current workflow.
from pyerrors.
Related Issues (20)
- numerical differentiation in derived_obs not working HOT 5
- Automatic windowing method fails for gapped and irregular chains HOT 4
- Issues with _filter_zeroes and Corr HOT 4
- Exception when applying .symmetric() to Corr containing None HOT 1
- Gamma_method() is broken for Obs that are NaN
- Multi-dimensional fits
- Bug coming from difference in search methods in sfcf inputs HOT 2
- `Corr.show()` draws prange in same color as error bars. HOT 1
- No dobs-related functions from the input submodule can be used HOT 1
- GEVP eigenvectors with errors HOT 7
- Warning in pandas tests
- Numpy 1.25 breaks a few linalg functions HOT 3
- Failing python 3.12 pytest workflow
- Duplicate data cause `gamma_method()` to fail with an unhelpful message HOT 3
- plot_history unexpected behaviour for gapped idl HOT 2
- read_hd5 in pyerrors 2.9.0 not fully backwards compatible to <=2.8.2 HOT 1
- Read specific interval with read_ms5_xsf() HOT 2
- Files keyword for multiple reps in read_sfcf HOT 2
- pyerrors does not work with the upcoming numpy 2 release HOT 8
- Corr.__getitem__ unexpected behaviour HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyerrors.