Documentation for json.gz format is missing about pyerrors HOT 9 CLOSED

fjosw commented on September 28, 2024

Documentation for json.gz format is missing

from pyerrors.

Comments (9)

s-kuberski commented on September 28, 2024

Now, that the output for the correlator class is final, there are no more things to add to the format, right? At least from the current perspective? In this case, I'll try to come up with a good way to document the format.

from pyerrors.

fjosw commented on September 28, 2024

I agree. I would also suggest to bump the format version to 1.0.

from pyerrors.

s-kuberski commented on September 28, 2024

I have by now implemented another feature that we have talked about, but I am not sure if it should enter the main branch or if I am the only one who would find this neat:
I have written routines to export and import python dictionaries to JSON. These are implemented as wrappers around load_json and dump_to_json. I export by:

Browsing through the dictionary
Replacing all supported structures (Obs, Corr, np.ndarray, list) by placeholders and
adding the structure to a list
Afterwards, the list is passed to dump_to_json where
the dictionary with the placeholders is put into the description of the file. (Which becomes a dictionary with two keys: the placeholder-dict and the original description)

On import, the same thing is done in reverse. The dict may contain anything that is valid in JSON (number, string, boolean, list, dict, None) as well as the above defined supported structures.

So-far, my only issue is that am not able to tell the input routine, that the file is a parsed dictionary. If the JSON file is imported using load_json, you get a list with all the structures (and the description, if you choose full_output) instead of a dict.

I am not sure, if it is wise to have this feature: In my workflow, it would certainly help to switch from pickles to JSON in intermediate steps. Globally, this output could become confusing, when large dicts are written to JSON. However, the placeholder-dict in the description could actually pretty helpful in understanding what is saved in this file and it is created automatically, so the user would not have to write this description him/herself, if the keys of the dictionary is self-explanatory.
@fjosw , @JanNeuendorf what do you think? I could create a pull request, as soon as everything is tested.

from pyerrors.

fjosw commented on September 28, 2024

My feeling is that relational databases like sqlite are better suited for what you want to achieve but I have no objections if you want to add this functionality. My concerns are that this bloats up the specification of the format and that nested code is not easy to maintain.

from pyerrors.

s-kuberski commented on September 28, 2024

If one does not have such database (they don't run well on lustre!), it could help, though. But I understand your concerns.
I tried to use the format, as it is, such that I don't have to change anything in its specification. This leads to the fact, that one has to know whether to parse the JSON file as list or dict - which is not really nice and prohibits to use the wrapper for reading a random JSON file and deciding if it is a dictionary or not (you could always read the file using load_json and get a valid list of structs). This would need an additional keyword and I don't want to add it to the format, because dicts are somewhat specific.
I'll think a bit more and then I'll propose an implementation.

from pyerrors.

fjosw commented on September 28, 2024

In the meantime I implemented io routines for the json.gz format in a separate julia package: https://github.com/fjosw/ADjson.jl

from pyerrors.

s-kuberski commented on September 28, 2024

In the meantime I implemented io routines for the json.gz format in a separate julia package: https://github.com/fjosw/ADjson.jl

This is a very good solution, thank you!

from pyerrors.

s-kuberski commented on September 28, 2024

I am still working on the documentation. Meanwhile, I am still not satisfied with the performance for a large number of Obses. When I replace

pyerrors/pyerrors/input/json.py

Lines 230 to 232 in 032b0b5

 jsonstring = '' 

 for chunk in my_encoder(indent=indent, ensure_ascii=False).iterencode(d): 

 jsonstring += chunk

jsonstring = ''.join(chunk for chunk in my_encoder(indent=indent, ensure_ascii=False).iterencode(d))

the memory consumption increases again (by about 40%), but the time that is needed for the string creating decreases significantly (such that about 60% of the total time is spent in create_json_string and the rest is needed for writing the file to disk. [As opposed to 84% vs. 16% - this most likely depends on the size of the file/string]

It is probably more important to have a fast routine than one that needs the minimal amount of memory. Instead of juggling around with our own code, we could resort to faster json implementations such as https://github.com/ijl/orjson or https://github.com/ultrajson/ultrajson . These seem, in general, to be suited to our needs, apart from one aspect:

When using an indent for writing files, all of the standard implementations split up multi-dimensional lists, such that one line is written for each element of deltas. For a correlator based on N configs and T time slices, the file has then N*(T+1) many lines, instead of N many, as in my current implementation. The fast packages do not support to change the encoder since these are precompiled c routines. These leaves several possibilities:

Use a faster package. This would, in principle, still support indentation, but a visual inspection of a file is made much more difficult, because there are many more lines. One would have to see, which packages is most compatible to @fjosw needs to have the string in memory, instead of writing directly to a file.
Stay with the current setup (or some deviation that still uses the json module but is tuned such that speedup and memory requirements are well balanced for the use case). The visual inspection of files (if an indent is used) is easier.
Use some kind of mixture, where the slower setup is chosen if indentation is needed - I am not in favor of this.

It kind of boils down to the question, whether we think that it is important to be able to write the files such that they are not written to a single line. Maybe we could think about this to fix the behavior in a possible release candidate.

from pyerrors.

fjosw commented on September 28, 2024

I have to say that I never ran in any issues with creating json strings or writing out the corresponding files so I would leave that totally up to you. I don't have a strong opinion on the indentation. As long as the library can still produce a python string containing the json output I would not have to alter my current workflow.

from pyerrors.

Documentation for json.gz format is missing about pyerrors HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	jsonstring = ''
	for chunk in my_encoder(indent=indent, ensure_ascii=False).iterencode(d):
	jsonstring += chunk