matplotlib / data-prototype Goto Github PK

Home Page: https://matplotlib.org/data-prototype

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

data-prototype's Introduction

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Check out our home page for more information.

Matplotlib produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, Python/IPython shells, web application servers, and various graphical user interface toolkits.

Install

See the install documentation, which is generated from /doc/install/index.rst

Contribute

You've discovered a bug or something else you want to change — excellent!

You've worked out a way to fix it — even better!

You want to tell us about it — best of all!

Start at the contributing guide!

Contact

Discourse is the discussion forum for general questions and discussions and our recommended starting point.

Our active mailing lists (which are mirrored on Discourse) are:

Users mailing list: [email protected]
Announcement mailing list: [email protected]
Development mailing list: [email protected]

Gitter is for coordinating development and asking questions directly related to contributing to matplotlib.

Citing Matplotlib

If Matplotlib contributes to a project that leads to publication, please acknowledge this by citing Matplotlib.

A ready-made citation entry is available.

data-prototype's People

Contributors

Stargazers

Watchers

Forkers

tacaswell ksunden story645

data-prototype's Issues

deal with data-space deltas

error bar specification
stacked bar / histograms
specifying rectangle by width

Data library sources that are worth integrating with

Some of these may be proof of concept level at best while they are under our control, but good to keep many in mind to avoid overspecializing and painting ourselves into a corner:

numpy arrays (or rather dictionaries of numpy arrays)
pandas
xarray
dask
tiled
WrightTools (It's my graduate work, so useful as a test bed for integrations into libraries themselves)
raw h5py
Zarr
functions
web APIs
databases
networkx (graph data rather than array-like)

change query to not rely on Matplotlib transform object

could replace by a callable that matches the signature of Transform.transform.

Add some more roadmapping...

This looks very interesting, but could use some description of what it is trying to do, and what it is trying to fix. I think it is
providing a draw-time interface to the data, including converting to array, and handling any unit conversion. I think it does this in a viewport cognizant way so that subsampling can also be handled by the wrapper? But I'm not clear on that from the examples.

All the examples should get a few more sentences, and a bit of explanation.

I appreciate that this is all proof-of-concept, not for user consumption, but I think writing some of these things down will really help clarify objectives at this point.

Ideas regarding "nu"

Mutation as the name for "nu"

Data prototype has a concept of nu for performing data-to-data transforms.

Firstly, this name is not descriptive at all, and while it makes sense in the context of a pure math description, programmers are unlikely to have that context.
A more descriptive name is preferable.

"Transforms" (and by extension, probably other "trans-" prefixed words) are confusing because mpl already uses this term to mean something specific.
"converters" is used by mpl to mean specifically "unit converters", so potentially falls into a similar boat
- OTOH, unit conversion is a specific case of this system, so potentially an option
- That said, its not directly the converter, it is rather some level of adaptation needed

Thus I propose the term "mutator", though certainly open to other options.
The term "mutate", while used in a few docstrings, tests, and variable names in mpl, is not really used in any type names or public signatures outside of Transforms.mutated[xy]? which return booleans.
Also has the advantage of using the same vowel sound as nu, so may help those who are familiar with the mathematical framing connect the concept.

Kinds of Mutators

compute

{'x': A} -> {'x': B}

Using the same variable name but achieving a (potentially) different value.

Identity is a subset of this.

spelling on current main:

nu={"x": lambda x: x+1}

rename

{'x': A} -> {'y': A}

Actually somewhat redundant to reuse + delete

spelling on current main:

nu={"y": lambda x: x}

reuse

{'x': A} -> {'x': A, 'y': A}

e.g. "color" expanding to "facecolor" and "edgecolor"

spelling on current main:

nu={"y": lambda x: x, "x": lambda x: x}

(or including "x" in expected/required keys, but including the y lambda)

combine

{'x': A, 'y': B} -> {'z': Z}

spelling on current main:

nu={"z": lambda x, y: x+y}

spelling with #17:

mutual mutation

{'x': A, 'y': B} -> {'x': C, 'y': D}

Importantly, the computation for C and D both depend on the values from A and B.

Potentially has some performance concerns as often perhaps they can actually be computed together, but some frameworks may require computing C and D separately.

spelling on current main:

nu={"x": lambda x, y: x+y, "y": lambda x, y: x-y}

spelling with #17:

NOT POSSIBLE.

While in most cases, #17 will upcast a single function to a list containg only that one function (plus units, if applicable), unlike main, it does operations sequentially.
Thus the value of x gets overridden by the first process, and it is not the same when processing y

If x=1, y=2, then main with the nu spefcified above will give an output of x=3, y=-1. #17 will give an output of x=3, y=1.

deletion

{'x': A} -> {}

spelling on current main:

Neither provide a nu for "x" nor include in required/expected keys (as those include a default identity)

chaining

{'x': A} -> {'y': B} -> {'z': C}

Importantly may include more complex operations as each step

spelling on current main:

NOT POSSIBLE, at least not in an elegant/composable way

nu={"z": lambda x: (lambda y: y+1)(x) + 1}

Is kind of the idea, but doesn't allow inspection or mutation of the internal structure.
Nor does it provide a way to e.g. add units in automatically aside from strictly before or strictly after.

If you also want to keep "y" in the final, you need to pass (and compute) it separately

spelling with #17:

nu={"x": [lambda x: x+1, lambda x: x+1], "z": lambda x: x}

(which will necessarily keep both x and z, set to the same value)

nu={"y": lambda x: x+1, "z": lambda y: y+1}

(which will necessarily keep both y and z, with different values)

While chaining was the purpose of #17, it's implementation is less elegant than I would like.
It works reasonably well when chaining things with the same name, but falls apart rather quickly when trying to change names as in this example.

The deeply ingrained order dependence feels awkward and likely to do things that are not intended.

E.g. in the last example, did the user intend for y in the computation of z to be the newly modified version (perhaps not, but maybe).
If you flip the y and z it looks the same, but is actually different on that branch.

But I think having intermediate values is useful.

A proposal

The behavior on main has advantages including order independence of nu and relatively easily doing computations with multiple inputs and outputs

The behavior on #17 allows treating units as just another nu function, i.e. separating individual transforms into single logical functions.
It also has the advantage of being able to use intermediate calculations, though with a significant drawback of order dependence and not being the most understandable system.

#17 introduces a list of functions for each variable to accomplish its goals.

The proposal then is to invert that a bit and instead of having a list of functions for each variable, to have a list of "mutation stages", each of which act as the behavior on main today.

Thus if you want precisely the behavior of main, it is identical to just having a list of one stage.

But if you want intermediate values (and units behavior), you add separate stages.

I've not yet written code for this, but I don't think it'll be that hard to do so.

I think I would lean towards separate objects to manage the interactions, rather than relying on a pure list of dictionaries.

This would allow us to give stages names, which in turn allows a (relatively) ergonomic way of saying: [MyStagePreUnits("pre units", ...), "units", MyStage("post units", ...)]

Mutation stages could each have "expected/required" keys, rather than just an overall. (with the default being to pass every key input plus every nu).

More radical ideas/fallout that may be enabled (but I haven't thought through completely)

Doing the caching at the stage level
- if so, do containers actually just become a MutationStage?
  - That may be a bridge too far, and keeping a divide may be more useful, even if it could collapse
Do FuncContainers actually cease to exist, even if not all containers do?
- The container would be the arguments to the func rather than the functions themselves. functions become a Mutation Stage.
Does the behavior that reaches into axes to get the transform/size become an optional MutationStage?
- Would decouple the majority of the stack from matplotlib specific code, potentially making this idea viable for other plotting/data analysis libraries.
- Only FuncContainer even uses it at this time (other than passing)
- How does the renderer/axes info get introduced if it does become optional?
  - perhaps the "core" gets implemented independent of this, but a mpl-specific wrapper introduces this and mpl units behavior?
Does argument parsing/defaulting just become a "MutationStage"?

These Ideas may fall a little far into "I have a hammer so everything looks like a nail", but I could see a path where each of these make sense.

What happens when Desc.dtype is Desc.units?

Now that numpy supports parametric dtype, matplotlib will see more custom types.

DataContainer.query() missing unit conversion?

Should DataContainer.query() include unit conversion? Or where will this happen? Ideally, I would like DataContainer to hold the data in its original form and then convert, slice, etc. upon query. Subsequent queries would start from the original data. Ideally, the enduser will be able to also inspect this DataContainer to verify its correctness. Today, that's impossible for some plotting functions and hard to do for others.

Write a container that actually does resampling

Example of something that will do resampling of input data

down sample 1D
re-implement modestimage

Write TextCollection Artist

see matplotlib/matplotlib#26296

Place many treat text as a valid data dimension (along with position and styling).

deal with screen-space dodging

e.g. N bars at the same categorical value or "bee hive"/"bee swarm" plots

README.rst documentation link is dead

Documentation: https://tacaswell.github.io/data-prototype

query MUST filter or MAY filter

Thinking a bit more about @greglucas 's argument that range-restricted auto-limiting may be easier than I thought lead me to this idea.

As currently proto-typed the query semantics are "here is some information, you MAY use it to restrict the data returned. If we were going to start doing range-restricted autoscaling, then maybe we should change this to "here is some information, you MUST use it to restrict".

However while this does make the API more explicit and means we can do a bit more stuff with it on the calling side, it does impose a bunch more complexity on implementation side. If we consider the data

x = [0, 1, 0, 1, 0, 1]
y = [1, 2, 3, 4, 5, 6]

and the xview limit [-.5, .5] a naive way to just filtering could turn the zigzag into a vertical line. If we are doing scatter that is fine, but if the data is meant to be discretely sampled continuous data then that is quite wrong! So maybe you could say "well, in that case the query should add points at the edge of the view limit" which is fine, but then we have to be able to tell (if say we are showing both the lines and the points) the difference between the "real" points that should have markers and the "synthetic points" that are there to preserve continuity. We could (should) push the notion of continuity down into the containers so they can make the decisions.

However, this is a huge step up in obligatory complexity to go to MUST so I think I still prefer MAY as a pragmatic choice and think about adding another get_limits(...) with a similar API to query but that only returns ranges.

Domain specific/interesting end goal use cases

While initial development work will focus more heavily on synthetic data/toy examples of some relatively low-level artists, it is a good idea to keep in mind a wide range of applications, as that is the actual end goal that makes this work actually useful.

In particular the following, in no particular order:

Oceanography/Geospatial data
- large datasets, subsampling data
- transforms into map coordinates
- integrations with cartopy, etc.
Astronomy data
- It is a NASA funded grant, after all
- large datasets
- stress test units
- spatial-type data
- integration with data sources used in that domain
Biological data
- Of particular interest to CZI grant
- Microscopy data/images
Spectroscopy data
- It is my own area of expertise, I have several kinds of plots that serve a variety of levels of difficulty
- Specialized domain-specific data format
- Composing multiple artists
- particularly hard units support (spectroscopists can never agree what units to use, and like to say that length and energy units are interoperable)
- easy "quick" plots from a self describing data format
- multidimensional data, slicing into, etc.
- interactivity, stress testing the level of hooks provided to modify the plot
Sports analytics data
- relatively unique visualizations
- see https://hockeyviz.com for many examples of a wide variety of plot types (made with matplotlib)
- potential interest for live updating

These are just a few of the domains for which this dataset-centric approach may be useful, feel free to add more.

matplotlib / data-prototype Goto Github PK

data-prototype's Introduction

Install

Contribute

Contact

Citing Matplotlib

data-prototype's People

Contributors

Stargazers

Watchers

Forkers

data-prototype's Issues

Mutation as the name for "nu"

Kinds of Mutators

compute

rename

reuse

combine

mutual mutation

deletion

chaining

A proposal

More radical ideas/fallout that may be enabled (but I haven't thought through completely)

Recommend Projects

Recommend Topics

Recommend Org

Jobs