mahmoud / glom Goto Github PK

View Code? Open in Web Editor NEW

1.8K 22.0 60.0 1.26 MB

☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️

Home Page: https://glom.readthedocs.io

License: Other

Python 100.00%

declarative data recursion python utilities cli nested-structures data-transformation apis dictionaries

glom's People

Contributors

Stargazers

Watchers

glom's Issues

XML Example

annoying_xml = """<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Placemark>
    <name>Simple placemark</name>
    <description>Attached to the ground. Intelligently places itself 
       at the height of the underlying terrain.</description>
    <Point>
      <coordinates>-122.0822035425683,37.42228990140251,0</coordinates>
    </Point>
  </Placemark>
</kml>
"""

# Does this work?
import glom
glom.glom(target=annoying_xml, spec='Placemark.Point.coordinates')

# What about this?
import xml.etree.ElementTree as ET
tree = ET.fromstring(annoying_xml)
glom.glom(target=tree, spec='Placemark.Point.coordinates')

# I'm trying to get this
print(tree[0][2][0].text)
# -122.0822035425683,37.42228990140251,0

Improved error checking of "scope" param on Spec constructor.

Each one of those two spec work separately.

from glom import Spec, T, glom

d = {'a': [1, {'b': 2}, 3], 'c': {'d': [4, 5]}}
spec = Spec(T['a'][1]['b'], T['c']['d'][1])
print(glom(d, spec))

Automatic registration of co-installed 3rd party library types

We already have a snippet documented for automatic Django ORM type iteration. Should this behavior happen automatically if glom and Django are in the same env?

If so, I think we may want to make an environment variable disabling this behavior, for those who wish to avoid the runtime import overhead. Right now glom.core depends on nothing but the stdlib and represents a very lightweight import. That won't stay true if it tries importing from a bunch of paths and either failing or loading large codebases that aren't even necessarily used. (see also: mahmoud/ashes#31)

Discussion on testing coverage for glom

I was thinking about the challenge of calculating the "coverage" of Glom that @mahmoud raised on the Test & Code podcast.

Manually writing parameterised tests for Pytest would be cumbersome and also you wouldn't know the coverage

In the JSON specification, there are objects and arrays, an object can contain values with a fixed list of types, and an array can contain either arrays or objects.

and then an array

As there is only a fixed number of types for a value,

If you converted this into a feature matrix, you could then (deciding on N) first, map out the potential combinations.

Feature 1	Feature 2	Feature 3	Feature 4	Feature-N
null	object	object	object	object
	string	number	array	array
				string

Then converting that feature matrix into a numpy array, you could dynamically generate all of the possible combinations. Since JSON supports an infinite level of nesting, you would have to fix the limted depth to N.

Once you have this you can calculate the possible number of combinations, create test data for each and use them as parameterised values.

Then, since Glom is a DSL, you again decide on N levels of operations-deep and calculate the same feature matrix for Glom.

The possible number of combinations (and your 100% coverage) is then a product of the 2 feature matrices.

You could apply the same technique to generate the same tests.

better error message on ('path', 'segment')

('path', 'segment') gives a much worse error message than 'path.segment' currently

maybe there's a way we can make the error message of the tuple form good

Extracting an element from a list

I have some array of objects like this:

target = [
 {'id': 0}, 
 {'id': 1},
 ...
]

I now want to get the object with id=0 for example. Is there a way to do this with glom?
My current solution looks like this and doesn't feel that elegant:

glom.glom(target, [lambda t: t if t['id']==0 else glom.OMIT])[0]

I also got it working with

glom.glom(target, ([lambda t: t if t['id'] == pk else glom.OMIT], glom.T[0]))

which still looks not that clean to me.

"enumerate" equivalent: give the full path with the result

Python has the extremely-handy enumerate function, allowing you to iterate through a list and get the index of the current element. What's the team's view on adding similar functinality to Glom? Could be very useful for data exploration/validation (c.f. #7), so that you can find your way back to where a value came from.

Don't have a good idea of the best interface for this, but something like:

target = {
    "foo": [
        {"a": 1, "b": 2},
        {"a": 3, "b": 4}
    ]
}
spec = ("foo", ["b"])
result = glom(target, spec, trace=True)
# [("foo.0.b", 2), ("foo.1.b", 4)]

So if I was using this to power a validator ensuring that, say, no "b" was greater than 3, it could do:

for path, value in result:
    if value > 3:
        print(f"Warning: {path} greater than 3!")

It seems like the information needed to power this is all there already, based on the output of a recursive Inspect. Is there already some easy way of achieving this that I'm missing?

Python 3.7 has slightly changed the repr of KeyError, breaking some tests

=================================== FAILURES ===================================
________________________________ test_coalesce _________________________________

    def test_coalesce():
        val = {'a': {'b': 'c'},  # basic dictionary nesting
               'd': {'e': ['f'],    # list in dictionary
                     'g': 'h'},
               'i': [{'j': 'k', 'l': 'm'}],  # list of dictionaries
               'n': 'o'}

        assert glom(val, 'a.b') == 'c'
        assert glom(val, Coalesce('xxx', 'yyy', 'a.b')) == 'c'

        with pytest.raises(CoalesceError) as exc_info:
            glom(val, Coalesce('xxx', 'yyy'))

        msg = exc_info.exconly()
        assert "'xxx'" in msg
        assert "'yyy'" in msg
        assert msg.count('PathAccessError') == 2
>       assert "[PathAccessError(KeyError('xxx',), Path('xxx'), 0), PathAccessError(KeyError('yyy',), Path('yyy'), 0)], [])" in repr(exc_info.value)
E       assert "[PathAccessError(KeyError('xxx',), Path('xxx'), 0), PathAccessError(KeyError('yyy',), Path('yyy'), 0)], [])" in "CoalesceError(<glom.core.Coalesce object at 0x7ffff236b550>, [PathAccessError(KeyError('xxx'), Path('xxx'), 0), PathAccessError(KeyError('yyy'), Path('yyy'), 0)], [])"
E        +  where "CoalesceError(<glom.core.Coalesce object at 0x7ffff236b550>, [PathAccessError(KeyError('xxx'), Path('xxx'), 0), PathAccessError(KeyError('yyy'), Path('yyy'), 0)], [])" = repr(CoalesceError(<glom.core.Coalesce object at 0x7ffff236b550>, [PathAccessError(KeyError('xxx'), Path('xxx'), 0), PathAccessError(KeyError('yyy'), Path('yyy'), 0)], []))
E        +    where CoalesceError(<glom.core.Coalesce object at 0x7ffff236b550>, [PathAccessError(KeyError('xxx'), Path('xxx'), 0), PathAccessError(KeyError('yyy'), Path('yyy'), 0)], []) = <ExceptionInfo CoalesceError tblen=4>.value

glom/test/test_basic.py:75: AssertionError
________________________ test_path_access_error_message ________________________

    def test_path_access_error_message():

        # test fuzzy access
        with raises(GlomError) as exc_info:
            glom({}, 'a.b')
        assert ("PathAccessError: could not access 'a', part 0 of Path('a', 'b'), got error: KeyError"
                in exc_info.exconly())
>       assert repr(exc_info.value) == "PathAccessError(KeyError('a',), Path('a', 'b'), 0)"
E       assert "PathAccessEr...'a', 'b'), 0)" == "PathAccessErr...'a', 'b'), 0)"
E         - PathAccessError(KeyError('a'), Path('a', 'b'), 0)
E         + PathAccessError(KeyError('a',), Path('a', 'b'), 0)
E         ?                             +

glom/test/test_path_and_t.py:56: AssertionError

Correction to Snippet "Filter Iterable"

The snippet Filter Iterable

glom(['cat', 1, 'dog', 2], Check(types=str, default=OMIT))

Gives me results:

Sentinel('OMIT')

Adding a pair of brackets gives me the expected result.

glom(['cat', 1, 'dog', 2], [Check(types=str, default=OMIT)])

Result:

['cat', 'dog']

accessing item from list within a list

Not sure if an issue but was hoping to get some insight as to how you might specify this spec.
I have a dictionary that looks like this:

data = {"data": [{'name': 'bob',
                  'custom': {
                      'hobbies': [{
                        'swimming': True
                      }]
                    }
                  },
                 {'name': 'joey',
                  'custom': {
                      'hobbies': [{
                        'swimming': False
                      }]
                    }
                  }]
        }

and I want to get a dictionary that looks like:
{'names': ['bob', 'joey'], 'swims': [True, False]}
the closest I'm able to get is by using this spec:
spec = {"names": ("data", ["name"]), "swims": ("data", ["custom"], ["hobbies"], [['swimming']])}
but the 'swims' attribute comes back as [[True], [False]]
Is there a way to get those attributes out of those inner lists by adjusting the spec? It seems trivial but actually nested lists multiple levels deep seems pretty common in json.
BTW glom is super handy!
Thanks, W

Assign support S-based paths

would be handy if Assign could take S-based paths as well as T-based

>>> glom({}, Assign(T['foo'], 'bar'))
{'foo': 'bar'}

it seems intuitive that Assign(S...) would work the same way:

>>> glom({}, (Assign(S['foo'], 'bar'), S['foo']))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\users\kurt\workspace\glom\glom\mutable.py", line 100, in __init__
    path = Path(path)
  File "c:\users\kurt\workspace\glom\glom\core.py", line 277, in __init__
    % sub_parts[0])
ValueError: path segment must be path from T, not T

Macro + Compiler

just expanding ideas, getting some terminology out that can nucleate further docs / cookbook items and guide future development --

a "macro" or "glomacro" in the glom context is a function or callable that:

outputs a spec
input may be anything (valid spec, or any python objects, or both)
is meant to run once at spec definition time

a "compiler" or "glompiler" in a glom context is a function or callable that:

takes a spec as input
output may be anything (valid spec, or any python object)
is meant to run once against a spec (since specs are meant to be small in number and global, things generated from specs should be similar)

both of these concepts could eventually be supported by official Macro and Compiler types; this would be a stepping stone to important tools like coverage checking

both of these concepts (but especially compilers) could be supported by glom specs that accept and/or output other glom specs, aka "meta-specs" or "glometas"

Debian jessie installation

I downloaded sources and run
sudo: /configure ERROR configure: error: *** A compiler with support for C++14 language features is required.
Can you help me?

Allow deep apply function

The Assign is a great enhancement of glom features. But it seems to work only with fixed values, whereas the fetch syntax allows execution of function at a given path. The most simple example would be:

from glom import assign,Call,T

o = {'a': 2}

def f(x):
    return 2 * x

# classic python
o['a'] = f(o['a'])
# > {'a': 4}

assign(o, 'a', f) # doesn't work
# > {'a': <function f at 0x7effe21b91e0>}

assign(o, 'a', Call(f)) # doesn't work
# > Call(<function f at 0x7f65c8c7a1e0>, args=(), kwargs={})

assign(o, 'a', f(T['a']))
# TypeError: unsupported operand type(s) for *: 'int' and 'TType'

A more realistic use case would be:

from functools import partial
from operator import itemgetter

sort_by_quantity = partial(sorted,key=itemgetter('quantity'))
assign(fat_nested_structure,'long.path.to.a.list.of.dicts',sort_by_quantity)
# expected fat_nested_structure["long"]["path"]["to"]["a"]["list"]["of"]["dicts"] get sorted inplace by 'quantity'

Is there a way to merge a list of dicts into a single dict?

Hi, thanks for the awesome project!

Is there a way to convert [{"id": 1, "name": "foo"}, {"id": 2, "name": "bar"}] to {1: "foo", 2: "bar"}?

Best regards, Artem.

runnable spec / equivalent of re.compile for glom-specs

Might be nice to have the equivalent of re.compile() or a runnable spec object.

E.g.

s = Spec({ glom-spec })
result = s.glom(target)

Coalesce + T usage

I'm mapping a large dictionary to a smaller (but still with many properties) one using glom, and ran into a strange issue. I'm using Coalesce to easily enable a default for every property in the mapped values, and T to ease legibility of my code. I'm not sure if I'm missing something here and using it incorrectly, or if this is a difference which shouldn't happen. It basically looks like in the context of Coalesce, T works differently than a simple string path.

Simplified version:

from glom import glom, T, Coalesce

complete_data = {
    'object': {
        'prop1': 1,
        'prop2': {
            'nested': 2,
        },
    },
}

limited_data = {
    'object': {
        'prop1': 1,
        'prop2': None,
    },
}

def full_coalesce(obj_spec):
    return {k: Coalesce(v, default=None) for k, v in obj_spec.items()}

obj = T['object']

working_spec = full_coalesce({
    'prop1': obj['prop1'],
    'nested': 'object.prop2.nested',  
})

broken_spec = full_coalesce({
    'prop1': obj['prop1'],
    'nested': obj['prop2']['nested'],
})

print glom(complete_data, working_spec)
print glom(limited_data, working_spec)
print glom(complete_data, broken_spec)
print glom(limited_data, broken_spec)

The output:

{'prop1': 1, 'nested': 2}
{'prop1': 1, 'nested': None}
{'prop1': 1, 'nested': 2}
Traceback (most recent call last):
  File "testing-glom.py", line 41, in <module>
    print glom(limited_data, broken_spec)
  File "./venv/lib/python2.7/site-packages/glom/core.py", line 919, in glom
    ret = self._glom(target, spec, path=path, inspector=inspector)
  File "./venv/lib/python2.7/site-packages/glom/core.py", line 949, in _glom
    val = self._glom(target, subspec, path=path, inspector=next_inspector)
  File "./venv/lib/python2.7/site-packages/glom/core.py", line 994, in _glom
    ret = self._glom(target, subspec, path=path, inspector=next_inspector)
  File "./venv/lib/python2.7/site-packages/glom/core.py", line 978, in _glom
    ret = _t_eval(spec, target, path, inspector, self._glom)
  File "./venv/lib/python2.7/site-packages/glom/core.py", line 678, in _t_eval
    cur = cur[arg]
TypeError: 'NoneType' object has no attribute '__getitem__'

cookbook: omit

show options how to filter a list

1- lambda list comprehension on the outside lambda t: [e for e in t if cond]

2- lambda return OMIT lambda v: v if cond else OMIT

glom equivalent for assignment to nested data structures?

First of all, this is a wonderful package, and basically ideal for my core use case.

When writing tests for glommy kinds of stuff, I sometimes have to go the other way: assigning something to some deeply nested data structure:

my_obj["a"]["b"]["c"]["d"] = True

Right now, all my ways of handling this are kind of cloodgy. But it seems like there could be an inverse glom that could do something like this:

iglom(my_obj, "a.b.c.d", True)

Realistic? Is there something that already does this? I bet there's something that already does this and I missed the memo.

coverage of gloms (track unevaluated child sub-specs)

as we are building larger and larger glom-specs it is important to make sure unit tests are covering all the nooks and crannies; for that reason add to Inspect or wherever is appropriate a coverage ability --

c = Coverage(SPEC)
result = glom(target, c)
print(c.coverage_report())

something akin to this (maybe Coverage is really Inspect)

two steps:

1- walk the whole spec and get all children specs and put them in a set

2- during execution, remove specs from the set as they are hit

afterwards, generate a report -- first idea for this is a pretty-printed version of the glom with children that were hit colored green and children that are missed colored red

Beginner question for docs: Can "T" do filters and/or grouping?

Hi,

sorry if this is too obvious, but from the tutorial I can't understand if glom supports grouping or if this is not something it's intended to do.

If you extend the planets example to include a category and sum up the moons by that category?

from glom import glom, T
target = {'system': {'planets': [{'name': 'earth', 'category':1, 'moons': 1},
    {'name': 'pluto', 'category':1, 'moons': 5},
    {'name': 'uranus', 'category':2, 'moons': 5},
    {'name': 'jupiter', 'category':2, 'moons': 69}]}}

spec = T['system']['planets']

if I wanted to sum up the moons by category, I could run

[
    sum([p['moons'] for p in glom(target, spec) if p['category'] == 1]),
    sum([p['moons'] for p in glom(target, spec) if p['category'] == 2])
]

But is there a way of adding filtering (for example sum only category 2) or grouping directly to the spec?

Of is this outside the scope of glom?

##UPDATE##

A spec without the T notation works for filtering:

from glom import OMIT
spec = ('system.planets', [lambda x : x['moons'] if x['category'] == 2 else OMIT], sum)

As the T-notation is not fixed yet, I'm closing the issue.

Relative Paths (glom equivalent of '..')

Now that #32 has added scopes to glom, it's possible to do multi-target glomming. I think the next frontier in this area is enabling relative paths.

The obvious analogy is that filesystem paths: T is . and we could add something like .., which could enable embeddable spec components and other self-referential fun.

There's an undocumented UP constant that works with T, but it doesn't work on Path, and it doesn't function like .. in that it can't exist at the start of a path.

I haven't needed UP much, but I'll be keeping an eye out for utility. If it starts getting useful, I suggest we close the gaps above, and rename it to U to keep the pattern with S and T.

Traverse glom

the job of a Traverse is to walk its target recursively and return an iterator over all of the bits (as in depth-first or breadth-first traversal) -- this could perhaps share some bits with TargetRegistry

this is very useful when combined with Check and Assign for a kind of pattern-matching strategy:

# not sure if Traverse even needs an argument or if it should just implicitly walk current target
# maybe the argument should specify what it iterates over:  just items, items + paths, etc
glom(target, (Traverse(T),  (Check(T.val, validate=lambda t: t<0), Assign('val', 0)))
                                                                                   # ensure T.val >= 0

if there was an un-traverse glom possible, that would be even more powerful; but in the absence of that being able to do something to the items being traversed is still useful

the ultimate goal of this kind of approach is a useful meta-glom -- you can imagine transformations like "set all defaults to a unique marker object that stores the path" to debug why an output is coming as None

the ultimate, ultimate goal being useful glom-macros (glomacro?) and glom-compilation (glompilation?)

Tests fail: missing YAML file

The YAML test test_yaml_target fails due to a missing YAML file (test_valid.yaml). That file doesn't exist in the distribution and it's not clear where it's supposed to come from.

=================================== FAILURES ===================================
_______________________________ test_yaml_target _______________________________

    def test_yaml_target():
        cwd = os.path.dirname(os.path.abspath(__file__))
        # Handles the filepath if running tox
        if '.tox' in cwd:
            cwd = os.path.join(cwd.split('.tox')[0] + '/glom/test/')
        path = os.path.join(cwd, 'data/test_valid.yaml')
        argv = ['__', '--target-file', path, '--target-format', 'yml', 'Hello']
>       assert main(argv) == 0

glom/test/test_main.py:23:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
glom/cli.py:83: in main
    return cmd.run(argv) or 0
/nix/store/jpyxgdwhniixch3cqq9g922vrsg8pfkj-python2.7-face-0.1.0/lib/python2.7/site-packages/face/command.py:380: in run
    return inject(wrapped, kwargs)
/nix/store/jpyxgdwhniixch3cqq9g922vrsg8pfkj-python2.7-face-0.1.0/lib/python2.7/site-packages/face/sinter.py:59: in inject
    return f(**kwargs)
<string>:6: in next_
    ???
glom/cli.py:173: in mw_get_target
    _error('could not read target file %r, got: %s' % (target_file, ose))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

msg = "could not read target file '/build/glom-18.3.1/glom/test/data/test_valid.yaml', got: [Errno 2] No such file or directory: '/build/glom-18.3.1/glom/test/data/test_valid.yaml'"

    def _error(msg):
        # TODO: build this functionality into face
        print('error:', msg)
>       raise CommandLineError(msg)
E       CommandLineError: could not read target file '/build/glom-18.3.1/glom/test/data/test_valid.yaml', got: [Errno 2] No such file or directory: '/build/glom-18.3.1/glom/test/data/test_valid.yaml'

glom/cli.py:101: CommandLineError
----------------------------- Captured stdout call -----------------------------
error: could not read target file '/build/glom-18.3.1/glom/test/data/test_valid.yaml', got: [Errno 2] No such file or directory: '/build/glom-18.3.1/glom/test/data/test_valid.yaml'

deeply nested setting

I work on a project that flings around deeply nested Python structures with wild abandon. glom nicely handles the "get something from this structure even if all the branches of the path aren't there" and now I can replace some code I wrote. Yay!

The other side of things that I need to handle is setting a value in a deeply nested structure where the branches of the path may not be there.

For example, maybe something like this which uses dicts:

>>> from glom import glom_set
>>> foo = {}
>>> glom_set(foo, 'a.b.c', value=5)
>>> foo
{'a': {'b': {'c': 5}}}

There are more complex tree manipulations that could be done, but at the moment I'm thinking about setting a single leaf value.

Is manipulating deeply nested data structures in place in-scope for glom?

nested lists

I'm dealing with some nested lists and would like to do something like:

>>> target = {
...        'f1': 'v',
...        'f2': [{
...            'f3': 'a',
...            'f4': 0,
...            'f5': [{
...                'f6': 1,
...                'f7': 2, ...
...            }, ...], ...
...        }, ...], ...
...    }
>>> glom(target, ...)
[{'f1': 'v', 'f2.f3': 'a', ..., 'f2.f5.f6': 1},
 {'f1': 'v', 'f2.f4': 0, ..., 'f2.f5.f6': 1},
 {'f1': 'v', 'f2.f3': 'a', ..., 'f2.f5.f7': 2}, ...]

I can get to the list of 'f2.f5.f6' kind of fields but how do I merge this list with parent values? Is this even possible?

suggestion: curried form?

Look at this beauty ❤️

from toolz import curry
from toolz.curried import pipe, map
from glom import glom

callsigns = [{'callsign':'goose'}, {'callsign':'maverick'}]

@curry
def glom_curried(spec, v):
  return glom(v, spec)



# --- userland code ---

convert_callsigns = glom_curried({'name':'callsign'})

print(pipe(callsigns,
           map(convert_callsigns),
           list))

perhaps this can be a nice API:

from glom.curried import glom
convert_callsigns = glom({'name':'callsign'})

print(pipe(callsigns,
           map(convert_callsigns),
           list))

Extract tuples from complex objects?

Is there a spec that can do the following ?

d = {'a': 1, 'b': 2}
glom(d, <some_spec_here>)
>> (1,2)

My intended use case is to extract tuples of selected data from complex items, that could be then used for sorting or grouping. Glom API creates easily dict outputs, but in the case the output needs to be an immutable tuple. It also differs from the nested list example of the tutorial because each element of the tuple is reached through a different path.

Help debugging a PathAccessError

@mahmoud @kurtbrose Hi, guys,there is a question i can not understand,please help me.

from glom import glom
target = {'system': {'planets': [{'name': 'earth'}, {'name': 'jupiter'}, {'name2': 'jupiters'}]}}

print glom(target, ('system.planets', ['name']))

Why this occur PathAccessError?

Does this mean the dict which key is name2 can not be in the list?

And how can i get data like this {'plants':['earth', 'jupiter', 'jupiters']} in a pythonic way use glom?

thank you.

"Rename" target mutation

Related to #81

The same problem I have with other attributes in dicts that must be renamed before actually working with them.

def tranform_data(webhook):
    return glom(webhook_data, (
            Assign('project.id', Spec('project.uid')),

            # Issue tranformation:
            Assign('object_attributes.id', Spec('object_attributes.uid')),
            Assign('issue', Spec('object_attributes')),
        ))

And then usage:

webhook_data = {'project': {'uid': 1}, 'object_attributes': {'uid': 2}}
modified = tranform_data(webhook_data)
# =>  {'project': {'id': 1, 'uid': 1}, 'object_attributes': {'id': 2, 'uid': 2}, 'issue': {'id': 2, 'uid': 2}}

That can quickly become to complex inside, when we really want ot just rename the key.
So, that why I suggest to make Rename mutation.

That's how it would work:

def tranform_data(webhook):
    return glom(webhook_data, (
            Rename('project.id', Spec('project.uid')),

            # Issue tranformation:
            Rename('object_attributes.id', Spec('object_attributes.uid')),
            Rename('issue', Spec('object_attributes')),
        ))

webhook_data = {'project': {'uid': 1}, 'object_attributes': {'uid': 2}}
modified = tranform_data(webhook_data)
# =>  {'project': {'id': 1}, 'issue': {'id': 2}}

Overload an operator to invoke glom

I've been using glom a fair amount lately, but one thing that's mildly frustrating is how obvious it is that I'm using it:

a = data['owner']['name']['last']
b = glom(data, T['owner']['name']['last'])

Personally, I think the second assignment is a bit more convoluted and hard to read at a glance. However, if there was some sort of operator overload on T, I think it could be a lot cleaner:

c = data | T['owner']['name']['last']

Looking at the definition of TType, it seems like adding an operator overload would be pretty simple. In this case, it would be the __ror__ operator.

Another idea that I think might clean things up a bit is a wrapper that facilitates something like this:

d = G(data)['owner']['name']['last']

But I'm not sure how you would "conclude" the lookup and tell it to evaluate, rather than keep providing a nestable object.

true "empty" path (P?)

T = target, S = scope
P = path?
P is the 0 of path operations?

Path(T, P, P, P, P) == Path(T)

currently there are a few places we assume T is the "blank" path; this paints us into a corner when we want to break a path into chunks as in assign when we break the "GOTO" prefix from the "ASSIGN THIS VALUE" tail of the path

maybe we should have a true "empty" path global, and T and S should be like different root directories

P for Path?

R for Relative?

Extension API

Right now glom is only extensible in the sense that you can register new types for automatic handling, etc.

But internally there's an emerging signature of what plugins to glom's recursion could look like. I think after #7 adds validation we could look at turning that API into a GlomContext object and exposing this.

Configurable error messages for Check

Per @jcollado's comment in #7, most validation libraries don't support custom error messages.

I've just merged #25 which brings in the first iteration of Check(), which can perform a variety of validations. With that in place, we can discuss custom error messages as an enhancement.

If you look at the docstring of Check, you'll see that there's a validate kwarg, which accepts callables. I'm thinking of making this a mapping of callables, where the key is the callable and the value is a message or message template.

Thoughts, @jcollado, @kurtbrose, others?

'yield' to get around recursion limit

glom.glom may be able to use yield for the same reason as twisted / asyncio / etc -- to form a trampoline function and avoid infinite recursion

that is, we could avoid recursion limit in cases like this

>>> a = {'a': glom.T}
>>> for i in range(500): a = {'a': a}
...
>>> glom.glom(1, a)
#...
  File "glom\chainmap_backport.py", line 113, in new_child
    return self.__class__(m, *self.maps)
  File "glom\chainmap_backport.py", line 63, in __init__
    self.maps = list(maps) or [{}]          # always at least one map
RuntimeError: maximum recursion depth exceeded while calling a Python object

scope vs not scope example

had a real world example of a using-scope vs data structures outside of scope pop up

thought it might be helpful to rework for cookbook or docs -- showing a relatively simple case how both approaches work

def totally_outside_glom():
    models = queryset()
    models_by_id = { model.id: model for model in models}
    values = models.values('id', 'bar')
    for valdict in values:
        valdict['model'] = models_by_id[val_dict[['id']]]
    glom(values,
        [{
            'foo': ('model.foo', T()),
            'bar': 'bar'
        }])


def using_scope():
    models = queryset()
    values = models.values('id', 'bar')
    glom(values,
        [{
            'foo': S['models-by-id'][T['id']].foo(),
            'bar': 'bar',
        }],
        scope={
            'models-by-id': { model.id: model for model in models}
        }
    )

maybe it could be pushed even further down into the scope

Safe navigation operator?

Have you considered adding a "safe navigation operator" to Glom?

It could be very elegant and powerful to use a target of a?.b?.c, and it's a well established feature in several languages.

https://en.wikipedia.org/wiki/Safe_navigation_operator

Check() and validate

While this wasn't on my mind when I first started out, it's been pointed out to me that glom may also benefit from a validation story.

Some preliminary design work suggests the following would work well:

Create a Check() specifier type
Support type=..., value=..., and maybe other kwargs.
Add an action kwarg to determine what to do if the Check fails the condition ('omit', 'raise', other?)
Explore having the spec work like Inspect, where it can wrap a spec or appear on it's own (probably after the spec it's supposed to check)
Extend the recursion API to pass through a validation context of some sort so that multiple errors can be registered, instead of just raising immediately (maybe a top-level glom() kwarg controlling this)

This is great for an assert-like functionality here and there, but for heavily Checked specs, we may want to have a convenience construct of some sort.

Snippets Data-driven assignment error and question

From https://glom.readthedocs.io/en/latest/snippets.html

glom({1:2, 2:3}, Call(dict, args=T.items()) <--- missing a paren,
glom({1:2, 2:3}, lambda t: dict(t.items()))
glom({1:2, 2:3}, dict)

also crashes when fixed

>>> from glom import glom, T, Call
>>> glom({1:2, 2:3}, Call(dict, args=T.items()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/glom/core.py", line 1446, in glom
    ret = _glom(target, spec, scope)
  File "/usr/local/lib/python3.6/dist-packages/glom/core.py", line 1462, in _glom
    return spec.glomit(target, scope)
  File "/usr/local/lib/python3.6/dist-packages/glom/core.py", line 747, in glomit
    return _eval(self.func)(*args, **kwargs)
TypeError: dict expected at most 1 arguments, got 2

I got there because I can't figure out how to transform into a dict where the keys are coming from the data, which is a bunch of dicts in a list (and its data from a JQL response). For ex:

{ 'issues' :  [ { 'id': '999999', 'summary': 'this is issue 999999'},
                {'id': '888888', 'summary': 'this is 888888'}
              ]
}

I'd like to get this into an new dict that looks like:
{ '999999': 'this is issue 999999', '888888': 'this is issue 888888' }
The data driven example above makes some sense but I can't wrap my head around how to do it in the context of the issues list.

Will go back and try some more .Thanks

Broadening CLI input formats

From @moshez and @dreid:

JSON inputs should be more robust to trailing commas and comments (need to research robust JSON parsers)
Sequential JSON (e.g., json-seq and jsonl) would be useful, too. Might need to refactor to actually support streaming.
Maybe YAML?

more googleable T and S?

T and S are really hard to google for -- see Q and F in django queries;

in the documentation verbiage refer to "T" as "TARGET" and "S" as "SCOPE", then in all code samples we can use T and S, and advise users to do so, but if you google for "what is TARGET glom" they will get meaningful hits in our docs

Add python_requires

python_requires allows you to bundle actual machine-readable metadata about what Python versions are supported in your distributions. It is best to add it before you need it, since pip falls back to the most recent version of a library where your python version meets the requirements, so you don't want to add this after your support matrix changes.

I think this project needs to add this to its setup():

python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*,!=3.3.*,!=3.4.*"

Note that you must also build with a recent version of setuptools and upload your packages with twine in order for PyPI to respect it.

Incrementalize tutorial doc

Right now the tutorial is coherently designed, tested, and even documented. However, it doesn't build up in a way that's very beginner friendly. It establishes glom's value and then immediately uses it at an intermediate level.

I'd like it if it was a bit more drawn out to use basic features first and then add a multi-line Coalesce as the finisher. The announcement blog post does a better job of this, but doesn't go far enough before jumping ahead to T.

Rationalize T and Path differences

At first, T and Path seem like they have a lot of overlap. But T is for very specific access, and Path() is for more looser, more general access. It doesn't look like either is going away, or merging into the other and that's good because that means this is all we need to do:

Allow T objects inside of Path, such that Path(T) gives the same result as Path() (See note below)
Make _get_path support resolving Ts
Eliminate GlomKeyError and so forth, opting instead for the PathAccessError throughout.

The one disadvantage is that users won't be able to use T as keys in their own dict targets, but that seems a very very niche case. The one challenge is making the PathAccessError messaging around the "index" of the path with the problem reflect the T traversal.

"Move" target mutation

Sometimes we need to move data from key to key.

def format_data(webhook_data):
     return glom(webhook_data, (
            Assign('object_attributes.project', Spec('project')),
        ))

And then using it:

webhook_data = {'project': {'uid': 1}, 'object_attributes': {'uid': 2}}
modified = format_data(webhook_data)
# => {'project': {'uid': 1}, 'object_attributes': {'uid': 2, 'project': {'uid': 1}}}

It works, but this way we tend to use more memory that we possibly can. And storing extra 'project' key is not required. It maybe way too memory consuming when dictionaries are big. And there are many inputs.

So, that why I suggest to add Move class that will handle that.

With Move it will work like so:

def format_data(webhook_data):
     return glom(webhook_data, (
            Move('object_attributes.project', Spec('project')),
        ))

webhook_data = {'project': {'uid': 1}, 'object_attributes': {'uid': 2}}
modified = format_data(webhook_data)
# => {'object_attributes': {'uid': 2, 'project': {'uid': 1}}}

If you like the idea - I would be happy to work on it.

KeyError in _t_child from weakref

I'm running into a weird issue using glom in PySpark on Databricks.

This expression:

glom(ping, (T[stub]["values"].values(), sum), default=0)

(where stub is "a11y_time")

is consistently throwing this exception when I run it on my real data:

/databricks/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
     318 raise Py4JJavaError(
     319 "An error occurred while calling {0}{1}{2}.\n". 
 --> 320                     format(target_id, ".", name), value)
     321 else:
     322 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 4 times, most recent failure: Lost task 0.3 in stage 22.0 (TID 243, 10.166.248.213, executor 2): org.apache.spark.api.python.PythonException: 

Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 229, in main process()
File "/databricks/spark/python/pyspark/worker.py", line 224, in process serializer.dump_stream(func(split_index, iterator), outfile)
File "/databricks/spark/python/pyspark/serializers.py", line 372, in dump_stream vs = list(itertools.islice(iterator, batch))
File "/databricks/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft yield next(iterator) 
File "<command-26292>", line 10, in to_row
File "<command-26292>", line 5, in histogram_measures
File "/databricks/python/local/lib/python2.7/site-packages/glom/core.py", line 753, in __getitem__ return _t_child(self, '[', item)
File "/databricks/python/local/lib/python2.7/site-packages/glom/core.py", line 791, in _t_child _T_PATHS[t] = _T_PATHS[parent] + (operation, arg)
File "/usr/lib/python2.7/weakref.py", line 330, in __getitem__ return self.data[ref(key)]
KeyError: <weakref at 0x7f84c7d2f6d8; to '_TType' at 0x7f84c8933f30>

The object that's crashing it is, itself, totally unremarkable:

{'submission_date': u'20180718', 'a11y_count': None, 'a11y_node_inspected_count': None, 'a11y_service_time': None, 'toolbox_time': None, 'toolbox_count': None, 'a11y_time': None, 'branch': u'Treatment', 'client_id': u'some-random-uuid', 'a11y_picker_time': None, 'a11y_select_accessible_for_node': None}

The Python that Databricks is running looks like 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609].

I can't reproduce it on my Mac in 2.7.14 or 2.7.12.

set literal spec to build set result from iterable

{T} could work exactly like [T] but build up a set rather than a list

CLI robustness

Python doesn't have a great track record of CLIs that handle piping well. Basically, we need to break that mold and make sure that when glom plays nicely in shell pipelines.

Some links on the topic:

face middleware should make it easy to semi-contextually catch the error, register a different signal handler, or even inject wrapped stdin/stdout handles.

how can I find item in an deeply nested data struct?

how could I find branch item by name (may be name='branch-a-a-a' , or maybe name='branch-b' ) from a deeply (may be depth > 10) nested json?
Can anyone help me ? thanks.

{
  "root": [
    {
      "type": "branch",
      "name": "branch-a",
      "children": [
        {
          "type": "branch",
          "name": "branch-a-a",
          "children": [
            {
              "type": "branch",
              "name": "branch-a-a-a",
              "children": [
                {
                  "type": "leaf",
                  "name": "leaf-a"
                },
                {
                  "type": "leaf",
                  "name": "leaf-aa"
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "type": "branch",
      "name": "branch-b",
      "children": [
        {
          "type": "leaf",
          "name": "leaf-ba"
        },
        {
          "type": "leaf",
          "name": "leaf-bb"
        }
      ]
    }
  ]
}

python-full isn't an accepted spec format

The CLI help specifies that 'python-full' is an acceptable option for the spec format, but if you try and use it, you get an error: expected spec-format to be one of python or json

mahmoud / glom Goto Github PK

glom's People

Contributors

Stargazers

Watchers

Forkers

glom's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs