GithubHelp home page GithubHelp logo

ijson's Introduction

ijson's People

Contributors

acrisci avatar dav1dde avatar davidfischer avatar explodingcabbage avatar isagalaev avatar matiasg avatar radhermit avatar rtobar avatar selik avatar signalpillar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ijson's Issues

Floaty numbers are parsed as ints

common.py makes sure to check whether the decimal 1.0 == int(1.0), return 1 if that is the case. Values sent as 1.0 over the wire show up as 1 in Python. I get that this might in some cases be considered a feature, and JSON makes no explicit distinction between integers and floats, but this is causing issues for round trip (de)serializations.

For reference, Python's default json.loads() does parse ["1.0", 1.0, 1] as-is, where ijson parses it as ["1.0", 1, 1]. Should this behaviour not be either configurable or compatible with the default lib?

Iterating over a collection!?!?

Hello, I have a 12gb JSON file that I am trying to iterate over.

The format is in a collection style as follows:
{
"key1": {...},
"key2": {...},
"key3": {...},
...
}

I cannot for the life of me figure it out. Pretty much, I want to chunk the root level keys one at a time, and handle them individually.

Any help would greatly be appreciated!

Thanks!

Connor

Support yajl2 backend

I get an AttributeError when parsing a json file from GitHub Archive.

I installed yajl with Homebrew. It looks like it uses Yajl version 2.0.4. I installed ijson from source hosted here on GitHub with python setup.py install. This appears to be version 0.8.0. I'm using Python version 2.7.3, running iPython installed from source a couple days ago.

Code

import gzip
import ijson

with gzip.open('/Users/mike/src/gads/data/GitHubArchive/2012-03-11-15.json.gz') as g:
    for item in ijson.items(g.read(), 'repository'):
        print(item)
        break

Output

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-cfc9bcb7679a> in <module>()
      1 import gzip
----> 2 import ijson
      3 
      4 with gzip.open('/Users/mike/src/gads/data/github/2012-03-11-15.json.gz') as g:
      5     for item in ijson.items(g.read(), 'repository'):

/Users/mike/venv/gads/lib/python2.7/site-packages/ijson/__init__.py in <module>()
----> 1 from ijson.parse import JSONError, IncompleteJSONError, \
      2                         basic_parse, parse, \
      3                         ObjectBuilder, items

/Users/mike/venv/gads/lib/python2.7/site-packages/ijson/parse.py in <module>()
      3 from decimal import Decimal
      4 
----> 5 from ijson.lib import yajl
      6 
      7 C_EMPTY = CFUNCTYPE(c_int, c_void_p)

/Users/mike/venv/gads/lib/python2.7/site-packages/ijson/lib.py in <module>()
     17 yajl.yajl_alloc.restype = POINTER(c_char)
     18 yajl.yajl_gen_alloc.restype = POINTER(c_char)
---> 19 yajl.yajl_gen_alloc2.restype = POINTER(c_char)
     20 yajl.yajl_get_error.restype = POINTER(c_char)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ctypes/__init__.pyc in __getattr__(self, name)
    376         if name.startswith('__') and name.endswith('__'):
    377             raise AttributeError(name)
--> 378         func = self.__getitem__(name)
    379         setattr(self, name, func)
    380         return func

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ctypes/__init__.pyc in __getitem__(self, name_or_ordinal)
    381 
    382     def __getitem__(self, name_or_ordinal):
--> 383         func = self._FuncPtr((name_or_ordinal, self))
    384         if not isinstance(name_or_ordinal, (int, long)):
    385             func.__name__ = name_or_ordinal

AttributeError: dlsym(0x10253eec0, yajl_gen_alloc2): symbol not found

If I comment out line 19 in lib.py

-19 yajl.yajl_gen_alloc2.restype = POINTER(c_char)
+19 # yajl.yajl_gen_alloc2.restype = POINTER(c_char)

Then the same code causes the ipython notebook to tell me the kernel crashed and when run in ipython console gives a segmentation fault.

In [1]: import gzip
In [2]: import ijson
In [3]: with gzip.open('/Users/mike/src/gads/data/github/2012-03-11-15.json.gz') as g:
   ...:         for item in ijson.items(g.read(), 'repository.item'):
   ...:                 print(item)
   ...:                 break
   ...:     
Segmentation fault: 11

The seg fault makes sense if setting the type of yajl_gen_alloc2 did something useful

Just in case, is this possibly related to the way GitHub Archive does not delimit its objects (igrigorik/gharchive.org#9)?

Ability to reinitialize parser mid stream

I'm trying to parse multiple discrete json object per file: E.g.
{
"object": 1
}
{
"object": 2
}

And while I can parse the first object, I'd like to be able to reinitialize the parser such that I can start from scratch with the second object (since these items are not in a json array). Currently I get an Additional data error when I hit the second object.

question regarding a stream of \n separated objects

Hi @isagalaev

I have a \n separated list of JSON objects arriving via a stream.

Example:

{
  "timestamp": "2015-07-29T20:09:45.304101",
  "ip_str": "1.2.3.4"
}
{
  "timestamp": "2015-07-29T20:09:45.304101",
  "ip_str": "5.6.7.8"
}
...

I wrote a small "SAX" parser based on ijson and so far it performs nicely except for the \n end of objects. I get this error message:

Traceback (most recent call last):
  File "insert_fingerprints.py", line 71, in <module>
    for prefix, event, value in parser:
  File "/usr/local/lib/python2.7/dist-packages/ijson/common.py", line 65, in parse
    for event, value in basic_events:
  File "/usr/local/lib/python2.7/dist-packages/ijson/backends/yajl2.py", line 96, in basic_parse
    raise exception(error.decode('utf-8'))
ijson.common.IncompleteJSONError: parse error: trailing garbage
          01",   "ip_str": "1.2.3.4" } {   "timestamp": "2015-07-29T20
                     (right here) ------^

So, I assume that ijson will not parse this non-100% JSON format. However, most streaming JSON files are formated exactly like this: \n separated objects (with or without semicolons between them).
If I put everything into an array , it works fine. But many services just send the \n separated JSON objects.

I think ijson sould be tolerant to this \n separated format.

What's needed in your opinion to allow this as well?

For reference see: https://en.wikipedia.org/wiki/JSON_Streaming#Line_delimited_JSON

I guess YAIL actually supports newline delimited JSON. At least it supports "//" comments in JSON.

Bad cut on the buffer cause UnicodeDecodeError

Hello,

I've encounter a very rare bug on the version 1.1 of ijson. I've try to create a bunch of code to reproduce it, but I can only managed to make it work (more or less) on a debian stable so I'm going to describe it instead.

Here: https://github.com/isagalaev/ijson/blob/master/ijson/backends/python.py#L88 you try to do a .decode("Utf-8") on a partial read of a file. Which works for 99.9% of the case. But, Utf-8 chars can be store on more than one byte. In 0.1% of the cases, you read in the middle of an Utf-8 char, thuse you don't read the full char, therefor .decode("Utf-8") fails with a message like this:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 16383: unexpected end of data

My current dirty hack to fix this is:

    except ValueError:
        old_len = len(self.buffer)
        data = self.f.read(BUFSIZE)
        try:
            self.buffer += data.decode('utf-8')
        except UnicodeDecodeError:
            data += self.f.read(1)
            self.buffer += data.decode('utf-8')
        if len(self.buffer) == old_len:
             raise common.IncompleteJSONError()

But this is not a good solution because UTF-8 chars can have an arbitrary len AFAIK. My intuition is that the good approach would be to only decode the content went you really need it, but I didn't managed to code this (tbh, I've only tried to do that for 10min).

Here is some code to reproduce this bug (but I've only managed to make this work on a stable debian where, apparently, the encoding is a bit fucked up): https://gist.github.com/Psycojoker/7225794 (you obviously need ijson 1.1 to make this work).

Kind regards and thanks a lot for ijson, it really helped me a lot :)

About performance

Hi, I am writing validater, which is based on ijson, but I found that
ijson's performance is poor, expecially deep list: '[' * 8000 + ']' * 8000.
here is my test result, it is much slower than standard json:

[guyskk@localhost validater]$ python benchmark.py 
------------------------ijson python------------------------
normal data: 0.040082 sec
deep data: 24.490807 sec
----------------------ijson yajl2_cffi----------------------
normal data: 0.029769 sec
deep data: 24.238428 sec
-------------------------validater--------------------------
normal data: 0.061465 sec
deep data: 0.034018 sec
-----------------------standard json------------------------
normal data: 0.004895 sec
deep data: 0.065272 sec
[guyskk@localhost validater]$ python --version
Python 3.5.1
[guyskk@localhost validater]$ 

benchmark and usage:

git clone https://github.com/guyskk/validater
cd validater
pip install -e .
python benchmark.py

IncompleteJSONError with yajl2 on python3

Hi!
I have strage IncompleteJSONError on OS X with ijson 2.2 from pypi and yajl2 2.1.0 from homebrew:

$ python3 -V
Python 3.4.3

$ echo '{"a": 1}' | python3 -c "import ijson.backends.yajl2 as ijson; import sys; print(list(ijson.items(sys.stdin, '')))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/private/tmp/jsonstream/.env/lib/python3.4/site-packages/ijson/common.py", line 138, in items
    current, event, value = next(prefixed_events)
  File "/private/tmp/jsonstream/.env/lib/python3.4/site-packages/ijson/common.py", line 65, in parse
    for event, value in basic_events:
  File "/private/tmp/jsonstream/.env/lib/python3.4/site-packages/ijson/backends/yajl2.py", line 96, in basic_parse
    raise exception(error.decode('utf-8'))
ijson.common.IncompleteJSONError: lexical error: invalid char in json text.
                                      {                     (right here) ------^

Looking for any ideas how to fix this.

'FFI' object has no attribute 'new_handle'

I was trying to use the yajl2_cffi backend on a system without a compiler installed and I get an exception:
'FFI' object has no attribute 'new_handle'

Is this the expected behavior? Is a compiler required for this to work?

Ted

Unicode char handling actually break unicode characters

The line 109 of backends/python.py contains:

yield ('string', unicode_escape_decode(symbol[1:-1])[0])

This is actually breaking unicode characters like Ç and à in JSON strings. The change in parser to return the symbol as is solve the issues on all of my tests.

yield ('string', symbol[1:-1]))

My knowledge is shallow in this area so im not sure what kind of unicode problem this escape should solve since JSON strings must be valid unicode (or escaped unicode) symbols.

Parsing error

for event in parse_value(lexer, symbol):
 File "/home/ivan/workspace/catalog/env/src/ijson/ijson/backends/python.py", line 102, in parse_value
   for event in parse_object(lexer):
 File "/home/ivan/workspace/catalog/env/src/ijson/ijson/backends/python.py", line 139, in parse_object
   for event in parse_value(lexer):
 File "/home/ivan/workspace/catalog/env/src/ijson/ijson/backends/python.py", line 111, in parse_value
   raise UnexpectedSymbol(symbol, lexer)
ijson.backends.python.UnexpectedSymbol: Unexpected symbol "5" at 10432

But json is good by simplejson.

parse error

What does the 'at <number' mean, line, byte offset? I'm parsing big json files and it would be nice if the error message would show some of the json prior to the error to make it easier to find/fix.

Allow to differentiate invalid and incomplete JSON

There is a difference between JSON that is definitely invalid and JSON that is invalid, but may be made correct by completing it.

And that's the information I need to get in my project and the only reason that I may want to use an iterative JSON parser.

Examples of invalid JSON:

[][                :
{"":"",[           a
{"":"",["a"]}      {""}
}                  {}",:
[{]}               [a;
[]{                {"a","b"}
[{}"],             {"":"",["a"]}

Examples of (potentially) valid JSON:

{                  {"a":"b"}
{"a":"b            ["
["]                ""
{"a":"b"           "z
{",":              {"a":"b}
{"                 [
"                  [{"{"

I see that you removed just the feature that I needed in commit e079cc2, which is not nice. There definitely was no harm in having it.

Here is what I have to deal with now:

def json_still_valid(js):
    try:
        list(ijson.parse(io.BytesIO(js)))
    except ijson.JSONError as e:
        if str(e) == "Incomplete JSON data":
            pass
        elif str(e) == "Incomplete string lexeme":
            try:
                # See if adding a quote would fix it
                list(ijson.parse(io.BytesIO(js+b'"')))
            except ijson.JSONError as e:
                return False
        else:
            return False
    return True

This code seems to work without flaws, I've tested it quite extensively on the above examples and much more, but obviously it's inelegant and relies on text message of exceptions.

Using the old version the code goes like this:

def json_still_valid(js):
    try:
        list(ijson.parse(io.BytesIO(js)))
    except ijson.IncompleteJSONError:
        pass
    except ijson.JSONError:
        return False
    return True

But it gives some misinformation, specifically when a string literal is involved (hence the workaround in previous snippet):

[{}"],
{}",:

These things are not Incomplete or empty JSON data, they're definitely wrong.

I would be thankful if you added this back, taking this detail into account.


I also made a comparably cringy solution with json from standard library:

def json_still_valid(js):
    try:
        json.loads(js)
    except ValueError as e:
        msg = str(e)
        if msg.startswith("Expecting"):
            # Expecting value: line 1 column 4 (char 3)
            n = int(msg.rstrip(')').split()[-1])
            # If the error is "outside" the string, it can still be valid
            return n >= len(js)
        elif msg.startswith("Unterminated"):
            # Unterminated string starting at: line 1 column 1 (char 0)
            return True
        return False
    return True

Please

Please, move last release to pypi.

Get root items only

Hi there,

imagine that I have a big array of json objects without an identifier.
For instance:
[
{
"id":"sdfsdf",
"service_name":"sdfsfd",
"expiration":"2099-12-31T00:00:00Z",
"brief_description":"sdfsdfsdfsdfsdf"
},
{
"id":"fsdfsdfsdf",
"service_name":"sdfsdf",
"expiration":"2099-12-31T00:00:00Z",
"brief_description":"sdfsdfdsf."
},
..
...
...
...
]

how is it possible to use items function? (since there is no key for each json object)
Thanks a lot

TypeError when trying to use CFFI backend

When trying to use the CFFI backend as instructed in the readme, I get the following error:

$ ./test.py  <<< "{}"
Traceback (most recent call last):
  File "./test.py", line 7, in <module>
    for prefix, event, value in parser:
  File "/usr/lib64/python3.4/site-packages/ijson/common.py", line 65, in parse
    for event, value in basic_events:
  File "/usr/lib64/python3.4/site-packages/ijson/backends/yajl2_cffi.py", line 218, in basic_parse
    yajl_parse(handle, buffer)
  File "/usr/lib64/python3.4/site-packages/ijson/backends/yajl2_cffi.py", line 179, in yajl_parse
    result = yajl.yajl_parse(handle, buffer, len(buffer))
TypeError: initializer for ctype 'unsigned char *' must be a bytes or list or tuple, not str

It seems to happen in all cases, a trivial example:

#!/usr/bin/env python3
import ijson.backends.yajl2_cffi as ijson
import sys


parser = ijson.parse(sys.stdin)
for prefix, event, value in parser:
    print(prefix, event, value)

The same code, when the import is changed to just import ijson, works just fine:

$ ./test.py <<< "{}"
 start_map None
 end_map None

OS: Gentoo Linux
ijson version: 2.3
yajl version: 2.1.0
cffi version: 1.8.3
python version: 3.4.3

"JSONError: Additional Data" when parsing simple JSON string

I am trying to iteratively parse a large JSON file. However, after the first few events ijson raises a JSONError: Additional data. I have looked into ijson's sourcecode, but fail to understand what the problem is.

Here is a minimal working example, my eventual goal to extract all objects with '.com' in the body.url.

import io
import ijson

fh = io.StringIO('''{"body": {"kids": [487171, 15, 234509, 454410, 82729], "descendants": 15, "url": "http://ycombinator.com", "title": "Y Combinator", "by": "pg", "score": 61, "time": 1160418111, "type": "story", "id": 1}, "source": "firebase", "id": 1, "retrieved_at_ts": 1435938464}
{"body": {"kids": [454411], "descendants": 0, "url": "http://www.paulgraham.com/mit.html", "title": "A Student's Guide to Startups", "by": "phyllis", "score": 16, "time": 1160418628, "type": "story", "id": 2}, "source": "firebase", "id": 2, "retrieved_at_ts": 1435938464}''')

parser = ijson.parse(fh)
for prefix, event, value in parser:
    print(prefix, event, value)

returns

In [6]:  start_map None
 map_key body
body start_map None
body map_key kids
body.kids start_array None
body.kids.item number 487171
body.kids.item number 15
body.kids.item number 234509
body.kids.item number 454410
body.kids.item number 82729
body.kids end_array None
body map_key descendants
body.descendants number 15
body map_key url
body.url string http://ycombinator.com
body map_key title
body.title string Y Combinator
body map_key by
body.by string pg
body map_key score
body.score number 61
body map_key time
body.time number 1160418111
body map_key type
body.type string story
body map_key id
body.id number 1
body end_map None
 map_key source
source string firebase
 map_key id
id number 1
 map_key retrieved_at_ts
retrieved_at_ts number 1435938464
 end_map None
---------------------------------------------------------------------------
JSONError                                 Traceback (most recent call last)
<ipython-input-6-db2d9e444fc8> in <module>()
----> 1 import codecs, os;__pyfile = codecs.open('''/tmp/py3985v6f''', encoding='''utf-8''');__code = __pyfile.read().encode('''utf-8''');__pyfile.close();os.remove('''/tmp/py3985v6f''');exec(compile(__code, '''/home/peon/edu/Econ 298 - Second Year Paper/src/data_management/minimal_working_example_ijson_failing.py''', 'exec'));

/home/peon/edu/Econ 298 - Second Year Paper/src/data_management/minimal_working_example_ijson_failing.py in <module>()
      9 
     10 parser = ijson.parse(fh)
---> 11 for prefix, event, value in parser:
     12     print(prefix, event, value)

/usr/local/lib/python3.4/dist-packages/ijson/common.py in parse(basic_events)
     63     '''
     64     path = []
---> 65     for event, value in basic_events:
     66         if event == 'map_key':
     67             prefix = '.'.join(path[:-1])

/usr/local/lib/python3.4/dist-packages/ijson/backends/python.py in basic_parse(file, buf_size)
    190         pass
    191     else:
--> 192         raise common.JSONError('Additional data')
    193 
    194 

JSONError: Additional data

asyncio backend

Recently @ethanfrey raised a question about using ijson with asyncio corountines. What was my surprise when I found asyncio.py module in ijson.backends package for 2.2 version downloaded from PyPi. There are no such file in the source repository for the same tag.

Unfortunately, that module doesn't implements any asyncio support and seems like just an outdated python backend, that accidentally leaked into pypi package.

However, the question is still actual. @isagalaev, are there any plans for asyncio backend?

Publish current master to PyPI

Current release on PyPI is a bit behind on the current master on github. The latest commit being one I personally care about, I'd love to see the current state on PyPI. As the int/float parsy thing was a bug for my env, I find myself building the current master and distributing that build to people running into it or still monkey-patching a fix into code.

Can the current state of things be released to PyPI or is there something blocking that?

Numeric decimals being converted into Decimal() objects

I'm reading from a geographic shapefile that has, among other data types, a list of coordinates, such as...

{
  "type": "FeatureCollection",
  "features": [
{
  "geometry": {
    "type": "Polygon", 
    "coordinates": [
      [
        [
          -123.09098499999993, 
          45.77992400000005
        ], 

The coordinates are getting converted in ijson into:

[Decimal('-123.09098499999993), Decimal('45.77992400000005')]

I'm not sure whether this is a bug or a feature, but it's making my database choke. =)

My ijson loader looks like this:

def loader(shapefile, collection):
    jf = open(shapefile)
    data = ijson.items(jf,'features')
    for d in data:
        try:
            db[collection].insert(d)
        except:
            pass

My vanilla loader, without ijson, looked like this (and worked, but not on the large files I need ijson for)

def loader(shapefile, collection):
    jf = open(shapefile)
    data = json.load(jf)
    for d in data['features']:
        db[collection].insert(d)

My database is complaining about the Decimal() objects. Where would I add something to convert only the Decimal() objects into their numeric values?

error parsing 1.1e+1

get exception in parsing float numbers with positive power sign
code
import ijson

f=open("a.dat","rt")
objects = ijson.items(f, 'item')
objects.next()
f.close()
a.dat: ---------------------
[
{
"a":1.1e+3
}
]

Traceback (most recent call last):
File "desimalerror.py", line 6, in
objects.next()
File "D:\comp\Python27\lib\site-packages\ijson-1.0-py2.7.egg\ijson\common.py", line 143, in items
current, event, value = next(prefixed_events)
File "D:\comp\Python27\lib\site-packages\ijson-1.0-py2.7.egg\ijson\common.py", line 63, in parse
for event, value in basic_events:
File "D:\comp\Python27\lib\site-packages\ijson-1.0-py2.7.egg\ijson\backends\python.py", line 162, in basic_parse
for value in parse_value(lexer):
File "D:\comp\Python27\lib\site-packages\ijson-1.0-py2.7.egg\ijson\backends\python.py", line 102, in parse_value
for event in parse_array(lexer):
File "D:\comp\Python27\lib\site-packages\ijson-1.0-py2.7.egg\ijson\backends\python.py", line 122, in parse_array
for event in parse_value(lexer, symbol):
File "D:\comp\Python27\lib\site-packages\ijson-1.0-py2.7.egg\ijson\backends\python.py", line 105, in parse_value
for event in parse_object(lexer):
File "D:\comp\Python27\lib\site-packages\ijson-1.0-py2.7.egg\ijson\backends\python.py", line 143, in parse_object
for event in parse_value(lexer):
File "D:\comp\Python27\lib\site-packages\ijson-1.0-py2.7.egg\ijson\backends\python.py", line 112, in parse_value
number = Decimal(symbol) if '.' in symbol else int(symbol)
File "D:\comp\Python27\lib\decimal.py", line 548, in new
"Invalid literal for Decimal: %r" % value)
File "D:\comp\Python27\lib\decimal.py", line 3866, in _raise_error
raise error(explanation)
decimal.InvalidOperation: Invalid literal for Decimal: '1.1e'

Claim finite state machine, but really is not

print [i for i in ijson.items(StringIO('['* 500), '')]

will generate long trace-back like:

...

.../ijson/backends/python.pyc in parse_array(lexer)
    136         if symbol != ']':
    137             while True:
--> 138                 for event in parse_value(lexer, symbol, pos):
    139                     yield event
    140                 pos, symbol = next(lexer)

.../ijson/backends/python.pyc in parse_value(lexer, symbol, pos)
    114             yield ('boolean', False)
    115         elif symbol == '[':
--> 116             for event in parse_array(lexer):
    117                 yield event
    118         elif symbol == '{':

.../ijson/backends/python.pyc in parse_array(lexer)
    136         if symbol != ']':
    137             while True:
--> 138                 for event in parse_value(lexer, symbol, pos):
    139                     yield event
    140                 pos, symbol = next(lexer)

.../ijson/backends/python.pyc in parse_value(lexer, symbol, pos)
    114             yield ('boolean', False)
    115         elif symbol == '[':
--> 116             for event in parse_array(lexer):
    117                 yield event
    118         elif symbol == '{':

...

But instead, it should make highly-enclosed arrays for that case, consuming memory (heap), but not the stack!

isjon raises AttributeError when importing yajl2 backends on RHEL6

Attemping to import either the yajl2_cffi or yajl2 backends on RHEL6 throws an AttributeError instead of ImportError.

Quick reproduction case:

try:
    import ijson.backends.yajl2_cffi as ijson
except ImportError:
    try:
        import ijson.backends.yajl2 as ijson
    except ImportError:
        import ijson

Instead of importing the pure-python backend, this results in the following stacktraces depending on whether cffi is installed or not:

With cffi:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/root/tmp/lib/python2.6/site-packages/ijson/backends/yajl2_cffi.py", line 65, in <module>
    yajl = backends.find_yajl_cffi(ffi, 2)
  File "/root/tmp/lib/python2.6/site-packages/ijson/backends/__init__.py", line 41, in find_yajl_cffi
    require_version(yajl.yajl_version(), required)
  File "/root/tmp/lib/python2.6/site-packages/cffi/api.py", line 866, in __getattr__
    make_accessor(name)
  File "/root/tmp/lib/python2.6/site-packages/cffi/api.py", line 862, in make_accessor
    accessors[name](name)
  File "/root/tmp/lib/python2.6/site-packages/cffi/api.py", line 792, in accessor_function
    value = backendlib.load_function(BType, name)
AttributeError: function/symbol 'yajl_version' not found in library 'libyajl.so.1': /usr/lib64/libyajl.so.1: undefined symbol: yajl_version

Without cffi:

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/root/tmp/lib/python2.6/site-packages/ijson/backends/yajl2.py", line 12, in <module>
    yajl = backends.find_yajl_ctypes(2)
  File "/root/tmp/lib/python2.6/site-packages/ijson/backends/__init__.py", line 29, in find_yajl_ctypes
    require_version(yajl.yajl_version(), required)
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 366, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 371, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib64/libyajl.so.1: undefined symbol: yajl_version

Fix is to wrap require_version(yajl.yajl_version(), required) at lines 29 and 41 with a try/except to catch the AttributeError and raise an ImportError. I can open a pull request if you'd like.

Edit: clean up stacktrace formatting.

internationalization support

In python I've been working with some json that has a lot of kanji in it and the parser will fail decoding it's internal buffer to utf8 if a read happen to split a kanji string in the wrong place. I've been using ijson because the json files can be quite large (2G) and the standard parser wants to read it all into memory which caused swapping and made my machine crawl but I may have use it if I can't figure out how to deal with this situation.

Ability to iteratively parse a json array

I've got this kind of list:

[{"feed": {"map": 2}}, {"feed": "alert"}]

And I'd like to iteratively parse out two objects:
{"feed": {"map": 2}} and {"feed": "alert"} right now unfortunately I'll get full json parsed if I'll use ijson.items

Python3 compatibility issue

Running this very simple example with python3:

import ijson

parser = ijson.parse(open('./tree.json'))

for prefix, event, value in parser:
    print(prefix)
    print(event)
    print(value)

Gives this error:

Traceback (most recent call last):
  File "ij.py", line 6, in <module>
    for prefix, event, value in parser:
  File "/usr/lib/python3.4/site-packages/ijson/common.py", line 58, in parse
    for event, value in basic_events:
  File "/usr/lib/python3.4/site-packages/ijson/backends/python.py", line 178, in basic_parse
    for value in parse_value(lexer):
  File "/usr/lib/python3.4/site-packages/ijson/backends/python.py", line 107, in parse_value
    pos, symbol = next(lexer)
  File "/usr/lib/python3.4/site-packages/ijson/backends/python.py", line 26, in Lexer
    buf = f.read(buf_size)
  File "/usr/lib/python3.4/codecs.py", line 493, in read
    data = self.bytebuffer + newdata
TypeError: can't concat bytes to str

It works just fine with Python 2.

Multiple file opens required for multiple iterators on same JSON

Something that might be clarified in the documentation:

While IJSON returns a generator of a specific item within the JSON, stepping through the generator steps through all items, affecting any additional iterators. Example:

f = open(filepath, 'r')
    g1 = ijson.items(f, Category1)
    g2 = ijson.items(f, Category2)
[...]
f.close()

If i call g1.next() within [...], it will yield the first Category1 item. If I then call g2.next(), it will yield the second Category2 item.

I can work around this by using different file opens:

f1= open(filepath, 'r')
f2 = open(filepath, 'r')
    g1 = ijson.items(f1, Category1)
    g2 = ijson.items(f2, Category2)
[...]
f1.close()
f2.close()

CFFI Instead of Ctypes

So, I was playing around with parsing huge JSON files (19GiB, testfile is ~520MiB) and wanted to try a sample code with PyPy, turns out, the PyPy needed ~1:30-2:00 where as Python 2.7 needed ~13 seconds (the pure python implementation was close at ~8 minutes).

Apparantly ctypes is really bad performance wise, especially on PyPy. So I made a quick CFFI mockup: https://gist.github.com/Dav1dde/c509d472085f9374fc1d

Before:

Python 2.7: python -m emfas.server size dumps/echoprint-dump-1.json  11.89s user 0.36s system 98% cpu 12.390 total
PYPY: python -m emfas.server size dumps/echoprint-dump-1.json  117.19s user 2.36s system 99% cpu 1:59.95 total

After (CFFI):

Python 2.7: python jsonsize.py ../dumps/echoprint-dump-1.json  8.63s user 0.28s system 99% cpu 8.945 total
PyPy: python jsonsize.py ../dumps/echoprint-dump-1.json  4.04s user 0.34s system 99% cpu 4.392 total

Maybe it would make sense to add an additional CFFI backend which gets chosen over ctypes if CFFI is available.


Testcode:

import sys

_IGNORED_SIZE_EVENTS = ('end_map', 'end_array', 'map_key')

def size(ijson, path):
    s = 0
    with open(path) as f:
        events = ijson.parse(f)

        for space, event, data in events:
            if space == 'item' and event not in _IGNORED_SIZE_EVENTS:
                s += 1

    return s


def main():
    # from ijson.backends import yajl2 as ijson
    import cffibackend

    path = sys.argv[1]
    print size(cffibackend, path)


if __name__ == '__main__':
    main()

Doing this json procedure more efficiently in ijson?

(x-post from Stack Overflow)

I have this massive json file. and I run out of memory when trying to read it in to Python. How would I implement a similar procedure using ijson?

import pandas as pd

#There are (say) 1m objects - each is its json object - within in this file. 
with open('my_file.json') as json_file:      
    data = json_file.readlines()
    #So I take a list of these json objects
    list_of_objs = [obj for obj in data]
   
#But I only want about 200 of the json objects
desired_data = [obj for obj in list_of_objs if object['feature']=="desired_feature"]

Basically, the file is a list of json objects. I want a list of json objects where the objects all have a certain value for a particular key. For such json objects, I want to include every attribute.

The file itself contains a list of objects like:


{
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",
    "stars": 4,
    "date": "2016-03-09",
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
    "useful": 0,
    "funny": 0,
}

yajl2_cffi backend type error

With the following minimal example

import sys
import ijson.backends.yajl2_cffi as ijson
# import ijson

items = ijson.items(open(sys.argv[1]), 'item')
for item in items:
    print(item)

and a simple json file containing, for example, [1,2,3]

I get the following error

Traceback (most recent call last):
File "minimal_example.py", line 6, in
for item in items:
File "/usr/local/lib/python3.5/dist-packages/ijson/common.py", line 138, in items
current, event, value = next(prefixed_events)
File "/usr/local/lib/python3.5/dist-packages/ijson/common.py", line 65, in parse
for event, value in basic_events:
File "/usr/local/lib/python3.5/dist-packages/ijson/backends/yajl2_cffi.py", line 218, in basic_parse
yajl_parse(handle, buffer)
File "/usr/local/lib/python3.5/dist-packages/ijson/backends/yajl2_cffi.py", line 179, in yajl_parse
result = yajl.yajl_parse(handle, buffer, len(buffer))
TypeError: initializer for ctype 'unsigned char *' must be a bytes or list or tuple, not str

There is no error when using the pure python backend.

Using Python 3.5, Ubuntu 16.04 with OS-packaged yajl2. ijson==2.3; cffi==1.7.0

Will be happy to provide other information.

TypeError with Python 3.6

With yajl 2.1.0 and cffi 1.9.1

import ijson.backends.yajl2_cffi as ijson
f = next(ijson(open('json_file.json', 'r')))

produces

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-468f0afdf1b9> in <module>()
----> 1 next(f)

./py36/lib/python3.6/site-packages/ijson-2.3-py3.6.egg/ijson/common.py in parse(basic_events)
     63     '''
     64     path = []
---> 65     for event, value in basic_events:
     66         if event == 'map_key':
     67             prefix = '.'.join(path[:-1])

./py36/lib/python3.6/site-packages/ijson-2.3-py3.6.egg/ijson/backends/yajl2_cffi.py in basic_parse(f, buf_size, **config)
    216             # this calls the callbacks which will
    217             # fill the events list
--> 218             yajl_parse(handle, buffer)
    219
    220             if not buffer and not events:

./py36/lib/python3.6/site-packages/ijson-2.3-py3.6.egg/ijson/backends/yajl2_cffi.py in yajl_parse(handle, buffer)
    177 def yajl_parse(handle, buffer):
    178     if buffer:
--> 179         result = yajl.yajl_parse(handle, buffer, len(buffer))
    180     else:
    181         result = yajl.yajl_complete_parse(handle)

TypeError: initializer for ctype 'unsigned char *' must be a bytes or list or tuple, not str

python backend retains memory while streaming!

First of all, this library's been very helpful to me, thanks for all your time on it!

I ran into a weird issue where as I streamed through some json objects, the process held onto more and more memory. This is only with the python backend.

We managed to track the issue down to the Lexer method. https://github.com/isagalaev/ijson/blob/master/ijson/backends/python.py#L25

There, buf is being appended to continuously, but none of the old parsed data is ever discarded. We confirmed this is the issue by adding

                    lexeme = match.group()
                yield discarded + match.start(), lexeme
                pos = match.end()
+           buf = buf[pos:]
+           pos = 0
        else:
            data = f.read(buf_size)
            if not data:

^ those two lines. Not a recommended solution, but it did stop the memory from growing. I ended up switching to the yajl2_cffi backend as it's faster anyway, but this tripped me up for a bit!

add more documentation to the package

Hi, I am doing some research and find this package, this it is really helpful for parsing my very large json files. The only thing is that the documentation is not sufficient at all and maybe adding docs would help a lot for this package to get more attention:

This SO answer actually provides some brief but useful documentation already

Encoding

Hi, is it possible to set the encoding when parsing?

ijson on pypi is out of date.

Thanks so much for this lib. It was exactly what I was looking for. However with that said is it possible to push a new version to pypi. I encountered issue #15 with it and glad to see it was fixed, but after the last push to pypi. It would be nice to have it updated as it is a over a year old now.

api for async use

Would like to have an api that can be used for example with asyncio. https://github.com/kashifrazzaqui/json-streamer can be used that way (but it doesn't "yield native Python objects out of a JSON stream located under a prefix"). Here's what one approach would look like:

def callback(item):
    # do something...

parser = ijson.Parser()
parser.add_item_listener(prefix, callback)
parser.consume(chunk1)
parser.consume(chunk2)
# etc
parser.close()

Control buffer size

I have an application that reads a continuous stream of json data from a process as an infinite array. I need to read an object out of this array about once per second. Each object is about 200 characters long.

The current buffer size is 16 * 1024. This means that I don't get events until the buffer is full, which takes a very long time. I need to make the buffer size very small for my application to work correctly.

I would like to add a kwarg to ijson.parse() to control the buffer size.

cannot parse integer written in format 1.55E10

Dear Sir,,
I have a large json, the size is around 1,5GB. One item in json is price of property which written in format e.g. 1.55E10, 6.5E10, etc. When I parse with IJSON, the error like this 'Unexpected symbol "E" at 377'.
Temporary solution is I must replace this integer format into 1.55e+10, but It's so hard to replace all parts in large JSON. Please help me. Any help would be appreciated.

Ijson support for unicode

Characters such as ö,™ when parsing throws UnexpectedSymbol.
Can somebody help on this issue.Will ijson support unicode characters

Python3 support for yajl2_cffi

With Python 3.5, I get the following error:

Traceback (most recent call last):
  File "./streaming_feature_count.py", line 8, in <module>
    print('# features: %s' % sum(1 for _ in features))
  File "./streaming_feature_count.py", line 8, in <genexpr>
    print('# features: %s' % sum(1 for _ in features))
  File "/Users/danvk/.virtualenvs/osm-segments/lib/python3.5/site-packages/ijson/common.py", line 138, in items
    current, event, value = next(prefixed_events)
  File "/Users/danvk/.virtualenvs/osm-segments/lib/python3.5/site-packages/ijson/common.py", line 65, in parse
    for event, value in basic_events:
  File "/Users/danvk/.virtualenvs/osm-segments/lib/python3.5/site-packages/ijson/backends/yajl2_cffi.py", line 218, in basic_parse
    yajl_parse(handle, buffer)
  File "/Users/danvk/.virtualenvs/osm-segments/lib/python3.5/site-packages/ijson/backends/yajl2_cffi.py", line 179, in yajl_parse
    result = yajl.yajl_parse(handle, buffer, len(buffer))
TypeError: initializer for ctype 'unsigned char *' must be a bytes or list or tuple, not str
./streaming_feature_count.py sf-blocks.json  0.15s user 0.04s system 66% cpu 0.297 total

My code works fine in Python 2.7, however.

Here's the full code:

#!/usr/bin/env python
# import ijson
import ijson.backends.yajl2_cffi as ijson
import sys

with open(sys.argv[1]) as f:
    features = ijson.items(f, 'features.item')
    print('# features: %s' % sum(1 for _ in features))

Fails on trivial JSON in Python 3.4

Not quite sure how to debug, but tried installing from github master.
Python 3.4.2 w/ libyajl 2.1.0 on OS X 10.10

import ijson.backends.yajl2 as ijson
import codecs

class FHoseFile(object):
  def __init__(self, filename, *parms, **kw):
    self.filename = filename

  def iter(self):
    with codecs.open(self.filename, 'r', encoding='utf8') as rawjson:
      objs = ijson.items(rawjson, "item")
      for o in objs:
        yield o

filename = "./sample2.json"

hf = FHoseFile(filename)
for tweet in hf.iter():
  print(tweet)

sample2.json contains

[{
  "foo":"bar"
}]

when I execute my script I get the following error:

(twit3)evil-jim-klo:src jklo$ ./run.py
Traceback (most recent call last):
  File "./run.py", line 20, in <module>
    for tweet in hf.iter():
  File "/Users/jklo/projects/Sunflower/EmergingEvents/twitterDemo/src/firehose/__init__.py", line 15, in iter
    for o in objs:
  File "/Users/jklo/projects/Sunflower/EmergingEvents/twitterDemo/twit3/lib/python3.4/site-packages/ijson/common.py", line 131, in items
    current, event, value = next(prefixed_events)
  File "/Users/jklo/projects/Sunflower/EmergingEvents/twitterDemo/twit3/lib/python3.4/site-packages/ijson/common.py", line 58, in parse
    for event, value in basic_events:
  File "/Users/jklo/projects/Sunflower/EmergingEvents/twitterDemo/twit3/lib/python3.4/site-packages/ijson/backends/yajl2.py", line 95, in basic_parse
    raise common.JSONError(error.decode('utf-8'))
ijson.common.JSONError: lexical error: invalid char in json text.
                                      [                     (right here) ------^

I stuck pdb on line 95 in yajl2.py, this is what I discovered:

(twit3)evil-jim-klo:src jklo$ ./run.py
> /Users/jklo/projects/Sunflower/EmergingEvents/twitterDemo/twit3/lib/python3.4/site-packages/ijson/backends/yajl2.py(96)basic_parse()
-> raise common.JSONError(error.decode('utf-8'))
(Pdb) buffer
'[{\n  "foo": "bar"\n}]'
(Pdb) error
b'lexical error: invalid char in json text.\n                                      [                     (right here) ------^\n'
(Pdb)

From the surface it looks like \n isn't getting ignored or stripped, however your Travis tests for Python 3.4 seem to be passing.

FWIW switching from codecs.open to open makes no difference, same error.

If I use Python 2 it all works, however my script has some Python 3 dependencies.

parse context '*.item' is ambiguous

One callback missing from yajl is an equivalent of yajl_map_key for arrays, i.e. yajl_array_item. The fact that ijson.parse() indicates the context has the potential to remedy situations where such a callback would be useful. However, the context is unreliable, due to the ambiguity when "item" is a key:

>>> pprint.pprint(list(ijson.parse(StringIO.StringIO('[{"item":"val"}]'))))
[('', 'start_array', None),
 ('item', 'start_map', None),
 ('item', 'map_key', 'item'),
 ('item.item', 'string', u'val'),
 ('item', 'end_map', None),
 ('', 'end_array', None)]

NaN support

I'm trying to parse in Python some json that contain values of NaN but ijson is throwing an exception. I see that the standard json parser has the capability to handle NaN. Is this something you'd consider added to ijson?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.