GithubHelp home page GithubHelp logo

pdjson's Introduction

Public Domain JSON Parser for C

A public domain JSON parser focused on correctness, ANSI C99 compliance, full Unicode (UTF-8) support, minimal memory footprint, and a simple API. As a streaming API, arbitrary large JSON could be processed with a small amount of memory (the size of the largest string in the JSON). It seems most C JSON libraries suck in some significant way: broken string support (what if the string contains \u0000?), broken/missing Unicode support, or crappy software license (GPL or "do no evil"). This library intends to avoid these flaws.

The parser is intended to support exactly the JSON standard, no more, no less, so that even slightly non-conforming JSON is rejected. The input is assumed to be UTF-8, and all strings returned by the library are UTF-8 with possible nul characters in the middle, which is why the size output parameter is important. Encoded characters (\uxxxx) are decoded and re-encoded into UTF-8. UTF-16 surrogate pairs expressed as adjacent encoded characters are supported.

One exception to this rule is made to support a "streaming" mode. When a JSON "stream" contains multiple JSON objects (optionally separated by JSON whitespace), the default behavior of the parser is to allow the stream to be "reset," and to continue parsing the stream.

The library is usable and nearly complete, but needs polish.

API Overview

All parser state is attached to a json_stream struct. Its fields should not be accessed directly. To initialize, it can be "opened" on an input FILE * stream or memory buffer. It's disposed of by being "closed."

void json_open_stream(json_stream *json, FILE * stream);
void json_open_string(json_stream *json, const char *string);
void json_open_buffer(json_stream *json, const void *buffer, size_t size);
void json_close(json_stream *json);

After opening a stream, custom allocator callbacks can be specified, in case allocations should not come from a system-supplied malloc. (When no custom allocator is specified, the system allocator is used.)

struct json_allocator {
    void *(*malloc)(size_t);
    void *(*realloc)(void *, size_t);
    void (*free)(void *);
};


void json_set_allocator(json_stream *json, json_allocator *a);

By default only one value is read from the stream. The parser can be reset to read more objects. The overall line number and position are preserved.

void json_reset(json_stream *json);

If strict conformance to the JSON standard is desired, streaming mode can be disabled by calling json_set_streaming and setting the mode to false. This will cause any non-whitespace trailing data to trigger a parse error.

void json_set_streaming(json_stream *json, bool mode);

The JSON is parsed as a stream of events (enum json_type). The stream is in the indicated state, during which data can be queried and retrieved.

enum json_type json_next(json_stream *json);
enum json_type json_peek(json_stream *json);

const char *json_get_string(json_stream *json, size_t *length);
double json_get_number(json_stream *json);

Numbers can also be retrieved by json_get_string(), which will return the raw text number as it appeared in the JSON. This is useful if better precision is required.

In the case of a parse error, the event will be JSON_ERROR. The stream cannot be used again until it is reset. In the event of an error, a human-friendly, English error message is available, as well as the line number and byte position. (The line number and byte position are always available.)

const char *json_get_error(json_stream *json);
size_t json_get_lineno(json_stream *json);
size_t json_get_position(json_stream *json);

Outside of errors, a JSON_OBJECT event will always be followed by zero or more pairs of JSON_STRING (member name) events and their associated value events. That is, the stream of events will always be logical and consistent.

In the streaming mode the end of the input is indicated by returning a second JSON_DONE event. Note also that in this mode an input consisting of zero JSON values is valid and is represented by a single JSON_DONE event.

JSON values in the stream can be separated by zero or more JSON whitespaces. Stricter or alternative separation can be implemented by reading and analyzing characters between values using the following functions.

int json_source_get (json_stream *json);
int json_source_peek (json_stream *json);
bool json_isspace(int c);

As an example, the following code fragment makes sure values are separated by at least one newline.

enum json_type e = json_next(json);

if (e == JSON_DONE) {
    int c = '\0';
    while (json_isspace(c = json_source_peek(json))) {
        json_source_get(json);
        if (c == '\n')
            break;
    }

    if (c != '\n' && c != EOF) {
        /* error */
    }

    json_reset(json);
}

pdjson's People

Contributors

aleks-f avatar astillich-igniti avatar boris-kolpackov avatar dhobsd avatar pavelxdd avatar skeeto avatar sw17ch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdjson's Issues

few questions

@skeeto

  1. what is json_typename ? I don't see it used anywhere and it breaks some older MSVC builds, can it be removed?

  2. would you mind few explicit typecasts, so pdjson.c can be compiled as C++? I know it is superfluous for C, but we are forced to compile as C++ for pre-VS2013 builds because C99 support is dismal there

  3. this

#include <stdio.h>
#include <stdbool.h>

is in both source and header. can we get rid of the source file entries?

If you're ok with the above, I'll send pull

Howto use with unix socket returning array of dicts

Could you give some pointers on how to use this library when getting a unix socket stream of possibly large amount of json objects.

Which helper functions would be needed?

Thank you in advance.

/edit
Previous version said a tcp stream

Error strings include "error" phrase, line number

I was surprised to find that the error messages returned by json_get_error() all contain the error: <line>: prefix. To me this seems to go against the overall philosophy of the library which is to provide the information in its most elementary form. In this case the line can be easily obtained with json_get_lineno() and then incorporated into diagnostics in the desired form (for example, I prefer the more traditional <file>:<line>: error: <message> format).

I am willing to do the work to fix this but would like to discuss the best approach. I am not sure that just dropping this prefix would be acceptable since existing code may by now rely on it being there.

Other, more elaborate options include having a flag, either a compile-time macro or a run-time flag (similar to existing JSON_FLAG_*).

Another option would be to have another function that returns a pointer to the "tail" of the existing strings without the prefix (this is, in fact, pretty much what I am doing now as a workaround). This would be pretty easy to implement with the only drawback being the potentially unnecessary work done formatting the line number. But perhaps it's ok since this only happens in case of an error?

What do you think?

rename json.c/h

@skeeto small request: rename json.c/h to pdjson or pd_json.

we have a name conflict - json is a pretty common name; renaming it would make it easier to simply copy/paste files without any renaming. not a huge deal, but would be nice; if you don't mind, I'll send a pull

Would you consider C90 support?

I need a C90 json library and this is so close. Would you consider accepting the changes necessary to make this C90 compliant?

It seems that the biggest change would be related to the json_error macro, which would need to become a function instead.

Locale can break floating point parsing

pdjson/pdjson.c

Line 850 in 67108d8

return p == NULL ? 0 : strtod(p, NULL);

strtod() is affected by the current locale and is therefore unsuitable.

For instance, setting the locale to a locale that uses comma instead of period, will incorrectly treat 123.45 as 123 (stopping at the period, since the period is not part of the floating point number syntax).

On macOS LC_ALL=sv_SE.UTF-8 sets such a locale, the name is different on Linux.

There is no test-case for json_get_number().

Reproducer:

perm@NP pdjson % locale
LANG="en_SE.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
perm@NP pdjson % make test
tests/tests
PASS double locale C
PASS double locale en_US.UTF-8
PASS number
PASS true
PASS false
PASS null
PASS string
PASS string quotes
PASS object
PASS array
PASS number stream
PASS mixed stream
PASS empty stream
PASS stream separation
PASS incomplete array
PASS \uXXXX
PASS invalid surrogate pair
PASS invalid surrogate half
PASS surrogate misorder
PASS surrogate pair
20 pass, 0 fail
perm@NP pdjson % LC_ALL=sv_SE.UTF-8 make test
tests/tests
PASS double locale C
FAIL double locale sv_SE.UTF-8
PASS number
PASS true
PASS false
PASS null
PASS string
PASS string quotes
PASS object
PASS array
PASS number stream
PASS mixed stream
PASS empty stream
PASS stream separation
PASS incomplete array
PASS \uXXXX
PASS invalid surrogate pair
PASS invalid surrogate half
PASS surrogate misorder
PASS surrogate pair
19 pass, 1 fail
make: *** [check] Error 1
perm@NP pdjson % git rev-parse HEAD
67108d883061043e55d0fb13961ac1b6fc8a485c
perm@NP pdjson % git diff | cat
diff --git a/tests/tests.c b/tests/tests.c
index 03760fe..d5b289a 100644
--- a/tests/tests.c
+++ b/tests/tests.c
@@ -1,6 +1,8 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
+#include <locale.h>
+
 #include "../pdjson.h"
 
 #if _WIN32
@@ -113,6 +115,48 @@ main(void)
     int count_pass = 0;
     int count_fail = 0;
 
+    {
+      size_t i;
+      char const *locales[] =
+	{
+	 "C", // Standard POSIX ("ASCII") locale
+	 "" // Inherit from environment
+	};
+				
+				
+      for (i = 0; i < 2; i++) {
+	char const *locale = locales[i];
+
+	char const *name = "double";
+	int success = 0;
+	struct json_stream json[1];
+	char const *buf = "123.45";
+	double expect_double = 123.45;
+	char *locale_name = setlocale(LC_ALL, locale);
+
+	json_open_buffer(json, buf, strlen(buf));
+	json_set_streaming(json, false);
+
+	if (JSON_NUMBER == json_next(json)) {
+	  double d = json_get_number(json);
+	  if (d == expect_double) {
+	    success = 1;
+	  }
+	}
+	if (success) {
+	  count_pass++;
+	  printf(C_GREEN("PASS") " %s locale %s\n", name, locale_name);
+	} else {
+	  count_fail++;
+	  printf(C_RED("FAIL") " %s locale %s\n", name, locale_name);
+	}
+      }
+    }
+
+
+
+
+
     {
         const char str[] = "  1024\n";
         struct expect seq[] = {
perm@NP pdjson % 

Use RFC8259 terminology in error messages

While I think the existing error messages are pretty good and self-consistent, they don't use the same terminology as what's in RFC8259. Probably the most representative example (and a sticking point to me) is "property name" vs "object member name".

Would there be interest in having this fixed? If so, I would be willing to do the work.

End-of-stream indication indistinguishable from error

Currently, in the streaming mode the end-of-stream indicator is JSON_ERROR. For example:

$ ./stream <<<'1 2'
struct expect seq[] = {
    {JSON_NUMBER, "1"},
    {JSON_DONE},
    {JSON_NUMBER, "2"},
    {JSON_DONE},
    {JSON_ERROR},
};

Unfortunately, this is indistinguishable (or, at least, not easily distinguishable) from a real error. Compare:

$ ./stream <<<'1 2 }'
struct expect seq[] = {
    {JSON_NUMBER, "1"},
    {JSON_DONE},
    {JSON_NUMBER, "2"},
    {JSON_DONE},
    {JSON_ERROR},
};

It feels like a more natural indication would have been another JSON_DONE (in fact, since this is not documented in the README file, I initially assumed that's what happens). Are there any issues with this approach that I don't see? And if not, is there interest in making the change? If the answer to the second question is yes, I would be willing to work on a patch.

Streaming mode strange behavior

I cannot figure out how to use streaming mode
Input file:

{"xx":"zz"}
{"aa":3}

My execution sequence at the end:

  • got first JSON_DONE
  • call json_reset
  • got JSON_ERROR

dependence on execution character set

The README rightly states that only UTF-8 is supported, as it
should be. It notes other libraries’ broken Unicode support and a
focus on correctness in this one. All that seems quite right,
except for one little thing: pdjson depends on the execution
character set being ASCII or the like; e.g., it requires the
constant '{' to have the value 0x7b.

That is a fine choice and works for most cases, but is not
maximally portable. To break pdjson, you can play around with
gcc -fexec-charset.

To fix, the character constants should be replaced by the integer
values (0x7b /* { */). C23 introduces UTF-8 character constants
(u8'{' of type unsigned char), but I suppose you don’t want to
depend on a C23 compiler.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.