1rgs / jsonformer Goto Github PK

View Code? Open in Web Editor NEW

4.0K 21.0 141.0 1.12 MB

A Bulletproof Way to Generate Structured JSON from Language Models

License: MIT License

Python 37.90% Jupyter Notebook 62.10%

jsonformer's Introduction

Jsonformer: A Bulletproof Way to Generate Structured JSON from Language Models.

Problem: Getting models to output structured JSON is hard

Solution: Only generate the content tokens and fill in the fixed tokens

Generating structured JSON from language models is a challenging task. The generated JSON must be syntactically correct, and it must conform to a schema that specifies the structure of the JSON.

Current approaches to this problem are brittle and error-prone. They rely on prompt engineering, fine-tuning, and post-processing, but they still fail to generate syntactically correct JSON in many cases.

Jsonformer is a new approach to this problem. In structured data, many tokens are fixed and predictable. Jsonformer is a wrapper around Hugging Face models that fills in the fixed tokens during the generation process, and only delegates the generation of content tokens to the language model. This makes it more efficient and bulletproof than existing approaches.

This currently supports a subset of JSON Schema. Below is a list of the supported schema types:

number
boolean
string
array
object

Example

from jsonformer import Jsonformer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b")
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b")

json_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "is_student": {"type": "boolean"},
        "courses": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
}

prompt = "Generate a person's information based on the following schema:"
jsonformer = Jsonformer(model, tokenizer, json_schema, prompt)
generated_data = jsonformer()

print(generated_data)

Jsonformer works on complex schemas, even with tiny models. Here is an example of a schema with nested objects and arrays, generated by a 3B parameter model.

{"type": "object", "properties": {"car": {"type": "object", "properties": {"make": {"type": "string"}, "model": {"type": "string"}, "year": {"type": "number"}, "colors": {"type": "array", "items": {"type": "string"}}, "features": {"type": "object", "properties": {"audio": {"type": "object", "properties": {"brand": {"type": "string"}, "speakers": {"type": "number"}, "hasBluetooth": {"type": "boolean"}}}, "safety": {"type": "object", "properties": {"airbags": {"type": "number"}, "parkingSensors": {"type": "boolean"}, "laneAssist": {"type": "boolean"}}}, "performance": {"type": "object", "properties": {"engine": {"type": "string"}, "horsepower": {"type": "number"}, "topSpeed": {"type": "number"}}}}}}}, "owner": {"type": "object", "properties": {"firstName": {"type": "string"}, "lastName": {"type": "string"}, "age": {"type": "number"}}}}}

{
  car: {
    make: "audi",
    model: "model A8",
    year: 2016.0,
    colors: [
      "blue"
    ],
    features: {
      audio: {
        brand: "sony",
        speakers: 2.0,
        hasBluetooth: True
      },
      safety: {
        airbags: 2.0,
        parkingSensors: True,
        laneAssist: True
      },
      performance: {
        engine: "4.0",
        horsepower: 220.0,
        topSpeed: 220.0
      }
    }
  },
  owner: {
    firstName: "John",
    lastName: "Doe",
    age: 40.0
  }
}

Features

Bulletproof JSON generation: Jsonformer ensures that the generated JSON is always syntactically correct and conforms to the specified schema.
Efficiency: By generating only the content tokens and filling in the fixed tokens, Jsonformer is more efficient than generating a full JSON string and parsing it.
Flexible and extendable: Jsonformer is built on top of the Hugging Face transformers library, making it compatible with any model that supports the Hugging Face interface.

Installation

pip install jsonformer

Development

Poetry is used for dependency management.

poetry install

poetry run python -m jsonformer.example

License

Jsonformer is released under the MIT License. You are free to use, modify, and distribute this software for any purpose, commercial or non-commercial, as long as the original copyright and license notice are included.

jsonformer's People

Contributors

Stargazers

Watchers

Forkers

jaytoday codeaudit conprogramming dacus1995 emrul suryatmodulus kaliatech marcellodesales thomasmatecki kokizzu thegovind shpetimhaxhiu wenxin-jiang hbcbh1999 jawond soumyadeepbi ather23 jelenamitrovic instawerx kappuchino jaedukseo shyamsn97 sprotolabs richardsonjf rajbala rajneeshaggarwal krish-adi bhattarai842 ssun3 steven4354 joshuabellew gabemeikle clement-lelievre eltociear aicodehunt wjessup nawodyaishan techthiyanes martinezpl hhy5277 apollohuang1 crypticpy clairefro haowang-bioinfo andreajparker jjhw cprakashagr minhduc0711 abhijit47 omarnagy91 agentgill ryul0rd moro-no-kimi williamdeve 10c8 lee-b mishig25 ishmandoo lbsnrs firstuserhere ml-lab xmaster6y bensenhsu katterchan danivalero kolbjornlindberg anaclumos posionus mattkallo awaisjafar jiro-bot v-sekai nanocode012 deployradiant rio100 leecig sebastianschramm ryanmesaros tohrnii eternalerrors winklerj tdoehmen felixzhang7 weixinfree dkzdev rafaelvp-db noahjamison plowsai manonandan thinh-huynh-re mhconradt haoyitedaniu joshc8c7 jakubdulas rickqiu pizzaface eash-veera anujnayyar1 bellyfat lpawlick

jsonformer's Issues

Reduce required Python version (currently v3.10)

Many standard Ubuntu images ship with Python 3.8.

Can we reduce the required Python version from 3.10 to 3.8?

Need to support Generation Config for model generation parameters

I encounter this error when I try to the run the latest version of JSONFormer. It looks like there is no support for Generation Configs yet.

You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )

Can you please add this support ?

`iterations` is never incremented in `generate_number()`

If I see it correctly, the iterations variable is never incremented in this function. Or did I miss something?

https://github.com/1rgs/jsonformer/blob/f6366c2c35cacfd1ae4d1b5644d34b1e9e070615/jsonformer/main.py#LL80C21-L80C21

How it works

Hi,i am really interesting about your jsonformer project。i have read the code again and again,and i kown how it mask,but i really not konw how to ensure the result to be json stytly。at last,can we use other gpt like turbo to do jsonformer work?

Supporting InferenceClient

Thanks for the library. I would like to test large models such as llama2-70b from huggingface_hub. I wonder if I can use jsonformer via InferenceClient from the hub, because I don't want to download the model.

Attempting to run example in README fails

When I attempt to run the code in the README, it fails with the following stack trace:

Traceback (most recent call last):
  File "/home/oogali/lab/llmjson/./poc.py", line 32, in <module>
    sys.exit(main())
  File "/home/oogali/lab/llmjson/./poc.py", line 27, in main
    generated_data = jsonformer()
  File "/home/oogali/lab/llmjson/venv/lib/python3.10/site-packages/jsonformer/main.py", line 188, in __call__
    generated_data = self.generate_object(
  File "/home/oogali/lab/llmjson/venv/lib/python3.10/site-packages/jsonformer/main.py", line 114, in generate_object
    obj[key] = self.generate_value(schema, obj, key)
  File "/home/oogali/lab/llmjson/venv/lib/python3.10/site-packages/jsonformer/main.py", line 136, in generate_value
    return self.generate_array(schema["items"], new_array)
  File "/home/oogali/lab/llmjson/venv/lib/python3.10/site-packages/jsonformer/main.py", line 146, in generate_array
    element = self.generate_value(item_schema, obj)
  File "/home/oogali/lab/llmjson/venv/lib/python3.10/site-packages/jsonformer/main.py", line 131, in generate_value
    obj[key if key else -1] = self.generation_marker
IndexError: list assignment index out of range

High VRam Requirement

with llama-2-7b , normally it could pass in 2k context & my gpu can handle it , but when wrapped with jsonformer , getting out of memory error with just 500 tokens passed into the context.

Performance Issues - Multiple Cacheless Generate Calls

Coming here from this issue over on transformers. Looks like I'm no longer getting the error I was getting earlier but there are still performance issues.

Taking a look at the code, the key issue is that it's calling generate once for every value and there's no caching the key and value tensors so a ton of work is being done multiple times. On my machine the example code takes about 4 seconds to generate when using gpt2 as a model when it normally takes about 0.25 seconds to generate a comparable amount of text with free form generation.

I see two solutions:

Cache the keys and values so that work isn't getting redone every generate call. I thought the transformers library had some built in functionality for this but I went looking just now and couldn't find it. Hopefully I'm missing something.
Use the approach in my implementation. I create a DFA for the desired schema then make a single call to generate, using the DFA to determine which tokens are legal continuations each time the callable passed as the prefix_allowed_tokens_fn argument in generate gets called. My approach does have the problem mentioned in the other issue, which is that it's very slow when implemented in Python so it would need to be done in C/C++/Rust. It would also basically be a complete rewrite of this library.

Failed in: RuntimeError: The expanded size of the tensor (151936) must match the existing size (151646) at non-singleton dimension 1. Target sizes: [1, 151936]. Tensor sizes: [151646]

from jsonformer import Jsonformer
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
text_generation_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

json_schema = {
  "type": "object",
  "properties": {
    "status": {
      "type": "string",
      "enum": ["success", "failure"]
    },
    "mcq_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "question": {
            "type": "string"
          },
          "options": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "option": {
                  "type": "string"
                },
                "reasoning": {
                  "type": "string"
                },
                "label": {
                  "type": "number"
                }
              },
              "required": ["option", "reasoning", "label"]
            }
          },
          "answer": {
            "type": "number"
          }
        },
        "required": ["question", "options", "answer"]
      }
    }
  },
  "required": ["status", "mcq_items"]
}


context = "A story revolving around a man forced by circumstances to participate in a mysterious game of survival. Zheng Kai Si has nothing in his name and in order to pay off his debts, he goes aboard a cruise ship as one of the players in a deadly game. It's a game of lies and deception to outsmart the enemy and emerge victoriously. For the sake of his mother and Liu Qing, Kai Si struggles to survive."
prompt = f"Generate multiple choice question(s) using provided context/topic \
                        and your general knowledge, including 1 correct option and 3 \
                        wrong options. Here is the context: {context}"
jsonformer = Jsonformer(text_generation_model, tokenizer, json_schema, prompt)
generated_data = jsonformer()
print(generated_data)

Fails, when i use "number" as a datatype.

Error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "C:\Users\Lenovo\Desktop\project_minerva\jsonformer_test.py", line 57, in
generated_data = jsonformer()
^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 242, in call
generated_data = self.generate_object(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 147, in generate_object
obj[key] = self.generate_value(schema, obj, key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 178, in generate_value
return self.generate_array(schema["items"], new_array)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 192, in generate_array
element = self.generate_value(item_schema, obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 185, in generate_value
return self.generate_object(schema["properties"], new_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 147, in generate_object
obj[key] = self.generate_value(schema, obj, key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 178, in generate_value
return self.generate_array(schema["items"], new_array)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 192, in generate_array
element = self.generate_value(item_schema, obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 185, in generate_value
return self.generate_object(schema["properties"], new_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 147, in generate_object
obj[key] = self.generate_value(schema, obj, key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 162, in generate_value
return self.generate_number()
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\main.py", line 61, in generate_number
response = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\transformers\generation\utils.py", line 1758, in generate
result = self._sample(
^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\transformers\generation\utils.py", line 2410, in _sample
next_token_scores = logits_processor(input_ids, next_token_logits)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\transformers\generation\logits_process.py", line 98, in call
scores = processor(input_ids, scores)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\Desktop\project_minerva.venv\Lib\site-packages\jsonformer\logits_processors.py", line 81, in call
mask = self.allowed_mask.expand_as(scores)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The expanded size of the tensor (151936) must match the existing size (151646) at non-singleton dimension 1. Target sizes: [1, 151936]. Tensor sizes: [151646]

Support for GPTQ models

Jsonformer doesn't work with GPTQ models. For inference speed, it would be nice to have support for such models.

Documentation for returning an array of objects

Is there any way to return an array of objects (e.g. return multiple car objects):

{"type": "object", "properties": {"car": {"type": "object", "properties": {"make": {"type": "string"}, "model": {"type": "string"}, "year": {"type": "number"}, "colors": {"type": "array", "items": {"type": "string"}}, "features": {"type": "object", "properties": {"audio": {"type": "object", "properties": {"brand": {"type": "string"}, "speakers": {"type": "number"}, "hasBluetooth": {"type": "boolean"}}}, "safety": {"type": "object", "properties": {"airbags": {"type": "number"}, "parkingSensors": {"type": "boolean"}, "laneAssist": {"type": "boolean"}}}, "performance": {"type": "object", "properties": {"engine": {"type": "string"}, "horsepower": {"type": "number"}, "topSpeed": {"type": "number"}}}}}}}, "owner": {"type": "object", "properties": {"firstName": {"type": "string"}, "lastName": {"type": "string"}, "age": {"type": "number"}}}}}

Here is an example I tried that gave the below error:

json_schema = {
    "type": "array",
    "properties": {
        "type": "object",
        "properties": {
            "car": {
                "type": "object",
                "properties": {
                    "make": {"type": "string"},
                    "model": {"type": "string"},
                    "horsepower": {"type": "number"}
                }
            }
        }        
    }
}

error:

TypeError: string indices must be integers

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-0fa4792b-fa73-408b-b0a8-ecf9f5e56538/lib/python3.10/site-packages/jsonformer/main.py:242, in Jsonformer.__call__(self)
    240 def __call__(self) -> Dict[str, Any]:
    241     self.value = {}
--> 242     generated_data = self.generate_object(
    243         self.json_schema["properties"], self.value
    244     )
    245     return generated_data

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-0fa4792b-fa73-408b-b0a8-ecf9f5e56538/lib/python3.10/site-packages/jsonformer/main.py:147, in Jsonformer.generate_object(self, properties, obj)
    145 for key, schema in properties.items():
    146     self.debug("[generate_object] generating value for", key)
--> 147     obj[key] = self.generate_value(schema, obj, key)
    148 return obj

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-0fa4792b-fa73-408b-b0a8-ecf9f5e56538/lib/python3.10/site-packages/jsonformer/main.py:156, in Jsonformer.generate_value(self, schema, obj, key)
    150 def generate_value(
    151     self,
    152     schema: Dict[str, Any],
    153     obj: Union[Dict[str, Any], List[Any]],
    154     key: Union[str, None] = None,
    155 ) -> Any:
--> 156     schema_type = schema["type"]
    157     if schema_type == "number":
    158         if key:

Description of keys

Is there a way to add description to the keys so that the values are mapped correctly?

allow for do_sample=True

we might not always want greedy sampling do we?
Could you implement do_sample beeing an init param for JsonFormer or is there anything technically that prohibits this change?

Installation error on case-sensitive file systems

I get the error below when trying to run poetry install. This is because the README is listed as README.md in pyproject.toml, but the file in the repository is named readme.md (lowercase). Changing either of these to match solves the problem.

$ poetry install
Installing dependencies from lock file

No dependencies to install or update

[Errno 2] No such file or directory: '/home/mmior-admin/apps/jsonformer/README.md'

compatibility with Gemini

add compatibility with gemini

Javascript version started - Help needed

This would be awesome to also have in the JS version.

My yet uncompleted attempt to translate JSONFormer into Typescript:

https://github.com/vincenzodomina/jsonformerjs

I used numjs as a lightweight alternative to tensorflow to replace torch, but Huggingface transformers is not available in JavaScript, any help with how to work with or substitute the Huggingface interface and to at least get it running with the OpenAI API would be appreciated.

Support nullable in case of not found sub object

Return empty values

First off, great package! Thanks for the contribution.

I've been using this package to basically transform unstructured text to JSON. It works very well with one exception. If a value does not exist in the text, one is made up.

Instead, for a number, for example, I'd be better for it to return null or for a string to return an empty string.

I'd issue a PR but I am not sure how to accomplish this.

Thanks again!

Running on cpu takes 1m+ for generation. Possibly to run on gpu?

Hi, is it possible to run jsonformer using a gpu on google colab?

When I ran it today on google colab with a a100 gpu runtime I only see it using cpu/ram resources. is it possible to run dolly + jsonformer on a gpu? and would that decrease the time for the generation?

String stopping criteria might need to be more specific

I see in the code and the readme that the stopping criteria for strings is the second quotation mark. However, in most json dialects you can escape a quote in a string with \". This appears sometimes so it's not impossible for the LLM to produce this, too.

e.g. the prompt might be "extract the second sentence from the following paragraph" and the paragraph is:

This is the first sentence. This is the "second" sentence. This is the third sentence.

The LLM would ideally output "This is the \"second\" sentence." but the parser wouldn't handle this correctly.

Json Former Causing High V-Memory Usage and Error with Increasing Token Size

Description

Outputs when enforced through json former is getting insufficient memory errors, as the token size increases.

Traceback

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.57 GiB already allocated; 16.25 MiB free; 9.70 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON

More surported model type

In your readme.md, your model and tokenizer are:

model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b")
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b")

I just want to use a typical T5ForConditionalGeneration model as follows:

"""This module provides a T5 model jsonformer."""

import transformers
from jsonformer import Jsonformer

pretrained_model_name = "t5-small"

model = transformers.T5ForConditionalGeneration.from_pretrained(
    pretrained_model_name
)
tokenizer = transformers.T5Tokenizer.from_pretrained(pretrained_model_name)


json_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
    }
}

prompt = "Generate a person's information based on the following schema:"
jsonformer = Jsonformer(model, tokenizer, json_schema, prompt)
generated_data = jsonformer()

print(generated_data)

But failed due to the error when generating string:

    111 self.debug("[generate_string] response", response)
    112 split = response.split('"')
--> 113 assert len(split) >= 2
    114 return split[1]

AssertionError:

And I might have to say, could you add more docstring and type hints for your project?

the example prompt in the README doesn't make sense

Here's the example at the time of writing:

from jsonformer import Jsonformer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b")
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b")

json_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "is_student": {"type": "boolean"},
        "courses": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
}

prompt = "Generate a person's information based on the following schema:"
jsonformer = Jsonformer(model, tokenizer, json_schema, prompt)
generated_data = jsonformer()

print(generated_data)

I find it confusing, because you're not giving the model any data to extract to JSON. Should the prompt be more like: "Generate a person's information based on the following schema. The person is John Doe, aged 23. John is a student at Georgia Tech, and take the following courses: Chemistry, Mathematics, and a minor in Japanese."

Add OpenAI API Key-based version of Jsonformer

Currently, the Jsonformer class is using a local transformer model and tokenizer to generate data in JSON schema format. However, it would be useful to have a version of the class that uses OpenAI's Language Models to generate data. Therefore, I would like to request the implementation of a new version of the class that takes OpenAI API keys and a model name as parameters, in addition to the existing parameters of the Jsonformer class.

The OpenAI API keys should be used to authenticate the requests made to OpenAI's API. The model name should be used to specify which LLM to use for data generation. The new version of the class should work similar to the current implementation but use OpenAI's API to generate data

This feature would allow for the generation of data using state-of-the-art LLMs from OpenAI.

Thank you for considering this feature request.

Move input_ids to CUDA

Hey guys!

First thank you so much for this project and open sourcing!. Absolutely great idea!!!

I keep running into issues when my model is already placed on the GPU. and see no way to apply a device.

Issue I get is:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

It would be great if you could pass the input_ids to cuda() before calling the model.generate.
Or have a way to specify your device target.

Thanks again!!

Support for JSON-LD

Hey!

I'm currently working with a group to try to automate metadata for scholarly articles, and I wanted to use JSONformer to return the metadata. However, I am hoping to translate JSON into JSON-LD. I couldn't find here if you had support for JSON-LD or only support for JSON.

Could someone let me know if I can use this as a method to translate semi-structure to JSON-LD specifically?

How much ram should running this need?

I run this in a laptop with 32gb ram and it top up really quick.

Add Optional and Union types

Great library, but some use-cases require that fields be omitted, or that values can be of one type or another.

Issue with array response

Hi,

I have issue with the generated JSON response. It seems that it doesn't respond well with array related prompt instruction.

from transformers import AutoModelForCausalLM, AutoTokenizer

print("Loading model and tokenizer...")
model_name = "databricks/dolly-v2-3b"
model = AutoModelForCausalLM.from_pretrained(model_name, use_cache=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, use_cache=True)
print("Loaded model and tokenizer")

Prompt:

from jsonformer.format import highlight_values
from jsonformer.main import Jsonformer

stock2 = {
  "type": "object",
  "properties": {
    "stocks": {
        "type": "array",
        "items": {"type": "string"}
      }
   }
}

builder = Jsonformer(
    model=model,
    tokenizer=tokenizer,
    json_schema=stock2,
    debug=True,
    prompt="generate 10 stocks code",
)

print("Generating...")

output = builder()

highlight_values(output)

Response:

Generating...
[generate_object] generating value for stocks
[generate_string] generate 10 stocks code
Output result in the following JSON schema format:
{"type": "object", "properties": {"stocks": {"type": "array", "items": {"type": "string"}}}}
Result: {"stocks": ["
[generate_string] |ABC",|
[generate_string] generate 10 stocks code
Output result in the following JSON schema format:
{"type": "object", "properties": {"stocks": {"type": "array", "items": {"type": "string"}}}}
Result: {"stocks": ["ABC", "
[generate_string] |XYZ",|
[generate_string] generate 10 stocks code
Output result in the following JSON schema format:
{"type": "object", "properties": {"stocks": {"type": "array", "items": {"type": "string"}}}}
Result: {"stocks": ["ABC", "XYZ", "
[generate_string] |PQR",|
{
  stocks: [
    "ABC",
    "XYZ",
    "PQR"
  ]
}

The response only respond with 3 data not 10 as in the prompt. I am not sure if it is issue with the model or not.
Also, you may notice that the memory used for 3b model is at 23GB of RAM. Is this normal?

Any help would be appreciated. Thank you.

Recommendations on underlying model fine-tuning

This work is very interesting and potentially useful in many domains.

Do you have any recommendations on how we might fine-tune models in specific domains to better support structured extraction? Specifically, we would be interested extracting structured data from medical reports where things such as a fixed set of conditions, location site, and other semantic labels are specific to the input text, but patterns could be learned from fine-tuned training. While we could fine-tune the model with a json response to report input, it is not clear that this would be the best approach.

Thanks for this work and any future response.

How to specify a schema where an object can be any key-value combination

I want the LLM to generate a JSON like this

{
  "name": "John Doe",
  "info": {
    "age": "41",
    "tennis_club": "Detroit Club",
    "wife": "Jan Doe",
  }
}

The info is not always the same. Some people might not have a wife so the property would be omitted.
In theory, there could be thousands of combinations in the "info".

If I specify a schema like this:

{
    "type": "object",
    "properties": {
        "name": { "type": "string" },
        "info": {
            "type": "object"
        }
    }
}

It would error with:

  File "venv/lib/python3.11/site-packages/jsonformer/main.py", line 185, in generate_value
    return self.generate_object(schema["properties"], new_obj)
                                ~~~~~~^^^^^^^^^^^^^^
KeyError: 'properties'

So "properties" is required.

The only way I currently see to solve this is to use an array with objects of format {"key": "...", "value": "..."}

So the schema would look like:

{
    "type": "object",
    "properties": {
        "name": { "type": "string" },
        "info": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "key": { "type": "string" },
                    "value": { "type": "string" }
               }
           }
        }
    }
}

But I'd rather avoid this.

Infinite recursion bug when calling generate_number

In the main.py file, when jsonformer fails to generate a number, it continues calling generate number, but without incrementing the iterations variable, causing the "Failed to generate a valid number" to never get called and we get max recursion depth exceeded error.

feature pairity with openai function calling

openai json function calling supports a couple additional keys that jsonformer doesn’t seem to have the structure to parse: description, enums and required.

does anyone have interest in introducing these?

ctransformers / GGML support

hey any chance the team can work to provide ctransformers / GGML support? also key description options would be clutch, thanks

Error in logits_processors.py - OutputNumbersTokens().call() for some models

Hello: When running the tiiuae/falcon-7b model, I get no issue using the package as intented. But some models, such as tiiuae/falcon-rw-1b will get an error in OutputNumbersTokens().call() like below:

The expanded size of the tensor (50304) must match the existing size (50257) at non-singleton dimension 1. Target sizes: [1, 50304]. Tensor sizes: [50257]

I've been trying to debug this on my own but have not figured out why sometimes self.allowed_mask and scores sometimes have mismatching shapes (depending on model) that will cause the above error when trying to run:

self.allowed_mask.expand_as(scores)

Support for LLaVa

As per my current testing, it seems like jsonformer is only compatible with text based prompts. It is not compatible with prompts for models like LLaVa

Array of objects always contains 1 item

I'm trying to generate an array containing objects with the following format:

"results": {
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "T": {"type": "string"},
            "E": {"type": "string"},
        }
    }
}

But the resulting array always contains only 1 item.

Update: perhaps it's a problem with my prompt and my model, so I will close this for now.

Make `Jsonformer` derive from `PreTrainedModel`

As far as I understand it, this is currently not the case.

Having Jsonformer derive from PreTrainedModel would enable immediate use with e.g. pipeline and other ecosystem building blocks that require a PreTrainedModel.

It might even be possible to automatically derive directly from the (more specialized) base class loaded using from_pretrained (and automatically loading the tokenizer from the same path except specified otherwise). That way, almost no functions would need to be changed. Other ideas:

implement forward manually. This is probably tedious
automatically load all functions from the other model and set them, e.g. along the lines of

from inspect import getmembers, isfunction

    # somewhere in Jsonformer.__init__, probably
    for name, func in dict(getmembers(self.model, isfunction)):
        setattr(self, name, func)

Edit: Thinking about it some more (and understanding the PreTrainedModel interface better), it's probably not that easy.

AWQ RuntimeError

Is there a way to make it work with AWQ models?

Output:

Fetching 14 files: 100%|███████████████████████████| 14/14 [00:00<00:00, 115591.06it/s]
Replacing layers...: 100%|█████████████████████████████| 32/32 [00:02<00:00, 14.02it/s]
Fusing layers...: 100%|████████████████████████████████| 32/32 [00:04<00:00,  7.79it/s]
2023-12-10 18:50:15.290217: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-10 18:50:15.345000: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-10 18:50:15.951851: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "/home/conic/llm_experimentation/./jsonformer_test.py", line 40, in <module>
    generated_data = jsonformer()
  File "/home/conic/.local/lib/python3.10/site-packages/jsonformer/main.py", line 242, in __call__
    generated_data = self.generate_object(
  File "/home/conic/.local/lib/python3.10/site-packages/jsonformer/main.py", line 147, in generate_object
    obj[key] = self.generate_value(schema, obj, key)
  File "/home/conic/.local/lib/python3.10/site-packages/jsonformer/main.py", line 168, in generate_value
    return self.generate_boolean()
  File "/home/conic/.local/lib/python3.10/site-packages/jsonformer/main.py", line 90, in generate_boolean
    output = self.model.forward(input_tensor.to(self.model.device))
  File "/home/conic/.local/lib/python3.10/site-packages/awq/models/base.py", line 37, in forward
    return self.model(*args, **kwargs)
  File "/home/conic/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/conic/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/conic/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1052, in forward
    logits = self.lm_head(hidden_states)
  File "/home/conic/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/conic/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/conic/.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Inference tensors cannot be saved for backward. To work around you can make a clone to get a normal tensor and use it in autograd.

Code:

from jsonformer import Jsonformer
import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/Xwin-LM-7B-V0.2-AWQ"
device_map = 'auto'
json_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "is_student": {"type": "boolean"},
        "courses": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
}

device = 'cuda' if torch.cuda.is_available() else 'cpu'
available_devices = torch.cuda.device_count()
device_name = torch.cuda.get_device_name(device=device)

model = AutoAWQForCausalLM.from_quantized(model_name_or_path, 
                                        fuse_layers=True,
                                          trust_remote_code=False,
                                          safetensors=True,
                                          device_map=device_map,
                                          )

model.device = device
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False,
        )

prompt = "Generate a person's information based on the following schema:"
jsonformer = Jsonformer(model, tokenizer, json_schema, prompt)
generated_data = jsonformer()

print(generated_data)