Comments (6)
Hi @bohea thanks for the suggestion!
I am open to adding another encoder that uses jsoncomment.
Two questions:
-
are you including examples and prompt instructions to not include comments? if so, are the model outputs actually reasonable even though it fails to omit comments?
-
Are you looking for JSON comment functionality only for the decoding path? Or does your use case benefit from adding comments during encoding?
Is JSON comment is maintained anymore: https://pypi.org/project/jsoncomment/#data
The project page leads to a 404. Let me know if you're able to find a project page or any other well maintained library that offers this functionality.
If not, jsoncomment implementation could be added directly to kor using lark. (Let me know if you have interest to work on this.)
An alternative approach would be to introduce another decoder that first strips away comments using a regexp and then delegates to JSON.
from kor.
it's not about comments in json , but extra commas in llm's response json which fails in json decode
from kor.
I don't find project page yet, it seems the author only upload package to pypi
from kor.
by the way, I have found that removing the json tag in schema encode makes json decode work better, because llm raw response often miss the json tag("</json>") at the end(not due to token restrictions)
e.g.
<json>{"a": 1, "b":{"foo": 1, "bar": 2}}
json decoder will unwarp json tag first using a regexp, but if no </json> at the end, regexp match nothing, so decode fails
from kor.
@bohea apologies for delayed responses -- on vacation until end of July so i only have limited computer access.
A few questions:
- do you have benchmarking results?
- which LLMs are you testing with?
- are you including examples?
My personal experience:
- I experimented mostly with open AI text-davinci-003, gpt-3.5-turbo and Claude. My sense was that text-davinci-003 and Claude were significantly better than gpt-3.5-turbo.
- I and wrappers was that it the open AI models often included explanations after the JSON section making it more difficult to identify the json section correctly (or requiring some sort of hacks). The and tags significantly reduced parsing errors as far as I could tell.
I unfortunately don't have any benchmark datasets, so all of my conclusions should be treated as anecdotal, but based on my experience I don't want to change the default behavior of including the tag without quantitative evidence that it improves results.
We should definitely make the presence of tag controllable by a flag though -- it will allow the user to determine how the data should be encoded.
it's not about comments in json , but extra commas in llm's response json which fails in json decode
This sounds like it could improve extraction in some cases and make it worse in other cases (extracting incorrect information). Is this not the case?
from kor.
@eyurtsev thanks for your response
do you have benchmarking results? -- not yet, I did a simple statistic,chatgpt has a 50% chance of not adding the json tag, so I simply make use_tag = False when do json encode, and use_tag = True when do csv encode
which LLMs are you testing with? -- mostly gpt-3.5-turbo-16k
are you including examples? -- no examples included
You were right to ask llm to add csv/json tag, it's chatgpt's problem don't follow the instruction(may be my text is too long)
I didn't know text-davinci-003 may be better than chatgpt when do extractions, I need chatgpt's reasoning abilities,But I will make a try on text-davinci-003
from kor.
Related Issues (20)
- NotImplementedError: parse_raw is not supported for pydantic 2 - how to do it with Pydantic 2? HOT 4
- Why are entities extracted from examples, and how can I avoid them HOT 7
- unable to handle new lines with proposed fix HOT 4
- Kor with LLama2 or Mistral HOT 2
- from_pydantic with id support HOT 1
- from kor.extraction import create_extraction_chain not supported by Lang Chain HOT 2
- Dict type for Pydantic schema
- Question about nested objects (Using Mistral Instruct 7B) HOT 2
- Add support for typing.Annotated pydantic fields HOT 1
- ValueError when running chain.run(...) HOT 2
- Some values are not parsed HOT 1
- Hallucination even with Four Entities in Schema
- Dose Kor support open source model HOT 1
- Extracting confidence score or probability scores for extracted entities
- No validation or parsing on pydantic v2
- Support for gradual migration of Pydantic V1 -> V2 HOT 3
- KOR Extraction doesn't give reproducible results HOT 1
- KOR extraction doesn't yields reproducible results HOT 2
- kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n
- create_extraction_chain with HuggingFaceEndpoint: 'HuggingFaceEndpoint' object is not subscriptable HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kor.