eyurtsev / kor Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 80.0 3.55 MB

LLM(😽)

Home Page: https://eyurtsev.github.io/kor/

License: MIT License

Python 100.00%

information-extraction llm natural-language natural-language-processing natural-language-understanding

kor's People

Contributors

Stargazers

Watchers

Forkers

touristshaun naturallydeer suzuke tomdyson leegang tonyxia2016 leloykun smwitkowski galtay sorokinvld bmwas jameshennessytempus rishabhjain1198 khaliloukarl mlubbad lihanghang aruncivicscience hbcbh1999 mattang-michael hansv905 vitaly-z mistrymm7 turagatron hboon youminxue nonfungiblefuturist bingoral jimmc414 nashid gladiopeace nfcampos timonweb-forks sshussain270 themantalope ronnietheroot hpunetha justin5927 lukashermann gonzab jjhw prasannaiitm xang1234 sahinutar seb0 lbsnrs puri-gagan baoarmy dkzdev eltociear boriswilhelms rosssong lipengyuebird erichlin1 zakharyan9889 gipster jeromyjsmith techthiyanes robi56 sarahbrownplace glogiotatidis voldemort373 sudhanshu-shukla-git khanhphan1311 mj9121902 doytsujin darcstar-solutions-tech stanfea ethicalsecurity-agency britvabo go-online-public zhutony talesmousinho ppnorain chaimt qinb sakhiking

kor's Issues

SyntaxError: 'await' outside function

When I try to run the example documented here:
https://eyurtsev.github.io/kor/document_extraction.html

I get this error when running the extract_from_documents function:

    document_extraction_results = await extract_from_documents(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: 'await' outside function

Add embedded lists and embedded objects support to CSV encoder

CSV encoder only works with flat objects at the moment, it would be nice to add support for embedded lists and embedded objects. Embedded lists are a particularly common extraction item.

Introduce `valid_values` to `Text` which alters the prompt, in order to limit which values are returned by `predict_and_parse`

Hey - thanks for creating kor! I'm eager to start using it in my day job, in particular for aspect-based sentiment analysis.

I'm looking at reviews on food items and would like to label each review with an "aspect" if that aspect is mentioned in the review. You could begin to imagine which aspects are most relevant for this use case, flavor and texture are two that come to mind immediately. I want to limit this labeling exercise to only aspects that I am interested in.

I expect that including these instructions in the prompt would be sufficient, and I think two ways could be incorporated into kor.

The most straightforward way is to allow an end user to alter the prompt, but I'd prefer the second solution listed below.

The second seems to be a better long-term solution perhaps. AbstractSchemaNode could accept a new parameter valid_values, which would indicate what are valid values for a given key defined in the attribute.

schema = Object(
    id="review_aspect",
    description="Extracts aspects from a review.",
    attributes=[
        Text(
            id="aspect",
            description="Aspects mentioned in the review",
            examples=[("Taste was fine it was just a weird texture.", ["Flavor", "Texture"])],
            valid_values=["Flavor", "Texture"],
            many=True
        )
    ]
)

Then, that valid_values would be passed to generate_instruction_segment along with the node, and the prompt would be updated to include instructions on how to restrict which values are returned for aspect.

kor/kor/prompts.py

Lines 89 to 93 in c3066c1

 def generate_instruction_segment(self, node: AbstractSchemaNode) -> str: 

 """Generate the instruction segment of the extraction.""" 

 type_description = self.type_descriptor.describe(node) 

 instruction_segment = self.encoder.get_instruction_segment() 

 return f"{self.prefix}\n\n{type_description}\n\n{instruction_segment}"

Happy to help contribute to this if it seems helpful!

Failure in parsing for longer queries

Following the LangChain tutorial, I create:

llm = ChatOpenAI(model_name="gpt-3.5-turbo")

Define schema:

schema = Object(
    id = "travel_destination",
    description = "A source or destination for travel",
    attributes = [
        Text(
            id="city_name",
            description="The name of a city.",
        )
    ],
    examples = [
        ("I have to travel somewhere near Toronto and Denver.", [{"city_name": "Toronto"}, {"city_name": "Denver"}])
    ],
    many = True,
)

Chain:

chain = create_extraction_chain(llm, schema)

Define 4 queries:

query1 = """ My team is split between San Francisco or Phoenix. """
query2 = """ My team is split 50/50 between Toronto and Denver and I want somewhere that's easy to get to for everyone. """
query3 = """ I'm looking to do a sales offsite for my team. They're split 50/50 between Toronto and Denver and I want somewhere that's easy to get to for everyone. Can you recommend 3–5 cities that might be a good choice, based on where is easy for everyone to get to? """
query4 = """ I'm looking to do a sales offsite for my team. My team is split 50/50 between Toronto and Denver and I want somewhere that's easy to get to for everyone. Can you recommend 3–5 cities that might be a good choice, based on where is easy for everyone to get to? """

Test:

for q in [query1,query2,query3,query4]:
    print(chain.predict_and_parse(text=(q))["data"])

Result:

{'travel_destination': [{'city_name': 'San Francisco'}, {'city_name': 'Phoenix'}]}
{'travel_destination': [{'city_name': 'Toronto'}, {'city_name': 'Denver'}]}
{'travel_destination': []}
{'travel_destination': []}

Works great for first 2, but apparently we see failure in parsing for longer queries.

Is this a known issue?

Add details on different node type and usage to the documentation

I couldn't find information for different node types in the documentation. Is this something I could add?

Yes, please do!

Originally posted by @eyurtsev in #84 (comment)

TypeError: issubclass() arg 1 must be a class

I'm trying to run the below code provided in the github repo.

`from langchain.chat_models import ChatOpenAI
from kor import create_extraction_chain, Object, Text

llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0,
max_tokens=2000,
frequency_penalty=0,
presence_penalty=0,
top_p=1.0,
)

schema = Object(
id="player",
description=(
"User is controlling a music player to select songs, pause or start them or play"
" music by a particular artist."
),
attributes=[
Text(
id="song",
description="User wants to play this song",
examples=[],
many=True,
),
Text(
id="album",
description="User wants to play this album",
examples=[],
many=True,
),
Text(
id="artist",
description="Music by the given artist",
examples=[("Songs by paul simon", "paul simon")],
many=True,
),
Text(
id="action",
description="Action to take one of: play, stop, next, previous.",
examples=[
("Please stop the music", "stop"),
("play something", "play"),
("play a song", "play"),
("next song", "next"),
],
),
],
many=False,
)

chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')
chain.run("play songs by paul simon and led zeppelin and the doors")['data']`

I was able to run this with chain.predict_and_parse() but not with chain.run(). Below is the trace:

in <cell line: 1>()
----> 1 from langchain.chat_models import ChatOpenAI
2 from kor import create_extraction_chain, Object, Text
3
4 llm = ChatOpenAI(
5 model_name="gpt-3.5-turbo",

/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/init.py in import_patch(name, globals, locals, fromlist, level)
169 # Import the desired module. If you’re seeing this while debugging a failed import,
170 # look at preceding stack frames for relevant error information.
--> 171 original_result = python_builtin_import(name, globals, locals, fromlist, level)
172
173 is_root_import = thread_local._nest_level == 1

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/init.py in
4 from typing import Optional
5
----> 6 from langchain.agents import MRKLChain, ReActChain, SelfAskWithSearchChain
7 from langchain.cache import BaseCache
8 from langchain.chains import (

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/agents/init.py in
1 """Interface for agents."""
----> 2 from langchain.agents.agent import (
3 Agent,
4 AgentExecutor,
5 AgentOutputParser,

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/agents/agent.py in
23 Callbacks,
24 )
---> 25 from langchain.chains.base import Chain
26 from langchain.chains.llm import LLMChain
27 from langchain.input import get_color_mapping

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/chains/init.py in
1 """Chains are easily reusable components which can be linked together."""
----> 2 from langchain.chains.api.base import APIChain
3 from langchain.chains.api.openapi.chain import OpenAPIEndpointChain
4 from langchain.chains.combine_documents.base import AnalyzeDocumentChain
5 from langchain.chains.combine_documents.map_reduce import MapReduceDocumentsChain

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/chains/api/base.py in
10 CallbackManagerForChainRun,
11 )
---> 12 from langchain.chains.api.prompt import API_RESPONSE_PROMPT, API_URL_PROMPT
13 from langchain.chains.base import Chain
14 from langchain.chains.llm import LLMChain

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/chains/api/prompt.py in
1 # flake8: noqa
----> 2 from langchain.prompts.prompt import PromptTemplate
3
4 API_URL_PROMPT_TEMPLATE = """You are given the below API Documentation:
5 {api_docs}

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/prompts/init.py in
10 SystemMessagePromptTemplate,
11 )
---> 12 from langchain.prompts.example_selector import (
13 LengthBasedExampleSelector,
14 MaxMarginalRelevanceExampleSelector,

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/prompts/example_selector/init.py in
2 from langchain.prompts.example_selector.length_based import LengthBasedExampleSelector
3 from langchain.prompts.example_selector.ngram_overlap import NGramOverlapExampleSelector
----> 4 from langchain.prompts.example_selector.semantic_similarity import (
5 MaxMarginalRelevanceExampleSelector,
6 SemanticSimilarityExampleSelector,

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/prompts/example_selector/semantic_similarity.py in
6 from pydantic import BaseModel, Extra
7
----> 8 from langchain.embeddings.base import Embeddings
9 from langchain.prompts.example_selector.base import BaseExampleSelector
10 from langchain.vectorstores.base import VectorStore

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/embeddings/init.py in
27 from langchain.embeddings.mosaicml import MosaicMLInstructorEmbeddings
28 from langchain.embeddings.octoai_embeddings import OctoAIEmbeddings
---> 29 from langchain.embeddings.openai import OpenAIEmbeddings
30 from langchain.embeddings.sagemaker_endpoint import SagemakerEndpointEmbeddings
31 from langchain.embeddings.self_hosted import SelfHostedEmbeddings

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/langchain/embeddings/openai.py in
119
120
--> 121 class OpenAIEmbeddings(BaseModel, Embeddings):
122 """Wrapper around OpenAI embedding models.
123

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydantic/main.cpython-39-x86_64-linux-gnu.so in pydantic.main.ModelMetaclass.new()

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydantic/fields.cpython-39-x86_64-linux-gnu.so in pydantic.fields.ModelField.infer()

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydantic/fields.cpython-39-x86_64-linux-gnu.so in pydantic.fields.ModelField.init()

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydantic/fields.cpython-39-x86_64-linux-gnu.so in pydantic.fields.ModelField.prepare()

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydantic/fields.cpython-39-x86_64-linux-gnu.so in pydantic.fields.ModelField._type_analysis()

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydantic/fields.cpython-39-x86_64-linux-gnu.so in pydantic.fields.ModelField._create_sub_type()

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydantic/fields.cpython-39-x86_64-linux-gnu.so in pydantic.fields.ModelField.init()

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydantic/fields.cpython-39-x86_64-linux-gnu.so in pydantic.fields.ModelField.prepare()

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydantic/fields.cpython-39-x86_64-linux-gnu.so in pydantic.fields.ModelField._type_analysis()

/usr/lib/python3.9/typing.py in subclasscheck(self, cls)
833 return issubclass(cls.origin, self.origin)
834 if not isinstance(cls, _GenericAlias):
--> 835 return issubclass(cls, self.origin)
836 return super().subclasscheck(cls)
837

TypeError: issubclass() arg 1 must be a class

[suggestion]use jsoncomment instead of json in decode

llm has a chance of giving the wrong json format, the most common of which is the addition of extra commas

example:

import json
text = '{"a": 1, "b":{"foo": 1, "bar": 2,}}'
data = json.loads(text)  # will raise Exception here, JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 34 (char 33)

if use package jsoncomment

from jsoncomment import JsonComment
json = JsonComment()
text = '{"a": 1, "b":{"foo": 1, "bar": 2,}}'
data = json.loads(text)  # text is decode success here, data is {'a': 1, 'b': {'foo': 1, 'bar': 2}}

Use of Function/Tool with OpenAI

2023-06-13 OpenAI's announcement of the API's changes allows to pass a function parameter which supposedly improves the LLM interpretation of the task. (Langchain already implemented the necessary changes in 0.199/0.200) Do you see how can this be use to improve Kor data extraction?

when there are no examples provided, the parsing results are inaccurate

res = chain.run("Please help me search for all the tender notices of big data companies on record until the day before yesterday.")

{'data': {}, 'raw': '{\n "industry": "big data",\n "type": "tender",\n "deadline": "2023-08-05"\n}', 'errors': [ParseError('The LLM has returned structured data which does not match the expected schema. Providing additional examples may help improve the parse.')],
'validated_data': {}}

the top-level object is missing

Output to other formats than JSON

Is there any way to output anything else than JSON? I was thinking YAML for example.

The reason for asking this is that JSON is quite expensive in terms of "token usage" with OpenAI - I was thinking of using the output of kor multiple times in a LLM pipeline.

Accessing the original LLM response

Hey Eugene,

Thanks for the hard work. This library works very well for us.
We have a requirement to access some metadata from the original openai API response. (Not just the usage).

Is there currently a way to do this?

Best way to work with embeddings?

Hey!

Wondering if we can use embeddings with this package? I have 500 json files that has been embedded - all following a base schema, which I am loading in the attributes:

attributes=[
    Object(
        id="shipment",
        attributes=[
            Text(id="origin", description="The origin of the shipment (unloco code)"),
            Text(id="destination", description="The destination of the shipment (unloco code)")
            Text(id="mode", description="the tranport mode, can be either air, sea, rail, road")
            Text(id="type", description="the shipment type, can be either lcl or fcl")
        ]
    )
]

Is there a way to load in embeddings using the examples? Or perhaps hook this up with LangChain / Llama Index even...

Add boolean node

attribute description added to extracted info in Chinese

code snippet:

schema = Object(
id="post",
description=(
'''
社交媒体博主在社交媒体上发布的脚本
'''
),
attributes=[
Text(
id="ingredient",
description="化妆品的原料和成分",
examples=[],
many=True,
),
Text(
id="function",
description="产品能够起到的作用",
examples=[
],
many=True,
),
Text(
id="brand",
description="文案中的化妆品品牌",
examples=[],
many=True,
),
Text(
id="product",
description="宣传的化妆品产品",
examples=[],
many=True,
),
Text(
id="skin",
description="皮肤的类型和状态",
examples=[
],
many=True,
),
Text(
id="target",
description="品牌或者产品适用的用户人群",
examples=[],
many=True,
),
Text(
id="feeling",
description="使用化妆品后的个人感受",
examples=[],
many=True,
),
Text(
id="scene",
description="适合使用化妆品的地点，气候，节日，季节，场合等",
examples=[],
many=True,
),
Text(
id="promotion",
description="产品促销信息",
examples=[],
many=True,
),
Text(
id="special",
description="产品的优势和特点",
examples=[
],
many=True,
),
Text(
id="category",
description="化妆品所属的品类",
examples=[
("第二有一支好的防晒霜", '防晒霜')
],
many=True,
)
],
many=False
)

but the output likes:
{'post': {'brand': ['ZOTO'],
'product': ['防晒霜'],
'function': ['防晒'],
'skin': ['皮肤类型和状态'],
'target': ['用户人群'],
'feeling': ['使用化妆品后的个人感受'],
'scene': ['适合使用化妆品的地点，气候，节日，季节，场合等'],
'promotion': ['产品促销信息'],
'category': ['防晒霜']}}

actually, '皮肤类型和状态' is attribute description not extracted info

Add guidance about how to improve quality to official docs

#128 (comment)

Add ability to not inline a nested object type information with TypeScript type-description

Want to be able to support something like this, where Bar is not inlined.

type Bar = {
  ...
}

type Foo  = {
  a: Bar
  b: Bar
}

Kor with long documents

I really like Kor, it's very helpful, congratulations for taking the time to create such a fantastic solution....

How can I use Kor with an accounts payable document with many pages and tables that exceed the total token allowed by OpenAI?

Can you use embeds?

Could you post an example code of a pdf document with many invoices?

about the kor output attribute mismatch

Hi, I guided kor to output up to 8 attributes. The problem I encountered was the output attribute mismatch, the answer should have appeared in attribute 5 'gender' appeared in attribute 4 'age'.

As I checked the prompt that kor formatted using print(chain.prompt.format_prompt("[user_input]").to_string()), I got the prompt talking about output as the following:

Please output the extracted information strictly in the above order. Please use a | as the delimeter.

And the input example appeared like:

attribute1 | attribute2 | attribute3 | attribute4 | attribute5 | attribute6 | attribute7 | attribute8
input1 | input2| input3| input4 | input5| input6| input7| input8

And this I assume as the main reason why there would be an attribute mismatch because 1) Sometimes the answer for attribute 1 is null, and the answer should have gone to attribute2 went to attribute1. 2) the model sometimes output 9 answers with 8 delimeter " | " which indicates llm was confused by how many attributues(questions) it should answer.

So my question is how can I modify the kor so that the output won't use a | as a delimeter and won't output things like

att1 | att2 | att3
in1| in2| in3

Instead, the output should be a dict: {'att1':'in1','att2':'in2','att3':'in3'}. In this way there won't be an attribute mismatch.

Or if you have other better ways please tell me as well.

Thank you very much.

[BUG] TypeError: initial_value must be str or None, not dict

name: Bug Report
labels: bug
assignees: ''

Describe the bug

The code in the "Working With Objects" document example appears to not be running correctly.

kor is using version 0.13.0.

To Reproduce

There appear to be some errors when using this demo.

https://eyurtsev.github.io/kor/objects.html#working-with-objects

code：

from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from variables import OPENAI_API_KEY
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    openai_api_key=OPENAI_API_KEY,
    temperature=0,
    max_tokens=2000,
    frequency_penalty=0,
    presence_penalty=0,
    top_p=1.0,
)
schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            examples=[("John Smith went to the store", "Smith")],
        ),
        Number(
            id="age",
            description="The age of the person in years.",
            examples=[("23 years old", "23"), ("I turned three on sunday", "3")],
        ),
    ],
    examples=[
        (
            "John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.",
            [
                {"first_name": "John", "last_name": "Smith", "age": 23},
                {"first_name": "Jane", "last_name": "Doe", "age": 5},
            ],
        )
    ],
    many=True,
)
chain = create_extraction_chain(llm, schema)
print(
    chain.predict_and_parse(
        text=(
            "My name is Bob Alice and my phone number is (123)-444-9999. I found my true love one"
            " on a blue sunday. Her number was (333)1232832. Her name was Moana Sunrise and she was 10 years old."
        )
    )["data"]
)

error stack：

TypeError                                 Traceback (most recent call last)
Cell In[4], line 3
      1 chain = create_extraction_chain(llm, schema)
      2 print(
----> 3     chain.predict_and_parse(
      4         text=(
      5             "My name is Bob Alice and my phone number is (123)-444-9999. I found my true love one"
      6             " on a blue sunday. Her number was (333)1232832. Her name was Moana Sunrise and she was 10 years old."
      7         )
      8     )["data"]
      9 )

File [c:\Users\xxx\anaconda3\envs\demo\lib\site-packages\langchain\chains\llm.py:281](file:///C:/Users/xxx/anaconda3/envs/demo/lib/site-packages/langchain/chains/llm.py:281), in LLMChain.predict_and_parse(self, callbacks, **kwargs)
    279 result = self.predict(callbacks=callbacks, **kwargs)
    280 if self.prompt.output_parser is not None:
--> 281     return self.prompt.output_parser.parse(result)
    282 else:
    283     return result

File [c:\Users\xxx\anaconda3\envs\demo\lib\site-packages\kor\extraction\parser.py:38](file:///C:/Users/xxx/anaconda3/envs/demo/lib/site-packages/kor/extraction/parser.py:38), in KorParser.parse(self, text)
     36 """Parse the text."""
     37 try:
---> 38     data = self.encoder.decode(text)
     39 except ParseError as e:
...
   (...)
    102                 skipinitialspace=True,
    103             )

TypeError: initial_value must be str or None, not dict

Expected behavior

{'personal_info': [{'first_name': 'Bob', 'last_name': 'Alice', 'age': ''}, {'first_name': 'Moana', 'last_name': 'Sunrise', 'age': '10'}]}

Screenshots

Additional context

Rewrite existing integration test as a unit test

Re-write existing integration test as a unit test, replace with mock response

Error in document output structure

Hi @eyurtsev, I think there is something wrong with the structure of document extraction output.
The values for each key is mixed up with some other keys, it seems.
Check the attached screenshot.

Here, the value for employment period is mixed with company location.
Similarly, value for employment period should actually go to skills at job
I think the value assignment should go one step down.

WARNING! msgs when running kor style schema

When I try to run the first example (using kor style schema), I get the following errors:

WARNING! frequency_penalty is not default parameter.
                    frequency_penalty was transferred to model_kwargs.
                    Please confirm that frequency_penalty is what you intended.
WARNING! presence_penalty is not default parameter.
                    presence_penalty was transferred to model_kwargs.
                    Please confirm that presence_penalty is what you intended.
WARNING! top_p is not default parameter.
                    top_p was transferred to model_kwargs.
                    Please confirm that top_p is what you intended.

Is there a getting started guide with a complete working example?

Use LangChain tools / support agents

Wondering if there's anyway that this could be adapted to use agents, and from that add support for giving the agent tools defined in https://python.langchain.com/en/latest/modules/agents/tools.html to help with extractions?

'BaseLanguageModel' ERROR

If I try to run the following command

from kor import create_extraction_chain

I am having
ImportError: cannot import name 'BaseLanguageModel' from 'langchain.schema' (/usr/local/lib/python3.10/dist-packages/langchain/schema.py)

Any idea how to fix it?

Error in extract_from_documents function

I am unable to run the last step in the document extraction article.
The function extract_from_documents returns below error -

[TypeError("__init__() got an unexpected keyword argument 'line_terminator'"),
 TypeError("__init__() got an unexpected keyword argument 'line_terminator'"),
 TypeError("__init__() got an unexpected keyword argument 'line_terminator'"),
 TypeError("__init__() got an unexpected keyword argument 'line_terminator'")]

Looks like something changed at the langchain end

Pydantic v2 support

Need to make kor pydantic v1 and v2 compatible

Integrate with LangChain indexes

My understanding is that kor can structure data from a single piece of text - it would be helpful to add some further info to query a LangChain index, get the relevant docs and structure the information in there.

How to limit number of item for fields in document extraction

Hi @eyurtsev, is there a way to mention (or limit) the number of items that should be extracted for each field in the schema?
Something like below -

class ShowOrMovie(BaseModel):
    name: str = Field(
        description="The name of the movie or tv show",
        limit=3 <------
)

Also, how can I explicitly instruct GPT for certain cases like - Not adding random gibberish text if nothing is found
Can we mention the instruction in description parameter above?

Index of start and end of entity

Hi, an excellent tool for extraction. Is there a way for output to have a start and end index in the sentence? It would be a great improvement to highlight the inside sentence entity.

For example:
{'personal_info': [{'first_name': {text: string, start: integer, end: integer}, 'last_name': {text: 'Last name', start: int, end: int}''}]}

Thanks

Require top level node to be an object

Top level node has to be an object to avoid ambiguities in interpreting the format. Add an explanation and refactor code to handle this.

pydantic error for invalid identifier

Problem: it seems that the identifier used in kor doesn't comply to the convention in pydantic and that cause the following issue:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../.cache/pypoetry/virtualenvs/job-aggregator-u73mtD9n-py3.11/lib/python3.11/site-packages/kor/adapters.py:151: in from_pydantic
    schema = _translate_pydantic_to_kor(
../../../.cache/pypoetry/virtualenvs/job-aggregator-u73mtD9n-py3.11/lib/python3.11/site-packages/kor/adapters.py:109: in _translate_pydantic_to_kor
    options=[Option(id=choice.value) for choice in enum_choices],
../../../.cache/pypoetry/virtualenvs/job-aggregator-u73mtD9n-py3.11/lib/python3.11/site-packages/kor/adapters.py:109: in <listcomp>
    options=[Option(id=choice.value) for choice in enum_choices],
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   pydantic.error_wrappers.ValidationError: 1 validation error for Option
E   id
E     `<built-in function id>` is not a valid identifier. Please only use lower cased a-z, _ or the digits 0-9 (type=value_error)

pydantic/main.py:341: ValidationError

My pyproject.toml:

[tool.poetry]
name = "job-aggregator"
version = "0.1.0"
description = ""
authors = ["dreamerlzl <[email protected]>"]
readme = "README.md"
packages = [{include = "job_aggregator"}]

[tool.poetry.dependencies]
python = "^3.8.1"
beautifulsoup4 = "^4.11.2"
html5lib = "^1.1"
requests = "^2.28.2"
sqlitedict = "^2.1.0"
python-telegram-bot = "^20.1"
redis = "^4.5.5"
kor = "^0.12.0"
pydantic = "1.10.8"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

How to reproduce:

import enum
from pydantic import BaseModel
from kor import from_pydantic

class Bar(enum.Enum):
    A = "1 2 3"


class Foo(BaseModel):
    bar: Bar


from_pydantic(Foo)

A way to print the prompt as it goes to the LLM

Hey all,

Love the package! I'm using it all the time!

Is there a way to print the prompt as it goes to the LLM like how langchain's verbose works?

"kor" is not a package

Hi!

First of all - this use case is exactly what I am looking for!

I have tried installing kor, but when I run my script, I get this error:

No module named 'kor.extraction'; 'kor' is not a package

how to integrate kor with langchain cache methods

Hello, i would like to integrate kor wrapper with langchain cache methods. Would it be possible to do so?

AttributeError: 'FieldInfo' object has no attribute 'field_info'

attributeError Traceback (most recent call last)

Cell In[12], line 1

----> 1 testschema, test_validator = from_pydantic(

  2         Model,

  3         description="test",

  4 
 13         many=False,

 14     )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/kor/adapters.py:151, in from_pydantic(model_class, description, examples, many)

133 def from_pydantic(

134     model_class: Type[BaseModel],

135     *,

(...)

138     many: bool = False,

139 ) -> Tuple[Object, Validator]:

140     """Convert a pydantic model to Kor internal representation.

141

142     Args:

(...)

149         A tuple of the Kor internal representation of the model and a validator.

150     """

--> 151 schema = _translate_pydantic_to_kor(

152         model_class,

153         description=description,

154         examples=examples,

155         many=many,

156     )

157     validator = PydanticValidator(model_class, schema.many)

158     return schema, validator

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/kor/adapters.py:51, in _translate_pydantic_to_kor(model_class, name, description, examples, many)

 49 attributes: List[Union[ExtractionSchemaNode, Selection, "Object"]] = []

 50 for field_name, field in model_class.__fields__.items():

---> 51 field_info = field.field_info

 52     extra = field_info.extra

 53     if "examples" in extra:

AttributeError: 'FieldInfo' object has no attribute 'field_info'

my code :
something wrong with model :
class Test(BaseModel):
abc: str = Field(
...,
description="test.",
examples=[

        ("hi test", "test"),
    ],
)
cdfc: str = Field(
    ...,
    description=" test calculation.",
    examples=[
        (" Risk-Based ", "Risk"),
    ],
)



this is dummy code but you get the idea of implementation and structure, not sure what am missing. help me out.

Access the final prompt

According to my understanding

chain.run(text=("some text input"))

constructs a prompt based on all the parameters defined in the Object which is then sent to OpenAI. Is there any way I can access that final prompt?

Can Kor be used to identify sections in a document?

To give a brief overview, let's say I want to parse job application CVs. I don't know the structure of the data, i.e. various people write their CV in their own style and I want to identify sections belonging to specific topics such as Skills, Experience, Education, etc. Can Kor work with these kinds of unstructured data?

:grey_question: How to "pipe" around Kor and langchain :sloth:

Hi, langchain has recently supported the | operator:

chain = prompt | model

... and as for now I've started to use kor with the following syntax (very good results btw) :

chain = create_extraction_chain(llm, schema)

I wonder how I could use pipes to do the same on langchain ?

I find the | syntax very elegant and concise and it seems like it gets a lot of traction so I ask.

Thank you in advance for your help, I've searched if the question had already asked but it seems like it hasn't.

Various types of examples for various types of documents

Hello how are you? I want to tell you that you have created a great solution, and for something recent it works very well.

Now, my case is the following, I want to parse various types of documents that, although it is true, have the same fields, the order is not the same, so I tried to create an example for each case but it works fine for me when the examples are at the beginning , but mixing multiple examples doesn't work well. So, I want to look for a way that depending on the type of document, I get the proper examples of that type. Any idea how to do this in the best possible way?

Does support AzureChatOpenAI instead of ChatOpenAI ??

Thanks for you coding，I see that you use chatopenai to implement, is there an implementation of AzureChatOpenAI?
I am looking forward to your reply. Thank you again！

encoder error

here is the bug report when I feed input as a string to extract entities

response = self.chain.predict_and_parse(text=text)
File "/Users/kevinliu/opt/anaconda3/envs/s3demo/lib/python3.9/site-packages/langchain-0.0.235-py3.9.egg/langchain/chains/llm.py", line 281, in predict_and_parse
return self.prompt.output_parser.parse(result)
File "/Users/kevinliu/opt/anaconda3/envs/s3demo/lib/python3.9/site-packages/kor-0.13.0-py3.9.egg/kor/extraction/parser.py", line 38, in parse
data = self.encoder.decode(text)
File "/Users/kevinliu/opt/anaconda3/envs/s3demo/lib/python3.9/site-packages/kor-0.13.0-py3.9.egg/kor/encoders/csv_data.py", line 95, in decode
with StringIO(table_str) as buffer:
TypeError: initial_value must be str or None, not dict

Add postgres type-descriptor

Add a type-descriptor that can provide a description in postgres syntax. This is for experimentation purposes for SQL type extraction.

Add support for AnyOf / Union Type

Need to expand the internal schema to allow representation of AnyOf aka Union as it appears frequently in more complex scenarios.

is chinese id supported ?

Hi, i want to extract the key field information of Chinese text, so does kor.Text id supports chinese ?

Possible alternate approach to compare

I am very interested in this project and think it opens up a lot of possibilities. I wanted to mention an alternate and complementary approach I've taken to this problem for consideration. It involves generally shorter prompts and fairly good accuracy.

At a high level, my approach has been to format the desired output and examples as a markdown table. I chunk the input into sections of one or more paragraphs, and give to the model one at a time with the same prompt (trade off here between speed and reliability, also whether all the needed info is contained in the same chunk of text. Depends on the use case.)

I then validate the output by checking if it contains a table with the correct number of columns and correct column headers. If not, I resend the prompt until the correct format is output. I save the resulting information into a list of dictionaries from the content inside the table headers and table columns then reconstitute it into whatever format I want to use it for.

I find it most convenient and reliable to pretend that the assistant actually provided the user example output. This can be repeated for as many examples as are provided.

It might be a bit more limited as far as data types that can be structured out of the output. I haven't tried this, but I bet you could get the model to include comma separated values within one field in a table.

Here's an example, specific to the OpenAI chat format. Note that the API allows submitting multiple messages from the same party in a row.

User:
You will assist with locating data from unstructured text input from the user. You will output the data in a table with the following column headers:
|Name|Age|

User:
Alice is 30 years old. Bob, Alice's brother is younger than her. Bob is 28 years old.

Assistant:
|Name|Age|
|Alice|30|
|Bob|28|

User:
My grandmother came over today. She's 90 years old, and her name is Mary. I'm 30 years old, my name is Alice. Leigh is 29. We're all going to a birthday party for Brenda, who is 2 years older than Mary. My grandfather is quite old. He will also be at the birthday party. He is 80. There will be cake! My grandfather's name is Bob.

Assistant:
Thank you for the input. Here's the table with the extracted information:

|Name|Age|
|Mary|90|
|Alice|30|
|Leigh|29|
|Brenda|92|
|Bob|80|

Please note that I have inferred Brenda's age from the statement "Brenda, who is 2 years older than Mary."

I theorize that this format has a fairly high reliability given the enormous volume of data structured in tables contained on the text on which the model was trained.

I imagine a person could find a way to convert from a desired output data structure like the ones in your current documentation to a columnar format and back, should this approach prove more reliable when tested.

Just putting this out there as I'm very interested in libraries that will facilitate structured output from text! If this is tried, I'd be happy to contribute.

When I use sample from offical document and reduce the number of attributes from 3 to 2, error occured.

This is the sample I copied from official document, I remove the age make the number of attrbutes to 2. When I run "chain.predict_and_parse(text=text)["data"]", error occured.

I tried some other cases with my custom use case, it seems that if the number of attributes is 2 and I add examples in Object, the error always occurs. If I don't add examples in Object, in stead I add examples in attributes, it is OK.

schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            examples=[("John Smith went to the store", "Smith")],
        ),
    ],
    examples=[
        (
            "John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.",
            [
                {"first_name": "John", "last_name": "Smith"},
                {"first_name": "Jane", "last_name": "Doe"},
            ],
        )
    ],
    many=True,
)

Error:

ValueError Traceback (most recent call last)
Cell In[178], line 1
----> 1 output = chain.predict_and_parse(text=text)["data"]

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:171, in LLMChain.predict_and_parse(self, **kwargs)
169 def predict_and_parse(self, **kwargs: Any) -> Union[str, List[str], Dict[str, str]]:
170 """Call predict and then parse the results."""
--> 171 result = self.predict(**kwargs)
172 if self.prompt.output_parser is not None:
173 return self.prompt.output_parser.parse(result)

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:151, in LLMChain.predict(self, **kwargs)
137 def predict(self, **kwargs: Any) -> str:
138 """Format prompt with kwargs and pass to LLM.
139
140 Args:
(...)
149 completion = llm.predict(adjective="funny")
150 """
--> 151 return self(kwargs)[self.output_key]

File D:\Dev\venv310\lib\site-packages\langchain\chains\base.py:116, in Chain.call(self, inputs, return_only_outputs)
114 except (KeyboardInterrupt, Exception) as e:
115 self.callback_manager.on_chain_error(e, verbose=self.verbose)
--> 116 raise e
117 self.callback_manager.on_chain_end(outputs, verbose=self.verbose)
118 return self.prep_outputs(inputs, outputs, return_only_outputs)

File D:\Dev\venv310\lib\site-packages\langchain\chains\base.py:113, in Chain.call(self, inputs, return_only_outputs)
107 self.callback_manager.on_chain_start(
108 {"name": self.class.name},
109 inputs,
110 verbose=self.verbose,
111 )
112 try:
--> 113 outputs = self._call(inputs)
114 except (KeyboardInterrupt, Exception) as e:
115 self.callback_manager.on_chain_error(e, verbose=self.verbose)

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:57, in LLMChain._call(self, inputs)
56 def _call(self, inputs: Dict[str, Any]) -> Dict[str, str]:
---> 57 return self.apply([inputs])[0]

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:118, in LLMChain.apply(self, input_list)
116 def apply(self, input_list: List[Dict[str, Any]]) -> List[Dict[str, str]]:
117 """Utilize the LLM generate method for speed gains."""
--> 118 response = self.generate(input_list)
119 return self.create_outputs(response)

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:61, in LLMChain.generate(self, input_list)
59 def generate(self, input_list: List[Dict[str, Any]]) -> LLMResult:
60 """Generate LLM result from inputs."""
---> 61 prompts, stop = self.prep_prompts(input_list)
62 return self.llm.generate_prompt(prompts, stop)

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:79, in LLMChain.prep_prompts(self, input_list)
77 for inputs in input_list:
78 selected_inputs = {k: inputs[k] for k in self.prompt.input_variables}
---> 79 prompt = self.prompt.format_prompt(**selected_inputs)
80 _colored_text = get_colored_text(prompt.to_string(), "green")
81 _text = "Prompt after formatting:\n" + _colored_text

File D:\Dev\venv310\lib\site-packages\kor\prompts.py:82, in ExtractionPromptTemplate.format_prompt(self, text)
79 """Format the prompt."""
80 text = format_text(text, input_formatter=self.input_formatter)
81 return ExtractionPromptValue(
---> 82 string=self.to_string(text), messages=self.to_messages(text)
83 )

File D:\Dev\venv310\lib\site-packages\kor\prompts.py:97, in ExtractionPromptTemplate.to_string(self, text)
95 """Format the template to a string."""
96 instruction_segment = self.format_instruction_segment(self.node)
---> 97 encoded_examples = self.generate_encoded_examples(self.node)
98 formatted_examples: List[str] = []
100 for in_example, output in encoded_examples:

File D:\Dev\venv310\lib\site-packages\kor\prompts.py:133, in ExtractionPromptTemplate.generate_encoded_examples(self, node)
131 """Generate encoded examples."""
132 examples = generate_examples(node)
--> 133 return encode_examples(
134 examples, self.encoder, input_formatter=self.input_formatter
135 )

File D:\Dev\venv310\lib\site-packages\kor\encoders\encode.py:59, in encode_examples(examples, encoder, input_formatter)
52 def encode_examples(
53 examples: Sequence[Tuple[str, str]],
54 encoder: Encoder,
55 input_formatter: InputFormatter = None,
56 ) -> List[Tuple[str, str]]:
57 """Encode the output using the given encoder."""
---> 59 return [
60 (
61 format_text(input_example, input_formatter=input_formatter),
62 encoder.encode(output_example),
63 )
64 for input_example, output_example in examples
65 ]

File D:\Dev\venv310\lib\site-packages\kor\encoders\encode.py:62, in (.0)
52 def encode_examples(
53 examples: Sequence[Tuple[str, str]],
54 encoder: Encoder,
55 input_formatter: InputFormatter = None,
56 ) -> List[Tuple[str, str]]:
57 """Encode the output using the given encoder."""
59 return [
60 (
61 format_text(input_example, input_formatter=input_formatter),
---> 62 encoder.encode(output_example),
63 )
64 for input_example, output_example in examples
65 ]

File D:\Dev\venv310\lib\site-packages\kor\encoders\csv_data.py:77, in CSVEncoder.encode(self, data)
74 if not isinstance(data_to_output, list):
75 # Should always output records for pd.Dataframe
76 data_to_output = [data_to_output]
---> 77 table_content = pd.DataFrame(data_to_output, columns=field_names).to_csv(
78 index=False, sep=DELIMITER
79 )
81 if self.use_tags:
82 return wrap_in_tag("csv", table_content)

File D:\Dev\venv310\lib\site-packages\pandas\core\frame.py:762, in DataFrame.init(self, data, index, columns, dtype, copy)
754 mgr = arrays_to_mgr(
755 arrays,
756 columns,
(...)
759 typ=manager,
760 )
761 else:
--> 762 mgr = ndarray_to_mgr(
763 data,
764 index,
765 columns,
766 dtype=dtype,
767 copy=copy,
768 typ=manager,
769 )
770 else:
771 mgr = dict_to_mgr(
772 {},
773 index,
(...)
776 typ=manager,
777 )

File D:\Dev\venv310\lib\site-packages\pandas\core\internals\construction.py:349, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
344 # _prep_ndarraylike ensures that values.ndim == 2 at this point
345 index, columns = _get_axes(
346 values.shape[0], values.shape[1], index=index, columns=columns
347 )
--> 349 _check_values_indices_shape_match(values, index, columns)
351 if typ == "array":
353 if issubclass(values.dtype.type, str):

File D:\Dev\venv310\lib\site-packages\pandas\core\internals\construction.py:420, in _check_values_indices_shape_match(values, index, columns)
418 passed = values.shape
419 implied = (len(index), len(columns))
--> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (1, 1), indices imply (1, 2)

Loading schema from file

Hi,

is there any way to load a schema from a file? Tried serializing an existing schema to JSON, but loading (via Object.parse_raw) fails, because there is no type discriminator for the attributes.

Is there already another way to do that? If not, are you planning to implement this or would you accept a PR for this?

Thanks
Boris

Feature: to allow more input variables in prompt template

Thank you for making this fantastic tool! Here's a feature request to consider:

Currently, when creating an extraction chain, only two optional input variables (type_description and format_instructions) are acceptable when defining the prompt template.

Would it make sense for it to accept more input variables? For instance, for use cases requiring the LLM to know current date and time, accepting an additional current_datetime input variable would be ideal to supply such context to the LLM.

Feature: Langchain Memory

Hey there,

Great job on this library, it's been really useful for my needs. However, I've hit a use case that would need langchain memory, which is not implemented (yet?).

My application needs to translate user input into search parameters for database filtering. But, because the translation can sometimes be inaccurate, I'd like to give users the ability to manually edit these search parameters as they wish. This could be done either within the app's UI and/or directly using natural language in a chat box. Basically, users should be able to tweak the output (the list of search parameters) until they're satisfied with it, which requires langchain memory.

I think this could be a really cool feature that would broaden the possible applications of the library.

What do you think?

Thanks!

predict_and_parse deprecated

Since langchain 0.0.205 the predict_and_parse methods on the chain are deprecated. The solution is to set an output parser on the Chain directly. Kor currently does not set the output parser. It also uses apredict_and_parse for document extract_from_documents. predict_and_parse is also used in several places in the documentation.

As was as I see it is quite a trivial change in api.py, tests and documentation.

Do you want to implement the change or are you willing to accept an PR for that?

	def generate_instruction_segment(self, node: AbstractSchemaNode) -> str:
	"""Generate the instruction segment of the extraction."""
	type_description = self.type_descriptor.describe(node)
	instruction_segment = self.encoder.get_instruction_segment()
	return f"{self.prefix}\n\n{type_description}\n\n{instruction_segment}"

eyurtsev / kor Goto Github PK

kor's People

Contributors

Stargazers

Watchers

Forkers

kor's Issues

Error:

Recommend Projects

Recommend Topics

Recommend Org

Jobs