Following off of <a class="issue-link js-issue-link" data-error-text="Failed to load t

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

I believe I may have found the answer after looking into <a href="https://github.com/m

Support for Upsert operation in IKernelMemory about kernel-memory HOT 5 CLOSED

drelyea commented on June 1, 2024

Support for Upsert operation in IKernelMemory

from kernel-memory.

Comments (5)

dluc commented on June 1, 2024 1

If the update operation depends on persisted pipeline records between operations, that would absolutely explain the behavior I see. I'll do a little more digging and see if this is the case.

thanks for investigating, yes I think you're on the right track. All Import* methods act as Upsert, and for the upsert logic to work, they require persistence in the content storage where the ID uniqueness/existence is detected. Serverless memory by default is volatile, and that would explain what you're seeing.

If you need Serverless Memory to be fully persistent:

set a content storage. The default SimpleFileStorage is ok, but you need to set StorageType to Disk.
set a vector storage. The default SimpleVectorDb is ok for tests/demos, but you need to set StorageType to Disk. If you want something more performant but local, I'd suggest using Qdrant or Postgres (other options are coming soon)

If by any chance you're setting Serverless to use queues, I would avoid using SimpleQueues and opt for Azure Queues or RabbitMQ. Or just don't use queues with Serverless, because it's an odd setup :-)

from kernel-memory.

dluc commented on June 1, 2024

it also means that there is no easy Update mechanism if my intention is to completely replace everything associated with the documentId in question.

hi @drelyea when uploading a document with the same ID, the resulting operation is equivalent to an Upsert. All the previous information is replaced. For instance if you upload a PDF with ID "foo" and then upload a Word doc with the same ID "foo", the content of the PDF is replaced with the content of the Word doc. Same if you upload multiple files under the document ID (a document can be composed of multiple files).

Perhaps the "Import" name is confusing, but I can assure it's designed to work this way:

If a Document ID is provided => Upsert, ie Replace
If a Document ID is not provided => Insert new record, return new Document ID in the respose

from kernel-memory.

drelyea commented on June 1, 2024

Hey @dluc! Thanks for getting back to me. This seems at odds with the behavior I observe, at least with ImportText.

uploading a document with the same ID, the resulting operation is equivalent to an Upsert

Is this true for both ImportDocument and ImportText?

As an example, I call IKernelMemory.ImportText at two different points in time with the same documentId but conflicting information:

// Call 1
memory.ImportTextAsync(
    text: "Researchers from the International Society of Frog Enthusiasts have determined there are exactly 386 kinds of Frog in the world", // Random text
    documentId: "2cc8d19a-4fa1-484b-8f27-1a1b72f653f2"); // A specific guid

// Call 2
memory.ImportTextAsync(
    text: "Researchers from the International Society of Frog Enthusiasts have determined there are exactly 5 kinds of Frog in the world", // Random text
    documentId: "2cc8d19a-4fa1-484b-8f27-1a1b72f653f2"); // A specific guid

When I look at my index in Azure Search Service searching for 'frog', I can see 2 distinct entities with matching __document_id tag:

"value": [
    {
      "id": " [some distinct key] ",
      "tags": [
        "__document_id:2cc8d19a-4fa1-484b-8f27-1a1b72f653f2",
        ....
      ],
      "payload": "{\"url\":\"\",\"schema\":\"20231218A\",\"file\":\"content.txt\",\"text\":\"Researchers from the International Society of Frog Enthusiasts have determined there are exactly 386 kinds of Frog in the world\",\"vector_provider\":\"AI.AzureOpenAI.AzureOpenAITextEmbeddingGenerator\",\"vector_generator\":\"TODO\",\"last_update\":\"2024-01-04T22:25:59\"}",
      "embedding": [ <taken out for length> ]
    },
    {
      "id": " [some distinct key] ",
      "tags": [
        "__document_id:2cc8d19a-4fa1-484b-8f27-1a1b72f653f2",
        ....
      ],
      "payload": "{\"url\":\"\",\"schema\":\"20231218A\",\"file\":\"content.txt\",\"text\":\"Researchers from the International Society of Frog Enthusiasts have determined there are exactly 5 kinds of Frog in the world\",\"vector_provider\":\"AI.AzureOpenAI.AzureOpenAITextEmbeddingGenerator\",\"vector_generator\":\"TODO\",\"last_update\":\"2024-01-04T22:27:00\"}",
      "embedding": [ <taken out for length> ]
    }
]

Finally, when I call IKernelMemory.AskAsync("How many kinds of frogs are there?"), I get back:

There are conflicting reports. One source states that there are exactly 386 kinds of frog in the world, while another source states that there are only 5 kinds of frog in the world. Therefore, the exact number of frog species is unclear.

And the MemoryAnswer.RelevantSources object clearly shows 2 distinct sources with the same __document_id tag value.

from kernel-memory.

drelyea commented on June 1, 2024

I believe I may have found the answer after looking into BaseOrchestrator - I'm using the MemoryServerless implementation, currently invoking via a CLI which instantiates and tears down dependencies after every SDK call. I've also omitted adding a true Azure Storage Content Storage dependency.

If the update operation depends on persisted pipeline records between operations, that would absolutely explain the behavior I see. I'll do a little more digging and see if this is the case.

from kernel-memory.

drelyea commented on June 1, 2024

Appreciate it - I'll look into these options! Looks like I was using a persisted vector storage, but not a persisted content storage.

I also verified by an integration test running the two ImportText operations one after another without teardown in the middle and did observe upsert working properly. Persistence is what I was missing, I had assumed uniqueness came as part of the Azure Search dependency and not the pipeline content storage.

Thanks for the help!

from kernel-memory.

Support for Upsert operation in IKernelMemory about kernel-memory HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs