GithubHelp home page GithubHelp logo

Comments (5)

dluc avatar dluc commented on June 1, 2024 1

If the update operation depends on persisted pipeline records between operations, that would absolutely explain the behavior I see. I'll do a little more digging and see if this is the case.

thanks for investigating, yes I think you're on the right track. All Import* methods act as Upsert, and for the upsert logic to work, they require persistence in the content storage where the ID uniqueness/existence is detected. Serverless memory by default is volatile, and that would explain what you're seeing.

If you need Serverless Memory to be fully persistent:

  • set a content storage. The default SimpleFileStorage is ok, but you need to set StorageType to Disk.
  • set a vector storage. The default SimpleVectorDb is ok for tests/demos, but you need to set StorageType to Disk. If you want something more performant but local, I'd suggest using Qdrant or Postgres (other options are coming soon)

If by any chance you're setting Serverless to use queues, I would avoid using SimpleQueues and opt for Azure Queues or RabbitMQ. Or just don't use queues with Serverless, because it's an odd setup :-)

from kernel-memory.

dluc avatar dluc commented on June 1, 2024

it also means that there is no easy Update mechanism if my intention is to completely replace everything associated with the documentId in question.

hi @drelyea when uploading a document with the same ID, the resulting operation is equivalent to an Upsert. All the previous information is replaced. For instance if you upload a PDF with ID "foo" and then upload a Word doc with the same ID "foo", the content of the PDF is replaced with the content of the Word doc. Same if you upload multiple files under the document ID (a document can be composed of multiple files).

Perhaps the "Import" name is confusing, but I can assure it's designed to work this way:

  • If a Document ID is provided => Upsert, ie Replace
  • If a Document ID is not provided => Insert new record, return new Document ID in the respose

from kernel-memory.

drelyea avatar drelyea commented on June 1, 2024

Hey @dluc! Thanks for getting back to me. This seems at odds with the behavior I observe, at least with ImportText.

uploading a document with the same ID, the resulting operation is equivalent to an Upsert

Is this true for both ImportDocument and ImportText?

As an example, I call IKernelMemory.ImportText at two different points in time with the same documentId but conflicting information:

// Call 1
memory.ImportTextAsync(
    text: "Researchers from the International Society of Frog Enthusiasts have determined there are exactly 386 kinds of Frog in the world", // Random text
    documentId: "2cc8d19a-4fa1-484b-8f27-1a1b72f653f2"); // A specific guid

// Call 2
memory.ImportTextAsync(
    text: "Researchers from the International Society of Frog Enthusiasts have determined there are exactly 5 kinds of Frog in the world", // Random text
    documentId: "2cc8d19a-4fa1-484b-8f27-1a1b72f653f2"); // A specific guid

When I look at my index in Azure Search Service searching for 'frog', I can see 2 distinct entities with matching __document_id tag:

"value": [
    {
      "id": " [some distinct key] ",
      "tags": [
        "__document_id:2cc8d19a-4fa1-484b-8f27-1a1b72f653f2",
        ....
      ],
      "payload": "{\"url\":\"\",\"schema\":\"20231218A\",\"file\":\"content.txt\",\"text\":\"Researchers from the International Society of Frog Enthusiasts have determined there are exactly 386 kinds of Frog in the world\",\"vector_provider\":\"AI.AzureOpenAI.AzureOpenAITextEmbeddingGenerator\",\"vector_generator\":\"TODO\",\"last_update\":\"2024-01-04T22:25:59\"}",
      "embedding": [ <taken out for length> ]
    },
    {
      "id": " [some distinct key] ",
      "tags": [
        "__document_id:2cc8d19a-4fa1-484b-8f27-1a1b72f653f2",
        ....
      ],
      "payload": "{\"url\":\"\",\"schema\":\"20231218A\",\"file\":\"content.txt\",\"text\":\"Researchers from the International Society of Frog Enthusiasts have determined there are exactly 5 kinds of Frog in the world\",\"vector_provider\":\"AI.AzureOpenAI.AzureOpenAITextEmbeddingGenerator\",\"vector_generator\":\"TODO\",\"last_update\":\"2024-01-04T22:27:00\"}",
      "embedding": [ <taken out for length> ]
    }
]

Finally, when I call IKernelMemory.AskAsync("How many kinds of frogs are there?"), I get back:

There are conflicting reports. One source states that there are exactly 386 kinds of frog in the world, while another source states that there are only 5 kinds of frog in the world. Therefore, the exact number of frog species is unclear.

And the MemoryAnswer.RelevantSources object clearly shows 2 distinct sources with the same __document_id tag value.

from kernel-memory.

drelyea avatar drelyea commented on June 1, 2024

I believe I may have found the answer after looking into BaseOrchestrator - I'm using the MemoryServerless implementation, currently invoking via a CLI which instantiates and tears down dependencies after every SDK call. I've also omitted adding a true Azure Storage Content Storage dependency.

If the update operation depends on persisted pipeline records between operations, that would absolutely explain the behavior I see. I'll do a little more digging and see if this is the case.

from kernel-memory.

drelyea avatar drelyea commented on June 1, 2024

Appreciate it - I'll look into these options! Looks like I was using a persisted vector storage, but not a persisted content storage.

I also verified by an integration test running the two ImportText operations one after another without teardown in the middle and did observe upsert working properly. Persistence is what I was missing, I had assumed uniqueness came as part of the Azure Search dependency and not the pipeline content storage.

Thanks for the help!

from kernel-memory.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.