This issue is meant to track <a href="https://github.com/embedchain/embedchain/pull/31

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Add support for local text about mem0 HOT 6 CLOSED

mem0ai commented on July 30, 2024

Add support for local text

from mem0.

Comments (6)

cachho commented on July 30, 2024 1

Following questions should be thought of before adding a new endpoint

Add a new end point. What should be the name of it?

I think add_local makes sense and is descriptive, yet still short. Can't think of anything better.

How many arguments should it take?

The add_local method is closely related to the add method, I just changed the argument name from url to content. I think we should only allow a single argument to the local_add method, just like the add method had. Then a collection type (list, tuple, dict) can be passed, depending on the use cases we come up with in the future.

What other formats it can support?

Plaintext should be the basis. QnA pair is something that definitely makes sense. I think the name qna-pair makes a lot of sense. Calling it pair communicates that the input is a tuple. Then you can allow the local passing of everything that's currently using a URL. I mean passing a local video is probably too much of an exotic use case (I don't know if that even works, no idea how the video is chunked), but PDFs would make sense. I would also skip webpages, since if you have them locally you can pass the important parts as text.

Where should the loader live?

Where should the chunker live?

I admit that making a separate folder for local loaders but not for chunkers is debatable. The reason I did this is because unlike the rest of the loaders, this loader doesn't take a url as an input, so I separated it. When it comes to chunking, it's really not treated differently than the other types, that's why I put it in the same folder. That's my reasoning.

What should be the url value in meta_data?

I don't know what the metadata is used for in the end and why it matters. If it's just used for hashing and deduplication, url 'local' should be fine since then it effectively just compares the content. You could add filenames, but then you would have to add a second parameter and not everything comes from it's own file necessarily. For instance URLs are unique per video/page, but for the QnA pair I can't think of a separate filename for each pair.

Who will format the data? What should be a standard formatting guideline here? or no formatting guideline, leave it to the user to format the data in whatever way the user wants

I would generally say that this should not be a sanitization library. We need to keep focus and that is embedding and LLMs. If you want to add online sources with a single command, some sanitization has to be included. This package is meant to be an abstraction. But the local files, should have as little abstraction and opinionation on a embedchain level as possible. That also makes it possible that if you aren't happy about the way this package handles formatting for web pages, to use requests, do your own formatting and then pass it as local text. I think this is a smart move and makes it so it can be wider adopted for a wider range of requirements and skill levels.

What is a good chunking size here? Which chunker to use?

I cannot answer this.

What should be the retrieval prompt for this?

from mem0.

taranjeet commented on July 30, 2024

@cachho has put together some nice thoughts

Hey, first of all cool project.

I'm not sure if you want this to be an online only project. I totally see where that comes in handy. I prefer to do the loading of the data myself, then sanitize it locally, make sure everything is as I want it, and then have it embedded. I guess the difference in approach is that I want to supply my own learning material, while the online approach is more about teaching the right public material.

My specific use case is the following: I'm looking to switch from a classic QnA bot to a LLM. That's why I added the option to add a local QnA pair.

I made a new method add_local, to make it clear that it's not a URL you're supposed to pass to the function. This might seem like it adds complexity now, but I think this split makes sense going forward. We're going to want to add more local options, for instance regular plain text. And then we could also consider adding local variants of the online stuff, maybe not videos but local pdf makes sense.

I hope you approve of my idea, and agree that this functionality is the right way to move forward. Thanks.

I want to add that the prompt engineering that I do using Q:... \nA:... is how openai does it if you select the Q&A preset in the playground.

I'm happy to change the chunk size. I just looked at what you used for a pdf and decided it should be smaller since QnA pairs are usually smaller than a PDF I'd say, but that's totally negotiable.

from mem0.

taranjeet commented on July 30, 2024

Following questions should be thought of before adding a new endpoint

Add a new end point. What should be the name of it?
How many arguments should it take?
What other formats it can support?
Where should the loader live?
Where should the chunker live?
What should be the url value in meta_data?
Who will format the data? What should be a standard formatting guideline here? or no formatting guideline, leave it to the user to format the data in whatever way the user wants
What is a good chunking size here? Which chunker to use?
What should be the retrieval prompt for this?

from mem0.

taranjeet commented on July 30, 2024

Here are my thoughts on above

Add a new end point. What should be the name of it?

Current is .add_local. Intuitive by name, suggests something local and not online. Other alternatives can be .add_offline (not intuitive at all).
Lets go with .add_local

How many arguments should it take?

Ideally 2. One is the data_type and the other one should be the text.
Egs:
For qna pair, it will be ('qna_pair', (ques, ans))
For text, it will be ('text', text)
For markdown, it will be ('markdown', markdown)

Current implementation is fine.

What other formats it can support?

Covered in example above, but stating here: text, markdown.
Right now lets add support for qna, for others in future and if community suggests

Where should the loader live?

loader should live in the loader directory only. since its local, so its file name can start with local_qna_pair.py. Local clears out that its a local data loader and then the next name is data_type. We will assume all other loaders to be online where local is not present.

Change needed here: Can we rename the loader as per above guideline?

Where should the chunker live?

chunker should live in the chunker directory only. For chunker, we can skip adding the fact that its local as chunk operates on a data and it doesn't matter to a chunker from where the data is retrieved.

No change needed in current implementation

What should be the url value in meta_data?

For online loaders, meta_data is data specific but it has a key called url.
We will need to have url key here because there are some functionalities dependent on it.
Lets keep the url key here.
Right now user intutive is local.
The only concern it will have is when we are fetching existing ids before creating embeddings, then it will fetch all the chunks with url as local. Lets live with this for now, but in future we may need to revisit this.

Who will format the data? What should be a standard formatting guideline here? or no formatting guideline, leave it to the user to format the data in whatever way the user wants

all data formatting should be user specific, since we cannot anticipate in advance how the user wants to format
also if data formatted in a specific way, then it becomes limiting for other users

What is a good chunking size here? Which chunker to use?

RecursiveChunker with 300 makes sense for now. This may be revisited in future.

What should be the retrieval prompt for this?

Right now, we can keep the same prompt, but in future we might have data_type specific retrieval prompts.

from mem0.

RuairiSpain commented on July 30, 2024

Would like to see this, would make this project compete with LlamaIndex and PersonalGPT

from mem0.

cachho commented on July 30, 2024

Would like to see this, would make this project compete with LlamaIndex and PersonalGPT

this feature has been added

from mem0.

Add support for local text about mem0 HOT 6 CLOSED

Comments (6)

Add a new end point. What should be the name of it?

How many arguments should it take?

What other formats it can support?

Where should the chunker live?

What should be the url value in meta_data?

Who will format the data? What should be a standard formatting guideline here? or no formatting guideline, leave it to the user to format the data in whatever way the user wants

What is a good chunking size here? Which chunker to use?

What should be the retrieval prompt for this?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs