Comments (3)
Tables in markdown need to be chunked in a single embedding, it doesn't make sense to split the content strictly based on token limit.
hi @Licantrop0, it's not that simple. What if a table is too big for the embedding model? I think a more advanced chunker would split by row, keeping the row header in each chunk. Even in this case a row might be too big, so there's need for more complex logic.
from kernel-memory.
I described how to approach the problem if the entire table doesn't fit the embedding model, down to the single cell.
The current chunking model makes table data completely unusable, as it breaks the structure and doesn't maintain context (the table headers).
It's also difficult with the current Memory APIs to get the nearby partitions like it was done in this old example: https://github.com/Azure-Samples/semantic-kernel-rag-chat/blob/cc51e164ac1e559e80437918c671ab6257e7c873/src/chapter2/Chapter2Function.cs#L45
For reference, the current approach is this:
from kernel-memory.
Thanks for the details. The existing chunker is a sample with its limitations, and we welcome improvements. The behavior with markdown file is a bare minimum implementation, and there are improvements that could be made with regards to tables, lists, headers etc. If someone wants to work on adding the feature described above or other improvements we'd be open to help with PR reviews.
from kernel-memory.
Related Issues (20)
- Different similarity results when using text-embedding-3-small or text-embedding-3-large models HOT 4
- [Feature Request] Built-in Huggingface TextGenerator HOT 1
- [Bug] build IMAGE windows error HOT 2
- [Feature Request] Should line endings be removed by decoders? HOT 3
- [Bug] Package Common.Logging 1.2.0 is not compatible with net7.0 (.NETCoreApp,Version=v7.0). HOT 1
- [Question] Package for MongoDB Atlas seems missing from nuget, Did I miss something in the PR? HOT 2
- [Feature Request] Configure Decoders with Dependency Injection HOT 15
- [Feature Request] Auto-throttling the embedding generation speed thru the use of x-ratelimit-* headers HOT 1
- [Bug] Ingesting docx file: DocumentFormat.OpenXml.Packaging.OpenXmlPackageException: A malformed URI was found in the document. Please provide a OpenSettings.RelationshipErrorRewriter to handle these errors while opening a package. HOT 5
- [Bug] Inconsistent chunk ID for repeated uploads leads to duplicate entries in vector store HOT 2
- [Feature Request] Support for Gemini Embedding model HOT 1
- [Bug] Endless waiting with custom openAI endpoint HOT 2
- [Feature Request] Cache and manage the embeddings in a persistent storage HOT 2
- [Bug] got confused by passing "documentid" parameters to the kernelmemory/service HOT 1
- How to develop an extension outside of the library HOT 4
- [Feature Request] Streamline Kernel Memory registration in dependency injection HOT 5
- [Feature Request] Allow arguments to be passed to the prompt when calling AskAsync HOT 2
- [Feature Request] Allow custom WebScraper HOT 4
- [Question] OCR HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kernel-memory.