Comments (3)
Removing line endings is a tricky step because it affects the semantics. From a high level perspective I usually leave all the line endings in place, until the moment of calculating embeddings and saving the text chunk. I would avoid removing line endings while chunking sentences because it might join paragraphs together. Then there's the scenario where one might want to use "the chunk before" and "the chunk after", merged with the current chunk. In this case it would be better having the original line endings.
I'm a bit surprised that cosine similarity is affected so much by this detail to be honest, and I would not jump the gun without a thorough investigation with a considerable amount of test cases and a report comparing results. I'm worried we could make things worse overall.
There's also another way to look at it if you want: for each pieace of text we could have multiple versions (with line endings, without, etc), and calculate embeddings for all. It would increase cost, but could make things better for everyone.
from kernel-memory.
I understand your point, so I think that the correct approach could be the feature I have discussed in #379, so everyone can makes the tests for a specific use case.
from kernel-memory.
Related Issues (20)
- TextChunker doesn't handle Markdown Tables HOT 3
- Different similarity results when using text-embedding-3-small or text-embedding-3-large models HOT 4
- [Feature Request] Built-in Huggingface TextGenerator HOT 1
- [Bug] build IMAGE windows error HOT 2
- [Bug] Package Common.Logging 1.2.0 is not compatible with net7.0 (.NETCoreApp,Version=v7.0). HOT 1
- [Question] Package for MongoDB Atlas seems missing from nuget, Did I miss something in the PR? HOT 2
- [Feature Request] Configure Decoders with Dependency Injection HOT 15
- [Feature Request] Auto-throttling the embedding generation speed thru the use of x-ratelimit-* headers HOT 1
- [Bug] Ingesting docx file: DocumentFormat.OpenXml.Packaging.OpenXmlPackageException: A malformed URI was found in the document. Please provide a OpenSettings.RelationshipErrorRewriter to handle these errors while opening a package. HOT 5
- [Bug] Inconsistent chunk ID for repeated uploads leads to duplicate entries in vector store HOT 2
- [Feature Request] Support for Gemini Embedding model HOT 1
- [Bug] Endless waiting with custom openAI endpoint HOT 2
- [Feature Request] Cache and manage the embeddings in a persistent storage HOT 2
- [Bug] got confused by passing "documentid" parameters to the kernelmemory/service HOT 1
- How to develop an extension outside of the library HOT 4
- [Feature Request] Streamline Kernel Memory registration in dependency injection HOT 5
- [Feature Request] Allow arguments to be passed to the prompt when calling AskAsync HOT 2
- [Feature Request] Allow custom WebScraper HOT 4
- [Question] OCR HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kernel-memory.