Comments (6)
Thanks for creating an issue on the topic.
I can imagine, a keep_id flag when calling the run function could be enough. It could be set to False by default.
The id could be then be reused in the end, when the new document is being created.
from haystack.
I will take this issue. I believe @Phlasse approach seems to be the way to go, as it keeps things sompler.
from haystack.
Hey @CarlosFerLo @julian-risch @Phlasse
Just saw your PR contribution @CarlosFerLo - thank you.
I was wondering why not, if we are already adding another init parameter, make it more powerful and future proof yet as simple as the flag keep_id
We could add id_generator init parameter:
id_generator: Optional[Callable[[Document, str], str]] = None
where the first parameter of the callable is cleaned document, the second parameter is doc id of the old document and callable returns str.
assigned in init like this:
self.id_generator = id_generator or (lambda doc, id: doc.id) # or whatever default is
and the end of the run method is:
clean_doc = Document(content=text, meta=deepcopy(doc.meta))
clean_doc.id = self.id_generator(clean_doc, doc.id)
cleaned_docs.append(clean_doc)
Let me know your thoughts about this approach.
from haystack.
@vblagoje then what are you suggesting?
We keep the keep_id
flag, and we add a id_generator
where the id generator can overwrite the resulting id of the cleaned document receiving the original document and the new id.
Or we delete the keep_flag
and add all the functionality in the id_generator
parameter and we can add a keep_id id generator that has this behaviour.
And the important part, what is the default id generator?
from haystack.
@CarlosFerLo I'm suggesting we don't use keep_id
flag because id_generator
callable can replace it and future use cases where people want to customize doc id generation. We can then use this familiar pattern not only in this component but in others as well - whenever some derived documents are created by a component and we need to generate derived docs ids.
from haystack.
@vblagoje Okey. I will create a new pull request, to clean the commit tree. Should I create a Class for this kind of function, to make the code more readable and to add more functionality if needed in the future. Should I name the class DocumentIdGenerator
or just IdGenerator
. And to specify the input parameters: it receives the new document and the old id? or better if it gets the new document and the old one, it would give more flexibility, but also, haveing the new document alrady formatted seems a little redundant.
from haystack.
Related Issues (20)
- Create a colab with an example template Chat + RAG pipeline
- Select 4 or 5 datasets
- Run evaluations on selected datasets to optimise basic RAG pipeline
- ModuleNotFoundError: No module named 'haystack.nodes' HOT 1
- `FileTypeRouter` should get mime type from `ByteStream` mime type attribute instead of `meta
- Use case Chat + tools
- Use case tools + plan
- Use case text-to-sql database explorer
- Allow Pipelines to be run/reused in "SuperPipelines" HOT 5
- ModuleNotFoundError: No module named 'haystack.nodes' HOT 2
- Installation issues on Databricks
- Use case RAG + one-shot query planning
- QA problem in using QdrantDocumentStore HOT 3
- Docs: SentenceTransformersDiversityRanker HOT 1
- (De-) Serialization is not properly working for HuggingFaceAPITextEmbedder HOT 1
- (De-) Serialization is not properly working for NamedEntityExtractor
- LLM-based evaluators not always returning a valid JSON
- Port Haystack v1 DocumentClassifier node to Haystack v2 HOT 3
- LLM-based evaluators shouldn't return `NaN`
- Provide an abstraction for Tools HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from haystack.