GithubHelp home page GithubHelp logo

Comments (6)

Phlasse avatar Phlasse commented on May 30, 2024

Thanks for creating an issue on the topic.

I can imagine, a keep_id flag when calling the run function could be enough. It could be set to False by default.
The id could be then be reused in the end, when the new document is being created.

from haystack.

CarlosFerLo avatar CarlosFerLo commented on May 30, 2024

I will take this issue. I believe @Phlasse approach seems to be the way to go, as it keeps things sompler.

from haystack.

vblagoje avatar vblagoje commented on May 30, 2024

Hey @CarlosFerLo @julian-risch @Phlasse

Just saw your PR contribution @CarlosFerLo - thank you.

I was wondering why not, if we are already adding another init parameter, make it more powerful and future proof yet as simple as the flag keep_id

We could add id_generator init parameter:

id_generator: Optional[Callable[[Document, str], str]] = None

where the first parameter of the callable is cleaned document, the second parameter is doc id of the old document and callable returns str.

assigned in init like this:

self.id_generator = id_generator or (lambda doc, id: doc.id) # or whatever default is

and the end of the run method is:

clean_doc = Document(content=text, meta=deepcopy(doc.meta))
clean_doc.id = self.id_generator(clean_doc, doc.id)
cleaned_docs.append(clean_doc)

Let me know your thoughts about this approach.

from haystack.

CarlosFerLo avatar CarlosFerLo commented on May 30, 2024

@vblagoje then what are you suggesting?

We keep the keep_id flag, and we add a id_generator where the id generator can overwrite the resulting id of the cleaned document receiving the original document and the new id.

Or we delete the keep_flag and add all the functionality in the id_generator parameter and we can add a keep_id id generator that has this behaviour.

And the important part, what is the default id generator?

from haystack.

vblagoje avatar vblagoje commented on May 30, 2024

@CarlosFerLo I'm suggesting we don't use keep_id flag because id_generator callable can replace it and future use cases where people want to customize doc id generation. We can then use this familiar pattern not only in this component but in others as well - whenever some derived documents are created by a component and we need to generate derived docs ids.

from haystack.

CarlosFerLo avatar CarlosFerLo commented on May 30, 2024

@vblagoje Okey. I will create a new pull request, to clean the commit tree. Should I create a Class for this kind of function, to make the code more readable and to add more functionality if needed in the future. Should I name the class DocumentIdGenerator or just IdGenerator. And to specify the input parameters: it receives the new document and the old id? or better if it gets the new document and the old one, it would give more flexibility, but also, haveing the new document alrady formatted seems a little redundant.

from haystack.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.