Comments (9)
Yes, that helps! So essentially, the DocumentSplitter
should return a List[Document]
.
But since it has as input:
def run(self, documents: List[Document]):
It should return a List[List[Documents]]
do you agree?
from haystack.
Hi Sebastian, I can pick up this again after finishing some high-priority issues I need to handle - maybe by the end of the week. Just to let you know, I haven't forgot it
from haystack.
This issue #6706 is related since we currently do not keep page break information when converting a PDF file to a Haystack Document.
from haystack.
Hi @sjrl :)
This is my first issue. I'm trying to understand the requisites better. It seems to me that to keep the page number and the associated text, I suppose we have to keep the chunks in the metadata, e.g.:
units = self._split_into_units(doc.content, self.split_by)
text_splits = self._concatenate_units(units, self.split_length, self.split_overlap)
metadata = deepcopy(doc.meta)
metadata["source_id"] = doc.id
metadata["page_number"] = units
split_docs += [Document(content=txt, meta=metadata) for txt in text_splits]
This has a few drawbacks:
- duplicated text,
doc.content
anddoc.metadata['page_number']
now have the same information, a possible solution would be to haveself._concatenate_units()
being triggered only whendoc.content
is called/needed - the
metadata["page_number"]
has the page number 0 - but this can be easily fixed
from haystack.
Hi @davidsbatista!
Thanks for taking on this issue :)
I don't think we need to keep the associated text for the use case I am imagining. Basically what we are interested in Haystack terms would like this
- Load a PDF File
- Convert a PDF file to a single Document object --> PyPDFToDocument
- Split the single Document into Chunked Documents (so Document to List of Documents) --> Document Splitter
- In this final step I would like to insert a
page_number
into the each Doc's metadata in the List of Documents that would tell me which page the chunked doc came from based on the original single Document. This tracking ofpage_number
was done in Haystack v1 by counting and keeping track of page breaks (\f
)
Does this make more sense?
from haystack.
Hmm I'm not entirely sure. Initially I would say that it makes sense to return List[List[Documents]]
, but often we want a flattened list to be returned since we will often directly write these documents to a document store which I believe expects List[Document]
as input.
So I think to keep that workflow working we should return List[Document]
or have some way of flattening the list. What do you think?
from haystack.
No problem! Thanks for the update
from haystack.
Would be interested by a follow up about this 👀 If something I could do ?
from haystack.
@lambda-science there hasn't been any follow up, feel free to start working on it if you feel like it
from haystack.
Related Issues (20)
- Retrieving Source Documents
- remove deprecated TGI Generators and TEI embedders
- docs: add docs about new HuggingFace API Generators and Embedders HOT 1
- Set a Hugging Face API key for the CI HOT 1
- Hugging Face Integration Failure HOT 4
- Deployment guide: Podman
- Disable DocumentCleaner new id generation HOT 2
- Update material about TGI Generators and TEI embedders
- Bug running a serialized pipeline HOT 1
- Make 2.x release process automatic
- Implement Evaluation API proposal
- Implement the proposal
- Running SASEvaluator fails when using the default model HOT 1
- [bug] Logger uses reserved attributes and raises KeyError
- TransformersSimilarityRanker (transformers_similarity.py) runtime error HOT 1
- Support gRPC interface of TEI HOT 1
- Design and implement scaffolding for pipeline evaluation
- Evaluator inputs, outputs, and names should be more consistent to allow drop-in replacement HOT 1
- Initialise EvaluationResult class with output from evaluation pipeline HOT 2
- bug: Pipeline with conditional router hangs when branch starts with prompt builder HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from haystack.