Comments (5)
π€
Sure, you can use a combination of libraries such as pdf.js
for text extraction and pdf-image
for image extraction. Here's a simplified example of how you can achieve this:
import { PDFDocument } from 'pdfjs-dist';
import { PDFImage } from 'pdf-image';
const filePath = 'docs';
export const run = async () => {
try {
const rawDocs = [];
const files = fs.readdirSync(filePath);
for (const file of files) {
if (file.endsWith('.pdf')) {
const pdfPath = path.join(filePath, file);
const pdfDoc = await PDFDocument.load(fs.readFileSync(pdfPath));
const pageCount = pdfDoc.getPageCount();
const pdfImage = new PDFImage(pdfPath);
for (let i = 0; i < pageCount; i++) {
const page = pdfDoc.getPage(i);
const textContent = await page.getTextContent();
const text = textContent.items.map(item => item.str).join(' ');
const image = await pdfImage.convertPage(i);
rawDocs.push({ text, image });
}
}
}
// ... rest of the code
} catch (error) {
console.log('error', error);
throw new Error('Failed to ingest your data');
}
};
(async () => {
await run();
console.log('ingestion complete');
})();
In this code, we are using pdfjs-dist
to load the PDF document and extract the text content from each page. We are also using pdf-image
to convert each page of the PDF into an image. The text and image data for each page are then stored in the rawDocs
array.
Please note that this is a simplified example and you may need to adjust the code to fit your specific needs, such as handling errors and optimizing performance. Also, you would need to install the pdfjs-dist
and pdf-image
libraries using npm:
npm install pdfjs-dist pdf-image
As for the current method of processing PDFs in the gpt4-pdf-chatbot-langchain
project, I wasn't able to find an answer within the repository. If you have more information about this, it would be helpful to provide a more accurate answer.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from gpt4-pdf-chatbot-langchain.
π€
Hello @araza110! I'm Dosu, a friendly bot here to assist you with your queries, help resolve any issues you might have, and guide you in becoming a contributor to this project. I'm your resource to use while you're waiting for a human maintainer to get back to you. Don't hesitate to ask me anything you need. Let's make your experience with gpt4-pdf-chatbot-langchain as smooth as possible!
Based on the current implementation of the gpt4-pdf-chatbot-langchain
repository, the chatbot does not have the capability to process images from PDF files. The CustomPDFLoader
class, which is responsible for loading and parsing PDF files, only extracts the text content and the number of pages from the PDF file. It does not extract images.
To implement a feature where the chatbot can process both the PDF text and images, you would need to modify the CustomPDFLoader
class to also extract images from the PDF files. This could potentially be done using a library that supports image extraction from PDF files, such as pdf.js
or pdf-image
.
Once the images are extracted, you would need to generate image embeddings. This could be done using a pre-trained image embedding model, such as those available in TensorFlow Hub or PyTorch Hub. The generated image embeddings could then be stored in the Pinecone vector store alongside the text embeddings.
Here is a rough example of how this could be implemented:
export class CustomPDFLoader extends BufferLoader {
public async parse(
raw: Buffer,
metadata: Document['metadata'],
): Promise<Document[]> {
const { pdf } = await PDFLoaderImports();
const parsed = await pdf(raw);
// Extract images from the PDF file
const images = await extractImagesFromPDF(raw);
// Generate image embeddings
const imageEmbeddings = await generateImageEmbeddings(images);
return [
new Document({
pageContent: parsed.text,
imageEmbeddings: imageEmbeddings,
metadata: {
...metadata,
pdf_numpages: parsed.numpages,
},
}),
];
}
}
In the PineconeStore.fromDocuments
function, you would need to modify the textKey
parameter to also include the image embeddings:
await PineconeStore.fromDocuments(docs, embeddings, {
pineconeIndex: index,
namespace,
textKey: ['text', 'imageEmbeddings'],
});
Please note that this is a high-level overview and the actual implementation may be more complex. You would need to handle cases where a PDF file does not contain any images, and you would need to ensure that the image embeddings are compatible with the Pinecone vector store.
As for the GPT-agent's inability to answer questions related to images, this is because the GPT-agent is a text-based model and does not have the ability to process images. To enable the GPT-agent to answer questions related to images, you would need to implement a system where the image content is converted into a textual description that the GPT-agent can process. This could potentially be done using an image captioning model, but it would likely be a complex task and may not always produce accurate results.
I hope this helps! If you have any further questions, please let me know.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from gpt4-pdf-chatbot-langchain.
π€
To implement the feature you're looking for, you'll need to modify the CustomPDFLoader
class to extract images from the PDF files and generate image embeddings. Here's a step-by-step guide on how you can achieve this:
- Extract Images from PDF: You can use a library like
pdf-image-extractor
to extract images from the PDF files. You'll need to install it using npm:
npm install pdf-image-extractor
Then, you can modify the CustomPDFLoader
class to use this library to extract images:
import { PDFImageExtractor } from 'pdf-image-extractor';
// ...
public async load(): Promise<Document[]> {
// ... existing code ...
const extractor = new PDFImageExtractor(this.filePathOrBlob);
const images = await extractor.extractImages();
return this.parse(buffer, metadata, images);
}
- Modify the Document Object: You'll need to modify the
Document
object to include the extracted images. You can do this by adding a newimages
property to theDocument
object:
export class CustomPDFLoader extends BufferLoader {
public async parse(
raw: Buffer,
metadata: Document['metadata'],
images: Buffer[],
): Promise<Document[]> {
const { pdf } = await PDFLoaderImports();
const parsed = await pdf(raw);
return [
new Document({
pageContent: parsed.text,
images: images, // add images to the document
metadata: {
...metadata,
pdf_numpages: parsed.numpages,
},
}),
];
}
}
- Generate Image Embeddings: Once you have the images, you'll need to generate image embeddings. You can use a library like
tensorflow.js
to create a model that generates embeddings from the images. You can then store these embeddings in thePineconeStore
along with the text embeddings.
Please note that this is a simplified example and might need adjustments based on your specific requirements and the structure of your project. Also, the pdf-image-extractor
library might not work in a browser environment, as it uses Node.js file system APIs. If you need to extract images in a browser, you might need to use a different library or approach.
Regarding the structure of the Document
object, I wasn't able to find specific information within the repository. If you could provide more details about the structure of the Document
object, I might be able to give a more accurate answer.
I hope this helps! If you have any more questions or need further clarification, feel free to ask.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from gpt4-pdf-chatbot-langchain.
@dosu-bot, Sorry, I am confused with your solution. Can you give me another workaround?
from gpt4-pdf-chatbot-langchain.
@dosu-bot, What if I don't want to use CustomPDFLoader?
from gpt4-pdf-chatbot-langchain.
Related Issues (20)
- Conversion of code into Python HOT 2
- PineconeError: Error, message length too large: found 5453452 bytes, the limit is: 4194304 bytes HOT 6
- source output HOT 1
- How to include more than 4 results from Pinecone? HOT 3
- How to change the BaseUrl if I use a proxy HOT 1
- Can ChatGPT 3.5 be supportedοΌ HOT 6
- Text words overlay display HOT 1
- Enhancement - ability to use a graph database such as neo4j instead of vector database HOT 1
- enhancement - integrate with llamaindex HOT 3
- Explain data ingestion code. HOT 4
- s HOT 1
- "TypeError: Cannot read properties of undefined (reading 'text')" HOT 1
- error TypeError: ids is not iterable HOT 1
- Add support for Pinecone Serverless HOT 8
- Error: Azure OpenAI API instance name not found HOT 3
- FetchError: request to https://api.openai.com/v1/embeddings failed HOT 1
- run "yarn run ingest" Japanese punctuation marks were converted to Korean HOT 1
- I get this error when I open my local server: Cannot read properties of undefined (reading 'text') HOT 14
- error PineconeConnectionError HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt4-pdf-chatbot-langchain.