GithubHelp home page GithubHelp logo

Comments (5)

dosubot avatar dosubot commented on June 4, 2024 3

πŸ€–

Sure, you can use a combination of libraries such as pdf.js for text extraction and pdf-image for image extraction. Here's a simplified example of how you can achieve this:

import { PDFDocument } from 'pdfjs-dist';
import { PDFImage } from 'pdf-image';

const filePath = 'docs';

export const run = async () => {
  try {
    const rawDocs = [];
    const files = fs.readdirSync(filePath);
    for (const file of files) {
      if (file.endsWith('.pdf')) {
        const pdfPath = path.join(filePath, file);
        const pdfDoc = await PDFDocument.load(fs.readFileSync(pdfPath));
        const pageCount = pdfDoc.getPageCount();
        const pdfImage = new PDFImage(pdfPath);

        for (let i = 0; i < pageCount; i++) {
          const page = pdfDoc.getPage(i);
          const textContent = await page.getTextContent();
          const text = textContent.items.map(item => item.str).join(' ');
          const image = await pdfImage.convertPage(i);
          rawDocs.push({ text, image });
        }
      }
    }

    // ... rest of the code
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

In this code, we are using pdfjs-dist to load the PDF document and extract the text content from each page. We are also using pdf-image to convert each page of the PDF into an image. The text and image data for each page are then stored in the rawDocs array.

Please note that this is a simplified example and you may need to adjust the code to fit your specific needs, such as handling errors and optimizing performance. Also, you would need to install the pdfjs-dist and pdf-image libraries using npm:

npm install pdfjs-dist pdf-image

As for the current method of processing PDFs in the gpt4-pdf-chatbot-langchain project, I wasn't able to find an answer within the repository. If you have more information about this, it would be helpful to provide a more accurate answer.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

from gpt4-pdf-chatbot-langchain.

dosubot avatar dosubot commented on June 4, 2024 2

πŸ€–

Hello @araza110! I'm Dosu, a friendly bot here to assist you with your queries, help resolve any issues you might have, and guide you in becoming a contributor to this project. I'm your resource to use while you're waiting for a human maintainer to get back to you. Don't hesitate to ask me anything you need. Let's make your experience with gpt4-pdf-chatbot-langchain as smooth as possible!

Based on the current implementation of the gpt4-pdf-chatbot-langchain repository, the chatbot does not have the capability to process images from PDF files. The CustomPDFLoader class, which is responsible for loading and parsing PDF files, only extracts the text content and the number of pages from the PDF file. It does not extract images.

To implement a feature where the chatbot can process both the PDF text and images, you would need to modify the CustomPDFLoader class to also extract images from the PDF files. This could potentially be done using a library that supports image extraction from PDF files, such as pdf.js or pdf-image.

Once the images are extracted, you would need to generate image embeddings. This could be done using a pre-trained image embedding model, such as those available in TensorFlow Hub or PyTorch Hub. The generated image embeddings could then be stored in the Pinecone vector store alongside the text embeddings.

Here is a rough example of how this could be implemented:

export class CustomPDFLoader extends BufferLoader {
  public async parse(
    raw: Buffer,
    metadata: Document['metadata'],
  ): Promise<Document[]> {
    const { pdf } = await PDFLoaderImports();
    const parsed = await pdf(raw);

    // Extract images from the PDF file
    const images = await extractImagesFromPDF(raw);

    // Generate image embeddings
    const imageEmbeddings = await generateImageEmbeddings(images);

    return [
      new Document({
        pageContent: parsed.text,
        imageEmbeddings: imageEmbeddings,
        metadata: {
          ...metadata,
          pdf_numpages: parsed.numpages,
        },
      }),
    ];
  }
}

In the PineconeStore.fromDocuments function, you would need to modify the textKey parameter to also include the image embeddings:

await PineconeStore.fromDocuments(docs, embeddings, {
  pineconeIndex: index,
  namespace,
  textKey: ['text', 'imageEmbeddings'],
});

Please note that this is a high-level overview and the actual implementation may be more complex. You would need to handle cases where a PDF file does not contain any images, and you would need to ensure that the image embeddings are compatible with the Pinecone vector store.

As for the GPT-agent's inability to answer questions related to images, this is because the GPT-agent is a text-based model and does not have the ability to process images. To enable the GPT-agent to answer questions related to images, you would need to implement a system where the image content is converted into a textual description that the GPT-agent can process. This could potentially be done using an image captioning model, but it would likely be a complex task and may not always produce accurate results.

I hope this helps! If you have any further questions, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

from gpt4-pdf-chatbot-langchain.

dosubot avatar dosubot commented on June 4, 2024 2

πŸ€–

To implement the feature you're looking for, you'll need to modify the CustomPDFLoader class to extract images from the PDF files and generate image embeddings. Here's a step-by-step guide on how you can achieve this:

  1. Extract Images from PDF: You can use a library like pdf-image-extractor to extract images from the PDF files. You'll need to install it using npm:
npm install pdf-image-extractor

Then, you can modify the CustomPDFLoader class to use this library to extract images:

import { PDFImageExtractor } from 'pdf-image-extractor';

// ...

public async load(): Promise<Document[]> {
  // ... existing code ...
  const extractor = new PDFImageExtractor(this.filePathOrBlob);
  const images = await extractor.extractImages();
  return this.parse(buffer, metadata, images);
}
  1. Modify the Document Object: You'll need to modify the Document object to include the extracted images. You can do this by adding a new images property to the Document object:
export class CustomPDFLoader extends BufferLoader {
  public async parse(
    raw: Buffer,
    metadata: Document['metadata'],
    images: Buffer[],
  ): Promise<Document[]> {
    const { pdf } = await PDFLoaderImports();
    const parsed = await pdf(raw);
    return [
      new Document({
        pageContent: parsed.text,
        images: images, // add images to the document
        metadata: {
          ...metadata,
          pdf_numpages: parsed.numpages,
        },
      }),
    ];
  }
}
  1. Generate Image Embeddings: Once you have the images, you'll need to generate image embeddings. You can use a library like tensorflow.js to create a model that generates embeddings from the images. You can then store these embeddings in the PineconeStore along with the text embeddings.

Please note that this is a simplified example and might need adjustments based on your specific requirements and the structure of your project. Also, the pdf-image-extractor library might not work in a browser environment, as it uses Node.js file system APIs. If you need to extract images in a browser, you might need to use a different library or approach.

Regarding the structure of the Document object, I wasn't able to find specific information within the repository. If you could provide more details about the structure of the Document object, I might be able to give a more accurate answer.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

from gpt4-pdf-chatbot-langchain.

araza110 avatar araza110 commented on June 4, 2024

@dosu-bot, Sorry, I am confused with your solution. Can you give me another workaround?

from gpt4-pdf-chatbot-langchain.

araza110 avatar araza110 commented on June 4, 2024

@dosu-bot, What if I don't want to use CustomPDFLoader?

from gpt4-pdf-chatbot-langchain.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.