It seems like there was an update to "chunk" it out while I was working on this, so it

Pinecone Ingest Error about gpt4-pdf-chatbot-langchain HOT 6 CLOSED

mayooear commented on May 12, 2024

Pinecone Ingest Error

from gpt4-pdf-chatbot-langchain.

Comments (6)

mayooear commented on May 12, 2024

Hi, did you have the same error before the chunk function was added?

Also have you run through the troubleshooting list here

from gpt4-pdf-chatbot-langchain.

swagnostic commented on May 12, 2024

I believe it was the same error yes.

And, yes, finished the troubleshooting list, restart and all. No dice. I'll keep trying.

from gpt4-pdf-chatbot-langchain.

janzheng commented on May 12, 2024

I'm uploading a 12mb PDF that had some trouble. This helped me get ingest it:

Re: instructions to add Dimensions: I had to go into my default project and create a new index, with the same name as I set in pinecone.ts, e.g. const PINECONE_INDEX_NAME = 'langchainjs-pdf-test';
I also added async-sema from here: https://github.com/vercel/async-sema which helps slow your code down.
Full ingest-data.ts looks like this:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings';
import { PineconeStore } from 'langchain/vectorstores';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';

import { Sema } from 'async-sema';
const s = new Sema(
  2, // Allow 4 concurrent async calls
  {
    capacity: 20 // Prealloc space for 100 tokens
  }
);




/* Name of directory to retrieve files from. You can change this as required */
// const filePath = 'docs/MorseVsFrederick.pdf';
const filePath = 'docs/Sacher_Jessica_C_201807_PhD.pdf';

export const run = async () => {
  try {
    /*load raw docs from the pdf file in the directory */
    const loader = new PDFLoader(filePath);
    // const loader = new PDFLoader(filePath);
    const rawDocs = await loader.load();

    console.log(rawDocs);

    /* Split text into chunks */
    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs);

    console.log('creating vector store...');
    /*create and store the embeddings in the vectorStore*/
    const embeddings = new OpenAIEmbeddings();
    const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name

    //embed the PDF documents

    /* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
    const chunkSize = 50;
    for (let i = 0; i < docs.length; i += chunkSize) {
      const chunk = docs.slice(i, i + chunkSize);
      console.log('chunk', i, chunk);
      await s.acquire()
      try {
        await PineconeStore.fromDocuments(
          index,
          chunk,
          embeddings,
          'text',
          PINECONE_NAME_SPACE,
        );
      } finally {
        s.release()
      }
    }
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

from gpt4-pdf-chatbot-langchain.

mayooear commented on May 12, 2024

I believe it was the same error yes.

And, yes, finished the troubleshooting list, restart and all. No dice. I'll keep trying.

What version of LangChain and Pinecone are you using?

from gpt4-pdf-chatbot-langchain.

mayooear commented on May 12, 2024

I'm uploading a 12mb PDF that had some trouble. This helped me get ingest it:

Re: instructions to add Dimensions: I had to go into my default project and create a new index, with the same name as I set in pinecone.ts, e.g. const PINECONE_INDEX_NAME = 'langchainjs-pdf-test';
I also added async-sema from here: https://github.com/vercel/async-sema which helps slow your code down.
Full ingest-data.ts looks like this:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings';
import { PineconeStore } from 'langchain/vectorstores';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';

import { Sema } from 'async-sema';
const s = new Sema(
  2, // Allow 4 concurrent async calls
  {
    capacity: 20 // Prealloc space for 100 tokens
  }
);




/* Name of directory to retrieve files from. You can change this as required */
// const filePath = 'docs/MorseVsFrederick.pdf';
const filePath = 'docs/Sacher_Jessica_C_201807_PhD.pdf';

export const run = async () => {
  try {
    /*load raw docs from the pdf file in the directory */
    const loader = new PDFLoader(filePath);
    // const loader = new PDFLoader(filePath);
    const rawDocs = await loader.load();

    console.log(rawDocs);

    /* Split text into chunks */
    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs);

    console.log('creating vector store...');
    /*create and store the embeddings in the vectorStore*/
    const embeddings = new OpenAIEmbeddings();
    const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name

    //embed the PDF documents

    /* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
    const chunkSize = 50;
    for (let i = 0; i < docs.length; i += chunkSize) {
      const chunk = docs.slice(i, i + chunkSize);
      console.log('chunk', i, chunk);
      await s.acquire()
      try {
        await PineconeStore.fromDocuments(
          index,
          chunk,
          embeddings,
          'text',
          PINECONE_NAME_SPACE,
        );
      } finally {
        s.release()
      }
    }
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

Thanks, but I'm not sure async-sema should be necessary here to work.

from gpt4-pdf-chatbot-langchain.

janzheng commented on May 12, 2024

You're right, it worked without async-sema! For some reason I was getting timeouts that seemed to error from hitting the endpoint too much, but maybe I just misread the errors. Removed it and it works!

from gpt4-pdf-chatbot-langchain.

Pinecone Ingest Error about gpt4-pdf-chatbot-langchain HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs