GithubHelp home page GithubHelp logo

Comments (6)

mayooear avatar mayooear commented on May 12, 2024

Hi, did you have the same error before the chunk function was added?

Also have you run through the troubleshooting list here

from gpt4-pdf-chatbot-langchain.

swagnostic avatar swagnostic commented on May 12, 2024

I believe it was the same error yes.

And, yes, finished the troubleshooting list, restart and all. No dice. I'll keep trying.

from gpt4-pdf-chatbot-langchain.

janzheng avatar janzheng commented on May 12, 2024

I'm uploading a 12mb PDF that had some trouble. This helped me get ingest it:

  1. Re: instructions to add Dimensions: I had to go into my default project and create a new index, with the same name as I set in pinecone.ts, e.g. const PINECONE_INDEX_NAME = 'langchainjs-pdf-test';

  2. I also added async-sema from here: https://github.com/vercel/async-sema which helps slow your code down.
    Full ingest-data.ts looks like this:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings';
import { PineconeStore } from 'langchain/vectorstores';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';

import { Sema } from 'async-sema';
const s = new Sema(
  2, // Allow 4 concurrent async calls
  {
    capacity: 20 // Prealloc space for 100 tokens
  }
);




/* Name of directory to retrieve files from. You can change this as required */
// const filePath = 'docs/MorseVsFrederick.pdf';
const filePath = 'docs/Sacher_Jessica_C_201807_PhD.pdf';

export const run = async () => {
  try {
    /*load raw docs from the pdf file in the directory */
    const loader = new PDFLoader(filePath);
    // const loader = new PDFLoader(filePath);
    const rawDocs = await loader.load();

    console.log(rawDocs);

    /* Split text into chunks */
    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs);

    console.log('creating vector store...');
    /*create and store the embeddings in the vectorStore*/
    const embeddings = new OpenAIEmbeddings();
    const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name

    //embed the PDF documents

    /* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
    const chunkSize = 50;
    for (let i = 0; i < docs.length; i += chunkSize) {
      const chunk = docs.slice(i, i + chunkSize);
      console.log('chunk', i, chunk);
      await s.acquire()
      try {
        await PineconeStore.fromDocuments(
          index,
          chunk,
          embeddings,
          'text',
          PINECONE_NAME_SPACE,
        );
      } finally {
        s.release()
      }
    }
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

from gpt4-pdf-chatbot-langchain.

mayooear avatar mayooear commented on May 12, 2024

I believe it was the same error yes.

And, yes, finished the troubleshooting list, restart and all. No dice. I'll keep trying.

What version of LangChain and Pinecone are you using?

from gpt4-pdf-chatbot-langchain.

mayooear avatar mayooear commented on May 12, 2024

I'm uploading a 12mb PDF that had some trouble. This helped me get ingest it:

  1. Re: instructions to add Dimensions: I had to go into my default project and create a new index, with the same name as I set in pinecone.ts, e.g. const PINECONE_INDEX_NAME = 'langchainjs-pdf-test';
  2. I also added async-sema from here: https://github.com/vercel/async-sema which helps slow your code down.
    Full ingest-data.ts looks like this:
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings';
import { PineconeStore } from 'langchain/vectorstores';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';

import { Sema } from 'async-sema';
const s = new Sema(
  2, // Allow 4 concurrent async calls
  {
    capacity: 20 // Prealloc space for 100 tokens
  }
);




/* Name of directory to retrieve files from. You can change this as required */
// const filePath = 'docs/MorseVsFrederick.pdf';
const filePath = 'docs/Sacher_Jessica_C_201807_PhD.pdf';

export const run = async () => {
  try {
    /*load raw docs from the pdf file in the directory */
    const loader = new PDFLoader(filePath);
    // const loader = new PDFLoader(filePath);
    const rawDocs = await loader.load();

    console.log(rawDocs);

    /* Split text into chunks */
    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs);

    console.log('creating vector store...');
    /*create and store the embeddings in the vectorStore*/
    const embeddings = new OpenAIEmbeddings();
    const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name

    //embed the PDF documents

    /* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
    const chunkSize = 50;
    for (let i = 0; i < docs.length; i += chunkSize) {
      const chunk = docs.slice(i, i + chunkSize);
      console.log('chunk', i, chunk);
      await s.acquire()
      try {
        await PineconeStore.fromDocuments(
          index,
          chunk,
          embeddings,
          'text',
          PINECONE_NAME_SPACE,
        );
      } finally {
        s.release()
      }
    }
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

Thanks, but I'm not sure async-sema should be necessary here to work.

from gpt4-pdf-chatbot-langchain.

janzheng avatar janzheng commented on May 12, 2024

You're right, it worked without async-sema! For some reason I was getting timeouts that seemed to error from hitting the endpoint too much, but maybe I just misread the errors. Removed it and it works!

from gpt4-pdf-chatbot-langchain.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.