Comments (6)
Hi, did you have the same error before the chunk function was added?
Also have you run through the troubleshooting list here
from gpt4-pdf-chatbot-langchain.
I believe it was the same error yes.
And, yes, finished the troubleshooting list, restart and all. No dice. I'll keep trying.
from gpt4-pdf-chatbot-langchain.
I'm uploading a 12mb PDF that had some trouble. This helped me get ingest it:
-
Re: instructions to add
Dimensions
: I had to go into my default project and create a new index, with the same name as I set inpinecone.ts
, e.g.const PINECONE_INDEX_NAME = 'langchainjs-pdf-test';
-
I also added
async-sema
from here: https://github.com/vercel/async-sema which helps slow your code down.
Fullingest-data.ts
looks like this:
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings';
import { PineconeStore } from 'langchain/vectorstores';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { Sema } from 'async-sema';
const s = new Sema(
2, // Allow 4 concurrent async calls
{
capacity: 20 // Prealloc space for 100 tokens
}
);
/* Name of directory to retrieve files from. You can change this as required */
// const filePath = 'docs/MorseVsFrederick.pdf';
const filePath = 'docs/Sacher_Jessica_C_201807_PhD.pdf';
export const run = async () => {
try {
/*load raw docs from the pdf file in the directory */
const loader = new PDFLoader(filePath);
// const loader = new PDFLoader(filePath);
const rawDocs = await loader.load();
console.log(rawDocs);
/* Split text into chunks */
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const docs = await textSplitter.splitDocuments(rawDocs);
console.log('split docs', docs);
console.log('creating vector store...');
/*create and store the embeddings in the vectorStore*/
const embeddings = new OpenAIEmbeddings();
const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name
//embed the PDF documents
/* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
const chunkSize = 50;
for (let i = 0; i < docs.length; i += chunkSize) {
const chunk = docs.slice(i, i + chunkSize);
console.log('chunk', i, chunk);
await s.acquire()
try {
await PineconeStore.fromDocuments(
index,
chunk,
embeddings,
'text',
PINECONE_NAME_SPACE,
);
} finally {
s.release()
}
}
} catch (error) {
console.log('error', error);
throw new Error('Failed to ingest your data');
}
};
(async () => {
await run();
console.log('ingestion complete');
})();
from gpt4-pdf-chatbot-langchain.
I believe it was the same error yes.
And, yes, finished the troubleshooting list, restart and all. No dice. I'll keep trying.
What version of LangChain and Pinecone are you using?
from gpt4-pdf-chatbot-langchain.
I'm uploading a 12mb PDF that had some trouble. This helped me get ingest it:
- Re: instructions to add
Dimensions
: I had to go into my default project and create a new index, with the same name as I set inpinecone.ts
, e.g.const PINECONE_INDEX_NAME = 'langchainjs-pdf-test';
- I also added
async-sema
from here: https://github.com/vercel/async-sema which helps slow your code down.
Fullingest-data.ts
looks like this:import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'; import { OpenAIEmbeddings } from 'langchain/embeddings'; import { PineconeStore } from 'langchain/vectorstores'; import { pinecone } from '@/utils/pinecone-client'; import { PDFLoader } from 'langchain/document_loaders'; import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone'; import { Sema } from 'async-sema'; const s = new Sema( 2, // Allow 4 concurrent async calls { capacity: 20 // Prealloc space for 100 tokens } ); /* Name of directory to retrieve files from. You can change this as required */ // const filePath = 'docs/MorseVsFrederick.pdf'; const filePath = 'docs/Sacher_Jessica_C_201807_PhD.pdf'; export const run = async () => { try { /*load raw docs from the pdf file in the directory */ const loader = new PDFLoader(filePath); // const loader = new PDFLoader(filePath); const rawDocs = await loader.load(); console.log(rawDocs); /* Split text into chunks */ const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, chunkOverlap: 200, }); const docs = await textSplitter.splitDocuments(rawDocs); console.log('split docs', docs); console.log('creating vector store...'); /*create and store the embeddings in the vectorStore*/ const embeddings = new OpenAIEmbeddings(); const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name //embed the PDF documents /* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/ const chunkSize = 50; for (let i = 0; i < docs.length; i += chunkSize) { const chunk = docs.slice(i, i + chunkSize); console.log('chunk', i, chunk); await s.acquire() try { await PineconeStore.fromDocuments( index, chunk, embeddings, 'text', PINECONE_NAME_SPACE, ); } finally { s.release() } } } catch (error) { console.log('error', error); throw new Error('Failed to ingest your data'); } }; (async () => { await run(); console.log('ingestion complete'); })();
Thanks, but I'm not sure async-sema
should be necessary here to work.
from gpt4-pdf-chatbot-langchain.
You're right, it worked without async-sema
! For some reason I was getting timeouts that seemed to error from hitting the endpoint too much, but maybe I just misread the errors. Removed it and it works!
from gpt4-pdf-chatbot-langchain.
Related Issues (20)
- Conversion of code into Python HOT 1
- PineconeError: Error, message length too large: found 5453452 bytes, the limit is: 4194304 bytes HOT 6
- source output HOT 1
- How to include more than 4 results from Pinecone? HOT 3
- How to change the BaseUrl if I use a proxy HOT 1
- Can ChatGPT 3.5 be supported? HOT 6
- Text words overlay display HOT 1
- Does this project accepts image read from PDF? HOT 5
- Enhancement - ability to use a graph database such as neo4j instead of vector database HOT 1
- enhancement - integrate with llamaindex HOT 3
- Explain data ingestion code. HOT 4
- s HOT 1
- "TypeError: Cannot read properties of undefined (reading 'text')" HOT 1
- error TypeError: ids is not iterable HOT 1
- Add support for Pinecone Serverless HOT 5
- Error: Azure OpenAI API instance name not found HOT 3
- FetchError: request to https://api.openai.com/v1/embeddings failed HOT 1
- run "yarn run ingest" Japanese punctuation marks were converted to Korean HOT 1
- I get this error when I open my local server: Cannot read properties of undefined (reading 'text') HOT 14
- error PineconeConnectionError HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt4-pdf-chatbot-langchain.