Comments (4)
π€
Hello @araza110! π
I'm Dosu, your friendly neighborhood bot. While human maintainers are busy saving the world, I'm here to assist you with bugs, answer your questions, and help you become a stellar contributor. Let's get those vector counts aligning, shall we? ππΎ
The discrepancy in the number of vectors between your NestJS implementation and the original code could be due to several factors. Here are a few possibilities:
-
Different Text Splitting Parameters: The
RecursiveCharacterTextSplitter
is used to split the text into chunks. If thechunkSize
andchunkOverlap
parameters are different in your NestJS implementation, this could result in a different number of vectors. In your code, you've setchunkSize
to 1000 andchunkOverlap
to 200. If these values are different in the original code, it could explain the discrepancy. -
Different Document Loaders: The method you use to load the PDF files could also affect the number of vectors. If your NestJS implementation uses a different method to load the PDF files, it could result in a different number of vectors.
-
Different Embedding Method: The
OpenAIEmbeddings
is used to create embeddings from the split text. If your NestJS implementation uses a different method to create embeddings, it could result in a different number of vectors.
Without more information, it's impossible to provide a more specific answer. Could you please provide the original code and the specific parameters used there? This would help in diagnosing the issue more accurately.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from gpt4-pdf-chatbot-langchain.
Hi, @araza110,
I'm helping the gpt4-pdf-chatbot-langchain team manage their backlog and am marking this issue as stale. The issue involves a discrepancy in the number of vectors stored when ingesting PDF documents into a Pinecone database using a NestJS application. It seems that the discrepancy has been resolved, with potential factors contributing to the issue identified and addressed. Additionally, a detailed explanation of the textSplitter
and textKey
parameters was provided to help achieve consistency in the number of vectors generated.
Could you please confirm if this issue is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository? If it is, please let the gpt4-pdf-chatbot-langchain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!
from gpt4-pdf-chatbot-langchain.
Hi,@dosbut Please explain this code block
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
Also what is textKey?
await PineconeStore.fromDocuments(docs, embeddings, {
pineconeIndex: index,
namespace,
textKey: 'text',
});
from gpt4-pdf-chatbot-langchain.
@araza110 textSplitter
splits PDF documents into smaller parts called "chunks".
chunkSize - each chunk should have a maximum of 1000 characters.
chunkOverlap - how much overlap there should be between adjacent chunks (200 characters)
say we load a document containing:
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed vitae mauris nec nisl imperdiet placerat. Nullam in semper velit. Phasellus congue varius felis, in efficitur tellus tempor sit amet. Duis mollis tincidunt ex, ac feugiat metus consectetur ut. Sed tincidunt lacus sed felis fringilla, eu blandit massa ullamcorper. Aliquam pulvinar, nisl eu consectetur dignissim, mi erat fermentum sapien, eu ultrices turpis lectus a nunc."
const rawDocs = await directoryLoader.load();
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const docs = await textSplitter.splitDocuments(rawDocs);
console.log('split docs', docs);
We get:
[
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed vitae mauris nec nisl imperdiet placerat. Nullam in semper velit. Phasellus congue varius felis, in efficitur tellus tempor sit amet. Duis mollis tincidunt ex, ",
"ac feugiat metus consectetur ut. Sed tincidunt lacus sed felis fringilla, eu blandit massa ullamcorper. Aliquam pulvinar, nisl eu consectetur dignissim, mi erat fermentum sapien, eu ultrices turpis lectus a nunc."
]
For:
await PineconeStore.fromDocuments(docs, embeddings, {
pineconeIndex: index,
namespace,
textKey: 'text',
});
textKey
tells the code which part of each document should be used for generating embeddings. In this case, it expects each element in the docs array to have a property called 'text'. Each document object has a text property containing the text to be used for generating embeddings.
{
text: "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed vitae mauris nec nisl imperdiet placerat. Nullam in semper velit. Phasellus congue varius felis, in efficitur tellus tempor sit amet. Duis mollis tincidunt ex, ",
text: "ac feugiat metus consectetur ut. Sed tincidunt lacus sed felis fringilla, eu blandit massa ullamcorper. Aliquam pulvinar, nisl eu consectetur dignissim, mi erat fermentum sapien, eu ultrices turpis lectus a nunc."
};
from gpt4-pdf-chatbot-langchain.
Related Issues (20)
- Conversion of code into Python HOT 2
- PineconeError: Error, message length too large: found 5453452 bytes, the limit is: 4194304 bytes HOT 6
- source output HOT 1
- How to include more than 4 results from Pinecone? HOT 3
- How to change the BaseUrl if I use a proxy HOT 1
- Can ChatGPT 3.5 be supportedοΌ HOT 6
- Text words overlay display HOT 1
- Does this project accepts image read from PDF? HOT 5
- Enhancement - ability to use a graph database such as neo4j instead of vector database HOT 1
- enhancement - integrate with llamaindex HOT 3
- s HOT 1
- "TypeError: Cannot read properties of undefined (reading 'text')" HOT 1
- error TypeError: ids is not iterable HOT 1
- Add support for Pinecone Serverless HOT 8
- Error: Azure OpenAI API instance name not found HOT 3
- FetchError: request to https://api.openai.com/v1/embeddings failed HOT 1
- run "yarn run ingest" Japanese punctuation marks were converted to Korean HOT 1
- I get this error when I open my local server: Cannot read properties of undefined (reading 'text') HOT 14
- error PineconeConnectionError HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt4-pdf-chatbot-langchain.