GithubHelp home page GithubHelp logo

Comments (16)

arnavroh45 avatar arnavroh45 commented on August 16, 2024 1

Can I get your discord id, so that I can share the data with you.

from qdrant-client.

joein avatar joein commented on August 16, 2024

Hey

It can be caused by 2 reasons:

  1. you have a very large payload
  2. you don't do batching

the former can be solved either by reducing the size of a payload or increasing allowed json size limit
while the latter can be solved by introducing batching

also setting prefer_grpc=True while instantiating a qdrant client would probably solve this problem, but it would be just a way to cure symptoms rather than the problem itself

from qdrant-client.

arnavroh45 avatar arnavroh45 commented on August 16, 2024

is there a way to increase json payload limit in qdrant db somehow? Because I have tried batching as well, but still getting the same error.

from qdrant-client.

joein avatar joein commented on August 16, 2024

It is not recommended to increase the limit, could you please show how you are trying to upload it?

Which methods do you use?
How do you do batching, what is the batch size?

from qdrant-client.

arnavroh45 avatar arnavroh45 commented on August 16, 2024

I loaded documents using llama_index document loaders, and then converted them to nodes.
The problem here I think is, the length/size of the metadata of the nodes(payload for qdrant vectors) is greater than the allocated limit. I implemented the same thing with the documents and it's working but not with the nodes.

sentence_node_parser = SentenceWindowNodeParser.from_defaults(window_size=3, window_metadata_key="window", original_text_metadata_key="original_text")
nodes = sentence_node_parser.get_nodes_from_documents(documents)
vector_store = QdrantVectorStore(client=client, collection_name="collection_name", batch_size=20) #here, specified the batch size
storage_context = StorageContext.from_defaults(vector_store=vector_store)
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embeddings, chunk_size=512, chunk_overlap=50)
index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context)

from qdrant-client.

joein avatar joein commented on August 16, 2024

Could you maybe try it with batch_size=1?

If it succeeds, could you check what is actually stored in the payload? Maybe this code stores an original document and not only chunks?

from qdrant-client.

arnavroh45 avatar arnavroh45 commented on August 16, 2024

Tried that, it isn't working.

from qdrant-client.

arnavroh45 avatar arnavroh45 commented on August 16, 2024

Any breakthrough?

from qdrant-client.

joein avatar joein commented on August 16, 2024

Hi @arnavroh45 , sorry for the delay, haven't had time to look deeper into it yet

@Anush008 maybe you could take a look at it please?

I am not that familiar with llama index, but the error usually occurs when there are either problems with batch size or with payload loading.

from qdrant-client.

Anush008 avatar Anush008 commented on August 16, 2024

Hi @arnavroh45.
I rechecked the batching implementation. Looks fine.
I'll try reproducing the issue.

from qdrant-client.

arnavroh45 avatar arnavroh45 commented on August 16, 2024

Okay

from qdrant-client.

Anush008 avatar Anush008 commented on August 16, 2024

Hey @arnavroh45. I tried reproducing this with SentenceWindowNodeParser as per your snippet.
The upload worked fine for me. I assume it has to do with the documents from your data in specific. Could you give some info about the data?

from qdrant-client.

arnavroh45 avatar arnavroh45 commented on August 16, 2024

I am performing web scraping using selenium and then storing the data in the following format. Then converted the documents into nodes using the provided code. Then adding the nodes to the vector db which gave the error.

Document Format:
Document(id_='dac4c09b-5e50-4c8e-a3de-01ac624ab8e4', embedding=None, metadata={'title': 'title', 'source': 'sourcet'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='dd623d76410ed65006c454c4fbf93511baa6bb1bef1936d5a352191b62a0c1bc', text='text', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

Conversion from document to nodes:
sentence_node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text")
nodes = sentence_node_parser.get_nodes_from_documents(documents)

Adding into vector db:
client = QdrantClient(url="url")
vector_store = QdrantVectorStore(client=client, collection_name="collection_name", batch_size=10)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)

Error:
UnexpectedResponse: Unexpected Response: 400 (Bad Request)
Raw response content:
b'{"status":{"error":"Payload error: JSON payload (41900140 bytes) is larger than allowed (limit: 33554432 bytes)."},"time":0.0}'

from qdrant-client.

Anush008 avatar Anush008 commented on August 16, 2024

@arnavroh45, the document schema is very similar to what I had in my attempt at reproducing.

But this would be almost not possible to debug without the actual data which seems to be the cause, since the batch upload seems to work fine and you even tried setting it to 1 point.

from qdrant-client.

arnavroh45 avatar arnavroh45 commented on August 16, 2024

Can this be possible that the length of my metadata exceeds the size limit allowed for metadata?

from qdrant-client.

Anush008 avatar Anush008 commented on August 16, 2024

Anush008

from qdrant-client.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.