<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for the details and the code <a class="user-mention notranslate" data-hovercard

Dataset.pop() not working as expected. about deeplake HOT 9 CLOSED

j-beastman commented on May 29, 2024

Dataset.pop() not working as expected.

from deeplake.

Comments (9)

davidbuniat commented on May 29, 2024

@j-beastman thanks for reporting the issue and sorry about the experience. Do you mind sharing the stack trace and the error you are experiencing?

from deeplake.

j-beastman commented on May 29, 2024

Hi @davidbuniat! Sorry I didn't include that initially, but what is a stack trace in the context of my program not crashing, but just not behaving as expected? Here's my code:

def pull_deeplake_dataset() -> Dataset:
    # either load existing vector store or upload a new one to the hub
    ds = deeplake.load(f'hub://{ACTIVELOOP_ORG_NAME}/{ACTIVELOOP_DATASET}', token=ACTIVELOOP_TOKEN, read_only=False)
    return ds

def clear_dataset():
    try:
        ds = pull_deeplake_dataset()
        len = ds.max_len
        if len != 0:
            print("Deleting data")
            for i in range(0, len): # Apparently this is slow, idk the other way to do it.
                print("Popping index", i)
                ds.text.pop()
                ds.embedding.pop()
                ds.metadata.pop()
                ds.id.pop()
    except (GetChunkError, DatasetHandlerError):
        print("Dataset is already empty")

Basically, I'm having to pop off each value from the tensors instead of being able to use ds.pop()

from deeplake.

FayazRahman commented on May 29, 2024

Hey @j-beastman! When you do

for idx in range(length):
    ds.pop(idx)

The length of the dataset changes on each iteration, so we end up popping the wrong indices and go beyond the length of the dataset. So we have to do ds.pop(0) or ds.pop() and it should work. (Did you try this as well and run into some issue?)

About the "slow indexing" warning, you can replace for i in range(0, len) with for i, sample in enumerate(ds.max_view):

for i, sample in enumerate(ds.max_view):
    ds.pop()

Also, refrain from using len as a variable, because it is a built-in function :)

from deeplake.

j-beastman commented on May 29, 2024

Hey @FayazRahman ! (thanks for the tips) I got rid of using the index. I just use ds.pop() and it doesn't remove from the dataset. When I do ds.tensor.pop() for each tensor, I'm able to clear the dataset. I'm not sure what is going on.

from deeplake.

FayazRahman commented on May 29, 2024

@j-beastman Interesting, can you share the final state of your code so I can try it out on my end?

from deeplake.

FayazRahman commented on May 29, 2024

Attaching a ds.summary() could be useful as well

from deeplake.

j-beastman commented on May 29, 2024

from deeplake.core.dataset import Dataset
from langchain.docstore.document import Document
from langchain.document_loaders import DirectoryLoader
from langchain.vectorstores import DeepLake, VectorStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
import deeplake
from deeplake.util.exceptions import GetChunkError, DatasetHandlerError
def pull_deeplake_dataset() -> Dataset:
    # either load existing vector store or upload a new one to the hub
    ds = deeplake.load(f'hub://{ACTIVELOOP_ORG_NAME}/{ACTIVELOOP_DATASET}', token=ACTIVELOOP_TOKEN, read_only=False)
    return ds

def clear_dataset():
    try:
        ds = pull_deeplake_dataset()
        for i, sample in enumerate(ds.max_len): # Try max_view too
            print("Popping index", i)
            ds.text.pop()
            ds.embedding.pop()
            ds.metadata.pop()
            ds.id.pop()
    except (GetChunkError, DatasetHandlerError):
        print("Dataset is already empty")

def get_embeddings():
    return SentenceTransformerEmbeddings(
        model_name=EMBEDDING_MODEL_NAME,
        cache_folder="../streamlit/cache/",
    )

def load_misc(directory):
    loader = DirectoryLoader(f"./{directory}")

    print(f"Loading {directory} directory for any {filter}")
    data = loader.load()
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=10,
    )
    r_docs = splitter.split_documents(data)
    return r_docs
def upload_latin():
    clear_dataset()
    chunked_text = load_misc("data/snippets")
    embedding_function = get_embeddings()
    DeepLake.from_documents(
        chunked_text,
        embedding_function,
        dataset_path=VECTOR_STORE_PATH,
        token=ACTIVELOOP_TOKEN,
    )

Try running it twice. I'll provide the file that I use for this snippet too.

from deeplake.

j-beastman commented on May 29, 2024

latin.txt

from deeplake.

FayazRahman commented on May 29, 2024

Thanks for the details and the code @j-beastman! I see what is causing the issue, I'll let you know as soon as the fix is released.

from deeplake.

Dataset.pop() not working as expected. about deeplake HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs