GithubHelp home page GithubHelp logo

quick-start-guide-to-llms's Issues

Remove env file from semantic-search-fastapi folder.

Noticed that the .env file in the semantic-search-fastapi folder was checked into the repository.
Probably need to update your .gitignore file, remove the file and regenerated your API keys
to ensure they are deactivated.

Not working any more with v>1

Even if I explicitely install openai-0.28.1, I get

APIRemovedInV1:

You tried to access openai.Embedding, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You probably want to update the course on o'really eather, otherwise the code is not working

Chapter: 7: name 'decoder_tokenizer' is not defined

Hey Sinan,
i downloaded the data from kaggle and while running the code:

import json
from collections import defaultdict
from tqdm import tqdm
from datasets import Dataset

# Function to load VQA data from the given annotation and question files
def load_vqa_data(annotations_file, questions_file, images_folder, start_at=None, end_at=None, max_images=None, max_questions=None):
    # Load the annotations and questions JSON files
    with open(annotations_file, "r") as f:
        annotations_data = json.load(f)
    with open(questions_file, "r") as f:
        questions_data = json.load(f)

    data = []
    images_used = defaultdict(int)

    # Create a dictionary to map question_id to the annotation data
    annotations_dict = {annotation["question_id"]: annotation for annotation in annotations_data["annotations"]}

    # Iterate through questions in the specified range
    for question in tqdm(questions_data["questions"][start_at:end_at]):
        # Assuming that "image_id" is available in the question data
        image_id = question.get("image_id", None)  # Replace with the actual code to get the image_id

        # Check if the image file exists and has not reached the max_questions limit
        # Add your image-checking logic here

        # Assuming that "annotation" is available in the annotation data
        annotation = annotations_dict.get(question["question_id"], {})  # Replace with the actual code to get the annotation

        # Assuming that other variables like "decoder_tokenizer" and "all_answers" are defined elsewhere in your code

        # Add the data as a dictionary
        data.append({
            "image_id": image_id,
            "question_id": question["question_id"],
            "question": question["question"],
            "answer": decoder_tokenizer.bos_token + ' ' + annotation.get("multiple_choice_answer", "") + decoder_tokenizer.eos_token,
            "all_answers": all_answers,  # You should define this variable elsewhere
            "image": image,  # You should define this variable elsewhere
        })

        # Break the loop if the max_images limit is reached
        # Add your max_images-checking logic here

    return data

# Load training and validation VQA data
train_data = load_vqa_data(
    "/content/data/v2_Annotations_Train_mscoco/v2_mscoco_train2014_annotations.json",
    "/content/data/v2_Questions_Train_mscoco/v2_OpenEnded_mscoco_train2014_questions.json",
    "/content/data/train2014"
)

val_data = load_vqa_data(
    "/content/data/v2_Annotations_Val_mscoco/v2_mscoco_val2014_annotations.json",
    "/content/data/v2_Questions_Val_mscoco/v2_OpenEnded_mscoco_val2014_questions.json",
    "/content/data/val2014"
)

# Create Hugging Face datasets
train_dataset = Dataset.from_dict({key: [item[key] for item in train_data] for key in train_data[0].keys()})

# Optionally save the dataset to disk for later retrieval
train_dataset.save_to_disk("vqa_train_dataset")

# Create Hugging Face datasets for validation
val_dataset = Dataset.from_dict({key: [item[key] for item in val_data] for key in val_data[0].keys()})

# Optionally save the dataset to disk for later retrieval
val_dataset.save_to_disk("vqa_val_dataset")

ERROR:

0%|          | 0/443757 [00:00<?, ?it/s]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[<ipython-input-17-be70899caa09>](https://localhost:8080/#) in <cell line: 49>()
     47 
     48 # Load training and validation VQA data
---> 49 train_data = load_vqa_data(
     50     "/content/data/v2_Annotations_Train_mscoco/v2_mscoco_train2014_annotations.json",
     51     "/content/data/v2_Questions_Train_mscoco/v2_OpenEnded_mscoco_train2014_questions.json",

[<ipython-input-17-be70899caa09>](https://localhost:8080/#) in load_vqa_data(annotations_file, questions_file, images_folder, start_at, end_at, max_images, max_questions)
     36             "question_id": question["question_id"],
     37             "question": question["question"],
---> 38             "answer": decoder_tokenizer.bos_token + ' ' + annotation.get("multiple_choice_answer", "") + decoder_tokenizer.eos_token,
     39             "all_answers": all_answers,  # You should define this variable elsewhere
     40             "image": image,  # You should define this variable elsewhere

NameError: name 'decoder_tokenizer' is not defined

Chapter 2 Notebook errors

Hi Sinan,

Trying to run your Chapter 2 notebook. When I get to

import pinecone

pinecone.init(api_key=pinecone_key, environment="us-west1-gcp")

it issues an error that "init is no longer a top-level attribute of the pinecone package." Then it gives the following replacement code:

    import os
    from pinecone import Pinecone, ServerlessSpec

    pc = Pinecone(
        api_key=os.environ.get("PINECONE_API_KEY")
    )

    # Now do stuff
    if 'my_index' not in pc.list_indexes().names():
        pc.create_index(
            name='my_index', 
            dimension=1536, 
            metric='euclidean',
            spec=ServerlessSpec(
                cloud='aws',
                region='us-west-2'
            )
        )

When I try the replacement code, I get this error:

PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'X-Cloud-Trace-Context': 'a4dd30976b4a3a292eb49f325c69da3b', 'Date': 'Sun, 28 Jan 2024 22:21:38 GMT', 'Server': 'Google Frontend', 'Content-Length': '125', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"INVALID_ARGUMENT","message":"Name must consist of lower case alphanumeric characters or '-'"},"status":400}

Do you have an updated working notebook?

Chapter 3: Is "get_best_result_from_pinecone()" implementation correct?

Hi,

This function post the payload into "https://information-retrieval-hiaa.onrender.com/document/retrieve" and not doing any interactions with PineCode. As per Figure 3.12 and the relevant sections it is mentioned that Retrieve result from Vector database. I'm unable to see any thing in the Chapter #3 Notebook.

Please help me understand.

def get_best_result_from_pinecone(query, namespace=NAMESPACE):
    payload = json.dumps({
      "num_results": 2,
      "query": query,
      "re_ranking_strategy": "none",
      "namespace": namespace
    })

    response = requests.post(
        "https://information-retrieval-hiaa.onrender.com/document/retrieve",
        data=payload
    )

.

.

chapter 3-4 mentions Fast openai

Hi Sinan,
Thanks for the amazing book, any chance to add the fastAPI code as mentioned and how it interacts?. So that it interacts as mentioned in Chapters 2-4
Screenshot 2023-09-14 at 10 07 00 PM

Looking forward to hearing from you.
Thanks,
Andy

Chapter 2: Semantic Search Pinecone index.upsert() throws 'Unable to prepare type Embedding for serialization' exception

Source: https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/semantic-search-fastapi/api.py

document_ingest()

Line 132: upserted_count = index.upsert(pinecone_request, namespace=request.namespace).get('upserted_count') throws following error:

ApiValueError: Unable to prepare type Embedding for serialization

DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x79339f965bd0>, 'json_data': {'input': ['hi'], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
DEBUG:httpcore.connection:close.started
DEBUG:httpcore.connection:close.complete
DEBUG:httpcore.connection:connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=5.0 socket_options=None
DEBUG:httpcore.connection:connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x79339f95d960>
DEBUG:httpcore.connection:start_tls.started ssl_context=<ssl.SSLContext object at 0x7933b233c4c0> server_hostname='api.openai.com' timeout=5.0
DEBUG:httpcore.connection:start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x79339f95f520>
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']>
['hi']
DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 28 Nov 2023 03:52:51 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b'*'), (b'openai-organization', b'user-bfcfnnkwlm7ogyjwm7tngl6y'), (b'openai-processing-ms', b'28'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'3000'), (b'x-ratelimit-remaining-requests', b'2999'), (b'x-ratelimit-reset-requests', b'20ms'), (b'x-request-id', b'e6d41a58a9769205cd239d0eb103d47c'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'82cfa917eb7f1125-ORD'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; ma=86400')])
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
DEBUG:openai._base_client:HTTP Request: POST https://api.openai.com/v1/embeddings "200 OK"
---------------------------------------------------------------------------
ApiValueError                             Traceback (most recent call last)
[<ipython-input-37-56d43383d63d>](https://d00c0viy2zf-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in <cell line: 2>()


19 frames
[/content/notebooks/pinecone/core/client/api_client.py](https://d00c0viy2zf-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in sanitize_for_serialization(cls, obj)
    290         if isinstance(obj, dict):
    291             return {key: cls.sanitize_for_serialization(val) for key, val in obj.items()}
--> 292         raise ApiValueError('Unable to prepare type {} for serialization'.format(obj.__class__.__name__))
    293 
    294     def deserialize(self, response, response_type, _check_type):

ApiValueError: Unable to prepare type Embedding for serialization

chapter 3: openai function is not fully mentioned

    response = openai.Completion.create(
      model=model,
      prompt=prompt,
      max_tokens=256,
      **kwargs
    )
    answer = response.choices[0].text
    if not suppress:
        print(f'PROMPT:\n------\n{prompt}\n------\nRESPONSE\n------\n{answer}')
    else:
        return answer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.