sinanuozdemir / quick-start-guide-to-llms Goto Github PK

Jupyter Notebook 99.94% Dockerfile 0.01% Python 0.06%

quick-start-guide-to-llms's Issues

Remove env file from semantic-search-fastapi folder.

Noticed that the .env file in the semantic-search-fastapi folder was checked into the repository.
Probably need to update your .gitignore file, remove the file and regenerated your API keys
to ensure they are deactivated.

Not working any more with v>1

Even if I explicitely install openai-0.28.1, I get

APIRemovedInV1:

You tried to access openai.Embedding, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You probably want to update the course on o'really eather, otherwise the code is not working

Chapter: 7: name 'decoder_tokenizer' is not defined

Hey Sinan,
i downloaded the data from kaggle and while running the code:

import json
from collections import defaultdict
from tqdm import tqdm
from datasets import Dataset

# Function to load VQA data from the given annotation and question files
def load_vqa_data(annotations_file, questions_file, images_folder, start_at=None, end_at=None, max_images=None, max_questions=None):
    # Load the annotations and questions JSON files
    with open(annotations_file, "r") as f:
        annotations_data = json.load(f)
    with open(questions_file, "r") as f:
        questions_data = json.load(f)

    data = []
    images_used = defaultdict(int)

    # Create a dictionary to map question_id to the annotation data
    annotations_dict = {annotation["question_id"]: annotation for annotation in annotations_data["annotations"]}

    # Iterate through questions in the specified range
    for question in tqdm(questions_data["questions"][start_at:end_at]):
        # Assuming that "image_id" is available in the question data
        image_id = question.get("image_id", None)  # Replace with the actual code to get the image_id

        # Check if the image file exists and has not reached the max_questions limit
        # Add your image-checking logic here

        # Assuming that "annotation" is available in the annotation data
        annotation = annotations_dict.get(question["question_id"], {})  # Replace with the actual code to get the annotation

        # Assuming that other variables like "decoder_tokenizer" and "all_answers" are defined elsewhere in your code

        # Add the data as a dictionary
        data.append({
            "image_id": image_id,
            "question_id": question["question_id"],
            "question": question["question"],
            "answer": decoder_tokenizer.bos_token + ' ' + annotation.get("multiple_choice_answer", "") + decoder_tokenizer.eos_token,
            "all_answers": all_answers,  # You should define this variable elsewhere
            "image": image,  # You should define this variable elsewhere
        })

        # Break the loop if the max_images limit is reached
        # Add your max_images-checking logic here

    return data

# Load training and validation VQA data
train_data = load_vqa_data(
    "/content/data/v2_Annotations_Train_mscoco/v2_mscoco_train2014_annotations.json",
    "/content/data/v2_Questions_Train_mscoco/v2_OpenEnded_mscoco_train2014_questions.json",
    "/content/data/train2014"
)

val_data = load_vqa_data(
    "/content/data/v2_Annotations_Val_mscoco/v2_mscoco_val2014_annotations.json",
    "/content/data/v2_Questions_Val_mscoco/v2_OpenEnded_mscoco_val2014_questions.json",
    "/content/data/val2014"
)

# Create Hugging Face datasets
train_dataset = Dataset.from_dict({key: [item[key] for item in train_data] for key in train_data[0].keys()})

# Optionally save the dataset to disk for later retrieval
train_dataset.save_to_disk("vqa_train_dataset")

# Create Hugging Face datasets for validation
val_dataset = Dataset.from_dict({key: [item[key] for item in val_data] for key in val_data[0].keys()})

# Optionally save the dataset to disk for later retrieval
val_dataset.save_to_disk("vqa_val_dataset")

ERROR:

0%|          | 0/443757 [00:00<?, ?it/s]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[<ipython-input-17-be70899caa09>](https://localhost:8080/#) in <cell line: 49>()
     47 
     48 # Load training and validation VQA data
---> 49 train_data = load_vqa_data(
     50     "/content/data/v2_Annotations_Train_mscoco/v2_mscoco_train2014_annotations.json",
     51     "/content/data/v2_Questions_Train_mscoco/v2_OpenEnded_mscoco_train2014_questions.json",

[<ipython-input-17-be70899caa09>](https://localhost:8080/#) in load_vqa_data(annotations_file, questions_file, images_folder, start_at, end_at, max_images, max_questions)
     36             "question_id": question["question_id"],
     37             "question": question["question"],
---> 38             "answer": decoder_tokenizer.bos_token + ' ' + annotation.get("multiple_choice_answer", "") + decoder_tokenizer.eos_token,
     39             "all_answers": all_answers,  # You should define this variable elsewhere
     40             "image": image,  # You should define this variable elsewhere

NameError: name 'decoder_tokenizer' is not defined

Chapter 2 Notebook errors

Hi Sinan,

Trying to run your Chapter 2 notebook. When I get to

import pinecone

pinecone.init(api_key=pinecone_key, environment="us-west1-gcp")

it issues an error that "init is no longer a top-level attribute of the pinecone package." Then it gives the following replacement code:

    import os
    from pinecone import Pinecone, ServerlessSpec

    pc = Pinecone(
        api_key=os.environ.get("PINECONE_API_KEY")
    )

    # Now do stuff
    if 'my_index' not in pc.list_indexes().names():
        pc.create_index(
            name='my_index', 
            dimension=1536, 
            metric='euclidean',
            spec=ServerlessSpec(
                cloud='aws',
                region='us-west-2'
            )
        )

When I try the replacement code, I get this error:

PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'X-Cloud-Trace-Context': 'a4dd30976b4a3a292eb49f325c69da3b', 'Date': 'Sun, 28 Jan 2024 22:21:38 GMT', 'Server': 'Google Frontend', 'Content-Length': '125', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"INVALID_ARGUMENT","message":"Name must consist of lower case alphanumeric characters or '-'"},"status":400}

Do you have an updated working notebook?

Chapter 3: Is "get_best_result_from_pinecone()" implementation correct?

Hi,

This function post the payload into "https://information-retrieval-hiaa.onrender.com/document/retrieve" and not doing any interactions with PineCode. As per Figure 3.12 and the relevant sections it is mentioned that Retrieve result from Vector database. I'm unable to see any thing in the Chapter #3 Notebook.

Please help me understand.

def get_best_result_from_pinecone(query, namespace=NAMESPACE):
    payload = json.dumps({
      "num_results": 2,
      "query": query,
      "re_ranking_strategy": "none",
      "namespace": namespace
    })

    response = requests.post(
        "https://information-retrieval-hiaa.onrender.com/document/retrieve",
        data=payload
    )

.

Replace deprecated `text-davinci-003`

I need to replace text-davinci-003 with gpt-3.5-turbo-instruct and re-run code to make sure everything still works as expected

Dataset "amazon_reviews_multi" is defunct and no longer accessible due to the decision of data providers

Hi @sinanuozdemir,

This is to just notify you that the dataset "amazon_reviews_multi" is disabled by author. https://huggingface.co/datasets/amazon_reviews_multi. Looking forward your recommended alternate.

Regards,
Sheik

chapter 3-4 mentions Fast openai

Hi Sinan,
Thanks for the amazing book, any chance to add the fastAPI code as mentioned and how it interacts?. So that it interacts as mentioned in Chapters 2-4

Looking forward to hearing from you.
Thanks,
Andy

depricated amazon_reviews_multi dataset

chapter 4 talks about Optimizing LLMs with Customized Fine-Tuning using amazon_reviews_multi dataset, however this is deprecated @sinanuozdemir

huggingface/datasets#6109 (comment)

Chapter 2: Semantic Search Pinecone index.upsert() throws 'Unable to prepare type Embedding for serialization' exception

Source: https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/semantic-search-fastapi/api.py

document_ingest()

Line 132: upserted_count = index.upsert(pinecone_request, namespace=request.namespace).get('upserted_count') throws following error:

ApiValueError: Unable to prepare type Embedding for serialization

DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x79339f965bd0>, 'json_data': {'input': ['hi'], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
DEBUG:httpcore.connection:close.started
DEBUG:httpcore.connection:close.complete
DEBUG:httpcore.connection:connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=5.0 socket_options=None
DEBUG:httpcore.connection:connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x79339f95d960>
DEBUG:httpcore.connection:start_tls.started ssl_context=<ssl.SSLContext object at 0x7933b233c4c0> server_hostname='api.openai.com' timeout=5.0
DEBUG:httpcore.connection:start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x79339f95f520>
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']>
['hi']
DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 28 Nov 2023 03:52:51 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b'*'), (b'openai-organization', b'user-bfcfnnkwlm7ogyjwm7tngl6y'), (b'openai-processing-ms', b'28'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'3000'), (b'x-ratelimit-remaining-requests', b'2999'), (b'x-ratelimit-reset-requests', b'20ms'), (b'x-request-id', b'e6d41a58a9769205cd239d0eb103d47c'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'82cfa917eb7f1125-ORD'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; ma=86400')])
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
DEBUG:openai._base_client:HTTP Request: POST https://api.openai.com/v1/embeddings "200 OK"
---------------------------------------------------------------------------
ApiValueError                             Traceback (most recent call last)
[<ipython-input-37-56d43383d63d>](https://d00c0viy2zf-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in <cell line: 2>()


19 frames
[/content/notebooks/pinecone/core/client/api_client.py](https://d00c0viy2zf-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in sanitize_for_serialization(cls, obj)
    290         if isinstance(obj, dict):
    291             return {key: cls.sanitize_for_serialization(val) for key, val in obj.items()}
--> 292         raise ApiValueError('Unable to prepare type {} for serialization'.format(obj.__class__.__name__))
    293 
    294     def deserialize(self, response, response_type, _check_type):

ApiValueError: Unable to prepare type Embedding for serialization

chapter 3: openai function is not fully mentioned

    response = openai.Completion.create(
      model=model,
      prompt=prompt,
      max_tokens=256,
      **kwargs
    )
    answer = response.choices[0].text
    if not suppress:
        print(f'PROMPT:\n------\n{prompt}\n------\nRESPONSE\n------\n{answer}')
    else:
        return answer

Where is code repo for Quick guide to large language models?

Hello ,
Greetings!!
While reading your book, I was looking for code for fast api for document chunking but was not able to find the repo. Can you please share the repo.
Thanks,
Ankush Singal

Update Chapter 6 Anime Recommendation Engine with updated metrics and more fine-tuning

I've been working on an updated version of the Chapter 6 recommendation engine case study and will be posting a new notebook soon!

sinanuozdemir / quick-start-guide-to-llms Goto Github PK

quick-start-guide-to-llms's Issues

Remove env file from semantic-search-fastapi folder.

Not working any more with v>1

Chapter: 7: name 'decoder_tokenizer' is not defined

Chapter 2 Notebook errors

Chapter 3: Is "get_best_result_from_pinecone()" implementation correct?

.

Replace deprecated `text-davinci-003`

Dataset "amazon_reviews_multi" is defunct and no longer accessible due to the decision of data providers

chapter 3-4 mentions Fast openai

depricated amazon_reviews_multi dataset

Chapter 2: Semantic Search Pinecone index.upsert() throws 'Unable to prepare type Embedding for serialization' exception

chapter 3: openai function is not fully mentioned

Where is code repo for Quick guide to large language models?

Update Chapter 6 Anime Recommendation Engine with updated metrics and more fine-tuning

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs