quick-start-guide-to-llms's Issues
Remove env file from semantic-search-fastapi folder.
Noticed that the .env file in the semantic-search-fastapi folder was checked into the repository.
Probably need to update your .gitignore file, remove the file and regenerated your API keys
to ensure they are deactivated.
Not working any more with v>1
Even if I explicitely install openai-0.28.1, I get
APIRemovedInV1:
You tried to access openai.Embedding, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.
You probably want to update the course on o'really eather, otherwise the code is not working
Chapter: 7: name 'decoder_tokenizer' is not defined
Hey Sinan,
i downloaded the data from kaggle and while running the code:
import json
from collections import defaultdict
from tqdm import tqdm
from datasets import Dataset
# Function to load VQA data from the given annotation and question files
def load_vqa_data(annotations_file, questions_file, images_folder, start_at=None, end_at=None, max_images=None, max_questions=None):
# Load the annotations and questions JSON files
with open(annotations_file, "r") as f:
annotations_data = json.load(f)
with open(questions_file, "r") as f:
questions_data = json.load(f)
data = []
images_used = defaultdict(int)
# Create a dictionary to map question_id to the annotation data
annotations_dict = {annotation["question_id"]: annotation for annotation in annotations_data["annotations"]}
# Iterate through questions in the specified range
for question in tqdm(questions_data["questions"][start_at:end_at]):
# Assuming that "image_id" is available in the question data
image_id = question.get("image_id", None) # Replace with the actual code to get the image_id
# Check if the image file exists and has not reached the max_questions limit
# Add your image-checking logic here
# Assuming that "annotation" is available in the annotation data
annotation = annotations_dict.get(question["question_id"], {}) # Replace with the actual code to get the annotation
# Assuming that other variables like "decoder_tokenizer" and "all_answers" are defined elsewhere in your code
# Add the data as a dictionary
data.append({
"image_id": image_id,
"question_id": question["question_id"],
"question": question["question"],
"answer": decoder_tokenizer.bos_token + ' ' + annotation.get("multiple_choice_answer", "") + decoder_tokenizer.eos_token,
"all_answers": all_answers, # You should define this variable elsewhere
"image": image, # You should define this variable elsewhere
})
# Break the loop if the max_images limit is reached
# Add your max_images-checking logic here
return data
# Load training and validation VQA data
train_data = load_vqa_data(
"/content/data/v2_Annotations_Train_mscoco/v2_mscoco_train2014_annotations.json",
"/content/data/v2_Questions_Train_mscoco/v2_OpenEnded_mscoco_train2014_questions.json",
"/content/data/train2014"
)
val_data = load_vqa_data(
"/content/data/v2_Annotations_Val_mscoco/v2_mscoco_val2014_annotations.json",
"/content/data/v2_Questions_Val_mscoco/v2_OpenEnded_mscoco_val2014_questions.json",
"/content/data/val2014"
)
# Create Hugging Face datasets
train_dataset = Dataset.from_dict({key: [item[key] for item in train_data] for key in train_data[0].keys()})
# Optionally save the dataset to disk for later retrieval
train_dataset.save_to_disk("vqa_train_dataset")
# Create Hugging Face datasets for validation
val_dataset = Dataset.from_dict({key: [item[key] for item in val_data] for key in val_data[0].keys()})
# Optionally save the dataset to disk for later retrieval
val_dataset.save_to_disk("vqa_val_dataset")
ERROR:
0%| | 0/443757 [00:00<?, ?it/s]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
[<ipython-input-17-be70899caa09>](https://localhost:8080/#) in <cell line: 49>()
47
48 # Load training and validation VQA data
---> 49 train_data = load_vqa_data(
50 "/content/data/v2_Annotations_Train_mscoco/v2_mscoco_train2014_annotations.json",
51 "/content/data/v2_Questions_Train_mscoco/v2_OpenEnded_mscoco_train2014_questions.json",
[<ipython-input-17-be70899caa09>](https://localhost:8080/#) in load_vqa_data(annotations_file, questions_file, images_folder, start_at, end_at, max_images, max_questions)
36 "question_id": question["question_id"],
37 "question": question["question"],
---> 38 "answer": decoder_tokenizer.bos_token + ' ' + annotation.get("multiple_choice_answer", "") + decoder_tokenizer.eos_token,
39 "all_answers": all_answers, # You should define this variable elsewhere
40 "image": image, # You should define this variable elsewhere
NameError: name 'decoder_tokenizer' is not defined
Chapter 2 Notebook errors
Hi Sinan,
Trying to run your Chapter 2 notebook. When I get to
import pinecone
pinecone.init(api_key=pinecone_key, environment="us-west1-gcp")
it issues an error that "init is no longer a top-level attribute of the pinecone package." Then it gives the following replacement code:
import os
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(
api_key=os.environ.get("PINECONE_API_KEY")
)
# Now do stuff
if 'my_index' not in pc.list_indexes().names():
pc.create_index(
name='my_index',
dimension=1536,
metric='euclidean',
spec=ServerlessSpec(
cloud='aws',
region='us-west-2'
)
)
When I try the replacement code, I get this error:
PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'X-Cloud-Trace-Context': 'a4dd30976b4a3a292eb49f325c69da3b', 'Date': 'Sun, 28 Jan 2024 22:21:38 GMT', 'Server': 'Google Frontend', 'Content-Length': '125', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"INVALID_ARGUMENT","message":"Name must consist of lower case alphanumeric characters or '-'"},"status":400}
Do you have an updated working notebook?
Chapter 3: Is "get_best_result_from_pinecone()" implementation correct?
Hi,
This function post the payload into "https://information-retrieval-hiaa.onrender.com/document/retrieve" and not doing any interactions with PineCode. As per Figure 3.12 and the relevant sections it is mentioned that Retrieve result from Vector database. I'm unable to see any thing in the Chapter #3 Notebook.
Please help me understand.
def get_best_result_from_pinecone(query, namespace=NAMESPACE):
payload = json.dumps({
"num_results": 2,
"query": query,
"re_ranking_strategy": "none",
"namespace": namespace
})
response = requests.post(
"https://information-retrieval-hiaa.onrender.com/document/retrieve",
data=payload
)
.
.
Replace deprecated `text-davinci-003`
I need to replace text-davinci-003
with gpt-3.5-turbo-instruct
and re-run code to make sure everything still works as expected
Dataset "amazon_reviews_multi" is defunct and no longer accessible due to the decision of data providers
Hi @sinanuozdemir,
This is to just notify you that the dataset "amazon_reviews_multi" is disabled by author. https://huggingface.co/datasets/amazon_reviews_multi. Looking forward your recommended alternate.
Regards,
Sheik
chapter 3-4 mentions Fast openai
depricated amazon_reviews_multi dataset
chapter 4 talks about Optimizing LLMs with Customized Fine-Tuning using amazon_reviews_multi dataset, however this is deprecated @sinanuozdemir
Chapter 2: Semantic Search Pinecone index.upsert() throws 'Unable to prepare type Embedding for serialization' exception
Source: https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/semantic-search-fastapi/api.py
document_ingest()
Line 132: upserted_count = index.upsert(pinecone_request, namespace=request.namespace).get('upserted_count') throws following error:
ApiValueError: Unable to prepare type Embedding for serialization
DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x79339f965bd0>, 'json_data': {'input': ['hi'], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
DEBUG:httpcore.connection:close.started
DEBUG:httpcore.connection:close.complete
DEBUG:httpcore.connection:connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=5.0 socket_options=None
DEBUG:httpcore.connection:connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x79339f95d960>
DEBUG:httpcore.connection:start_tls.started ssl_context=<ssl.SSLContext object at 0x7933b233c4c0> server_hostname='api.openai.com' timeout=5.0
DEBUG:httpcore.connection:start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x79339f95f520>
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']>
['hi']
DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 28 Nov 2023 03:52:51 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b'*'), (b'openai-organization', b'user-bfcfnnkwlm7ogyjwm7tngl6y'), (b'openai-processing-ms', b'28'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'3000'), (b'x-ratelimit-remaining-requests', b'2999'), (b'x-ratelimit-reset-requests', b'20ms'), (b'x-request-id', b'e6d41a58a9769205cd239d0eb103d47c'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'82cfa917eb7f1125-ORD'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; ma=86400')])
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
DEBUG:openai._base_client:HTTP Request: POST https://api.openai.com/v1/embeddings "200 OK"
---------------------------------------------------------------------------
ApiValueError Traceback (most recent call last)
[<ipython-input-37-56d43383d63d>](https://d00c0viy2zf-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in <cell line: 2>()
19 frames
[/content/notebooks/pinecone/core/client/api_client.py](https://d00c0viy2zf-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in sanitize_for_serialization(cls, obj)
290 if isinstance(obj, dict):
291 return {key: cls.sanitize_for_serialization(val) for key, val in obj.items()}
--> 292 raise ApiValueError('Unable to prepare type {} for serialization'.format(obj.__class__.__name__))
293
294 def deserialize(self, response, response_type, _check_type):
ApiValueError: Unable to prepare type Embedding for serialization
chapter 3: openai function is not fully mentioned
response = openai.Completion.create(
model=model,
prompt=prompt,
max_tokens=256,
**kwargs
)
answer = response.choices[0].text
if not suppress:
print(f'PROMPT:\n------\n{prompt}\n------\nRESPONSE\n------\n{answer}')
else:
return answer
Where is code repo for Quick guide to large language models?
Hello ,
Greetings!!
While reading your book, I was looking for code for fast api for document chunking but was not able to find the repo. Can you please share the repo.
Thanks,
Ankush Singal
Update Chapter 6 Anime Recommendation Engine with updated metrics and more fine-tuning
I've been working on an updated version of the Chapter 6 recommendation engine case study and will be posting a new notebook soon!
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.