sinanuozdemir / quick-start-guide-to-llms Goto Github PK

Jupyter Notebook 99.94% Dockerfile 0.01% Python 0.06%

quick-start-guide-to-llms's Introduction

Quick Start Guide to Large Language Models

Get your copy today and please leave a rating/review to tell me what you thought! ⭐⭐⭐⭐⭐

Welcome to the GitHub repository for the "Quick Start Guide to Large Language Models" book. This repository contains the code snippets and notebooks used in the book, demonstrating various applications of Transformer models.

Repository Structure

Directories

notebooks: This directory contains Jupyter notebooks for each chapter in the book.
data: Contains the datasets used in the notebooks.
images: Contains images and graphs used in the notebooks.

Notebooks

Here are some of the notebooks included in the notebooks directory:

Part I - Introduction to Large Language Models

2_semantic_search.ipynb: An introduction to semantic search using OpenAI and open source models.
- I have an updated version here with the updated OpenAI client usage plus the use of the latest V3 OpenAI Embedding. Spoiler alert, the open-source embedder + a fine-tuned cross encoder beat even the largest OpenAI embedder :)
3_prompt_engineering.ipynb: A guide to effective prompt engineering for instruction aligned LLMs.

Part II - Getting the Most Out of LLMs

4_fine_tuned_classification.ipynb: Learn how to perform text classification through fine-tuning OpenAI models
- Check out UPDATED 4_fine_tuned_classification_sentiment.ipynb for the updated version of the previous notebook because OpenAI made a new Fine-tuning API and Amazon revoked access to the dataset I used (always keeping me on my toes, thanks everyone)
5_adv_prompt_engineering.ipynb: Advanced techniques for prompt engineering including k-shot, semantic k-shot, chain of thought prompting, chaining, and building a retrieval augmented generating (RAG) enabled chatbot with GPT-4.
5_VQA.ipynb: Introduction to prompt chaining and Visual Question Answering (VQA) with open source LLMs
6_recommendation_engine.ipynb: Building a recommendation engine using custom fine-tuned LLMs
- Check out this colab notebook here for the most recent update of this case study with more graphs and more compute! https://colab.research.google.com/drive/1JfxyxdGCDjYeO52Bk1JzW4Af94xndTws?usp=sharing

Part III - Advanced LLM Usage

7_constructing_a_vqa_system.ipynb: Step-by-step guide to constructing a Visual Question Answering system using open-source GPT2 and the Vision Transformer.
7_using_our_vqa.ipynb: A notebook to use the VQA system we built in the previous notebook.
7_rl_flan_t5_summaries.ipynb: Using Reinforcement Learning (RL) to produce more neutral and grammatically correct summaries with the FLAN-T5 model.
8_latex_gpt2.ipynb: Fine-tuning GPT-2 to generate LaTeX formulas
8_anime_category_classification_model_freezing.ipynb: Fine-tuning a BERT model to classify anime categories with a comparison between freezing model layers and keeping the model unfrozen.
8_optimizing_fine_tuning.ipynb: Best practices for optimizing fine-tuning of transformer models - dynamic padding, gradient accumulation, mixed precision, and more.
8_sawyer_1_instruction_ft.ipynb: Fine-tuning the instruction model for the SAWYER bot.
8_sawyer_2_train_reward_model.ipynb: Training a reward model for the SAWYER bot from human preferences.
8_sawyer_3_rl.ipynb: Using Reinforcement Learning from Human Feedback (RLHF) to further align the SAWYER bot
8_sawyer_4_use_sawyer.ipynb: Using our SAWYER bot
9_distillation.ipynb: An exploration of knowledge distillation techniques for transformer models.

We will continue to add more notebooks exploring topics like fine-tuning, advanced prompt engineering, combining transformers, and various use-cases. Stay tuned!

How to Use

To use this repository, clone it to your local machine, navigate to the notebooks directory, and open the Jupyter notebook of your choice. Note that some notebooks may require specific datasets, which can be found in the data directory.

Please ensure that you have the necessary libraries installed and that they are up to date. This can usually be done by running pip install -r requirements.txt in the terminal.

Contributing

Contributions are welcome! Feel free to submit a pull request if you have any additions, corrections, or enhancements to submit.

Disclaimer

This repository is for educational purposes and is meant to accompany the "Quick Start Guide to Large Language Models" book. Please refer to the book for in-depth explanations and discussions of the topics covered in the notebooks.

quick-start-guide-to-llms's People

Contributors

Stargazers

Watchers

quick-start-guide-to-llms's Issues

Chapter 3: Is "get_best_result_from_pinecone()" implementation correct?

Hi,

This function post the payload into "https://information-retrieval-hiaa.onrender.com/document/retrieve" and not doing any interactions with PineCode. As per Figure 3.12 and the relevant sections it is mentioned that Retrieve result from Vector database. I'm unable to see any thing in the Chapter #3 Notebook.

Please help me understand.

def get_best_result_from_pinecone(query, namespace=NAMESPACE):
    payload = json.dumps({
      "num_results": 2,
      "query": query,
      "re_ranking_strategy": "none",
      "namespace": namespace
    })

    response = requests.post(
        "https://information-retrieval-hiaa.onrender.com/document/retrieve",
        data=payload
    )

chapter 3-4 mentions Fast openai

Hi Sinan,
Thanks for the amazing book, any chance to add the fastAPI code as mentioned and how it interacts?. So that it interacts as mentioned in Chapters 2-4

Looking forward to hearing from you.
Thanks,
Andy

Where is code repo for Quick guide to large language models?

Hello ,
Greetings!!
While reading your book, I was looking for code for fast api for document chunking but was not able to find the repo. Can you please share the repo.
Thanks,
Ankush Singal

depricated amazon_reviews_multi dataset

chapter 4 talks about Optimizing LLMs with Customized Fine-Tuning using amazon_reviews_multi dataset, however this is deprecated @sinanuozdemir

huggingface/datasets#6109 (comment)

.

Chapter 2: Semantic Search Pinecone index.upsert() throws 'Unable to prepare type Embedding for serialization' exception

Source: https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/semantic-search-fastapi/api.py

document_ingest()

Line 132: upserted_count = index.upsert(pinecone_request, namespace=request.namespace).get('upserted_count') throws following error:

ApiValueError: Unable to prepare type Embedding for serialization

DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x79339f965bd0>, 'json_data': {'input': ['hi'], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
DEBUG:httpcore.connection:close.started
DEBUG:httpcore.connection:close.complete
DEBUG:httpcore.connection:connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=5.0 socket_options=None
DEBUG:httpcore.connection:connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x79339f95d960>
DEBUG:httpcore.connection:start_tls.started ssl_context=<ssl.SSLContext object at 0x7933b233c4c0> server_hostname='api.openai.com' timeout=5.0
DEBUG:httpcore.connection:start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x79339f95f520>
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']>
['hi']
DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 28 Nov 2023 03:52:51 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b'*'), (b'openai-organization', b'user-bfcfnnkwlm7ogyjwm7tngl6y'), (b'openai-processing-ms', b'28'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'3000'), (b'x-ratelimit-remaining-requests', b'2999'), (b'x-ratelimit-reset-requests', b'20ms'), (b'x-request-id', b'e6d41a58a9769205cd239d0eb103d47c'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'82cfa917eb7f1125-ORD'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; ma=86400')])
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
DEBUG:openai._base_client:HTTP Request: POST https://api.openai.com/v1/embeddings "200 OK"
---------------------------------------------------------------------------
ApiValueError                             Traceback (most recent call last)
[<ipython-input-37-56d43383d63d>](https://d00c0viy2zf-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in <cell line: 2>()


19 frames
[/content/notebooks/pinecone/core/client/api_client.py](https://d00c0viy2zf-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in sanitize_for_serialization(cls, obj)
    290         if isinstance(obj, dict):
    291             return {key: cls.sanitize_for_serialization(val) for key, val in obj.items()}
--> 292         raise ApiValueError('Unable to prepare type {} for serialization'.format(obj.__class__.__name__))
    293 
    294     def deserialize(self, response, response_type, _check_type):

ApiValueError: Unable to prepare type Embedding for serialization

Dataset "amazon_reviews_multi" is defunct and no longer accessible due to the decision of data providers

Hi @sinanuozdemir,

This is to just notify you that the dataset "amazon_reviews_multi" is disabled by author. https://huggingface.co/datasets/amazon_reviews_multi. Looking forward your recommended alternate.

Regards,
Sheik

Update Chapter 6 Anime Recommendation Engine with updated metrics and more fine-tuning

I've been working on an updated version of the Chapter 6 recommendation engine case study and will be posting a new notebook soon!

chapter 3: openai function is not fully mentioned

    response = openai.Completion.create(
      model=model,
      prompt=prompt,
      max_tokens=256,
      **kwargs
    )
    answer = response.choices[0].text
    if not suppress:
        print(f'PROMPT:\n------\n{prompt}\n------\nRESPONSE\n------\n{answer}')
    else:
        return answer

Chapter: 7: name 'decoder_tokenizer' is not defined

Hey Sinan,
i downloaded the data from kaggle and while running the code:

import json
from collections import defaultdict
from tqdm import tqdm
from datasets import Dataset

# Function to load VQA data from the given annotation and question files
def load_vqa_data(annotations_file, questions_file, images_folder, start_at=None, end_at=None, max_images=None, max_questions=None):
    # Load the annotations and questions JSON files
    with open(annotations_file, "r") as f:
        annotations_data = json.load(f)
    with open(questions_file, "r") as f:
        questions_data = json.load(f)

    data = []
    images_used = defaultdict(int)

    # Create a dictionary to map question_id to the annotation data
    annotations_dict = {annotation["question_id"]: annotation for annotation in annotations_data["annotations"]}

    # Iterate through questions in the specified range
    for question in tqdm(questions_data["questions"][start_at:end_at]):
        # Assuming that "image_id" is available in the question data
        image_id = question.get("image_id", None)  # Replace with the actual code to get the image_id

        # Check if the image file exists and has not reached the max_questions limit
        # Add your image-checking logic here

        # Assuming that "annotation" is available in the annotation data
        annotation = annotations_dict.get(question["question_id"], {})  # Replace with the actual code to get the annotation

        # Assuming that other variables like "decoder_tokenizer" and "all_answers" are defined elsewhere in your code

        # Add the data as a dictionary
        data.append({
            "image_id": image_id,
            "question_id": question["question_id"],
            "question": question["question"],
            "answer": decoder_tokenizer.bos_token + ' ' + annotation.get("multiple_choice_answer", "") + decoder_tokenizer.eos_token,
            "all_answers": all_answers,  # You should define this variable elsewhere
            "image": image,  # You should define this variable elsewhere
        })

        # Break the loop if the max_images limit is reached
        # Add your max_images-checking logic here

    return data

# Load training and validation VQA data
train_data = load_vqa_data(
    "/content/data/v2_Annotations_Train_mscoco/v2_mscoco_train2014_annotations.json",
    "/content/data/v2_Questions_Train_mscoco/v2_OpenEnded_mscoco_train2014_questions.json",
    "/content/data/train2014"
)

val_data = load_vqa_data(
    "/content/data/v2_Annotations_Val_mscoco/v2_mscoco_val2014_annotations.json",
    "/content/data/v2_Questions_Val_mscoco/v2_OpenEnded_mscoco_val2014_questions.json",
    "/content/data/val2014"
)

# Create Hugging Face datasets
train_dataset = Dataset.from_dict({key: [item[key] for item in train_data] for key in train_data[0].keys()})

# Optionally save the dataset to disk for later retrieval
train_dataset.save_to_disk("vqa_train_dataset")

# Create Hugging Face datasets for validation
val_dataset = Dataset.from_dict({key: [item[key] for item in val_data] for key in val_data[0].keys()})

# Optionally save the dataset to disk for later retrieval
val_dataset.save_to_disk("vqa_val_dataset")

ERROR:

0%|          | 0/443757 [00:00<?, ?it/s]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[<ipython-input-17-be70899caa09>](https://localhost:8080/#) in <cell line: 49>()
     47 
     48 # Load training and validation VQA data
---> 49 train_data = load_vqa_data(
     50     "/content/data/v2_Annotations_Train_mscoco/v2_mscoco_train2014_annotations.json",
     51     "/content/data/v2_Questions_Train_mscoco/v2_OpenEnded_mscoco_train2014_questions.json",

[<ipython-input-17-be70899caa09>](https://localhost:8080/#) in load_vqa_data(annotations_file, questions_file, images_folder, start_at, end_at, max_images, max_questions)
     36             "question_id": question["question_id"],
     37             "question": question["question"],
---> 38             "answer": decoder_tokenizer.bos_token + ' ' + annotation.get("multiple_choice_answer", "") + decoder_tokenizer.eos_token,
     39             "all_answers": all_answers,  # You should define this variable elsewhere
     40             "image": image,  # You should define this variable elsewhere

NameError: name 'decoder_tokenizer' is not defined

import pinecone

pinecone.init(api_key=pinecone_key, environment="us-west1-gcp")

it issues an error that "init is no longer a top-level attribute of the pinecone package." Then it gives the following replacement code:

    import os
    from pinecone import Pinecone, ServerlessSpec

    pc = Pinecone(
        api_key=os.environ.get("PINECONE_API_KEY")
    )

    # Now do stuff
    if 'my_index' not in pc.list_indexes().names():
        pc.create_index(
            name='my_index', 
            dimension=1536, 
            metric='euclidean',
            spec=ServerlessSpec(
                cloud='aws',
                region='us-west-2'
            )
        )

When I try the replacement code, I get this error:

PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'X-Cloud-Trace-Context': 'a4dd30976b4a3a292eb49f325c69da3b', 'Date': 'Sun, 28 Jan 2024 22:21:38 GMT', 'Server': 'Google Frontend', 'Content-Length': '125', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"INVALID_ARGUMENT","message":"Name must consist of lower case alphanumeric characters or '-'"},"status":400}

Do you have an updated working notebook?

Not working any more with v>1

Even if I explicitely install openai-0.28.1, I get

APIRemovedInV1:

You tried to access openai.Embedding, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You probably want to update the course on o'really eather, otherwise the code is not working