GithubHelp home page GithubHelp logo

quick-start-guide-to-llms's Introduction

Quick Start Guide to Large Language Models

Get your copy today and please leave a rating/review to tell me what you thought! ⭐⭐⭐⭐⭐

Quick Start Guide to Large Language Models

Welcome to the GitHub repository for the "Quick Start Guide to Large Language Models" book. This repository contains the code snippets and notebooks used in the book, demonstrating various applications of Transformer models.

Repository Structure

Directories

  • notebooks: This directory contains Jupyter notebooks for each chapter in the book.
  • data: Contains the datasets used in the notebooks.
  • images: Contains images and graphs used in the notebooks.

Notebooks

Here are some of the notebooks included in the notebooks directory:

Part I - Introduction to Large Language Models

  • 2_semantic_search.ipynb: An introduction to semantic search using OpenAI and open source models.
    • I have an updated version here with the updated OpenAI client usage plus the use of the latest V3 OpenAI Embedding. Spoiler alert, the open-source embedder + a fine-tuned cross encoder beat even the largest OpenAI embedder :)
  • 3_prompt_engineering.ipynb: A guide to effective prompt engineering for instruction aligned LLMs.

Part II - Getting the Most Out of LLMs

Part III - Advanced LLM Usage

We will continue to add more notebooks exploring topics like fine-tuning, advanced prompt engineering, combining transformers, and various use-cases. Stay tuned!

How to Use

To use this repository, clone it to your local machine, navigate to the notebooks directory, and open the Jupyter notebook of your choice. Note that some notebooks may require specific datasets, which can be found in the data directory.

Please ensure that you have the necessary libraries installed and that they are up to date. This can usually be done by running pip install -r requirements.txt in the terminal.

Contributing

Contributions are welcome! Feel free to submit a pull request if you have any additions, corrections, or enhancements to submit.

Disclaimer

This repository is for educational purposes and is meant to accompany the "Quick Start Guide to Large Language Models" book. Please refer to the book for in-depth explanations and discussions of the topics covered in the notebooks.

More From Sinan

  1. Check out Sinan's Newsletter AI Office Hours for more AI/LLM content!
  2. Sinan has a podcast called Practically Intelligent where he chats about the latest and greatest in AI!
  3. Follow the Getting Started with Data, LLMs and ChatGPT Playlist on O'Reilly for a curated list of Sinan's work!

quick-start-guide-to-llms's People

Contributors

sinanuozdemir avatar

Stargazers

sion avatar Taekyung Lee avatar  avatar  avatar Benito Martin avatar Jorge Alcantara Barroso avatar  avatar Jarek Pyszkiewicz avatar Jongwon Kim avatar  avatar  avatar  avatar  avatar Brewster avatar Shakirudeen Abdulfatai avatar Praphul Samavedam avatar Francisco Estrada avatar  avatar Douglas Starnes avatar az avatar  avatar  avatar Divyansh Goyal avatar  avatar Dilip avatar Walter Ullon avatar Jin-Kook Choi avatar Meysam avatar Kai He avatar Hristo Ganev avatar 2A avatar Bayu Siddhi Mukti avatar  avatar  avatar  avatar  avatar Jieun Kim avatar Joeun Park avatar  avatar Kaushik Chakraborty avatar Sonien T. Son avatar Juan Manuel Uribe Quintero avatar  avatar Jason Zhang avatar  avatar Dr. Manika Lamba avatar Namgyu Kim avatar  avatar Jason Howk avatar Yun yoseob avatar  avatar Seoyoung Oh avatar Peiyi Xu avatar Elmira Ghorbani avatar li fengyu avatar  avatar Samuel Teles avatar Harheem Kim avatar Odun avatar SHYDEV avatar Sangwon Chae avatar Nancy Wei avatar Ricardo Lammert Zepeda avatar Junhyeok Choi avatar Su Jin Kim avatar Scott Williams avatar Venkateswararao karri avatar Hakan Bayazıt Habeş avatar AI FirstD3V avatar RAJARAM K avatar Sami Nas avatar  avatar Angelo Varlotta avatar Corcoran Smith avatar James Woolfenden avatar Kevin Herrera avatar  avatar Omar Wael avatar  avatar  avatar Martin Joseph Lubowa avatar  avatar Andrea Spadaccini avatar Bella avatar Kirk Biglione avatar Ishan Anand avatar  avatar yibit avatar  avatar  avatar SungGeun Kim avatar  avatar Brais Maneiro Sánchez avatar F3N1X avatar  avatar Clarke Bishop avatar Jon Chun avatar Ben McNulty avatar naiborhujosua avatar Eduardo Moreno avatar

Watchers

James Cloos avatar  avatar  avatar  avatar Bryce Platt avatar  avatar  avatar

quick-start-guide-to-llms's Issues

Chapter 3: Is "get_best_result_from_pinecone()" implementation correct?

Hi,

This function post the payload into "https://information-retrieval-hiaa.onrender.com/document/retrieve" and not doing any interactions with PineCode. As per Figure 3.12 and the relevant sections it is mentioned that Retrieve result from Vector database. I'm unable to see any thing in the Chapter #3 Notebook.

Please help me understand.

def get_best_result_from_pinecone(query, namespace=NAMESPACE):
    payload = json.dumps({
      "num_results": 2,
      "query": query,
      "re_ranking_strategy": "none",
      "namespace": namespace
    })

    response = requests.post(
        "https://information-retrieval-hiaa.onrender.com/document/retrieve",
        data=payload
    )

chapter 3-4 mentions Fast openai

Hi Sinan,
Thanks for the amazing book, any chance to add the fastAPI code as mentioned and how it interacts?. So that it interacts as mentioned in Chapters 2-4
Screenshot 2023-09-14 at 10 07 00 PM

Looking forward to hearing from you.
Thanks,
Andy

.

.

Chapter 2: Semantic Search Pinecone index.upsert() throws 'Unable to prepare type Embedding for serialization' exception

Source: https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/semantic-search-fastapi/api.py

document_ingest()

Line 132: upserted_count = index.upsert(pinecone_request, namespace=request.namespace).get('upserted_count') throws following error:

ApiValueError: Unable to prepare type Embedding for serialization

DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x79339f965bd0>, 'json_data': {'input': ['hi'], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
DEBUG:httpcore.connection:close.started
DEBUG:httpcore.connection:close.complete
DEBUG:httpcore.connection:connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=5.0 socket_options=None
DEBUG:httpcore.connection:connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x79339f95d960>
DEBUG:httpcore.connection:start_tls.started ssl_context=<ssl.SSLContext object at 0x7933b233c4c0> server_hostname='api.openai.com' timeout=5.0
DEBUG:httpcore.connection:start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x79339f95f520>
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']>
['hi']
DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 28 Nov 2023 03:52:51 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b'*'), (b'openai-organization', b'user-bfcfnnkwlm7ogyjwm7tngl6y'), (b'openai-processing-ms', b'28'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'3000'), (b'x-ratelimit-remaining-requests', b'2999'), (b'x-ratelimit-reset-requests', b'20ms'), (b'x-request-id', b'e6d41a58a9769205cd239d0eb103d47c'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'82cfa917eb7f1125-ORD'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; ma=86400')])
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
DEBUG:openai._base_client:HTTP Request: POST https://api.openai.com/v1/embeddings "200 OK"
---------------------------------------------------------------------------
ApiValueError                             Traceback (most recent call last)
[<ipython-input-37-56d43383d63d>](https://d00c0viy2zf-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in <cell line: 2>()


19 frames
[/content/notebooks/pinecone/core/client/api_client.py](https://d00c0viy2zf-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in sanitize_for_serialization(cls, obj)
    290         if isinstance(obj, dict):
    291             return {key: cls.sanitize_for_serialization(val) for key, val in obj.items()}
--> 292         raise ApiValueError('Unable to prepare type {} for serialization'.format(obj.__class__.__name__))
    293 
    294     def deserialize(self, response, response_type, _check_type):

ApiValueError: Unable to prepare type Embedding for serialization

chapter 3: openai function is not fully mentioned

    response = openai.Completion.create(
      model=model,
      prompt=prompt,
      max_tokens=256,
      **kwargs
    )
    answer = response.choices[0].text
    if not suppress:
        print(f'PROMPT:\n------\n{prompt}\n------\nRESPONSE\n------\n{answer}')
    else:
        return answer

Chapter: 7: name 'decoder_tokenizer' is not defined

Hey Sinan,
i downloaded the data from kaggle and while running the code:

import json
from collections import defaultdict
from tqdm import tqdm
from datasets import Dataset

# Function to load VQA data from the given annotation and question files
def load_vqa_data(annotations_file, questions_file, images_folder, start_at=None, end_at=None, max_images=None, max_questions=None):
    # Load the annotations and questions JSON files
    with open(annotations_file, "r") as f:
        annotations_data = json.load(f)
    with open(questions_file, "r") as f:
        questions_data = json.load(f)

    data = []
    images_used = defaultdict(int)

    # Create a dictionary to map question_id to the annotation data
    annotations_dict = {annotation["question_id"]: annotation for annotation in annotations_data["annotations"]}

    # Iterate through questions in the specified range
    for question in tqdm(questions_data["questions"][start_at:end_at]):
        # Assuming that "image_id" is available in the question data
        image_id = question.get("image_id", None)  # Replace with the actual code to get the image_id

        # Check if the image file exists and has not reached the max_questions limit
        # Add your image-checking logic here

        # Assuming that "annotation" is available in the annotation data
        annotation = annotations_dict.get(question["question_id"], {})  # Replace with the actual code to get the annotation

        # Assuming that other variables like "decoder_tokenizer" and "all_answers" are defined elsewhere in your code

        # Add the data as a dictionary
        data.append({
            "image_id": image_id,
            "question_id": question["question_id"],
            "question": question["question"],
            "answer": decoder_tokenizer.bos_token + ' ' + annotation.get("multiple_choice_answer", "") + decoder_tokenizer.eos_token,
            "all_answers": all_answers,  # You should define this variable elsewhere
            "image": image,  # You should define this variable elsewhere
        })

        # Break the loop if the max_images limit is reached
        # Add your max_images-checking logic here

    return data

# Load training and validation VQA data
train_data = load_vqa_data(
    "/content/data/v2_Annotations_Train_mscoco/v2_mscoco_train2014_annotations.json",
    "/content/data/v2_Questions_Train_mscoco/v2_OpenEnded_mscoco_train2014_questions.json",
    "/content/data/train2014"
)

val_data = load_vqa_data(
    "/content/data/v2_Annotations_Val_mscoco/v2_mscoco_val2014_annotations.json",
    "/content/data/v2_Questions_Val_mscoco/v2_OpenEnded_mscoco_val2014_questions.json",
    "/content/data/val2014"
)

# Create Hugging Face datasets
train_dataset = Dataset.from_dict({key: [item[key] for item in train_data] for key in train_data[0].keys()})

# Optionally save the dataset to disk for later retrieval
train_dataset.save_to_disk("vqa_train_dataset")

# Create Hugging Face datasets for validation
val_dataset = Dataset.from_dict({key: [item[key] for item in val_data] for key in val_data[0].keys()})

# Optionally save the dataset to disk for later retrieval
val_dataset.save_to_disk("vqa_val_dataset")

ERROR:

0%|          | 0/443757 [00:00<?, ?it/s]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[<ipython-input-17-be70899caa09>](https://localhost:8080/#) in <cell line: 49>()
     47 
     48 # Load training and validation VQA data
---> 49 train_data = load_vqa_data(
     50     "/content/data/v2_Annotations_Train_mscoco/v2_mscoco_train2014_annotations.json",
     51     "/content/data/v2_Questions_Train_mscoco/v2_OpenEnded_mscoco_train2014_questions.json",

[<ipython-input-17-be70899caa09>](https://localhost:8080/#) in load_vqa_data(annotations_file, questions_file, images_folder, start_at, end_at, max_images, max_questions)
     36             "question_id": question["question_id"],
     37             "question": question["question"],
---> 38             "answer": decoder_tokenizer.bos_token + ' ' + annotation.get("multiple_choice_answer", "") + decoder_tokenizer.eos_token,
     39             "all_answers": all_answers,  # You should define this variable elsewhere
     40             "image": image,  # You should define this variable elsewhere

NameError: name 'decoder_tokenizer' is not defined

Remove env file from semantic-search-fastapi folder.

Noticed that the .env file in the semantic-search-fastapi folder was checked into the repository.
Probably need to update your .gitignore file, remove the file and regenerated your API keys
to ensure they are deactivated.

Chapter 2 Notebook errors

Hi Sinan,

Trying to run your Chapter 2 notebook. When I get to

import pinecone

pinecone.init(api_key=pinecone_key, environment="us-west1-gcp")

it issues an error that "init is no longer a top-level attribute of the pinecone package." Then it gives the following replacement code:

    import os
    from pinecone import Pinecone, ServerlessSpec

    pc = Pinecone(
        api_key=os.environ.get("PINECONE_API_KEY")
    )

    # Now do stuff
    if 'my_index' not in pc.list_indexes().names():
        pc.create_index(
            name='my_index', 
            dimension=1536, 
            metric='euclidean',
            spec=ServerlessSpec(
                cloud='aws',
                region='us-west-2'
            )
        )

When I try the replacement code, I get this error:

PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'X-Cloud-Trace-Context': 'a4dd30976b4a3a292eb49f325c69da3b', 'Date': 'Sun, 28 Jan 2024 22:21:38 GMT', 'Server': 'Google Frontend', 'Content-Length': '125', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"INVALID_ARGUMENT","message":"Name must consist of lower case alphanumeric characters or '-'"},"status":400}

Do you have an updated working notebook?

Not working any more with v>1

Even if I explicitely install openai-0.28.1, I get

APIRemovedInV1:

You tried to access openai.Embedding, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You probably want to update the course on o'really eather, otherwise the code is not working

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.