GithubHelp home page GithubHelp logo

jstet / myrtle Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 687 KB

A wise brachiosaurs (gpt2 finetuned on quotes)

Home Page: https://myrtle.jstet.net

Python 52.94% Jupyter Notebook 45.70% Dockerfile 1.36%
fastapi finetuning gpt2 llm nlp python

myrtle's Introduction

Myrtle: a wise brachiosaurus

Just a fun project teach myself a bit to train and deploy LLMs. Myrtle is a GPT2 model finetuned on quotes by non male notable people. Find myrtle here. She only speaks every 5 seconds. Find myrtle here. She only speaks every 5 seconds.

/----------------------------------------------------------------------------\
| The truth is that we are all flawed, and it takes courage to be ourselves. |
\----------------------------------------------------------------------------/
      \
       \
        \

    
         _
       .~O`,
      {__,  \
          \' \
           \  \
            \  \
             \  `._            __.__
              \    ~-._  _.==~~     ~~--.._
               \        '                  ~-.
                \      _-   -_                `.
                 \    /       }        .-    .  \
                  `. |      /  }      (       ;  \
                    `|     /  /       (       :   '\
                     \    |  /        |      /       \
                      |     /`-.______.\     |~-.      \
                      |   |/           (     |   `.      \_
                      |   ||            ~\   \      '._    `-.._____..----..___
                      |   |/             _\   \         ~-.__________.-~~~~~~~~~'''
                    .o'___/            .o______}
    

Background

This project originated in the CorrelAid Slack workspace. A colleague, Frie, started a thread of recommending terminal tools that are useless but spark joy, for example lolcat combined with cowsay. Cowsay basically draws a cow in your terminal and you can input text. From that I got the idea of somehow randomizing what the cow says and found fortune, a program that prints random quotes from a local database.

However, when trying fortune, I found some quotes to be pretty weird and even sexist. People have observed this before (read this thread). My next idea was to use an API to retrieve a random quote by famous people from the internet. While working on a bash script to combine this API with cowsay, I researched cowsay flags and cow alternatives and again found some weird stuff (take a look yourself if you are interested). Thats why I decided to use a wholesome, child friendly python package called dinosay instead. The result was a bash script that uses a random quote from the internet as an input for dinosay.

Not satisfied with this, I thought it would be funny to train a LLM to generate quotes that sound realistic. I wanted to learn how to this anyways so thats what I did. As training data, I used a large dataset of quotes found originally on GitHub and built for a paper (see below). As there are enough quotes of men circulating in the internet and LLMs tend to be gender biased I tried to only use quotes by non male people (I know that its not that easy). I also like to imagine Myrtle as a wise old female dinosaur. As you cant detect a persons gender through their name, I generated a list of names of non male notable people and only used quotes by authors with names in that list. I used this data to finetune a GPT2 model, because its small enough to not cost that much to run (no GPU necessary for inference) but good enough to sounds realistic. Processing and training was done remotely with modal.

This worked relatively good and I was suprised and had to laugh about how pseudo-wise some quotes sound. Try it yourself :)

Usage in a terminal

I wrote a simple bash script to bring myrtles wisdom to your terminal.

Disclaimer

As myrtle is a finetuned GPT2 model, it is simarily biased. Information about this here. Please bear in mind that I am not the one writing the quotes. While I tried to moderate the model output somewhat using various tools including a list of forbidden words, I cant guarantee non offensive text output.

Data Sources

  • Laouenan, M., Bhargava, P., Eyméoud, J.B., Gergaud, O., Plique, G., & Wasmer, E. (2022). A cross-verified database of notable people, 3500BC-2018AD. Scientific Data, 9(1), 290.
  • Goel, S., Madhok, R., & Garg, S. (2018). Proposing Contextually Relevant Quotes for Images. Advances in Information Retrieval. Springer. doi: 10.1007/978-3-319-76941-7_49

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.