GithubHelp home page GithubHelp logo

postgresml / postgresml Goto Github PK

View Code? Open in Web Editor NEW
5.4K 51.0 259.0 422.76 MB

The GPU-powered AI application database. Get your app to market faster using the simplicity of SQL and the latest NLP, ML + LLM models.

Home Page: https://postgresml.org

License: MIT License

Dockerfile 0.09% Shell 0.62% Python 6.50% PLpgSQL 0.55% CSS 0.63% HTML 9.46% JavaScript 11.73% Rust 61.80% SCSS 7.93% TypeScript 0.53% Ruby 0.01% Smarty 0.07% HCL 0.08%
ml machine-learning ai ann artificial-intelligence classification embeddings javascript knn llm

postgresml's Introduction

PostgresML

Generative AI and Simple ML with PostgreSQL

CI Join our Discord!

Table of contents

Introduction

PostgresML is a machine learning extension for PostgreSQL that enables you to perform training and inference on text and tabular data using SQL queries. With PostgresML, you can seamlessly integrate machine learning models into your PostgreSQL database and harness the power of cutting-edge algorithms to process data efficiently.

Text Data

  • Perform natural language processing (NLP) tasks like sentiment analysis, question and answering, translation, summarization and text generation
  • Access 1000s of state-of-the-art language models like GPT-2, GPT-J, GPT-Neo from πŸ€— HuggingFace model hub
  • Fine tune large language models (LLMs) on your own text data for different tasks
  • Use your existing PostgreSQL database as a vector database by generating embeddings from text stored in the database.

Translation

SQL query

SELECT pgml.transform(
    'translation_en_to_fr',
    inputs => ARRAY[
        'Welcome to the future!',
        'Where have you been all this time?'
    ]
) AS french;

Result

                         french                                 
------------------------------------------------------------

[
    {"translation_text": "Bienvenue Γ  l'avenir!"},
    {"translation_text": "OΓΉ Γͺtes-vous allΓ© tout ce temps?"}
]

Sentiment Analysis SQL query

SELECT pgml.transform(
    task   => 'text-classification',
    inputs => ARRAY[
        'I love how amazingly simple ML has become!', 
        'I hate doing mundane and thankless tasks. ☹️'
    ]
) AS positivity;

Result

                    positivity
------------------------------------------------------
[
    {"label": "POSITIVE", "score": 0.9995759129524232}, 
    {"label": "NEGATIVE", "score": 0.9903519749641418}
]

Tabular data

Training a classification model

Training

SELECT * FROM pgml.train(
    'Handwritten Digit Image Classifier',
    algorithm => 'xgboost',
    'classification',
    'pgml.digits',
    'target'
);

Inference

SELECT pgml.predict(
    'My Classification Project', 
    ARRAY[0.1, 2.0, 5.0]
) AS prediction;

Installation

PostgresML installation consists of three parts: PostgreSQL database, Postgres extension for machine learning and a dashboard app. The extension provides all the machine learning functionality and can be used independently using any SQL IDE. The dashboard app provides an easy to use interface for writing SQL notebooks, performing and tracking ML experiments and ML models.

Serverless Cloud

If you want to check out the functionality without the hassle of Docker, sign up for a free PostgresML account. You'll get a free database in seconds, with access to GPUs and state of the art LLMs.

Docker

docker run \
    -it \
    -v postgresml_data:/var/lib/postgresql \
    -p 5433:5432 \
    -p 8000:8000 \
    ghcr.io/postgresml/postgresml:2.7.12 \
    sudo -u postgresml psql -d postgresml

For more details, take a look at our Quick Start with Docker documentation.

Getting Started

Option 1

  • On the cloud console click on the Dashboard button to connect to your instance with a SQL notebook, or connect directly with tools listed below.
  • On local installation, go to dashboard app at http://localhost:8000/ to use SQL notebooks.

Option 2

Option 3

NLP Tasks

PostgresML integrates πŸ€— Hugging Face Transformers to bring state-of-the-art NLP models into the data layer. There are tens of thousands of pre-trained models with pipelines to turn raw text in your database into useful results. Many state of the art deep learning architectures have been published and made available from Hugging Face model hub.

You can call different NLP tasks and customize using them using the following SQL query.

SELECT pgml.transform(
    task   => TEXT OR JSONB,     -- Pipeline initializer arguments
    inputs => TEXT[] OR BYTEA[], -- inputs for inference
    args   => JSONB              -- (optional) arguments to the pipeline.
)

Text Classification

Text classification involves assigning a label or category to a given text. Common use cases include sentiment analysis, natural language inference, and the assessment of grammatical correctness.

text classification

Sentiment Analysis

Sentiment analysis is a type of natural language processing technique that involves analyzing a piece of text to determine the sentiment or emotion expressed within it. It can be used to classify a text as positive, negative, or neutral, and has a wide range of applications in fields such as marketing, customer service, and political analysis.

Basic usage

SELECT pgml.transform(
    task   => 'text-classification',
    inputs => ARRAY[
        'I love how amazingly simple ML has become!', 
        'I hate doing mundane and thankless tasks. ☹️'
    ]
) AS positivity;

Result

[
    {"label": "POSITIVE", "score": 0.9995759129524232}, 
    {"label": "NEGATIVE", "score": 0.9903519749641418}
]

The default model used for text classification is a fine-tuned version of DistilBERT-base-uncased that has been specifically optimized for the Stanford Sentiment Treebank dataset (sst2).

Using specific model

To use one of the over 19,000 models available on Hugging Face, include the name of the desired model and text-classification task as a JSONB object in the SQL query. For example, if you want to use a RoBERTa model trained on around 40,000 English tweets and that has POS (positive), NEG (negative), and NEU (neutral) labels for its classes, include this information in the JSONB object when making your query.

SELECT pgml.transform(
    inputs => ARRAY[
        'I love how amazingly simple ML has become!', 
        'I hate doing mundane and thankless tasks. ☹️'
    ],
    task  => '{"task": "text-classification", 
              "model": "finiteautomata/bertweet-base-sentiment-analysis"
             }'::JSONB
) AS positivity;

Result

[
    {"label": "POS", "score": 0.992932200431826}, 
    {"label": "NEG", "score": 0.975599765777588}
]

Using industry specific model

By selecting a model that has been specifically designed for a particular industry, you can achieve more accurate and relevant text classification. An example of such a model is FinBERT, a pre-trained NLP model that has been optimized for analyzing sentiment in financial text. FinBERT was created by training the BERT language model on a large financial corpus, and fine-tuning it to specifically classify financial sentiment. When using FinBERT, the model will provide softmax outputs for three different labels: positive, negative, or neutral.

SELECT pgml.transform(
    inputs => ARRAY[
        'Stocks rallied and the British pound gained.', 
        'Stocks making the biggest moves midday: Nvidia, Palantir and more'
    ],
    task => '{"task": "text-classification", 
              "model": "ProsusAI/finbert"
             }'::JSONB
) AS market_sentiment;

Result

[
    {"label": "positive", "score": 0.8983612656593323}, 
    {"label": "neutral", "score": 0.8062630891799927}
]

Natural Language Inference (NLI)

NLI, or Natural Language Inference, is a type of model that determines the relationship between two texts. The model takes a premise and a hypothesis as inputs and returns a class, which can be one of three types:

  • Entailment: This means that the hypothesis is true based on the premise.
  • Contradiction: This means that the hypothesis is false based on the premise.
  • Neutral: This means that there is no relationship between the hypothesis and the premise.

The GLUE dataset is the benchmark dataset for evaluating NLI models. There are different variants of NLI models, such as Multi-Genre NLI, Question NLI, and Winograd NLI.

If you want to use an NLI model, you can find them on the πŸ€— Hugging Face model hub. Look for models with "mnli".

SELECT pgml.transform(
    inputs => ARRAY[
        'A soccer game with multiple males playing. Some men are playing a sport.'
    ],
    task => '{"task": "text-classification", 
              "model": "roberta-large-mnli"
             }'::JSONB
) AS nli;

Result

[
    {"label": "ENTAILMENT", "score": 0.98837411403656}
]

Question Natural Language Inference (QNLI)

The QNLI task involves determining whether a given question can be answered by the information in a provided document. If the answer can be found in the document, the label assigned is "entailment". Conversely, if the answer cannot be found in the document, the label assigned is "not entailment".

If you want to use an QNLI model, you can find them on the πŸ€— Hugging Face model hub. Look for models with "qnli".

SELECT pgml.transform(
    inputs => ARRAY[
        'Where is the capital of France?, Paris is the capital of France.'
    ],
    task => '{"task": "text-classification", 
              "model": "cross-encoder/qnli-electra-base"
             }'::JSONB
) AS qnli;

Result

[
    {"label": "LABEL_0", "score": 0.9978110194206238}
]

Quora Question Pairs (QQP)

The Quora Question Pairs model is designed to evaluate whether two given questions are paraphrases of each other. This model takes the two questions and assigns a binary value as output. LABEL_0 indicates that the questions are paraphrases of each other and LABEL_1 indicates that the questions are not paraphrases. The benchmark dataset used for this task is the Quora Question Pairs dataset within the GLUE benchmark, which contains a collection of question pairs and their corresponding labels.

If you want to use an QQP model, you can find them on the πŸ€— Hugging Face model hub. Look for models with qqp.

SELECT pgml.transform(
    inputs => ARRAY[
        'Which city is the capital of France?, Where is the capital of France?'
    ],
    task => '{"task": "text-classification", 
              "model": "textattack/bert-base-uncased-QQP"
             }'::JSONB
) AS qqp;

Result

[
    {"label": "LABEL_0", "score": 0.9988721013069152}
]

Grammatical Correctness

Linguistic Acceptability is a task that involves evaluating the grammatical correctness of a sentence. The model used for this task assigns one of two classes to the sentence, either "acceptable" or "unacceptable". LABEL_0 indicates acceptable and LABEL_1 indicates unacceptable. The benchmark dataset used for training and evaluating models for this task is the Corpus of Linguistic Acceptability (CoLA), which consists of a collection of texts along with their corresponding labels.

If you want to use a grammatical correctness model, you can find them on the πŸ€— Hugging Face model hub. Look for models with cola.

SELECT pgml.transform(
    inputs => ARRAY[
        'I will walk to home when I went through the bus.'
    ],
    task => '{"task": "text-classification", 
              "model": "textattack/distilbert-base-uncased-CoLA"
             }'::JSONB
) AS grammatical_correctness;

Result

[
    {"label": "LABEL_1", "score": 0.9576480388641356}
]

Zero-Shot Classification

Zero Shot Classification is a task where the model predicts a class that it hasn't seen during the training phase. This task leverages a pre-trained language model and is a type of transfer learning. Transfer learning involves using a model that was initially trained for one task in a different application. Zero Shot Classification is especially helpful when there is a scarcity of labeled data available for the specific task at hand.

zero-shot classification

In the example provided below, we will demonstrate how to classify a given sentence into a class that the model has not encountered before. To achieve this, we make use of args in the SQL query, which allows us to provide candidate_labels. You can customize these labels to suit the context of your task. We will use facebook/bart-large-mnli model.

Look for models with mnli to use a zero-shot classification model on the πŸ€— Hugging Face model hub.

SELECT pgml.transform(
    inputs => ARRAY[
        'I have a problem with my iphone that needs to be resolved asap!!'
    ],
    task => '{
                "task": "zero-shot-classification", 
                "model": "facebook/bart-large-mnli"
             }'::JSONB,
    args => '{
                "candidate_labels": ["urgent", "not urgent", "phone", "tablet", "computer"]
             }'::JSONB
) AS zero_shot;

Result

[
    {
        "labels": ["urgent", "phone", "computer", "not urgent", "tablet"], 
        "scores": [0.503635, 0.47879, 0.012600, 0.002655, 0.002308], 
        "sequence": "I have a problem with my iphone that needs to be resolved asap!!"
    }
]

Token Classification

Token classification is a task in natural language understanding, where labels are assigned to certain tokens in a text. Some popular subtasks of token classification include Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. NER models can be trained to identify specific entities in a text, such as individuals, places, and dates. PoS tagging, on the other hand, is used to identify the different parts of speech in a text, such as nouns, verbs, and punctuation marks.

token classification

Named Entity Recognition

Named Entity Recognition (NER) is a task that involves identifying named entities in a text. These entities can include the names of people, locations, or organizations. The task is completed by labeling each token with a class for each named entity and a class named "0" for tokens that don't contain any entities. In this task, the input is text, and the output is the annotated text with named entities.

SELECT pgml.transform(
    inputs => ARRAY[
        'I am Omar and I live in New York City.'
    ],
    task => 'token-classification'
) as ner;

Result

[[
    {"end": 9,  "word": "Omar", "index": 3,  "score": 0.997110, "start": 5,  "entity": "I-PER"}, 
    {"end": 27, "word": "New",  "index": 8,  "score": 0.999372, "start": 24, "entity": "I-LOC"}, 
    {"end": 32, "word": "York", "index": 9,  "score": 0.999355, "start": 28, "entity": "I-LOC"}, 
    {"end": 37, "word": "City", "index": 10, "score": 0.999431, "start": 33, "entity": "I-LOC"}
]]

Part-of-Speech (PoS) Tagging

PoS tagging is a task that involves identifying the parts of speech, such as nouns, pronouns, adjectives, or verbs, in a given text. In this task, the model labels each word with a specific part of speech.

Look for models with pos to use a zero-shot classification model on the πŸ€— Hugging Face model hub.

select pgml.transform(
	inputs => array [
  	'I live in Amsterdam.'
	],
	task => '{"task": "token-classification", 
              "model": "vblagoje/bert-english-uncased-finetuned-pos"
    }'::JSONB
) as pos;

Result

[[
    {"end": 1,  "word": "i",         "index": 1, "score": 0.999, "start": 0,  "entity": "PRON"},
    {"end": 6,  "word": "live",      "index": 2, "score": 0.998, "start": 2,  "entity": "VERB"},
    {"end": 9,  "word": "in",        "index": 3, "score": 0.999, "start": 7,  "entity": "ADP"},
    {"end": 19, "word": "amsterdam", "index": 4, "score": 0.998, "start": 10, "entity": "PROPN"}, 
    {"end": 20, "word": ".",         "index": 5, "score": 0.999, "start": 19, "entity": "PUNCT"}
]]

Translation

Translation is the task of converting text written in one language into another language.

translation

You have the option to select from over 2000 models available on the Hugging Face hub for translation.

select pgml.transform(
    inputs => array[
            	'How are you?'
    ],
	task => '{"task": "translation", 
              "model": "Helsinki-NLP/opus-mt-en-fr"
    }'::JSONB	
);

Result

[
    {"translation_text": "Comment allez-vous ?"}
]

Summarization

Summarization involves creating a condensed version of a document that includes the important information while reducing its length. Different models can be used for this task, with some models extracting the most relevant text from the original document, while other models generate completely new text that captures the essence of the original content.

summarization

select pgml.transform(
	task => '{"task": "summarization", 
              "model": "sshleifer/distilbart-cnn-12-6"
    }'::JSONB,
	inputs => array[
	'Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'
	]
);

Result

[
    {"summary_text": " Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018 . The city is the centre and seat of government of the region and province of Île-de-France, or Paris Region . Paris Region has an estimated 18 percent of the population of France as of 2017 ."}
    ]

You can control the length of summary_text by passing min_length and max_length as arguments to the SQL query.

select pgml.transform(
	task => '{"task": "summarization", 
              "model": "sshleifer/distilbart-cnn-12-6"
    }'::JSONB,
	inputs => array[
	'Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'
	],
	args => '{
            "min_length" : 20,
            "max_length" : 70
	}'::JSONB
);
[
    {"summary_text": " Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018 . City of Paris is centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated 12,174,880, or about 18 percent"
    }  
]

Question Answering

Question Answering models are designed to retrieve the answer to a question from a given text, which can be particularly useful for searching for information within a document. It's worth noting that some question answering models are capable of generating answers even without any contextual information.

question answering

SELECT pgml.transform(
    'question-answering',
    inputs => ARRAY[
        '{
            "question": "Where do I live?",
            "context": "My name is Merve and I live in Δ°stanbul."
        }'
    ]
) AS answer;

Result

{
    "end"   :  39, 
    "score" :  0.9538117051124572, 
    "start" :  31, 
    "answer": "Δ°stanbul"
}

Text Generation

Text generation is the task of producing new text, such as filling in incomplete sentences or paraphrasing existing text. It has various use cases, including code generation and story generation. Completion generation models can predict the next word in a text sequence, while text-to-text generation models are trained to learn the mapping between pairs of texts, such as translating between languages. Popular models for text generation include GPT-based models, T5, T0, and BART. These models can be trained to accomplish a wide range of tasks, including text classification, summarization, and translation.

text generation

SELECT pgml.transform(
    task => 'text-generation',
    inputs => ARRAY[
        'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
    ]
) AS answer;

Result

[
    [
        {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and eight for the Dragon-lords in their halls of blood.\n\nEach of the guild-building systems is one-man"}
    ]
]

To use a specific model from πŸ€— model hub, pass the model name along with task name in task.

SELECT pgml.transform(
    task => '{
        "task" : "text-generation",
        "model" : "gpt2-medium"
    }'::JSONB,
    inputs => ARRAY[
        'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
    ]
) AS answer;

Result

[
    [{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone.\n\nThis place has a deep connection to the lore of ancient Elven civilization. It is home to the most ancient of artifacts,"}]
]

To make the generated text longer, you can include the argument max_length and specify the desired maximum length of the text.

SELECT pgml.transform(
    task => '{
        "task" : "text-generation",
        "model" : "gpt2-medium"
    }'::JSONB,
    inputs => ARRAY[
        'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
    ],
    args => '{
			"max_length" : 200
		}'::JSONB 
) AS answer;

Result

[
    [{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Three for the Dwarfs and the Elves, One for the Gnomes of the Mines, and Two for the Elves of Dross.\"\n\nHobbits: The Fellowship is the first book of J.R.R. Tolkien's story-cycle, and began with his second novel - The Two Towers - and ends in The Lord of the Rings.\n\n\nIt is a non-fiction novel, so there is no copyright claim on some parts of the story but the actual text of the book is copyrighted by author J.R.R. Tolkien.\n\n\nThe book has been classified into two types: fantasy novels and children's books\n\nHobbits: The Fellowship is the first book of J.R.R. Tolkien's story-cycle, and began with his second novel - The Two Towers - and ends in The Lord of the Rings.It"}]
]

If you want the model to generate more than one output, you can specify the number of desired output sequences by including the argument num_return_sequences in the arguments.

SELECT pgml.transform(
    task => '{
        "task" : "text-generation",
        "model" : "gpt2-medium"
    }'::JSONB,
    inputs => ARRAY[
        'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
    ],
    args => '{
			"num_return_sequences" : 3
		}'::JSONB 
) AS answer;

Result

[
    [
        {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and Thirteen for the human-men in their hall of fire.\n\nAll of us, our families, and our people"}, 
        {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and the tenth for a King! As each of these has its own special story, so I have written them into the game."}, 
        {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone… What's left in the end is your heart's desire after all!\n\nHans: (Trying to be brave)"}
    ]
]

Text generation typically utilizes a greedy search algorithm that selects the word with the highest probability as the next word in the sequence. However, an alternative method called beam search can be used, which aims to minimize the possibility of overlooking hidden high probability word combinations. Beam search achieves this by retaining the num_beams most likely hypotheses at each step and ultimately selecting the hypothesis with the highest overall probability. We set num_beams > 1 and early_stopping=True so that generation is finished when all beam hypotheses reached the EOS token.

SELECT pgml.transform(
    task => '{
        "task" : "text-generation",
        "model" : "gpt2-medium"
    }'::JSONB,
    inputs => ARRAY[
        'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
    ],
    args => '{
			"num_beams" : 5,
			"early_stopping" : true
		}'::JSONB 
) AS answer;

Result

[[
    {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Nine for the Dwarves in their caverns of ice, Ten for the Elves in their caverns of fire, Eleven for the"}
]]

Sampling methods involve selecting the next word or sequence of words at random from the set of possible candidates, weighted by their probabilities according to the language model. This can result in more diverse and creative text, as well as avoiding repetitive patterns. In its most basic form, sampling means randomly picking the next word $w_t$ according to its conditional probability distribution: $$ w_t \approx P(w_t|w_{1:t-1})$$

However, the randomness of the sampling method can also result in less coherent or inconsistent text, depending on the quality of the model and the chosen sampling parameters such as temperature, top-k, or top-p. Therefore, choosing an appropriate sampling method and parameters is crucial for achieving the desired balance between creativity and coherence in generated text.

You can pass do_sample = True in the arguments to use sampling methods. It is recommended to alter temperature or top_p but not both.

Temperature

SELECT pgml.transform(
    task => '{
        "task" : "text-generation",
        "model" : "gpt2-medium"
    }'::JSONB,
    inputs => ARRAY[
        'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
    ],
    args => '{
			"do_sample" : true,
			"temperature" : 0.9
		}'::JSONB 
) AS answer;

Result

[[{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and Thirteen for the Giants and Men of S.A.\n\nThe First Seven-Year Time-Traveling Trilogy is"}]]

Top p

SELECT pgml.transform(
    task => '{
        "task" : "text-generation",
        "model" : "gpt2-medium"
    }'::JSONB,
    inputs => ARRAY[
        'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
    ],
    args => '{
			"do_sample" : true,
			"top_p" : 0.8
		}'::JSONB 
) AS answer;

Result

[[{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Four for the Elves of the forests and fields, and Three for the Dwarfs and their warriors.\" ―Lord Rohan [src"}]]

Text-to-Text Generation

Text-to-text generation methods, such as T5, are neural network architectures designed to perform various natural language processing tasks, including summarization, translation, and question answering. T5 is a transformer-based architecture pre-trained on a large corpus of text data using denoising autoencoding. This pre-training process enables the model to learn general language patterns and relationships between different tasks, which can be fine-tuned for specific downstream tasks. During fine-tuning, the T5 model is trained on a task-specific dataset to learn how to perform the specific task. text-to-text

Translation

SELECT pgml.transform(
    task => '{
        "task" : "text2text-generation"
    }'::JSONB,
    inputs => ARRAY[
        'translate from English to French: I''m very happy'
    ]
) AS answer;

Result

[
    {"generated_text": "Je suis très heureux"}
]

Similar to other tasks, we can specify a model for text-to-text generation.

SELECT pgml.transform(
    task => '{
        "task" : "text2text-generation",
        "model" : "bigscience/T0"
    }'::JSONB,
    inputs => ARRAY[
        'Is the word ''table'' used in the same meaning in the two previous sentences? Sentence A: you can leave the books on the table over there. Sentence B: the tables in this book are very hard to read.'

    ]
) AS answer;

Fill-Mask

Fill-mask refers to a task where certain words in a sentence are hidden or "masked", and the objective is to predict what words should fill in those masked positions. Such models are valuable when we want to gain statistical insights about the language used to train the model. fill mask

SELECT pgml.transform(
    task => '{
        "task" : "fill-mask"
    }'::JSONB,
    inputs => ARRAY[
        'Paris is the <mask> of France.'

    ]
) AS answer;

Result

[
    {"score": 0.679, "token": 812,   "sequence": "Paris is the capital of France.",    "token_str": " capital"}, 
    {"score": 0.051, "token": 32357, "sequence": "Paris is the birthplace of France.", "token_str": " birthplace"}, 
    {"score": 0.038, "token": 1144,  "sequence": "Paris is the heart of France.",      "token_str": " heart"}, 
    {"score": 0.024, "token": 29778, "sequence": "Paris is the envy of France.",       "token_str": " envy"}, 
    {"score": 0.022, "token": 1867,  "sequence": "Paris is the Capital of France.",    "token_str": " Capital"}]

Vector Database

A vector database is a type of database that stores and manages vectors, which are mathematical representations of data points in a multi-dimensional space. Vectors can be used to represent a wide range of data types, including images, text, audio, and numerical data. It is designed to support efficient searching and retrieval of vectors, using methods such as nearest neighbor search, clustering, and indexing. These methods enable applications to find vectors that are similar to a given query vector, which is useful for tasks such as image search, recommendation systems, and natural language processing.

PostgresML enhances your existing PostgreSQL database to be used as a vector database by generating embeddings from text stored in your tables. To generate embeddings, you can use the pgml.embed function, which takes a transformer name and a text value as input. This function automatically downloads and caches the transformer for future reuse, which saves time and resources.

Using a vector database involves three key steps: creating embeddings, indexing your embeddings using different algorithms, and querying the index using embeddings for your queries. Let's break down each step in more detail.

Step 1: Creating embeddings using transformers

To create embeddings for your data, you first need to choose a transformer that can generate embeddings from your input data. Some popular transformer options include BERT, GPT-2, and T5. Once you've selected a transformer, you can use it to generate embeddings for your data.

In the following section, we will demonstrate how to use PostgresML to generate embeddings for a dataset of tweets commonly used in sentiment analysis. To generate the embeddings, we will use the pgml.embed function, which will generate an embedding for each tweet in the dataset. These embeddings will then be inserted into a table called tweet_embeddings.

SELECT pgml.load_dataset('tweet_eval', 'sentiment');

SELECT * 
FROM pgml.tweet_eval
LIMIT 10;

CREATE TABLE tweet_embeddings AS
SELECT text, pgml.embed('distilbert-base-uncased', text) AS embedding 
FROM pgml.tweet_eval;

SELECT * from tweet_embeddings limit 2;

Result

text embedding
"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin" {-0.1567948312,-0.3149209619,0.2163394839,..}
"Ben Smith / Smith (concussion) remains out of the lineup Thursday, Curtis #NHL #SJ" {-0.0701668188,-0.012231146,0.1304316372,.. }

Step 2: Indexing your embeddings using different algorithms

After you've created embeddings for your data, you need to index them using one or more indexing algorithms. There are several different types of indexing algorithms available, including B-trees, k-nearest neighbors (KNN), and approximate nearest neighbors (ANN). The specific type of indexing algorithm you choose will depend on your use case and performance requirements. For example, B-trees are a good choice for range queries, while KNN and ANN algorithms are more efficient for similarity searches.

On small datasets (<100k rows), a linear search that compares every row to the query will give sub-second results, which may be fast enough for your use case. For larger datasets, you may want to consider various indexing strategies offered by additional extensions.

  • Cube is a built-in extension that provides a fast indexing strategy for finding similar vectors. By default it has an arbitrary limit of 100 dimensions, unless Postgres is compiled with a larger size.
  • PgVector supports embeddings up to 2000 dimensions out of the box, and provides a fast indexing strategy for finding similar vectors.

When indexing your embeddings, it's important to consider the trade-offs between accuracy and speed. Exact indexing algorithms like B-trees can provide precise results, but may not be as fast as approximate indexing algorithms like KNN and ANN. Similarly, some indexing algorithms may require more memory or disk space than others.

In the following, we are creating an index on the tweet_embeddings table using the ivfflat algorithm for indexing. The ivfflat algorithm is a type of hybrid index that combines an Inverted File (IVF) index with a Flat (FLAT) index.

The index is being created on the embedding column in the tweet_embeddings table, which contains vector embeddings generated from the original tweet dataset. The vector_cosine_ops argument specifies the indexing operation to use for the embeddings. In this case, it's using the cosine similarity operation, which is a common method for measuring similarity between vectors.

By creating an index on the embedding column, the database can quickly search for and retrieve records that are similar to a given query vector. This can be useful for a variety of machine learning applications, such as similarity search or recommendation systems.

CREATE INDEX ON tweet_embeddings USING ivfflat (embedding vector_cosine_ops);

Step 3: Querying the index using embeddings for your queries

Once your embeddings have been indexed, you can use them to perform queries against your database. To do this, you'll need to provide a query embedding that represents the query you want to perform. The index will then return the closest matching embeddings from your database, based on the similarity between the query embedding and the stored embeddings.

WITH query AS (
    SELECT pgml.embed('distilbert-base-uncased', 'Star Wars christmas special is on Disney')::vector AS embedding
)
SELECT * FROM items, query ORDER BY items.embedding <-> query.embedding LIMIT 5;

Result

text
Happy Friday with Batman animated Series 90S forever!
"Fri Oct 17, Sonic Highways is on HBO tonight, Also new episode of Girl Meets World on Disney"
tfw the 2nd The Hunger Games movie is on Amazon Prime but not the 1st one I didn't watch
5 RT's if you want the next episode of twilight princess tomorrow
Jurassic Park is BACK! New Trailer for the 4th Movie, Jurassic World -

LLM Fine-tuning

In this section, we will provide a step-by-step walkthrough for fine-tuning a Language Model (LLM) for differnt tasks.

Prerequisites

  1. Ensure you have the PostgresML extension installed and configured in your PostgreSQL database. You can find installation instructions for PostgresML in the official documentation.

  2. Obtain a Hugging Face API token to push the fine-tuned model to the Hugging Face Model Hub. Follow the instructions on the Hugging Face website to get your API token.

Text Classification 2 Classes

1. Loading the Dataset

To begin, create a table to store your dataset. In this example, we use the 'imdb' dataset from Hugging Face. IMDB dataset contains three splits: train (25K rows), test (25K rows) and unsupervised (50K rows). In train and test splits, negative class has label 0 and positive class label 1. All rows in unsupervised split has a label of -1.

SELECT pgml.load_dataset('imdb');

2. Prepare dataset for fine-tuning

We will create a view of the dataset by performing the following operations:

  • Add a new text column named "class" that has positive and negative classes.
  • Shuffled view of the dataset to ensure randomness in the distribution of data.
  • Remove all the unsupervised splits that have label = -1.
CREATE VIEW pgml.imdb_shuffled_view AS
SELECT
    label,
    CASE WHEN label = 0 THEN 'negative'
         WHEN label = 1 THEN 'positive'
         ELSE 'neutral'
    END AS class,
    text
FROM pgml.imdb
WHERE label != -1
ORDER BY RANDOM();

3 Exploratory Data Analysis (EDA) on Shuffled Data

Before splitting the data into training and test sets, it's essential to perform exploratory data analysis (EDA) to understand the distribution of labels and other characteristics of the dataset. In this section, we'll use the pgml.imdb_shuffled_view to explore the shuffled data.

3.1 Distribution of Labels

To analyze the distribution of labels in the shuffled dataset, you can use the following SQL query:

-- Count the occurrences of each label in the shuffled dataset
pgml=# SELECT
    class,
    COUNT(*) AS label_count
FROM pgml.imdb_shuffled_view
GROUP BY class
ORDER BY class;

  class   | label_count
----------+-------------
 negative |       25000
 positive |       25000
(2 rows)

This query provides insights into the distribution of labels, helping you understand the balance or imbalance of classes in your dataset.

3.2 Sample Records

To get a glimpse of the data, you can retrieve a sample of records from the shuffled dataset:

-- Retrieve a sample of records from the shuffled dataset
pgml=# SELECT LEFT(text,100) AS text, class
FROM pgml.imdb_shuffled_view
LIMIT 5;
                                                 text                                                 |  class
------------------------------------------------------------------------------------------------------+----------
 This is a VERY entertaining movie. A few of the reviews that I have read on this forum have been wri | positive
 This is one of those movies where I wish I had just stayed in the bar.<br /><br />The film is quite  | negative
 Barbershop 2: Back in Business wasn't as good as it's original but was just as funny. The movie itse | negative
 Umberto Lenzi hits new lows with this recycled trash. Janet Agren plays a lady who is looking for he | negative
 I saw this movie last night at the Phila. Film festival. It was an interesting and funny movie that  | positive
(5 rows)

Time: 101.985 ms

This query allows you to inspect a few records to understand the structure and content of the shuffled data.

3.3 Additional Exploratory Analysis

Feel free to explore other aspects of the data, such as the distribution of text lengths, word frequencies, or any other features relevant to your analysis. Performing EDA is crucial for gaining insights into your dataset and making informed decisions during subsequent steps of the workflow.

4. Splitting Data into Training and Test Sets

Create views for training and test data by splitting the shuffled dataset. In this example, 80% is allocated for training, and 20% for testing. We will use pgml.imdb_test_view in section 6 for batch predictions using the finetuned model.

-- Create a view for training data
CREATE VIEW pgml.imdb_train_view AS
SELECT *
FROM pgml.imdb_shuffled_view
LIMIT (SELECT COUNT(*) * 0.8 FROM pgml.imdb_shuffled_view);

-- Create a view for test data
CREATE VIEW pgml.imdb_test_view AS
SELECT *
FROM pgml.imdb_shuffled_view
OFFSET (SELECT COUNT(*) * 0.8 FROM pgml.imdb_shuffled_view);

5. Fine-Tuning the Language Model

Now, fine-tune the Language Model for text classification using the created training view. In the following sections, you will see a detailed explanation of different parameters used during fine-tuning. Fine-tuned model is pushed to your public Hugging Face Hub periodically. A new repository will be created under your username using your project name (imdb_review_sentiment in this case). You can also choose to push the model to a private repository by setting hub_private_repo: true in training arguments.

SELECT pgml.tune(
    'imdb_review_sentiment',
    task => 'text-classification',
    relation_name => 'pgml.imdb_train_view',
    model_name => 'distilbert-base-uncased',
    test_size => 0.2,
    test_sampling => 'last',
    hyperparams => '{
        "training_args" : {
            "learning_rate": 2e-5,
            "per_device_train_batch_size": 16,
            "per_device_eval_batch_size": 16,
            "num_train_epochs": 20,
            "weight_decay": 0.01,
            "hub_token" : "YOUR_HUB_TOKEN", 
            "push_to_hub" : true
        },
        "dataset_args" : { "text_column" : "text", "class_column" : "class" }
    }'
);
  • project_name ('imdb_review_sentiment'): The project_name parameter specifies a unique name for your fine-tuning project. It helps identify and organize different fine-tuning tasks within the PostgreSQL database. In this example, the project is named 'imdb_review_sentiment,' reflecting the sentiment analysis task on the IMDb dataset. You can check pgml.projects for list of projects.

  • task ('text-classification'): The task parameter defines the nature of the machine learning task to be performed. In this case, it's set to 'text-classification,' indicating that the fine-tuning is geared towards training a model for text classification.

  • relation_name ('pgml.imdb_train_view'): The relation_name parameter identifies the training dataset to be used for fine-tuning. It specifies the view or table containing the training data. In this example, 'pgml.imdb_train_view' is the view created from the shuffled IMDb dataset, and it serves as the source for model training.

  • model_name ('distilbert-base-uncased'): The model_name parameter denotes the pre-trained language model architecture to be fine-tuned. In this case, 'distilbert-base-uncased' is selected. DistilBERT is a distilled version of BERT, and the 'uncased' variant indicates that the model does not differentiate between uppercase and lowercase letters.

  • test_size (0.2): The test_size parameter determines the proportion of the dataset reserved for testing during fine-tuning. In this example, 20% of the dataset is set aside for evaluation, helping assess the model's performance on unseen data.

  • test_sampling ('last'): The test_sampling parameter defines the strategy for sampling test data from the dataset. In this case, 'last' indicates that the most recent portion of the data, following the specified test size, is used for testing. Adjusting this parameter might be necessary based on your specific requirements and dataset characteristics.

5.1 Dataset Arguments (dataset_args)

The dataset_args section allows you to specify critical parameters related to your dataset for language model fine-tuning.

  • text_column: The name of the column containing the text data in your dataset. In this example, it's set to "text."
  • class_column: The name of the column containing the class labels in your dataset. In this example, it's set to "class."

5.2 Training Arguments (training_args)

Fine-tuning a language model requires careful consideration of training parameters in the training_args section. Below is a subset of training args that you can pass to fine-tuning. You can find an exhaustive list of parameters in Hugging Face documentation on TrainingArguments.

  • learning_rate: The learning rate for the training. It controls the step size during the optimization process. Adjust based on your model's convergence behavior.
  • per_device_train_batch_size: The batch size per GPU for training. This parameter controls the number of training samples utilized in one iteration. Adjust based on your available GPU memory.
  • per_device_eval_batch_size: The batch size per GPU for evaluation. Similar to per_device_train_batch_size, but used during model evaluation.
  • num_train_epochs: The number of training epochs. An epoch is one complete pass through the entire training dataset. Adjust based on the model's convergence and your dataset size.
  • weight_decay: L2 regularization term for weight decay. It helps prevent overfitting. Adjust based on the complexity of your model.
  • hub_token: Your Hugging Face API token to push the fine-tuned model to the Hugging Face Model Hub. Replace "YOUR_HUB_TOKEN" with the actual token.
  • push_to_hub: A boolean flag indicating whether to push the model to the Hugging Face Model Hub after fine-tuning.

5.3 Monitoring

During training, metrics like loss, gradient norm will be printed as info and also logged in pgml.logs table. Below is a snapshot of such output.

INFO:  {
    "loss": 0.3453,
    "grad_norm": 5.230295181274414,
    "learning_rate": 1.9e-05,
    "epoch": 0.25,
    "step": 500,
    "max_steps": 10000,
    "timestamp": "2024-03-07 01:59:15.090612"
}
INFO:  {
    "loss": 0.2479,
    "grad_norm": 2.7754225730895996,
    "learning_rate": 1.8e-05,
    "epoch": 0.5,
    "step": 1000,
    "max_steps": 10000,
    "timestamp": "2024-03-07 02:01:12.064098"
}
INFO:  {
    "loss": 0.223,
    "learning_rate": 1.6000000000000003e-05,
    "epoch": 1.0,
    "step": 2000,
    "max_steps": 10000,
    "timestamp": "2024-03-07 02:05:08.141220"
}

Once the training is completed, model will be evaluated against the validation dataset. You will see the below in the client terminal. Accuracy on the evaluation dataset is 0.934 and F1-score is 0.93.

INFO:  {
    "train_runtime": 2359.5335,
    "train_samples_per_second": 67.81,
    "train_steps_per_second": 4.238,
    "train_loss": 0.11267969808578492,
    "epoch": 5.0,
    "step": 10000,
    "max_steps": 10000,
    "timestamp": "2024-03-07 02:36:38.783279"
}
INFO:  {
    "eval_loss": 0.3691485524177551,
    "eval_f1": 0.9343711842996372,
    "eval_accuracy": 0.934375,
    "eval_runtime": 41.6167,
    "eval_samples_per_second": 192.23,
    "eval_steps_per_second": 12.014,
    "epoch": 5.0,
    "step": 10000,
    "max_steps": 10000,
    "timestamp": "2024-03-07 02:37:31.762917"
}

Once the training is completed, you can check query pgml.logs table using the model_id or by finding the latest model on the project.

pgml: SELECT logs->>'epoch' AS epoch, logs->>'step' AS step, logs->>'loss' AS loss FROM pgml.logs WHERE model_id = 993 AND jsonb_exists(logs, 'loss');
 epoch | step  |  loss
-------+-------+--------
 0.25  | 500   | 0.3453
 0.5   | 1000  | 0.2479
 0.75  | 1500  | 0.223
 1.0   | 2000  | 0.2165
 1.25  | 2500  | 0.1485
 1.5   | 3000  | 0.1563
 1.75  | 3500  | 0.1559
 2.0   | 4000  | 0.142
 2.25  | 4500  | 0.0816
 2.5   | 5000  | 0.0942
 2.75  | 5500  | 0.075
 3.0   | 6000  | 0.0883
 3.25  | 6500  | 0.0432
 3.5   | 7000  | 0.0426
 3.75  | 7500  | 0.0444
 4.0   | 8000  | 0.0504
 4.25  | 8500  | 0.0186
 4.5   | 9000  | 0.0265
 4.75  | 9500  | 0.0248
 5.0   | 10000 | 0.0284

During training, model is periodically uploaded to Hugging Face Hub. You will find the model at https://huggingface.co/<username>/<project_name>. An example model that was automatically pushed to Hugging Face Hub is here.

6. Inference using fine-tuned model

Now, that we have fine-tuned model on Hugging Face Hub, we can use pgml.transform to perform real-time predictions as well as batch predictions.

Real-time predictions

Here is an example pgml.transform call for real-time predictions on the newly minted LLM fine-tuned on IMDB review dataset.

 SELECT pgml.transform(
  task   => '{
    "task": "text-classification",
    "model": "santiadavani/imdb_review_sentiement"
  }'::JSONB,
  inputs => ARRAY[
    'I would not give this movie a rating, its not worthy. I watched it only because I am a Pfieffer fan. ',
    'This movie was sooooooo good! It was hilarious! There are so many jokes that you can just watch the'
  ]
);
                                               transform
--------------------------------------------------------------------------------------------------------
 [{"label": "negative", "score": 0.999561846256256}, {"label": "positive", "score": 0.986771047115326}]
(1 row)

Time: 175.264 ms

Batch predictions

pgml=# SELECT
    LEFT(text, 100) AS truncated_text,
    class,
    predicted_class[0]->>'label' AS predicted_class,
    (predicted_class[0]->>'score')::float AS score
FROM (
    SELECT
        LEFT(text, 100) AS text,
        class,
        pgml.transform(
            task => '{
                "task": "text-classification",
                "model": "santiadavani/imdb_review_sentiement"
            }'::JSONB,
            inputs => ARRAY[text]
        ) AS predicted_class
    FROM pgml.imdb_test_view
    LIMIT 2
) AS subquery;
                                            truncated_text                                            |  class   | predicted_class |       score
------------------------------------------------------------------------------------------------------+----------+-----------------+--------------------
 I wouldn't give this movie a rating, it's not worthy. I watched it only because I'm a Pfieffer fan.  | negative | negative        | 0.9996490478515624
 This movie was sooooooo good! It was hilarious! There are so many jokes that you can just watch the  | positive | positive        | 0.9972313046455384

 Time: 1337.290 ms (00:01.337)

7. Restarting Training from a Previous Trained Model

Sometimes, it's necessary to restart the training process from a previously trained model. This can be advantageous for various reasons, such as model fine-tuning, hyperparameter adjustments, or addressing interruptions in the training process. pgml.tune provides a seamless way to restart training while leveraging the progress made in the existing model. Below is a guide on how to restart training using a previous model as a starting point:

Define the Previous Model

Specify the name of the existing model you want to use as a starting point. This is achieved by setting the model_name parameter in the pgml.tune function. In the example below, it is set to 'santiadavani/imdb_review_sentiement'.

model_name => 'santiadavani/imdb_review_sentiement',

Adjust Hyperparameters

Fine-tune hyperparameters as needed for the restarted training process. This might include modifying learning rates, batch sizes, or training epochs. In the example below, hyperparameters such as learning rate, batch sizes, and epochs are adjusted.

"training_args": {
    "learning_rate": 2e-5,
    "per_device_train_batch_size": 16,
    "per_device_eval_batch_size": 16,
    "num_train_epochs": 1,
    "weight_decay": 0.01,
    "hub_token": "",
    "push_to_hub": true
},

Ensure Consistent Dataset Configuration

Confirm that the dataset configuration remains consistent, including specifying the same text and class columns as in the previous training. This ensures compatibility between the existing model and the restarted training process.

"dataset_args": {
    "text_column": "text",
    "class_column": "class"
},

Run the pgml.tune Function

Execute the pgml.tune function with the updated parameters to initiate the training restart. The function will leverage the existing model and adapt it based on the adjusted hyperparameters and dataset configuration.

SELECT pgml.tune(
    'imdb_review_sentiement',
    task => 'text-classification',
    relation_name => 'pgml.imdb_train_view',
    model_name => 'santiadavani/imdb_review_sentiement',
    test_size => 0.2,
    test_sampling => 'last',
    hyperparams => '{
        "training_args": {
            "learning_rate": 2e-5,
            "per_device_train_batch_size": 16,
            "per_device_eval_batch_size": 16,
            "num_train_epochs": 1,
            "weight_decay": 0.01,
            "hub_token": "YOUR_HUB_TOKEN",
            "push_to_hub": true
        },
        "dataset_args": { "text_column": "text", "class_column": "class" }
    }'
);

By following these steps, you can effectively restart training from a previously trained model, allowing for further refinement and adaptation of the model based on new requirements or insights. Adjust parameters as needed for your specific use case and dataset.

8. Hugging Face Hub vs. PostgresML as Model Repository

We utilize the Hugging Face Hub as the primary repository for fine-tuning Large Language Models (LLMs). Leveraging the HF hub offers several advantages:

  • The HF repository serves as the platform for pushing incremental updates to the model during the training process. In the event of any disruptions in the database connection, you have the flexibility to resume training from where it was left off.
  • If you prefer to keep the model private, you can push it to a private repository within the Hugging Face Hub. This ensures that the model is not publicly accessible by setting the parameter hub_private_repo to true.
  • The pgml.transform function, designed around utilizing models from the Hugging Face Hub, can be reused without any modifications.

However, in certain scenarios, pushing the model to a central repository and pulling it for inference may not be the most suitable approach. To address this situation, we save all the model weights and additional artifacts, such as tokenizer configurations and vocabulary, in the pgml.files table at the end of the training process. It's important to note that as of the current writing, hooks to use models directly from pgml.files in the pgml.transform function have not been implemented. We welcome Pull Requests (PRs) from the community to enhance this functionality.

Text Classification 9 Classes

1. Load and Shuffle the Dataset

In this section, we begin by loading the FinGPT sentiment analysis dataset using the pgml.load_dataset function. The dataset is then processed and organized into a shuffled view (pgml.fingpt_sentiment_shuffled_view), ensuring a randomized order of records. This step is crucial for preventing biases introduced by the original data ordering and enhancing the training process.

-- Load the dataset
SELECT pgml.load_dataset('FinGPT/fingpt-sentiment-train');

-- Create a shuffled view
CREATE VIEW pgml.fingpt_sentiment_shuffled_view AS
SELECT * FROM pgml."FinGPT/fingpt-sentiment-train" ORDER BY RANDOM();

2. Explore Class Distribution

Once the dataset is loaded and shuffled, we delve into understanding the distribution of sentiment classes within the data. By querying the shuffled view, we obtain valuable insights into the number of instances for each sentiment class. This exploration is essential for gaining a comprehensive understanding of the dataset and its inherent class imbalances.

-- Explore class distribution
SELECTpgml=# SELECT
    output,
    COUNT(*) AS class_count
FROM pgml.fingpt_sentiment_shuffled_view
GROUP BY output
ORDER BY output;

       output        | class_count
---------------------+-------------
 mildly negative     |        2108
 mildly positive     |        2548
 moderately negative |        2972
 moderately positive |        6163
 negative            |       11749
 neutral             |       29215
 positive            |       21588
 strong negative     |         218
 strong positive     |         211

3. Create Training and Test Views

To facilitate the training process, we create distinct views for training and testing purposes. The training view (pgml.fingpt_sentiment_train_view) contains 80% of the shuffled dataset, enabling the model to learn patterns and associations. Simultaneously, the test view (pgml.fingpt_sentiment_test_view) encompasses the remaining 20% of the data, providing a reliable evaluation set to assess the model's performance.

-- Create a view for training data (e.g., 80% of the shuffled records)
CREATE VIEW pgml.fingpt_sentiment_train_view AS
SELECT *
FROM pgml.fingpt_sentiment_shuffled_view
LIMIT (SELECT COUNT(*) * 0.8 FROM pgml.fingpt_sentiment_shuffled_view);

-- Create a view for test data (remaining 20% of the shuffled records)
CREATE VIEW pgml.fingpt_sentiment_test_view AS
SELECT *
FROM pgml.fingpt_sentiment_shuffled_view
OFFSET (SELECT COUNT(*) * 0.8 FROM pgml.fingpt_sentiment_shuffled_view);

4. Fine-Tune the Model for 9 Classes

In the final section, we kick off the fine-tuning process using the pgml.tune function. The model will be internally configured for sentiment analysis with 9 classes. The training is executed on the 80% of the train view and evaluated on the remaining 20% of the train view. The test view is reserved for evaluating the model's accuracy after training is completed. Please note that the option hub_private_repo: true is used to push the model to a private Hugging Face repository.

-- Fine-tune the model for 9 classes without HUB token
SELECT pgml.tune(
    'fingpt_sentiement',
    task => 'text-classification',
    relation_name => 'pgml.fingpt_sentiment_train_view',
    model_name => 'distilbert-base-uncased',
    test_size => 0.2,
    test_sampling => 'last',
    hyperparams => '{
        "training_args": {
            "learning_rate": 2e-5,
            "per_device_train_batch_size": 16,
            "per_device_eval_batch_size": 16,
            "num_train_epochs": 5,
            "weight_decay": 0.01,
            "hub_token" : "YOUR_HUB_TOKEN",
            "push_to_hub": true,
            "hub_private_repo": true
        },
        "dataset_args": { "text_column": "input", "class_column": "output" }
    }'
);

Conversation

In this section, we will discuss conversational task using state-of-the-art NLP techniques. Conversational AI has garnered immense interest and significance in recent years due to its wide range of applications, from virtual assistants to customer service chatbots and beyond.

Understanding the Conversation Task

At the core of conversational AI lies the conversation task, a fundamental NLP problem that involves processing and generating human-like text-based interactions. Let's break down this task into its key components:

  • Input: The input to the conversation task typically consists of a sequence of conversational turns, often represented as text. These turns can encompass a dialogue between two or more speakers, capturing the flow of communication over time.

  • Model: Central to the conversation task is the NLP model, which is trained to understand the nuances of human conversation and generate appropriate responses. These models leverage sophisticated transformer based architectures like Llama2, Mistral, GPT etc., empowered by large-scale datasets and advanced training techniques.

  • Output: The ultimate output of the conversation task is the model's response to the input conversation. This response aims to be contextually relevant, coherent, and engaging, reflecting a natural human-like interaction.

Versatility of the Conversation Task

What makes the conversation task truly remarkable is its remarkable versatility. Beyond its traditional application in dialogue systems, the conversation task can be adapted to solve several NLP problems by tweaking the input representation or task formulation.

  • Text Classification: By providing individual utterances with corresponding labels, the conversation task can be repurposed for tasks such as sentiment analysis, intent detection, or topic classification.

    Input:

    • System: Chatbot: "Hello! How can I assist you today?"
    • User: "I'm having trouble connecting to the internet."

    Model Output (Text Classification):

    • Predicted Label: Technical Support
    • Confidence Score: 0.85
  • Token Classification: Annotating the conversation with labels for specific tokens or phrases enables applications like named entity recognition within conversational text.

    Input:

    • System: Chatbot: "Please describe the issue you're facing in detail."
    • User: "I can't access any websites, and the Wi-Fi indicator on my router is blinking."

    Model Output (Token Classification):

    • User's Description: "I can't access any websites, and the Wi-Fi indicator on my router is blinking."
    • Token Labels:
    • "access" - Action
    • "websites" - Entity (Location)
    • "Wi-Fi" - Entity (Technology)
    • "indicator" - Entity (Device Component)
    • "blinking" - State
  • Question Answering: Transforming conversational exchanges into a question-answering format enables extracting relevant information and providing concise answers, akin to human comprehension and response.

    Input:

    • System: Chatbot: "How can I help you today?"
    • User: "What are the symptoms of COVID-19?"

    Model Output (Question Answering):

    • Answer: "Common symptoms of COVID-19 include fever, cough, fatigue, shortness of breath, loss of taste or smell, and body aches."

Fine-tuning Llama2-7b model using LoRA

In this section, we will explore how to fine-tune the Llama2-7b-chat large language model for the financial sentiment data discussed in the previous section utilizing the pgml.tune function and employing the LoRA approach. LoRA is a technique that enables efficient fine-tuning of large language models by only updating a small subset of the model's weights during fine-tuning, while keeping the majority of the weights frozen. This approach can significantly reduce the computational requirements and memory footprint compared to traditional full model fine-tuning.

SELECT pgml.tune(
    'fingpt-llama2-7b-chat',
    task => 'conversation',
    relation_name => 'pgml.fingpt_sentiment_train_view',
    model_name => 'meta-llama/Llama-2-7b-chat-hf',
    test_size => 0.8,
    test_sampling => 'last',
    hyperparams => '{
        "training_args" : {
            "learning_rate": 2e-5,
            "per_device_train_batch_size": 4,
            "per_device_eval_batch_size": 4,
            "num_train_epochs": 1,
            "weight_decay": 0.01,
            "hub_token" : "HF_TOKEN", 
            "push_to_hub" : true,
            "optim" : "adamw_bnb_8bit",
            "gradient_accumulation_steps" : 4,
            "gradient_checkpointing" : true
        },
        "dataset_args" : { "system_column" : "instruction", "user_column" : "input", "assistant_column" : "output" },
        "lora_config" : {"r": 2, "lora_alpha" : 4, "lora_dropout" : 0.05, "bias": "none", "task_type": "CAUSAL_LM"},
        "load_in_8bit" : false,
        "token" : "HF_TOKEN"
    }'
);

Let's break down each argument and its significance:

  1. Model Name (model_name):

    • This argument specifies the name or identifier of the base model that will be fine-tuned. In the context of the provided query, it refers to the pre-trained model "meta-llama/Llama-2-7b-chat-hf."
  2. Task (task):

    • Indicates the specific task for which the model is being fine-tuned. In this case, it's set to "conversation," signifying that the model will be adapted to process conversational data.
  3. Relation Name (relation_name):

    • Refers to the name of the dataset or database relation containing the training data used for fine-tuning. In the provided query, it's set to "pgml.fingpt_sentiment_train_view."
  4. Test Size (test_size):

    • Specifies the proportion of the dataset reserved for testing, expressed as a fraction. In the example, it's set to 0.8, indicating that 80% of the data will be used for training, and the remaining 20% will be held out for testing.
  5. Test Sampling (test_sampling):

    • Determines the strategy for sampling the test data. In the provided query, it's set to "last," indicating that the last portion of the dataset will be used for testing.
  6. Hyperparameters (hyperparams):

    • This argument encapsulates a JSON object containing various hyperparameters essential for the fine-tuning process. Let's break down its subcomponents:
      • Training Args (training_args): Specifies parameters related to the training process, including learning rate, batch size, number of epochs, weight decay, optimizer settings, and other training configurations.
      • Dataset Args (dataset_args): Provides arguments related to dataset processing, such as column names for system responses, user inputs, and assistant outputs.
      • LORA Config (lora_config): Defines settings for the LORA (Learned Optimizer and Rate Adaptation) algorithm, including parameters like the attention radius (r), LORA alpha (lora_alpha), dropout rate (lora_dropout), bias, and task type.
      • Load in 8-bit (load_in_8bit): Determines whether to load data in 8-bit format, which can be beneficial for memory and performance optimization.
      • Token (token): Specifies the Hugging Face token required for accessing private repositories and pushing the fine-tuned model to the Hugging Face Hub.
  7. Hub Private Repo (hub_private_repo):

    • This optional parameter indicates whether the fine-tuned model should be pushed to a private repository on the Hugging Face Hub. In the provided query, it's set to true, signifying that the model will be stored in a private repository.

Training Args:

Expanding on the training_args within the hyperparams argument provides insight into the specific parameters governing the training process of the model. Here's a breakdown of the individual training arguments and their significance:

  • Learning Rate (learning_rate):

    • Determines the step size at which the model parameters are updated during training. A higher learning rate may lead to faster convergence but risks overshooting optimal solutions, while a lower learning rate may ensure more stable training but may take longer to converge.
  • Per-device Train Batch Size (per_device_train_batch_size):

    • Specifies the number of training samples processed in each batch per device during training. Adjusting this parameter can impact memory usage and training speed, with larger batch sizes potentially accelerating training but requiring more memory.
  • Per-device Eval Batch Size (per_device_eval_batch_size):

    • Similar to per_device_train_batch_size, this parameter determines the batch size used for evaluation (validation) during training. It allows for efficient evaluation of the model's performance on validation data.
  • Number of Train Epochs (num_train_epochs):

    • Defines the number of times the entire training dataset is passed through the model during training. Increasing the number of epochs can improve model performance up to a certain point, after which it may lead to overfitting.
  • Weight Decay (weight_decay):

    • Introduces regularization by penalizing large weights in the model, thereby preventing overfitting. It helps to control the complexity of the model and improve generalization to unseen data.
  • Hub Token (hub_token):

    • A token required for authentication when pushing the fine-tuned model to the Hugging Face Hub or accessing private repositories. It ensures secure communication with the Hub platform.
  • Push to Hub (push_to_hub):

    • A boolean flag indicating whether the fine-tuned model should be uploaded to the Hugging Face Hub after training. Setting this parameter to true facilitates sharing and deployment of the model for wider usage.
  • Optimizer (optim):

    • Specifies the optimization algorithm used during training. In the provided query, it's set to "adamw_bnb_8bit," indicating the use of the AdamW optimizer with gradient clipping and 8-bit quantization.
  • Gradient Accumulation Steps (gradient_accumulation_steps):

    • Controls the accumulation of gradients over multiple batches before updating the model's parameters. It can help mitigate memory constraints and stabilize training, especially with large batch sizes.
  • Gradient Checkpointing (gradient_checkpointing):

    • Enables gradient checkpointing, a memory-saving technique that trades off compute for memory during backpropagation. It allows training of larger models or with larger batch sizes without running out of memory.

Each of these training arguments plays a crucial role in shaping the training process, ensuring efficient convergence, regularization, and optimization of the model for the specific task at hand. Adjusting these parameters appropriately is essential for achieving optimal model performance.

LORA Args:

Expanding on the lora_config within the hyperparams argument provides clarity on its role in configuring the LORA (Learned Optimizer and Rate Adaptation) algorithm:

  • Attention Radius (r):

    • Specifies the radius of the attention window for the LORA algorithm. It determines the range of tokens considered for calculating attention weights, allowing the model to focus on relevant information while processing conversational data.
  • LORA Alpha (lora_alpha):

    • Controls the strength of the learned regularization term in the LORA algorithm. A higher alpha value encourages sparsity in attention distributions, promoting selective attention and enhancing interpretability.
  • LORA Dropout (lora_dropout):

    • Defines the dropout rate applied to the LORA attention scores during training. Dropout introduces noise to prevent overfitting and improve generalization by randomly zeroing out a fraction of attention weights.
  • Bias (bias):

    • Determines whether bias terms are included in the LORA attention calculation. Bias terms can introduce additional flexibility to the attention mechanism, enabling the model to learn more complex relationships between tokens.
  • Task Type (task_type):

    • Specifies the type of task for which the LORA algorithm is applied. In this context, it's set to "CAUSAL_LM" for causal language modeling, indicating that the model predicts the next token based on the previous tokens in the sequence.

Configuring these LORA arguments appropriately ensures that the attention mechanism of the model is optimized for processing conversational data, allowing it to capture relevant information and generate coherent responses effectively.

Dataset Args:

Expanding on the dataset_args within the hyperparams argument provides insight into its role in processing the dataset:

  • System Column (system_column):

    • Specifies the name or identifier of the column containing system responses (e.g., prompts or instructions) within the dataset. This column is crucial for distinguishing between different types of conversational turns and facilitating model training.
  • User Column (user_column):

    • Indicates the column containing user inputs or queries within the dataset. These inputs form the basis for the model's understanding of user intentions, sentiments, or requests during training and inference.
  • Assistant Column (assistant_column):

    • Refers to the column containing assistant outputs or responses generated by the model during training. These outputs serve as targets for the model to learn from and are compared against the actual responses during evaluation to assess model performance.

Configuring these dataset arguments ensures that the model is trained on the appropriate input-output pairs, enabling it to learn from the conversational data and generate contextually relevant responses.

Once the fine-tuning is completed, you will see the model in your Hugging Face repository (example: https://huggingface.co/santiadavani/fingpt-llama2-7b-chat). Since we are using LoRA to fine tune the model we only save the adapter weights (~2MB) instead of all the 7B weights (14GB) in Llama2-7b model.

Inference

For inference, we will be utilizing the OpenSourceAI class from the pgml SDK. Here's an example code snippet:

import pgml

database_url = "DATABASE_URL"

client = pgml.OpenSourceAI(database_url)

results = client.chat_completions_create(
    {
        "model" : "santiadavani/fingpt-llama2-7b-chat",
        "token" : "TOKEN",
        "load_in_8bit": "true",
        "temperature" : 0.1,
        "repetition_penalty" : 1.5,
    },
    [
        {
            "role" : "system",
            "content" : "What is the sentiment of this news? Please choose an answer from {strong negative/moderately negative/mildly negative/neutral/mildly positive/moderately positive/strong positive}.",
        },
        {
            "role": "user",
            "content": "Starbucks says the workers violated safety policies while workers said they'd never heard of the policy before and are alleging retaliation.",
        },
    ]
)

print(results)

In this code snippet, we first import the pgml module and create an instance of the OpenSourceAI class, providing the necessary database URL. We then call the chat_completions_create method, specifying the model we want to use (in this case, "santiadavani/fingpt-llama2-7b-chat"), along with other parameters such as the token, whether to load the model in 8-bit precision, the temperature for sampling, and the repetition penalty.

The chat_completions_create method takes two arguments: a dictionary containing the model configuration and a list of dictionaries representing the chat conversation. In this example, the conversation consists of a system prompt asking for the sentiment of a given news snippet, and a user message containing the news text.

The results are:

{
    "choices": [
        {
            "index": 0,
            "message": {
                "content": " Moderately negative ",
                "role": "assistant"
            }
        }
    ],
    "created": 1711144872,
    "id": "b663f701-db97-491f-b186-cae1086f7b79",
    "model": "santiadavani/fingpt-llama2-7b-chat",
    "object": "chat.completion",
    "system_fingerprint": "e36f4fa5-3d0b-e354-ea4f-950cd1d10787",
    "usage": {
        "completion_tokens": 0,
        "prompt_tokens": 0,
        "total_tokens": 0
    }
}

This dictionary contains the response from the language model, santiadavani/fingpt-llama2-7b-chat, for the given news text.

The key information in the response is:

  1. choices: A list containing the model's response. In this case, there is only one choice.
  2. message.content: The actual response from the model, which is " Moderately negative".
  3. model: The name of the model used, "santiadavani/fingpt-llama2-7b-chat".
  4. created: A timestamp indicating when the response was generated.
  5. id: A unique identifier for this response.
  6. object: Indicates that this is a "chat.completion" object.
  7. usage: Information about the token usage for this response, although all values are 0 in this case.

So, the language model has analyzed the news text Starbucks says the workers violated safety policies while workers said they'd never heard of the policy before and are alleging retaliation. and determined that the sentiment expressed in this text is Moderately negative

postgresml's People

Contributors

aplchian avatar chillenberger avatar chuckhend avatar dependabot[bot] avatar elvizlai avatar f-prime avatar garrrikkotua avatar higuoxing avatar jhydra12 avatar jonatas avatar jsaied99 avatar kczimm avatar kianmeng avatar levkk avatar moloejoe avatar montanalow avatar nickcanz avatar ns1000 avatar ole-gi avatar rahul721999 avatar samdobson avatar santiatpml avatar sasasu avatar silasmarvin avatar solidsnack avatar tanruixiang avatar thomaskluiters avatar tigitz avatar workingjubilee avatar zebehringer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

postgresml's Issues

Pull Request Preview Environments for increasing maintainer productivity

I would like to make life easier for PostgresML maintainers by implementing Uffizzi preview environments.
Disclaimer: I work on Uffizzi.

Uffizzi is a Open Source full stack previews engine and our platform is available completely free for PostgresML (and all open source projects). This will provide maintainers with preview environments of their PRs in the cloud, allowing them iterate faster and reduce time to merge.

Uffizzi is purpose-built for the task of previewing PRs and it integrates with your workflow to deploy preview environments in the background without any manual steps for maintainers or contributors.

TODO:

  • Intial PoC

Switch to unix line endings?

GitHub recommends to always use \n as a newline character in git-handled repos.

Windows line endings make it difficult to collaborate with folks using unix, and result in the docker-compose entrypoint scripts being parsed incorrectly on linux:

postgresml-admin-1     | /app/docker/entrypoint.sh: line 8: $'\r': command not found
postgresml-admin-1     | /app/docker/entrypoint.sh: line 23: syntax error: unexpected end of file

This should be an easy fix. I haven't created a PR, because it would involve changing a large number of files...

pgml.embed does not work

I use the hosted version of PGML.

select pgml.version() returns 2.1.1.

As a test, I have a database

create table european_anthems (
    country varchar(20),
    anthem_first_line varchar(255)
);

I have inserted 20 lines into this table OK and can query them as normal.

I now want to create embedding vectors for anthem_first_line, and according to the manual this can be done using:

SELECT pgml.embed('distilbert-base-uncased', anthem_first_line)
FROM european_anthems;

However this gives the error message

[42883] ERROR: function pgml.embed(unknown, character varying) does not exist Hint: No function matches the given name and argument types. You might need to add explicit type casts. Position: 8

Has the method been changed? Do I need to enable something extra on the hosted version of PGML to make this work?

Can you please update the documentation accordingly?

Iris models always predict the same class

Following the iris classification example, you end up with a model that classifies every flower as Iris-virginica.

pgml_development=# SELECT pgml.predict('Iris Classifier', ARRAY[sepal_length, sepal_width, petal_length, petal_width]) AS prediction, count(target) FROM pgml.iris GROUP BY prediction;
 prediction | count
------------+-------
          2 |   150
(1 row)

This is the case for at least the following algorithms:

  • linear
  • ridge
  • xgboost
  • random_forest

When training a classification model, I receive an error

Data is a two-column view:

  1. vector - Integer[]
  2. result - Integer

Running the following:

SELECT * FROM pgml.train(
  'commits:category:build',
  'classification',
  'commits_build',
  'result'
);

This results in the following error:

Query 1 ERROR: ERROR:  ValueError: y should be a 1d array, got an array of shape (325, 2) instead.
CONTEXT:  Traceback (most recent call last):
  PL/Python function "train", line 4, in <module>
    status = train(
  PL/Python function "train", line 839, in train
  PL/Python function "train", line 718, in fit
  PL/Python function "train", line 568, in roc_auc_score
  PL/Python function "train", line 74, in _average_binary_score
  PL/Python function "train", line 342, in _binary_roc_auc_score
  PL/Python function "train", line 977, in roc_curve
  PL/Python function "train", line 741, in _binary_clf_curve
  PL/Python function "train", line 1151, in column_or_1d
PL/Python function "train"

[BUG] Unable to see snapshots generated from views

I created a view on some data, such as:

CREATE VIEW iris_view AS SELECT * FROM pgml.iris ORDER BY random() LIMIT 100;

and then trained a model with the following command:

SELECT * FROM pgml.train('Iris Classifier', 'classification', 'iris_view', 'target');

When trying to access the UI /snapshots, I got a ProgrammingError:

relation "iris_view" does not exist
LINE 1: SELECT pg_size_pretty(pg_total_relation_size('iris_view'))

(but when I execute the size query in psql it doesn't throw an error; it just returns nothing.) I have attached a screenshot of the error.

Screen Shot 2022-05-05 at 12 16 38 PM

Broken link ( https://postgresml.org/projects/ ) in Tutorial 1: ⏱️ Real Time Fraud Detection

Hi, I'm following the tutorials, and they are super helpful.

IDK if this is exactly the right place to report.

On Tutorial 1: ⏱️ Real Time Fraud Detection , step 7, there is a broken link.

We'll organize our work on this task under the project name "Breast Cancer Detection", which you can now see it in your list of projects.

And

You can pop over to the projects tab for a visualization

Edit: the broken links are through multiple tutorials, and I realise now the links should point to https://postgresml.org/dashboard/projects

Streaming responses from LLMs

It'd be nice to have an API over a server side cursor that returned individual tokens as rows from the model to stream responses back to the end user.

Unable to create new project

I encountered an error "TypeError at /api/tables/columns/" while setting up a new project. I've included the entire screenshot of the error. kindly have a look.

image

I connected the pgml dashboard to my local Postgres database, I have made changes in allowed hosts entry in the following files which are under the directory pgml_dashboard .env, and settings.py

kindly inform me of the best way to handle this error.

version 2.2.0 gives pgml.predict is not unique

New version installed from the apt repo gives error such as:
error returned from database: function pgml.predict(unknown, smallint[]) is not unique

Caused by:
function pgml.predict(unknown, smallint[]) is not unique

even for the most of the examples in the notebooks section.

docker-compose up fails - Dockerfile.local scikit-learn

docker-compose up fails as on line 14 in Dockerfile.local of pgml-extension you try to pip install sklearn and its deprecated, it should be changed to:
RUN pip3 install xgboost skikit-learn diptest torch lightgbm transformers datasets sentencepiece sacremoses sacrebleu rouge

Python version support for 3.6

Hi,

Curious if the Python 3.7 is a hard requirement or if you could possibly run on some earlier 3.x version. In particular curious about Python 3.6 as we run with it already and would love to give this a try.

Unable to add new project

While trying to add new project we are getting 403 error. Please find the below screenshot.

image

And also we have made changes in allowed hosts entry in the following files which is under the directory pgml_dashboard .env.TEMPLATE, .env.docker and settings.py . Please let us know whether these changes are possible or not.

Functions using `blas` cause a segfault (SIGSEV)

After working with @levkk and @montanalow to install PostgresML (as of master: 63ebce3) on my linux box, I discovered that functions such as pgml.cosine_similarity and pgml.norm_l1 cause Postgres to segfault.

As an example:

[v15.1][5126] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]);
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 188.973 ms
[v][] ?!> 

Postgres logs leading up to a crash against pgml.cosine_similarity() are:

/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/logging.py:65: RuntimeWarning: Error deriving logger module name, using <None>. Exception: <module '' from '/home/pg/15/data'> is a built-in module
  warnings.warn(
No sentence-transformers model found with name /home/zombodb/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.
2023-05-04 18:35:48.950 UTC [20973] LOG:  server process (PID 21218) was terminated by signal 11: Segmentation fault
2023-05-04 18:35:48.950 UTC [20973] DETAIL:  Failed process was running: select *, pgml.cosine_similarity(embed, pgml.embed('intfloat/e5-large', 'meetings with beer or wine and cheese')) from embeddings_e5large_100k limit 10;
2023-05-04 18:35:48.950 UTC [20973] LOG:  terminating any other active server processes
2023-05-04 18:35:48.953 UTC [20973] LOG:  all server processes terminated; reinitializing
2023-05-04 18:35:48.979 UTC [20973] FATAL:  Can't attach, lock is not in an empty state: PgLwLockInner
2023-05-04 18:35:48.980 UTC [20973] LOG:  database system is shut down

The backtrace from a --debug build of pgml is:

Thread 1 "postgres" received signal SIGSEGV, Segmentation fault.
0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so
(gdb) bt
#0  0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so
#1  0x00007ff52b8b363a in blas::sdot (n=1024, x=..., incx=1, y=..., incy=1)
    at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/blas-0.22.0/src/lib.rs:109
#2  0x00007ff52b7c6aa6 in pgml::vectors::cosine_similarity_s (vector=..., other=...) at src/vectors.rs:304
#3  0x00007ff52b7c6d9a in pgml::vectors::cosine_similarity_s_wrapper::cosine_similarity_s_wrapper_inner (_fcinfo=0x55fa5656e560) at src/vectors.rs:302
#4  0x00007ff52b4ae1c1 in pgml::vectors::cosine_similarity_s_wrapper::{closure#0} () at src/vectors.rs:302
#5  0x00007ff52b6edb8c in std::panicking::try::do_call<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (
    data=0x7ffe798f2828) at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:483
#6  0x00007ff52b6f0f6b in __rust_try.llvm.11079318101650794703 () from /home/pg/15/lib/postgresql/pgml.so
#7  0x00007ff52b6ea049 in std::panicking::try<pgrx_pg_sys::submodules::datum::Datum, pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}> (f=...)
    at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:447
#8  0x00007ff52b75a0f6 in std::panic::catch_unwind<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...)
    at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panic.rs:137
#9  0x00007ff52b765983 in pgrx_pg_sys::submodules::panic::run_guarded<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:403
#10 0x00007ff52b77111c in pgrx_pg_sys::submodules::panic::pgrx_extern_c_guard<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:380
#11 0x00007ff52b7c6c9d in pgml::vectors::cosine_similarity_s_wrapper (_fcinfo=0x55fa5656e560) at src/vectors.rs:302
#12 0x000055fa54ce4b43 in ExecInterpExpr ()
#13 0x000055fa54cf15a2 in ExecScan ()
#14 0x000055fa54d0c368 in ExecLimit ()
#15 0x000055fa54ce88a2 in standard_ExecutorRun ()

My box is a (humblebrag):

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen Threadripper 3970X 32-Core Processor
    CPU family:          23
    Model:               49
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU max MHz:         3700.0000
    CPU min MHz:         2200.0000
    BogoMIPS:            7386.30
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb 
                         rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 mo
                         vbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt t
                         ce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 s
                         mep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_loc
                         al clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter p
                         fthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   1 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    16 MiB (32 instances)
  L3:                    128 MiB (8 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

With an nvidia RTX 4080:

  nvidia-debugdump -l
Found 1 NVIDIA devices
   Device ID:              0
   Device name:            NVIDIA GeForce RTX 4080   (*PrimaryCard)
   GPU internal ID:        GPU-b772ddf7-d413-e1bb-d1e1-8e7022c59343

Lev helped me discover that by commenting out this line,

println!("cargo:rustc-link-lib=static=openblas");
, everything works:

[v15.1][8595] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]);
 norm_l1 
---------
       6
(1 row)

Time: 0.620 ms

This crash seems to be isolated to blas as I created 100k embeddings with pgml.embed() in a mere 7m 50s, using 4 parallel workers, even. So that part is good.

I had a thought that rebooting the computer might help since I had just stressed the GPU making all those embeddings, but naw, that didn't change anything.

A theory is that since pgml links to so many libraries (probably directly and indirectly) that maybe there's some kind of symbol resolution problem and the wrong symbols are being called? Just a theory.

@thomcc might be able to offer some help with this if it's some kind of linking problem? Offering up his services as PostgresML's success is pgrx's success!

.env.docker overrides settings of .env

Hi,

Thank you for the project and your effort.

Issue: I was experimenting with deploying PostgresML on a workstation and accessing the dashboard from another machine. I bumped into the issue where I couldn't override DJANGO_ALLOWED_HOSTS. I tried to set environment variables through docker-compose.yml (see example below) and through .env file, but according to Django's debug/error page, DJANGO_ALLOWED_HOSTS (ALLOWED_HOSTS after parsing) was unchanged.

  dashboard:
    depends_on:
      - postgres
    build:
      context: ./pgml-dashboard/
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    environment:
      - DJANGO_ALLOWED_HOSTS
      - DJANGO_CSRF_TRUSTED_ORIGINS
    command:
      - python3
      - manage.py
      - runserver
      - 0.0.0.0:8000

Workaround: After digging into the code, I noticed that file ./pgml-dashboard/docker/.env.docker overrides everything I define in the environment (in docker-compose.yaml) and .env at the root or ./pgml-dashboard. Commenting out the DJANGO_ALLOWED_HOSTS there solves the issue.

I see two solutions to the problem:

  1. Remove the ./pgml-dashboard/docker/.env.docker file and introduce "global" dotenv at the root of the project. This way, environment variables would be passed from .env or compose file into the containers. In docker-compose you may define fallback values, if those are not defined anywhere.
  2. Add documentation to README describing where to change variables for the dashboard.

Kind regards,
Gregor

refit_final_model_on_all_data param for cross validation

We already have k-fold cross validation (although it's not well documented, it's the folds arguments to train). We could add another param to refit_final_model_on_all_data, and have that default to true, since if you are cross validating, doing one more training run is only incrementally more expensive. I'm open to a more concise name for that param.

"could not find native static library dmlc" error on Centos8

Hi,

I am trying to install postgresML from source on my Centos8 and having problem on cargo pgx package step.
I already installed Python xgboost and other packages but I am getting the "error: could not find native static library dmlc " error.
Can you please help me for this issue?

$ rustc --version
rustc 1.68.2 (9eb3afe9e 2023-03-27)
$ pip3 install xgboost lightgbm scikit-learn
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: xgboost in /usr/local/lib64/python3.9/site-packages (1.7.5)
Requirement already satisfied: lightgbm in /usr/local/lib/python3.9/site-packages (3.3.5)
Requirement already satisfied: scikit-learn in /usr/local/lib64/python3.9/site-packages (1.2.2)
Requirement already satisfied: numpy in /usr/local/lib64/python3.9/site-packages (from xgboost) (1.24.2)
Requirement already satisfied: scipy in /usr/local/lib64/python3.9/site-packages (from xgboost) (1.10.1)
Requirement already satisfied: wheel in /usr/local/lib/python3.9/site-packages (from lightgbm) (0.40.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/site-packages (from scikit-learn) (3.1.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/site-packages (from scikit-learn) (1.2.0)
$    

Error:

$ cargo pgx package
       Using PgConfig("pg15") and `pg_config` from /usr/pgsql-15/bin/pg_config
    Building extension with features python pg15
     Running command "cargo" "build" "--release" "--features" "python pg15" "--no-default-features" "--message-format=json-render-diagnostics"
   Compiling pyo3-build-config v0.17.3
   Compiling xgboost-sys v0.2.0 (https://github.com/postgresml/rust-xgboost.git?branch=master#8f3a8fb7)
   Compiling xgboost v0.2.0 (https://github.com/postgresml/rust-xgboost.git?branch=master#8f3a8fb7)
error: could not find native static library `dmlc`, perhaps an -L flag is missing?

error: could not compile `xgboost-sys` due to previous error
warning: build failed, waiting for other jobs to finish...
$ 

Verbose output:

   ...
   Compiling typetag v0.2.7
   Compiling linfa-linear v0.6.0 (/var/lib/pgsql/postgresml/pgml-extension/deps/linfa/algorithms/linfa-linear)
   Compiling rmp-serde v1.1.1
   Compiling blas-src v0.8.0
   Compiling csv v1.2.1
   Compiling linfa-svm v0.6.0 (/var/lib/pgsql/postgresml/pgml-extension/deps/linfa/algorithms/linfa-svm)
   Compiling linfa-logistic v0.6.0 (/var/lib/pgsql/postgresml/pgml-extension/deps/linfa/algorithms/linfa-logistic)
   Compiling pgx v0.7.4
   Compiling lightgbm v0.2.3 (https://github.com/postgresml/lightgbm-rs?branch=main#ab547686)
   Compiling xgboost v0.2.0 (https://github.com/postgresml/rust-xgboost.git?branch=master#8f3a8fb7)
error: could not find native static library `dmlc`, perhaps an -L flag is missing?

error: could not compile `xgboost-sys` due to previous error
warning: build failed, waiting for other jobs to finish...
$

ERROR: Service 'postgres' failed to build: Unknown flag: chown

Hi,
After clone repo, I've an error during docker-compose

cd postgresml && docker-compose up

Building postgres
Step 1/14 : FROM debian:bullseye-slim
---> c9cb6c086ef7
Step 2/14 : MAINTAINER [email protected]
---> Using cache
---> 3664ba4873e1
Step 3/14 : RUN apt-get update
---> Using cache
---> d923784e5c89
Step 4/14 : ARG DEBIAN_FRONTEND=noninteractive
---> Using cache
---> daef8f796ade
Step 5/14 : ENV TZ Etc/UTC
---> Using cache
---> 3af4c9b887c7
Step 6/14 : RUN apt-get install -y postgresql-plpython3-13 python3 python3-pip postgresql-13 tzdata sudo cmake
---> Using cache
---> 2e573258b766
Step 7/14 : RUN pip3 install xgboost sklearn diptest
---> Using cache
---> 3f93c7617c69
Step 8/14 : COPY --chown=postgres:postgres . /app
ERROR: Service 'postgres' failed to build: Unknown flag: chown

My version of docker in my centos
docker version
Client:
Version: 1.13.1
API version: 1.26
Package version: docker-1.13.1-209.git7d71120.el7.centos.x86_64
Go version: go1.10.3
Git commit: 7d71120/1.13.1
Built: Wed Mar 2 15:25:43 2022
OS/Arch: linux/amd64

Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Package version: docker-1.13.1-209.git7d71120.el7.centos.x86_64
Go version: go1.10.3
Git commit: 7d71120/1.13.1
Built: Wed Mar 2 15:25:43 2022
OS/Arch: linux/amd64
Experimental: false

signal 11: Segmentation fault

I've pulled fresh version of postgresml from github instead of using ubuntu package (as from what I saw it still doesn't have preprocessing features) and managed to compile it and update .so library with the new one. But now it gives me a segfault even on the basic example:

2023-02-03 16:34:32.815 UTC [125591] LOG: database system is ready to accept connections
2023-02-03 16:34:44.698 UTC [125591] LOG: server process (PID 125611) was terminated by signal 11: Segmentation fault
2023-02-03 16:34:44.698 UTC [125591] DETAIL: Failed process was running: SELECT * FROM pgml.train(
project_name => 'Breast Cancer Detection',
task => 'classification',
relation_name => 'pgml.breast_cancer',
y_column_name => 'malignant'
)

ubuntu 22.04, postgresql 14

failed to get `linfa` as a dependency of package `pgml v2.1.2 (/app)`

Hi,

I am trying to build a docker image only for pgml-extension. Am getting error at 'RUN cargo pgx package', am trying to build this on postgres13 and debian 10.

Step 30/35 : RUN cargo pgx package
 ---> Running in 48fc7645f156
Error:
   0: couldn't get cargo metadata
   1: `cargo metadata` exited with an error:     Updating crates.io index
          Updating git repository `https://github.com/postgresml/lightgbm-rs`
          Updating git submodule `https://github.com/microsoft/LightGBM/`
          Updating git submodule `https://gitlab.com/libeigen/eigen.git`
          Updating git submodule `https://github.com/lemire/fast_double_parser.git`
          Updating git submodule `https://github.com/abseil/abseil-cpp.git`
          Updating git submodule `https://github.com/google/double-conversion.git`
          Updating git submodule `https://github.com/fmtlib/fmt.git`
          Updating git submodule `https://github.com/boostorg/compute`
      error: failed to get `linfa` as a dependency of package `pgml v2.1.2 (/app)`

      Caused by:
        failed to load source for dependency `linfa`

      Caused by:
        Unable to update /app/deps/linfa

      Caused by:
        failed to read `/app/deps/linfa/Cargo.toml`

      Caused by:
        No such file or directory (os error 2)


Location:
   /var/lib/postgresql/.cargo/registry/src/github.com-1ecc6299db9ec823/cargo-pgx-0.6.0/src/metadata.rs:23

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ SPANTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

   0: cargo_pgx::command::package::execute
      at /var/lib/postgresql/.cargo/registry/src/github.com-1ecc6299db9ec823/cargo-pgx-0.6.0/src/command/package.rs:50

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
```

Input data structure and recursion error

First of all, I would like to preface that I have absolutely no experience, so my apologies in advance for the naivete.

I am seeing an error when attempting to train a model:

Query 1 ERROR: ERROR:  RecursionError: maximum recursion depth exceeded in comparison
CONTEXT:  Traceback (most recent call last):
  PL/Python function "train", line 4, in <module>
    status = train(
  PL/Python function "train", line 794, in train
  PL/Python function "train", line 650, in fit
  PL/Python function "train", line 363, in data
  PL/Python function "train", line 32, in flatten
  PL/Python function "train", line 33, in flatten
  PL/Python function "train", line 33, in flatten
  PL/Python function "train", line 33, in flatten
  PL/Python function "train", line 33, in flatten
  PL/Python function "train", line 33, in flatten
  PL/Python function "train", line 33, in flatten

Data

My database view returns two columns

"{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}",1

Where the vector is an Integer[] column comprised of 1200 values and result is an integer.

I am seeing the error when I call the following:

SELECT * FROM pgml.train(
  'Project - category:build',
  'classification',
  'category_build',
  'result'
)

Am I structuring my data incorrectly? Can I use an integer array to provide the training data?

Thank you very much for your patience and assistance!

XGBoost & LightGBM on Apple M1 & M2

XGBoost and LightGBM won't compile / segfault on Apple M1s (and M2s) aarch64. The issue comes from openmp which appears to break and requires special workarounds which seem to be unreliable.

XGBoost issue: dmlc/xgboost#7039

Train/Inference between unevenly matched replicas

Hello,

I'm Ilya from The Stone Cross Foundation of Ukraine, a data-driven humanitarian volunteer cooperative tasked with solving some of the issues charities in Ukraine are having during these tumultuous times. We've discovered this project via HN a couple weeks back and have been evaluating it ever since. We're using a high-availability variant of TimescaleDB which is another Postgres extension, providing time-series and some advanced capabilities, so called hypertables and hyperfunctions. This is where we keep all our data, including but not limited to geospatial, analytics and everything related to day-to-day operations of our charity, the pleas and the volunteers we're working with.

For us, the primary area of interest is the following: How does postgresml store the trained models, and can these models be streaming-replicated to the other replicas in order to perform inference afterwards? Our database instances are not particularly beefy on the CPU side, and it wouldn't be cost-effective to have them so, but we would really love to perform some regressions on our data in the following way:

  1. Provision a training replica on a highly-performant VM from the latest backup.
  2. (Inquiry: Does pgml's backend implementation in Python support GPU devices?)
  3. Give it time to get up to speed with the latest changes, WAL, etc.
  4. Construct a materialised corpus of our data to be used in training.
  5. pgml.train()
  6. Have other replicas pick it up and perform inference on it.
  7. The training replica can be terminated until the next time we would want to re-train the model.

I'm guessing whether if this approach is even possible, if not too far-out?

This could potentially significantly simplify our data analysis pipeline. Instead of having to worry about ingest, provisioning of the dedicated machine learning services, their inference APIs exposure, and many other things required to do basic machine learning, we could just play around with our Postgres-derived replicas and perform these scheduled training sessions without having to do any of that!

In our case this could translate to relief of actual human suffering.

Best regards,
Ilya

consider a client connection as an alternative to being an extension

Being an python-linking extension introduces some interesting problems, such as python versions changing with OS upgrades.

Perhaps allowing for operation as a client -- but with a suggestion to run adjacent to the database instance, via container -- would improve adoption.

how I can set some filters on the dataset table to use for training

As we are using SQL tables to fetch data for training. Do we have support to use data stored in only last two days for training and ignore old data?
One way I could imagine doing this is using the view in SQL. But that would need some extra lines, if we have this support in train call that would be great.

Support using materialized views

Materialized views are recorded in a separate metadata table and not in information_schema.tables, so when one tries to use them in pgml.train(), they would get a "table not found" error.

A quick workaround is to create a regular view around the materialized view:

CREATE VIEW my_view AS SELECT * FROM my_mat_view;

Unable to use Categorical columns as target or input fields for classification and regression

For classification, if the target is categorical the model is not getting trained and is throwing assertion error. It expects the categories to be numbers such as 0, 1, 2, etc.
For regression, the model is getting trained with categorical input fields but when predicting it is not accepting categorical values.
How to incorporate categorical columns with string values as input or target?

what is happening ?

SELECT 'id_inferencia', pgml.predict( 2, ARRAY [ 1.84,0.05565,133,133,16739,1.38,0.006346 ]) AS prediccion;
->
Column text | prediccion real
id_inferencia | 1

SELECT 'id_inferencia', pgml.predict( 2, ARRAY[ 'var0', 'var1', 'var2', 'var3', 'var4', 'var5', 'var6' ]) AS prediccion
FROM pgml.vwdata;

ERROR: function pgml.predict(integer, text[]) does not exist
LINE 1: SELECT id_inferencia, pgml.predict( 2, ARRAY[ 'var0', 'var1'...
^
HINT: No function matches on the name and types of arguments. It may be necessary to add explicit type casting.
SQL state: 42883
Character: 23

I'm a bit lost ...

TNX !!

Add authentication support to the dashboard

Since the application is open we tried to deploy in Nginx and Traefik using basic authorization. After setting up the authorization when we try to create a new project the next button doesn't work. But when the basic authorization is disabled both in Nginx and Traefik then the Next button works while creating a project. Attaching the screenshots for your reference.

  1. Failed to load resource: the server responded with a status of 403 () new-project.js:83
  2. Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'length') at new-project.js:83:26

auth

error posgresml

It would be great if you could advise on how we can secure postgresml with an authentication service.

unable to locate postgresml-13

When I'm running apt-get update && apt-get install -y postgresql-pgml-13, I'm getting error E: Unable to locate package postgresql-pgml-13

Could you please have a check on this.

xgboost gpu work ?

My essay:

SELECT * FROM pgml.train(
project_name => 'totalvalor',
task => 'classification',
relation_name => 'vtotalvalor',
y_column_name => 'id_inferencia',
algorithm => 'xgboost',
hyperparams => '{"tree_method" : "gpu_hist"}'
);

Results:

NFO: Snapshotting table "vtotal", this may take a little while...
INFO: Validating relation: vtotal
INFO: Validating relation: vtotal
INFO: Snapshot of table "vtotal" created and saved in "pgml"."snapshot_1"
INFO: Dataset { num_features: 6, num_labels: 1, num_distinct_labels: 3, num_rows: 1947801, num_train_rows: 1460851, num_test_rows: 486950 }
INFO: Training Model { id: 1, algorithm: xgboost, runtime: rust }
INFO: Hyperparameter searches: 1, cross validation folds: 1
INFO: Hyperparams: {
"tree_method": "gpu_hist"
}
ERROR: called Result::unwrap() on an Err value: XGBError { desc: "[07:58:02] /app/target/release/build/xgboost-sys-8117c1510d0b1933/out/xgboost/src/gbm/../common/common.h:239: XGBoost version not compiled with GPU support.\nStack trace:\n [bt] (0) /usr/lib/postgresql/12/lib/pgml.so(+0x1af9c68) [0x7fbe2f48ac68]\n [bt] (1) /usr/lib/postgresql/12/lib/pgml.so(+0x1af9d01) [0x7fbe2f48ad01]\n [bt] (2) /usr/lib/postgresql/12/lib/pgml.so(+0x1af9dbd) [0x7fbe2f48adbd]\n [bt] (3) /usr/lib/postgresql/12/lib/pgml.so(+0x1b10460) [0x7fbe2f4a1460]\n [bt] (4) /usr/lib/postgresql/12/lib/pgml.so(+0x19c665c) [0x7fbe2f35765c]\n [bt] (5) /usr/lib/postgresql/12/lib/pgml.so(+0x19b4868) [0x7fbe2f345868]\n [bt] (6) /usr/lib/postgresql/12/lib/pgml.so(+0x19460ad) [0x7fbe2f2d70ad]\n [bt] (7) /usr/lib/postgresql/12/lib/pgml.so(+0x2e594c) [0x7fbe2dc7694c]\n [bt] (8) /usr/lib/postgresql/12/lib/pgml.so(+0x150101) [0x7fbe2dae1101]\n\n" }
CONTEXT: src/bindings/xgboost.rs:211:43
SQL state: XX000

... But python xgboost works perfectly with gpu on other test model ...

postgresql-pgml-12 version 2.0.2

TNX !!

docker-compose up fails with 11 errors

Tested on master and v2.2.0.

$ docker-compose up
Sending build context to Docker daemon  339.4kB
Step 1/4 : FROM rust:1
 ---> d4572ea67e7e
Step 2/4 : COPY . /app
 ---> Using cache
 ---> d97b737197ef
Step 3/4 : WORKDIR /app
 ---> Using cache
 ---> 07ba01bbe2ce
Step 4/4 : RUN cargo build
 ---> Running in 9707e2dc17f0
    Updating crates.io index
 Downloading crates ...
  Downloaded adler v1.0.2
.
.
.
Compiling pgml-dashboard v2.2.0 (/app)
error: failed to find data for query SELECT * FROM pgml.notebooks WHERE id = $1
  --> src/models.rs:89:13
   |
89 |             sqlx::query_as!(Notebook, "SELECT * FROM pgml.notebooks WHERE id = $1", id,)
   |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query INSERT INTO pgml.notebooks (name) VALUES ($1) RETURNING *
   --> src/models.rs:96:12
    |
96  |           Ok(sqlx::query_as!(
    |  ____________^
97  | |             Notebook,
98  | |             "INSERT INTO pgml.notebooks (name) VALUES ($1) RETURNING *",
99  | |             name,
100 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query SELECT * FROM pgml.notebooks
   --> src/models.rs:106:12
    |
106 |         Ok(sqlx::query_as!(Notebook, "SELECT * FROM pgml.notebooks")
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query SELECT * FROM pgml.notebook_cells
                       WHERE notebook_id = $1
                       AND deleted_at IS NULL
                   ORDER BY cell_number
   --> src/models.rs:112:12
    |
112 |           Ok(sqlx::query_as!(
    |  ____________^
113 | |             Cell,
114 | |             "SELECT * FROM pgml.notebook_cells
115 | |                 WHERE notebook_id = $1
...   |
118 | |             self.id,
119 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query UPDATE pgml.notebook_cells
                       SET
                       execution_time = NULL,
                       rendering = NULL
                   WHERE notebook_id = $1
                   AND cell_type = $2
   --> src/models.rs:125:17
    |
125 |           let _ = sqlx::query!(
    |  _________________^
126 | |             "UPDATE pgml.notebook_cells
127 | |                 SET
128 | |                 execution_time = NULL,
...   |
133 | |             CellType::Sql as i32,
134 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query
                   WITH
                       lock AS (
                           SELECT * FROM pgml.notebooks WHERE id = $1 FOR UPDATE
                       ),
                       max_cell AS (
                           SELECT COALESCE(MAX(cell_number), 0) AS cell_number
                           FROM pgml.notebook_cells
                           WHERE notebook_id = $1
                           AND deleted_at IS NULL
                       )
                   INSERT INTO pgml.notebook_cells
                       (notebook_id, cell_type, contents, cell_number, version)
                   VALUES
                       ($1, $2, $3, (SELECT cell_number + 1 FROM max_cell), 1)
                   RETURNING id,
                           notebook_id,
                           cell_type,
                           contents,
                           rendering,
                           execution_time,
                           cell_number,
                           version,
                           deleted_at
   --> src/models.rs:187:12
    |
187 |           Ok(sqlx::query_as!(
    |  ____________^
188 | |             Cell,
189 | |             "
190 | |             WITH
...   |
215 | |             contents,
216 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query SELECT
                           id,
                           notebook_id,
                           cell_type,
                           contents,
                           rendering,
                           execution_time,
                           cell_number,
                           version,
                           deleted_at
                       FROM pgml.notebook_cells
                       WHERE id = $1

   --> src/models.rs:222:12
    |
222 |           Ok(sqlx::query_as!(
    |  ____________^
223 | |             Cell,
224 | |             "SELECT
225 | |                     id,
...   |
237 | |             id,
238 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query UPDATE pgml.notebook_cells
                   SET
                       cell_type = $1,
                       contents = $2,
                       version = version + 1
                   WHERE id = $3
   --> src/models.rs:252:17
    |
252 |           let _ = sqlx::query!(
    |  _________________^
253 | |             "UPDATE pgml.notebook_cells
254 | |             SET
255 | |                 cell_type = $1,
...   |
261 | |             self.id,
262 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query UPDATE pgml.notebook_cells
                   SET deleted_at = NOW()
                   WHERE id = $1
                   RETURNING id,
                           notebook_id,
                           cell_type,
                           contents,
                           rendering,
                           execution_time,
                           cell_number,
                           version,
                           deleted_at
   --> src/models.rs:270:12
    |
270 |           Ok(sqlx::query_as!(
    |  ____________^
271 | |             Cell,
272 | |             "UPDATE pgml.notebook_cells
273 | |             SET deleted_at = NOW()
...   |
284 | |             self.id
285 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query UPDATE pgml.notebook_cells SET rendering = $1 WHERE id = $2
   --> src/models.rs:339:9
    |
339 | /         sqlx::query!(
340 | |             "UPDATE pgml.notebook_cells SET rendering = $1 WHERE id = $2",
341 | |             rendering,
342 | |             self.id
343 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query INSERT INTO pgml.uploaded_files (id, created_at) VALUES (DEFAULT, DEFAULT)
                       RETURNING id, created_at
   --> src/models.rs:798:12
    |
798 |           Ok(sqlx::query_as!(
    |  ____________^
799 | |             UploadedFile,
800 | |             "INSERT INTO pgml.uploaded_files (id, created_at) VALUES (DEFAULT, DEFAULT)
801 | |                 RETURNING id, created_at"
802 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: could not compile `pgml-dashboard` due to 11 previous errors
warning: build failed, waiting for other jobs to finish...
1 error occurred:
	* Status: The command '/bin/sh -c cargo build' returned a non-zero code: 101, Code: 101

Test scripts missing?

docker-compose up results in:

...
postgres_1   | psql:tests/test.sql:17: error:  target | prediction 
postgres_1   | --------+------------
postgres_1   |       0 |          0
postgres_1   |       1 |          1
postgres_1   |       2 |          2
postgres_1   |       3 |          3
postgres_1   |       4 |          4
postgres_1   |       5 |          5
postgres_1   |       6 |          6
postgres_1   |       7 |          7
postgres_1   |       8 |          8
postgres_1   |       9 |          9
postgres_1   | (10 rows)
postgres_1   | 
postgres_1   | Time: 2.516 ms
postgres_1   | tests/joint_regression.sql: No such file or directory
postgresml_postgres_1 exited with code 3

Dashboard snapshot analysis is not static

The snapshot samples, correlations and data size are from the live table, not a point in time snapshot from model training. This also manifests as an error if you view a snapshot in the dashboard, after the original table has been dropped.

Handling categorical variables

SELECT * FROM pgml.train(
'pk20102022',
'classification',
'viewdata',
'inferencia',
'xgboost');

INFO: Snapshotting table "viewdata", this may take a little while...
ERROR: unhandled type: text for inferencia
CONTEXT: src/orm/snapshot.rs:288:25
SQL state: XX000

'inferencia' can be: reliable, neutral or possible

any ideas ?

TNX !!

Friendly error for HuggingFace models that don't exist

select pgml.embed('intfloat/e5_large', 'this is a test');

INFO:  Cache miss for "intfloat/e5_large", loading transformer, please wait
ERROR:  called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'huggingface_hub.utils._errors.RepositoryNotFoundError'>, value: RepositoryNotFoundError('401 Client Error. (Request ID: Root=1-6453dc33-0113b3d85bfa81fa2711c0f2)\n\nRepository Not Found for url: https://huggingface.co/api/models/intfloat/e5_large.\nPlease make sure you specified the correct `repo_id` and `repo_type`.\nIf you are trying to access a private or gated repo, make sure you are authenticated.\nInvalid username or password.'), traceback: Some(<traceback object at 0x7f72f59c4ec0>) }

What actually happened here is the transformer name is with a dash, not an underscore. We should return a friendlier error that tells the user what actually happened.

Models view in dashboard throwing error

When visiting the following URL http://localhost:8000/models/81 I am receiving the following error. This is happening on all models show pages.

KeyError at /models/81
KeyError at 'image_p50'

-- | --
http://localhost:8000/models/81
4.0.4
KeyError
'image_p50'
/app/app/views/models.py, line 35, in <dictcomp>
/usr/local/bin/python3
3.10.4
['/app',  '/usr/local/lib/python310.zip',  '/usr/local/lib/python3.10',  '/usr/local/lib/python3.10/lib-dynload',  '/usr/local/lib/python3.10/site-packages',  '/app/..']

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.