GithubHelp home page GithubHelp logo

sinanw / llm-security-prompt-injection Goto Github PK

View Code? Open in Web Editor NEW
11.0 3.0 4.0 2.81 MB

This project investigates the security of large language models by performing binary classification of a set of input prompts to discover malicious prompts. Several approaches have been analyzed using classical ML algorithms, a trained LLM model, and a fine-tuned LLM model.

License: MIT License

Jupyter Notebook 100.00%
cybersecurity llm-prompting llm-security prompt-injection transformers-models

llm-security-prompt-injection's Introduction

Security of Large Language Models (LLM) - Prompt Injection Classification

In this project, we investigate the security of large language models in terms of prompt injection attacks. Primarily, we perform binary classification of a dataset of input prompts in order to discover malicious prompts that represent injections.

In short: prompt injections aim at manipulating the LLM using crafted input prompts to steer the model into ignoring previous instructions and, thus, performing unintended actions.

To do so, we analyzed several AI-driven mechanisms to do the classification task, we particularly examined 1) classical ML algorithms, 2) a pre-trained LLM model, and 3) a fine-tuned LLM model.

Data Set (Deepset Prompt Injection Dataset)

The dataset used in this demo is: Prompt Injection Dataset provided by deepset, an AI company specialized in offering tools to build NLP-driven applications using LLMs.

  • The dataset combines hundreds of samples of both normal and manipulated prompts labeled as injections.
  • It contains prompts mainly in English, along with some other prompts translated into other languages, primarily German.
  • The original data set is already split into training and holdout subsets. We maintained this split across the multiple experiments to compare results using a unified testing benchmark.

METHOD 1 - Classification Using Traditional ML

Corresponding notebook: ml-classification.ipynb

Analysis steps:

  1. Loading the dataset from HuggingFace library and exploring it.
  2. Tokenizing prompt texts and generating embeddings using the multilingual BERT (Bidirectional Encoder Representations from Transformers) LLM model.
  3. Training the following ML algorithms on the downstream prompt classification task:
  4. Analyzing and comparing the performance of classification models.
  5. Investigating incorrect predictions of the best-performing model.

Results:

Accuracy Precision Recall F1 Score
Naive Bayes 88.79% 87.30% 91.67% 89.43%
Logistic Regression 96.55% 100.00% 93.33% 96.55%
Support Vector Machine 95.69% 100.00% 91.67% 95.65%
Random Forest 89.66% 100.00% 80.00% 88.89%

METHOD 2 - Classification Using a Pre-trained LLM (XLM-RoBERTa)

Corresponding notebook: llm-classification-pretrained.ipynb

Analysis steps:

  1. Loading the dataset from HuggingFace library.
  2. Loading the pre-trained XLM-RoBERTa model, the multilingual version of RoBERTa, the enhanced version of BERT, from HuggingFace library.
  3. Using HuggingFace zero-shot classification pipeline and XLM-RoBERTa to perform prompt classification on the testing dataset (without fine-tuning).
  4. Analyzing classification results and model performance.

Results:

Accuracy Precision Recall F1 Score
Testing Data 55.17% 55.13% 71.67% 62.32%

METHOD 3 - Classification Using a Fine-tuned LLM (XLM-RoBERTa)

Corresponding notebook: llm-classification-finetuned.ipynb

Analysis steps:

  1. Loading the dataset from HuggingFace library.
  2. Loading the pre-trained XLM-RoBERTa model, the multilingual version of RoBERTa, the enhanced version of BERT, from HuggingFace library.
  3. Fine-tuning XLM-RoBERTa to perform prompt classification on the training dataset.
  4. Analyzing the fine-tuning accuracy across 5 epochs on the testing dataset.
  5. Analyzing the final model accuracy and its performance, and comparing it with previous experiments.

Results:

Epoch Accuracy Precision Recall F1
1 62.93% 100.00% 28.33% 44.16%
2 91.38% 100.00% 83.33% 90.91%
3 93.10% 100.00% 86.67% 92.86%
4 96.55% 100.00% 93.33% 96.55%
5 97.41% 100.00% 95.00% 97.44%

llm-security-prompt-injection's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

llm-security-prompt-injection's Issues

Fine tuning

Hello,
Thank you for your great work.
I have one issue when trying fine tuning on google colab, I am getting warning that it needs 120GB VRAM, while strongest GPU on google colab in selection has only 40GB VRAM, can you please tell me how you manage to run it?

Data set issue

Try taking larger dataset as this dataset is too small for classifying the data

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.