GithubHelp home page GithubHelp logo

pdf-keyword-extraction-with-mistral's Introduction

PDF-Keyword-Extraction-with-Mistral

To extract the keyword from PDF, I use KeyLLM, an extension to KeyBERT for extracting keywords with Large Language Models like Mistral.

1. Data Preprocessing

First use PyPDF2 to extract text contents and the page number from pdf, then do the cleaning to remove characters like specific quotation marks.

2. Keyword Extraction with Mistral

2.1 Load the model

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=200,
    device_map="auto",
    hf=True,
    config=config
)

Here I load mistral 7B. It is a small language model but really accurate and instruct is a chat based model. gpu_layers=200 offload some of its layers to be used on the GPU.

config.config.context_length = 4096 Because the text contents in pdf are usually very long, so make sure to set this value very large.

login(token="---")

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Pipeline
generator = pipeline(
    model=model.to('cuda'), tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

For security, I hide the login token of my huggingface. Tokenizer is also from Mistral 7B and max_new_tokens is set to 50 to tell the model to only generate 50 new tokens at a max because we just want a number of keywords.

2.2 Prompt template

The prompt template is as follows (https://newsletter.maartengrootendorst.com/p/introducing-keyllm-keyword-extraction):

img

  • Example prompt - This will be used to show the LLM what a “good” output looks like
  • Keyword prompt - This will be used to ask the LLM to extract the keywords

We can show the large language model an example of how it should extract keywords and then it repeats that. Example prompt shows how a conversation should go. After that, Keyword prompt essentially is the samething but this time we can plug in any document that we want and our documents pdf for which we wnat our keywords to be extracted.

2.3 KeyLLM

With the document and prompt, load them and model into KeyLLM

# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)

for grouped_text in grouped_texts_list:
    keywords = kw_model.extract_keywords(grouped_text); 
    keywords_list = keywords_list + keywords

When the amount of text is particularly large, inputting the entire text directly into the model will cause the processing to be very slow, so here the text is input into the model in sections and a keywords list is obtained.

One problem with LLM extract keywords is that the keywords it outputs may not exist in the original text, so the last step is to remove these words and output the page number where each keyword is located.

pdf-keyword-extraction-with-mistral's People

Contributors

stanford997 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.