GithubHelp home page GithubHelp logo

navy10021 / krlawgpt Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 1.0 114 KB

KRLawGPT : Generative Pre-trained Transformer for producing Korean Legal Text

Python 100.00%
gpt nlp-machine-learning large-language-models legal transformers text-embeddings text-generator

krlawgpt's Introduction

header

Generative Pre-trained Transformer for producing Korean Legal Text

Abstract :

In this work, we introduce the development and application of a Generative Pre-trained Transformer (GPT) tailored for producing Korean legal text, named KRLawGPT. As a neural network-based language model, KRLawGPT is designed to generate expressive and relevant Korean legal text through a decoder-only transformer. This model is pre-trained on a comprehensive Korean legal dataset, CKLC (Clean Korean Legal Corpus), and is equipped to handle both natural language generation and natural language processing tasks. The thesis also outlines the model's adaptability for training on user-specific text data, broadening its utility beyond the realm of legal texts.

1. Model Description

1.1. Generative Pre-trained Transformer (GPT) for Legal Texts

KRLawGPT is introduced as a language model specifically crafted for the generation of Korean legal text. Utilizing a decoder-only transformer, this model is trained on a large-scale legal dataset, CKLC, enabling it to generate human-like and sophisticated legal texts. KRLawGPT stands out for its capability to process input text, performing both natural language generation and processing tasks.

1.2. Model Flexibility and Integration

The model is built with flexibility in mind, allowing users to either pre-train it with its own GPT model or leverage tokenizers and parameters from other GPT-based Pre-trained Language Models (PLMs) such as GPT-2/3 or KoGPT. Moreover, KRLawGPT supports training and optimization on user-provided text data, extending its functionality beyond the legal domain.

2. Model Usage

STEP 1. Loading Text Data and Building Vocabulary

The initial step involves creating a split dataset (train.bin and val.bin) in the 'data' directory and building a vocabulary. Users can set options to utilize other GPT-based tokenizers for added versatility.

$ python model/vocab.py

If you want to utilize other GPT-based tokenizers, you have to set --using_LLMs = True.

$ python model/vocab.py --using_LLMs = True

STEP 2. Pre-training KRLawGPT on Specific Text Data

This step involves training the KRLawGPT model, saving the best-performing model on the validation dataset, and generating KRLawGPT.pt and KRLawGPT_state_dict.pt in the 'output' directory. Users have the option to leverage pre-trained models from Hugging Face by setting specific parameters.

$ python model/train.py

If you want to leverage already trained GPT's parameters and weights from Hugging Face, you must set --using_LLMs = True and enter GPT-based pre-trained models name --model_type = 'kogpt' . Default is kogpt-2.

$ python model/train.py --using_LLMs = True --model_type = 'kogpt'

STEP 3. Generate Legal Text

Users can input short words or sentences to generate large volumes of relevant and sophisticated judges-like Korean legal text using the pre-trained KRLawGPT model.

from model.generate_legal_text import *

input_text = input(">> Enter your start prompt :")
legal_text_generator(input_text)

3. Sample visualization

generation

4. Development

  • Seoul National University NLP Labs
  • Under the guidance of Navy Lee

krlawgpt's People

Contributors

navy10021 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

dharmogata

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.