LegalQA using SentenceKoBART

LegalQA using SentenceKoBART

Implementation of legal QA system based on SentenceKoBART

How to train SentenceKoBART
Based on Neural Search Engine Jina v2.0
Provide Korean legal QA data(1,830 pairs)
Apply approximate KNN search with Faiss, Annoy, Hnswlib.

Setup

# install git lfs , https://github.com/git-lfs/git-lfs/wiki/Installation
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt install git-lfs
git clone https://github.com/haven-jeon/LegalQA.git
cd LegalQA
git lfs pull
# If the lfs quota is exceeded, please download it with the command below.
# wget http://gogamza.ipdisk.co.kr:80/gogamzapubs/VOL1/URLs/models/SentenceKoBART.bin
# mv SentenceKoBART.bin model/
pip install -r requirements.txt

Index

python app.py -t index

GPU-based indexing available as an option

pods/encode.yml - device: cuda

Train

The SentenceKoBART is not a model tuned based on the legal task, so it guarantees good recall, but requires adjustment in terms of precision. By re-ranking the results of top-k using a cross-encoder, we can supplement in terms of precision.

Model : Ranking for general purpose
Learn to Rank : Ranking for task specific purpose

Learn to Rank with KoBERT

Initial training is done by classifying whether the title of the dataset and the question are related pairs like below.

Why BERT?

To use BERT NSP power.

[CLS] title [SEP] question [SEP]

title	question	label
오토바이의 고속도로 주행금지가 행복추구권 등을 침해한 것은 아닌지 여부	甲은 평소 오토바이를 좋아하여 주말, 휴일이면 오토바이로 전국을 여행하였습니다. 그런데 ...	positive
피해자과실로 인한 교통사고로 개인택시사업면허가 취소된 경우	甲은 평소 오토바이를 좋아하여 주말, 휴일이면 오토바이로 전국을 여행하였습니다. 그런데 ...	negative

python app.py -t train

The trained model is saved in the rerank_model directory.

We provide a KoBERT model tuned with LegalQA(gogamza/kobert-legalqa-v1).

Search

With REST API

To start the Jina server for REST API:

# python app.py -t query_restful --query_flow flows/query_numpy_rerank.yml
python app.py -t query_restful

Then use a client to query:

curl --request POST -d '{"parameters": {"top_k": 1},  "data": ["상속 관련 문의"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:1234/search'

Or use Jinabox with endpoint http://127.0.0.1:1234/search

From the terminal

# python app.py -t query --query_flow flows/query_numpy_rerank.yml
python app.py -t query

Approximate KNN Search

python app.py -t query_restful --query_flow flows/query_hnswlib_rerank.yml

python app.py -t query_restful --query_flow flows/query_faiss_rerank.yml

python app.py -t query_restful --query_flow flows/query_annoy_rerank.yml

Retrieval time(sec.)
- AMD Ryzen 5 PRO 4650U, 16 GB Memory
- Average of 100 searches
- Excluding BertReRanker

top-k	Numpy	Hnswlib	Faiss	Annoy
10	1.433	0.101	0.131	0.118

Presentation

Neural IR 101

Demo

To Demo

FAQ

Why this dataset?

Legal data is composed of technical terms, so it is difficult to search if you are not familiar with these terms. Because of these characteristics, I thought it was a good example to show the effectiveness of neural IR.

LFS quota is exceeded

You can download SentenceKoBART.bin from one of the two links below.

Citation

Model training, data crawling, and demo system were all supported by the AWS Hero program.

@misc{heewon2021,
author = {Heewon Jeon},
title = {LegalQA using SentenceKoBART},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/LegalQA}}

License

QA data data/legalqa.jsonlines is crawled in www.freelawfirm.co.kr based on robots.txt. Commercial use other than academic use is prohibited.
We are not responsible for any legal decisions we make based on the resources provided here.

mky028 / legalqa Goto Github PK

legalqa's Introduction

LegalQA using SentenceKoBART

Setup

Index

Train

Learn to Rank with KoBERT

Search

With REST API

From the terminal

Approximate KNN Search

Presentation

Demo

Links

FAQ

Why this dataset?

LFS quota is exceeded

Citation

License

legalqa's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

Jobs