Implementation of legal QA system based on SentenceKoBART
- How to train SentenceKoBART
- Based on Neural Search Engine Jina v2.0
- Provide Korean legal QA data(1,830 pairs)
- Apply approximate KNN search with Faiss, Annoy, Hnswlib.
# install git lfs , https://github.com/git-lfs/git-lfs/wiki/Installation
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt install git-lfs
git clone https://github.com/haven-jeon/LegalQA.git
cd LegalQA
git lfs pull
# If the lfs quota is exceeded, please download it with the command below.
# wget http://gogamza.ipdisk.co.kr:80/gogamzapubs/VOL1/URLs/models/SentenceKoBART.bin
# mv SentenceKoBART.bin model/
pip install -r requirements.txt
python app.py -t index
GPU-based indexing available as an option
pods/encode.yml
-device: cuda
The SentenceKoBART is not a model tuned based on the legal task, so it guarantees good recall, but requires adjustment in terms of precision. By re-ranking the results of top-k using a cross-encoder, we can supplement in terms of precision.
- Model : Ranking for general purpose
- Learn to Rank : Ranking for task specific purpose
Learn to Rank with KoBERT
Initial training is done by classifying whether the title of the dataset and the question are related pairs like below.
Why BERT?
- To use BERT NSP power.
[CLS] title [SEP] question [SEP]
title | question | label |
---|---|---|
오토바이의 고속도로 주행금지가 행복추구권 등을 침해한 것은 아닌지 여부 | 甲은 평소 오토바이를 좋아하여 주말, 휴일이면 오토바이로 전국을 여행하였습니다. 그런데 ... | positive |
피해자과실로 인한 교통사고로 개인택시사업면허가 취소된 경우 | 甲은 평소 오토바이를 좋아하여 주말, 휴일이면 오토바이로 전국을 여행하였습니다. 그런데 ... | negative |
python app.py -t train
The trained model is saved in the rerank_model
directory.
We provide a KoBERT model tuned with LegalQA(gogamza/kobert-legalqa-v1).
To start the Jina server for REST API:
# python app.py -t query_restful --query_flow flows/query_numpy_rerank.yml
python app.py -t query_restful
Then use a client to query:
curl --request POST -d '{"parameters": {"top_k": 1}, "data": ["상속 관련 문의"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:1234/search'
Or use Jinabox with endpoint http://127.0.0.1:1234/search
# python app.py -t query --query_flow flows/query_numpy_rerank.yml
python app.py -t query
python app.py -t query_restful --query_flow flows/query_hnswlib_rerank.yml
python app.py -t query_restful --query_flow flows/query_faiss_rerank.yml
python app.py -t query_restful --query_flow flows/query_annoy_rerank.yml
- Retrieval time(sec.)
- AMD Ryzen 5 PRO 4650U, 16 GB Memory
- Average of 100 searches
- Excluding BertReRanker
top-k | Numpy | Hnswlib | Faiss | Annoy |
---|---|---|---|---|
10 | 1.433 | 0.101 | 0.131 | 0.118 |
Legal data is composed of technical terms, so it is difficult to search if you are not familiar with these terms. Because of these characteristics, I thought it was a good example to show the effectiveness of neural IR.
You can download SentenceKoBART.bin
from one of the two links below.
- http://gogamza.ipdisk.co.kr:80/gogamzapubs/VOL1/URLs/models/SentenceKoBART.bin
- https://komodels.s3.ap-northeast-2.amazonaws.com/models/SentenceKoBART.bin
Model training, data crawling, and demo system were all supported by the AWS Hero program.
@misc{heewon2021,
author = {Heewon Jeon},
title = {LegalQA using SentenceKoBART},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/LegalQA}}
- QA data
data/legalqa.jsonlines
is crawled in www.freelawfirm.co.kr based onrobots.txt
. Commercial use other than academic use is prohibited. - We are not responsible for any legal decisions we make based on the resources provided here.