theatticusproject / cuad Goto Github PK

View Code? Open in Web Editor NEW

362.0 362.0 112.0 17.79 MB

CUAD (NeurIPS 2021)

Home Page: https://www.atticusprojectai.org/cuad

Python 99.10% Shell 0.90%

bert legal-nlp

cuad's People

Contributors

Stargazers

Watchers

Forkers

mbijon qute012 ys7yoo theolivenbaum yk lacernest victor9000 stjordanis allensmile bobycv06fpm pika97 trendingtechnology haihua0913 juilin deepak-singh29 mashz brian8128 git-innocuous santhoshjinna15 hemanthgupta29 sengstacken kvuthoo sanaullahaq dallal9 sjyhne kellemnegasi cbalona kzinmr pankajminda chenyumiyu robertjhs zhaosang alelom huyennguyenhelen sarathyrl amitarjun erfard aropjoe akane999 amy-ianancy anna-peng muhtasham ivan-mihailov smuckwell cxz ihaustov vanessadourado techthiyanes xy8197 codefly13 xiaolinpeter agavestt oredna npolka sree181 leesuwonthan azile-g rajugvn45 waterfield-95 wgmzone lynniebird16 claytonsamples mdmcglone jaqen79 akashmavle5 aahmadai shwang xengleng n-kasatkin ritam-guha emiliagibellini fathi999 virtualdude1 fatemehadadi bhavesh-kf xpertdev sofienej abhigyan631 natuan anmarques amittewari05 shivammalviya712 naeemtarik lucasosouza pranavkumar-15032001 getkksingh1 yogesh0502 iq-scm makrehchitr sgao999 bsl-group-xyz armvndj keptt sim2k git-devanand butdrill1 mamalovesyou therealvish mmmika ayushdumka

cuad's Issues

Further fine-tuning

Can you share code to fine-tune the fine-tuned roberta model? Just a simple example containing one contract with all the required features is enough.

Provide the precision numbers for best CUAD model from paper

In the CUAD paper you show the chart below. Can you share the precision numbers that correspond to each clause? That would be useful for comparing against other models.

Checkpoints location

I couldn't locate them in the provided documentation, do you mind pointing or linking to them in README?

We provide checkpoints for three of the best models fine-tuned on CUAD: RoBERTa-base (~100M parameters), RoBERTa-large (~300M parameters), and DeBERTa-xlarge (~900M parameters).

[Feature Request] Publish Dataset to Hugging Face Datasets?

Thanks so much for open sourcing this dataset, looking forward to using it! I would love if you added it to Hugging Face's Datasets to make it even more accessible and discoverable for folks!

Why is test dataset (test.json) labeled?

The "--predict_file ./data/test.json" file is labeled with questions and answers, and it's passed directly into predictions = compute_predictions_logits() for predictions in train.py.

If I want to use your model to do predictions on my own dataset, do I also need to label it in the same json format? Doesn't that defeat the purpose? Let me know if I am misunderstanding, but shouldn't the model predict on unlabeled, raw text file?

Thanks!

Predictions take a lot of time.

When using the Deberta model for predictions, it takes more than an hour for one document (85 page document). Is there any way to reduce the time taken?, please advise on this.
Thanks in advance.

NCCL Error 1: unhandled cuda error

When I run the training script, I ran into an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
./run.sh

This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.

I have made sure torch can pick up the cuda info:

print(torch.cuda.is_available())
True

convert squad examples to features very slow

Hello,
The step to convert squad examples to features is very slow on my machine:48 cores + GPU. The tqdm estimates 24 hours to finish. Is it normal? Thanks!

convert squad examples to features: 4%|██ | 865/22450 [20:16<24:52:42, 4.15s/it]

LICENSE

Could you teach me a license of your code and attach license file to your repository?
Thank you.

Consume too much memory

When "Creating features from dataset file at .", this code consumes too much memory (I have a 48G machine).

This makes me can not run this code. (I guess this needs a 128G machine)

Is it possible to fix this problem? Thx.

Inference pipeline for sample question and answering

Please share the code if available to get inference on standard paragraph and question.

[Feature request] Upload data conversion script

Could you upload the script that was used to generate the train_separate_questions.json and test.json files?

Could you please create Google Colab to Run The Model?

Could you please create Google Colab to Run The Model?
Thanks @TheAtticusProject

Not getting start and end index for answer

Right now I am able to get start logits and end logits from the model output. but these logits contain features and examples.
How can I get the start and end index for answers so, that I can mark those on the context in the form of a bounding box.

@TheAtticusProject

Consume too much memory Training Failing

When "Creating features from dataset file at .", this code consumes too much memory (I have a 110G machine).

This makes me can not run this code.
Is it possible to fix this problem?
can you please guide me on what GPU specification should I use for training? @TheAtticusProject @IsCoelacanth @wangdsh

Could you push your models to Huggingface hub?

First at all, thanks for this amazing dataset and for the pretrained models. Would it be possible to push your models to the huggingface model's hub https://huggingface.co/models ? I can do it too but I think it would give the models more legitimacy if they came from your account

theatticusproject / cuad Goto Github PK

cuad's People

Contributors

Stargazers

Watchers

Forkers

cuad's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs