GithubHelp home page GithubHelp logo

birch's People

Contributors

dependabot[bot] avatar emmileaf avatar hatianzhang avatar infinitecold avatar victor0118 avatar zeynepakkalyoncu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

birch's Issues

BERT inference latency of CPU

I'd like to get some performance figures on latency on a CPU - queries per second, latency for each individual BERT inference, etc.

Add BERT inference code

We should be able to run ranking end-to-end, so we should fold BERT inference code into this repo.

More explaination on the readme

Hi, congrats on this great work!

As a new user and not having much experience on IR research field, Please don't mind I may have some naive questions. For example:
python src/utils/split_docs.py --collection <robust04, core17, core18> \ --index <path/to/index> --data_path data --anserini_path <path/to/anserini/root>

  1. what does index here mean, document index? If I want to use it on my own dataset, what kind of values should I put here?

  2. For 'data' (path), if I want to use my own dataset, what format should it be like? What should the data look like?

  3. anserini_path should be anserini_path folder path after I execute
    !git clone https://github.com/castorini/anserini.git !cd anserini && mvn clean package appassembler:assemble, right?

Thanks for answering my questions!

index_path

what is the index_path?
which file generates it ?
could it be generated by pyserini?

Snippet for interactive querying of BERT model

In the README, can I have a snippet for playing with BERT interactively? I.e., fire up Python interpreter, load model, issue a query and a sentence. Should be just a few lines, right?

Prediction score

Hi thanks for sharing birch!

I try to predict relevant sentences using the 'saved.msmarco_mb_1' model. One thing I am curious about is the prediction score I get from 'predictions = model(tokens_tensor, segments_tensor, mask_tensor)'. Each tuple in the 'predictions' does not sum to 1. Does it suppose to be a binary classification score?

list index out of range

I did it exactly according to readme, but there was a error... Could you tell me how to fix it? Thanks!

`Running eval/trec_eval.9.0.4/trec_eval data/qrels/qrels.mb.txt data/predictions/predict.tmp -m map -m P.20 -m ndcg_cut.20

Traceback (most recent call last):
File "src/main.py", line 92, in
main()

File "src/main.py", line 32, in main
train(args)

File "/home/castil/xueee/birch-master/src/model/train.py", line 53, in train
best_score = eval_select(args, model, tokenizer, validate_dataset, args.model_path, best_score, epoch)

File "/home/castil/xueee/birch-master/src/model/test.py", line 9, in eval_select
scores_dev = test(args, split='dev', model=model, test_dataset=validate_dataset)

File "/home/castil/xueee/birch-master/src/model/test.py", line 87, in test
qrels_file=os.path.join(args.data_path, 'qrels', 'qrels.{}.txt'.format(args.collection)))

File "/home/castil/xueee/birch-master/src/model/eval.py", line 14, in evaluate
map = float(lines[0].strip().split()[-1])

IndexError: list index out of range`

TREC Microblog Tracks Data

Hi,

Thank you for your nice work!

I took a look at the project but did not find test collections from the TREC Microblog Tracks (Lin et al., 2014) from 2011 to 2014, which were used to fine-tune BERT as described in your paper.

Could you please kindly let me know where I could find the collections?

Best,
Yumo

train process for bert is sentence-level or doc-level?

When training bert to get a query-doc score, are you using sentence-level or document-level? If sentence-level, what's the label for each example and how to choose the bert model with the dev set?
Looking forward to your early reply!

License

Thanks a lot for open sourcing birch. What is the license for the code in this repository?

I can't find the file named robust04_rm3_5cv_sent_fields.txt

I put Anserini and Birch in the same directory, and I run "./train.sh mb 5" in shell. However, it return error "FileNotFoundError: [Errno 2] No such file or directory: 'robust04_rm3_5cv_sent_fields.txt' ", and I can't find the robust04_rm3_5cv_sent_fields.txt either.

How do I use birch on my own dataset?

Hi, thanks for this awesome work.

The first question is about document retrieval.
I'd like to use birch as a tool so I can retrive relavent documents given a query.
I am not clear how to birch to achieve such goals after reading the readme.
Can I have more instructions? Thank you!

The second is about sentence retrieval.
How do I use birch for sentence selection? Like what the Figure 2 describes in the 'Applying BERT to Document Retrieval with Birch', can I have the most relevant sentences in a document, given a query?

embedding of long text

Hi, thanks for your effort for providing this code,
I couldn't figure how to use your code for getting the embedding of a textual document ( with thousands of words), is it possible to do it with your framework?
Thanks

Training QA model with WikiQA and TrecQA

Hi,

Thanks again for your nice work!

I am quite interested in the QA model which was trained on the data described in your ArXiv paper as follows,

the union of the TrecQA (Yao et al., 2013) and WikiQA (Yang et al., 2015) datasets.

Since I am now also trying to train a similar model and have several minor questions, I would be really appreciative if you could kindly clarify them for me:

  1. Did you use the union of the training sets from TrecQA and WikiQA as the training set, the union of their development sets for development? What about test sets, e.g., were they not used? Or, you integrated all the samples from TrecQA and WikiQA, the split it into train/dev manually?
  2. In terms of TrecQA, did you use TRAIN (which contains only 94 questions) or TRAIN-ALL (which contains 1229 questions)?
  3. In terms of WikiQA, did you truncate answer sentences to 40 tokens as one of the preprocessing steps (as introduced here: https://github.com/castorini/data/tree/master/WikiQA)?
  4. I found the following default configurations in args.py; were they also used for training the QA model?
    • epochs: 3
    • learning rate: 3e-6
    • batch_size: 32
    • warmup_proportion: 0.1

Best,
Yumo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.