fani-lab / repair Goto Github PK
View Code? Open in Web Editor NEWExtensible and Configurable Toolkit for Query Refinement Gold Standard Generation Using Transformers
Extensible and Configurable Toolkit for Query Refinement Gold Standard Generation Using Transformers
This is the issue where we log the updates on Anserini and replace it with Pyserini, focusing specifically on the RelevanceFeedback refiner. Since we've shifted from Anserini to Pyserini, one possible solution is to use the SimpleSearcher from Pyserini. However, there are challenges with multiprocessing (mp) since it is deprecated, and the library suggests using Lucene. Unfortunately, I haven't found a similar method in the library.
Here is the link to the issue regarding the recent information on this matter : link
We need a function that accepts the *.gold.tsv file and outputs some stats on it as (1) a dictionary of key-values (point-wise) and also (2) histograms. For the first step, we can do (1) on:
Pointwise stats
# of original queries
# of original queries that have a refined query
# of original queries with no refined query
Max, Min, Avg of length for the original queries
Max, Min, Avg of the scores for the original queries
# of refined queries
Max, Min, Avg of the # refined queries per original query
# of original queries who has a refined query with score 1.0
Max, Min, Avg of delta scores between original query and the best refined query
Max, Min, Avg of delta of lengths between original query and the best refined query
Histograms (future)
hello Dr. @hosseinfani,
I have been trying to work on the code all day to generate different queries for validation, and I wasn't able to make it work. Could you help me figure out where I am going wrong?
Here is the latest commit that I pushed.:
86f506f
We need sth like this to present our pipeline: https://www.linkedin.com/feed/update/urn:li:activity:7092542279298527232/
Title: Another Look at Information Retrieval as Statistical Translation
Year: 2022
venue: SIGIR (Reproducibility paper track)
link to paper
Main problem:
The authors argue that most of our problems can be solved with a large availability of data. When the noisy channel model has introduced over 2 decades ago, in 1999, it was worked on with synthetic data.
Output:
They successfully reproduce IRST(Information retrieval as statistical translation) as a reranker after ranking with bm25.
**Contribution and Motivation:**
The motivation of this paper is to prove that models proposed decades ago can be solved if we just had a larger dataset.
The contributions:
1. IRST (Sum of translation) reranking after bm25 1st stage retrieval.
2. IRST (MaxSIM instead of the sum of translation) reranking after bm25 1st stage retrieval
Proposed method:
The 1999 paper by Berger and Lafferty looks at using statistical translation for information retrieval. The authors of the paper reproduce this method.
How?
Gaps in the work:
The paper is reproducible. There are no gaps, as the paper argues. In fact, it can be said that while neural models (particularly pretrained transformers) have indeed led to great advances in retrieval effectiveness, the IRST model proposed decades ago is quite effective if provided with sufficient training data.
code:
Pyserini implementation
https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-irst.md
Title: DORA THE EXPLORER: Exploring Very Large Data with Interactive Deep Reinforcement Learning
Year: 2021
venue: CIKM (Demo paper track) link to paper
Main problem:
Many current systems fail to address how to guide users in finding items of interest in a large data set. This has been addressed in this paper, where users can find data in a diverse set of scenarios while also accommodating user intervention
Contribution and Motivation:
The paper contributes an interactive Deep reinforcement learning approach that combines
(1) intrinsic reward for capturing curiosity
(2) extrinsic reward for capturing data familiarity
(3) user interventions to focus and refine the exploration process.
Proposed method:
The proposed method is described with all three scenarios on a 2.6 Million galaxies SDSS, a large sky survey data set.
Two modes, as seen, are the guided mode and the unguided mode
Overall they call it zones(steps)
Gaps in the work:
The paper spoke only about a specific dataset and was not explored with other datasets. Dora the Explorer relies on Deep Reinforcement Learning, which can be computationally expensive and unsuitable for all applications.
code:
https://github.com/apersonnaz/rl-guided-galaxy-exploration
Hello Dr. @hosseinfani,
I have created the qrels and queries file.
here is something new I found out. the qrels file is greater than the queries file.
the queries file has a total instance of 9,966,939
the qrels file has a total instance of 19,442,629
The reason behind this. many queries are repeated in the qrels since the queries in AOL are basically search term and the document will vary all the time.
Also from what I see, there are 3 empty queries (NaN) in the queries file which are referenced.
My main query is Should I drop these these qid's from qrels and queries or go along with these?
Thanks,
Yogesh
Dear @hosseinfani,
While creating the context free files for training I came across this issue which I believe required your attention!
There are passage Id's in qrels that do not have a valid document available in our indexed collection.I tried to search that passage id through out the document collection. So it fails to be indexed and hence fails to retrieve when we try to create the doccols I am using the same code as msmarco.
This throws the 'NoneType' error. So as a workaround I have currently used an if statement to return an empty string. I would like your suggestion as to what we can do here.
I will push my code by morning so it would be great if you can have a look. I will be training with this dataset over the weekend
Thanks
Yogeswar
This is the issue where I log all my processes while adding ReQue's expanders to RePair.
Dear @hosseinfani ,
Here are my findings regarding the AOL dataset.
I tried downloading the aol-ia (internet archived version of the aol search log) from ir-dataset. The IR dataset stores only the qrels and queries of these datasets. The documents or(relevant docs) are to be downloaded manually using these steps.
The AOL dataset present in [ir-datasets](https://ir-datasets.com/aol-ia] has a lot of failures when downloaded using these particular step.
But I have not given up. I have let it run since last week and I have only had a mere 3% downloaded until now.
the only reason we need to use ir-dataset is for its integration of documents with their respective url and query using hash so it is able to use the iterator to fetch each query and map it with it's respective qrel URL.
I believe we can do this ourselves by writing a custom scraper to fetch from the internet-archive. If you believe we should use only the ir-dataset, I will wait for the download to get over and continue to work on the code as I will have to write separately to create the doc-query pairs. If not we can implement our custom method before the end of this weekend.
Please let me know your opinion.
So the colbert code is hardcoded in such a way that integers can only be used. But our aol-ia has passage id as alpha-numeric. I don't know how to convert his, and even if I do, I am unsure how to reconvert it to the same alpha numeric. Have you done this before @hosseinfani
Hello @hosseinfani ,
After an immense struggle figuring out how a transformer and the T5 work. I have found this link where we can train and infer.
https://github.com/castorini/docTTTTTquery#Predicting-Queries-from-Passages-T5-Inference-with-PyTorch
Can you please explain the transformation step again? I got very confused with it and was training msmarco dataset on something that I already have as a pretrained tokenizer here.
Also, if possible, can you give me the whole review on what AOL should do? I have experimented with question generations using predefined models from the Hugging face library.
currently, I am reading this paper for which I will provide a summary very soon.Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
sorry about this question this late.
Thanks in advance,
Yogeswar
Hello Dr. Fani,
I have been doing an exploratory analysis on the Yandex dataset to figure out a way to convert the numerics into something that can be fed into T5. Unfortunately, from what I have done over the last 2 days suggest other wise.
As we know, T5 is a text based model that requires queries and documents to be in a text format. From the sample document I post there is no means to convert them into query (text) or documents(text) . We also need a relevance judgements which are not available. This is dataset seems to be too old that cannot in anyway support our work (atleast from what I have seen from extracting all the gzips). Is there anyway you have any sort of file structure that contains query and documents as text.??
Thank you.
Hello Dr. @hosseinfani,
I have the initial code pushed.
I am trying to understand the qrels.train.tsv file
As our idea goes, the relevance judgement file is arranged, for which I have attached an example.
qid did pid relevancy
80385 0 723 1
Where qid is the query id, and pid is the passage id
now from the queries.train.tsv file we have the format: qid, query
for the example query 80385 we have
qid query
80385 can you use a calculator on the compass test
now when I use the pyserini to fetch the document and with the relevant passage id 723, I get a completely different answer which is not relevant at all.
example of fetching the pyserini code for 80385 qid, pyserini should fetch 732 id from passage.
Bisexual identity more so establishes attraction to both genders, whereas pansexual identity more so recognizes the existence of other genders (third genders) and the capacity to be sexually attracted to individuals identifying as these various genders.
This issue is where I will record any problems or issues I come across during the installation process of RePair project.
Dear @hosseinfani,
From the looks of it, Pytrec eval's results are not giving the right results. Even Negar's work uses trec_eval, I believe. Those results have a map score of 1, precisely what I am looking for. Hence I will rewrite my calculation of metrics to trec_eval instead of pytrec_eval.
Below I post the results I have noticed from the toy dataset as support for my argument.
pytrec_eval
predicted_queries00.metrics.txt
trec_eval
pq00.txt
Good evening Dr. @hosseinfani ,
I am facing an issue with running msmarco bm25 parallel search on compute canada graham.
I was wondering if you could help me with this issue.
A little background as to why I am doing this:
I did a little test run on the the diamond dataset (all queries have map 1) I created to check if the MAP comes to 1 for all the queries. My good fortune, I had few queries that had 0.2 or 0.5 that killed our reproduction. I found out we had a conflicting lucene index that caused this issue. So now I want to do a fast run on compute canada. I did the whole setup and transfered all files into my folder and setup the virtual env as well.
I am doing this because my workstation is busy running the trec eval on AOL.
The main issue is, The BM25 retrieval runs perfectly for original. But when I do parallel processing, it only creates the bm25 file but does not write to it. If you get any time please have a look over the weekend or if you are in lab anytime over the weekend I would like to surely sit with you and fix this issue!!
Please help me!!, you are my last hope!.
Warmest regards,
Yogeswar
The solution is
Hello @hosseinfani ,
I am unable to get a proper idea of how to solve the split merge; We can do the split before search and merge after trec_eval with the use of a variable,
Here are the issues I guess we will face if we do it with a split during evaluation.
The main problem we are trying to solve: get the search to split properly into multiple search files and merge after evaluation
The issues that come with it
My best suggestion is to split before search or merge and pair them all after evaluation.
Please let me know if I can go ahead with this method testing
Thanks,
Yogesh
Title: Can Users Predict Relative Query Effectiveness?
Year: 2022
venue: SIGIR ( Short paper track)
Link to paper
Main problem:
Often users' needs will not be met if their query is not formulated in the right way to retrieve a web page
Output:
They compare results generated by a SERP (Search engine results page) to that of a query selected by the user and compare its nDCG@10, showing both have similar end results.
Contribution and Motivation:
They used crowd-sourced workers from Amazon's Mechanical Turk for 2 tasks,
The motivation of this paper is to explore whether users can predict the effectiveness of alternative queries for the same retrieval need and hence provide guidance (or even training data) to automatic techniques for query performance prediction.
Proposed method:
given 2 HITs (Human Intelligence tasks): Ranking and rating. Each Worker has to rank/rate a given query for its effectiveness on a scale of 1-5. This was later compared with that of the retrieval from the SERP, and from the comparison of the results, the discovery is explained.
Gaps in the work:
This study was done with the help of crowd workers, which may not be the same with a general community.
Also, they only focused on a limited set of queries from (UQV100 dataset)[https://www.microsoft.com/en-us/research/publication/uqv100-a-test-collection-with-query-variability/]
code:
Analysis of the results are available in the repository
https://github.com/Zendelo/cs-qpp
Hello Dr.@hosseinfani ,
I am in the process of writing the parser for Yahoo Q & A.
I have the base idea written out and it works well on the sample.
Before I run it on the whole dataset, I would like to clarify a few things on choosing the write content and numbers.
Here is a typical example of a document instance:
<document type="wisdom">
<uri>432470</uri>
<subject>Why are yawns contagious?</subject>
<content>When people yawn, you see that other people in the room yawn, too. Why is that?</content>
<bestanswer>When your body need more oxygen, you yawn. <br />
When you yawn, you take more oxygen in the air. <br />
If the density of the oxygen in the air becomes lower, other people (their bodies) can feel that and start to yawn to get more oxygen.<br />
That's why yawns are contagious. Yawning is extremely contagious -- 55% of people who witness someone yawn will yawn within five minutes. If a visually impaired person hears a tape of someone yawning, he or she is likely to yawn as well. Face it, the likelihood of you making it to the end of this answer without looking like one of these gaping maws is unlikely. <br />
Although the contagious nature of yawning is well established, we know less about why this is so. Researchers are currently giving the topic some serious attention. One theory suggests it's a holdover from a period in evolutionary history when yawning served to coordinate the social behavior of a group of animals. A recent study postulates that contagious yawning could be part of the "neural network involved in empathy." <br />
<br />
While the mystery of contagious yawning has yet to be solved, perhaps researchers are closing in on an answer. On the other hand, given the subject matter, we wouldn't blame them for falling asleep at the wheel. In the meantime, give the "yawn challenge" a try -- it's tougher than it looks.</bestanswer>
<nbestanswers><answer_item>Yawning is extremely contagious -- 55% of people who witness someone yawn will yawn within five minutes. If a visually impaired person hears a tape of someone yawning, he or she is likely to yawn as well. Face it, the likelihood of you making it to the end of this answer without looking like one of these gaping maws is unlikely. <br />
Although the contagious nature of yawning is well established, we know less about why this is so. Researchers are currently giving the topic some serious attention. One theory suggests it's a holdover from a period in evolutionary history when yawning served to coordinate the social behavior of a group of animals. A recent study postulates that contagious yawning could be part of the "neural network involved in empathy." <br />
<br />
While the mystery of contagious yawning has yet to be solved, perhaps researchers are closing in on an answer. On the other hand, given the subject matter, we wouldn't blame them for falling asleep at the wheel. In the meantime, give the "yawn challenge" a try -- it's tougher than it looks.</answer_item>
<answer_item>When your body need more oxygen, you yawn. <br />
When you yawn, you take more oxygen in the air. <br />
If the density of the oxygen in the air becomes lower, other people (their bodies) can feel that and start to yawn to get more oxygen.<br />
That's why yawns are contagious.</answer_item>
</nbestanswers>
<cat>Trivia</cat>
<maincat>Education & Reference</maincat>
<subcat>Trivia</subcat>
<date>1127631600</date>
<res_date>1120237002</res_date>
<vot_date>1127977920</vot_date>
<lastanswerts>1161068399</lastanswerts>
<qlang>en</qlang>
<qintl>us</qintl>
<language>en-us</language>
<id>u1254780</id>
<best_id>u1305446</best_id>
</document>
Questions:
subject
tag or content
tag)<bestanswer>
is mostly 1 item from <nbestanswers>
but sometimes it's a combination of all. What should I do here.?<best_id>
and <id>
would be unique, but some users have answered multiple questions. So what could possibly considered as DID. Also I don't understand what could possibly be the difference between these 2 id'sConfirmed items:
uri
tag is going to be queryidHello @hosseinfani, I have an issue where TCT Colbert works on running it in a single process. Since this was tested on the original queries. But when I multiprocess with predicted queries. it runs into a memory error, and I cannot fix it by any means with the debugger. Please let me know if you will be available in the lab this weekend so we can sort of this issue with your guidance.
Thanks
@hosseinfani.
This is what I have as an idea. Please let me know if this sounds like an excellent short paper idea for the user.
Previously we had AOL title, url where we trained the dataset on the whole document discarding user info.
In this idea. We will be doing it in a slightly different way.
We currently have the qrels file as qid user did relevance.
We will be merging the qid and user.
So the new qrels will become: qid_user 1 did relevance
We will be fetching all user queries and making pairs of their document id. This will mean only documents fetched for that user will be used for refinement and not other relevant documents for other users.
We will have the new doc query pair as user queries-> documents
for that user only. We want to try to add the text also to make sure we get more content for our T5 to produce much more relevant queries. I will do some initial experiments and provide you with an update soon.
@hosseinfani
This is the issue where I am going to log my findings.
@hosseinfani . I need your help with this issue. I am running the Hredqs and end up with this error. I haven't changed anything from your code. Any idea how to fix it.
Title: Transformer Models for Recommending Related Questions in Web Search
Year: 2020
venue: CIKM (Short paper track) link to paper
Main problem:
Much work was done for query reformulation and related searches, but very few recommended related questions for a given user query in web search.
Contribution and Motivation:
The contribution is a related question relevant to the original query asked by the user. they use ways to fetch different sets of queries to compare and annotate them for the PAA(People also Ask) section of a search engine.
Proposed method:
for a query q, the standard session log co-occurrence-based related search method is used to get a set of related queries Q.
Given a query question (q, q') pair, they retrieve the top 10 results from SERP(search engine results pages). The similarity is established from the results of these returned SERPs, particularly the title and snippets shown on the search results.
The query and question are concatenated into a sequence and are separated with a special “SEP” token. After which they
token.
Gaps in the work:
It was worked on a private dataset created from anonymous search queries.
code:
No code is available!
We used T5 to generate refined queries for benchmark datasets. However, we can evaluate t5 refiner on a test set and see its performance as a firsthand refiner method.
Keeping track for F1 score computation for T5 as a query suggestion model
Dear @hosseinfani,
I have been trying to index the documents for AOL, but I have this strange error. I want your help on this. Could you please help me fix this?
If not could you give me a workaround
Thanks,
Hello @hosseinfani @DelaramRajaei
It appears that the CLEF-IP track is not accessible through ir_datasets. It's possible to still build a loader but it would need to be markedly different from the other loaders in the repo. Would you like me to explore this option?
Alternatively, I could start building loaders for other datasets which are present in ir_datasets if we want to stay within that. The option would be to start building one for nfCorpus (medical) while I look for a legal dataset within ir_datasets.
Brown University at TREC Deep Learning 2019:
To my understanding, they make an inter-query association based on the overlaps in relevant documents, e.g., q and p are two queries and they are related if R_q and R_p has intersections.
I'm thinking of developing a Jaccard Distance of two queries based on the set of their relevant docs. If it's more than a threshold, we consider them as the refined versions of each other.
How to calculate the pairwise Jaccard? I think it's close to calculate the number of pairwise collaborations like here
Here is the issue, I will keep a record of all my findings as I work on the task of refining all aspects of the retrieval system on different datasets using dense retrievals.
Dear Dr @hosseinfani,
I was finally able to successfully fine-tune the T5 model on msmarco doc-query-pairs training. Now I will be inferring the predictions from the model. I will upload the generated predictions in our teams' chat once I am done predicting from the model.
Thanks,
Yogesh
This Idea involves us asking ChatGPT to generate relevant queries based off the documents we feed.
We will sample about 10,000 documents that have the following criteria from the refined querys
Based on this data. We will be comparing how well ChatGPT can perform suggesting documents for these queries.
It's good to study the effect of rankers in finding refined queries.
As seen in toy results, in sparse retrievals, bm25 compared to qld yields to more relevant docs and hence better ir metrics and more refined queries in the final datasets.
This is the issue where I log the progress of reg fusion in RePair project.
We can follow https://github.com/google/gin-config for any config in our pipeline instead of param.py and settings[] everywhere :)
Hey @michelecatani
This is the issue where you can log your findings about datasets in specific domains like law and medicine, which include query and relevant document pairs.
Additionally, please check the availability of temporal data and user-specific information for each query within these datasets.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.