GithubHelp home page GithubHelp logo

fani-lab / repair Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 5.0 9.73 MB

Extensible and Configurable Toolkit for Query Refinement Gold Standard Generation Using Transformers

Python 98.81% Batchfile 0.39% Shell 0.80%
information-retrieval query-refinement query-reformulation query-suggestions

repair's People

Contributors

delaramrajaei avatar hosseinfani avatar michelecatani avatar reddymukesh44 avatar yogeswarl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

repair's Issues

Removing Anserini

This is the issue where we log the updates on Anserini and replace it with Pyserini, focusing specifically on the RelevanceFeedback refiner. Since we've shifted from Anserini to Pyserini, one possible solution is to use the SimpleSearcher from Pyserini. However, there are challenges with multiprocessing (mp) since it is deprecated, and the library suggests using Lucene. Unfortunately, I haven't found a similar method in the library.

Here is the link to the issue regarding the recent information on this matter : link

Stats on gold standard dataset

We need a function that accepts the *.gold.tsv file and outputs some stats on it as (1) a dictionary of key-values (point-wise) and also (2) histograms. For the first step, we can do (1) on:

  • Pointwise stats

  • # of original queries

  • # of original queries that have a refined query

  • # of original queries with no refined query

  • Max, Min, Avg of length for the original queries

  • Max, Min, Avg of the scores for the original queries

  • # of refined queries

  • Max, Min, Avg of the # refined queries per original query

  • # of original queries who has a refined query with score 1.0

  • Max, Min, Avg of delta scores between original query and the best refined query

  • Max, Min, Avg of delta of lengths between original query and the best refined query

  • Histograms (future)

2022 - SIGIR - Another Look at Information Retrieval as Statistical Translation


Title: Another Look at Information Retrieval as Statistical Translation

Year: 2022

venue: SIGIR (Reproducibility paper track)
link to paper
Main problem:

The authors argue that most of our problems can be solved with a large availability of data. When the noisy channel model has introduced over 2 decades ago, in 1999, it was worked on with synthetic data. 


Output:
They successfully reproduce IRST(Information retrieval as statistical translation) as a reranker after ranking with bm25.


**Contribution and Motivation:**

The motivation of this paper is to prove that models proposed decades ago can be solved if we just had a larger dataset.

The contributions:

1. IRST (Sum of translation) reranking after bm25 1st stage retrieval.
2. IRST (MaxSIM instead of the sum of translation) reranking after bm25 1st stage retrieval

Proposed method:

The 1999 paper by Berger and Lafferty looks at using statistical translation for information retrieval. The authors of the paper reproduce this method. 
How?

  1. Noisy channel model: The noisy channel model is a method of ad hoc retrieval proposed by Berger and Lafferty in their paper "Information Retrieval as Statistical Translation." It draws an analogy between machine translation and information retrieval, using IBM Model 1 to learn translation probabilities that relate query terms and document terms based on a hidden alignment between the words in the two sentences. The model is based on the concept of a noisy channel, where the information transmitted from the source (query) to the destination (document) is corrupted by noise, and the goal is to recover the original message
  2. MaxSim of Colbert: The MaxSim operator is a part of ColBERT's architecture that helps to calculate the similarity between two pieces of text. For a given query and a set of documents, The MaxSim operator helps to find the most similar document to the query by calculating the maximum similarity score between the query and each document. This score is based on the similarity of the embeddings (a numerical representation of the text) of the query and the document.

Gaps in the work:

The paper is reproducible. There are no gaps, as the paper argues. In fact, it can be said that while neural models (particularly pretrained transformers) have indeed led to great advances in retrieval effectiveness, the IRST model proposed decades ago is quite effective if provided with sufficient training data.

code:

Pyserini implementation
https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-irst.md

2021 - CIKM - DORA THE EXPLORER : Exploring Very Large Data with Interactive Deep Reinforcement Learning

Title: DORA THE EXPLORER: Exploring Very Large Data with Interactive Deep Reinforcement Learning

Year: 2021

venue: CIKM (Demo paper track) 
link to paper

Main problem:

Many current systems fail to address how to guide users in finding items of interest in a large data set. This has been addressed in this paper, where users can find data in a diverse set of scenarios while also accommodating user intervention

Contribution and Motivation:
The paper contributes an interactive Deep reinforcement learning approach that combines
(1) intrinsic reward for capturing curiosity
(2) extrinsic reward for capturing data familiarity
(3) user interventions to focus and refine the exploration process.

Proposed method:

The proposed method is described with all three scenarios on a 2.6 Million galaxies SDSS, a large sky survey data set.
Two modes, as seen, are the guided mode and the unguided mode
Overall they call it zones(steps)

  • Zone A: a view of the pipeline
  • Zone B: Current results from the search
  • Zone C: An exploration mode to move forward with
  • Zone D: Allows the users to intervene and make changes to the exploration
  • Zone E: Save a pipeline, load a saved pipeline or restart the exploration

Gaps in the work:

The paper spoke only about a specific dataset and was not explored with other datasets. Dora the Explorer relies on Deep Reinforcement Learning, which can be computationally expensive and unsuitable for all applications.

code:
https://github.com/apersonnaz/rl-guided-galaxy-exploration

Edge cases when I try to index

Hello Dr. @hosseinfani,
I have created the qrels and queries file.
here is something new I found out. the qrels file is greater than the queries file.
the queries file has a total instance of 9,966,939
the qrels file has a total instance of 19,442,629

The reason behind this. many queries are repeated in the qrels since the queries in AOL are basically search term and the document will vary all the time.

Also from what I see, there are 3 empty queries (NaN) in the queries file which are referenced.
My main query is Should I drop these these qid's from qrels and queries or go along with these?

Thanks,
Yogesh

Issues with AOL

Dear @hosseinfani,
While creating the context free files for training I came across this issue which I believe required your attention!

There are passage Id's in qrels that do not have a valid document available in our indexed collection.I tried to search that passage id through out the document collection. So it fails to be indexed and hence fails to retrieve when we try to create the doccols I am using the same code as msmarco.
This throws the 'NoneType' error. So as a workaround I have currently used an if statement to return an empty string. I would like your suggestion as to what we can do here.

I will push my code by morning so it would be great if you can have a look. I will be training with this dataset over the weekend

Thanks

Yogeswar

Update for AOL log

Dear @hosseinfani ,
Here are my findings regarding the AOL dataset.
I tried downloading the aol-ia (internet archived version of the aol search log) from ir-dataset. The IR dataset stores only the qrels and queries of these datasets. The documents or(relevant docs) are to be downloaded manually using these steps.

image

The AOL dataset present in [ir-datasets](https://ir-datasets.com/aol-ia] has a lot of failures when downloaded using these particular step.

But I have not given up. I have let it run since last week and I have only had a mere 3% downloaded until now.
image

the only reason we need to use ir-dataset is for its integration of documents with their respective url and query using hash so it is able to use the iterator to fetch each query and map it with it's respective qrel URL.

I believe we can do this ourselves by writing a custom scraper to fetch from the internet-archive. If you believe we should use only the ir-dataset, I will wait for the download to get over and continue to work on the code as I will have to write separately to create the doc-query pairs. If not we can implement our custom method before the end of this weekend.

Please let me know your opinion.

New unsolvable problem!!

So the colbert code is hardcoded in such a way that integers can only be used. But our aol-ia has passage id as alpha-numeric. I don't know how to convert his, and even if I do, I am unsure how to reconvert it to the same alpha numeric. Have you done this before @hosseinfani

Doubt regarding Transformer training

Hello @hosseinfani ,
After an immense struggle figuring out how a transformer and the T5 work. I have found this link where we can train and infer.
https://github.com/castorini/docTTTTTquery#Predicting-Queries-from-Passages-T5-Inference-with-PyTorch

Can you please explain the transformation step again? I got very confused with it and was training msmarco dataset on something that I already have as a pretrained tokenizer here.
Also, if possible, can you give me the whole review on what AOL should do? I have experimented with question generations using predefined models from the Hugging face library.

currently, I am reading this paper for which I will provide a summary very soon.Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

sorry about this question this late.

Thanks in advance,
Yogeswar

Yandex Dataset issues

Hello Dr. Fani,
I have been doing an exploratory analysis on the Yandex dataset to figure out a way to convert the numerics into something that can be fed into T5. Unfortunately, from what I have done over the last 2 days suggest other wise.
As we know, T5 is a text based model that requires queries and documents to be in a text format. From the sample document I post there is no means to convert them into query (text) or documents(text) . We also need a relevance judgements which are not available. This is dataset seems to be too old that cannot in anyway support our work (atleast from what I have seen from extracting all the gzips). Is there anyway you have any sort of file structure that contains query and documents as text.??

Thank you.

Help Wanted with pyserini passage fetching

Hello Dr. @hosseinfani,
I have the initial code pushed.
I am trying to understand the qrels.train.tsv file

As our idea goes, the relevance judgement file is arranged, for which I have attached an example.

qid	did	pid	relevancy
80385	0	723	1

Where qid is the query id, and pid is the passage id
now from the queries.train.tsv file we have the format: qid, query
for the example query 80385 we have

qid    query
80385	can you use a calculator on the compass test

now when I use the pyserini to fetch the document and with the relevant passage id 723, I get a completely different answer which is not relevant at all.
example of fetching the pyserini code for 80385 qid, pyserini should fetch 732 id from passage.

Bisexual identity more so establishes attraction to both genders, whereas pansexual identity more so recognizes the existence of other genders (third genders) and the capacity to be sexually attracted to individuals identifying as these various genders.

Installation and Setup

This issue is where I will record any problems or issues I come across during the installation process of RePair project.

Update on pytrec Eval usage

Dear @hosseinfani,
From the looks of it, Pytrec eval's results are not giving the right results. Even Negar's work uses trec_eval, I believe. Those results have a map score of 1, precisely what I am looking for. Hence I will rewrite my calculation of metrics to trec_eval instead of pytrec_eval.

Below I post the results I have noticed from the toy dataset as support for my argument.
pytrec_eval
predicted_queries00.metrics.txt

trec_eval
pq00.txt

MSmarco passage rerun

Good evening Dr. @hosseinfani ,
I am facing an issue with running msmarco bm25 parallel search on compute canada graham.
I was wondering if you could help me with this issue.

A little background as to why I am doing this:
I did a little test run on the the diamond dataset (all queries have map 1) I created to check if the MAP comes to 1 for all the queries. My good fortune, I had few queries that had 0.2 or 0.5 that killed our reproduction. I found out we had a conflicting lucene index that caused this issue. So now I want to do a fast run on compute canada. I did the whole setup and transfered all files into my folder and setup the virtual env as well.
I am doing this because my workstation is busy running the trec eval on AOL.

The main issue is, The BM25 retrieval runs perfectly for original. But when I do parallel processing, it only creates the bm25 file but does not write to it. If you get any time please have a look over the weekend or if you are in lab anytime over the weekend I would like to surely sit with you and fix this issue!!

Please help me!!, you are my last hope!.

Warmest regards,
Yogeswar

Help required with split merge logic

Hello @hosseinfani ,
I am unable to get a proper idea of how to solve the split merge; We can do the split before search and merge after trec_eval with the use of a variable,

Here are the issues I guess we will face if we do it with a split during evaluation.
The main problem we are trying to solve: get the search to split properly into multiple search files and merge after evaluation
The issues that come with it

  1. The current method will not cut off exactly after a qid's search results end
  2. addition of the topK parameter also will not work as we might have duplicate docs that need to be removed and this will cause discrepency.
  3. If we use the group by method, we don't have that huge a ram to split the dataframe after load.

My best suggestion is to split before search or merge and pair them all after evaluation.

Please let me know if I can go ahead with this method testing

Thanks,
Yogesh

2022 - SIGIR - Can Users Predict Relative Query Effectiveness?

Title: Can Users Predict Relative Query Effectiveness?
Year: 2022
venue: SIGIR ( Short paper track)
Link to paper

Main problem:
Often users' needs will not be met if their query is not formulated in the right way to retrieve a web page
Output:
They compare results generated by a SERP (Search engine results page) to that of a query selected by the user and compare its nDCG@10, showing both have similar end results.
Contribution and Motivation:
They used crowd-sourced workers from Amazon's Mechanical Turk for 2 tasks,

  • A query ranking task
  • A query rating task

The motivation of this paper is to explore whether users can predict the effectiveness of alternative queries for the same retrieval need and hence provide guidance (or even training data) to automatic techniques for query performance prediction.

Proposed method:
given 2 HITs (Human Intelligence tasks): Ranking and rating. Each Worker has to rank/rate a given query for its effectiveness on a scale of 1-5. This was later compared with that of the retrieval from the SERP, and from the comparison of the results, the discovery is explained.

Gaps in the work:
This study was done with the help of crowd workers, which may not be the same with a general community.

Also, they only focused on a limited set of queries from (UQV100 dataset)[https://www.microsoft.com/en-us/research/publication/uqv100-a-test-collection-with-query-variability/]
code:
Analysis of the results are available in the repository
https://github.com/Zendelo/cs-qpp

Overall Tasks on Repair

No context

  1. Transformer Fine-Tuning
    • #24
    • Bert (or other one)
  2. Choice of pairing
    • query.docs
    • query.doc
    • docs.query
    • doc.query
  3. Query Set
    • msmarco.passage
    • msmarco.document
    • Aol-title
    • Aol-tile-url
    • Aol-text
    • Yahoo Q & A

Information Retrieval

  1. Sparse Retrieval
    • BM25
    • qld
  2. Dense Retrieval
  • Hybrid Retrieval

Evaluation

  • MAP
  • MRR
  • Success
  • nDCG

Stats

Supervised Query Refinement

  • Acg
  • Anmt
  • Hredqs
  • transformer

Context

  • #23
  • Time
  • QueryType

Paper Writeup

  • Related Work
  • Benchmark on RePair
  • ...

Yahoo Q & A answering parsing

Hello Dr.@hosseinfani ,
I am in the process of writing the parser for Yahoo Q & A.
I have the base idea written out and it works well on the sample.
Before I run it on the whole dataset, I would like to clarify a few things on choosing the write content and numbers.

Here is a typical example of a document instance:

<document type="wisdom">
<uri>432470</uri>
<subject>Why are yawns contagious?</subject>
<content>When people yawn, you see that other people in the room yawn, too.  Why is that?</content>
<bestanswer>When your body need more oxygen, you yawn.  &lt;br /&gt;&#xa;When you yawn, you take more oxygen in the air.  &lt;br /&gt;&#xa;If the density of the oxygen in the air becomes lower, other people (their bodies) can feel that and start to yawn to get more oxygen.&lt;br /&gt;&#xa;That's why yawns are contagious. Yawning is extremely contagious -- 55% of people who witness someone yawn will yawn within five minutes. If a visually impaired person hears a tape of someone yawning, he or she is likely to yawn as well. Face it, the likelihood of you making it to the end of this answer without looking like one of these gaping maws is unlikely. &lt;br /&gt;&#xa;Although the contagious nature of yawning is well established, we know less about why this is so. Researchers are currently giving the topic some serious attention. One theory suggests it's a holdover from a period in evolutionary history when yawning served to coordinate the social behavior of a group of animals. A recent study postulates that contagious yawning could be part of the "neural network involved in empathy." &lt;br /&gt;&#xa;&lt;br /&gt;&#xa;While the mystery of contagious yawning has yet to be solved, perhaps researchers are closing in on an answer. On the other hand, given the subject matter, we wouldn't blame them for falling asleep at the wheel. In the meantime, give the "yawn challenge" a try -- it's tougher than it looks.</bestanswer>
<nbestanswers><answer_item>Yawning is extremely contagious -- 55% of people who witness someone yawn will yawn within five minutes. If a visually impaired person hears a tape of someone yawning, he or she is likely to yawn as well. Face it, the likelihood of you making it to the end of this answer without looking like one of these gaping maws is unlikely. &lt;br /&gt;&#xa;Although the contagious nature of yawning is well established, we know less about why this is so. Researchers are currently giving the topic some serious attention. One theory suggests it's a holdover from a period in evolutionary history when yawning served to coordinate the social behavior of a group of animals. A recent study postulates that contagious yawning could be part of the "neural network involved in empathy." &lt;br /&gt;&#xa;&lt;br /&gt;&#xa;While the mystery of contagious yawning has yet to be solved, perhaps researchers are closing in on an answer. On the other hand, given the subject matter, we wouldn't blame them for falling asleep at the wheel. In the meantime, give the "yawn challenge" a try -- it's tougher than it looks.</answer_item>
<answer_item>When your body need more oxygen, you yawn.  &lt;br /&gt;&#xa;When you yawn, you take more oxygen in the air.  &lt;br /&gt;&#xa;If the density of the oxygen in the air becomes lower, other people (their bodies) can feel that and start to yawn to get more oxygen.&lt;br /&gt;&#xa;That's why yawns are contagious.</answer_item>
</nbestanswers>
<cat>Trivia</cat>
<maincat>Education &amp; Reference</maincat>
<subcat>Trivia</subcat>
<date>1127631600</date>
<res_date>1120237002</res_date>
<vot_date>1127977920</vot_date>
<lastanswerts>1161068399</lastanswerts>
<qlang>en</qlang>
<qintl>us</qintl>
<language>en-us</language>
<id>u1254780</id>
<best_id>u1305446</best_id>
</document>

Questions:

  1. What is the best one to choose a query. (subject tag or content tag)
  2. From the looks of it <bestanswer> is mostly 1 item from <nbestanswers> but sometimes it's a combination of all. What should I do here.?
  3. For other works that are extensions of Repair (category, time, e.t.c) what would you like me to extract and store using this parser?
  4. My first assumption was <best_id> and <id> would be unique, but some users have answered multiple questions. So what could possibly considered as DID. Also I don't understand what could possibly be the difference between these 2 id's
  5. Final and Important!! - For user context, what should I even consider as userID

Confirmed items:

  1. the uri tag is going to be queryid

Colbert -- > Next step

Hello @hosseinfani, I have an issue where TCT Colbert works on running it in a single process. Since this was tested on the original queries. But when I multiprocess with predicted queries. it runs into a memory error, and I cannot fix it by any means with the debugger. Please let me know if you will be available in the lab this weekend so we can sort of this issue with your guidance.

Thanks

Context aware creation of large scale datasets with user as context

@hosseinfani.
This is what I have as an idea. Please let me know if this sounds like an excellent short paper idea for the user.
Previously we had AOL title, url where we trained the dataset on the whole document discarding user info.
In this idea. We will be doing it in a slightly different way.
We currently have the qrels file as qid user did relevance.
We will be merging the qid and user.
So the new qrels will become: qid_user 1 did relevance

We will be fetching all user queries and making pairs of their document id. This will mean only documents fetched for that user will be used for refinement and not other relevant documents for other users.
We will have the new doc query pair as user queries-> documents for that user only. We want to try to add the text also to make sure we get more content for our T5 to produce much more relevant queries. I will do some initial experiments and provide you with an update soon.

2020 - CIKM - Transformer Models for Recommending Related Questions in Web Search

Title: Transformer Models for Recommending Related Questions in Web Search

Year: 2020

venue: CIKM (Short paper track) 
link to paper

Main problem:
Much work was done for query reformulation and related searches, but very few recommended related questions for a given user query in web search.

Contribution and Motivation:
The contribution is a related question relevant to the original query asked by the user. they use ways to fetch different sets of queries to compare and annotate them for the PAA(People also Ask) section of a search engine.

Proposed method:
for a query q, the standard session log co-occurrence-based related search method is used to get a set of related queries Q.
Given a query question (q, q') pair, they retrieve the top 10 results from SERP(search engine results pages). The similarity is established from the results of these returned SERPs, particularly the title and snippets shown on the search results.
The query and question are concatenated into a sequence and are separated with a special “SEP” token. After which they
token.
Gaps in the work:
It was worked on a private dataset created from anonymous search queries.
code:
No code is available!

T5 as refiner methods

We used T5 to generate refined queries for benchmark datasets. However, we can evaluate t5 refiner on a test set and see its performance as a firsthand refiner method.

Experiment on AOL

Dear @hosseinfani,
I have been trying to index the documents for AOL, but I have this strange error. I want your help on this. Could you please help me fix this?
If not could you give me a workaround
Screen Shot 2023-01-07 at 5 57 19 PM

Thanks,

Data Loader for Domain Specific Datasets

Hello @hosseinfani @DelaramRajaei

It appears that the CLEF-IP track is not accessible through ir_datasets. It's possible to still build a loader but it would need to be markedly different from the other loaders in the repo. Would you like me to explore this option?

Alternatively, I could start building loaders for other datasets which are present in ir_datasets if we want to stay within that. The option would be to start building one for nfCorpus (medical) while I look for a legal dataset within ir_datasets.

Zerveas et al. idea on query refinement

Brown University at TREC Deep Learning 2019:

To my understanding, they make an inter-query association based on the overlaps in relevant documents, e.g., q and p are two queries and they are related if R_q and R_p has intersections.

I'm thinking of developing a Jaccard Distance of two queries based on the set of their relevant docs. If it's more than a threshold, we consider them as the refined versions of each other.

How to calculate the pairwise Jaccard? I think it's close to calculate the number of pairwise collaborations like here

Implementation of Dense Retrievals

Here is the issue, I will keep a record of all my findings as I work on the task of refining all aspects of the retrieval system on different datasets using dense retrievals.

Update on T5 fine tuning with Ms marco

Dear Dr @hosseinfani,
I was finally able to successfully fine-tune the T5 model on msmarco doc-query-pairs training. Now I will be inferring the predictions from the model. I will upload the generated predictions in our teams' chat once I am done predicting from the model.

Thanks,
Yogesh

Chat GPT as a model to reformulate queries.

This Idea involves us asking ChatGPT to generate relevant queries based off the documents we feed.

We will sample about 10,000 documents that have the following criteria from the refined querys

  • Highly relevant ( Map score from 0.75 - 1)
  • Relevant ( 0.5 - 0.74)
  • Somewhat Relevant (0.25-0.49)
  • Irrelevant(0 - 0.24)

Based on this data. We will be comparing how well ChatGPT can perform suggesting documents for these queries.

Addition of dense and hybrid retrieval methods (rankers)

It's good to study the effect of rankers in finding refined queries.
As seen in toy results, in sparse retrievals, bm25 compared to qld yields to more relevant docs and hence better ir metrics and more refined queries in the final datasets.

Rag Fusion

This is the issue where I log the progress of reg fusion in RePair project.

Domain-Specific Datasets

Hey @michelecatani
This is the issue where you can log your findings about datasets in specific domains like law and medicine, which include query and relevant document pairs.

Additionally, please check the availability of temporal data and user-specific information for each query within these datasets.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.