GithubHelp home page GithubHelp logo

stonybrooknlp / ircot Goto Github PK

View Code? Open in Web Editor NEW
125.0 21.0 16.0 2.06 MB

Repository for Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions, ACL23

Home Page: https://arxiv.org/abs/2212.10509

License: Apache License 2.0

Jsonnet 46.28% Python 32.02% Shell 21.61% Dockerfile 0.10%
chain-of-thought large-language-models multi-step-reasoning question-answering multi-step-retrieval retrieval-augmented-qa

ircot's Issues

Question about the figure demonstration

Hi, thank you for the great work! I have a question for the figure demonstration in this repo (Figure 2 in the paper). The right hand side "Reason" step takes in the triplet of (Q, yellow documents retrieved by the Q, T1). However, if I understand your approach correctly, the input should actually be (Q, yellow documents retrieved by the Q and blue documents retrieved by the T1, T1)?

Data and trained model

Hi,

I have several questions regarding your work!

  1. It seems 2wikimultihopqa is not properly downloaded in raw_data.sh
  2. In the code, are you saving the trained model with best hyperaparmeters?
  3. What's the use of base_configs and instantiated_configs folders?

Thank you in advance.

Where and How is the reason-step implemented?

Hi,

I really appreciate your work and the delicately structured code!

In the paper, you mentioned that the reason-step generates next CoT sentence based on

  1. the question
  2. so far retrieved paragraphs, and
  3. CoT sentences

I wonder how are the three components combined? Did you simply concate the three, which means sth like concat(question, paragraph_1, paragraph_2, ..., CoT_sent_1, CoT_sent_2, ...)? Where does this part locate in the code?

I tried to looked up and it seems that you fetched the retrieved paragraphs in read_examples() in dataset_readers.py, where output_instance is returned as a list of dictionaries containing all relevant information for each paragraph.
And in inference_mode in configurable_inference.py somehow the whole reasoning and answering is finished. What happened here?

Also, I want to make sure that in this implementation, the unit of indexing / retrieval is the whole paragraph for a document, right? That means for each wikipedia article, we only have one entry in the database, instead of separating it into smaller chunks / passages.

Please feel free to correct me on any misunderstanding of mine. Thanks again for your effort 😊

Dataset encoding format

What encoding method is used for the data set you provided? I opened it in UTF-8 encoding format. English characters are normal, but Russian and other languages are not normal.
屏幕截图 2024-04-03 205121

Contexts in processed_data

Hi, first of all, thank you for the great work!

I really enjoyed reading the paper, and the proposed idea with promising results was really interesting.

Now, I am trying to use this codebase for my own project and have a question about the processed_data.

In the processed_jsonl file (e.g., test_subsampled.jsonl), the contexts are already included for all datasets.

Are these contexts the result of BM25 with one retrieval? If not, how they are obtained?

If you can provide the answer to this question, it would be really useful.

Thank you so much!

How much does it cost to solve this problem

for GPT3 I wonder about the cost of money and the cost of time in 4 dataset
for flan-T5 I wonder about the cost of time in 4 datasets with different size
can you provide the actual data?

`2wikimultihopqa` Raw Data

Hi Harsh, the download/raw_data.sh script does not download (or extract?) the raw data of 2wikimultihopqa correctly, as I found an empty folder in raw_data/2wikimultihopqa. Could you please update the script? Thanks!

address.jsonnet file format and CUDA error

Hi,

I'm trying to reproduce the results, and I found llm_server_address.jsonnet and retriever_address.jsonnet necessary.
Can you provide an example scripts for these?

Also, I'm getting torch.cuda.OutOfMemoryError: CUDA out of memory error message. If you can give me some tips to prevent cuda error (e.g. where to reduce the batch size), that would be appreciated.

Thank you in advance :)

EXTREME WARNING: Not enough space to even fit in even the test example

Hi, when I was running ./reproduce.sh ircot flan-t5-base hotpotqa, I faced a warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (555 > 512). Running this sequence through the model will result in indexing errors
Running inference on examples
0it [00:00, ?it/s]EXTREME WARNING: Not enough space to even fit in even the test example.
EXTREME WARNING: Not enough space to even fit in even the test example.
EXTREME WARNING: Not enough space to even fit in even the test example.
EXTREME WARNING: Not enough space to even fit in even the test example.
EXTREME WARNING: Not enough space to even fit in even the test example.
EXTREME WARNING: Not enough space to even fit in even the test example.
1it [03:27, 207.97s/it]
...

I am not sure if this is right, please let me know if there's anything I need to fix.

These information might be relevant, so I put it here:

  1. I changed retriever_server port
    Instead of uvicorn serve:app --port 8000 --app-dir retriever_server, I changed my port to 9201 since port 8000 was used
    I ran :uvicorn serve:app --port 9201 --app-dir retriever_server
    Also, I made these changes:
    In predict.py and run.py, I set the env_variables["RETRIEVER_PORT"] to 9201, since str(retriever_address["port"]) can't get the right port:

retriever_address = get_retriever_address()
print("[here]retriever_address",retriever_address)
env_variables["RETRIEVER_HOST"] = str(retriever_address["host"])
# env_variables["RETRIEVER_PORT"] = str(retriever_address["port"])
env_variables["RETRIEVER_PORT"] = str("9201")
print("[here][env_variables['RETRIEVER_PORT']]",env_variables["RETRIEVER_PORT"])

  1. I was using bf16
    Since I got CUDA Out of Memory, I ran MODEL_NAME=flan-t5-base-bf16 RETRIEVER_PORT=9201 /mnt/.conda/envs/ircot/bin/uvicorn serve:app --port 8010 --app-dir llm_server.
    Also, I changed base_configs/ircot_flan_t5_base_hotpotqa.jsonnet:

"model_tokens_limit": 1000,

  1. About the localhost
    (I feel like sth. is wrong about the outputs but I am not so sure)
    First, I started elasticsearch and I got these on http://127.0.0.1:9200

{
"name" : "dell-PowerEdge-T640",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "i1NX0dODQ3qWEUBxhfl9Ig",
"version" : {
"number" : "7.10.2",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
"build_date" : "2021-01-13T00:42:12.435326Z",
"build_snapshot" : false,
"lucene_version" : "8.7.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}

Second, I started retriever_server, and I got these on http://127.0.0.1:9201

{"message":"Hello! This is a retriever server."}
and on http://127.0.0.1:9201/retrieve/
{"detail":"Method Not Allowed"}

Third, I started MODEL_NAME=flan-t5-base-bf16 RETRIEVER_PORT=9201 /mnt/.conda/envs/ircot/bin/uvicorn serve:app --port 8010 --app-dir llm_server, and I got these on http://127.0.0.1:8010/

{"message":"Hello! This is a server for flan-t5-base-bf16. Go to /generate/ for generation requests."}
and on http://127.0.0.1:8010/generate/
{"detail":[{"type":"missing","loc":["query","prompt"],"msg":"Field required","input":null,"url":"https://errors.pydantic.dev/2.7/v/missing"}]}

Thank you in advance!

When indexing, Elasticsearch instance fails to connect to localhost

Hi,

Thank you for your awesome work and your kindness to share the code. I really love the idea of querying database using LLM generation response. It's very inspiring! :)

However, when I followed README.md I seem to have a little trouble with elasticsearch since it's my first time using it and I'm kinda confused about everything.
I successfully started elasticsearch server on port 9200, and the retriever server at port 8000, but stuck at indexing. When I run python retriever_server/build_index.py hotpotqa until this line

    es.indices.create(index=elasticsearch_index, ignore=400, body=paragraphs_index_settings)

It first shows the error

Traceback (most recent call last):                                                                                                        
  File "/home/guest/r11944026/anaconda3/envs/ircot/lib/python3.8/site-packages/urllib3/connectionpool.py", line 791, in urlopen           
    response = self._make_request(                                                                                                        
  File "/home/guest/r11944026/anaconda3/envs/ircot/lib/python3.8/site-packages/urllib3/connectionpool.py", line 537, in _make_request     
    response = conn.getresponse()                                                                                                         
  File "/home/guest/r11944026/anaconda3/envs/ircot/lib/python3.8/site-packages/urllib3/connection.py", line 461, in getresponse           
    httplib_response = super().getresponse()                                                                                              
  File "/home/guest/r11944026/anaconda3/envs/ircot/lib/python3.8/http/client.py", line 1322, in getresponse                               
    response.begin()                                                                                                                      
  File "/home/guest/r11944026/anaconda3/envs/ircot/lib/python3.8/http/client.py", line 303, in begin                                      
    version, status, reason = self._read_status()                                                                                         
  File "/home/guest/r11944026/anaconda3/envs/ircot/lib/python3.8/http/client.py", line 272, in _read_status                               
    raise RemoteDisconnected("Remote end closed connection without"                                                                       
http.client.RemoteDisconnected: Remote end closed connection without response    
During handling of the above exception, another exception occurred:                                                                       
...

and there're more errors left behind.

At the same time, the elasticsearch server log shows

[2023-10-24T15:14:41,827][WARN ][o.e.h.n.Netty4HttpServerTransport] [cuda8] received plaintext http traffic on an https channel, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:50652}

It seems to be a HTTP vs HTTPs problem. Therefore I tried brutely changing this line in build_index.py

    elastic_host = "localhost"

to

    elastic_host = "https://localhost"

but it still doesn't work.

Could you please give me a hand? I'll really appreciate it.

Note: I notice that I followed official installation guide and I'm using elastic search 8.10, which is different from your version. Could that possibly be the reason?

The hyperparameters

Hi Harsh,
I am wondering for the 4 datasets, what's the K (the number of paragraphs to retrieve at each step) and M ( the number of distractor paragraphs) for IRCoT. Could you please provide the details? Thanks!

Ongoing Maintenance and Setup Queries for ircot Project

Hey Harsh Trivedi,

I've been trying to get my local setup aligned with the ircot project, specifically the state of the repo at this commit: 8637316e5e94ba0a2493e5df7846f2f23f46eaef.

I'm running into a few hiccups trying to replicate the environment on my end. Just wondering if there have been any updates to the requirements.txt or any particular package versions that I should use for a smooth setup?

Thanks a lot for your help, and for all the awesome work you're putting out there!

Cheers,
Hippoley

How do I know the call flow?

How do I know how to call each function? It all looks like some jsonnet operation. Do these jsonnet documents reflect the ircot execution process?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.