This is a multi-container Docker application, Docker Desktop is required to run Macaw. As of now, this application does not start Macaw. It installs all dependencies to run Macaw and creates and interactive environment where you can all scripts pertaining to Macaw.
The application is built on the docker image for Pyndri which is a Python interface to the Indri search engine.
At the moment, this image contains:
- Ubuntu 16.04 LTS
- Python 3.6.9
- Retrieval back-end systems:
- Indri 5.11
- Pyndri 0.4
- Java 11.0.6
- Stanford Core NLP
- Macaw
The first three images are built in the container for Pyndri. We build the last three on top of that in this macaw docker application.
First Time:
-
Open your terminal and clone this repository:
git clone https://github.com/roynirmal/macaw_docker
. -
Run
cd macaw_docker
. -
Run
docker-compose build --no-cache
. This will retrieve the necessary images for Ubuntu, Python3, Indri, Pyndri, Stanford Core NLP and finally Macaw itself; and build the containers for mongo and macaw. -
Run the following command:
docker-compose up -d
(don't close this terminal) -
Get the
CONTAINER_ID
formacaw_docker_app
by runningdocker ps -a
. -
Get into the interactive mode for executing macaw by running
docker-compose run app /bin/bash
. -
Execute
python3 /macaw/macaw/live_main.py
to start the basic Macaw STDIO interface. -
To close the application, press
Ctrl+A+D
on the terminal.
In this docker image we start off with running a bare-bone version of Macaw using the WikiPassageQA dataset and Indri
index. For that we perform the following steps (you do not need to repeat them again, this is already done while building the container during steps 1-8 previously):
- Get the dataset from here.
- Extarct it.
- Convert the data into
trectext
format required byIndri
to index the data. For that we use our script. The script is a bit hacky for now but will update it soon ;) - And then finally build the index using
Indri
. - Index the data using
Indri
:/indri-5.11/buildindex/IndriBuildIndex -corpus.path=./input_file.txt -corpus.class=trectext -index=/data/index
Theindex
arguments specifies the path where you want to store the index.
Different Macaw implementation: For this tutorial, I have forked the original Macaw repository and changed the ./macaw/live_main.py
to the run the bare-bone macaw which retrieve documents from the query we put in the standard input. To have your own implementation (e.g. to activate the voice search feature of macaw using a telegram token or using bing
instead of indri
you need to first create a new branch and update the live_main.py
as requried. Then you need to change the bracnh in the ./docker-images/app/Dockerfile/
image and build the container as in step 3 of 'First Time'.
Different dataset: To download your own dataset (and convert it to the requried format) change the code in ./docker-images/app/Dockerfile/
where I download the dataset using wget
.