This project implements an Information Retrieval System (IRS) that indexes documents and matches queries using various retrieval models. It aims to apply concepts learned in information retrieval courses, utilizing the LISA dataset for testing.
- Indexing: Implement algorithms for extracting terms, removing stopwords, and normalizing terms in documents using NLTK. Create descriptor and inverse files to facilitate retrieval.
- Query Matching: Implement retrieval models such as scalar product, cosine measure, Jaccard measure, boolean models (AND, OR, NOT), and BM25 probabilistic model.
- Evaluation: Compare retrieval models based on average precision, P@5, P@10, recall, F-measure, and plot precision-recall curves.
- Clone Repository: Clone the repository to your local machine.
- Install Dependencies: Install required dependencies using
pip install -r requirements.txt
. - Prepare Data: Obtain the LISA dataset from the University of Glasgow website and concatenate the files.
- Execute Application: Run the app.py file to launch the application
python main.py
. - Interact with GUI: Use the graphical user interface to perform indexing, query research, query matching, and evaluation.
- View Results: Evaluate the performance of different retrieval models and visualize precision-recall curves.