This project covers some areas of information retrieval. Examining the functional components of information retrieval including document and query representation, indexing techniques, similarity and matching, retrieval models, evaluation techniques, implementation issues, query reformulation (relevance feedback), Space models and space reduction.
- Read data from the files, supported files :
- Text Files
- PDF files (PDF Reader Library)
- Images ( All types of Images suppored by Tesseract-OCR -> Optical Charachter Recognition engine)
- Removing stopwords and handling three symboles { @ . , }
- Write data files results in STP directory as Text files
- Read data from STP directory
- Removal of suffixes using Porter Stemmer Algorithm
- Write the results in SFX directory as Text files
- Read data from SFX directory
- Initialize and fill the inverted file matrix
- Detect the new stopwords
- Receive query from user
- Initialize and fill the weighted query list
- Rank the documents
- Calculate Precision ,Recall ,and F-measure
- Java 13, you can download it from https://www.oracle.com/java/technologies/javase/jdk13-archive-downloads.html
- PDF Reader Library, you can download it from https://downloads.apache.org/pdfbox/2.0.24/
- Tesseract-OCR, you can download it from https://tesseract-ocr.github.io/tessdoc/Downloads.html
main path is "C:\Users\LENOVO\Desktop\IR"
contains test collection
contains files with stop words removed
contains files after stemming
contains all the .exe files of the search engine
contain one text file "imageOutText.txt" that contain engine's result