GithubHelp home page GithubHelp logo

singhpratyush / index-search-query Goto Github PK

View Code? Open in Web Editor NEW
10.0 3.0 0.0 23 KB

Inverted Index, Query Formulation and Ranking from Scratch in Python

Python 100.00%
python pipenv indexing stemming searching ranking multithreading query query-building

index-search-query's Introduction

Index Search Query

Inverted Index, Query Formulation and Ranking from Scratch in Python.

Part of Information Retrieval Lab (Autumn 2017-18)

Part 1: The Inverted Index

Dataset

The dataset used for this purpose is taken from the FIRE 2011 corpus. It can be downloaded from here. It contains articles from two different magazines. The methods for handling these files are present in the magazine_index package.

Usage

If you wish to index all the files recursively from a directory, use the following command -

$ python lab1.py path/to/files

This will create an inverted index and save it to a file called index.bin. You can directly use this file if created already by not passing any argument to the script -

$ python lab1.py
Loading index from "index.bin"
<Index documents=392577 words=105314026>
...

Using a pre-built index

Since indexing documents can take a lot of time, here are some already indexed files which can be renamed to index.bin and used directly -

Name Link Size Comments
index.bin LINK 478 MB Full index, 392k documents
index.bin.bak1 LINK 374 MB 303k documents
index.bin.bak LINK 36 MB 25.8k documents

Example

$ python lab_1.py
Loading index from "index.bin"
<Index documents=303290 words=83225120>
Please start entering words to get top 5 documents containing them (CTRL+C to exit) -
Enter word: market
[('1100110_calcutta_story_11965855.utf8', 58), ('1070603_calcutta_story_7858507.utf8', 31), ('1100326_opinion_story_12251777.utf8', 30), ('1050912_frontpage_story_5227346.utf8', 30), ('1040406_opinion_story_2948544.utf8', 29)]
Enter word: delhi
[('1080422_sports_ipl.utf8', 30), ('1031223_opinion_story_2710457.utf8', 28), ('1090225_sports_story_10587273.utf8', 22), ('1090812_sports_story_11351508.utf8', 21), ('1100223_sports_story_12140507.utf8', 21)]
Enter word: messi
[('1100612_sports_story_12557276.utf8', 27), ('1100527_sports_story_12492679.utf8', 17), ('1100619_sports_story_12582889.utf8', 17), ('1090529_calcutta_story_11031479.utf8', 16), ('1100613_frontpage_story_12560387.utf8', 12)]

Part 2: Ranking of Documents

Usage

You can use the pre-built index here.

$ python lab_2.py index.bin
Loading index from index.bin
Enter query: programming
en.15.66.21.2008.5.9 : 97.3490637742472
en.3.347.409.2010.2.2 : 79.64923399711134
en.3.373.142.2007.6.8 : 53.09948933140756
en.3.406.410.2007.11.18 : 48.6745318871236
en.2.296.350.2010.1.20 : 48.6745318871236
en.3.393.372.2007.9.23 : 44.24957444283963
en.3.321.344.2009.7.31 : 44.24957444283963
en.3.373.299.2007.6.11 : 44.24957444283963
en.15.109.486.2009.4.1 : 44.24957444283963
en.3.393.75.2007.9.24 : 44.24957444283963

Development

pipenv is used for this project -

$ sudo -H pip install pipenv

To install dependencies, simply

$ pipenv install

To enter a virtualenv shell

$ pipenv shell

This will spawn a new shell where all dependencies will be present.

index-search-query's People

Contributors

singhpratyush avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

index-search-query's Issues

Can i use text files for indexing

i would like to ask that if i can use my own text files for indexing? and i have followed your readme and tried the command you suggest to load all files in a directory, but it doesn't work. Could you please reply me soon? Thanks a lot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.