The song-search-engine-csf469 from him1411

Song Search Engine

-A tf-idf based Search Engine for searching about 50000 songs . The main purpose of this project is understand how vector space based retrieval models work. -More on Tf-Idf. Install all the dependencies using pip3.

The program/application can be broken down into the various subparts (actual file names also added) :

store_document_tokens_list.py: Stores the tokenized words of each document as lists and then the corresponding list is stored in a json file.
store_vocabulary.py: Stores all the unique words present in the corpus
store_megadict.py: creates a dictionary which contains the words in the vocabulary as the key and the value as another dictionary which contains each document as key and its value is one more dictionary as which contains the TF,IDF and TF-IDF values.
store_scores_gui.py: Takes query as input and calculates the scores for each document.
final_gui.py: Contains the gui program writtem in flask framework for python to accept query and receive the names of the top 10 documents with the highest scores

Order of executing the files.

$ sudo python3 store_document_tokens_list.py
$ sudo python3 store_vocabulary.py
$ sudo python3 store_megadict.py
$ sudo python3 store_scores_gui.py
$ sudo python3 final_gui.py

Installation:

Run the follwing in terminal.

$ sudo pip install -r requirements.txt

If you face any problem, install nltk separately.

Installing `nltk`

$ pip3 install nltk
$ python3
>>> import nltk
>>> nltk.download()
	Packages: all

DATA STRUCTURES USED:

Document_tokens_list

Contains lists enclosed within a list It will contain the stemmed tokens from each file in the corpus as individual lists. All are appended to make a list. Example:

[[‘i’,’play’,’cricket’],[‘sachin’,’tendulkar’],[‘india’,’is’,’best’]]

Vocabulary

Will contain a dictionary of all the unique words in the corpus. Example:

{‘i’: 1, ‘play’:2, ‘cricket’:3, ‘sachin’:4, ‘tendulkar’ :5, ‘india’:6 , ‘is’ :7, ‘best’:8]

Prime Dictionary

A nested dictionary containing the following structure explained through the following example:(Numbers are just representational )

{‘i’:{‘0’: {‘TF’:1 ,“IDF”:0.8, ‘TF-IF’ : 0.8} , ‘1’:{‘TF’: 2 ,‘IDF’: 0.4, ‘TF-IDF’:0.8}, ‘2’:{‘TF’: 0 ,‘IDF’: 0.3,
‘TF-IDF’:0}} , ‘cricket’ :{‘0’: {‘TF’:2 ,“IDF”:0.6, ‘TF-IF’ :1.2} , ‘1’:{‘TF’: 0 ,‘IDF’: 0.4, ‘TF-IDF’:0}, ‘2’:{ ‘TF’: 1
,‘IDF’: 0.4, ‘TF-IDF’:0.4}}}

Scores

A dictionary which will contain the scores of the documents after inputting the query and running cosine similarity algorithm. Example :

{‘0’: 0.2323 , ‘1’: 0.3125 , ‘2’ : 0.467 }

Creating The GUI

GUI has been created using flask framework of python and the front end web pages have been designed using HTML, CSS and Bootstrap. We have also provided multilingual query support using google API. Details about the song is obtained using the iTunes API.

The Search Engine Home page.

The Result page

The result page, query in chinese (Traditional)

Machine specs:

Processor: i7 4700HQ
Ram: 24 GB DDR3
OS: Ubuntu 16.04 LTS

Results

Index building time:

No stemming/lemmatization - 41.67s
Stemmed text + stopwords_removal - 146.13 s

Memory usage (RAM) while building the index: around 8 GB for 3000 documents, 1.3 GB for 800 files .

Members

Shubadeep Jana

Shardul Parab

Himanshu Gupta

him1411 / song-search-engine-csf469 Goto Github PK

song-search-engine-csf469's Introduction

Song Search Engine

The program/application can be broken down into the various subparts (actual file names also added) :

Order of executing the files.

Installation:

Installing nltk

DATA STRUCTURES USED:

Document_tokens_list

Vocabulary

Prime Dictionary

Scores

Creating The GUI

The Search Engine Home page.

The Result page

The result page, query in chinese (Traditional)

Machine specs:

Results

Members

song-search-engine-csf469's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Installing `nltk`