GithubHelp home page GithubHelp logo

mdineskumar / sinhalasongsearch Goto Github PK

View Code? Open in Web Editor NEW

This project forked from itharindu/sinhalasongsearch

0.0 0.0 0.0 61.58 MB

Sinhala song search app created using elasticsearch and by scraping the web

Python 86.26% HTML 13.74%

sinhalasongsearch's Introduction

Sinhala Song Search App

This repository contain source code for Sinhala song search engine created using Python and Elasticsearch

Directory Structure

The important files and directories of the repository is shown below

├── song-corpus : data scraped from the [website](http://sinhalasongbook.com/)                    
    ├── songs.json : contain the final song set 
    ├── songs_01.json to songs_010.json : original songs scraped form the website
    ├── songs_meta_all.json : contain all meta date related to the songs
    └── song_link.csv : contain links to the songs url
├── templates : UI related files  
├── app.py : Backend of the web app created using Flask 
├── data_upload.py : File to upload data to elasticsearch cluster
├── scraper.py :  Source code for the data scraper  
├── search.py : Search functions used to classify user search phrases and elasticsearch queries
├── queries.txt :  Example queries          

Starting the web app

Spinning the elasticsearch cluster

You can install elasticsearch locally or otherwise and spin up the elasticsearch cluster For more details visit website

Once elasticsearch is install, start elasticsearch cluster on port 9200

Getting started with the web app

git clone https://github.com/iTharindu/SinhalaSongSearch.git

cd Sinhala Songs Search

virtualenv -p python3 envname

source env/bin/activate

pip3 install -r requirements.txt

python app.py

To run the web scraper

Follow the above steps but instead of python app.py run scraper.py

Data fields

Each song contain subset of following data fields

  1. Title (both in Sinhala and English letters)
  2. Artist - English
  3. Artist - Sinhala
  4. Composer - English
  5. Composer - Sinhala
  6. Lyricist - English
  7. Lyricist - Sinhala
  8. Genre - English
  9. Genre - Sinhala
  10. Number of views
  11. Guitar keys
  12. Name of the movie (if the song is based on a movie)
  13. Lyrics

Data Scraping process

The process with scraping data from the site, the HTML/XML parsing library BeautifulSoup was used for scraping the web pages. Then the text extracted was passed through the text processing unit. Both simple techniques like replacing complex methods like regex are used for this process. This text processing unit generates cleaned text data which is then passed to the translator to translate to Sinhala, here both translation and transliteration takes place. Then the translated data is sent for post processing and the final data set with an aggregated dataset containing information about fields is generated

Scraping process

Search Process

Indexing and quering

For indexing the data and querying the Elasticsearch is used and I have used the standard indexing methods,mapping and the analyzer provided in the Elasticsearch. When a user enters a query the related intent is identified and the search query is related to the intent is executed. The searching can be done related to any field in the index and predefined size is used which the user can override using his search query. Also filtered queries are provided where users can filter the search result.

Advance Features

  • Text mining and text preprocessing
    • Search queries are processed before intent classification, here spelling errors are corrected and the query is cleaned. Also data extracted is also cleaned and processed before displaying on the web application.
  • Intent Classification
    • Once the query is added, intent behind the query is found by intent classification. The intent could be simple text search or a select top type, etc. The intent classifier used word tokenization and text vectorization and cosine distance to classify intentens
  • Faceted Search
    • The search engine supported faceted search related to Genre, Artist, Composer and Lyricist.
  • Bilingual support
    • The search engine supports queries in both Sinhala and English. User can type queries like top 10 songs or හොඳම ගීත 10 and expect to yield the same result.
  • Synonyms support
    • The search engine also support synonyms and that can be either in Sinhala or in English, user can type best, popular, good instead of top or හොඳ, ජනප්රිය, ප්‍රසිද්ධ instead of හොඳම.
  • Resistant to simple spelling errors
    • Due to the use of vectorization and distance calculation the search engine is resistant to small spelling errors and these are automatically corrected and related search results are generated.

search process

sinhalasongsearch's People

Contributors

itharindu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.