madurai_spider's Introduction

Scraping Project Madurai

Prerequisites

Scrapy
Beautiful Soup

Steps

Get the [title, author, genre, link] to html documents by parsing the table

scrapy crawl madurai_spider -o index.json

Crawl the links and write to html/

scrapy crawl mad_doc_spider

What next?

I plan to build a toy semantic search engine by first constructing tamil word embeddings, then encoding all the documents into fixed length vectors in a high dimensional space. Build an index (mxn-dimensional) based on the encodings. When the user enters a query in tamil, it is encoded into a fixed length vector (n-dimensional) and a beam search is performed to identify neighboring vectors that represent html documents. This ensures a strong semantic similarity between the query and the results.

Recommend Projects

suriyadeepan / madurai_spider Goto Github PK

madurai_spider's Introduction

Scraping Project Madurai

Prerequisites

Steps

madurai_spider's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs