fgpullen / shiny-couscous Goto Github PK

View Code? Open in Web Editor NEW

Features Grouping of notes Auto-save Markdown editor Easy publishing/sharing of groups or individual notes. Local encryption. Quick filtering Full screen writing Simple UI Running locally Clone the repo bundle install npm install bundle exec rake bower:install grunt rails s Open your browser to http://localhost:3000 In the bottom left, click on "Sign in or Register" or "Account" (if signed in), and make sure you're using the correct server. You can use a production server here a

Python 19.08% Shell 0.10% HTML 60.55% Jupyter Notebook 20.26%

shiny-couscous's Introduction

WebStructure project @ CMU LTI

The idea is analyzing web pages' structures for a better crawler which extracts contents more intelligently.

Crawler

This module crawls sites with annotated groups using simple BFS strategy. crawl_data is the folder for raw html files test_data is the folder that contains files for experiments You can ignore others

Cluster

This module extracts Xpaths and calculate features for clustering and classification. HITS idea is also implemented in this module.

For clustering algorithm, the following six python files are most important:

page.py: data structure for one web page

pages.py: data structure for page collection of one website.

kmeans.py&wkmeans.py

Those two take feature matrix as input and generate clusters as output.
wkmeans means weighted kmeans.

pageCluster.py: Main function. The number of clusters are heuristically assigned in its main function. To implement clustering, use command:

python pageCluster.py dataset algo feature train(cv)

dataset is the parameter select from [zhihu,stackexchange,rottentomatoes,medhelp,asp]
algo is the parameter select from [kmeans,wkmeans]
features select from [tf-idf,log-tf-idf,binary]
train(cv)* select from [train,cv]

output: evluation metric and visulization for train

Batch file: cv_results.sh & train_results.sh
Try all possible parameter settings and write results to files.

visualization.py
Utilizing t-sne to reduce high-dimenstion vectors to two dimensions for visualization.

Xpath

Test python libarary for xpath extraction.

Recommend Projects

fgpullen / shiny-couscous Goto Github PK

shiny-couscous's Introduction

WebStructure project @ CMU LTI

Crawler

Cluster

Xpath

shiny-couscous's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs