GithubHelp home page GithubHelp logo

crawl2's Introduction

crawl2

Crawl a list of seed URLs and score sites based on positive and negative phrases.

Usage

Setup your critieria in resources/ and then do lein run.

Crawl table schema is as follows:

CREATE TABLE `crawl` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `domain` varchar(255) DEFAULT NULL,
  `url` varchar(255) DEFAULT NULL,
  `score` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `domain` (`domain`,`url`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `domain` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `domain` varchar(255) DEFAULT NULL,
  `processed` tinyint(1) DEFAULT '0',
  PRIMARY KEY (`id`),
  UNIQUE KEY `domain` (`domain`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `domainScores` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `domain` varchar(255) DEFAULT NULL,
  `score` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `domain` (`domain`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `keyword` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `keyword` varchar(255) DEFAULT NULL,
  `weight` int(11) DEFAULT '1',
  `bias` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `keyword` (`keyword`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `matches` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `domain` varchar(255) DEFAULT NULL,
  `url` varchar(255) DEFAULT NULL,
  `term` varchar(255) DEFAULT NULL,
  `type` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Load data as follows:

LOAD DATA LOCAL INFILE
  '/tmp/keywords.txt'
INTO TABLE
  keyword
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(keyword, weight, bias);

Report data as follows:

SELECT
  domain, term
FROM
  matches
INTO OUTFILE
  '/tmp/rawMatch.csv'
FIELDS TERMINATED BY
  ','
LINES TERMINATED BY '\n';

SELECT
  keyword, weight
FROM
  keyword
INTO OUTFILE
  '/tmp/keywordScore.csv'
FIELDS TERMINATED BY
  ','
LINES TERMINATED BY
  '\n';

$ bin/term_mapper < /tmp/rawMatch.csv > /tmp/domainMatches.csv
$ bin/domain_score < /tmp/domainMatches > /tmp/domainScores.csv

LOAD DATA LOCAL INFILE
  '/tmp/domain_scores.csv'
INTO TABLE
  domainScores
FIELDS TERMINATED BY
  ','
LINES TERMINATED BY
  '\n' (domain, score);

SELECT
  domain, score
FROM
  domainScores
WHERE
  score > 100
INTO OUTFILE
  '/tmp/qualifiedDomains'
FIELDS TERMINATED BY
  ','
LINES TERMINATED BY
  '\n';

Reinitialize as follows:

UPDATE domain set processed = 0;
DELETE FROM domainScores;
DELETE FROM matches;
DELETE FROM crawl;

License

Copyright © 2013 FIXME

Distributed under the Eclipse Public License, the same as Clojure.

crawl2's People

Contributors

tmountain avatar

Stargazers

null_x3r0 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.