letractively / abot Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 0 B

Automatically exported from code.google.com/p/abot

License: Apache License 2.0

C# 28.90% CSS 0.05% ASP 0.01% HTML 71.04%

abot's People

Contributors

abot's Issues

Add crawl recovery

Add crawl recovery that reloads pages that were crawled, pages to crawl and 
other context. This allows the crawl to pick up where it left off. May also 
need to add a stop for this work properly

Original issue reported on code.google.com by [email protected] on 16 Nov 2012 at 5:15

Add abot version dynamically to user agent string

Add abot version dynamically to user agent string

Original issue reported on code.google.com by [email protected] on 25 Nov 2012 at 3:21

Add integration tests for a few benchmark crawls

Add integration tests for at least the following sites...

sitesimulator
wvtesting.com
sethgodin.com

Original issue reported on code.google.com by [email protected] on 27 Oct 2012 at 10:14

Use Vs fakes to raise code coverage on untestable code

Use Vs fakes to raise code coverage on untestable code

Original issue reported on code.google.com by [email protected] on 19 Nov 2012 at 3:24

Add lic text to each page

Add lic text from http://www.apache.org/licenses/LICENSE-2.0

Original issue reported on code.google.com by [email protected] on 3 Dec 2012 at 8:03

Add crawltimeout

Add crawltimeout where crawl ends if the timeout time has elapsed.

Original issue reported on code.google.com by [email protected] on 15 Nov 2012 at 3:21

[deleted issue]

[deleted issue]

hook up google analytics

Hook up google analytics

Original issue reported on code.google.com by [email protected] on 26 Sep 2012 at 4:33

Think about moving unique uri crawling check/logic to IScheduler

-Think about moving unique uri crawling check/logic to IScheduler so a 
DistributedScheduler would be the only change to give distributed scheduling.

Original issue reported on code.google.com by [email protected] on 23 Nov 2012 at 10:14

Add page for custom crawler work by hour

Add page for custom crawler work by hour

Original issue reported on code.google.com by [email protected] on 27 Nov 2012 at 9:21

Implement manual crawl delay

Implement manual crawl delay

Original issue reported on code.google.com by [email protected] on 21 Nov 2012 at 11:15

Add maxpagestocrawl check

Add maxpagestocrawl check in crawldecisionmaker

Original issue reported on code.google.com by [email protected] on 15 Nov 2012 at 7:45

One unit test fails only when running coverage.

Currently ignoring test...

Crawl_PageLinksCrawlDisallowedSubscriberThrowsExceptions_DoesNotCrash

Search for "//TODO This test only fails when run under NCOVER"

Original issue reported on code.google.com by [email protected] on 14 Oct 2012 at 12:37

Change Abot.Console to Abot.Demo

Change Abot.Console to Abot.Demo

Original issue reported on code.google.com by [email protected] on 27 Sep 2012 at 11:54

Create monitoring plugin that leverages abot

Create monitoring plugin that leverages abot

Original issue reported on code.google.com by [email protected] on 29 Oct 2012 at 12:44

Create Nuget installer

Create Nuget installer

Original issue reported on code.google.com by [email protected] on 18 Nov 2012 at 11:48

Add fatal crawl errors to CrawlResult

Add fatal crawl errors to CrawlResult

Original issue reported on code.google.com by [email protected] on 13 Oct 2012 at 10:36

Update all assemblies to 4.5

Update all assemblies to 4.5

Original issue reported on code.google.com by [email protected] on 21 Nov 2012 at 2:51

Verify can run on .net 3.5, 4.0, 4.5

Verify can run on .net 3.5, 4.0, 4.5

Original issue reported on code.google.com by [email protected] on 26 Sep 2012 at 12:27

Spread the word

Spread the word

Original issue reported on code.google.com by [email protected] on 19 Nov 2012 at 3:20

Implement use of downloadableContentTypes config value

Implement use of downloadableContentTypes config value

Original issue reported on code.google.com by [email protected] on 21 Nov 2012 at 11:15

CrawlContext should be reset on every crawl() call

CrawlContext should be reset on every crawl() call

Original issue reported on code.google.com by [email protected] on 28 Oct 2012 at 12:07

Console fails when crawling site

Console fails when crawling site wvtesting2.com

Original issue reported on code.google.com by [email protected] on 14 Oct 2012 at 10:35

Update documentation/Downloads

-Link to the latest stable instead of making them go to the downloads tab
-Add fiddler .saz file to replay
-Add Abot vs Arachnode vs NCrawler section
-Add faqs page
-Split up quickstart onto its own page
-Add more detail to running the tests w/fiddler etc..

Original issue reported on code.google.com by [email protected] on 19 Nov 2012 at 3:30

Create a PoliteWebCrawler

Create a PoliteWebCrawler.

-Add throttling
-Add manual crawl delay
-Add respect robots crawl delay
-Add respect robots disallow directive
-Add respect meta robots no index no follow

Original issue reported on code.google.com by [email protected] on 27 Sep 2012 at 11:47

Blocking: #74

Use ILMerge to create a single Abot.dll with all dependent dlls

Use ILMerge to create a single Abot.dll with all dependent dlls

Original issue reported on code.google.com by [email protected] on 18 Nov 2012 at 11:49

Consider using AutoResetEvent instead of busy wait

Consider using AutoResetEven instead of busy wait

Original issue reported on code.google.com by [email protected] on 26 Sep 2012 at 5:34

Create WebCrawler that uses a list of rules for its crawl descisions

Create WebCrawler that uses a list of rules for its crawl descisions

Original issue reported on code.google.com by [email protected] on 27 Sep 2012 at 11:45

Consider html Base tag when parsing relative links

Handle relative link parsing when html Base tag present

Original issue reported on code.google.com by [email protected] on 26 Sep 2012 at 12:04

Add Func<Arg, Retrun> hooks for on the fly behavior modification

-Add Func<PageToCrawl, CrawlDecision> for ShouldCrawlPage
-Add Func<CrawledPage, CrawlDecision> for ShouldCrawlPageLinks
-Add Func<CrawledPage, CrawlDecision> for ShouldDownloadPageContent

Original issue reported on code.google.com by [email protected] on 7 Nov 2012 at 5:19

Implement crawl depth

Implement crawl depth

Original issue reported on code.google.com by [email protected] on 21 Nov 2012 at 11:13

Make configuration like MaxThreads and UserAgentString easy from crawler object

Make configuration like MaxThreads and UserAgentString easy from crawler 
object. Maybe make a BasicWebCrawler : WebCrawler that gives this functionality.

Original issue reported on code.google.com by [email protected] on 14 Oct 2012 at 9:29

Merged into: #18

Add PageCrawlDisallowed & PageLinksCrawlDisallowed events

Add PageCrawlDisallowed & PageLinksCrawlDisallowed events so the user can 
detect when a page was either not crawled due to a crawl decision returning 
false.

Original issue reported on code.google.com by [email protected] on 27 Sep 2012 at 11:44

Implement crawl timeout

Implement crawl timeout where the crawler will stop after x seconds

Original issue reported on code.google.com by [email protected] on 21 Nov 2012 at 11:14

Consider using CsQuery as the parser

Consider using CsQuery as the parser. It boasts speeds of x times faster than 
hap.

https://github.com/jamietre/CsQuery

Original issue reported on code.google.com by [email protected] on 19 Nov 2012 at 3:23

Add constructor to webcrawler that takes only ICrawlDecisionMaker and both ICrawlDecisionMaker and CrawlConfiguration

Add constructor to webcrawler that takes only ICrawlDecisionMaker and both 
ICrawlDecisionMaker and CrawlConfiguration

Original issue reported on code.google.com by [email protected] on 26 Nov 2012 at 10:11

Use concurrent collections for Scheduler and CrawlContext.CrawledUris

Use concurrent collections for Scheduler and CrawlContext.CrawledUris now that 
abot is targeting 4.5.

Original issue reported on code.google.com by [email protected] on 22 Nov 2012 at 11:06

Add disallow reason when firing PageCrawlDisallowed and PageLinksCrawlDisallowed events

Add disallow reason when firing PageCrawlDisallowed and 
PageLinksCrawlDisallowed events

Original issue reported on code.google.com by [email protected] on 14 Oct 2012 at 12:38

Create integration tests that assert number of pages and their status

Verify site crawls are returning expected number of pages...

1: sitesimulator
2: wvtesting2.com
3: sethgodin.com

Original issue reported on code.google.com by [email protected] on 15 Oct 2012 at 12:28

Setup Paypal donate button

Setup Paypal donate button, and possibly a consulting block purchase page.

Original issue reported on code.google.com by [email protected] on 28 Oct 2012 at 9:12

Verify is able to run on mono by using MOMA

Verify is able to run on mono by using MOMA

Original issue reported on code.google.com by [email protected] on 26 Sep 2012 at 12:17

Create crawl configuration object and a provider that uses a custom config section

Create crawl configuration object and a generic provider that can be overridden.

Original issue reported on code.google.com by [email protected] on 17 Oct 2012 at 4:50

Verify nuget solution restore

Verify nuget solution restore

Original issue reported on code.google.com by [email protected] on 26 Sep 2012 at 5:35

Merge branch 1.0 into trunk, create bin zip for downloads section

Merge branch 1.0 into trunk, create bin zip for downloads section

Original issue reported on code.google.com by [email protected] on 15 Oct 2012 at 12:29

Create Logo

Contact graphic designer to have logo created

Original issue reported on code.google.com by [email protected] on 28 Oct 2012 at 9:09

Add MaxTimeToCrawl configuration

Add MaxTimeToCrawl configuration. Crawl times out after this limit is reached.

Original issue reported on code.google.com by [email protected] on 7 Nov 2012 at 3:31

Implement use of isUriRecrawlingEnabled

Implement use of isUriRecrawlingEnabled config value.

Original issue reported on code.google.com by [email protected] on 21 Nov 2012 at 11:16

Consider accepting lambda expressions for crawl decisions

Consider accepting lambda expressions for crawl decisions.

Pros: 
-Allows users to determine crawl behavior on the fly
-No classes or interfaces to implement and plugin

Cons:
-Not easy to test compound crawl behaviors
-Users must set these values on every instance (lots of copy and paste)
-Hard to group related or behaviors that are grouped together

Original issue reported on code.google.com by [email protected] on 29 Oct 2012 at 4:57

Merged into: #26

Make Abot check its version an if less than the latest "featured" version log a message suggesting an update

Make Abot check its version an if less than the latest "featured" version log a 
message suggesting an update

Original issue reported on code.google.com by [email protected] on 3 Dec 2012 at 4:31

Create google groups discussion

setup google groups and add link to homepage.

Original issue reported on code.google.com by [email protected] on 26 Nov 2012 at 11:22

letractively / abot Goto Github PK

abot's People

Contributors

abot's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs