letractively / abot Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/abot
License: Apache License 2.0
Automatically exported from code.google.com/p/abot
License: Apache License 2.0
Add crawl recovery that reloads pages that were crawled, pages to crawl and
other context. This allows the crawl to pick up where it left off. May also
need to add a stop for this work properly
Original issue reported on code.google.com by [email protected]
on 16 Nov 2012 at 5:15
Add abot version dynamically to user agent string
Original issue reported on code.google.com by [email protected]
on 25 Nov 2012 at 3:21
Add integration tests for at least the following sites...
sitesimulator
wvtesting.com
sethgodin.com
Original issue reported on code.google.com by [email protected]
on 27 Oct 2012 at 10:14
Use Vs fakes to raise code coverage on untestable code
Original issue reported on code.google.com by [email protected]
on 19 Nov 2012 at 3:24
Add lic text from http://www.apache.org/licenses/LICENSE-2.0
Original issue reported on code.google.com by [email protected]
on 3 Dec 2012 at 8:03
Add crawltimeout where crawl ends if the timeout time has elapsed.
Original issue reported on code.google.com by [email protected]
on 15 Nov 2012 at 3:21
[deleted issue]
Hook up google analytics
Original issue reported on code.google.com by [email protected]
on 26 Sep 2012 at 4:33
-Think about moving unique uri crawling check/logic to IScheduler so a
DistributedScheduler would be the only change to give distributed scheduling.
Original issue reported on code.google.com by [email protected]
on 23 Nov 2012 at 10:14
Add page for custom crawler work by hour
Original issue reported on code.google.com by [email protected]
on 27 Nov 2012 at 9:21
Implement manual crawl delay
Original issue reported on code.google.com by [email protected]
on 21 Nov 2012 at 11:15
Add maxpagestocrawl check in crawldecisionmaker
Original issue reported on code.google.com by [email protected]
on 15 Nov 2012 at 7:45
Currently ignoring test...
Crawl_PageLinksCrawlDisallowedSubscriberThrowsExceptions_DoesNotCrash
Search for "//TODO This test only fails when run under NCOVER"
Original issue reported on code.google.com by [email protected]
on 14 Oct 2012 at 12:37
Change Abot.Console to Abot.Demo
Original issue reported on code.google.com by [email protected]
on 27 Sep 2012 at 11:54
Create monitoring plugin that leverages abot
Original issue reported on code.google.com by [email protected]
on 29 Oct 2012 at 12:44
Create Nuget installer
Original issue reported on code.google.com by [email protected]
on 18 Nov 2012 at 11:48
Add fatal crawl errors to CrawlResult
Original issue reported on code.google.com by [email protected]
on 13 Oct 2012 at 10:36
Update all assemblies to 4.5
Original issue reported on code.google.com by [email protected]
on 21 Nov 2012 at 2:51
Verify can run on .net 3.5, 4.0, 4.5
Original issue reported on code.google.com by [email protected]
on 26 Sep 2012 at 12:27
Spread the word
Original issue reported on code.google.com by [email protected]
on 19 Nov 2012 at 3:20
Implement use of downloadableContentTypes config value
Original issue reported on code.google.com by [email protected]
on 21 Nov 2012 at 11:15
CrawlContext should be reset on every crawl() call
Original issue reported on code.google.com by [email protected]
on 28 Oct 2012 at 12:07
Console fails when crawling site wvtesting2.com
Original issue reported on code.google.com by [email protected]
on 14 Oct 2012 at 10:35
-Link to the latest stable instead of making them go to the downloads tab
-Add fiddler .saz file to replay
-Add Abot vs Arachnode vs NCrawler section
-Add faqs page
-Split up quickstart onto its own page
-Add more detail to running the tests w/fiddler etc..
Original issue reported on code.google.com by [email protected]
on 19 Nov 2012 at 3:30
Create a PoliteWebCrawler.
-Add throttling
-Add manual crawl delay
-Add respect robots crawl delay
-Add respect robots disallow directive
-Add respect meta robots no index no follow
Original issue reported on code.google.com by [email protected]
on 27 Sep 2012 at 11:47
Use ILMerge to create a single Abot.dll with all dependent dlls
Original issue reported on code.google.com by [email protected]
on 18 Nov 2012 at 11:49
Consider using AutoResetEven instead of busy wait
Original issue reported on code.google.com by [email protected]
on 26 Sep 2012 at 5:34
Create WebCrawler that uses a list of rules for its crawl descisions
Original issue reported on code.google.com by [email protected]
on 27 Sep 2012 at 11:45
Handle relative link parsing when html Base tag present
Original issue reported on code.google.com by [email protected]
on 26 Sep 2012 at 12:04
-Add Func<PageToCrawl, CrawlDecision> for ShouldCrawlPage
-Add Func<CrawledPage, CrawlDecision> for ShouldCrawlPageLinks
-Add Func<CrawledPage, CrawlDecision> for ShouldDownloadPageContent
Original issue reported on code.google.com by [email protected]
on 7 Nov 2012 at 5:19
Implement crawl depth
Original issue reported on code.google.com by [email protected]
on 21 Nov 2012 at 11:13
Make configuration like MaxThreads and UserAgentString easy from crawler
object. Maybe make a BasicWebCrawler : WebCrawler that gives this functionality.
Original issue reported on code.google.com by [email protected]
on 14 Oct 2012 at 9:29
Add PageCrawlDisallowed & PageLinksCrawlDisallowed events so the user can
detect when a page was either not crawled due to a crawl decision returning
false.
Original issue reported on code.google.com by [email protected]
on 27 Sep 2012 at 11:44
Implement crawl timeout where the crawler will stop after x seconds
Original issue reported on code.google.com by [email protected]
on 21 Nov 2012 at 11:14
Consider using CsQuery as the parser. It boasts speeds of x times faster than
hap.
https://github.com/jamietre/CsQuery
Original issue reported on code.google.com by [email protected]
on 19 Nov 2012 at 3:23
Add constructor to webcrawler that takes only ICrawlDecisionMaker and both
ICrawlDecisionMaker and CrawlConfiguration
Original issue reported on code.google.com by [email protected]
on 26 Nov 2012 at 10:11
Use concurrent collections for Scheduler and CrawlContext.CrawledUris now that
abot is targeting 4.5.
Original issue reported on code.google.com by [email protected]
on 22 Nov 2012 at 11:06
Add disallow reason when firing PageCrawlDisallowed and
PageLinksCrawlDisallowed events
Original issue reported on code.google.com by [email protected]
on 14 Oct 2012 at 12:38
Verify site crawls are returning expected number of pages...
1: sitesimulator
2: wvtesting2.com
3: sethgodin.com
Original issue reported on code.google.com by [email protected]
on 15 Oct 2012 at 12:28
Setup Paypal donate button, and possibly a consulting block purchase page.
Original issue reported on code.google.com by [email protected]
on 28 Oct 2012 at 9:12
Verify is able to run on mono by using MOMA
Original issue reported on code.google.com by [email protected]
on 26 Sep 2012 at 12:17
Create crawl configuration object and a generic provider that can be overridden.
Original issue reported on code.google.com by [email protected]
on 17 Oct 2012 at 4:50
Verify nuget solution restore
Original issue reported on code.google.com by [email protected]
on 26 Sep 2012 at 5:35
Merge branch 1.0 into trunk, create bin zip for downloads section
Original issue reported on code.google.com by [email protected]
on 15 Oct 2012 at 12:29
Contact graphic designer to have logo created
Original issue reported on code.google.com by [email protected]
on 28 Oct 2012 at 9:09
Add MaxTimeToCrawl configuration. Crawl times out after this limit is reached.
Original issue reported on code.google.com by [email protected]
on 7 Nov 2012 at 3:31
Implement use of isUriRecrawlingEnabled config value.
Original issue reported on code.google.com by [email protected]
on 21 Nov 2012 at 11:16
Consider accepting lambda expressions for crawl decisions.
Pros:
-Allows users to determine crawl behavior on the fly
-No classes or interfaces to implement and plugin
Cons:
-Not easy to test compound crawl behaviors
-Users must set these values on every instance (lots of copy and paste)
-Hard to group related or behaviors that are grouped together
Original issue reported on code.google.com by [email protected]
on 29 Oct 2012 at 4:57
Make Abot check its version an if less than the latest "featured" version log a
message suggesting an update
Original issue reported on code.google.com by [email protected]
on 3 Dec 2012 at 4:31
setup google groups and add link to homepage.
Original issue reported on code.google.com by [email protected]
on 26 Nov 2012 at 11:22
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.