Hi, This is a follow-up question to: <a class="issue-link js-issue-l

Please take questions like these to the forum <a href="https://groups.google.com/g

How do I direct it to crawl, specifc paged content in a site directory? about abot HOT 2 CLOSED

tinonetic commented on June 2, 2024

How do I direct it to crawl, specifc paged content in a site directory?

from abot.

Comments (2)

sjdirect commented on June 2, 2024

Please take questions like these to the forum
Ask a question, please search for similar questions first!!!
If you are interested in expert Abot customization go here?

from abot.

tinonetic commented on June 2, 2024

Figured it out by looking at the Scheduler class.

To help anyone who wants to do the same, see walkthrough below, noting that this isn't the most elegant code, but it works!.

I hardcoded my list of pages. You would otherwise have some dynamic way of populating your page list(like a DB or even parsing the paging info from the landing/first page)

Create a custom IScheduler where you populate your pages. Custom code is commented. The rest is as-is in the Scheduler class

    public class MyScheduler : IScheduler
    {
        ICrawledUrlRepository _crawledUrlRepo;
        IPagesToCrawlRepository _pagesToCrawlRepo;
        bool _allowUriRecrawling;
        // flag to indicate my list of pages have been loaded
        bool _pageListLoaded;

        public MyScheduler()
            : this(false, null, null)
        {
        }

        public MyScheduler(bool allowUriRecrawling, ICrawledUrlRepository crawledUrlRepo, IPagesToCrawlRepository pagesToCrawlRepo)
        {
            _allowUriRecrawling = allowUriRecrawling;
            _crawledUrlRepo = crawledUrlRepo ?? new CompactCrawledUrlRepository();
            _pagesToCrawlRepo = pagesToCrawlRepo ?? new FifoPagesToCrawlRepository();

            // custom code different from default Scheduler
            // this is where you populate your page list by creating a List of PageToCrawl
            var pagesToCrawl = new List<PageToCrawl>();
            var listingUrl = "https://www.mysite.com/puppies";
            for (var i = 1; i <= 5; i++)
            {
                pagesToCrawl.Add(new PageToCrawl(new Uri(listingUrl + "?" + i )));
            }
            // add your list
            Add(pagesToCrawl);
            // set flag
            _pageListLoaded = true;
        }

        public int Count => _pagesToCrawlRepo.Count();

        public void Add(PageToCrawl page)
        {
            // if your pageList has been loaded, don't allow adding another page
            if(_pageListLoaded) return;
            if (page == null)
                throw new ArgumentNullException("page");

            if (_allowUriRecrawling || page.IsRetry)
            {
                _pagesToCrawlRepo.Add(page);
            }
            else
            {
                if (_crawledUrlRepo.AddIfNew(page.Uri))
                    _pagesToCrawlRepo.Add(page);
            }
        }

        public void Add(IEnumerable<PageToCrawl> pages)
        {
            // same as above
            if (_pageListLoaded) return;
            if (pages == null)
                throw new ArgumentNullException("pages");

            foreach (var page in pages)
                Add(page);
        }

        public void AddKnownUri(Uri uri)
        {
            _crawledUrlRepo.AddIfNew(uri);
        }

        public void Clear()
        {
            _pagesToCrawlRepo.Clear();
        }

        public void Dispose()
        {
            if (_crawledUrlRepo != null)
            {
                _crawledUrlRepo.Dispose();
            }
            if (_pagesToCrawlRepo != null)
            {
                _pagesToCrawlRepo.Dispose();
            }
        }

        public PageToCrawl GetNext()
        {
            return _pagesToCrawlRepo.GetNext();
        }

        public bool IsUriKnown(Uri uri)
        {
            return _crawledUrlRepo.Contains(uri);
        }
    }

Then, when crawling, based on the QuickStart example, taking note that I used the custom MyScheduler to instantiate PoliteWebCrawler

            var config = new CrawlConfiguration
            {
                MaxPagesToCrawl = 10, //Only crawl 10 pages
                MinCrawlDelayPerDomainMilliSeconds = 3000, //Wait this many millisecs between requests
                
            };
            var scheduler = new MyScheduler();
            var crawler = new PoliteWebCrawler(config,null,null,scheduler,null,null,null,null,null);
            
            crawler.PageCrawlCompleted += PageCrawlCompleted;//Several events available...
            
            var crawlResult = await crawler.CrawlAsync(new Uri("https://www.mysite.com/"));

That's it! Have fun!

from abot.

How do I direct it to crawl, specifc paged content in a site directory? about abot HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs