A stealthy asynchronous spider capable of running Chrome, Headless Chrome, Chromium, and Headless Chromium.
Spiders can optionally be ran with multiple browsers and/or multiple browser tabs. All request will be cyclically distributed among browsers/tabs.
Automation detection prevention scripts will be automatically invoked every time a page is navigated. If a browser is running in headless mode, headless mode detection preventions scripts will also be invoked. All detection prevention scripts are largely adapted from puppeteer-extra and Eval Sangaline's blog post.
This package is built using Pyppeteer which is the Python bindings for Puppeteer.
pip install pyppeteer_spider
Spiders can be created through the PyppeteerSpider class, which has the signature:
PyppeteerSpider(
browser_page_count: int = 1, # Number of tabs per browser.
browser_count: int = 1, # Number of browsers.
default_nav_timeout: int = 30000, # Default maximum navigation timeout. Units: ms
max_consec_browser_errors: int = 4, # Max allowable consecutive browser errors before browser will be replaced.
incognito: bool = False, # Run browser in incognito mode.
headless: bool = False, # Run browser in headless mode.
delete_cookies: bool = False, # Clear all cookies before each request.
disable_cache: bool = False, # Disable cache for each request.
disable_images: bool = False, # Load pages without images.
browser_memory_limit: Optional[int] = None, # Max memory browser can use. Units: mb
default_viewport: Optional[Dict[str, int]] = None, # Change default viewport size. Example: {width: 1280, height: 800}. Default is full page.
js_injection_scripts: Optional[List[str]] = None, # JavaScript functions that will be invoked on every page navigation.
request_abort_types: Optional[List[str]] = None, # Content types of requests that should be aborted. Example: 'image', 'font', 'stylesheet', 'script'
blocked_urls: Optional[List[str]] = None, # URL patterns to block. Wildcards ('*') are allowed.
proxy_addr: Optional[Union[List[str],str]] = None, # Address of proxy server.
user_data_dir: Optional[Union[List[str],str]] = None, # Path to Chrome profile directory. Default will use temp directory.
browser_executable: Optional[str] = None, # Path to Chrome or Chromium executable. If None, Chromium will be downloaded.
user_agent_type: Union['Linux', 'Darwin', 'Windows'] = platform.system(), # Select a user agent type. Default will be current system.
log_level: int = logging.INFO,
log_file_path: Optional[Union[str,pathlib.Path]] = None)
URLs should be navigated to via the spider's get function. get will return the navigated page and optionally a response object. Pages returned by get will not be used again by the spider until spider.set_idle(page) is called. This is so the spider does not use the page for another navigation task while you are still processing the page in your scraping script.
from pyppeteer_spider.spider import PyppeteerSpider
spider = await PyppeteerSpider().launch()
page = await spider.get('https://www.amazon.com/')
urls = [await page.evaluate("(ele) => ele.getAttribute('href')",ele)
for ele in await page.xpath("//a[@href]")]
html = await page.content()
await spider.set_idle(page)
await spider.shutdown()
from pyppeteer_spider.spider import PyppeteerSpider
spider = await PyppeteerSpider().launch()
page = await spider.get('https://www.linkedin.com/search/results/people/?keywords=Software%20Engineer&origin=SUGGESTION')
# scroll to the page so all content loads.
await spider.scroll_page(page)
# hover all profile elements.
profile_xpath = '//li[contains(@class,"search-result")]'
await spider.hover_elements(ele_xpath=profile_xpath) # Note: hovering elements is very rarely necessary.
# extract data of from all profiles.
for profile_ele in await page.xpath(profile_xpath):
person_name_ele = await profile_ele.xpath('.//span[@class="name actor-name"]')
if person_name_ele:
person_name = await page.evaluate("(ele) => ele.innerText",person_name_ele[0])
profile_url_ele = await profile_ele.xpath('.//a[contains(@class,"search-result")]')
if profile_url_ele:
profile_url = await page.evaluate("(ele) => ele.getAttribute('href')",profile_url_ele[0])
await spider.set_idle(page)
await spider.shutdown()
from pyppeteer_spider.spider import PyppeteerSpider
from pathlib import Path
async def do_scrape(url, spider):
page = await spider.get(url)
urls = [await page.evaluate("(ele) => ele.getAttribute('href')",ele)
for ele in await page.xpath("//a[@href]")]
html = await page.content()
await spider.set_idle(page)
with Path("urls.txt").open(mode='r') as infile:
urls = set([line.strip() for line in infile])
spider = await PyppeteerSpider(browser_page_count=4,
browser_count=3).launch()
await asyncio.gather(*[asyncio.create_task(do_scrape(url, spider))
for url in urls])
await spider.shutdown()