dsmith47 / job_finder Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 225 KB

tool to automatically organize and mang job postings from companies by automatically accessing career sites

License: GNU General Public License v3.0

Python 100.00%

job_finder's People

Contributors

Stargazers

Watchers

job_finder's Issues

Retries on Failed Queries

At time of writing the state of the script isn't very robust. A single error during any crawl operation will cause the whole thing to crash, except when using the selenium driver, where the current logic will cause page visits to be cancelled early.

This is already creating some visible problems. Needing everything to line up causes transient failures a noticeable amount of the time and that probably isn't going to scale up much larger. We could make the script executions more fruitful by:

~~1. Catching exceptions - don't immediately escalate to runtime exceptions~~
~~2. Implement retrying failed queries - reduces likelihood of a failure big enough to wreck the script output~~
3. caching successful request bodies - as long as one requests succeeds, the total work of the script on reruns will go down. This is a bit in-tension with keeping the script output up-to-date, it shouldn't be too hard to parameterize and find script options for working around this
4. Trying to parallelize script execution contexts - more of a performance enhancement, but a parallel architecture could help keep isolated failures from breaking things, and it definitely makes it easier to spend extra time retrying queries when it doesn't fully extend the wall-clock time

These fixes are all of varying ease/payoff, 1 and 2 are pretty easy, 3 can probably done with a sufficiently clever naming scheme for files (and adding a fileIO step after every netIO step), and 4 can probably be done via multiprocessing.Process (although I'm not sure how selenium will handle sharing resources for parallel calls).

Searching for script elements is brittle

Just fixed a bug in the GoogleCrawler where, due to a change in title element, the crawl was returning zero elements.

Small win: all the abstraction made it very easy to update the Google-specific selectors and update things

Bigger problem: one of the Crawlers failed silently for an unknown amount of time, which erodes the usefulness of the script

Scraping is more useful if it's low-maintenance and just works, so this kind of undermines the purose. Would like to explore some solutions

Potential Solution 0: Do Nothing, Maintain classes as they break

Checking the script for errors in execution becomes a clean-up task on each run
- Could probably implement a Crawler-level alert when records are suspiciously low

Pros

Requires no engineering, just script-hacking exactly as often as the parser breaks
Fixing the query I just fixed was still less time than copying over the relevant records by hand, this method will still be efficient

Cons

Cedes control over when your script stops informing you
A bad alerting scheme could cause this to introduce large holes in the information
Script goes from a passive labor saver to a chore (albeit better than some alternative chores), changes that seem as simple as an incidental framework compiler choice (or that could be dynamically implemented to defeat scrapers) will contently tick forward and incur a constant amount of labor to keep using

2.0 Crawler: Google

Current Crawler doesn't appear to be working, it's also our oldest crawl, so time for an update to.

Use the new AbstractCrawler base class structure
Optimize for minified query times
- Test if BeautifulSoup works to save selenium overhead
- Trim Selenium waits as small as possible
- Trim system wait times as low as possible (sleep -> sleep_ms)
Implement support for location field on job ads

Optimizing Crawl Speed Thread

Experienced a big win when implementing multiprocessing, but script execution still takes some time, and I can forsee a 10x-20x increase in website visits. Keeping a thread to track optimizations and experimental records as I continue to try to speed up this crawl

dsmith47 / job_finder Goto Github PK

job_finder's People

Contributors

Stargazers

Watchers

job_finder's Issues

Retries on Failed Queries

Searching for script elements is brittle

Potential Solution 0: Do Nothing, Maintain classes as they break

2.0 Crawler: Google

Optimizing Crawl Speed Thread

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs