GithubHelp home page GithubHelp logo

dsmith47 / job_finder Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 225 KB

tool to automatically organize and mang job postings from companies by automatically accessing career sites

License: GNU General Public License v3.0

Python 100.00%

job_finder's People

Contributors

dsmith47 avatar

Stargazers

 avatar

Watchers

 avatar

job_finder's Issues

Retries on Failed Queries

At time of writing the state of the script isn't very robust. A single error during any crawl operation will cause the whole thing to crash, except when using the selenium driver, where the current logic will cause page visits to be cancelled early.

This is already creating some visible problems. Needing everything to line up causes transient failures a noticeable amount of the time and that probably isn't going to scale up much larger. We could make the script executions more fruitful by:

1. Catching exceptions - don't immediately escalate to runtime exceptions
2. Implement retrying failed queries - reduces likelihood of a failure big enough to wreck the script output
3. caching successful request bodies - as long as one requests succeeds, the total work of the script on reruns will go down. This is a bit in-tension with keeping the script output up-to-date, it shouldn't be too hard to parameterize and find script options for working around this
4. Trying to parallelize script execution contexts - more of a performance enhancement, but a parallel architecture could help keep isolated failures from breaking things, and it definitely makes it easier to spend extra time retrying queries when it doesn't fully extend the wall-clock time

These fixes are all of varying ease/payoff, 1 and 2 are pretty easy, 3 can probably done with a sufficiently clever naming scheme for files (and adding a fileIO step after every netIO step), and 4 can probably be done via multiprocessing.Process (although I'm not sure how selenium will handle sharing resources for parallel calls).

Searching for script elements is brittle

Just fixed a bug in the GoogleCrawler where, due to a change in title element, the crawl was returning zero elements.

Small win: all the abstraction made it very easy to update the Google-specific selectors and update things

Bigger problem: one of the Crawlers failed silently for an unknown amount of time, which erodes the usefulness of the script

Scraping is more useful if it's low-maintenance and just works, so this kind of undermines the purose. Would like to explore some solutions

Potential Solution 0: Do Nothing, Maintain classes as they break

  • Checking the script for errors in execution becomes a clean-up task on each run
    • Could probably implement a Crawler-level alert when records are suspiciously low

Pros

  • Requires no engineering, just script-hacking exactly as often as the parser breaks
  • Fixing the query I just fixed was still less time than copying over the relevant records by hand, this method will still be efficient

Cons

  • Cedes control over when your script stops informing you
  • A bad alerting scheme could cause this to introduce large holes in the information
  • Script goes from a passive labor saver to a chore (albeit better than some alternative chores), changes that seem as simple as an incidental framework compiler choice (or that could be dynamically implemented to defeat scrapers) will contently tick forward and incur a constant amount of labor to keep using

2.0 Crawler: Google

Current Crawler doesn't appear to be working, it's also our oldest crawl, so time for an update to.

  • Use the new AbstractCrawler base class structure
  • Optimize for minified query times
    • Test if BeautifulSoup works to save selenium overhead
    • Trim Selenium waits as small as possible
    • Trim system wait times as low as possible (sleep -> sleep_ms)
  • Implement support for location field on job ads

Optimizing Crawl Speed Thread

Experienced a big win when implementing multiprocessing, but script execution still takes some time, and I can forsee a 10x-20x increase in website visits. Keeping a thread to track optimizations and experimental records as I continue to try to speed up this crawl

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.