dsmith47 / job_finder Goto Github PK
View Code? Open in Web Editor NEWtool to automatically organize and mang job postings from companies by automatically accessing career sites
License: GNU General Public License v3.0
tool to automatically organize and mang job postings from companies by automatically accessing career sites
License: GNU General Public License v3.0
At time of writing the state of the script isn't very robust. A single error during any crawl operation will cause the whole thing to crash, except when using the selenium driver, where the current logic will cause page visits to be cancelled early.
This is already creating some visible problems. Needing everything to line up causes transient failures a noticeable amount of the time and that probably isn't going to scale up much larger. We could make the script executions more fruitful by:
1. Catching exceptions - don't immediately escalate to runtime exceptions
2. Implement retrying failed queries - reduces likelihood of a failure big enough to wreck the script output
3. caching successful request bodies - as long as one requests succeeds, the total work of the script on reruns will go down. This is a bit in-tension with keeping the script output up-to-date, it shouldn't be too hard to parameterize and find script options for working around this
4. Trying to parallelize script execution contexts - more of a performance enhancement, but a parallel architecture could help keep isolated failures from breaking things, and it definitely makes it easier to spend extra time retrying queries when it doesn't fully extend the wall-clock time
These fixes are all of varying ease/payoff, 1 and 2 are pretty easy, 3 can probably done with a sufficiently clever naming scheme for files (and adding a fileIO step after every netIO step), and 4 can probably be done via multiprocessing.Process
(although I'm not sure how selenium will handle sharing resources for parallel calls).
Just fixed a bug in the GoogleCrawler where, due to a change in title element, the crawl was returning zero elements.
Small win: all the abstraction made it very easy to update the Google-specific selectors and update things
Bigger problem: one of the Crawlers failed silently for an unknown amount of time, which erodes the usefulness of the script
Scraping is more useful if it's low-maintenance and just works, so this kind of undermines the purose. Would like to explore some solutions
Pros
Cons
Current Crawler doesn't appear to be working, it's also our oldest crawl, so time for an update to.
Experienced a big win when implementing multiprocessing, but script execution still takes some time, and I can forsee a 10x-20x increase in website visits. Keeping a thread to track optimizations and experimental records as I continue to try to speed up this crawl
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.