Comments (11)
I've been running into the same problem with another app that I'm building, which is kind of big problem.
I've added captcha detection, and instead of just throwing an error, I've added an interface to deathbycaptcha for getting the captcha solved in a relatively short amount of time (15-20 seconds). All of this seems to work pretty good in my initial testing.
Hopefully I'll be able to make a pull request within the next few days.
from amazon_scraper.
This has been tested a fair amount with success and is ready to go.
@adamlwgriffiths Can I have your permission to create a new branch?
from amazon_scraper.
Awesome! I'll try and get some time to take a look. Hopefully within 24 hours.
Why a new branch? If it works we can just merge to master and up the major or minor version - depending on if there's any function changes / new exceptions.
On 01/07/2016, at 10:21 PM, Benjamin [email protected] wrote:
This has been tested a fair amount with success and is ready to go.
@adamlwgriffiths Can I have your permission to create a new branch?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
from amazon_scraper.
Ah sorry I assumed code was already included. I'm currently mobile so not really able to interact fully.
I guess it would be a major version since its new dependencies which could break unsuspecting users.
Re: new branch.
Sure. Once you're happy with it well pull it to master and package it up for pypi.
I'm busy atm and not using this lib currently so I can't provide much / any testing.
On 01/07/2016, at 10:21 PM, Benjamin [email protected] wrote:
This has been tested a fair amount with success and is ready to go.
@adamlwgriffiths Can I have your permission to create a new branch?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
from amazon_scraper.
Hmm, deathbycaptcha isn't a free service.
If you want to do this, then it should be optional.
I would add a 'captcha handler' as part of the API.
If the API init function takes a new parameter of captcha_handler=None
.
When a captcha is detected, the default None
handler would throw a 'CaptchaRequiredException' or something similar.
You could then include a 'DeathByCaptcha' handler which could be passed to the API init function to handle this your way.
I don't want to enforce a paid-for service in the API itself.
Being a non-free service also implies there are alternatives, which means flexibility here is a good thing.
from amazon_scraper.
Agreed - captcha handling should be optional, as other alternatives may exist for dealing with them, or simply throwing an exception may be adequate.
The code that I was testing was filtering request responses in the same script, but this doesn't make sense for the codebase overall.
I'll take your comment into consideration, Adam, as I start examining the API more closely.
from amazon_scraper.
The codebase isn't exactly anything to write home about.
I wouldn't go overboard engineering a great solution, just something simple, logical, and flexible.
I think if each of the soup
@property
methods was changed it would be easier.
At the moment they call to a global get
function which does rate limiting.
I think the entirety of the soup
functions should be promoted to the core __init__
file.
After the BeautifulSoup code is loaded, a check can be put in place for a captcha page.
If detected, the API.captcha_handler
or what-have-you is then called to see if it can be handled. appriately. If handled, the function can have another go at downloading the HTML and reparsing it.
This way, all classes can seemlessly get the captcha detection/handling.
from amazon_scraper.
I've made some good progress within the last day. All requests run through the API and are filtered (via both regex and the bs4 interface). I've also added a hook for whatever action/service people may wish to use in order to deal with captchas whenever detected, otherwise defaulting to writing a log message.
One tiny bug that I've encountered - I'm seeing loop behavior in logic where I would not expect. Should be quick to get that ironed out though.
Can I have permission to create a branch in order to save my work?
from amazon_scraper.
Ahhh, if you just fork the repo you can commit all you want.
from amazon_scraper.
I was finally was able to make a commit on the captcha plugin.
from amazon_scraper.
Hey Ben,
I added some comments inline in that commit.
I think the coupling to DBC is too tight.
As I said above, there should just be a basic captcha handler which just detects the captcha and throws an exception.
The user should pass the DBC handler in at API construction time (if not passed in, it should fall back to the default mentioned above).
The amazon core API should have no knowledge of DBC, only that there are potential captchas and that there are methods to handle them.
If you want I can try and find some time to add a basic framework for this that you can then plug your DBC into. That said, I'm pretty busy at the moment.
from amazon_scraper.
Related Issues (20)
- extract_asin doesn't work with all Amazon's links HOT 1
- Reviews not getting after review page HOT 5
- Only getting the last 10 reviews HOT 3
- Problem installing amazon_scraper HOT 3
- Average Review Rating HOT 1
- Page sometimes not loading? HOT 10
- help HOT 1
- How to get offer listings(all offer price by all merchants for single product) HOT 1
- Random stopping on multiples of 10 HOT 9
- Get Product Price HOT 1
- ImportError: No module named tests HOT 1
- GUI? HOT 1
- Problem with BeautifulSoup import HOT 1
- Add ability to set amazon_base HOT 2
- Can't parse review date for foreign Amazon regions
- Problem with InsecureRequest HOT 1
- AWS Accout HOT 1
- Problems with .text command HOT 3
- Install requirement contains invalid library name HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon_scraper.