GithubHelp home page GithubHelp logo

Comments (11)

benperove avatar benperove commented on June 5, 2024

I've been running into the same problem with another app that I'm building, which is kind of big problem.

I've added captcha detection, and instead of just throwing an error, I've added an interface to deathbycaptcha for getting the captcha solved in a relatively short amount of time (15-20 seconds). All of this seems to work pretty good in my initial testing.

Hopefully I'll be able to make a pull request within the next few days.

from amazon_scraper.

benperove avatar benperove commented on June 5, 2024

This has been tested a fair amount with success and is ready to go.

@adamlwgriffiths Can I have your permission to create a new branch?

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on June 5, 2024

Awesome! I'll try and get some time to take a look. Hopefully within 24 hours.

Why a new branch? If it works we can just merge to master and up the major or minor version - depending on if there's any function changes / new exceptions.

On 01/07/2016, at 10:21 PM, Benjamin [email protected] wrote:

This has been tested a fair amount with success and is ready to go.

@adamlwgriffiths Can I have your permission to create a new branch?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on June 5, 2024

Ah sorry I assumed code was already included. I'm currently mobile so not really able to interact fully.

I guess it would be a major version since its new dependencies which could break unsuspecting users.

Re: new branch.
Sure. Once you're happy with it well pull it to master and package it up for pypi.
I'm busy atm and not using this lib currently so I can't provide much / any testing.

On 01/07/2016, at 10:21 PM, Benjamin [email protected] wrote:

This has been tested a fair amount with success and is ready to go.

@adamlwgriffiths Can I have your permission to create a new branch?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on June 5, 2024

Hmm, deathbycaptcha isn't a free service.

If you want to do this, then it should be optional.
I would add a 'captcha handler' as part of the API.
If the API init function takes a new parameter of captcha_handler=None.
When a captcha is detected, the default None handler would throw a 'CaptchaRequiredException' or something similar.
You could then include a 'DeathByCaptcha' handler which could be passed to the API init function to handle this your way.

I don't want to enforce a paid-for service in the API itself.
Being a non-free service also implies there are alternatives, which means flexibility here is a good thing.

from amazon_scraper.

benperove avatar benperove commented on June 5, 2024

Agreed - captcha handling should be optional, as other alternatives may exist for dealing with them, or simply throwing an exception may be adequate.

The code that I was testing was filtering request responses in the same script, but this doesn't make sense for the codebase overall.

I'll take your comment into consideration, Adam, as I start examining the API more closely.

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on June 5, 2024

The codebase isn't exactly anything to write home about.
I wouldn't go overboard engineering a great solution, just something simple, logical, and flexible.

I think if each of the soup @property methods was changed it would be easier.
At the moment they call to a global get function which does rate limiting.
I think the entirety of the soup functions should be promoted to the core __init__ file.
After the BeautifulSoup code is loaded, a check can be put in place for a captcha page.
If detected, the API.captcha_handler or what-have-you is then called to see if it can be handled. appriately. If handled, the function can have another go at downloading the HTML and reparsing it.

This way, all classes can seemlessly get the captcha detection/handling.

from amazon_scraper.

benperove avatar benperove commented on June 5, 2024

I've made some good progress within the last day. All requests run through the API and are filtered (via both regex and the bs4 interface). I've also added a hook for whatever action/service people may wish to use in order to deal with captchas whenever detected, otherwise defaulting to writing a log message.

One tiny bug that I've encountered - I'm seeing loop behavior in logic where I would not expect. Should be quick to get that ironed out though.

Can I have permission to create a branch in order to save my work?

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on June 5, 2024

Ahhh, if you just fork the repo you can commit all you want.

from amazon_scraper.

benperove avatar benperove commented on June 5, 2024

I was finally was able to make a commit on the captcha plugin.

benperove@df2752d

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on June 5, 2024

Hey Ben,
I added some comments inline in that commit.
I think the coupling to DBC is too tight.
As I said above, there should just be a basic captcha handler which just detects the captcha and throws an exception.
The user should pass the DBC handler in at API construction time (if not passed in, it should fall back to the default mentioned above).
The amazon core API should have no knowledge of DBC, only that there are potential captchas and that there are methods to handle them.

If you want I can try and find some time to add a basic framework for this that you can then plug your DBC into. That said, I'm pretty busy at the moment.

from amazon_scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.