GithubHelp home page GithubHelp logo

Comments (9)

adamlwgriffiths avatar adamlwgriffiths commented on May 23, 2024

I'm quite busy so can't directly check.
You are correct about the multiples, there are N preview reviews per page. If you're getting stopped always on a multiple, it would suggest the next page code is failing.
It's very possible there are issues due to the nature of scraping and amazon constantly changing HTML layouts.

The way to check this is to keep the rs object when iteration fails on a multiple of N, then print out the soup.
Dump it here or in a Gist and we can look further into it.

The 'next page' code always tries to find the URL from an anchor tag for the next page along.
I would hazard a guess that the next page BeautifulSoup calls aren't working with a new form of HTML layout. It's also possible you're scraping a category I haven't tried with, as different parts of the Amazon website also use different HTMl layouts. It's a bit of a nightmare.

As an aside, are you doing a lot of scraping before this happens? Is it possible it's a robot / captcha check kicking in? See this issue for some information.
#25

If you've done a lot in the past it can take a while to 'cool down'.
If it were captcha, I would expect errors in other areas though, not just review iteration.

from amazon_scraper.

mattrocklage avatar mattrocklage commented on May 23, 2024

I'd be happy to check the rs.soup. How can I get the soup for each new page it obtains?

print rs.soup gives me the very first page, but how do I get the subsequent soups for each next page?

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on May 23, 2024

Ah, good point.
Instead of iterating over rs, check rs.brief_reviews for its length, it's a generator so you'll need to make it into a list, len(list(rs.brief_reviews)).
Next page can be retrieved manually with rs = Reviews(api, URL=rs.next_page_url)

from amazon_scraper.

mattrocklage avatar mattrocklage commented on May 23, 2024

I'm sorry, that's not clear to me. I get the error "NameError: name 'api' is not defined" with the following code:

p = amzn.lookup(ItemId='B008LX6OC6')
rs = p.reviews(api, URL=rs.next_page_url)
print len(list(rs.brief_reviews))

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on May 23, 2024

api will be the amzn object you instantiated initially.

from amazon_scraper.

mattrocklage avatar mattrocklage commented on May 23, 2024

I get the error "NameError: name 'rs' is not defined" with the following code:

from amazon_scraper import AmazonScraper
from amazon_scraper import reviews

amzn = AmazonScraper("XXXX", "XXXX", "XXXX")
p = amzn.lookup(ItemId='B008LX6OC6')
rs = p.reviews(api=amzn, URL=rs.next_page_url)
print len(list(rs.brief_reviews))

Will this code work correctly even if rs is defined? I'm not sure I understand how this iterates across pages and displays the soup.

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on May 23, 2024

You cannot pass rs to itself, it's not defined yet =P
If you manually create a Reviews object (amazon_scraper.Reviews), then you need to pass in the api object (amzn), but if you call it from the amzn object itself, it will pass itself in for you.

This should do it.

from amazon_scraper import AmazonScraper
from amazon_scraper import reviews

amzn = AmazonScraper("XXXX", "XXXX", "XXXX")
# get the product
p = amzn.lookup(ItemId='B008LX6OC6')
# get the reviews page
rs = p.reviews()
# begin scanning review page
while rs.next_page_url:
    # get next review page
    rs = amzn.reviews(URL=rs.next_page_url)

# this review doesn't have any more pages
print(rs.url)
print(rs.next_page_url)
print(rs.soup)

from amazon_scraper.

mattrocklage avatar mattrocklage commented on May 23, 2024

Yep, it's a CAPTCHA once again:

<!DOCTYPE html>

<!--[if lt IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
<!--[if IE 7]>    <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
<!--[if IE 8]>    <html lang="en-us" class="a-no-js a-lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="a-no-js" lang="en-us"><!--<![endif]--><head>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
<meta charset="utf-8">
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
<title dir="ltr">Robot Check</title>
<meta content="width=device-width" name="viewport">
<link href="http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonUI-3c39b52ef832b0823a6dc102407707c29d14c9a1.min._V1_.css" rel="stylesheet">
<script>

if (true === true) {
    var ue_t0 = (+ new Date()),
        ue_csm = window,
        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
        ue_furl = "fls-na.amazon.com",
        ue_mid = "ATVPDKIKX0DER",
        ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
        ue_sn = "opfcaptcha.amazon.com",
        ue_id = '0V5H1SRSJ8XH9MV6959W';
}
</script>
</link></meta></meta></meta></meta></head>
<body>
<!--
        To discuss automated access to Amazon data please contact [email protected].
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
<!--
Correios.DoNotSend
-->
<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">
<div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">
<div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>
<div class="a-box a-alert a-alert-info a-spacing-base">
<div class="a-box-inner">
<i class="a-icon a-icon-alert"></i>
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
<div class="a-section">
<div class="a-box a-color-offset-background">
<div class="a-box-inner a-padding-extra-large">
<form action="/errors/validateCaptcha" method="get" name="">
<input name="amzn" type="hidden" value="/hwieKbRvn0ObRRJqdNi+g=="/><input name="amzn-r" type="hidden" value="/Dirt-Devil-Dynamite-Bagless-M084650RED/product-reviews/B000F8EUFI/ref=cm_cr_arp_d_paging_btm_9?ie=UTF8&amp;pageNumber=9&amp;sortBy=bySubmissionDateDescending"/><input name="amzn-pt" type="hidden" value="NoPageType"/>
<div class="a-row a-spacing-large">
<div class="a-box">
<div class="a-box-inner">
<h4>Type the characters you see in this image:</h4>
<div class="a-row a-text-center">
<img src="http://ecx.images-amazon.com/captcha/qujzzelu/Captcha_xsukjijfmx.jpg">
</img></div>
<div class="a-row a-spacing-base">
<div class="a-row">
<div class="a-column a-span6">
</div>
<div class="a-column a-span6 a-span-last a-text-right">
<a onclick="window.location.reload()">Try different image</a>
</div>
</div>
<input autocapitalize="off" autocomplete="off" autocorrect="off" class="a-span12" id="captchacharacters" name="field-keywords" placeholder="Type characters" spellcheck="false" type="text">
</input></div>
</div>
</div>
</div>
<div class="a-section a-spacing-extra-large">
<div class="a-row">
<span class="a-button a-button-primary a-span12">
<span class="a-button-inner">
<button class="a-button-text" type="submit">Continue shopping</button>
</span>
</span>
</div>
</div>
</form>
</div>
</div>
</div>
</div>
<div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>
<div class="a-text-center a-spacing-small a-size-mini">
<a href="http://www.amazon.com/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&amp;nodeId=508088">Conditions of Use</a>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<a href="http://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&amp;nodeId=468496">Privacy Policy</a>
</div>
<div class="a-text-center a-size-mini a-color-secondary">
          © 1996-2014, Amazon.com, Inc. or its affiliates
          <script>
           if (true === true) {
             document.write('<img src="http://fls-na.amaz'+'on.com/'+'1/oc-csi/1/OP/requestId=0V5H1SRSJ8XH9MV6959W&js=1" />');
           };
          </script>
<noscript>
<img src="http://fls-na.amazon.com/1/oc-csi/1/OP/requestId=0V5H1SRSJ8XH9MV6959W&amp;js=0"/>
</noscript>
</div>
</div>
<script>
    if (true === true) {
        var elem = document.createElement("script");
        elem.src = "https://images-na.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";
        document.getElementsByTagName('head')[0].appendChild(elem);
    }
    </script>
</body></html>

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on May 23, 2024

Yeah I need to add a check for the captcha page.
I just don't have time to play around with this library at the moment.

If you find the issue isn't captcha related, please re-open.

from amazon_scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.