Comments (9)
I'm quite busy so can't directly check.
You are correct about the multiples, there are N preview reviews per page. If you're getting stopped always on a multiple, it would suggest the next page code is failing.
It's very possible there are issues due to the nature of scraping and amazon constantly changing HTML layouts.
The way to check this is to keep the rs
object when iteration fails on a multiple of N, then print out the soup.
Dump it here or in a Gist and we can look further into it.
The 'next page' code always tries to find the URL from an anchor tag for the next page along.
I would hazard a guess that the next page BeautifulSoup calls aren't working with a new form of HTML layout. It's also possible you're scraping a category I haven't tried with, as different parts of the Amazon website also use different HTMl layouts. It's a bit of a nightmare.
As an aside, are you doing a lot of scraping before this happens? Is it possible it's a robot / captcha check kicking in? See this issue for some information.
#25
If you've done a lot in the past it can take a while to 'cool down'.
If it were captcha, I would expect errors in other areas though, not just review iteration.
from amazon_scraper.
I'd be happy to check the rs.soup. How can I get the soup for each new page it obtains?
print rs.soup
gives me the very first page, but how do I get the subsequent soups for each next page?
from amazon_scraper.
Ah, good point.
Instead of iterating over rs
, check rs.brief_reviews
for its length, it's a generator so you'll need to make it into a list, len(list(rs.brief_reviews))
.
Next page can be retrieved manually with rs = Reviews(api, URL=rs.next_page_url)
from amazon_scraper.
I'm sorry, that's not clear to me. I get the error "NameError: name 'api' is not defined" with the following code:
p = amzn.lookup(ItemId='B008LX6OC6')
rs = p.reviews(api, URL=rs.next_page_url)
print len(list(rs.brief_reviews))
from amazon_scraper.
api
will be the amzn
object you instantiated initially.
from amazon_scraper.
I get the error "NameError: name 'rs' is not defined" with the following code:
from amazon_scraper import AmazonScraper
from amazon_scraper import reviews
amzn = AmazonScraper("XXXX", "XXXX", "XXXX")
p = amzn.lookup(ItemId='B008LX6OC6')
rs = p.reviews(api=amzn, URL=rs.next_page_url)
print len(list(rs.brief_reviews))
Will this code work correctly even if rs is defined? I'm not sure I understand how this iterates across pages and displays the soup.
from amazon_scraper.
You cannot pass rs to itself, it's not defined yet =P
If you manually create a Reviews object (amazon_scraper.Reviews), then you need to pass in the api object (amzn), but if you call it from the amzn object itself, it will pass itself in for you.
This should do it.
from amazon_scraper import AmazonScraper
from amazon_scraper import reviews
amzn = AmazonScraper("XXXX", "XXXX", "XXXX")
# get the product
p = amzn.lookup(ItemId='B008LX6OC6')
# get the reviews page
rs = p.reviews()
# begin scanning review page
while rs.next_page_url:
# get next review page
rs = amzn.reviews(URL=rs.next_page_url)
# this review doesn't have any more pages
print(rs.url)
print(rs.next_page_url)
print(rs.soup)
from amazon_scraper.
Yep, it's a CAPTCHA once again:
<!DOCTYPE html>
<!--[if lt IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
<!--[if IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
<!--[if IE 8]> <html lang="en-us" class="a-no-js a-lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="a-no-js" lang="en-us"><!--<![endif]--><head>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
<meta charset="utf-8">
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
<title dir="ltr">Robot Check</title>
<meta content="width=device-width" name="viewport">
<link href="http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonUI-3c39b52ef832b0823a6dc102407707c29d14c9a1.min._V1_.css" rel="stylesheet">
<script>
if (true === true) {
var ue_t0 = (+ new Date()),
ue_csm = window,
ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
ue_furl = "fls-na.amazon.com",
ue_mid = "ATVPDKIKX0DER",
ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
ue_sn = "opfcaptcha.amazon.com",
ue_id = '0V5H1SRSJ8XH9MV6959W';
}
</script>
</link></meta></meta></meta></meta></head>
<body>
<!--
To discuss automated access to Amazon data please contact [email protected].
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
<!--
Correios.DoNotSend
-->
<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">
<div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">
<div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>
<div class="a-box a-alert a-alert-info a-spacing-base">
<div class="a-box-inner">
<i class="a-icon a-icon-alert"></i>
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
<div class="a-section">
<div class="a-box a-color-offset-background">
<div class="a-box-inner a-padding-extra-large">
<form action="/errors/validateCaptcha" method="get" name="">
<input name="amzn" type="hidden" value="/hwieKbRvn0ObRRJqdNi+g=="/><input name="amzn-r" type="hidden" value="/Dirt-Devil-Dynamite-Bagless-M084650RED/product-reviews/B000F8EUFI/ref=cm_cr_arp_d_paging_btm_9?ie=UTF8&pageNumber=9&sortBy=bySubmissionDateDescending"/><input name="amzn-pt" type="hidden" value="NoPageType"/>
<div class="a-row a-spacing-large">
<div class="a-box">
<div class="a-box-inner">
<h4>Type the characters you see in this image:</h4>
<div class="a-row a-text-center">
<img src="http://ecx.images-amazon.com/captcha/qujzzelu/Captcha_xsukjijfmx.jpg">
</img></div>
<div class="a-row a-spacing-base">
<div class="a-row">
<div class="a-column a-span6">
</div>
<div class="a-column a-span6 a-span-last a-text-right">
<a onclick="window.location.reload()">Try different image</a>
</div>
</div>
<input autocapitalize="off" autocomplete="off" autocorrect="off" class="a-span12" id="captchacharacters" name="field-keywords" placeholder="Type characters" spellcheck="false" type="text">
</input></div>
</div>
</div>
</div>
<div class="a-section a-spacing-extra-large">
<div class="a-row">
<span class="a-button a-button-primary a-span12">
<span class="a-button-inner">
<button class="a-button-text" type="submit">Continue shopping</button>
</span>
</span>
</div>
</div>
</form>
</div>
</div>
</div>
</div>
<div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>
<div class="a-text-center a-spacing-small a-size-mini">
<a href="http://www.amazon.com/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&nodeId=508088">Conditions of Use</a>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<a href="http://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=468496">Privacy Policy</a>
</div>
<div class="a-text-center a-size-mini a-color-secondary">
© 1996-2014, Amazon.com, Inc. or its affiliates
<script>
if (true === true) {
document.write('<img src="http://fls-na.amaz'+'on.com/'+'1/oc-csi/1/OP/requestId=0V5H1SRSJ8XH9MV6959W&js=1" />');
};
</script>
<noscript>
<img src="http://fls-na.amazon.com/1/oc-csi/1/OP/requestId=0V5H1SRSJ8XH9MV6959W&js=0"/>
</noscript>
</div>
</div>
<script>
if (true === true) {
var elem = document.createElement("script");
elem.src = "https://images-na.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";
document.getElementsByTagName('head')[0].appendChild(elem);
}
</script>
</body></html>
from amazon_scraper.
Yeah I need to add a check for the captcha page.
I just don't have time to play around with this library at the moment.
If you find the issue isn't captcha related, please re-open.
from amazon_scraper.
Related Issues (20)
- extract_asin doesn't work with all Amazon's links HOT 1
- Reviews not getting after review page HOT 5
- Only getting the last 10 reviews HOT 3
- Problem installing amazon_scraper HOT 3
- Average Review Rating HOT 1
- Page sometimes not loading? HOT 10
- Add captcha detection HOT 11
- help HOT 1
- How to get offer listings(all offer price by all merchants for single product) HOT 1
- Get Product Price HOT 1
- ImportError: No module named tests HOT 1
- GUI? HOT 1
- Problem with BeautifulSoup import HOT 1
- Add ability to set amazon_base HOT 2
- Can't parse review date for foreign Amazon regions
- Problem with InsecureRequest HOT 1
- AWS Accout HOT 1
- Problems with .text command HOT 3
- Install requirement contains invalid library name HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon_scraper.