Comments (11)
Could you provide the review id, (preferably ASIN too, if you have the product object).
print rs.url
print rs.id
I'll take a look after I catch some sleep =)
from amazon_scraper.
I had a quick look, the asin property works for reviews and review object. Albeit for the products I'm testing against.
Product pages html changes drastically (and continually evolves), but I've never seen any variation in the review html format. So I'm curious to see what product you're getting reviews from.
from amazon_scraper.
Here's a log from my interpreter (I get the same result from any product I try):
>>> amzn = AmazonScraper ('access', 'secret', 'associate_tag')
>>> item = amzn.lookup(ItemId='B00008MOQA')
>>> item.title
'Swiffer WetJet Spray, Mop Floor Cleaner Starter Kit (Packaging May Vary)'
>>> item.url
'http://www.amazon.com/dp/B00008MOQA'
>>> item.reviews_url
'http://www.amazon.com/product-reviews/B00008MOQA/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending'
>>> item_reviews = amzn.lookup(ItemId='B00008MOQA')
>>> item_reviews = amzn.reviews(URL=item.reviews_url)
>>> item_reviews.asin
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/pbeeson/.virtualenvs/custom_data_pulls/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 41, in asin
return unicode(span['name'])
TypeError: 'NoneType' object has no attribute '__getitem__'
>>> item_reviews.ids
[]
from amazon_scraper.
I can't get item_reviews.id
or item_reviews.url
without getting the same ASIN error. The ASIN for this product should be B00008MOQA.
from amazon_scraper.
Ok I can reproduce that. I'll get to work on it now.
from amazon_scraper.
Ok, the logic was fine its just that the python html parsers are a bit schizophrenic.
Using bsoup4 with 'html.parser' errored only on that one page (in my tests anyway).
I've changed bsoup to use the default parser for reviews and left the others. It passes all tests still.
I've also added some fixes. I'll push this release up in a tick.
from amazon_scraper.
I've pushed version 0.1.17 to pypi.
This should resolve the issue, if not, please re-open.
from amazon_scraper.
Great! I'll check it out this morning.
from amazon_scraper.
Just confirmed things are working as expected with the reviews I'm seeking to pull. Thanks!
from amazon_scraper.
can anyone help a noob (me) set up a scraping program for warehousedeals.com
from amazon_scraper.
You can use libraries like https://github.com/scrapy/scrapy to trawl sites.
It has some easy to use scraping routines based on XPath which were nice and concise, but I found the documentation lacking in the areas I needed (controlling the way it spiders websites).
Learning BeautifulSoup4 is a good start.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
But it is heavily reliant on which parser your choose ('html.parser', 'html5lib', 'lxml', etc), which is what caused this issue (the value was in the html, the html parsed differently for this product because the python html parsers are all deficient in different ways).
XPath is also very handy because its concise, and having the path not exist results in None, no matter where in the query it failed.
But XPath is quite hard to understand, its like regexp or perl. You wont remember what a line of code does a month later.
BeautifulSoup4 on the other hand is long winded, but prone to errors, ie
tag = soup.find('div', class_='main')
span = tag.find('span')
If tag doesn't exist, the span = line will throw an exception.
But there are no good xpath libs for python (lxml has one, I don't like lxml, using an XML parser for HTML is a bad idea).
In short, try scrapy.
If that fails, try beautifulsoup 4 and lxml xpath.
from amazon_scraper.
Related Issues (20)
- extract_asin doesn't work with all Amazon's links HOT 1
- Reviews not getting after review page HOT 5
- Only getting the last 10 reviews HOT 3
- Problem installing amazon_scraper HOT 3
- Average Review Rating HOT 1
- Page sometimes not loading? HOT 10
- Add captcha detection HOT 11
- help HOT 1
- How to get offer listings(all offer price by all merchants for single product) HOT 1
- Random stopping on multiples of 10 HOT 9
- Get Product Price HOT 1
- ImportError: No module named tests HOT 1
- GUI? HOT 1
- Problem with BeautifulSoup import HOT 1
- Add ability to set amazon_base HOT 2
- Can't parse review date for foreign Amazon regions
- Problem with InsecureRequest HOT 1
- AWS Accout HOT 1
- Problems with .text command HOT 3
- Install requirement contains invalid library name HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon_scraper.