I'm running into an odd error when trying to get the ASIN for reviews. I get an RS obj

I can't get item_reviews.id or <code class="notransla

Can't get ASIN for reviews about amazon_scraper HOT 11 CLOSED

adamlwgriffiths commented on May 25, 2024

Can't get ASIN for reviews

from amazon_scraper.

Comments (11)

adamlwgriffiths commented on May 25, 2024

Could you provide the review id, (preferably ASIN too, if you have the product object).

print rs.url
print rs.id

I'll take a look after I catch some sleep =)

from amazon_scraper.

adamlwgriffiths commented on May 25, 2024

I had a quick look, the asin property works for reviews and review object. Albeit for the products I'm testing against.

Product pages html changes drastically (and continually evolves), but I've never seen any variation in the review html format. So I'm curious to see what product you're getting reviews from.

from amazon_scraper.

patrickbeeson commented on May 25, 2024

Here's a log from my interpreter (I get the same result from any product I try):

>>> amzn = AmazonScraper ('access', 'secret', 'associate_tag')
>>> item = amzn.lookup(ItemId='B00008MOQA')
>>> item.title
'Swiffer WetJet Spray, Mop Floor Cleaner Starter Kit (Packaging May Vary)'
>>> item.url
'http://www.amazon.com/dp/B00008MOQA'
>>> item.reviews_url
'http://www.amazon.com/product-reviews/B00008MOQA/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending'
>>> item_reviews = amzn.lookup(ItemId='B00008MOQA')
>>> item_reviews = amzn.reviews(URL=item.reviews_url)
>>> item_reviews.asin
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Users/pbeeson/.virtualenvs/custom_data_pulls/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 41, in asin
return unicode(span['name'])
    TypeError: 'NoneType' object has no attribute '__getitem__'
>>> item_reviews.ids
[]

from amazon_scraper.

patrickbeeson commented on May 25, 2024

I can't get item_reviews.id or item_reviews.url without getting the same ASIN error. The ASIN for this product should be B00008MOQA.

from amazon_scraper.

adamlwgriffiths commented on May 25, 2024

Ok I can reproduce that. I'll get to work on it now.

from amazon_scraper.

adamlwgriffiths commented on May 25, 2024

Ok, the logic was fine its just that the python html parsers are a bit schizophrenic.
Using bsoup4 with 'html.parser' errored only on that one page (in my tests anyway).
I've changed bsoup to use the default parser for reviews and left the others. It passes all tests still.
I've also added some fixes. I'll push this release up in a tick.

from amazon_scraper.

adamlwgriffiths commented on May 25, 2024

I've pushed version 0.1.17 to pypi.
This should resolve the issue, if not, please re-open.

from amazon_scraper.

patrickbeeson commented on May 25, 2024

Great! I'll check it out this morning.

from amazon_scraper.

patrickbeeson commented on May 25, 2024

Just confirmed things are working as expected with the reviews I'm seeking to pull. Thanks!

from amazon_scraper.

igor555 commented on May 25, 2024

can anyone help a noob (me) set up a scraping program for warehousedeals.com

from amazon_scraper.

adamlwgriffiths commented on May 25, 2024

You can use libraries like https://github.com/scrapy/scrapy to trawl sites.
It has some easy to use scraping routines based on XPath which were nice and concise, but I found the documentation lacking in the areas I needed (controlling the way it spiders websites).

Learning BeautifulSoup4 is a good start.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
But it is heavily reliant on which parser your choose ('html.parser', 'html5lib', 'lxml', etc), which is what caused this issue (the value was in the html, the html parsed differently for this product because the python html parsers are all deficient in different ways).

XPath is also very handy because its concise, and having the path not exist results in None, no matter where in the query it failed.
But XPath is quite hard to understand, its like regexp or perl. You wont remember what a line of code does a month later.

BeautifulSoup4 on the other hand is long winded, but prone to errors, ie

tag = soup.find('div', class_='main')
span = tag.find('span')

If tag doesn't exist, the span = line will throw an exception.

But there are no good xpath libs for python (lxml has one, I don't like lxml, using an XML parser for HTML is a bad idea).

In short, try scrapy.
If that fails, try beautifulsoup 4 and lxml xpath.

from amazon_scraper.

Can't get ASIN for reviews about amazon_scraper HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs