GithubHelp home page GithubHelp logo

Comments (11)

adamlwgriffiths avatar adamlwgriffiths commented on May 25, 2024

Could you provide the review id, (preferably ASIN too, if you have the product object).

print rs.url
print rs.id

I'll take a look after I catch some sleep =)

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on May 25, 2024

I had a quick look, the asin property works for reviews and review object. Albeit for the products I'm testing against.

Product pages html changes drastically (and continually evolves), but I've never seen any variation in the review html format. So I'm curious to see what product you're getting reviews from.

from amazon_scraper.

patrickbeeson avatar patrickbeeson commented on May 25, 2024

Here's a log from my interpreter (I get the same result from any product I try):

>>> amzn = AmazonScraper ('access', 'secret', 'associate_tag')
>>> item = amzn.lookup(ItemId='B00008MOQA')
>>> item.title
'Swiffer WetJet Spray, Mop Floor Cleaner Starter Kit (Packaging May Vary)'
>>> item.url
'http://www.amazon.com/dp/B00008MOQA'
>>> item.reviews_url
'http://www.amazon.com/product-reviews/B00008MOQA/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending'
>>> item_reviews = amzn.lookup(ItemId='B00008MOQA')
>>> item_reviews = amzn.reviews(URL=item.reviews_url)
>>> item_reviews.asin
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Users/pbeeson/.virtualenvs/custom_data_pulls/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 41, in asin
return unicode(span['name'])
    TypeError: 'NoneType' object has no attribute '__getitem__'
>>> item_reviews.ids
[]

from amazon_scraper.

patrickbeeson avatar patrickbeeson commented on May 25, 2024

I can't get item_reviews.id or item_reviews.url without getting the same ASIN error. The ASIN for this product should be B00008MOQA.

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on May 25, 2024

Ok I can reproduce that. I'll get to work on it now.

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on May 25, 2024

Ok, the logic was fine its just that the python html parsers are a bit schizophrenic.
Using bsoup4 with 'html.parser' errored only on that one page (in my tests anyway).
I've changed bsoup to use the default parser for reviews and left the others. It passes all tests still.
I've also added some fixes. I'll push this release up in a tick.

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on May 25, 2024

I've pushed version 0.1.17 to pypi.
This should resolve the issue, if not, please re-open.

from amazon_scraper.

patrickbeeson avatar patrickbeeson commented on May 25, 2024

Great! I'll check it out this morning.

from amazon_scraper.

patrickbeeson avatar patrickbeeson commented on May 25, 2024

Just confirmed things are working as expected with the reviews I'm seeking to pull. Thanks!

from amazon_scraper.

igor555 avatar igor555 commented on May 25, 2024

can anyone help a noob (me) set up a scraping program for warehousedeals.com

from amazon_scraper.

adamlwgriffiths avatar adamlwgriffiths commented on May 25, 2024

You can use libraries like https://github.com/scrapy/scrapy to trawl sites.
It has some easy to use scraping routines based on XPath which were nice and concise, but I found the documentation lacking in the areas I needed (controlling the way it spiders websites).

Learning BeautifulSoup4 is a good start.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
But it is heavily reliant on which parser your choose ('html.parser', 'html5lib', 'lxml', etc), which is what caused this issue (the value was in the html, the html parsed differently for this product because the python html parsers are all deficient in different ways).

XPath is also very handy because its concise, and having the path not exist results in None, no matter where in the query it failed.
But XPath is quite hard to understand, its like regexp or perl. You wont remember what a line of code does a month later.

BeautifulSoup4 on the other hand is long winded, but prone to errors, ie

tag = soup.find('div', class_='main')
span = tag.find('span')

If tag doesn't exist, the span = line will throw an exception.

But there are no good xpath libs for python (lxml has one, I don't like lxml, using an XML parser for HTML is a bad idea).

In short, try scrapy.
If that fails, try beautifulsoup 4 and lxml xpath.

from amazon_scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.