GithubHelp home page GithubHelp logo

yashtotale / goodreads-user-scraper Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 6.0 61 KB

PyPI Package (25,000+ downloads): scrapes Goodreads user data (profile, bookshelves, books, authors, etc.)

Home Page: https://tinyurl.com/goodreads-user-scraper

License: MIT License

Python 96.60% Shell 3.40%
python pypi-package web-scraper goodreads beautifulsoup cli

goodreads-user-scraper's Introduction

Goodreads Icon

Goodreads User Scraper

Scrape Goodreads User Data: Profile, Book Shelves, Books, Authors

Version  Downloads  Build 

Contents

Usage

Using pip:

pip install goodreads-user-scraper
goodreads-user-scraper --user_id <your id> --output_dir goodreads-data

Using pipx:

pipx run goodreads-user-scraper --user_id <your id> --output_dir goodreads-data

Arguments

--user_id

  • Description: The user whose data should be scraped. Find your user id using these directions.
  • Required: Yes

--output_dir

  • Description: The directory where all scraped data will be output.
  • Required: No
  • Default: goodreads-data

--skip_user_info

  • Description: Whether the script should skip scraping user information.
  • Required: No
  • Default: False

--skip_shelves

  • Description: Whether the script should skip scraping shelves.
  • Required: No
  • Default: False

--skip_authors

  • Description: Whether the script should skip scraping authors.
  • Required: No
  • Default: False

Troubleshooting

Ensure that your profile is viewable by anyone:

  1. Navigate to the Goodreads Account Settings page
  2. Click on the Settings tab
  3. In the Privacy section, under the Who Can View My Profile question, select "anyone"

Development

  1. Clone the GitHub repository

    git clone https://github.com/YashTotale/goodreads-user-scraper.git
  2. Run the install script

    sh scripts/install.sh
  3. Make changes

  4. Run the test script

    sh scripts/test.sh

Publishing

  1. Create .env

    TWINE_USERNAME=<foo>
    TWINE_PASSWORD=<bar>
    
  2. Run the publish script

    sh scripts/publish.sh <patch|minor|major>

goodreads-user-scraper's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

goodreads-user-scraper's Issues

Crashes due to error from not finding id="description" read shelf

Describe the bug
Scraping fails when a book description is not found. Some books do not have <div id="description". This will cause the scrape_book method to return an error then crash the goodreads-user-scraper.

To Reproduce
Steps to reproduce the behavior:

  1. run goodreads-user-scraper --user_id 149832357
  2. once scraping gets to 'read' shelf it will crash on "Le Reveil" (this occurs on "Bounty In Brimstone" also)

error in terminal:

Traceback (most recent call last):
  File "/Users/nick/Documents/_environments/goodreads/bin/goodreads-user-scraper", line 33, in <module>
    sys.exit(load_entry_point('goodreads-user-scraper', 'console_scripts', 'goodreads-user-scraper')())
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/__main__.py", line 28, in main
    scrape_user(args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/__main__.py", line 10, in scrape_user
    shelves.get_all_shelves(args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/shelves.py", line 131, in get_all_shelves
    get_shelf(args, shelf)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/shelves.py", line 95, in get_shelf
    book = books.scrape_book(book_id, args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/books.py", line 109, in scrape_book
    "book_description": get_description(soup),
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/books.py", line 85, in get_description
    book_description = soup.find("div", {"id": "description"}).findAll("span")[-1].text
AttributeError: 'NoneType' object has no attribute 'findAll'

Expected behavior
I think a good enough solution would be to return "No description found" for book description and continue scraping.

Crashing when no cover image is found in book

Describe the bug
Scraping fails when a book cover image is not found. Some books do not have <img id="coverImage". This will cause the scrape_book method to return an error then crash the goodreads-user-scraper.

This happens with the test script. I have not encountered the problem in any of my own scrapings

To Reproduce
Steps to reproduce the behavior:

  1. Clone the GitHub repository
    git clone https://github.com/YashTotale/goodreads-user-scraper.git

  2. Run the install script
    sh scripts/install.sh

  3. Run the test script
    sh scripts/test.sh

error in terminal:

~/Documents/goodreads-user-scraper (main*) » sh scripts/test.sh                                                    130 ↵ nick@Nicks-Mac-mini
Scraping user...
👤 Scraped user

Scraping 'read' shelf...
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/__main__.py", line 32, in <module>
    main()
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/__main__.py", line 28, in main
    scrape_user(args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/__main__.py", line 10, in scrape_user
    shelves.get_all_shelves(args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/shelves.py", line 131, in get_all_shelves
    get_shelf(args, shelf)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/shelves.py", line 95, in get_shelf
    book = books.scrape_book(book_id, args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/books.py", line 111, in scrape_book
    "book_image": soup.find("img", {"id": "coverImage"}).attrs.get("src"),
AttributeError: 'NoneType' object has no attribute 'attrs'

Expected behavior
I think a good enough solution would be to return "No image found" for book cover image and continue scraping.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.