yashtotale / goodreads-user-scraper Goto Github PK

View Code? Open in Web Editor NEW

8.0 1.0 6.0 61 KB

PyPI Package (25,000+ downloads): scrapes Goodreads user data (profile, bookshelves, books, authors, etc.)

Home Page: https://tinyurl.com/goodreads-user-scraper

License: MIT License

Python 96.60% Shell 3.40%

python pypi-package web-scraper goodreads beautifulsoup cli

goodreads-user-scraper's Introduction

Goodreads User Scraper

Scrape Goodreads User Data: Profile, Book Shelves, Books, Authors

Usage
Arguments
Troubleshooting
Development
Publishing

Usage

Using pip:

pip install goodreads-user-scraper
goodreads-user-scraper --user_id <your id> --output_dir goodreads-data

Using pipx:

pipx run goodreads-user-scraper --user_id <your id> --output_dir goodreads-data

Arguments

`--user_id`

Description: The user whose data should be scraped. Find your user id using these directions.
Required: Yes

`--output_dir`

Description: The directory where all scraped data will be output.
Required: No
Default: goodreads-data

`--skip_user_info`

Description: Whether the script should skip scraping user information.
Required: No
Default: False

`--skip_shelves`

Description: Whether the script should skip scraping shelves.
Required: No
Default: False

`--skip_authors`

Description: Whether the script should skip scraping authors.
Required: No
Default: False

Troubleshooting

Ensure that your profile is viewable by anyone:

Navigate to the Goodreads Account Settings page
Click on the Settings tab
In the Privacy section, under the Who Can View My Profile question, select "anyone"

Development

Clone the GitHub repository

git clone https://github.com/YashTotale/goodreads-user-scraper.git

Run the install script
```
sh scripts/install.sh
```
Make changes
Run the test script
```
sh scripts/test.sh
```

Publishing

Create .env

TWINE_USERNAME=<foo>
TWINE_PASSWORD=<bar>

Run the publish script

sh scripts/publish.sh <patch|minor|major>

goodreads-user-scraper's People

Stargazers

Watchers

Forkers

ivantenryu nikuda lgtm-migrator costa-rica basioli-k annaho124

goodreads-user-scraper's Issues

Crashes due to error from not finding id="description" read shelf

Describe the bug
Scraping fails when a book description is not found. Some books do not have <div id="description". This will cause the scrape_book method to return an error then crash the goodreads-user-scraper.

To Reproduce
Steps to reproduce the behavior:

run goodreads-user-scraper --user_id 149832357
once scraping gets to 'read' shelf it will crash on "Le Reveil" (this occurs on "Bounty In Brimstone" also)

error in terminal:

Traceback (most recent call last):
  File "/Users/nick/Documents/_environments/goodreads/bin/goodreads-user-scraper", line 33, in <module>
    sys.exit(load_entry_point('goodreads-user-scraper', 'console_scripts', 'goodreads-user-scraper')())
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/__main__.py", line 28, in main
    scrape_user(args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/__main__.py", line 10, in scrape_user
    shelves.get_all_shelves(args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/shelves.py", line 131, in get_all_shelves
    get_shelf(args, shelf)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/shelves.py", line 95, in get_shelf
    book = books.scrape_book(book_id, args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/books.py", line 109, in scrape_book
    "book_description": get_description(soup),
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/books.py", line 85, in get_description
    book_description = soup.find("div", {"id": "description"}).findAll("span")[-1].text
AttributeError: 'NoneType' object has no attribute 'findAll'

Expected behavior
I think a good enough solution would be to return "No description found" for book description and continue scraping.

Crashing when no cover image is found in book

Describe the bug
Scraping fails when a book cover image is not found. Some books do not have <img id="coverImage". This will cause the scrape_book method to return an error then crash the goodreads-user-scraper.

This happens with the test script. I have not encountered the problem in any of my own scrapings

To Reproduce
Steps to reproduce the behavior:

Clone the GitHub repository
git clone https://github.com/YashTotale/goodreads-user-scraper.git
Run the install script
sh scripts/install.sh
Run the test script
sh scripts/test.sh

error in terminal:

~/Documents/goodreads-user-scraper (main*) » sh scripts/test.sh                                                    130 ↵ nick@Nicks-Mac-mini
Scraping user...
👤 Scraped user

Scraping 'read' shelf...
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/__main__.py", line 32, in <module>
    main()
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/__main__.py", line 28, in main
    scrape_user(args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/__main__.py", line 10, in scrape_user
    shelves.get_all_shelves(args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/shelves.py", line 131, in get_all_shelves
    get_shelf(args, shelf)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/shelves.py", line 95, in get_shelf
    book = books.scrape_book(book_id, args)
  File "/Users/nick/Documents/goodreads-user-scraper/scraper/books.py", line 111, in scrape_book
    "book_image": soup.find("img", {"id": "coverImage"}).attrs.get("src"),
AttributeError: 'NoneType' object has no attribute 'attrs'

Expected behavior
I think a good enough solution would be to return "No image found" for book cover image and continue scraping.

output_dir - invalid syntax error message

Describe the bug
invalid syntax message will appear when running the --output_dir

Using:
Visual Studio Code

yashtotale / goodreads-user-scraper Goto Github PK

goodreads-user-scraper's Introduction

Goodreads User Scraper

Contents

Usage

Arguments

--user_id

--output_dir

--skip_user_info

--skip_shelves

--skip_authors