GithubHelp home page GithubHelp logo

aliparlakci / bulk-downloader-for-reddit Goto Github PK

View Code? Open in Web Editor NEW
2.2K 30.0 210.0 1.18 MB

Downloads and archives content from reddit

Home Page: https://pypi.org/project/bdfr

License: GNU General Public License v3.0

Python 96.11% PowerShell 1.57% Shell 2.29% Ruby 0.04%
reddit imgur gfycat scraper python downloader archive

bulk-downloader-for-reddit's Introduction

Bulk Downloader for Reddit

PyPI Status PyPI version PyPI downloads AUR version Python Test Code style: black pre-commit

This is a tool to download submissions or submission data from Reddit. It can be used to archive data or even crawl Reddit to gather research data. The BDFR is flexible and can be used in scripts if needed through an extensive command-line interface. List of currently supported sources

If you wish to open an issue, please read the guide on opening issues to ensure that your issue is clear and contains everything it needs to for the developers to investigate.

Included in this README are a few example Bash tricks to get certain behaviour. For that, see Common Command Tricks.

Installation

Bulk Downloader for Reddit needs Python version 3.9 or above. Please update Python before installation to meet the requirement.

Then, you can install it via pip with:

python3 -m pip install bdfr --upgrade

or via pipx with:

python3 -m pipx install bdfr

To update BDFR, run the above command again for pip or pipx upgrade bdfr for pipx installations.

To check your version of BDFR, run bdfr --version

To install shell completions, run bdfr completions

AUR Package

If on Arch Linux or derivative operating systems such as Manjaro, the BDFR can be installed through the AUR.

Source code

If you want to use the source code or make contributions, refer to CONTRIBUTING

Usage

The BDFR works by taking submissions from a variety of "sources" from Reddit and then parsing them to download. These sources might be a subreddit, multireddit, a user list, or individual links. These sources are combined and downloaded to disk, according to a naming and organisational scheme defined by the user.

There are three modes to the BDFR: download, archive, and clone. Each one has a command that performs similar but distinct functions. The download command will download the resource linked in the Reddit submission, such as the images, video, etc. The archive command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML. Lastly, the clone command will perform both functions of the previous commands at once and is more efficient than running those commands sequentially.

Note that the clone command is not a true, failthful clone of Reddit. It simply retrieves much of the raw data that Reddit provides. To get a true clone of Reddit, another tool such as HTTrack should be used.

After installation, run the program from any directory as shown below:

bdfr download
bdfr archive
bdfr clone

However, these commands are not enough. You should chain parameters in Options according to your use case. Don't forget that some parameters can be provided multiple times. Some quick reference commands are:

bdfr download ./path/to/output --subreddit Python -L 10
bdfr download ./path/to/output --user reddituser --submitted -L 100
bdfr download ./path/to/output --user me --saved --authenticate -L 25 --file-scheme '{POSTID}'
bdfr download ./path/to/output --subreddit 'Python, all, mindustry' -L 10 --make-hard-links
bdfr archive ./path/to/output --user reddituser --submitted --all-comments --comment-context
bdfr archive ./path/to/output --subreddit all --format yaml -L 500 --folder-scheme ''

Alternatively, you can pass options through a YAML file.

bdfr download ./path/to/output --opts my_opts.yaml

For example, running it with the following file

skip: [mp4, avi]
file_scheme: "{UPVOTES}_{REDDITOR}_{POSTID}_{DATE}"
limit: 10
sort: top
subreddit:
  - EarthPorn
  - CityPorn

would be equilavent to (take note that in YAML there is file_scheme instead of file-scheme):

bdfr download ./path/to/output --skip mp4 --skip avi --file-scheme "{UPVOTES}_{REDDITOR}_{POSTID}_{DATE}" -L 10 -S top --subreddit EarthPorn --subreddit CityPorn

Any option that can be specified multiple times should be formatted like subreddit is above.

In case when the same option is specified both in the YAML file and in as a command line argument, the command line argument takes priority

Options

The following options are common between both the archive and download commands of the BDFR.

  • directory
    • This is the directory to which the BDFR will download and place all files
  • --authenticate
    • This flag will make the BDFR attempt to use an authenticated Reddit session
    • See Authentication for more details
  • --config
    • If the path to a configuration file is supplied with this option, the BDFR will use the specified config
    • See Configuration Files for more details
  • --opts
    • Load options from a YAML file.
    • Has higher prority than the global config file but lower than command-line arguments.
    • See opts_example.yaml for an example file.
  • --disable-module
    • Can be specified multiple times
    • Disables certain modules from being used
    • See Disabling Modules for more information and a list of module names
  • --filename-restriction-scheme
    • Can be: windows, linux
    • Turns off the OS detection and specifies which system to use when making filenames
    • See Filesystem Restrictions
  • --ignore-user
    • This will add a user to ignore
    • Can be specified multiple times
  • --include-id-file
    • This will add any submission with the IDs in the files provided
    • Can be specified multiple times
    • Format is one ID per line
  • --log
    • This allows one to specify the location of the logfile
    • This must be done when running multiple instances of the BDFR, see Multiple Instances below
  • --saved
    • This option will make the BDFR use the supplied user's saved posts list as a download source
    • This requires an authenticated Reddit instance, using the --authenticate flag, as well as --user set to me
  • --search
    • This will apply the input search term to specific lists when scraping submissions
    • A search term can only be applied when using the --subreddit and --multireddit flags
  • --submitted
    • This will use a user's submissions as a source
    • A user must be specified with --user
  • --upvoted
    • This will use a user's upvoted posts as a source of posts to scrape
    • This requires an authenticated Reddit instance, using the --authenticate flag, as well as --user set to me
  • -L, --limit
    • This is the limit on the number of submissions retrieve
    • Default is max possible
    • Note that this limit applies to each source individually e.g. if a --limit of 10 and three subreddits are provided, then 30 total submissions will be scraped
    • If it is not supplied, then the BDFR will default to the maximum allowed by Reddit, roughly 1000 posts. We cannot bypass this.
  • -S, --sort
    • This is the sort type for each applicable submission source supplied to the BDFR
    • This option does not apply to upvoted or saved posts when scraping from these sources
    • The following options are available:
      • controversial
      • hot (default)
      • new
      • relevance (only available when using --search)
      • rising
      • top
  • -l, --link
    • This is a direct link to a submission to download, either as a URL or an ID
    • Can be specified multiple times
  • -m, --multireddit
    • This is the name of a multireddit to add as a source
    • Can be specified multiple times
      • This can be done by using -m multiple times
      • Multireddits can also be used to provide CSV multireddits e.g. -m 'chess, favourites'
    • The specified multireddits must all belong to the user specified with the --user option
  • -s, --subreddit
    • This adds a subreddit as a source
    • Can be used mutliple times
      • This can be done by using -s multiple times
      • Subreddits can also be used to provide CSV subreddits e.g. -m 'all, python, mindustry'
  • -t, --time
    • This is the time filter that will be applied to all applicable sources
    • This option does not apply to upvoted or saved posts when scraping from these sources
    • This option only applies if sorting by top or controversial. See --sort for more detail.
    • The following options are available:
      • all (default)
      • hour
      • day
      • week
      • month
      • year
    • --time-format
      • This specifies the format of the datetime string that replaces {DATE} in file and folder naming schemes
      • See Time Formatting Customisation for more details, and the formatting scheme
  • -u, --user
    • This specifies the user to scrape in concert with other options
    • When using --authenticate, --user me can be used to refer to the authenticated user
    • Can be specified multiple times for multiple users
      • If downloading a multireddit, only one user can be specified
  • -v, --verbose
    • Increases the verbosity of the program
    • Can be specified multiple times

Downloader Options

The following options apply only to the download command. This command downloads the files and resources linked to in the submission, or a text submission itself, to the disk in the specified directory.

  • --make-hard-links
    • This flag will create hard links to an existing file when a duplicate is downloaded in the current run
    • This will make the file appear in multiple directories while only taking the space of a single instance
  • --max-wait-time
    • This option specifies the maximum wait time for downloading a resource
    • The default is 120 seconds
    • See Rate Limiting for details
  • --no-dupes
    • This flag will not redownload files if they were already downloaded in the current run
    • This is calculated by MD5 hash
  • --search-existing
    • This will make the BDFR compile the hashes for every file in directory
    • The hashes are used to remove duplicates if --no-dupes is supplied or make hard links if --make-hard-links is supplied
  • --file-scheme
  • --folder-scheme
  • --exclude-id
    • This will skip the download of any submission with the ID provided
    • Can be specified multiple times
  • --exclude-id-file
    • This will skip the download of any submission with any of the IDs in the files provided
    • Can be specified multiple times
    • Format is one ID per line
  • --skip-domain
    • This adds domains to the download filter i.e. submissions coming from these domains will not be downloaded
    • Can be specified multiple times
    • Domains must be supplied in the form example.com or img.example.com
  • --skip
    • This adds file types to the download filter i.e. submissions with one of the supplied file extensions will not be downloaded
    • Can be specified multiple times
  • --skip-subreddit
    • This skips all submissions from the specified subreddit
    • Can be specified multiple times
    • Also accepts CSV subreddit names
  • --min-score
    • This skips all submissions which have fewer than specified upvotes
  • --max-score
    • This skips all submissions which have more than specified upvotes
  • --min-score-ratio
    • This skips all submissions which have lower than specified upvote ratio
  • --max-score-ratio
    • This skips all submissions which have higher than specified upvote ratio

Archiver Options

The following options are for the archive command specifically.

  • --all-comments
    • When combined with the --user option, this will download all the user's comments
  • -f, --format
    • This specifies the format of the data file saved to disk
    • The following formats are available:
      • json (default)
      • xml
      • yaml
  • --comment-context
    • This option will, instead of downloading an individual comment, download the submission that comment is a part of
    • May result in a longer run time as it retrieves much more data

Cloner Options

The clone command can take all the options listed above for both the archive and download commands since it performs the functions of both.

Common Command Tricks

A common use case is for subreddits/users to be loaded from a file. The BDFR supports this via YAML file options (--opts my_opts.yaml).

Alternatively, you can use the command-line xargs function. For a list of users users.txt (one user per line), type:

cat users.txt | xargs -L 1 echo --user | xargs -L 50 bdfr download <ARGS>

The part -L 50 is to make sure that the character limit for a single line isn't exceeded, but may not be necessary. This can also be used to load subreddits from a file, simply exchange --user with --subreddit and so on.

Authentication and Security

The BDFR uses OAuth2 authentication to connect to Reddit if authentication is required. This means that it is a secure, token-based system for making requests. This also means that the BDFR only has access to specific parts of the account authenticated, by default only saved posts, upvoted posts, and the identity of the authenticated account. Note that authentication is not required unless accessing private things like upvoted posts, saved posts, and private multireddits.

To authenticate, the BDFR will first look for a token in the configuration file that signals that there's been a previous authentication. If this is not there, then the BDFR will attempt to register itself with your account. This is normal, and if you run the program, it will pause and show a Reddit URL. Click on this URL and it will take you to Reddit, where the permissions being requested will be shown. Read this and confirm that there are no more permissions than needed to run the program. You should not grant unneeded permissions; by default, the BDFR only requests permission to read your saved or upvoted submissions and identify as you.

If the permissions look safe, confirm it, and the BDFR will save a token that will allow it to authenticate with Reddit from then on.

Changing Permissions

Most users will not need to do anything extra to use any of the current features. However, if additional features such as scraping messages, PMs, etc are added in the future, these will require additional scopes. Additionally, advanced users may wish to use the BDFR with their own API key and secret. There is normally no need to do this, but it is allowed by the BDFR.

The configuration file for the BDFR contains the API secret and key, as well as the scopes that the BDFR will request when registering itself to a Reddit account via OAuth2. These can all be changed if the user wishes, however do not do so if you don't know what you are doing. The defaults are specifically chosen to have a very low security risk if your token were to be compromised, however unlikely that actually is. Never grant more permissions than you absolutely need.

For more details on the configuration file and the values therein, see Configuration Files.

Folder and File Name Schemes

The naming and folder schemes for the BDFR are both completely customisable. A number of different fields can be given which will be replaced with properties from a submission when downloading it. The scheme format takes the form of {KEY}, where KEY is a string from the below list.

  • DATE
  • FLAIR
  • POSTID
  • REDDITOR
  • SUBREDDIT
  • TITLE
  • UPVOTES

Each of these can be enclosed in curly bracket, {}, and included in the name. For example, to just title every downloaded post with the unique submission ID, you can use {POSTID}. Static strings can also be included, such as download_{POSTID} which will not change from submission to submission. For example, the previous string will result in the following submission file names:

  • download_aaaaaa.png
  • download_bbbbbb.png

At least one key must be included in the file scheme, otherwise an error will be thrown. The folder scheme however, can be null or a simple static string. In the former case, all files will be placed in the folder specified with the directory argument. If the folder scheme is a static string, then all submissions will be placed in a folder of that name. In both cases, there will be no separation between all submissions.

It is highly recommended that the file name scheme contain the parameter {POSTID} as this is the only parameter guaranteed to be unique. No combination of other keys will necessarily be unique and may result in posts being skipped as the BDFR will see files by the same name and skip the download, assuming that they are already downloaded.

Configuration

The configuration files are, by default, stored in the configuration directory for the user. This differs depending on the OS that the BDFR is being run on. For Windows, this will be:

  • C:\Users\<User>\AppData\Local\BDFR\bdfr

If Python has been installed through the Windows Store, the folder will appear in a different place. Note that the hash included in the file path may change from installation to installation.

  • C:\Users\<User>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\Local\BDFR\bdfr

On Mac OSX, this will be:

  • ~/Library/Application Support/bdfr.

Lastly, on a Linux system, this will be:

  • ~/.config/bdfr/

The logging output for each run of the BDFR will be saved to this directory in the file log_output.txt. If you need to submit a bug, it is this file that you will need to submit with the report.

Configuration File

The config.cfg is the file that supplies the BDFR with the configuration to use. At the moment, the following keys must be included in the configuration file supplied.

  • client_id
  • client_secret
  • scopes

The following keys are optional, and defaults will be used if they cannot be found.

  • backup_log_count
  • max_wait_time
  • time_format
  • disabled_modules
  • filename-restriction-scheme

All of these should not be modified unless you know what you're doing, as the default values will enable the BDFR to function just fine. A configuration is included in the BDFR when it is installed, and this will be placed in the configuration directory as the default.

Most of these values have to do with OAuth2 configuration and authorisation. The key backup_log_count however has to do with the log rollover. The logs in the configuration directory can be verbose and for long runs of the BDFR, can grow quite large. To combat this, the BDFR will overwrite previous logs. This value determines how many previous run logs will be kept. The default is 3, which means that the BDFR will keep at most three past logs plus the current one. Any runs past this will overwrite the oldest log file, called "rolling over". If you want more records of past runs, increase this number.

Time Formatting Customisation

The option time_format will specify the format of the timestamp that replaces {DATE} in filename and folder name schemes. By default, this is the ISO 8601 format which is highly recommended due to its standardised nature. If you don't need to change it, it is recommended that you do not. However, you can specify it to anything required with this option. The --time-format option supersedes any specification in the configuration file

The format can be specified through the format codes that are standard in the Python datetime library.

Disabling Modules

The individual modules of the BDFR, used to download submissions from websites, can be disabled. This is helpful especially in the case of the fallback downloaders, since the --skip-domain option cannot be effectively used in these cases. For example, the Youtube-DL downloader can retrieve data from hundreds of websites and domains; thus the only way to fully disable it is via the --disable-module option.

Modules can be disabled through the command line interface for the BDFR or more permanently in the configuration file via the disabled_modules option. The list of downloaders that can be disabled are the following. Note that they are case-insensitive.

  • Direct
  • DelayForReddit
  • Erome
  • Gallery (Reddit Image Galleries)
  • Gfycat
  • Imgur
  • PornHub
  • Redgifs
  • SelfPost (Reddit Text Post)
  • Vidble
  • VReddit (Reddit Video Post)
  • Youtube
  • YoutubeDlFallback

Rate Limiting

The option max_wait_time has to do with retrying downloads. There are certain HTTP errors that mean that no amount of requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so many requests that the remote website cuts the client off to preserve the function of the site. This is a common situation when downloading many resources from the same site. It is polite and best practice to obey the website's wishes in these cases.

To this end, the BDFR will sleep for a time before retrying the download, giving the remote server time to "rest". This is done in 60 second increments. For example, if a rate-limiting-related error is given, the BDFR will sleep for 60 seconds before retrying. Then, if the same type of error occurs, it will sleep for another 120 seconds, then 180 seconds, and so on.

The option --max-wait-time and the configuration option max_wait_time both specify the maximum time the BDFR will wait. If both are present, the command-line option takes precedence. For instance, the default is 120, so the BDFR will wait for 60 seconds, then 120 seconds, and then move one. Note that this results in a total time of 180 seconds trying the same download. If you wish to try to bypass the rate-limiting system on the remote site, increasing the maximum wait time may help. However, note that the actual wait times increase exponentially if the resource is not downloaded i.e. specifying a max value of 300 (5 minutes), can make the BDFR pause for 15 minutes on one submission, not 5, in the worst case.

Multiple Instances

The BDFR can be run in multiple instances with multiple configurations, either concurrently or consecutively. The use of scripting files facilitates this the easiest, either Powershell on Windows operating systems or Bash elsewhere. This allows multiple scenarios to be run with data being scraped from different sources, as any two sets of scenarios might be mutually exclusive i.e. it is not possible to download any combination of data from a single run of the BDFR. To download from multiple users for example, multiple runs of the BDFR are required.

Running these scenarios consecutively is done easily, like any single run. Configuration files that differ may be specified with the --config option to switch between tokens, for example. Otherwise, almost all configuration for data sources can be specified per-run through the command line.

Running scenarios concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file.

The way to fix this is to use the --log option to manually specify where the logfile is to be stored. If the given location is unique to each instance of the BDFR, then it will run fine.

Filesystem Restrictions

Different filesystems have different restrictions for what files and directories can be named. Thesse are separated into two broad categories: Linux-based filesystems, which have very few restrictions; and Windows-based filesystems, which are much more restrictive in terms if forbidden characters and length of paths.

During the normal course of operation, the BDFR detects what filesystem it is running on and formats any filenames and directories to conform to the rules that are expected of it. However, there are cases where this will fail. When running on a Linux-based machine, or another system where the home filesystem is permissive, and accessing a share or drive with a less permissive system, the BDFR will assume that the home filesystem's rules apply. For example, when downloading to a SAMBA share from Ubuntu, there will be errors as SAMBA is more restrictive than Ubuntu.

The best option would be to always download to a filesystem that is as permission as possible, such as an NFS share or ext4 drive. However, when this is not possible, the BDFR allows for the restriction scheme to be manually specified at either the command-line or in the configuration file. At the command-line, this is done with --filename-restriction-scheme windows, or else an option by the same name in the configuration file.

Manipulating Logfiles

The logfiles that the BDFR outputs are consistent and quite detailed and in a format that is amenable to regex. To this end, a number of bash scripts have been included here. They show examples for how to extract successfully downloaded IDs, failed IDs, and more besides.

Unsaving posts

Back in v1 there was an option to unsave posts from your account when downloading, but it was removed from the core BDFR on v2 as it is considered a read-only tool. However, for those missing this functionality, a script was created that uses the log files to achieve this. There is info on how to use this on the README.md file on the scripts subdirectory.

List of currently supported sources

  • Direct links (links leading to a file)
  • Delay for Reddit
  • Erome
  • Gfycat
  • Gif Delivery Network
  • Imgur
  • Reddit Galleries
  • Reddit Text Posts
  • Reddit Videos
  • Redgifs
  • Vidble
  • YouTube
    • Any source supported by YT-DLP should be compatable

Contributing

If you wish to contribute, see Contributing for more information.

When reporting any issues or interacting with the developers, please follow the Code of Conduct.

bulk-downloader-for-reddit's People

Contributors

abgd1234 avatar ailothaen avatar aliparlakci avatar blipranger avatar boo1098 avatar botts85 avatar chapmanjacobd avatar comradeecho avatar creepler13 avatar danclowry avatar dbanon87 avatar deepsourcebot avatar dunefox avatar ekriirke avatar elipsitz avatar jrwren avatar omegarazer avatar r-pufky avatar serene-arc avatar shinji257 avatar sinclairkosh avatar soulsuck24 avatar st-korn avatar stared avatar thayol avatar vladdoster avatar zapperdj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bulk-downloader-for-reddit's Issues

Gfycat NSFW download error: NotADownloadableLinkError: Could not read the page source. Gfycat now using redgifs for NSFW submissions?

I've updated my script to the latest pull that fixes the previous Gfycat error:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I now see a new exception with Gfycat posts.

System

Mac OS 10.14.6 & Ubuntu Server 16.04
Both Python 3.6.5

Issue

Most NSFW Gfycat posts return the following new error:

NotADownloadableLinkError: Could not read the page source
or
HTTPError: HTTP Error 404: Not Found

Maybe they are just both similar results of Gfycat removing NSFW posts from their main domain. I guess it's an issue now that gfycat is migrating NSFW content to redgifs? Seems they're blocking NSFW submissions to the gfycat domain.

I've tested gfycat downloads on non-nsfw subs, and they seem to download fine. Also, some NSFW Gfycat posts seem to download fine (maybe older ones which haven't be purged yet?).

redgifs.com submissions do show up in log file POSTS.json (not FAILED.json, as the mentioned NSFW Gfycat submissions do), but their PostType shows as null, and upon checking they don't actually download or reach the console at all.

Example

Here's a censored pastebin (SFW) I made with an example of what I'm talking about. Includes CONSOLE_LOG.txt, FAILED.json, and POSTS.json. Offending errors are pointed out.

Any plans to add redgifs.com support? It seems to be crawled/indexed and shows up in POSTS.json, but not downloaded. I guess it just needs some entries in src/. I know NSFW is controversial at times, it's all legal NSFW stuff we're talking about here, but I imagine it's one of the main ways this script is used.

Pull Request

Made this pull request after digging around myself. Not too familiar with Python, but my changes work fine for me, and adds redgifs support. Important to note that most NSFW Gfycat posts will still throw the above exceptions, as they are no longer allowed and are being removed from Gfycat's domain. As users switch over to redgifs, these errors will go away.

Login without web browser on headless Linux

I am trying to use your library on a headless linux server on Docker. Because the Reddit API authorization requires logging into the web browser, it cannot be accomplished.

Can you add the option to enter the Reddit API details in a txt file?

Unlimited is limited to 976

When downloading a subreddit without a limit, the maximum amount I get is 976 - Even when setting --limit 1000 or --limit 10000 976 is the maximum. Is there any way to increase this number?

Couple feature requests.

after using this program for a while there is a couple of features I feel could be beneficial.

  1. have the program change the date created meta tag to the time and date of the original post it downloads. this will give users more of a timeline of their upvoted posts when browsing through folders. might help them find a specific one easier.

  2. change the "upvoted posts" options to be able to select certain subreddits you want this to apply to rather than just having it download everything.

Authentication Problem

I'm using Termux on Android. I'm trying to authenticate Reddit by pasting link to browser, but authenticating failed maybe because I'm on Android, not on PC. Are there a solution to transfer the token without using that automatic authentication?

Use xdg specification

Really great software. Thanks for this. There's one thing bugging me. Bulk Downloader for Reddit directory in my home folder looks cluttered to me. Can you move it to ~/.config so that it becomes $HOME/.config/Bulk Downloader for Reddit/config.json

Edit: Also I think it's good idea to drop space.

'client' is not defined

Seeing the following error when running the latest (1.6.4.2) exe release:

Bulk Downloader for Reddit v1.6.4.1

Go to this URL and login to reddit:

<url snipped>
ERROR:root:NameError
Traceback (most recent call last):
  File "C:\Users\Ali\AppData\Local\Programs\Python\Python36\lib\site-packages\cx_Freeze\initscripts\__startup__.py", line 14, in run
  File "C:\Users\Ali\AppData\Local\Programs\Python\Python36\lib\site-packages\cx_Freeze\initscripts\Console.py", line 26, in run
  File "script.py", line 713, in <module>
  File "script.py", line 686, in main
  File "C:\Users\Ali\git-repositories\bulk-downloader-for-reddit\src\searcher.py", line 118, in getPosts
  File "C:\Users\Ali\git-repositories\bulk-downloader-for-reddit\src\searcher.py", line 103, in beginPraw
  File "C:\Users\Ali\git-repositories\bulk-downloader-for-reddit\src\searcher.py", line 59, in getRefreshToken
NameError: name 'client' is not defined

name 'client' is not defined

I tried to copy and paste the url but I get a 'localhost' error (assuming because the app has already closed).

EDIT: please support read only mode

Unraid User Scripts Errors

Hi, I use the script on unraid with the userscripts plugin and theres some issues I've had to work around to get it to work.

  • The automatic config gen doesnt really work on a remote client as the redirect is coded to localhost, so ive had to change the default config location and manually copied the reddit auth code from the end of the redirect url.

defaultConfigDirectory = Path("/mnt/user/***/Reddit/") / "config"

  • The userscripts output doesnt do UTF-8 so ive had to change your BAD_CHARS stripping to a GOOD_CHARS whitelist for filenames:
GOOD_CHARS = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
    
    if not all(x in string for x in GOOD_CHARS):
        for char in string:
            if not char in GOOD_CHARS:
                string = string.replace(char,"_")

Also had to replace the • (https://www.compart.com/en/unicode/U+2022) with a *
And the – (https://www.fileformat.info/info/unicode/char/2013/index.htm) with a -

Just letting you know, no idea if its worth doing anything about but just thought I'd let you know.

Thanks for the script anyway.

File names are formatted wrong

They are not following the file name format from the documentation and look like random gibberish. Maybe a back end API update from Reddit's side messed it up.

Problems when feeding a list of subreddits

Hi, I don't know if this helps, but I am having some issues when I try to feed a list of subreddits via cat to your software. I tried it with your last version, but I couldn't fully run the "script", as the "press enter to quit" message makes the script stop (I don't know how to send an "enter" into a command line script"). I had to go back to your jul 22 version, which didn't have this issue, and allowed me to download hundreds of subreddits unattended (mostly). Great script anyway!

input via textfile?

Would be nice if I could download a list of users or a list of subreddits instead of manually doing each one.

gfycat error

Whenever I try to download a gif with a gfycat link, I get this error. I have used this downloader before and didn't have an issue. I tried a fresh install and it still doesn't work. Is there something else I should try?

Here is an example post which produces an error:
{
"HEADER": "C:\Bulkdownloader\bulk-downloader-for-reddit.exe",
"1": [
"NotADownloadableLinkError: Could not read the page source",
{
"postId": "9at3ik",
"postTitle": "A master hunter in action",
"postSubmitter": "RespectMyAuthoriteh",
"postType": "gfycat",
"postURL": "https://gfycat.com/BoldSlightGibbon",
"postSubreddit": "funny"
}
]
}

here's the message in the console:

(1/1) – r/funny – GFYCAT
NotADownloadableLinkError: Could not read the page source
Nothing downloaded :(

Thanks!

I'm a dumbass, send help

I used this previously on another account for a long time, but now, since I changed accounts, I want the downloader to be on that one. The problem is, I don't remember how the hell I downloaded it, right now the problem is that I have no idea where to redirect the app, or am I even supposed to make an app myself? I want to use the .exe cuz it feels nicer and I'm more used to it, so any help what so ever would be welcome. I'm probably being an idiot and missing something important so don't yell at me thanks.

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Ubuntu Server 16.04 with Python 3.6.5 running python script. All requirements seemed to install successfully.

Some posts (through subreddit and user submissions) get this error.

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I can't seem to figure out what the conditions for this error are. I imagine it's a Python error?
I saw somewhere it may be an API limit (maybe imgur api?). I changed IPs via VPN and the error continued.

Any ideas?

EDIT - OOPS

Saw a pull was made for this in the post before mine. My bad.

Connections seem to time out and downloads don't work

It seems that it isn't working properly anymore. I've tested it multiple times on different machines with the same result. After usually a few successfully downloaded files it just freezes. I assume there is some error handling missing when the connection is refused or timed out.
There is no information in the log files.

Documentation and config.json

I like this project. It works like a charm. Though, it took me some time to run it.

  • Does not automatically create config.json if it does not exists. Instead, it produces an obscure error (instead of creating this file).
  • The main readme page does not contain direct commands - had to manually look in docs.

Possible to skip video files?

Looking for pictures and videos are taking up a lot of bandwidth. Is it possible to skip video/mp4 files? I can use the download later option and sort them but it's tedious. Thanks a ton!

version `GLIBC_2.25' not found

Hi there,

i get

./bulk-downloader-for-reddit: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.25' not found (required by /app/blkreddit/libpython3.6m.so.1.0)

when running the script. Running Deabian Stretch and libc6 is installed.

NO AUDIO ON CLIPS?

This program works perfectly execpt NO AUDIO? on the clips, so all clips are useless without audio. Any chance this could be fixed? thanks!

can I download all the comments of a post?

I have seen that it only downloads the first post and not all the comments - is it possible to download all the comments of a post? NOT by a user - but all comments from a post

ps: it would be convenient for me to download all the comments from my SAVED posts
thanks for your work.

PRAW Authentication Error when account has 2FA.

I use 2FA on my reddit account and so should you 😄 .
However this script will not work with 2FA because PRAW requires the 6 digit code that simply isn't provided as referenced here. The way for getting around this currently is that I made a second reddit account without 2FA and am using that instead.

The issue with this script is that you would need to program it so that it asks for the 2FA on run every time because as is the nature of 2FA it changes.

Avoid duplication

Hi,

nice tool! Is it possible to avoid duplications by either checking if a file is already in folder X or maybe make the user select the time frame to download posts from (like last 7 days or May 24th to May 31st).

EDIT: I see it skips existing files. But it's kinda hard to know when I need to let the software rerun to have a all posts from like an entire week, because it's based on how many posts people make a day and thats not always the same. So I guess a time frame would be better

First try and got 403 error forbidden xD

urllib.error.HTTPError: HTTP Error 403: Forbidden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "script.py", line 402, in <module>
    main()
  File "script.py", line 396, in main
    downloader(getPosts())
  File "script.py", line 348, in downloader
    updateBackup(BACKUP_FILE,str(i+1),"add")
  File "script.py", line 204, in updateBackup
    getattr(FILE,mode) (key)
  File "/root/bulk-downloader-for-reddit/src/tools.py", line 44, in add
    data = {**data, **toBeAdded}
TypeError: 'str' object is not a mapping

Feature request: Save file directory, post location and amount

Hey there. Just had a small suggestion. I am woefully ignorant about coding, so if there's a way of achieving what I'm requesting here without adding a feature, by all means let me know.

I usually always run the code with the same parameters. Always the same directory, always download 100 posts from the Saved list. I'd like to create a scheduled task so this happens daily, but I'd rather not type everything in every time. Again, I suspect this can be achieved without adding a feature, which I imagine is much more practical, so I'm willing to try anything to achieve this.

Configure sorting of downloaded files.

Hey there, just had a couple suggestions/feature requests.

I'd like to be able define the directory structure for new downloads, i.e. perhaps based on file type. And also some sort of option to sort download files which retains the order in which they were saved, either by date saved (if that's something reddit exposes) at the start of the filename ( DD-MM-YYYY format) or just modifying the system modified date/created date to match that as well. If that's not possible, perhaps some sort of numeric ID, separate from the post ID that can be added to the start of the filename, that when sorted by will put the files in saved order.

I've been using some powershell scripts run afterwards as part of a batch script, but it's not working too well, in that it needs to copy everything to a new set of directories which doubles the amount of space used, and it appears the script only takes into account the contents of the destination folder to determine what has and hasn't already been downloaded. I tried using the --log option , thinking that might keep a log of what's been retrieved already, but I'm not sure what it's looking for in terms of arguments. I tried putting a directory path, the path with a filename with an without an extension (.json, and .txt) (tried both path options both enclosed in quotes as well), and a filename with and without ext without the path, and it doesn't seem to like any of those options; didn't have any luck finding any clarification in the docs either.

Cheers.

Command not working

I am using python source code on windows using this comman python script.py --directory .\data --subreddit AnimalsBeingDerps --sort hot --limit 0

GETTING POSTS No matching submission was found

Saved Posts Limit

I have been using the program for a month or two now and I have been noticing that the amount of saved posts it can load are between 920 and 1000 because I have been adding new saved posts during those months but the counter has never passed 1000.

getting stuck

it seems to be getting stuck at different posts for some reason, have tried restarts and it still happens

(Debian) (binary) GLIBC_2.25 not found

sudo apt-get install libc6
...
libc6 is already the newest version (2.24-11+deb9u4).
cd aliparlakci_bulk-downloader-for-reddit/bulk_downloader_for_reddit-1.6.4.2-linux
./bulk-downloader-for-reddit 
./bulk-downloader-for-reddit: 
/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.25' not found 
(required by aliparlakci_bulk-downloader-for-reddit/bulk_downloader_for_reddit-1.6.4.2-linux/
libpython3.6m.so.1.0)

Proxy requests

It would be nice if there were an option to proxy your connections to download content through a proxy to avoid rate limiting.

Make internal port number configurable

I kept running into the following error when trying to get the Reddit authorization:

[WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions

I saw that it was using the port number 1337 and incidentally the Razer Chroma SDK also uses that same port. I was able to get the authorization to work after stopping the server the Razor software starts.

It would be great to have a way for us to configure which port number we want it to use with out needing to fork your project.

line 571 SyntaxError: invalid syntax

python script.py 
  File "script.py", line 571
    print(f"\n({i+1}/{subsLenght}) – r/{submissions[i]['postSubreddit']}",
                                                                        ^
SyntaxError: invalid syntax

I had done this:

git clone https://github.com/aliparlakci/bulk-downloader-for-reddit aliparlakci_bulk-downloader-for-reddit
cd aliparlakci_bulk-downloader-for-reddit/
virtualenv -p python3 env
source env/bin/activate
python -m pip install -r requirements.txt

which python
.../aliparlakci_bulk-downloader-for-reddit/env/bin/python

python --version
Python 3.5.3

python script.py 

with this last commit

git log
commit 15a91e578496e1c07b5302606fab6061733eb1a1
Author: Ali <[email protected]>
Date:   Sun Feb 24 12:28:40 2019 +0300
    Fixed saving auth info problem

No sound in video

There are no audio in downloaded video.
On some post I found the response, that Reddit serves the video and audio in separate tracks, so the app is unable to download a stream containing both, unless it re-encodes the video or something, which is much more work than downloading a file, and requieres significant more time/CPU.

Gfycat fail to download (exception thrown)

System

Mac OS
Python 3.7.7
Script version 1.6.5

Issue

Gfycat links fail to download. The script throws an exception, and the script continues until finish.

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Example

I want to download the top 20 from a subreddit. There are 10 imgur links, 8 reddit image links, and 2 gfycat links. There are no non-supported links. Only the 2 gfycat links fail to download.

some question

is there a way to save the file without weird character like this "😋😏😂💦"??
also when i've download file, sometimes i delete some of it. but when i redownload it, it's downloaded again. is there a way to skip it (ie just download new ones)

SSL Error

When I run the script I get the following error for the majority of the entries when running the following command:

Error:
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)>

Command:
py -3 script.py .\NEW_FOLDER\ANOTHER_FOLDER --saved --limit 1000

I was able to authenticate Imgur and Reddit successfully.

Wrong Parameter

Tried to run the software a second time and save to the same folder as before (same sub reddit)

Here's what happened:

EERROR:root:OSError
Traceback (most recent call last):
File "C:\Users\Ali\AppData\Roaming\Python\Python37\site-packages\cx_Freeze\initscripts_startup_.py", line 40, in run
File "C:\Users\Ali\AppData\Roaming\Python\Python37\site-packages\cx_Freeze\initscripts\Console.py", line 37, in run
File "script.py", line 725, in
File "script.py", line 714, in main
File "script.py", line 583, in download
File "D:\projects\bulk-downloader-for-reddit\src\utils.py", line 107, in printToFile
OSError: [WinError 87] Falscher Parameter

filenames too long

I saw errors about files not being created due to filename length, do you know of any workaround?
I am just running normal btrfs, so the filename is I think limited to 255 characters.

Does not run on fresh system without locale set

I've updated this a few times as I finally figured out the problem. On a fresh installation on Linux, the locale isn't set to UTF8 which the script assumes.

python3 ./script.py

ERROR:root:UnicodeEncodeError
Traceback (most recent call last):
File "./script.py", line 718, in
main()
File "./script.py", line 655, in main
f"\nBulk Downloader for Reddit v{version}\n"
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 59: ordinal not in range(128)

Traceback (most recent call last):
File "./script.py", line 731, in
if not GLOBAL.arguments.quit: input("\nPress enter to quit\n")
AttributeError: 'NoneType' object has no attribute 'quit'

LC_ALL=C.UTF-8 python3 ./script.py

Bulk Downloader for Reddit v1.6.5
Written by Ali PARLAKCI – [email protected]

https://github.com/aliparlakci/bulk-downloader-for-reddit/

download directory:

That's probably an understood thing but I've spent no time with python3 having last done python 2.4. It may be worth something in the README.md for people coming into the system fresh or maybe it's just assumed knowledge.

Feature Request: add auto-wait when API gets a temporary lock

When downloading quite a lot of images after some time there is a message saying that it will be resetted (right now after 20 minutes):
==> Client: 11573 - User: 0 - Reset after 20 Minutes 26 Seconds
After every following request one can see how the countdown decreases. It would be nice if the script could auto-detect this waiting time and sleep until the message disappears. After that it can continue the downloading (until the next waiting time appears and so on..)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.