GithubHelp home page GithubHelp logo

voussoir / timesearch Goto Github PK

View Code? Open in Web Editor NEW
171.0 10.0 7.0 183 KB

The subreddit archiver

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
pushshift reddit sql archival python database api cli

timesearch's Introduction

timesearch

NEWS (2023 06 25):

Pushshift's API is currently offline. Without the timestamp search parameter or Pushshift access, timesearch is not able to get historical data. You can continue to use the livestream module to collect new posts and comments as they are made.

You can still download the Pushshift archives, though. https://the-eye.eu/redarcs/ is one source.

I have added a module for ingesting these json files into a timesearch database so that you can continue to use offline_reading, or if you just prefer the sqlite format. You need to extract the zst file with an archive tool like 7-Zip before giving it to timesearch.

python timesearch.py ingest_jsonfile subredditname_submissions -r subredditname

python timesearch.py ingest_jsonfile subredditname_comments -r subredditname

NEWS (2023 05 01):

Reddit has revoked Pushshift's API access, so pushshift.io may not be able to continue ingesting reddit content.

NEWS (2018 04 09):

Reddit has removed the timestamp search feature which timesearch was built off of (original). Please message the admins by sending a PM to /r/reddit.com. Let them know that this feature is important to you, and you would like them to restore it on the new search stack.

Thankfully, Jason Baumgartner aka /u/Stuck_in_the_Matrix, owner of Pushshift.io, has made it easy to interact with his dataset. Timesearch now queries his API to get post data, and then uses reddit's /api/info to get up-to-date information about those posts (scores, edited text bodies, ...). While we're at it, this also gives us the ability to speed up get_comments. In addition, we can get all of a user's comments which was not possible through reddit alone.

NOTE: Because Pushshift is an independent dataset run by a regular person, it does not contain posts from private subreddits. Without the timestamp search parameter, scanning private subreddits is now impossible. I urge once again that you contact your senator the admins to have this feature restored.


I don't have a test suite. You're my test suite! Messages go to /u/GoldenSights.

Timesearch is a collection of utilities for archiving subreddits.

Make sure you have:

  • Downloaded this project using the green "Clone or Download" button in the upper right.
  • Installed Python. I use Python 3.7.
  • Installed PRAW >= 4, as well as the other modules in requirements.txt. Try pip install -r requirements.txt to get them all.
  • Created an OAuth app at https://old.reddit.com/prefs/apps. Make it script type, and set the redirect URI to http://localhost:8080. The title and description can be anything you want, and the about URL is not required.
  • Used this PRAW script to generate a refresh token. Just save it as a .py file somewhere and run it through your terminal / command line. For simplicity's sake, I just choose all for the scopes.
    • The instructions mention export praw_client_id=.... This creates environment variables on Linux. If you are on Windows, or simply don't want to create environment variables, you can alternatively add client_id='...' and client_secret='...' to the praw.Reddit instance on line 40, alongside the redirect_uri and user_agent arguments.
  • Downloaded a copy of this file and saved it as bot.py. Fill out the variables using your OAuth information, and read the instructions to see where to put it. The most simple way is to save it in the same folder as this README file.
    • The USERAGENT is a description of your API usage. Typically "/u/username's praw client" is sufficient.
    • The CONTACT_INFO is sent when downloading from Pushshift, as encouraged by Stuck_in_the_Matrix. It could just be your email address or reddit username.

This package consists of:

  • get_submissions: If you try to page through /new on a subreddit, you'll hit a limit at or before 1,000 posts. Timesearch uses the pushshift.io dataset to get information about very old posts, and then queries the reddit api to update their information. Previously, we used the timestamp cloudsearch query parameter on reddit's own API, but reddit has removed that feature and pushshift is now the only viable source for initial data.
    python timesearch.py get_submissions -r subredditname <flags>
    python timesearch.py get_submissions -u username <flags>

  • get_comments: Similar to get_submissions, this tool queries pushshift for comment data and updates it from reddit.
    python timesearch.py get_comments -r subredditname <flags>
    python timesearch.py get_comments -u username <flags>

  • livestream: get_submissions+get_comments is great for starting your database and getting the historical posts, but it's not the best for staying up-to-date. Instead, livestream monitors /new and /comments to continuously ingest data.
    python timesearch.py livestream -r subredditname <flags>
    python timesearch.py livestream -u username <flags>

  • get_styles: Downloads the stylesheet and CSS images.
    python timesearch.py get_styles -r subredditname

  • get_wiki: Downloads the wiki pages, sidebar, etc. from /wiki/pages.
    python timesearch.py get_wiki -r subredditname

  • offline_reading: Renders comment threads into HTML via markdown.
    Note: I'm currently using the markdown library from pypi, and it doesn't do reddit's custom markdown like /r/ or /u/, obviously. So far I don't think anybody really uses o_r so I haven't invested much time into improving it.
    python timesearch.py offline_reading -r subredditname <flags>
    python timesearch.py offline_reading -u username <flags>

  • index: Generates plaintext or HTML lists of submissions, sorted by a property of your choosing. You can order by date, author, flair, etc. With the --offline parameter, you can make all the links point to the files you generated with offline_reading.
    python timesearch.py index -r subredditname <flags>
    python timesearch.py index -u username <flags>

  • breakdown: Produces a JSON file indicating which users make the most posts in a subreddit, or which subreddits a user posts in.
    python timesearch.py breakdown -r subredditname
    python timesearch.py breakdown -u username

  • merge_db: Copy all new data from one timesearch database into another. Useful for syncing or merging two scans of the same subreddit.
    python timesearch.py merge_db --from filepath/database1.db --to filepath/database2.db

To use it

When you download this project, the main file that you will execute is timesearch.py here in the root directory. It will load the appropriate module to run your command from the modules folder.

You can view a summarized version of all the help text by running timesearch.py, and you can view a specific help text by running a command with no arguments, like timesearch.py livestream, etc.

I recommend sqlitebrowser if you want to inspect the database yourself.

Changelog

  • 2020 01 27

    • When I first created Timesearch, it was simply a collection of all the random scripts I had written to archive various things. And they tended to have wacky names like commentaugment and redmash. Well, since the timesearch toolkit is meant to be a singular cohesive package now I decided to finally rename everything. I believe I have aliased everything properly so the old names still work for backwards compat, except for the fact the modules folder is now called timesearch_modules which may break your import statements if you ever imported that on your own.
  • 2018 04 09

    • Integrated with Pushshift to restore timesearch functionality, speed up commentaugment, and get user comments.
  • 2017 11 13

    • Gave timesearch its own Github repository so that (1) it will be easier for people to download it and (2) it has a cleaner, more independent URL. voussoir/timesearch
  • 2017 11 05

    • Added a try-except inside livestream helper to prevent generator from terminating.
  • 2017 11 04

    • For timesearch, I switched from using my custom cloudsearch iterator to the one that comes with PRAW4+.
  • 2017 10 12

    • Added the mergedb utility for combining databases.
  • 2017 06 02

    • You can use commentaugment -s abcdef to get a particular thread even if you haven't scraped anything else from that subreddit. Previously -s only worked if the database already existed and you specified it via -r. Now it is inferred from the submission itself.
  • 2017 04 28

    • Complete restructure into package, started using PRAW4.
  • 2016 08 10

    • Started merging redmash and wrote its argparser
  • 2016 07 03

    • Improved docstring clarity.
  • 2016 07 02

    • Added livestream argparse
  • 2016 06 07

    • Offline_reading has been merged with the main timesearch file
    • get_all_posts renamed to timesearch
    • Timesearch parameter usermode renamed to username; maxupper renamed to upper.
    • Everything now accessible via commandline arguments. Read the docstring at the top of the file.
  • 2016 06 05

    • NEW DATABASE SCHEME. Submissions and comments now live in different tables like they should have all along. Submission table has two new columns for a little bit of commentaugment metadata. This allows commentaugment to only scan threads that are new.
    • You can use the migrate_20160605.py script to convert old databases into new ones.
  • 2015 11 11

    • created offline_reading.py which converts a timesearch database into a comment tree that can be rendered into HTML
  • 2015 09 07

    • fixed bug which allowed livestream to crash because bot.refresh() was outside of the try-catch.
  • 2015 08 19

    • fixed bug in which updatescores stopped iterating early if you had more than 100 comments in a row in the db
    • commentaugment has been completely merged into the timesearch.py file. you can use commentaugment_prompt() to input the parameters, or use the commentaugment() function directly.

I want to live in a future where everyone uses UTC and agrees on daylight savings.

Timesearch

Mirrors

https://git.voussoir.net/voussoir/timesearch

https://github.com/voussoir/timesearch

https://gitlab.com/voussoir/timesearch

https://codeberg.org/voussoir/timesearch

timesearch's People

Contributors

voussoir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

timesearch's Issues

import timesearch_modules error

D:\software\timesearch-master>python timesearch.py get_submissions -r gifs
Traceback (most recent call last):
File "D:\software\timesearch-master\timesearch.py", line 16, in
import timesearch_modules
File "D:\software\timesearch-master\timesearch_modules_init_.py", line 282, in
DOCSTRING = betterhelp.add_previews(DOCSTRING, MODULE_DOCSTRINGS)
AttributeError: module 'voussoirkit.betterhelp' has no attribute 'add_previews'

This is what I get after installing python 3.10 form 3.8 version

Offline Reading <> Invalidates HTML

Hey there, it's me again. When I compile a subreddit using the offline_reading functionality, there always seems to be errors whenever I merge all the files into one (using something like HTML Merge). Specifically, these errors refer to the < and > signs, as HTML uses them quite heavily.

Would it be possible for offline_reading to replace < and > with &lt; and &gt;? These codes do work correctly without errors, and would fix a lot of problems I've been having, along with keeping it WYSIWYG. (it would also help with certain instances, where, for example, someone puts <i into their comment/submission, and the HTML renderer just renders the rest of the entire document italicized)

Thanks!

parent_id attribute for comments from pushshift API

When requesting comments from the Pushshift API, parent_id is coming back as null for root comments, and integers for reply comments.

Another user reported this issue on reddit: https://www.reddit.com/r/pushshift/comments/ujwdyt/parent_id_is_being_returned_as_integer_bug/

For timesearch, we'll solve it by transforming this data in the dummy object so that by the time it reaches TSDB, it will be normal.

     def __init__(self, **attributes):
         for (key, val) in attributes.items():
             if key == 'author':
                  val = DummyObject(name=val)
             elif key == 'subreddit':
                  val = DummyObject(display_name=val)
             elif key in ['body', 'selftext']:
                  val = html.unescape(val)
+            elif key == 'parent_id':
+                 if val is None:
+                    val = attributes['link_id']
+                 elif isinstance(val, int):
+                    val = 't1_' + common.b36(val)

If you have a timesearch database and you need to repair the parent_id data, you do not need to re-download any of those comments. We would just take all rows with null and copy the submission ID, and all rows with int and run them through the b36 function.

Offline reading

After getting all the contents of a subreddit,
(timesearch.py get_submissions -r subredditname,
timesearch.py get_comments -r subredditname,
timesearch.py get_wiki -r subredditname).

And I try to render it:

Directory\timesearch>timesearch.py offline_reading -r Subredditofchoice

Directory\timesearch>
[main 2020-04-23T00:48:09.223Z] update#setState idle
(node:25576) Electron: Loading non context-aware native modules in the renderer process is deprecated and will stop working at some point in the future, please see electron/electron#18397 for more information
(node:25576) Electron: Loading non context-aware native modules in the renderer process is deprecated and will stop working at some point in the future, please see electron/electron#18397 for more information
(node:25576) Electron: Loading non context-aware native modules in the renderer process is deprecated and will stop working at some point in the future, please see electron/electron#18397 for more information
(node:25576) Electron: Loading non context-aware native modules in the renderer process is deprecated and will stop working at some point in the future, please see electron/electron#18397 for more information
(node:25576) Electron: Loading non context-aware native modules in the renderer process is deprecated and will stop working at some point in the future, please see electron/electron#18397 for more information
(node:25576) Electron: Loading non context-aware native modules in the renderer process is deprecated and will stop working at some point in the future, please see electron/electron#18397 for more information
(node:28272) Electron: Loading non context-aware native modules in the renderer process is deprecated and will stop working at some point in the future, please see electron/electron#18397 for more information
(node:28272) Electron: Loading non context-aware native modules in the renderer process is deprecated and will stop working at some point in the future, please see electron/electron#18397 for more information
[main 2020-04-23T00:48:39.225Z] update#setState checking for updates
[main 2020-04-23T00:48:39.504Z] update#setState idle

Nothing happens past this point, I have to end it manually using cntrl+c and no files are output.

get_submissions doesn't work for some subreddits

I'm trying to run get_submissions for /r/insaneparents and it doesn't download anything past September 7, 2022 but it works for several other subreddits such as /r/UnresolvedMysteries. The last Unix timestamp in the submissions table for /r/insaneparents is 1662563491. I was able to get all the comments using get_comments. This is the error I get from the latest version.

python.exe timesearch.py get_submissions -r insaneparents
Thank you Jason Baumgartner of Pushshift.io!
Traceback (most recent call last):
  File "C:\Users\777\Downloads\timesearch-master2\timesearch.py", line 554, in <module>
    raise SystemExit(main(sys.argv[1:]))
  File "C:\Users\777\AppData\Roaming\Python\Python39\site-packages\voussoirkit\vlogging.py", line 218, in wrapped
    return main(argv, *args, **kwargs)
  File "C:\Users\777\Downloads\timesearch-master2\timesearch.py", line 546, in main
    return betterhelp.go(parser, argv)
  File "C:\Users\777\AppData\Roaming\Python\Python39\site-packages\voussoirkit\betterhelp.py", line 620, in go
    return _go_multi(parser, argv, args_postprocessor=args_postprocessor)
  File "C:\Users\777\AppData\Roaming\Python\Python39\site-packages\voussoirkit\betterhelp.py", line 616, in _go_multi
    return main(argv)
  File "C:\Users\777\AppData\Roaming\Python\Python39\site-packages\voussoirkit\betterhelp.py", line 578, in main
    return args.func(args)
  File "C:\Users\777\Downloads\timesearch-master2\timesearch.py", line 56, in get_submissions_gateway
    get_submissions.get_submissions_argparse(args)
  File "C:\Users\777\Downloads\timesearch-master2\timesearch_modules\get_submissions.py", line 96, in get_submissions_argparse
    return get_submissions(
  File "C:\Users\777\Downloads\timesearch-master2\timesearch_modules\get_submissions.py", line 75, in get_submissions
    step = database.insert(chunk)
  File "C:\Users\777\Downloads\timesearch-master2\timesearch_modules\tsdb.py", line 347, in insert
    status = method(obj)
  File "C:\Users\777\Downloads\timesearch-master2\timesearch_modules\tsdb.py", line 427, in insert_submission
    (qmarks, bindings) = sqlhelpers.insert_filler(postdata)
TypeError: insert_filler() missing 1 required positional argument: 'values'

This is the error I get from a slightly older version.

python.exe timesearch.py get_submissions -r insaneparents
Thank you Jason Baumgartner of Pushshift.io!
Traceback (most recent call last):
  File "C:\Users\777\Downloads\timesearch-master\timesearch.py", line 553, in <module>
    raise SystemExit(main(sys.argv[1:]))
  File "C:\Users\777\AppData\Roaming\Python\Python39\site-packages\voussoirkit\vlogging.py", line 218, in wrapped
    return main(argv, *args, **kwargs)
  File "C:\Users\777\Downloads\timesearch-master\timesearch.py", line 545, in main
    return betterhelp.go(parser, argv)
  File "C:\Users\777\AppData\Roaming\Python\Python39\site-packages\voussoirkit\betterhelp.py", line 620, in go
    return _go_multi(parser, argv, args_postprocessor=args_postprocessor)
  File "C:\Users\777\AppData\Roaming\Python\Python39\site-packages\voussoirkit\betterhelp.py", line 616, in _go_multi
    return main(argv)
  File "C:\Users\777\AppData\Roaming\Python\Python39\site-packages\voussoirkit\betterhelp.py", line 578, in main
    return args.func(args)
  File "C:\Users\777\Downloads\timesearch-master\timesearch.py", line 56, in get_submissions_gateway
    get_submissions.get_submissions_argparse(args)
  File "C:\Users\777\Downloads\timesearch-master\timesearch_modules\get_submissions.py", line 96, in get_submissions_argparse
    return get_submissions(
  File "C:\Users\777\Downloads\timesearch-master\timesearch_modules\get_submissions.py", line 75, in get_submissions
    step = database.insert(chunk)
  File "C:\Users\777\Downloads\timesearch-master\timesearch_modules\tsdb.py", line 347, in insert
    status = method(obj)
  File "C:\Users\777\Downloads\timesearch-master\timesearch_modules\tsdb.py", line 398, in insert_submission
    url = submission.url
AttributeError: 'DummySubmission' object has no attribute 'url'

Recovering from exception

After running timesearch through a huge sub, the bot exited with an exception.

Is there a way to resume progress from where it exited? Running the timesearch again only grabs the most recent threads from the top (does not attempt to continue where it left off).

Also: any idea what might be the cause? (running 2 instances with different apps configured on mac os)

Jul 14 2015 13:28:52 - Jul 14 2015 12:37:17 +100
Jul 14 2015 12:36:47 - Jul 14 2015 11:16:45 +100
Jul 14 2015 11:16:19 - Jul 14 2015 09:50:38 +100
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 1009, in recv_into
    return self.read(nbytes, buffer)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 871, in read
    return self._sslobj.read(len, buffer)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/util/retry.py", line 357, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 389, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='oauth.reddit.com', port=443): Read timed out. (read timeout=16.0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/requestor.py", line 47, in request
    return self._http.request(*args, timeout=TIMEOUT, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/adapters.py", line 521, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='oauth.reddit.com', port=443): Read timed out. (read timeout=16.0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "timesearch.py", line 11, in <module>
    status_code = timesearch.main(sys.argv[1:])
  File "/Users/1/ts/timesearch/__init__.py", line 425, in main
    args.func(args)
  File "/Users/1/ts/timesearch/__init__.py", line 329, in timesearch_gateway
    timesearch.timesearch_argparse(args)
  File "/Users/1/ts/timesearch/timesearch.py", line 152, in timesearch_argparse
    interval=common.int_none(args.interval),
  File "/Users/1/ts/timesearch/timesearch.py", line 78, in timesearch
    for chunk in submissions:
  File "/Users/1/ts/timesearch/common.py", line 66, in generator_chunker
    for item in generator:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/praw/models/reddit/subreddit.py", line 451, in submissions
    sort='new', syntax='cloudsearch'):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/praw/models/listing/generator.py", line 52, in __next__
    self._next_batch()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/praw/models/listing/generator.py", line 62, in _next_batch
    self._listing = self._reddit.get(self.url, params=self.params)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/praw/reddit.py", line 367, in get
    data = self.request('GET', path, params=params)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/praw/reddit.py", line 472, in request
    params=params)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/sessions.py", line 181, in request
    params=params, url=url)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/sessions.py", line 112, in _request_with_retries
    data, files, json, method, params, retries, url)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/sessions.py", line 97, in _make_request
    params=params)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/rate_limit.py", line 33, in call
    response = request_function(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/requestor.py", line 49, in request
    raise RequestException(exc, args, kwargs)
prawcore.exceptions.RequestException: error with request HTTPSConnectionPool(host='oauth.reddit.com', port=443): Read timed out. (read timeout=16.0)

x

x

received 400 HTTP response

Sorry if this isn't detailed enough, I'm new to writing Issues but when I try to run the code I get this error and I have tried reinstalling and checked the bot.py info and checked google and tried to read the code and still cant figure out what is wrong can someone please help.
(venv) C:\Users\richa\PycharmProjects\timesearch>python timesearch.py get_submissions -r aww
Thank you Jason Baumgartner of Pushshift.io!
Traceback (most recent call last):
File "timesearch.py", line 424, in
raise SystemExit(main(sys.argv[1:]))
File "timesearch.py", line 411, in main
return betterhelp.subparser_main(
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\voussoirkit\betterhelp.py", line 204, in subparser_main
return subparser_betterhelp(parser, main_docstring, sub_docstrings)(main)(argv)
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\voussoirkit\betterhelp.py", line 184, in wrapped
return main(argv)
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\voussoirkit\betterhelp.py", line 203, in main
return args.func(args)
File "timesearch.py", line 334, in get_submissions_gateway
get_submissions.get_submissions_argparse(args)
File "C:\Users\richa\PycharmProjects\timesearch\timesearch_modules\get_submissions.py", line 99, in get_submissions_argparse
return get_submissions(
File "C:\Users\richa\PycharmProjects\timesearch\timesearch_modules\get_submissions.py", line 73, in get_submissions
for chunk in submissions:
File "C:\Users\richa\PycharmProjects\timesearch\timesearch_modules\common.py", line 78, in generator_chunker
for item in generator:
File "C:\Users\richa\PycharmProjects\timesearch\timesearch_modules\pushshift.py", line 243, in supplement_reddit_data
live_copies = list(common.r.info(ids))
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\praw\reddit.py", line 631, in generator
for result in self.get(API_PATH["info"], params=params):
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\praw\reddit.py", line 566, in get
return self._objectify_request(method="GET", params=params, path=path)
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\praw\reddit.py", line 667, in _objectify_request
self.request(
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\praw\reddit.py", line 849, in request
return self._core.request(
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\prawcore\sessions.py", line 328, in request
return self._request_with_retries(
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\prawcore\sessions.py", line 226, in _request_with_retries
response, saved_exception = self._make_request(
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\prawcore\sessions.py", line 183, in _make_request
response = self._rate_limiter.call(
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\prawcore\rate_limit.py", line 33, in call
kwargs["headers"] = set_header_callback()
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\prawcore\sessions.py", line 281, in _set_header_callback
self._authorizer.refresh()
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\prawcore\auth.py", line 254, in refresh
self._request_token(
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\prawcore\auth.py", line 155, in _request_token
response = self._authenticator._post(url, **data)
File "C:\Users\richa\PycharmProjects\timesearch\venv\lib\site-packages\prawcore\auth.py", line 38, in _post
raise ResponseException(response)
prawcore.exceptions.ResponseException: received 400 HTTP response

New issue with praw

I recently upgraded praw and now I am getting error
prawcore:Retrying due to ConnectionError(ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))) status: GET https://oauth.reddit.com/api/info/

AttributeError: 'DummySubmission' object has no attribute 'url' on some subreddits

Hello Ethan,

Thank you so much for this great tool. I'd like to report an error I've been receiving while scraping some subreddits.

Traceback (most recent call last):
  File "/opt/reddit/timesearch.py", line 553, in <module>
    raise SystemExit(main(sys.argv[1:]))
  File "/home/arrakis/.local/lib/python3.10/site-packages/voussoirkit/vlogging.py", line 221, in wrapped
    return main(argv, *args, **kwargs)
  File "/opt/reddit/timesearch.py", line 545, in main
    return betterhelp.go(parser, argv)
  File "/home/arrakis/.local/lib/python3.10/site-packages/voussoirkit/betterhelp.py", line 621, in go
    return _go_multi(parser, argv, args_postprocessor=args_postprocessor)
  File "/home/arrakis/.local/lib/python3.10/site-packages/voussoirkit/betterhelp.py", line 617, in _go_multi
    return main(argv)
  File "/home/arrakis/.local/lib/python3.10/site-packages/voussoirkit/betterhelp.py", line 579, in main
    return args.func(args)
  File "/opt/reddit/timesearch.py", line 56, in get_submissions_gateway
    get_submissions.get_submissions_argparse(args)
  File "/opt/reddit/timesearch_modules/get_submissions.py", line 96, in get_submissions_argparse
    return get_submissions(
  File "/opt/reddit/timesearch_modules/get_submissions.py", line 75, in get_submissions
    step = database.insert(chunk)
  File "/opt/reddit/timesearch_modules/tsdb.py", line 347, in insert
    status = method(obj)
  File "/opt/reddit/timesearch_modules/tsdb.py", line 400, in insert_submission
    url = submission.url
AttributeError: 'DummySubmission' object has no attribute 'url'

This only happens on some subreddits (e.g. /r/unixporn), and only on submissions after September 06, 2022. I have upgraded voussoirkit to 0.0.74, and have the latest git updates. Thank you once again for fantastic tool.

Proxy

Can http/https proxies be enabled (like on praw)?

No module named 'bot4'

When trying to use timesearch on latest source and up-to-date pip packages, I get the error in the title:

  File "timesearch.py", line 11, in <module>
    status_code = timesearch_modules.main(sys.argv[1:])
  File "C:\ProgramData\Anaconda3\envs\py36\lib\site-packages\voussoirkit\betterhelp.py", line 124, in wrapped
    return main(argv)
  File "D:\timesearch\timesearch_modules\__init__.py", line 423, in main
    args.func(args)
  File "D:\timesearch\timesearch_modules\__init__.py", line 342, in get_submissions_gateway
    from . import get_submissions
  File "D:\timesearch\timesearch_modules\get_submissions.py", line 4, in <module>
    from . import common
  File "D:\timesearch\timesearch_modules\common.py", line 22, in <module>
    import bot4
ModuleNotFoundError: No module named 'bot4'

Any way to fix this? I'm using Anaconda for env management.

insert_filler missing 1 required positional argument

First time user here, so I might be making a simple mistake, looks like a great utility, can't wait to get it up and running.

I get the following error when running 'get_submissions' command:

timesearch-master/timesearch_modules/tsdb.py", line 420, in insert_submission
    (qmarks, bindings) = sqlhelpers.insert_filler(postdata)
TypeError: insert_filler() missing 1 required positional argument: 'values'

The function in question from sqlhelpers.py:

def insert_filler(column_names, values, require_all=True):

And the calling line from tsdb.py > insert_submission

(qmarks, bindings) = sqlhelpers.insert_filler(postdata)

Getting deleted posts/comments?

Hi, I'm wondering if there's a way to get deleted posts and/or comments in a subreddit via timesearch? pushshift.io archives them, but any database downloads using default settings still display removed comments/posts.

SyntaxError: invalid syntax

Pretty sure i'm doing this wrong but heres how it goes,

  1. python

  2. timesearch.py get_submissions -r subredditofchoice

Output:
File "", line 1
timesearch.py get_submissions -r subredditofchoice
===================^
SyntaxError: invalid syntax

(Ignore the =====================s they're just there as spacers since GitHub doesn't like spaces}.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.