A parser for the reddit data dump that can be found here: reddit
- You edit the config to include the path where the uncompressed dump files can be found.
- You run
python3 run.py
- You wait for it to complete (Took a few hours)
- The sqlite3 database file is now ready to be queried!
- Create a task for each batch of 10000 comments for preprocessing.
- Preprocess the string using the following technique (tunable via config.json)
- Replace names by the tag <name>
- Replace numbers by the tag <number>
- Remove ponctuation
- Replace words not part of the provided dictionary by the tag <unk>
- Save the resulting text as sanitized_body in the sqlite3 db