Scrape J! Archive for am extensive archive of Jeopardy!.
-
Install Python 3.7 and the Psycopg2 and BeautifulSoup packages.
-
Set up a credentials.json file in /credentials directory with PostgreSQL database information. Below is an example of a local database.
{ "host": "localhost", "database": "jeopardy", "user": "postgres", "password": "password" }
-
To scrape the entire archive of the site, run the following command. This will create a collection in your database with all questions leading up to the most recent episode. It will take some time (between one and two hours).
cd scrape python scrape.py -a
-
You can also scrape only specific seasons by running the command below. The example shown is for seasons 1 through 10 (will include 10).
cd scrape python scrape.py -s 1 10
-
Set up a keys.json file in the /credentials directory with keys for Genderize.io and the Google Maps API
{ "gmaps_api_key": "key_here", "genderize_api_key": "key_here" }
-
Run the commands below to send the first name for each contestant to Genderize.io, a third-party API for identifying gender by name, and update the database accordingly.
cd addon python gender.py
-
Run the commands below to update the database with the latitude and longitude for each contestant with a valid location.
cd addon python geocode.py
This project uses the topic modeling software MALLET to build a topic model for clues.
-
Run the command below to create a separate text file for each clue in the database from the first specified season to the last.
cd topics python create_clues_data.py -s 1 35
-
Move to the MALLET directory on your local machine and import these files (specifically, point to the directory they are kept in).
bin\mallet import-dir --input {wherever the project lives}\jeopardy-scrape\topics\mallet_files\data\clues --output jeopardy_clues.mallet --keep-sequence --remove-stopwords
-
Train the model and create the output files.
bin\mallet train-topics --input jeopardy_prof.mallet --num-topics 25 --optimize-interval 20 --output-state jeopardy_prof_topic-state.gz --output-topic-keys jeopardy_prof_keys.txt --output-doc-topics jeopardy_prof_composition.txt
-
Move jeopardy_prof_keys.txt and jeopardy_prof_composition.txt to topics/mallet_files/output/clues. These files contain the topic information and which topics each clue is part of. Run the two commands below to update the database with that data.
cd topics python parse_clue_topic_keys.py python parse_clue_topic_composition.py