A tool for extracting datasets from Kaggle and downloading them to a local machine
The purpose of this project is to extract datasets from Kaggle and download them to a local machine. The project consists of three main scripts:
- links.py: extracts dataset links from the Kaggle site that are in CSV format
- extractor.py: extracts the table description, associated column descriptions, dataset name, and table URL from the dataset page to a CSV file
- downloader.py: downloads the table using the KaggleAPI, along with the data dictionary, table description, and URL to the table. Each extracted table has a parent directory with 4 files
Additionally, there are other important files such as count_columns.py which counts the total number of columns collected so far, and filter_empty_dirs.py which processes the scraped data, such as deleting empty folders, limiting the number of rows in a CSV file, converting JSON strings into actual JSON files, and copying the folders from the test directory to the processed directory.
This project uses selenium-python, KaggleAPI, and pandas. To use the KaggleAPI, you need to have your kaggle.json file in the ~/.kaggle folder.
There are some areas of improvement for this project, such as using multi-threading to speed up the extraction process, and better exception handling.