Project-Euler-Scrape
A complete web-scrape of every Project Euler programming challenge problem including information about the problem, the problems themselves, and all files/images.
What information was scraped?
I extracted as much information as I could find that was useful, including:
- Problem number (purple)
- Problem title (blue)
- Problem information (green)
- Publish date/time
- Number of solvers
- Difficulty rating
- Problem description (orange)
- Raw HTML from the page
- Plain text
- Any images in the problem description (red)
- Any files in the problem description (yellow)
Most of the data I scraped is in the file 1_631.json
. The structure of the data is:
{
"<problem number>": {
"number": 1,
"url": "<Project Euler problem URL>",
"title": "<title of problem>",
"info": {
"difficulty": <problem difficulty level in %>,
"published": "<publish date/time>",
"solved": <number of solvers>
},
"content": {
"images": <list of images>,
"html": <raw HTML text>,
"files": <list of files>
}
},
...
}
The images and files from the problems are found in the images/
and files/
directories respectively.
How was information scraped?
In previous commits I used a program called ParseHub to do the scraping as I was fairly new to the concept and didn't think about doing it in a programming language. However, recently I redid everything in Python using requests
to get the webpages and BeautifulSoup
to parse the HTML and scrape the information I wanted with regular expressions. All of the code is in pe_scrape.py
Why?
I am in the process of making a portfolio of all of the programming projects I have done. Naturally, I have solved a couple of the Project Euler problems and wanted to include their descriptions, title, etc in my website without manually entering it all. So I decided to have the webpages dynamically filled with PHP using a json file containing all the necessary information, hence this project!
Feel free to use the data I scraped, or modify my code to suit your needs!