This repository contains 2 scripts corresponding to the McGill eScholarship data.
One script is used to extract the metadata from the xml documents. You can modify the script to extract any specific metadata you want.
The second script is used to parse the filename and content of the html files. The script also cleans the data.
The output of both scripts is a list of dictionaries.
The parsed xml dataset can be accessed here: https://drive.google.com/file/d/0B_4_ObSAWJETbFl5dERidnNfbnM/view?usp=sharing
For more information about the dataset, you can send an email to [email protected]