yalies / api Goto Github PK

View Code? Open in Web Editor NEW

12.0 4.0 23.0 14.38 MB

👥 The best directory of Yale personnel, with a clean API to match. Used by 70% of undergrads!

Home Page: https://yalies.io

HTML 13.39% Python 69.65% Shell 0.45% CSS 3.69% JavaScript 12.50% Mako 0.23% Procfile 0.02% Dockerfile 0.08%

yale api scraping

api's Introduction

👥 Yalies

A website and API for getting information on students at Yale College.

Initial Setup

To develop changes to the application, you'll need to run it locally for testing.

This guide assumes, as prerequisites, that you have A MacOS or Linux-based OS (if you use Windows, you can still follow along, but some commands may be different).

Install Homebrew

If you're on a Mac, first install Homebrew.
If you're on Linux, follow along with your distro's package manager (pacman, apt-get, yum, etc).

Install Python

This project requires Python version 3.10.7. You may already have Python on your machine—it comes preinstalled with macOS—but it may be the wrong version.

To check your python version, run:

python3 --version

If it says anything other than 3.10.7, keep following along.

We are going to use pyenv to install multiple versions of Python on our machine, and pyenv-virtualenv to manage dependencies in our package. Run:

brew install pyenv
brew install pyenv-virtualenv

Now, we must add pyenv to our PATH. PATH is a special bash variable that tells the shell what executables we can run.

If you're using zsh (the default for macOS nowadays), open ~/.zprofile. in your favorite text editor If you're still using bash, edit ~/.bash_profile. Add the following lines at the bottom:

export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init --path)"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

Note: for fish shell, add this to ~/.config/fish/config.fish instead. If you don't know what fish is, ignore this.
set PATH $PATH "$HOME/.pyenv/bin"
eval "$(pyenv init --path)"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

Finally, we have to tell Python to use the correct version. To tell pyenv to switch to the right version, run

pyenv install 3.10.7
pyenv global 3.10.7

Install PostgreSQL

Run:

brew install postgresql

Clone the repository

If you need a refresher on Git, Eric made a Git tutorial video. The video describes the workflow the development team will be using.

Clone the repo using

git clone https://github.com/Yalies/api Yalies
cd Yalies

Note: If you are not a member of the Y/CS Yalies Dev Team but want to contribute, make a fork of the repo and set upstream.

Create your `venv`

Normally, when you install Python packages with pip3, the packages are installed in your user directory. But, what if two project use different versions of the same package? To avoid conflicts, instead of installing packages globally, we'll install them in a virtual environment localized to our project.

Create a virtual environment (venv) inside your project directory with:

python3 -m venv .venv

Optional: Install the VSCode extension to work with venv.

Now, activate the venv:

source .venv/bin/activate

Or, for fish:

source .venv/bin/activate.fish

Or, for powershell:

Set-ExecutionPolicy Unrestricted -Scope Process
.venv\Scripts\activate

Now that you've activated your venv, the commands python3 and pip3 are replaced with a special pointer to your project directory.

Install dependencies

pip3 install -r requirements.txt
pip3 install -r requirements-test.txt

Run migrations

Finally, run the database migrations to get the local SQLite database configured:

python3 -m flask db upgrade

Running

To locally launch the application:

FLASK_APP=app.py FLASK_ENV=development flask run

Or, for powershell:

$env:FLASK_APP="app.py"
$env:FLASK_ENV="development"
flask run

The app will subsequently be available at localhost:5000.

When running locally, the app will use a non-hosted SQLite database, meaning that all database contents will be stored in app.db. If you wish to run SQL queries on this database, simply install sqlite (best obtained through Homebrew or other package manager), and run:

sqlite3 app.db

Scraper

Our scraper crawls Yale's websites in order to obtain the data we provide. See documentation here.

Submitting changes

Switch to a new branch to hold your changes:

git checkout -b changes_description

The name of the branch should be short and refer to what you’re planning to change.

Next, make your code changes! Be sure to test them and make sure the app runs as you expect. Next, commit your code:

# To tell git to track all the files you changed
git add -A
# To label this set of changes:
git commit -m "Describe your changes here"
# Make sure to title your commit in the imperative tense, for example "Add new features” instead of "Added…", "Adding…", etc.

Next, upload your code to the repo. This won't affect master, only your feature branch.

git push -u origin your_branch_name

NOTE: If you make another change on this branch, you can just do git push (without additional flags) and it will automatically push to the last remote/branch you specified.

Before you make a pull request, please make sure you have all the latest changes from master. Resolve all conflicts accordingly.

git merge master

Next, create a pull request (a request to merge your changes into the main repository) by going to the repository page on GitHub and clicking the green "Compare & Pull Request” button that appears.

Title the pull request with a description of all included changes.

In the description, write "Fixes #X” or "Resolve #X”, with X being an issue number, for each issue you're fixing in this PR. This will save time by telling GitHub to automatically close those issues once your changes are merged.

On the right side under Reviewers, select one other team member, your team lead (Eric Yoon), and Erik Boesen. You'll need all three people to approve your changes before it can be merged.

Congratulations! Your changes will be up for review. After they are merged, you'll need to check out back to master.

git checkout master

Repeat until all features are implemented and all bugs fixed! 🙂

License

Licensed under the MIT license.

Author

Built by Erik Boesen. Maintained by the Yale Computer Society.

api's People

Contributors

Stargazers

Watchers

api's Issues

Add admin and banned columns to users table

Currently, when checking that a user is permitted to do certain privileged operations (i.e. running the scraper), we just check if the user's CAS NetID is equal to my NetID (ekb33). We should add a boolean admin column to the users table that would allow users to be set as administrators, and then check if the current user is an admin when attempting to perform privileged operations, rather than checking against my hardcoded NetID. If you really want to be fancy, you could try to figure out how to add a decorator for this (like @admin_required, comparably to how flask-cas and flask-login implement @login_required).

For banned, it would be good to be able to ban individual users who we don't want using the site. Just in case.

If no people were found on face book page and we abort scraping, delete the saved page file

This failure is caused when authentication has failed. Currently, if we change the passed token so that it's valid, and then immediately rerun the scraper, it'll use the existing page.html file from when the request failed, and the problem won't be fixed until we restart the heroku dyno (which resets the ephemeral filesystem).

Add code to the failing case to delete the page.html file.

Fix database lockup

Make scraper asynchronous

If we multithread the process, we could probably finish a lot faster.

Add information about YCS partnership on about and splash pages

Since this isn't solely my project anymore, it would be nice to update the About page to explain that this is a YCS project. Maybe add a logo too. And then consider adding similar content to the pre-login homepage (splash.html) as well.

Give clearer error when a user has been banned, isn't eligible to view website, etc.

Currently we just abort their request with a vague error message. We should give an explanation of why they can't access the information, so it doesn't just look like the website broke.

Scrape more faculty information from department websites

Lots of academic departments at Yale have People pages that list (in apparently a somewhat consistent format) all the people (grad students, faculty, staff, etc.) in the department.

These websites have lots of extra information, such as:

Suffixes (M.D., Ph.D, etc.)
Links to personal or lab webpages
Full professorship titles (for example "Sterling Prof of Sociology, Director, Urban Ethnography Project; Prof African American Studies")
Pictures

Examples:
https://ling.yale.edu/people
https://cpsc.yale.edu/people
https://afamstudies.yale.edu/people
https://math.yale.edu/people
https://mcdb.yale.edu/people
https://medicine.yale.edu/anesthesiology/people/

Many more... full list here: https://www.yale.edu/academics/departments-programs

Fix 'Law School' and 'School of Law' being separate schools/organizations

Support filtering/searching by birthdays

ElasticSearch can return different results each search, causing different ordering for each page of results

Pretty much the title. If you do a broad search that returns many results ("Hopper", for instance), you may notice that some people are duplicated across pages, or possibly omitted. This only is an issue for very large searches (which few people do, apparently preferring to use filters), but it's a very obvious problem once you notice it. One solution to this might be to use ES's scrolling tracking features.

Consider scraping social media to find people's IG/Twitter profiles

Support sorting in request to API

/auth generated tokens will be rejected because they aren't added as keys

Make sure we can properly scrape people with the same name

Re-show residence filters once room numbers return to Face Book

The Face Book has removed all room numbers this semester. This may be because of us. It also may be because of the irregularity of COVID. For this reason, I hid the filters that use room numbers (building code, entryway, floor, etc.) in app/templates/index.html. If/when room numbers are put back, we should show these filters again.

Tag Eli Whitney students

In filters endpoint, support listing filters already applied and return results based on what other options would be supported

So, for example, you could pass {'school_code': ['YC']} and it would first filter down to just the options found on the resulting rows.

This will allow us to move towards building the filters call in JS from the front end rather than through jinja.

Separate web interface into a different repository

Use people endpoint from front end

Jesus, we should at least follow our own advice

Stop scraper if there aren't any students found to prevent a bad page load from emptying database

List properties that should be included in JSON serialization of object

Rather than excluding certain properties. This way we can make it ordered also.

Manage API tokens and keep track of who is using them for what applications

Allow copying email list with button

Support page_size option API requests

To allow fetching a page of size other than the currently hardcoded 20

Automatically put current user's profile first

Oftentimes people just want to see what their own profile does, so why not sort it to the top automatically? Or at least provide the option?

Use Directory email if not included in facebook

Allow expanding room numbers for more explanation

Block non-undergrads from viewing website

Some people have school but not school_code, or organization but not organization_id

Create automatic tool to extract keys for scraper, such as a Chrome extension

Use more secure filenames for S3 images

Right now, someone could theoretically iterate through every number from 1-100,000 and get all the user images. Rather than using Yale's naming scheme for the files, we should generate a securely random name for the image file based on some properties of the user that aren't likely to change. For example, we can append UPI, image ID, netid, etc. together and then hash that somehow and name the file thus.

Refactor scraper into multiple files

It's huge, and once we implement #44, it's only gonna get huger. We should split different components into multiple files, some ideas for divisions:

face_book
directory
department_websites
util (for example clean_* functions)

Some emeritus professors have longer, alphabetical-only netids

They usually get caught coincidentally right now, but theoretically if they started with a common prefix they could get missed.

Automate CAS login in scraper to remove need for manually providing cookies

Allow passing include parameter listing fields to include in response

When 'Other' is selected, it sometimes causes SQL error

Capitalized 'True' used in JSON on API docs page

Don't send query and filters props unless they're occupied

Fix inconsistency of g.me and g.user

Use fetch API for token request

Require API users to apply for access to certain fields

Some fields like address, residence, etc. are somewhat private. It could be nice to support a review system where people can be approved for access to those fields, but don't get them by default.

Executing a search twice in sequence may result in duplicate students showing up

Add 'repeat search without filters' button

Clean up major names

Write more complete API documentation

I think it would be really cool to have a Swagger docs system like this, where you can test the API in-browser and see what the responses are like. At minimum we should document the filters endpoint and add a list of fields Person has.

Raise error when invalid filter passed

Disable clear filters button if no filters are selected

Currently you can just click it repeatedly and it'll just refresh over and over. Seems like a clumsy behavior, would be better if it just did nothing.

Come up with a new way to tell if people are on leave

Currently we just check if the graduation year of each student has increased since the saved copy we have from last year. Once this semester ends, we'll no longer have a reliable way to tell whether people are still on leave or if they only took one semester off. We'll either need to find another way to get leave data, or change the labeling to signify that this student HAS taken a leave but may not necessarily still be on it.

Split office column into office_building and office_room

Persist query in URL parameters

One kind of nice (although very bugged) thing that the Yale Face Book does is that when you run a search, the search information is stored in the URL. That way, if you want to send someone the results of your search, you can just copy and paste the URL, which would be something like:

yalies.io/?query=Some+name&filters=...