coursera / courseraresearchexports Goto Github PK

Python library and tooling for Coursera Research Exports

License: Apache License 2.0

Python 100.00%

courseraresearchexports's Introduction

courseraresearchexports

This project is a library consisting of a command line interface and a client for interacting with Coursera's research exports. Up to date documentation of the data provided by Coursera for research purposes is available in the Partner Resource Center , Coursera Data Exports Guide.

Installation

To install this package, execute:

pip install courseraresearchexports

pip is a python package manager.

If you do not have pip installed on your machine, please follow the installation instructions for your platform.

If you experience issues installing with pip, we recommend that you use the python 2.7 distribution of Anaconda and try the above command again or to use a virtualenv for installation:

virtualenv venv -p python2.7
source venv/bin/activate
pip install courseraresearchexports

Note: the containers subcommand requires docker to already be installed on your machine. Please see the docker installation instructions for platform specific information.

Refer to Issues section for additional debugging around installation.

autocomplete

To enable tab autocomplete, please install argcomplete using pip install autocomplete and execute activate-global-python-argcomplete. Open a new shell and press tab for autocomplete functionality.

See the argcomplete documentation for more details.

Setup

Authorize your application using courseraoauth2client:

courseraoauth2client config authorize --app manage_research_exports

To use the containers functionality, a docker instance must be running. Please see the docker getting started guide for installation instructions for your platform.

Upgrade

If you have a previously installed version of courseracourseexports, execute:

pip install courseraresearchexports --upgrade

This will upgrade your installation to the newest version.

Command Line Interface

The project includes a command line tool. Run:

courseraresearchexports -h

for a complete list of features, flags, and documentation. Similarly, documentation for the subcommands listed below is also available (e.g. for jobs) by running:

courseraresearchexports jobs -h

jobs

Submit a research export request or retrieve the status of pending and completed export jobs.

request

Creates an data export job request and return the export request id. To create a data export requests for all available tables for a course:

courseraresearchexports jobs request tables --course_id $COURSE_ID \
    --purpose "testing data export"

In order to know your course_id, you can take advantage of our COURSE API, putting in the appropriate course_slug.

For example, if the course_slug is developer-iot, you can query the course_id by making the request in your browser logged in session:

https://api.coursera.org/api/onDemandCourses.v1?q=slug&slug=developer-iot

The response will be a JSON object containing an id field with the value:

iRl53_BWEeW4_wr--Yv6Aw

Note: The course slug is the part after /learn in your course url. For https://www.coursera.org/learn/machine-learning, the slug is machine-learning

If you have a publically available course, you can request the export using:

courseraresearchexports jobs request tables --course_slug $COURSE_SLUG \
    --purpose "testing data export"

Replace $COURSE_SLUG with your course slug (The course slug is the part after /learn in the url. For https://www.coursera.org/learn/machine-learning, the slug is machine-learning).

If a more limited set of data is required, you can specify which schemas are included with the export. (e.g. for the demographics and notebooks tables):

courseraresearchexports jobs request tables --course_id $COURSE_ID \
    --schemas demographics notebooks --purpose "testing data export"

You can look at all the possible ways to export using:

courseraresearchexports jobs request tables -h

Recommendations

1. Always request the specific schemas that you need by adding the schemas while requesting the exports. For more information on the available tables/schemas, please refer to the Coursera Data Exports Guide.

2. While requesting the exports for all courses in your institution, it is recommended to use the partner level export, rather than requesting individual course level exports. You can use the command:

courseraresearchexports jobs request tables --partner_short_name $PARTNER_SHORT_NAME \
    --schemas demographics notebooks --purpose "testing data export"

Your partner_short_name can be found in the University Assets section of your institution setting.

Note that the above command is available for only publicly available partners. If you have your partnerID, you can request the export using:

courseraresearchexports jobs request tables --partner_id $PARTNER_ID \
    --schemas demographics notebooks --purpose "testing data export"

You can find your partner_id using the API in your browser login session::: https://www.coursera.org/api/partners.v1?q=shortName&shortName=$PARTNER_SHORT_NAME

If you are a data coordinator, you can request that user ids are linked between domains of the data export:

courseraresearchexports jobs request tables --course_id $COURSE_ID \
    --purpose "testing data export" --user_id_hashing linked

Data coordinators can also request clickstream exports:

courseraresearchexports jobs request clickstream --course_id $COURSE_ID \
    --interval 2016-09-01 2016-09-02 --purpose "testing data export"

By default, clickstream exports will cache results for days already exported. To ignore the cache and request exports for the entire date range, pass in the flag --ignore_existing.

Rate limits

We have rate limits enabled for the number of exports that can be performed. The underlying export API returns the rate limit error message, which is printed when the command fails. The error message reflects the reason why you might be rate limited.

get_all

Lists the details and status of all data export requests that you have made:

courseraresearchexports jobs get_all

get

Retrieve the details and status of an export request:

courseraresearchexports jobs get $EXPORT_REQUEST_ID

download

Download a completed table or clickstream to your local destination:

courseraresearchexports jobs download $EXPORT_REQUEST_ID

clickstream_download_links

Due to the size of clickstream exports, we persist download links for completed clickstream export requests on Amazon S3. The clickstream data for each day is saved into a separate file and download links to these files can be retrieved by running:

courseraresearchexports jobs clickstream_download_links --course_id $COURSE_ID

containers

create

Creates a docker container using the postgres image and loads export data into a postgres database on the container. To create a docker container from an export, first request an export using the jobs command. Then, using the $EXPORT_REQUEST_ID, create a docker container with:

courseraresearchexports containers create --export_request_id $EXPORT_REQUEST_ID

This will download the data export and load all the data into the database running on the container. This may take some time depending on the size of your export. To create a docker container with an already downloaded export (please decompress the archive first):

courseraresearchexports containers create --export_data_folder /path/to/data_export/

After creation use the list command to check the status of the container and view the container name, database name, address and port to connect to the database. Use the db connect $CONTAINER_NAME command to open a psql shell.

list

Lists the details of all the containers created by courseraresearchexports:

courseraresearchexports containers list

start

Start a container:

courseraresearchexports containers start $CONTAINER_NAME

stop

Stop a container:

courseraresearchexports containers stop $CONTAINER_NAME

remove

Remove a container:

courseraresearchexports containers remove $CONTAINER_NAME

db

connect

Open a shell to a postgres database:

courseraresearchexports db connect $CONTAINER_NAME

create_view

Create a view in the postgres database. We are planning to include commonly used denormalized views as part of this project. To create one of these views (i.e. for the demographic_survey view):

courseraresearchexports db create_view $CONTAINER_NAME --view_name demographic_survey

If you have your own sql script that you'd like to create as a view run:

courseraresearchexports db create_view $CONTAINER_NAME --sql_file /path/to/sql/file/new_view.sql

This will create a view using the name of the file as the name of the view, in this case "new_view".

Note: as user_id columns vary with partner and user id hashing, please refer to the exports guide for SQL formatting guidelines.

unload_to_csv

Export a table or view to a csv file. For example, if the demographic_survey was created in the above section, use this commmand to create a csv:

courseraresearchexports db unload_to_csv $CONTAINER_NAME --relation demographic_survey --dest /path/to/dest/

list_tables

List all the tables present inside a dockerized database:

courseraresearchexports db list_tables $CONTAINER_NAME

list_views

List all the views present inside a dockerized database:

courseraresearchexports db list_views $CONTAINER_NAME

Using courseraresearchexports on a machine without a browser

Sometimes, a browser is not available, making the oauth flow not possible. Commonly, this occurs when users want to automate the data export process by using an external machine.

To get around this, you may generate the access token initially on a machine with browser access [e.g your laptop]. The access token is serialized in your local file system at ~/.coursera/manage_research_exports_oauth2_cache.pickle.

Requests after the first can use the refresh token flow, which does not require a browser. By copying the initial pickled access token to a remote machine, that machine can continue to request updated data.

Bugs / Issues / Feature Requests

Please us the github issue tracker to document any bugs or other issues you encounter while using this tool.

Developing / Contributing

We recommend developing courseraresearchexports within a python virtualenv. To get your environment set up properly, do the following:

virtualenv venv
source venv/bin/activate
python setup.py develop
pip install -r test_requirements.txt

Tests

To run tests, simply run: nosetests, or tox.

Code Style

Code should conform to pep8 style requirements. To check, simply run:

pep8 courseraresearchexports tests

Issues

If you face following error when installling psycopg2 package for Mac:

ld: library not found for -lssl
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'gcc' failed with exit status 1

Install openssl package if not installed:

brew install openssl
export LDFLAGS="-L/usr/local/opt/openssl/lib"
or
export LDFLAGS=-L/usr/local/opt/openssl@3/lib

courseraresearchexports's People

Contributors

Stargazers

Watchers

Forkers

aosavi vibhor98 rperezalvarez dhruvd25 sstaudaher lcruz-coursera jairomelo avison9 cristianstanciu hkaur-coursera imperialcollegelondon kalkai01 timtan samstent

courseraresearchexports's Issues

DockerException: Error while fetching server API version

Hi, when I run the courseraresearchexports containers create in both the specified ways, I am getting the same error:

ERROR:root:Problem when running command. Sorry!

Traceback (most recent call last):
  File "c:\python27\lib\site-packages\courseraresearchexports\main.py", line 86, in main
    return args.func(args)
  File "c:\python27\lib\site-packages\courseraresearchexports\commands\containers.py", line 31, in create_container
    d = utils.docker_client(args.docker_url, args.timeout)
  File "c:\python27\lib\site-packages\courseraresearchexports\containers\utils.py", line 121, in docker_client
    version='auto')
  File "c:\python27\lib\site-packages\docker\client.py", line 99, in __init__
    self._version = self._retrieve_server_version()
  File "c:\python27\lib\site-packages\docker\client.py", line 124, in _retrieve_server_version
    'Error while fetching server API version: {0}'.format(e)
DockerException: Error while fetching server API version: (2, 'WaitNamedPipe', 'The system cannot find the file specified.')

Please advise. Thank you.

Build conda package for courseraresearchexports

courseraresearchexport installation works OK with pip, but requires a variety of system packages to be available.

By following http://conda.pydata.org/docs/build_tutorials/pkgs.html we should be able to build a conda package that hopefully makes the installation process easier.

Add `user_id_hashing` to export request printouts

Currently, no information about user_id_hashing/ANONYMITY_LEVEL is included in summary get_all table or get.

hi I was trying to run docker-compose build. get following error. can anyone help?

Traceback (most recent call last):
File "urllib3/connection.py", line 160, in _new_conn
File "urllib3/util/connection.py", line 84, in create_connection
File "urllib3/util/connection.py", line 74, in create_connection
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "urllib3/connectionpool.py", line 677, in urlopen
File "urllib3/connectionpool.py", line 392, in _make_request
File "http/client.py", line 1277, in request
File "http/client.py", line 1323, in _send_request
File "http/client.py", line 1272, in endheaders
File "http/client.py", line 1032, in _send_output
File "http/client.py", line 972, in send
File "urllib3/connection.py", line 187, in connect
File "urllib3/connection.py", line 172, in _new_conn
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f7d4c147d10>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "requests/adapters.py", line 449, in send
File "urllib3/connectionpool.py", line 727, in urlopen
File "urllib3/util/retry.py", line 446, in increment
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='0.0.0.0', port=2375): Max retries exceeded with url: /version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7d4c147d10>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "docker/api/client.py", line 214, in _retrieve_server_version
File "docker/api/daemon.py", line 181, in version
File "docker/utils/decorators.py", line 46, in inner
File "docker/api/client.py", line 237, in _get
File "requests/sessions.py", line 543, in get
File "requests/sessions.py", line 530, in request
File "requests/sessions.py", line 643, in send
File "requests/adapters.py", line 516, in send
requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=2375): Max retries exceeded with url: /version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7d4c147d10>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "docker-compose", line 3, in
File "compose/cli/main.py", line 81, in main
File "compose/cli/main.py", line 200, in perform_command
File "compose/cli/command.py", line 70, in project_from_options
File "compose/cli/command.py", line 153, in get_project
File "compose/cli/docker_client.py", line 43, in get_client
File "compose/cli/docker_client.py", line 170, in docker_client
File "docker/api/client.py", line 197, in init
File "docker/api/client.py", line 222, in _retrieve_server_version
docker.errors.DockerException: Error while fetching server API version: HTTPConnectionPool(host='0.0.0.0', port=2375): Max retries exceeded with url: /version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7d4c147d10>: Failed to establish a new connection: [Errno 111] Connection refused'))
[218106] Failed to execute script docker-compose

Return exporttype (valid and nonvalid) when creating container with incorrect exporttype.

Using a GRADEBOOK export type, the error message is uninformative.

ERROR:root:Sorry, container creation is only available with tables data exports.
ERROR:root:Problem when running command. Sorry!
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/courseraresearchexports/main.py", line 76, in main
    return args.func(args)
  File "/usr/local/lib/python2.7/site-packages/courseraresearchexports/commands/containers.py", line 40, in create_container
    args.export_request_id, docker_client=d, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/courseraresearchexports/containers/client.py", line 202, in create_from_export_request_id
    raise ValueError('Invalid Export Type.')
ValueError: Invalid Export Type.

Not Authorized

When I run the instruction " courseraresearchexports jobs request tables --course_slug machine-learning --purpose "testing data export" ", an error occurs with code "Not Authorized". Before this I have executed "courseraoauth2client config authorize --app manage_research_exports" and checked my authority. How should I solve this issue?

DockerException: Error while fetching server API version

Hi, when I run the
command in the specified ways, I got the same error

I am using Windows 10 Home, so I installed Docker Toolbox instead of Docker. When I run the above command, the Docker Toolbox was running.
Anyone has some idea how to solve this? Thanks a lot.

Example of connecting to Docker container

Examples of connecting to the containerized Postgres database in [ipython notebooks, misc visualization tools, R, ...] would be useful

setup.py psycopg2 requirement update

Hello,
I believe in your setup.py install_requires variable, 'psycopg2<=2.6.2' (line 46) is throwing an error when running on Anaconda Python 2.7 (which is recommended for this library). I changed it locally to 'psycopg2-binary' and am now able to run the setup.py file without error. Thanks!

"Coursera Data Exports Guide" Gitbook link is broken 404 not found

Coursera Data Exports Guide Gitbook link in the readme is broken (404 not found).
https://coursera.gitbooks.io/data-exports/content/introduction/programmatic_access.html

=============
Readme context:

Up to date documentation of the data provided by Coursera for research purposes is available on gitbooks , Coursera Data Exports Guide.

For more information on the available tables/schemas, please refer to the Coursera Data Exports Guide/

Display more informative error message for closed courses

When users attempt to use clickstream export with the command
courseraresearchexports jobs request tables --course_slug COURSE_SLUG ... when COURSE_SLUG is a closed course, we try to translate the course slug to course id using https://www.coursera.org/api/onDemandCourses.v1/?q=slug&slug=COURSE_SLUG.

We use the unauthenticated version of this API because we do not have access to the user's Coursera cookies when making this request. For closed courses, this request returns an error.

We should document this potential issue and provide an alternative (hit the API in your browser, and request with courseraresearchexports jobs request tables --course_id ...).

Ideally we will be able to open the above API with OAuth2 as well and not need this workaround.

`requests` module for linux

InsecurePlatformWarning errors occur on default installation. Add security option to setup.py to reflect pip install requests[security]. See http://stackoverflow.com/questions/29134512/insecureplatformwarning-a-true-sslcontext-object-is-not-available-this-prevent

courserareserachexports jobs download should work for clickstream exports

Currently we direct people to use courseraresearchexports jobs eventing_download_links, but for usability we should also allow people to download clickstream exports directly.

Documentation for sudo/python packaging/docker nastiness

In our testing we've found some python environments to be ill behaved and requiring sudo to install with pip. We've also found Docker to require sudo by default. We should document these behaviors and note more sane alternatives.

More informative message when no clickstream export data exists

Even if no clickstream data for a particular scope and interval exists, a clickstream export will be reported as successful. However, clickstream_download_links will respond with an empty response if no data has ever been successfully exported. Throw an exception that no clickstream data is found with an informative message.

Make `docker run ...` to connect to database with psql easier

Currently I have to enter the container name, database name, and host name which is the same thing by default. Let's have a more convenient way to do this.

Allow output of `courseraresearchexports jobs clickstream_download_links` to be piped

Seems like right now we don't actually print to stdout so redirecting to a file, piping to wget, etc doesn't work.

"courseraresearchexports jobs request tables" HTTP error 403

Running courseraresearchexports jobs request tables --course_slug machine-learning --purpose=research returns the following error message:

_ERROR:root:Request to https://www.coursera.org/api/onDemandExports.v2/ with body:
        {"exportType": "RESEARCH_WITH_SCHEMAS", "scope": {"typeName": "courseContext", "definition": {"courseId": "Gtv4Xb1-EeS-ViIACwYKVQ"}}, "statementOfPurpose": "research", "anonymityLevel": "HASHED_IDS_WITH_ISOLATED_UGC_NO_PII", "schemaNames": ["demographics", "users", "course_membership", "course_progress", "feedback", "assessments", "course_grades", "peer_assignments", "discussions", "programming_assignments", "course_content", "ecb", "notebooks", "transactions"]}
received response:
        {"errorCode":"Not Authorized","message":null,"details":null}
Please contact [email protected] or #data-exports on Slack for assistance
ERROR:root:Problem when running command. Sorry!
Traceback (most recent call last):
  File "c:\dev\python27\lib\site-packages\courseraresearchexports\main.py", line 86, in main
    return args.func(args)
  File "c:\dev\python27\lib\site-packages\courseraresearchexports\commands\jobs.py", line 74, in request_tables
    export_request_with_metadata = api.post(export_request)[0]
  File "c:\dev\python27\lib\site-packages\courseraresearchexports\models\utils.py", line 40, in response_transformer_wrapper
    response.raise_for_status()
  File "c:\dev\python27\lib\site-packages\requests\models.py", line 844, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
HTTPError: 403 Client Error: Forbidden for url: https://www.coursera.org/api/onDemandExports.v2/_

Tooling for clickstream data

Clickstream data can be forced into Postgres for exploratory purposes, requires some thinking on how the CLI will interact with existing containers, maybe ability to sample the data

Also, ideally we can provide an interface to data processing tooling that handles clickstream data (Apache Spark?)

Check for docker during installation and create a more informative error message.

Installation doesn't check for an active docker server and docker related messages in the containers command are uninformative (see #33).

Problems authorizing courseraoauth2client

Hello I am trying to authorize con auth2 following the command:

courseraoauth2client config authorize --app manager_research_exports

but I get this message
ERROR:root:Problem when running command. Sorry!
Traceback (most recent call last):
File "/home/ronald/anaconda2/envs/snowflakes/lib/python2.7/site-packages/courseraoauth2client/main.py", line 64, in main
return args.func(args)
File "/home/ronald/anaconda2/envs/snowflakes/lib/python2.7/site-packages/courseraoauth2client/commands/config.py", line 36, in authorize
oauth2_instance = oauth2.build_oauth2(args.app, args)
File "/home/ronald/anaconda2/envs/snowflakes/lib/python2.7/site-packages/courseraoauth2client/oauth2.py", line 472, in build_oauth2
configure_app(app, cfg)
File "/home/ronald/anaconda2/envs/snowflakes/lib/python2.7/site-packages/courseraoauth2client/oauth2.py", line 517, in configure_app
with open(cfg_path, 'wb') as configfile:
IOError: [Errno 2] No such file or directory: '/home/ronald/.coursera/courseraoauth2client.cfg'

Thanks

Reverse order of `jobs` printout

Reverse the order so newest jobs are at the bottom.

Rename `jobs get_all`

Rename jobs get_all to jobs list.

Infer `partners_short_name` when creating views

Currently, creating views with courseraresearchexports db create_view requires the flag --partner_short_name which users need to manually find using the API https://www.coursera.org/api/partners.v1 which is confusing. We should automatically infer this given the current content of the database.

What's the "parter_short_name"?

I had created my data export job request by command like this:
courseraresearchexports jobs request tables --course_slug $COURSE_SLUG --purpose "testing data export"
Then I downloaded the data and created a docker container. But when I wanted to create a view in the postgres database, I was always requested to provide a argument parter_short_name. I even don't know what it is. Anyone can help me?

Autocompletion at the command line interface

Use argcomplete https://argcomplete.readthedocs.io to perform optional autocompletion at the command line.

400 error on jobs request tables

$ courseraresearchexports jobs request tables --purpose=research --partner_short_name=umich

ERROR:root:Request to https://www.coursera.org/api/onDemandExports.v2/ with body:
	{"exportType": "RESEARCH_WITH_SCHEMAS", "scope": {"typeName": "partnerContext", "definition": {"partnerId": {"maestroId": "3"}}}, "statementOfPurpose": "research", "anonymityLevel": "HASHED_IDS_WITH_ISOLATED_UGC_NO_PII", "schemaNames": ["demographics", "users", "course_membership", "course_progress", "feedback", "assessments", "course_grades", "peer_assignments", "discussions", "programming_assignments", "course_content"]}
received response:
	{"errorCode":null,"message":"JSON didn't validate","details":{"/scope/partnerId/maestroId":[{"message":"error.expected.jsnumber","args":[]}],"/scope/typeName":[],"/scope":[{"message":"Invoked `unimplementedReads` for org.coursera.export.research.ExportScope","args":[]}]}}
Please contact [email protected] or #data-exports on Slack for assistance
ERROR:root:Problem when running command. Sorry!
Traceback (most recent call last):
  File "/home/brooksch/.virtualenv/py2-dsmooc/local/lib/python2.7/site-packages/courseraresearchexports/main.py", line 86, in main
    return args.func(args)
  File "/home/brooksch/.virtualenv/py2-dsmooc/local/lib/python2.7/site-packages/courseraresearchexports/commands/jobs.py", line 74, in request_tables
    export_request_with_metadata = api.post(export_request)[0]
  File "/home/brooksch/.virtualenv/py2-dsmooc/local/lib/python2.7/site-packages/courseraresearchexports/models/utils.py", line 40, in response_transformer_wrapper
    response.raise_for_status()
  File "/home/brooksch/.virtualenv/py2-dsmooc/local/lib/python2.7/site-packages/requests/models.py", line 844, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
HTTPError: 400 Client Error: Bad Request for url: https://www.coursera.org/api/onDemandExports.v2/

python --version
Python 2.7.6

coursera / courseraresearchexports Goto Github PK

courseraresearchexports's Introduction

courseraresearchexports

Installation

autocomplete

Setup

Upgrade

Command Line Interface

jobs

request

Rate limits

get_all

get

download

clickstream_download_links

containers

create

list

start

stop

remove

db

connect

create_view

unload_to_csv

list_tables

list_views

Using courseraresearchexports on a machine without a browser

Bugs / Issues / Feature Requests

Developing / Contributing

Tests

Code Style

Issues

courseraresearchexports's People

Contributors

Stargazers

Watchers

Forkers

courseraresearchexports's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs