shriram / gradescope-racket Goto Github PK

View Code? Open in Web Editor NEW

17.0 7.0 7.0 86 KB

Infrastructure to autograde Racket code on Gradescope

License: MIT License

Makefile 7.63% Racket 91.08% Shell 1.29%

racket racket-language racket-lang gradescope-autograder-specification gradescope

gradescope-racket's Introduction

Auto-Grading Racket Code in Gradescope

This repository enables educators to autograde student submissions written in Racket on Gradescope.

At least in its current structure, you will want to make a different copy of this code for each assignment. Since this codebase is very lightweight, I don't anticipate changing that in the near future.

Docker

Gradescope relies on Docker. You don't strictly have to know how to use Docker yourself, but it certainly helps.

If you're not comfortable with Docker, you can just create the test suite (see below), upload files to Gradescope, and leave the building to them, and test on their machines. Be aware this can be quite slow, and if you make a mistake, you'll have to do it all over again.

Assuming you will use Docker locally:

Make a local image for testing:

make base-image
make grader-image a=<tag>

There are two images due to staging. The first simply installs Racket, which is unlikely to change, but the second installs the assignment-specific grader content, which is likely to change quite a bit as you're developing it. We don't have to name both images, but it might make it a bit clearer to navigate (and it's sometimes also useful to go into a pristine base-image to test some things).

When you're ready to test, run make grade a=<tag> s=<dir>, where <dir> is the sub-directory of tests that houses the (mock) student submission, and <tag> is the assignment tag on the grader scripts. See examples below.

Creating a Test Suite

The library currently depends on rackunit. If you're only used to the student language levels, don't worry, it's pretty similar. Stick to the test-* forms (as opposed to the check-* ones). You can read up more in the documentation.

Warning: Use only the test- forms, not the check- forms! The former work well in the presence of run-time errors caused by a check. The latter do not, so one erroneous test can truncate grading.

Create a file named grade-<tag>.rkt. (Leave all the other files alone unless you really know what you're doing.) File names of this format can be automatically managed by Makefile and installed into Docker container and Gradescope setup script using the a=<tag> input.

There are three files to help you create your own grader:

a template file, which doesn't itself work: grade-template.rkt
three working example files: grade-sq.rkt, grade-two-funs.rkt, and grade-macros.rkt

If you use the template, be sure to edit only the UPPER-CASE parts of it.

You can have only one test suite per file/grading run, but test suites can be nested so this does not impose much of a limitation.

Once you've created your test suite, you will probably want to test it locally first. This requires you to have Docker installed (but not necessarily know much about how to use it). See the instructions above. If you're skipping local testing, you can move on to deployment.

Naming Tests

Note that all the test-* forms take a name for the test. (This is a difference from the student languages.) To name it wisely, it is important to understand how this name is used.

When a test fails (either the answer is incorrect, or it has an error), students get a report of the failure. In the report, they see the name of the test, rather than the actual test or test output. This is because both of those can leak the content of the test. A cunning student can write a bunch of failing tests, and use those to exfiltrate your test suite, which you may have wanted to keep private.

Many course staff thus like to choose names that are instructive to the student for fixing their error, but do not give away the whole test. If, on the other hand, you want students to know the full test, you could just write that in the test. For instance, assuming you are testing a sq function that squares its input, your could write any of these, from least to most informative:

  (test-equal? "" (sq -1) 1)
  (test-equal? "a negative number" (sq -1) 1)
  (test-equal? "-1" (sq -1) 1)

Test suites can also have names, and be nested. For example:

(test-suite
  "Test suite 1"
  (test-equal? "" (sq -1) 1)
  (test-suite
    "Tricky Tests"
    (test-equal? "-1" (sq -1) 1)))

When names suites have names, the names of test suites and their tests are appended hierarchically when reporting test failures and errors to GradeScope. In the above example, if the tricky test fails, then the test will report Test suite 1:Tricky Tests:-1 failed.

Why `define-var`

You might wonder why you have to use define-var to name the variables you want from the module, especially if the module already exports them. (If you don't use define-var you'll get an unbound identifier error.) There are two reasons:

Some languages, like Beginning Student Language, do not export names. Therefore, the autograder has to “extract” them from the module, and wouldn't know which ones to extract.
Simply accepting all the names from a module may be dangerous: a malicious student could export names that override the autograder's functionality (since this code is public, after all), thereby giving themselves they grade they want, not the one they deserve (unless it's a course on malicious behavior, in which case, they've earned whatever they award).

define-var lets you carefully limit which names the student provided you actually end up with. It will first look for the exported name and, only if it isn't found, extract from the module. This avoids the slight nuisance of having to return a student's assignment for having forgotten to export a name. (If this behavior isn't desired, talk to me and I can help you edit the source.)

Why `mirror-macro`

Macros are not values. Therefore, we can't simply extract the macro as a value from the student's module. Instead, the library sets up a mirror of that macro in the testing module. Note that it currently has the following consequences (some are weaknesses, others aren't as clear):

All evaluation happens in the student's module. Thus, all references to names are resolved in that module. This means a local helper function cannot be referenced directly.
Because of the way mirror-macro is written, the mirroring step takes place each time a test uses a macro. This does not seem to be prohibitively costly, but it is at least annoying. If this proves to be a performance problem, let me know.
Because this code goes into the student's module, it uses the namespace local to that module, not the names exported. Practically speaking, this means it disregards rename-out in the student's module. Since we don't expect students will be using this feature if they are still at the level where this library makes sense, this should not be much of a problem.

Deploying to Gradescope

Run make zip a=<tag> to generate the Zip file that you upload to Gradescope. If you have broken your grader into multiple files, be sure to edit the Makefile to add those other files to the archive as well. (And don't forget to add them to the repository, too!)

Following Gradescope's instructions (see below), upload the Zip.

Gradescope will build a Docker image for you.

When it's ready, you can upload sample submissions and see the output from the autograder.

Examples

The directory tests/sq/ contains mock submissions of a sq function that squares its argument, and grade-sq.rkt a test suite for it. So install that test suite, then check the several mock submissions:

make grader-image a=sq
make a=sq s=sq/s1
make a=sq s=sq/s2
...

where s1 is the first student submission, s2 is the second, etc. Focus on the JSON output at the end of these runs. See tests/sq/README.txt to understand how the submissions differ.

The default Makefile target (grade) will automatically try to rebuild the grader-image when necessary, so running make grader-image is probably not necessary.

The directory tests/two-funs/ illustrates that we can test more than one thing from a program; grade-two-funs.rkt is its test suite:

make a=two-funs s=two-funs/s1

The directory tests/macros/ illustrates that we can also test for macros; grade-macros.rkt is its test suite:

make a=two-funs s=macros/s1

(In this directory, student programs are purposely called student-code.rkt to show that you can choose whatever names you want; they don't have to be code.rkt.)

Scoring

Gradescope demands that every submission be given a numeric score: either one in aggregrate or one per test. I find this system quite frustrating; it imposes a numerics-driven pedagogy that I don't embrace on many assignments. Therefore, for now the score given for the assignment is simple: the number of passing tests divided by the total number of tests. That's it. Someday I may look into adding weights for tests, etc., but I'm really not excited about any of it. Either way, please give your students instructions on how to interpret the numbers.

Open Issues

See https://github.com/shriram/gradescope-racket/issues/.

Is This Current?

This code has been tested with Gradescope in late May 23, 2020, using Racket 7.7. Since Gradescope's APIs are in some flux, you may encounter errors with newer versions; contact me and I'll try to help. In addition, Gradescope intends to make it easier to deploy autograders using Docker; when that happens, it would be nice to upgrade this system.

This code does not have any non-vanilla-Racket dependencies.

Gradescope Specs

Gradescope's "specs" are currently at:

https://gradescope-autograders.readthedocs.io/en/latest/getting_started/

https://gradescope-autograders.readthedocs.io/en/latest/specs/

https://gradescope-autograders.readthedocs.io/en/latest/manual_docker/

https://gradescope-autograders.readthedocs.io/en/latest/git_pull/

Acknowledgments

Thanks to Matthew Flatt, Alex Harsanyi, David Storrs, Alexis King, Matthias Felleisen, Joe Politz, and James Tompkin.

gradescope-racket's People

Stargazers

Watchers

Forkers

rebelsky jasonhemann wilbowma saverioperugini northeastern-khoury rymaju dbp

gradescope-racket's Issues

explain why `define-var` is necessary

The documentation doesn't explain that it should be used, and why.

add an Examplar interface

It would be nice to reverse the logic and have students get automated feedback on their tests as opposed to their code. This requires a somewhat different setup.

support module exports

Because BSL does not export names, we currently use namespace tools to extract the name from inside the module. This works across many languages, but it has a subtle flaw: it does not respect rename-out. For instance, in

(provide [rename-out (three four)])

(define three 3)
(define four 4)

the value of four obtained from requiring the module is 3 but when obtained from the module's namespace, it's 4.

Therefore, either the name extraction should respect rename-out, or we should first try to obtain the name through regular means, and resort to namespace inspection only as a last resort. (A language without provide has no need for rename-out, so this should be safe.)

add support for testing student-written macros

Add note about xvfb-run, perhaps just set this to include and use it by default.

parameterize Makefile with dockerhub image prefix

The Makefile currently hardcodes shriramk in the grading image name. This should be a parameter.

Discussion: What's the right way to add timeouts for testing?

My students' will sometimes accidentally code infinite loops. As written, the entire test suite will time out when I test such code, which isn't as helpful as it could be.

What's the right way to flexibly provide good testing of students' code that may infinite loop? Here's what I'm currently using:

(require racket/engine)

(define timeout-symbol (gensym 'timeout))

(define-syntax-rule (run-w/timeout e)
  (let ([w/timeout (engine (λ (f) e))])
    (if (engine-run *TIMEOUT-MS* w/timeout)
        (engine-result w/timeout)
        '(timeout-symbol "test timed out"))))

I'm currently needing to wrap this around almost all my tests, which seems gross and wrong. In principle I'll want to timeout on almost any tests, b/c they could accidentally loop in any program they write. It's a shame to have to set one single time-out.

@wilbowma , are you currently using https://gist.github.com/wilbowma/79330280f474ecc456916787028206cc ? Matthias suggests: https://www.mail-archive.com/[email protected]/msg32199.html on
So what's the right API and a good standard implementation to support it?

This seems like something that should be a part of or an extension to rackunit, rather than hand-rolled each time. I assume there's an obvious reason rackunit doesn't include some solution for this already: what is that reason?

MIT license

          @shriram @wilbowma : Any objections to an MIT license?

@dbp Just FYI: https://pkgs.racket-lang.org/package/khoury-gradescope, https://github.com/northeastern-khoury/gradescope-racket/.

Originally posted by @jasonhemann in #25 (comment)

better support for multi-function assignments

Some assignments (e.g., cs019 summer placement, or equivalent of cs019 data scripting) have lots of separate problems bundled into one homework. Students are likely to develop these incrementally. Should they get output for each function?

Right now the autograder halts when it can't find a definition of any of the required functions. This means they would get no feedback at all even if they're done with some problems. They can manually work around it with stub functions, but that would create busywork and produce irritating output, for no good use. (And may reveal something of the intended tests before they've even tried anything.)

There seem to be two alternatives:

Break down the homework into several individual assignments. This seems quite annoying.
Add support to the auto-grader to just skip tests associated with a name. This requires some redesign of the infrastructure, because name-extraction and testing are currently disjoint.

Nevertheless, the second option above seems to be the best way to go?

use better base image name

The Docker image name base-image is too generic and may clash with other images (or at the very least is not memorable six months hence). Give it a better name. Update Dockerfiles, Makefile, and readme.

confirm that names exported by student code can't override grader

Create a student test file with the same names as used by the grader and check that these don't override the grader's names and hence behavior.

A version of define-var that permits test suite to run even if the variable is undefined

At present, to test their code partway through an assignment, students will have to stub out all the missing functionality. From an instructor point of view, it could be easier to judge the completeness of a student's program by seeing the output of testing their than knowing about the first missing function.

Check for empty length in lib-grade before you divide by zero.

Above

collapse multiple `define-var`s

When obtaining multiple names, it's a nuisance to write define-var over and over again. Instead, provide a define-vars that enables extracting multiple names.

Consider using the .sh install rather than PPA ?

An outstanding bug seems to get in the way of raco installing additional packages from the PPA racket. The .sh installation works around this. Would you consider a PR implementing the suggested work-around?

add git-based pull to speed up repo

To avoid Gradescope build times, there's a recommendation (also from Joe Politz) to pull from the autograder from a git repository.

https://gradescope-autograders.readthedocs.io/en/latest/git_pull/

Then there's only one build, and every update is automatically seen. However, this seems extremely wasteful (a git pull for each student). Furthermore, it's unclear that it's easy to set up for this project, because it's meant to be generic: the people customizing it would have to be able to set up their own public git repositories, etc.

The much better solution is to upload an image to Gradescope instead, and let Docker's layer caching, etc., do its job. In the meanwhile, a user comfortable enough with following Gradescope's instructions above can probably figure out how to set this up for themselves.

Standalone autograders, autodetecting file names, package, etc

For whatever reason, I've found it difficult to wrap my head around making autograders with this: I think it's the copy repo, edit files, etc, workflow that for whatever reason is confusing me.

Working backwards, I started from wanting a single file to be an autograder for an assignment. They live in the repo for the course, and so should be self-contained, depending on the gradescope code in a library (and depending on shared pre-built docker images seems as bad / worse, given how poorly they seem to be maintained as infrastructure).

This is easy --- if you move the lib-grade.rkt code into a package (which I tentatively named autogradescope, though haven't published, as I wanted to open this issue first):

#!/usr/bin/env racket
#lang racket
(require autogradescope)

(require rackunit) 
...
;; (define-var ...
;; (define-test-suite ...

The next thing that caused trouble for me was the hardcoded file names -- students would often submit slight variations (capitalized, etc), and while one could say this is a learning opportunity, it doesn't strike me as an important one. So, rather than (define-var foo from "file.rkt"), I changed it so that it finds whatever file in the submission directory has the right extension (which is configurable, but defaults to .rkt). If students are submitting multiple files, obviously this doesn't work (and perhaps that was the original motivation for the current design?).

(set-submission-extension! ".rkt") ; this is the default, so not actually needed.
(define-var my-function) ; these are found in the first file in the submission with the correct extension. 
(define-var other-function)    
...

The last thing that I changed was how testing of the autograders is done (i.e., on our own computers, before we send them to Gradescope). While there is certainly value in running on the same image (or, hopefully the same image) that runs on Gradescope, I suspect most of us rely on the fact that Racket 8.9 is Racket 8.9, across platforms (as, indeed, we need our testing to match what our students are doing), and so running natively should be "good enough", and can be a lot lighter weight: an environment variable can hardcode the path to the file where definitions are loaded from (and, when it is present, the results are printed to stdout):

$ SUBMISSION=./reference-soln.rkt ./run_autograder
{"score":"96.42857142857143","tests":[{"output":"Execution error in test named «blah»"}]}

Obviously, these are a bunch of changes, and some actively conflict with how this library currently works. At the same time, they aren't mutually exclusive (e.g., there can easily be two forms of define-var), and if you were interested in turning this into a library (to support my first goal: having the autograders be single files, not depending on prebuilt docker images), I could certainly do a more careful job merging. On the other hand, I'm also fine with having a fork (which is why I haven't called the library gradescope-racket). One thing though: could you stick a license on this code? :)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.