Simulating an alternative grading system
1. What is reasonable?
Imagine that you are done with assigning points and “computing” grades from weighted averages.^{[1]} The points thing needs too many tweaks and modifications to end up with a reasonable mapping between a single mangled number against a qualitative evaluation of student learning and performance by a subject matter expert.
Enter different ways of grading in a course under the umbrella term of alternative grading.
2. The System
Grade  DBM  ACDC  HWSW  Xfactor 

A 
☐ 
☐ 
☐ ☐ 
☐ ☐ ☐ 
B 
☐ ☐ 
☐ 
☐ ☐ 
☐ ☐ ☐ 
C 
☐ ☐ 
☐ ☐ 
☐ ☐ ☐ ☐ 
 
D 
☐ ☐ 
☐ ☐ 
☐ ☐ ☐ ☐ ☐ 
 
Available 
7 
7 
15 
7 
Notice that the number of checkboxes in each column is sometimes less than the Available number. 

To earn a given base letter, all items must be checked for that row and for all rows below, with an F for not completing the D row completely.

By default, items with deadlines (only HWSW here) are not eligible for a checkoff after the deadline.

Students are limited to N checkoff attempts per week.

Each item may be reattempted once with no penalty.

Every student starts the semester with 2 tokens. Tokens may be “spent” by a student for the following purposes:

Submit something past its deadline (if relevant).

Submit an extra item for evaluation in a given week.

Submit a second revision of an item. (← see later section)

3. Centering the course design
It is possible to create all assignments (or specifications for same) for all categories for the entire course, and then make the rules. This is an existing course that is on it’s 10^{th} iteration and already has a library of labs, assinments, and other activities, so it is not necessary to start from scratch.
But still, it is confidenceinspiring to start from a firstprinciples estimate of the numbers.
 How many checkoff items should there reasonably be?


Too many items adds more logistics unrelated to the course content, and more (?) work for me per week to get feedback back quickly.

Too few make it more difficult to create a set of items that do a good job of student evaluation and reduces granularity.

Come up with some Fermi estimates:

14 full weeks of regular meetings

DBMs require significant work across several domains and involve lab time. Everyother week seems reasonable, and also syncs with the scheduled laboratory sessions schedule. 7 items

ACDCs need more time to complete than a typical homework set involving higher levels of cognitive processing and knowledge, which would benefit from a few opportunities to “sleep on it.” 7 items

HWSWs are not large time investments, but are at a regular pace. One per week seems reasonable and there are 15 calendar weeks. 15 items

Xfactors are eXtras, but a 2week interval seems fine. 6 items
→ so we have 35 items.
Each attempt is to be either checkedoff or not and marked with a 4level scheme. An E or M earns a completion. Receiving an R or N then is deemed not completed and the student can reattempt the item. The R gets appropriate and helpful feedback on how to improve. Some reattempts are fresh problems on the same topics and others are revisions of the prior submission as appropriate for the item.
 What is “good enough”?

I am satisfied that it shows evidence of correct understanding or otherwise professionalgrade work and not missing something important. This is a judgement call by an experienced subjectmatter expert and educator.
 What are the typical expectations for this?


Reasonable workload per week.

We want to target an 85% success rate to maximize the rate of learning / time.^{[3]}

 Does this quantity and ruleset behave well for most students?


If so, then the instructor’s task is to simply fleshout this number of items spread across time.

Workload can be gauged from simply asking students.

Success rate is ideally calibrated per individual, but will be assessed across the total.

4. Simulator
This is Engineering, so we can take action to get some approximate answers to this question. So, let’s build a simulator to see how a typical semester would play out.
Details:

2.5 items newly available per week for 14 weeks

our 35 items total


3 submissions max per person per week (new and/or revisions)

Cap the grading load for me.

Prevent turning in everything at the end of the semester.


85% likelihood of success for an attempt

3 tokens, all used for extra (re)submissions on finals week
4.1. Objectives
I am interested in two student behaviors and one situation:

Student submits everything assigned, and always revises if possible.
This person is doing all that is asked of them and actively works to make improvements when needed. Combined with the high standard for a successful submission, this person should receive an A. 
Student submits everything that is assigned but never makes revisions.
This person is not blowingoff work, but also has areas to improve. With an 85% yes rate, this would correspond to an 85% score in the traditional points system. Interesting here is the distribution of outcomes for this behavior. It is also the reference point that the simulator works correctly. 
What happens at the end of the semester when there are 2.5 more new items and students may be using their 3/week attempts and tokens for a backlog of retries?
This is a recipe for unnecessary stress and possibly revolt and bad feelings compelling a lastminute policy change.
Not turning in assigned items, especially if frequent, is a behavior that isn’t compatible with professional engineering success  a lower grade is appropriate. It, conversely, would be regrettable if the system had an edge case (e.g. extra credit) that yields an inflated B or even A.
Selecting the number of items required at each grade level allows tuning the lower grade levels and seems a priori to be most likely to be tweaked after reflection on a full course of experience.

See the code at checkoffsim.py
4.2. Results
4.2.1. Always revise
✅ %  Count 

100 
438 
97.1 
341 
94.3 
144 
91.4 
51 
88.6 
20 
85.7 
5 
82.9 
1 
From Table 2, 97% of students receive some flavor of A in the course and everyone receives at least a B.
The simulator outputs the specific results for each random trial.

x
— checkedoff 
r
— missed and eligible for retry 
.
— missed twice and lockedout 
_
— no attempt
The number of attempts is below each of the 35 items, followed by the total attempts during the course. This is at most 44 because the first week only has 2 items to attempt.
xxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxx > x:34 .:1 21211111111211211112111111111111111 (40) 1:30 2:5 97.1 %
xxxxxxxxxxxx.x.xxxxxxxxxxxxxxxxxxxr > x:32 r:1 .:2 12112211111121211111112111211211211 (44) 1:26 2:9 91.4 %
*********************************__ > *:33 _:2 11112122211211111111111221121122200 (44) 0:2 1:22 2:11 94.3 %
The first is a student that needs to retry several items and struggles with one topic especially. They receive an A by demonstrating achievement of essentially all of the course’s objectives.
The second and third students require twice as many reattempts as the first student, demonstrating achievement but at a slower pace. The third student did not have the opportunity to attempt the last two items of the course, spending the final normal attempts and tokens on retries.
4.2.2. Never revise
✅ %  Count 

100.0 
1 
97.1 
16 
94.3 
73 
91.4 
119 
88.6 
175 
85.7 
188 
82.9 
171 
80.0 
124 
77.1 
67 
74.3 
39 
71.4 
20 
68.6 
4 
65.7 
1 
62.9 
1 
60.0 
1 
From Table 3:

21% receive an A

66% receive a B

13% receive a C

All pass the course

The distribution is centered around 85%, as expected.
xxxxxxxxrxxxxxrxxxxxrrxxxxxxxrxxxxx > x:30 r:5 11111111111111111111111111111111111 (35) 1:35 85.7 %
The analytic distribution of scores for this distribution is simply a binomial distribution with probability mass function
It is technically redundant to simulate this behavior.^{[4]} But, that is the point of running this scenario: to compare the simulator against input parameters where the answer is known by alternate means.