Grading check-off simulator

Simulating an alternative grading system

1. What is reasonable?

Imagine that you are done with assigning points and “computing” grades from weighted averages.^[1] The points thing needs too many tweaks and modifications to end up with a reasonable mapping between a single mangled number against a qualitative evaluation of student learning and performance by a subject matter expert.

Enter different ways of grading in a course under the umbrella term of alternative grading.

2. The System

Table 1. Letter grading scheme
Grade	DBM	ACDC	HWSW	X-factor
A	☐	☐	☐ ☐	☐ ☐ ☐
B	☐ ☐	☐	☐ ☐	☐ ☐ ☐
C	☐ ☐	☐ ☐	☐ ☐ ☐ ☐	-
D	☐ ☐	☐ ☐	☐ ☐ ☐ ☐ ☐	-
Available	7	7	15	7

Notice that the number of checkboxes in each column is sometimes less than the Available number.

To earn a given base letter, all items must be checked for that row and for all rows below, with an F for not completing the D row completely.
By default, items with deadlines (only HWSW here) are not eligible for a check-off after the deadline.
Students are limited to N check-off attempts per week.
Each item may be re-attempted once with no penalty.
Every student starts the semester with 2 tokens. Tokens may be “spent” by a student for the following purposes:
- Submit something past its deadline (if relevant).
- Submit an extra item for evaluation in a given week.
- Submit a second revision of an item. (← see later section)

This is for an electronics course

DBM — design, build, measure — replace traditional lab “experiments” and require a three-way match between hand-calculations, computer simulations, and measurements of a physical prototype’s performance withing an appropriate tolerance.
ACDC — analysis challenge or design challenge — large derivations of major concepts or designs that meet challenging specifications (without the full end-to-end process of a DBM).
HWSW — homework and skills work — regular practice work using the lower congnitive process levels (remember, understand, and apply).
X-factor — individual-specific and course-related readings, side-projects, etc. You should demonstrate some level of curiosity-driven action-taking for an “A”-level grade to have appropriate meaning.

3. Centering the course design

It is possible to create all assignments (or specifications for same) for all categories for the entire course, and then make the rules. This is an existing course that is on it’s 10^th iteration and already has a library of labs, assinments, and other activities, so it is not necessary to start from scratch.

But still, it is confidence-inspiring to start from a first-principles estimate of the numbers.

How many check-off items should there reasonably be?

Too many items adds more logistics unrelated to the course content, and more (?) work for me per week to get feedback back quickly.
Too few make it more difficult to create a set of items that do a good job of student evaluation and reduces granularity.

Come up with some Fermi estimates:

14 full weeks of regular meetings
DBMs require significant work across several domains and involve lab time. Every-other week seems reasonable, and also syncs with the scheduled laboratory sessions schedule. 7 items
ACDCs need more time to complete than a typical homework set involving higher levels of cognitive processing and knowledge, which would benefit from a few opportunities to “sleep on it.” 7 items
HWSWs are not large time investments, but are at a regular pace. One per week seems reasonable and there are 15 calendar weeks. 15 items
X-factors are eXtras, but a 2-week interval seems fine. 6 items

→ so we have 35 items.

Figure 1. The EMRN Rubric ^[2]

Each attempt is to be either checked-off or not and marked with a 4-level scheme. An E or M earns a completion. Receiving an R or N then is deemed not completed and the student can re-attempt the item. The R gets appropriate and helpful feedback on how to improve. Some re-attempts are fresh problems on the same topics and others are revisions of the prior submission as appropriate for the item.

What is “good enough”?: I am satisfied that it shows evidence of correct understanding or otherwise professional-grade work and not missing something important. This is a judgement call by an experienced subject-matter expert and educator.

What are the typical expectations for this?

Reasonable workload per week.
We want to target an 85% success rate to maximize the rate of learning / time.^[3]

Does this quantity and ruleset behave well for most students?

If so, then the instructor’s task is to simply flesh-out this number of items spread across time.
Workload can be gauged from simply asking students.
Success rate is ideally calibrated per individual, but will be assessed across the total.

4. Simulator

This is Engineering, so we can take action to get some approximate answers to this question. So, let’s build a simulator to see how a typical semester would play out.

Details:

2.5 items newly available per week for 14 weeks
- our 35 items total
3 submissions max per person per week (new and/or revisions)
- Cap the grading load for me.
- Prevent turning in everything at the end of the semester.
85% likelihood of success for an attempt
3 tokens, all used for extra (re)submissions on finals week

4.1. Objectives

I am interested in two student behaviors and one situation:

Student submits everything assigned, and always revises if possible.
This person is doing all that is asked of them and actively works to make improvements when needed. Combined with the high standard for a successful submission, this person should receive an A.
Student submits everything that is assigned but never makes revisions.
This person is not blowing-off work, but also has areas to improve. With an 85% yes rate, this would correspond to an 85% score in the traditional points system. Interesting here is the distribution of outcomes for this behavior. It is also the reference point that the simulator works correctly.
What happens at the end of the semester when there are 2.5 more new items and students may be using their 3/week attempts and tokens for a backlog of retries?
This is a recipe for unnecessary stress and possibly revolt and bad feelings compelling a last-minute policy change.

Not turning in assigned items, especially if frequent, is a behavior that isn’t compatible with professional engineering success --- a lower grade is appropriate. It, conversely, would be regrettable if the system had an edge case (e.g. extra credit) that yields an inflated B or even A.

Selecting the number of items required at each grade level allows tuning the lower grade levels and seems a priori to be most likely to be tweaked after reflection on a full course of experience.

See the code at checkoff-sim.py

4.2. Results

4.2.1. Always revise

Table 2. Always revise
✅ %	Count
100	438
97.1	341
94.3	144
91.4	51
88.6	20
85.7	5
82.9	1

From Table 2, 97% of students receive some flavor of A in the course and everyone receives at least a B.

The simulator outputs the specific results for each random trial.

x — checked-off
r — missed and eligible for retry
. — missed twice and locked-out
_ — no attempt

The number of attempts is below each of the 35 items, followed by the total attempts during the course. This is at most 44 because the first week only has 2 items to attempt.

Diligent and good

xxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxx  --> x:34 .:1
21211111111211211112111111111111111 (40) 1:30 2:5
97.1 %

Diligent and eventually-gets-it

xxxxxxxxxxxx.x.xxxxxxxxxxxxxxxxxxxr  --> x:32 r:1 .:2
12112211111121211111112111211211211 (44) 1:26 2:9
91.4 %

Diligent and exhausted attempts

*********************************__  --> *:33 _:2
11112122211211111111111221121122200 (44) 0:2 1:22 2:11
94.3 %

The first is a student that needs to retry several items and struggles with one topic especially. They receive an A by demonstrating achievement of essentially all of the course’s objectives.

The second and third students require twice as many re-attempts as the first student, demonstrating achievement but at a slower pace. The third student did not have the opportunity to attempt the last two items of the course, spending the final normal attempts and tokens on retries.

4.2.2. Never revise

Table 3. Never revise
✅ %	Count
100.0	1
97.1	16
94.3	73
91.4	119
88.6	175
85.7	188
82.9	171
80.0	124
77.1	67
74.3	39
71.4	20
68.6	4
65.7	1
62.9	1
60.0	1

From Table 3:

21% receive an A
66% receive a B
13% receive a C
All pass the course
The distribution is centered around 85%, as expected.

Typical result

xxxxxxxxrxxxxxrxxxxxrrxxxxxxxrxxxxx  --> x:30 r:5
11111111111111111111111111111111111 (35) 1:35
85.7 %

The analytic distribution of scores for this distribution is simply a binomial distribution with probability mass function

\[P(X=k) = {35 \choose k} 0.85^k \left(1 - 0.85\right)^{35-k}\]

It is technically redundant to simulate this behavior.^[4] But, that is the point of running this scenario: to compare the simulator against input parameters where the answer is known by alternate means.

5. Conclusions

The simulation results and distributions seem … reasonable.

1. It feels like trying to get something reasonable out of trying to average ZIP codes.

2. https://rtalbert.org/emrn/

3. Wilson, R.C., Shenhav, A., Straccia, M. _et al._ The Eighty Five Percent Rule for optimal learning. Nature Communications 10, 4646 (2019). https://doi.org/10.1038/s41467-019-12552-4

4. So, in full disclosure, I didn’t think about this too much ahead of time, especially the distribution’s shape/width. Finding the analytic distribution for the always-retry case seemed like much more work and time than writing less than 100 lines of code on a Thursday afternoon.