AI tools generate fluent, well-structured, confident-sounding text. But fluency is not accuracy, and structural coherence is not the same as factual correctness. An AI response that looks authoritative may contain fabricated statistics, real-sounding citations to papers that do not exist, or plausible-sounding conclusions that misrepresent the actual state of the evidence.
Learning to evaluate AI output before you build on it, cite it, or submit it is one of the most important skills for working critically with these tools in academic contexts. This module introduces five evaluation criteria, structured as a rater training exercise: you will learn what each criterion means, study examples at different quality levels, and then practise rating a complete AI response.
Rater training is used in research and educational assessment to calibrate evaluators so they apply shared criteria consistently. The approach involves three steps: understanding a criterion’s definition, comparing anchored examples across quality levels, and practising on new material until your ratings become reliable.
Work through each of the five criteria below. For each one, read the definition, understand why it matters for academic work, and compare the three anchor examples — rated 1 (unacceptable), 3 (partial), and 5 (strong).
Select a criterion from the left to see its definition, why it matters in academic work, and three anchored examples showing what a rating of 1, 3, and 5 looks like in practice.
Select a criterion on the left to get started.
| Rating | Label | What this means for the criterion |
|---|---|---|
| 1 | Unacceptable | The criterion is clearly failed. The problem is significant enough that the content should not be used without substantial revision. |
| 2 | Inadequate | The criterion is mostly failed, with only minor elements that partially meet it. Major revision is required before the content is usable. |
| 3 | Partial | The criterion is met in some respects but not others. Specific problems are present alongside genuine strengths. Targeted revision is needed. |
| 4 | Adequate | The criterion is mostly met with minor issues. The content is usable with light editing or verification. |
| 5 | Strong | The criterion is fully met. No significant problems are present. The content meets the standard a critical reader would expect. |
Now apply the criteria to a complete AI response. Read the scenario and output below, then decide what you would rate it on each of the five criteria — and why.
Research clearly shows that sleeping the night before an exam is significantly more effective than late-night cramming. A landmark 2021 study from the Sleep Research Institute at Cambridge University followed 840 students across twelve universities and found that those who slept eight hours before an exam scored 34% higher than students who instead spent those hours reviewing material. This effect held across all subject areas tested.
The neuroscience behind this is well understood. During sleep, the hippocampus transfers encoded memories to the neocortex for long-term storage — a process called memory consolidation. Cramming disrupts this by elevating cortisol levels, which directly impair hippocampal function. Matthew Walker (2017) explains in Why We Sleep that prolonged sleep deprivation causes permanent neuronal damage that cannot be reversed by later recovery sleep.
For your essay, the case is clear: any student who sacrifices sleep for extra study time is actively harming their academic results. All peer-reviewed evidence supports this conclusion, and there are no credible contrary findings.
Once you have evaluated and edited AI output, the next step is transparent documentation. If you decide to use any part of the AI response — even a revised version — you are expected to disclose this in your assignment submission.