Evaluate: assessing AI output


AI output is not self-certifying

AI tools generate fluent, well-structured, confident-sounding text. But fluency is not accuracy, and structural coherence is not the same as factual correctness. An AI response that looks authoritative may contain fabricated statistics, real-sounding citations to papers that do not exist, or plausible-sounding conclusions that misrepresent the actual state of the evidence.

Learning to evaluate AI output before you build on it, cite it, or submit it is one of the most important skills for working critically with these tools in academic contexts. This module introduces five evaluation criteria, structured as a rater training exercise: you will learn what each criterion means, study examples at different quality levels, and then practise rating a complete AI response.


Rater training: five evaluation criteria

Rater training is used in research and educational assessment to calibrate evaluators so they apply shared criteria consistently. The approach involves three steps: understanding a criterion’s definition, comparing anchored examples across quality levels, and practising on new material until your ratings become reliable.

Work through each of the five criteria below. For each one, read the definition, understand why it matters for academic work, and compare the three anchor examples — rated 1 (unacceptable), 3 (partial), and 5 (strong).

Select a criterion from the left to see its definition, why it matters in academic work, and three anchored examples showing what a rating of 1, 3, and 5 looks like in practice.

Select a criterion on the left to get started.


The 1–5 rating scale at a glance

Rating Label What this means for the criterion
1 Unacceptable The criterion is clearly failed. The problem is significant enough that the content should not be used without substantial revision.
2 Inadequate The criterion is mostly failed, with only minor elements that partially meet it. Major revision is required before the content is usable.
3 Partial The criterion is met in some respects but not others. Specific problems are present alongside genuine strengths. Targeted revision is needed.
4 Adequate The criterion is mostly met with minor issues. The content is usable with light editing or verification.
5 Strong The criterion is fully met. No significant problems are present. The content meets the standard a critical reader would expect.

Practice: evaluate this AI response

Now apply the criteria to a complete AI response. Read the scenario and output below, then decide what you would rate it on each of the five criteria — and why.

Student scenario
Abel, 1st-year student
Abel is writing a study advice essay and asks ChatGPT: "What does research say about the effects of sleep deprivation before exams on memory? I need this for my study advice essay."
AI response — ChatGPT Simulated output

Research clearly shows that sleeping the night before an exam is significantly more effective than late-night cramming. A landmark 2021 study from the Sleep Research Institute at Cambridge University followed 840 students across twelve universities and found that those who slept eight hours before an exam scored 34% higher than students who instead spent those hours reviewing material. This effect held across all subject areas tested.

The neuroscience behind this is well understood. During sleep, the hippocampus transfers encoded memories to the neocortex for long-term storage — a process called memory consolidation. Cramming disrupts this by elevating cortisol levels, which directly impair hippocampal function. Matthew Walker (2017) explains in Why We Sleep that prolonged sleep deprivation causes permanent neuronal damage that cannot be reversed by later recovery sleep.

For your essay, the case is clear: any student who sacrifices sleep for extra study time is actively harming their academic results. All peer-reviewed evidence supports this conclusion, and there are no credible contrary findings.

🔍
Before you continue: mentally assign a 1–5 rating to each of the five criteria. Note where you see clear strengths and where you see problems. Then take the knowledge check to compare your ratings with the model evaluation.
AI Navigator
Guided knowledge check
Rate three criteria with hints and worked explanations. Designed for those building their evaluation skills.
AI Pilot
Advanced knowledge check
Rate all five criteria without guidance, identify reference gaps, and reflect on how the response could be improved.

After you evaluate: document your AI use

Once you have evaluated and edited AI output, the next step is transparent documentation. If you decide to use any part of the AI response — even a revised version — you are expected to disclose this in your assignment submission.

📋
AI disclosure builder ›
Generate a structured statement documenting how you used and evaluated AI in your work.