11 May 2026 14 min read The Long Game

Values Through Academics: What an Honest AI Literacy Program Looks Like

The thesis: we don't teach values by bolting an AI ethics module onto Tuesday afternoon. We teach values through the academic work the school already does — by changing how the work is assessed, how the classroom incentivises engagement with it, and what teachers model in the room.

Three levers — assessment design, classroom incentives, teacher modelling — for middle and senior leaders at international schools in Asia who are about to redesign their assessments and don't yet have a frame for how.

1. A Wecom message I couldn't read

A student messaged me on Wecom last semester with an economics question. The message was in English. It was also polished, wordy, and hard to understand.

He had written it in Chinese first and run it through an AI to translate.

I asked him, kindly, to drop the AI and rephrase it in his own words. He apologised. He said he was worried I wouldn't understand his English.

I told him two things. First: I'd rather ask you questions until I understand than communicate through an AI. Second, and probably the more important of the two: the AI message was unintelligible anyway.

That second line matters. The student's whole assumption — that the AI would represent him better than he could represent himself — collapsed on first contact. The polished version was less legible than the real one. The premise refuted itself before I had to argue against it.

But the motive is the part I keep returning to.

He wasn't trying to cheat. He wasn't lazy. He was worried about being received. He had absorbed, from somewhere, that the polished version of him was more legitimate than the real one — that he should send his proxy into the conversation rather than himself.

This is not an argument against translation support, and not an argument against students using AI to bridge a language gap when the goal is to be understood. It is an argument against letting the tool replace the student's own act of meaning-making — the moment where the student does the thinking, in their own voice, and learns by doing it.

That is the hidden curriculum a school teaches when it doesn't pay attention.

So the question that runs through the rest of this piece:

What had the school taught this student, by accident, that the polished version of him felt safer than the real one?

2. AI didn't end grades. It ended product-only grading.

Anyone can produce a polished output now. This means a polished output, on its own, is worth less as evidence of learning than it used to be.

Corbin, Dawson and Liu put this most cleanly in their 2025 paper Talk is cheap: why structural assessment changes are needed for a time of GenAI. They draw a hard line between discursive responses — rules, honour statements, traffic-light permission systems, AI policy documents — and structural responses that change the mechanics of assessment so AI cannot substitute for the student.

Most institutions, they argue, are stuck in the discursive lane. What that produces is an enforcement illusion.

The University of Sydney is one of the named examples of moving into the structural lane. Their two-lane approach to assessment splits every task into Lane 1 (secure, in-person, supervised, AI use controllable) and Lane 2 (open, AI assumed present, the task designed around scaffolded use). They explicitly reject a middle "amber" lane on the grounds that it is unenforceable.

Inside Higher Ed put it more bluntly in their December 2025 headline: you can't AI-proof the classroom. Redesign instead.

These are early, named examples of universities restructuring assessment around the assumption that AI is present. Not yet a global shift.

But enough to tell us where the assessment architecture our students will sit inside is heading.

3. But "values education" has a credibility problem

Before going further, the objection has to be named. My arguments read naively without it.

First, graded character is a contradiction in terms. Marilyn Strathern's 1997 essay 'Improving ratings': audit in the British University System is where most people first encountered the education-applied version of Goodhart's law: when a measure becomes a target, it ceases to be a good measure.

Try to grade integrity and you teach the performance of integrity.

Second, "AI literacy" is still a contested term. The 2025 integrative review by Gu and Ericson documents that the same label gets attached to kindergarten social robots, secondary digital citizenship, and undergraduate prompt design — with no consensus on what should be taught, at what age, or how to assess it.

The field is also being circled by assessment vendors looking to own the measurement frame.

Third, schools that proclaim values and grade the opposite end up teaching cynicism. Students notice the gap.

None of these objections is fatal. Each kills a lazy version of the argument.

4. Values through academics, not alongside

The thesis: we don't teach values by bolting an AI ethics module onto Tuesday afternoon. We teach values through the academic work the school already does — by changing how the work is assessed, how the classroom incentivises engagement with it, and what teachers model in the room.

The values at stake here are not generic niceness. They are authorship, intellectual honesty, responsibility, the courage to think in public, and judgment under uncertainty. Each of the three levers below carries a distinct strand of that work: assessment design carries authorship and honesty; classroom incentives carry intellectual courage and judgment; teacher modelling carries responsibility and integrity. They are not separate moves — they're one redesign.

This is the structural answer Dawson and colleagues called for, brought down to classroom level. Leadership teams drafting AI policy are stuck on the discursive lane: define permitted uses, write honour codes, demand attribution.

The structural lane is where assessment redesign actually happens.

Three levers, presented as one redesign.

Assessment design

Process artefacts — outlines, drafts, reflection logs, chat transcripts — are work-trail evidence, not the grade. They earn completion credit. The assessed weight lives at the link: does the chain of process artefacts plausibly produce the final product?

This is the move that makes Goodhart's law harder to exploit. The high-stakes grade is the coherence check between artefact and product, not the artefact itself.

If process artefacts were the high-stakes target, students would optimise the artefact for the grade and stop tracking the underlying thinking — Strathern's exact warning. Moving the grade to the link does not eliminate gaming. A student can still try to manufacture a chat history that looks plausibly connected to a finished essay. But the target has shifted to harder-to-fake territory, and to the place where teacher judgment matters most: the relationship between the thinking and the product.

Classroom incentives

Learning happens in the process, not in the final result. Friction is where learning lives — wrestling with a question, abandoning a bad argument, defending a fragile one.

The goal is not to preserve all friction. Some friction is waste — language anxiety, working-memory overload, typing speed, access barriers, the strain of an unfamiliar academic genre. The goal is to preserve the friction where thinking happens, and to let AI take the friction that doesn't teach.

AI has removed the friction of production. Anyone can produce a final result now. What's gone is the work that used to happen between the question and the result.

The teacher's job, then, is to use AI strategically to preserve the high-value friction points and rebuild others where AI has eroded them. In subject work, the high-value friction looks like:

• evaluating ideas against each other,

• generating rebuttals and counter-arguments with an AI and then defending or abandoning them, and

• building the writing-skills that let a student tell a good argument from a polished bad one.

Friction is not a bug to engineer out. It is where learning lives. The point is not to make tasks more painful for their own sake. It is to make sure the friction that teaches is still in the room after AI has removed everything else.

Teacher modelling

Work with the student's actual voice, not the proxy. The Wecom move from earlier in this piece wasn't pedagogy and wasn't policy. It was modelling — saying, in real time, I'd rather hear the real you than a polished version of you.

Modelling cuts both ways.

The same expectations schools set on students around AI use should apply to teachers. If we expect students to disclose meaningful AI use on a task, teachers should do the same when AI materially shapes feedback, resources, reports, or communication. I drafted this with an AI assistant; here are the moves that are mine, and here are the ones I revised.

Anything less teaches students that the disclosure rule is for kids, and that the polished proxy is allowed if you have the rank to use it.

Sarah Eaton's six tenets of postplagiarism puts the load-bearing version of this in one line: we can outsource control but not responsibility. The student remains the author. The teacher's job is to keep insisting on that — including for the teacher's own work.

5. The worked example: A Level Economics, a Socratic chatbot, and one honest caveat

Here is what the three levers look like in one of my own classrooms.

For A Level Economics essay tasks, I have built a custom Socratic chatbot using Reedlet.

Disclosure: I'm the maker of Reedlet. What follows is practitioner reflection on using my own tool in my own teaching, not an evaluation. Treat it as a worked example of the workflow, not as evidence that Reedlet specifically improves outcomes.

Students converse with the bot as part of the essay-creation task. The bot doesn't write for them — it organises their thinking. It asks them questions, identifies weaknesses in their argument, presses on assumptions, refuses to do their job.

I see all the messages. It is a teacher-visible environment, rather than a private student–AI exchange.

The workflow has three measures, deliberately at different levels of stakes.

The bot scores the interaction — quantity and quality of the conversation. Low-stakes, formative input.

I score the final essay. The essay still receives the subject mark. That hasn't changed.

I check the link — does this essay plausibly come from this chat? Does the thinking in the artefact match the thinking in the conversation? If the essay is disconnected from the trail, the grade cannot stand as submitted; the conversation moves to the student, not to the rubric. The link-check is an authenticity-and-reasoning layer on top of the subject mark, not a separate grade.

This is the Goodhart problem — grade the artefact and it becomes the target — addressed structurally rather than argued away. The bot's interaction score is low-stakes process input; the subject mark is what the student earns on the essay; the link-check is what the teacher uses to decide whether the mark stands. If a student games the chat — says the right things to score interaction points — but produces an essay disconnected from that thinking, the link breaks. The grade reflects the break.

The move does not eliminate gaming. It pushes the target into territory that is harder to fake cheaply.

What's actually shifted in my classroom:

Students actually find it more tedious to write an essay now. That is the design working. It adds friction and forces the student through a process before the essay is produced. When there is no AI-shortcut, the work is the work. Some students hate it, but that is also the design working.

Honest caveat, which matters more than anything else in this issue:

I have a strong impression that the students who engage more with the bot are writing better essays. But I'm not sure if it was because they were good students to begin with. Selection bias. I haven't separated the bot's effect from the underlying ability of the students who chose to use it well.

The workflow is also iterating. I'm moving toward more discrete milestones — submit an outline, get focused feedback, continue the conversation — rather than one long continuous dialogue.

Part of the reason is technical: the bot can hallucinate inside long conversations, and students don't always catch it.

The bigger reason is pedagogical. Discrete milestones let me rejoin the feedback loop. Outsourcing feedback to an AI is a surefire way of losing trust and rapport with students. Giving feedback is a trust-building endeavour. If I outsource it to an AI, I am conditioning students to trust an AI and not me.

The leadership implication: workflow design is the experimentation loop. You don't ship a finished AI-integrated assessment. You ship version 1, you watch what it does, and you redesign — particularly to put the teacher back into the parts that build trust.

Treat the iteration as the work, not as a sign you got it wrong.

6. Three objections, honestly handled

Objection 1: You've made the process the target — Goodhart wins.

Partly. The answer is the link-check. The process artefacts are completion-credit only. The subject mark stays on the essay. The link-check sits on top as the authenticity-and-reasoning layer that decides whether the mark stands.

The artefact-as-grade move loses to Goodhart. The link-as-grade move does not eliminate gaming, but it shifts the target back to the relationship between thinking and product — which is where the teacher's judgment matters most, and which is harder to fake cheaply.

Objection 2: Exam pressure will crush this in any Asian context.

Honest concession. Three decades of washback research — mapped in Xie and Jia's 2025 bibliometric study across 243 studies and 40 contexts, with Asia as the leading region in the field — gives us the pattern to watch for. High-stakes exams tend to pull classroom practice toward what is tested. Instruction narrows. Formative routines become vulnerable. Process work survives only when it is embedded inside exam-facing practice.

The pull is real. It is especially real in IB, AP and CIE schools in East and Southeast Asia.

The counter is not denial. The move is to embed process inside exam-review lessons, rather than running them as a separate add-on. Project Zero's Visible Thinking Routines — See-Think-Wonder, Claim-Support-Question, and the rest — are already familiar in most IB-aligned schools. Make them the default language for exam review, so that process visibility lives where the high-stakes pressure is, not in a parallel curriculum the exam season will crush.

Objection 3: Disclosure rules alone won't carry it.

Correct. The 2022 meta-analysis by Zhao and colleagues — r ≈ 0.37 across 40 educational contexts — found that what students believe their peers are doing is one of the strongest known predictors of whether they cheat.

Stronger than most contextual interventions schools deploy. And more pronounced in cultures high in collectivism, power distance, and long-term orientation — which includes most of the international-school-in-Asia contexts.

"Tell students to disclose" alone won't shift behaviour. The third lever — teacher modelling, visible peer exemplars of honest AI use — is what shifts the norm.

Disclosure rules without norm-shifting work are wallpaper.

7. The leadership move — redesign a category, not an assessment

For HoDs running IB programmes, the architecture is already moving. The IB's 2023 statement on AI in assessment is explicit: AI will not be banned, and AI-generated text and images must be credited in the body and bibliography. The IB's emerging direction — visible in its 2026 draft AI Design Principles — is toward AI that deepens learning, keeps students active in their own education, and reserves high-stakes decisions about grading and progression for human judgment.

So the question for your department isn't whether the assessment architecture is shifting. It is. The question is whether you're making that shift visible at classroom level now, or waiting for an external mandate that arrives too late to redesign around.

The instinctive move is to pick one upcoming assessment and bolt process-artefacts onto it. That doesn't scale. The move that scales is to redesign a category of assessment.

The highest-leverage category to start with is homework.

Homework is where AI ate first, where it eats most, and where the redesign work touches the largest number of student-hours per week. A redesigned one-off project changes a few hundred student-hours per term. A redesigned homework category changes thousands.

What does homework look like, in a subject, when AI has ended the worksheet-grinding pattern? A few moves that work without adding hours of new design work for a busy teacher:

Replace completion-only worksheets with thinking-trail worksheets. Same set of practice questions. The worksheet now has two columns: the answer, and a short note on where the student got stuck, what they asked the AI, what they took from the response, and what they rejected. The AI is allowed. The thinking-trail is what's marked. Five extra minutes per homework set for the teacher to scan; structurally different work for the student.

Make the submission a short oral explanation, not the worksheet. Students record themselves explaining how they solved one of the questions — what they tried first, what didn't work, why the chosen method holds. The worksheet becomes the scaffolding; the oral artefact is the deliverable. A voice note on WeChat or Teams. The teacher listens at 1.5× speed.

Use AI as the sparring partner, not the solver. "Find three errors in the AI's reasoning on this problem." "Argue against the AI's first answer." Homework becomes a structured disagreement with an AI, not a worksheet the AI can do. The student practises evaluation and rebuttal — the high-value friction that actually teaches.

Build the link-check in from the start. When a homework task has both a private working step (chat, notes, scratch work) and a public submitted step, the teacher's marking move is the same as the Reedlet workflow: does the submitted work plausibly come from the trail? If not, the conversation is with the student, not the rubric.

A worked example for one category travels further across a department than a worked example for one task. Pick homework first because the leverage is highest. The redesign moves above are deliberately small. None of them requires the teacher to rebuild the curriculum.

For senior leaders: create space for AI literacy as an assessment-architecture conversation, not an AI-policy conversation. Push the structural lane, not the discursive one. Frame the next leadership-team agenda item as what categories of assessment are we redesigning — not what's our AI policy.

Policy will follow. Policy that arrives before classroom evidence will be wrong in ways nobody catches until students have already adapted around it.

I do not claim that this shift is easy. The redesign work is large, the exam-pressure pull is real, and the political room for assessment redesign in international schools in Asia is bounded by parent expectations that have not moved as fast as the technology.

I just think the change is worth it for the students we teach. Education is a long-horizon investment. Some of the most important things schools do will not pay off within the semester, the exam cycle, or even the school year. That does not make them less necessary.

8. From inside the building, still figuring it out

I want to close where I started.

I am just beginning to realise how much work I have to do to redesign my assessments, think about how I am using grades as an incentive, and what I need to be modelling. That is all I think about these days.

What I am focused on right now: how to put the teacher back into the parts of the workflow that build trust, while keeping the AI in the parts where its friction earns its keep. How to redesign homework — the category, not the task — so the work survives an AI-present classroom. And how the modelling cuts both ways: I cannot ask students to disclose AI use that I am not disclosing myself.

The Wecom conversation with the student from the opening of this article is the work this is all in service of. That slow, broken-English, back-and-forth conversation — where I sit with the real student rather than the polished proxy — is what I am trying to design into the assessment, not engineer out of it.

I'd rather ask you questions until I understand than communicate through an AI.

The values aren't a separate curriculum. They are the medium through which the academic work happens.

So the question for you, and for your department:

Which friction in your school is doing the learning work — and which has AI quietly removed?

A note on method: this article, which argues for human judgment preservation in the AI era, was itself produced through a co-creation workflow I'm advocating. The idea, the angle, the practitioner observations, the curated sources, and the final wording are mine. An AI assistant calibrated to my voice (through a guide of phrases I've approved and rejected) did the research legwork on sources I then selected and drafted from an outline we agreed on together. I revised the whole piece before publishing. The references are verified.

References

1. Corbin, T., Dawson, P., & Liu, D. Talk is cheap: why structural assessment changes are needed for a time of GenAI. Assessment & Evaluation in Higher Education. 2025.

2. Liu, D. & Bridgeman, A. Frequently asked questions about the two-lane approach to assessment in the age of AI. Teaching@Sydney, University of Sydney. 2024–2025.

3. Whitford, E. You Can't AI-Proof the Classroom, Experts Say. Get Creative Instead. Inside Higher Ed. 2025.

4. Eaton, S. E. Six Tenets of Postplagiarism: Writing in the Age of Artificial Intelligence. Learning, Teaching and Leadership. 2023.

5. International Baccalaureate Organization. Artificial intelligence in IB assessment and education: a crisis or an opportunity? International Baccalaureate. 2023.

6. International Baccalaureate Organization. Shaping our approach to AI as a community — draft AI Design Principles. International Baccalaureate. 2026 (draft).

7. Gu, X. & Ericson, B. J. AI Literacy in K-12 and Higher Education in the Wake of Generative AI: An Integrative Review. In Proceedings of the 2025 ACM Conference on International Computing Education Research. 2025.

8. Strathern, M. 'Improving ratings': audit in the British University System. European Review. 1997.

9. Xie, Q. & Jia, J. Three decades of research on washback (1993–2023): a bibliometric study. Language Testing in Asia. 2025.

10. Zhao, L. et al. Academic dishonesty and its relations to peer cheating and culture: A meta-analysis of the perceived peer cheating effect. Educational Research Review. 2022.