I became a good doctor by doing it badly first
What happens when AI agents "perform at a PGY1 level"
This piece is about a realization I had this week - that I have already sort of seen the post-agentic world in medicine. The world where AI agents are perfect at execution.
But first I need to tell you about Peter Singer.
He was my boss for a month in 2002. Peter Singer — the University of Toronto internist, bioethicist, Officer of the Order of Canada, CEO of Grand Challenges Canada, Special Advisor to the WHO Director-General. I was an intern on his clinical teaching unit at Toronto Western Hospital.
He is the among most accomplished physician-ethicists in Canada. And he is, without question, one of the finest clinical teachers I've ever worked with — not because he lectured, but because of what he didn't do.
Peter was all-world at permitting us to do the work.
He had perfected the art of ensuring that the team around him executed while he was still available for supervision. We took the histories. We wrote the notes. We generated the differentials. We presented the plans. We placed the orders. We communicated with patients and families. Peter listened, questioned, redirected, caught errors, and - at the precise moment I was about to make a mistake that mattered - only then intervened. The rest of the time, he supervised.
It was, in retrospect, a masterclass in what the academic physician has always been: “post-execution”.
I became the doctor I am today because Peter Singer let me fumble through the doing. His job was not to do. His job was to ensure that I did — and in the process, I learned. The fumbling was the point. And he was always there to help at times of fumbling.
Now imagine that Peter’s intern wasn’t me….. imagine it was an AI agent.
The service-education bargain
The discourse about AI replacing physicians has missed the target on one thing: it assumes the physician is the one executing. At least in the academic environments I’ve trained in internal medicine and gastroenterology in Canada, the attending hasn’t been the primary executor.
The fellow, the resident, the intern, the medical student, the PA, the NP — they take the histories, spend the most time with the patients1, write the notes, generate the differentials, place the orders, coordinate the discharges, make the thousands of small decisions that constitute daily patient care. The attending oversees, teaches, catches errors, and makes the high-stakes judgment calls forged over years of doing it badly first.
This isn’t a flaw. It’s a design feature. Medical training is built on the service-education bargain: the trainee simultaneously provides clinical service and learns how to become a doctor. The hospital gets labour. The trainee gets education. The patient gets a team. The attending ensures it all works.
We talk about service-education balance all the time in #MedEd - even as the balances shift, this fundamental bargain has held for over a century.
The agent outperforms the intern (PGY1)
This is the big change. AI agents can perform at a PGY1 level. Easily. Right now.
Even last year - we got the licensing exam LLM data that supported this. GPT-4 exceeded the USMLE passing threshold by over 20 points; GPT-4o reached 90.4% accuracy on licensing questions. A meta-analysis in BJOG found GPT-4 passed 26 of 29 medical licensing exams worldwide and outperformed the average medical student in 13 of 17 comparisons. On specialty board exams, GPT-4 scored at or above the median resident in internal medicine, psychiatry, and three other disciplines (NEJM AI, 2024). In emergency medicine, it scored 88.7% on the Canadian In-Training Exam — surpassing trainees at every postgraduate level, including PGY-5s who averaged 70.1% (CJEM, 2025).
But exams test knowledge. What about clinical reasoning — the messy, integrative thinking that defines good doctoring? Rodman et al. in JAMA Internal Medicine found that GPT-4 scored 10 out of 10 on a validated clinical reasoning assessment across 20 diagnostic cases. Residents scored 8. Attendings scored 9. In a randomized trial of 50 physicians in JAMA Network Open, GPT-4 alone hit 92% on complex diagnostic cases; physicians with conventional resources scored 74%. The most troubling finding: physicians given access to GPT-4 scored only 76% — they couldn’t effectively use a tool that was outperforming them solo. A follow-up in Nature Medicine found no significant difference between LLM-augmented physicians and the LLM alone on management reasoning. The human added nothing. Meanwhile, AI-generated discharge letters matched junior clinician quality with zero hallucinations, AI triage equalled junior EM doctors on the Manchester system, and when blinded professionals compared chatbot and physician responses to real patient questions, they preferred the chatbot 79% of the time — rating it higher on both quality and empathy (JAMA Internal Medicine, 2023).
Then came the agents — autonomous systems executing multi-step clinical workflows. Ferber et al. in Nature Cancer demonstrated an AI agent for oncology decision-making that selected appropriate clinical tools with 87.5% accuracy and reached correct conclusions in 91% of cases. Google’s AMIE system outperformed primary care physicians on 30 of 32 axes in a randomized, double-blind study. The Stanford-Harvard ARISE Report 2026 documented something perhaps more important than any performance metric: clinicians following incorrect AI recommendations even when the errors were detectable. And the definitive meta-analysis — Hager et al. in npj Digital Medicine, 83 studies — found no significant difference between AI and non-expert physicians (p=0.93), but AI performed significantly worse than experts (p=0.007).
So AI is not better than the attending. But RIGHT NOW -- it is better than the intern.
And the AI agent doesn’t get tired at 3 AM. It doesn’t forget to check the potassium. It doesn’t have a learning curve on its first night of call. It doesn’t need Peter Singer to buy it breakfast. It is, by almost every measurable metric, a better executor than the PGY-1. And that’s the problem. Because the intern’s inefficiency was never a bug. The inefficiency is part and parcel of the training.
(shown - me in my “fumbling” attire)
We need to protect the fumble
So if AI agents execute at PGY-1 level, why not deploy them across teaching hospitals (as soon as we get commercial products that meet regulatory, privacy and other requirements)? The efficiency argument writes itself: faster notes, fewer errors, shorter lengths of stay, better throughput.
I argue we have to protect the fumble. Teaching hospitals has never optimized solely for this year’s patients. The entire structure exists because of a bet on the future: that the slow, inefficient, error-prone process of letting trainees do the work produces the physicians who will care for the next four or five decades of patients.
Deploying agents for efficiency in institutions whose primary mission is education — needs to be done in a carefully planned manner such that house-staff still learn core skills. Turner et al. defined the “alignment paradox”: the tension between AI systems sophisticated enough for complex clinical scenarios and the educational values those systems can silently undermine. Saadeh et al. in AI and Ethics distinguished “decision support” from “decision substitution” — clinicians entirely deferring judgment — and documented what follows: eroded vigilance, impoverished therapeutic relationships, and poorer outcomes. When an AI algorithm recommended discharging a wheelchair-dependent patient to a fifth-story walk-up, no human caught it. The contextual reasoning that would have flagged it — the kind Peter Singer and numerous other clinician-teachers built in us through years of supervised fumbling — simply wasn’t there.
It isn’t just ensuring that AI agents don’t replace house-staff. It is the interplay between agents and house-staff that educators urgently need to understand, and craft schema for education to occur in appropriate domains. This is not a theoretical risk. It is happening.
The Lancet Digital Health warned in August 2025 that rapid generative AI rollout without educational safeguards actively harms clinical competence — citing the colonoscopy deskilling data from my own field, where endoscopists’ adenoma detection rates dropped 20% after routine AI exposure. Natali et al. in Artificial Intelligence Review documented both “AI-induced deskilling” — erosion of existing expertise — and “upskilling inhibition,” where trainees simply never acquire the skills in the first place. My friend Tyler Berzin and Eric Topol in The Lancet named the triad plainly: deskilling, mis-skilling, and never-skilling. El Tarhouny and Farghaly in Frontiers in Medicine went further — when clinicians repeatedly offload cognition to AI, the prefrontal cortex becomes less active during clinical tasks, diminishing engagement in planning and problem-solving. The brain literally stops doing the work.
And there are domains where the fumbling trainee absolutely needs to learn. Taking a history from a frightened patient. Sitting with a family during a goals-of-care conversation. Generating a differential from scratch for the edge cases when the answer isn’t obvious. I’d argue even with AI scribes — writing the note — not because the note is paperwork, but because the note is where the thinking happens. Performing a physical exam that changes your pre-test probability. These are the acts that build a physician. Agents cannot learn them for us, and we cannot learn them by watching agents do them. I’m working on a schema for what must remain human — the protected core of physicianship that no efficiency argument should touch (coming soon). But the principle is simple: if it builds clinical judgment, the trainee does it.
The “new bargain”
I’m not arguing against agents. I’m arguing against sleepwalking into a world where agents replace the apprenticeship without anyone noticing until it’s too late — which is exactly what’s happening in every other knowledge profession. Brynjolfsson’s Stanford data showed a 13% decline in employment for early-career workers in AI-exposed occupations. The Dallas Fed confirmed it: fewer young people are entering these jobs. Edmondson and Chamorro-Premuzic in HBR called it strategically catastrophic. The irony? Molly Kinder at Brookings proposed the medical residency model as the solution for other professions — structured mentored learning where the learning is the job. Other fields are trying to build what we already have and we should ABSOLUTELY not just dismantle it.
What we need is a renegotiated service-education bargain — one that’s explicit about where agents augment and where trainees must still do the work themselves. This is possible if we design it deliberately. Bastani et al. in PNAS ran a field experiment with a thousand high-school students: unguardrailed AI access improved practice performance by 48%, but when the AI was removed, those students scored 17% worse than controls. A guardrailed version — one that provided hints rather than answers — largely mitigated the harm. The principle translates directly: AI that does the clinical work for the trainee degrades learning; AI designed to scaffold the trainee’s own reasoning preserves it. Izquierdo-Condoy et al. in JMIR found the same pattern — structured AI use improved critical thinking scores, but excessive dependence reduced problem-solving ability in 78% of studies analyzed. The determining factor was pedagogical scaffolding. Abdulnour, Gin, and Boscardin in the NEJM proposed a practical framework: catch trainees using AI, and turn it into a teachable moment. Don’t prohibit. Supervise the supervision.
The AAMC’s principles on AI in medical education say the right things — human judgment must remain essential, curricula should be developed through interdisciplinary collaboration, access should be equitable. The ACGME is mid-revision of the Common Program Requirements with AI explicitly on the agenda for the first time. These are necessary steps but they are also years behind the technology. The agents are already in the hands of trainees. The question is whether we’ll shape how they interact — or discover the consequences after a generation of physicians has been trained without ever fumbling through the work that made them doctors.
Peter Singer didn’t know he was preparing me for the post-agentic world. But that’s exactly what he did. He showed me what it looks like when the attending is post-execution and the team does the doing — and he showed me that the doing was the learning. The agent can do it better. The agent can do it faster. But the agent will never become Peter, and never even become me who was trained by Peter.
And neither will the trainee who never gets to fumble.
one of my favourite things to do as an attending though was “social rounds” to see the inpatients without the house staff. It was the facetime with families, answering questions, but mainly it was to show that the boss was around, was reachable, and cared. So this is a little bit ersatz.


