Agentic AI in medical education
What I learned from Clawdbot this week
One week ago, I configured an AI agent to help me work. Clawdbot (aka Moltbot, now known as OpenClaw).
Not a chatbot like ChatGPT. An agent. The distinction matters, and I didn’t fully appreciate it until I lived it. In the past week, this agent has:
Conducted comprehensive literature searches across PubMed, producing a referenced foundation for an academic chapter I’m writing
Analyzed my writing archives and showed me the phrases I overuse, the structural patterns I fall into, and how to sharpen my writing
Built me a personal dashboard that tracks my calories, reminds me of important updates, and monitors my daily patterns. All by voice command. Built by itself.
Debugged a smart home system crash at 1:30 AM by connecting to the server, diagnosing the problem, and fixing it… while I slept.
Remembered context from days ago that I had forgotten, picking up conversations mid-thought.
Logs everything it does into a Kanban board on Notion for easy review by me.
I’m not writing about a hypothetical future. This was my week.
And it has forced me to think hard about what medical education needs to prepare for, and specifically what we haven’t figured out yet. And I think it changes everything.
What is Clawdbot—and why should you care?
The agent I’ve been using is built on OpenClaw (previously known as Clawdbot, then Moltbot)—an open-source AI agent framework that’s become one of the fastest-growing AI projects on GitHub. You can run it on a home server or deploy it to the cloud.
For the non-technical: think of it as giving an AI a body. Instead of just answering questions, it can do things—read your files, search databases, control devices, send messages, execute code, and remember what you’ve worked on together.
AI agents are defined as systems that can:
Perceive their environment (read files, access APIs, observe state)
Reason about goals and plan multi-step actions
Act using tools (search, code execution, database queries, messaging)
Learn from outcomes and adjust behavior
Persist memory across sessions
My Clawdbot has a “soul document” that defines its identity and values. It has memory files where it records context from our interactions. It checks in periodically through scheduled heartbeats, even when I haven’t asked anything. It maintains an asset registry of every file it creates so nothing gets lost.
The shift from “AI that answers questions” to “AI that completes tasks” is fundamental.
What we actually built together
Let me ground this in specifics that I have mapped to those five capabilities.
Perceive: Literature search at speed
I’m writing a chapter on innovations in AI-driven training for GI endoscopy. Traditional approach: Nikko and I would spend days reading literature, organizing notes, identifying themes. Realistic time: 20 hours of searching and reading before I even start writing.
Here’s what the agent actually did: It connected directly to PubMed’s E-utilities API—the same programmatic interface that powers institutional search tools—running structured queries across MeSH terms and free text. It pulled abstracts, extracted key findings, and cross-referenced with Semantic Scholar to catch preprints and citation networks that PubMed misses.
The technical details matter: the agent wasn’t hallucinating references (like LLMs do) or making up citations. It was retrieving real papers through real APIs, then synthesizing what it found JUST from that information. The output was a referenced document—60 citations with PMIDs I could verify—organized by theme of relevance to the article I am writing (CADe/CADx training, VR simulation, the deskilling phenomenon, LLMs in education, assessment frameworks). It then dumped it all into a RAG.
I reviewed every reference. The accuracy was remarkable. But critically, I was curating rather than generating. The cognitive load shifted from “find everything relevant” to “evaluate what was found.”
Reason: Multi-step problem solving
I run my home on Home Assistant—an open-source platform that integrates smart devices into a unified system. Lights, thermostats, sensors, cameras, voice assistants, even my car. When it works, it’s invisible. When it crashes, I’m lost.
At 1:30 AM, Home Assistant crashed because of the extreme cold in Toronto (it wasn’t expecting that low of a value!) I was asleep. But the agent wasn’t.
It caught the error. It reasoned through the problem: SSH into the server, pull logs, identify the root cause (a corrupted area registry entry missing required fields), determine the fix, patch the JSON file, restart the service. By morning, I had a message explaining what happened and what it did. Elapsed time: about 10 minutes. My involvement: zero.
The agent decomposed a complex problem into steps, executed them sequentially, and adapted when intermediate results required different approaches. I didn’t have to be awake, let alone technically competent in the moment.
Act: A dashboard that knows me
Over the past week, the agent built me a personal dashboard in Home Assistant. Not a generic template—a system tailored to my life:
Calorie tracking: I tell it what I ate by voice or photograph (”Log a coffee with cream and a banana”), and it records the entry through a custom integration, tracking my daily intake against goals
Automatically logs steps and workouts
Environmental context: Weather, calendar events, home status—the information I actually need when I glance at a screen
Each of these required acting through different tools: API calls to nutrition databases, file manipulation for configuration, database queries for project tracking, messaging interfaces for reminders. The agent coordinates these seamlessly, and the dashboard evolves as my needs change.
That it made by itself without me telling it to do so.
On its own ideas of what it thought I needed.
Learn: Teaching me to write better
I asked the agent to read the entirety of what I have written in academic literature and tell me what I could improve.
The feedback was uncomfortably specific:
Phrases I overuse: “In my opinion” (appears in 80% of commentaries), “The interesting question is” (60% of “Discussions”), “Let me be concrete” (20% of letters to the editor). I had no idea.
Structural patterns I default to: Hook → numbered principles → examples → takeaways. Reliable, but predictable. The agent suggested varying my structure to maintain reader engagement.
Citation patterns: I tend to front-load references and thin them out toward the end. The agent recommended distributing evidence more evenly. Maybe this is just the nature of academic papers though.
Word choice: I use “fundamental” and “critical” so often they’ve lost impact. It suggested reserving them for genuine emphasis.
Even this article is informed by that feedback. I’m trying to vary my structure, distribute my evidence, and stop saying things are “fundamental” every other paragraph. The agent learned my patterns—and now it’s helping me unlearn the bad ones.
Persist: Continuity across sessions
Most critically: the agent remembers. When I return to a half-finished task, context is preserved. It knows what we’ve worked on, what decisions we’ve made, what I’ve asked before. I don’t have to re-explain.
The agent maintains a SOUL.md file that defines its identity, values, and behavioral guidelines, curated by me. Think of it as a constitution that persists across sessions, ensuring consistency in how the agent approaches problems and interacts with me. Separately, it maintains MEMORY.md and daily memory files where it records context from our interactions, including decisions made, preferences learned, projects in progress. Each session, the agent reads these files before doing anything else, reconstructing its understanding of our shared history.
The result: continuity that feels natural. When I mention “the chapter,” it knows I mean the GI endoscopy piece. When I say “update the dashboard,” it knows which dashboard and what we discussed changing. The cognitive burden of context-switching disappears.
This persistence transforms the relationship from transactional to collaborative. It’s the difference between a colleague who was in the meeting and a contractor you have to brief from scratch every time.
Improve: Getting better over time
Here’s what surprised me most: the agent gets better.
When I asked it to send me a file we’d created together, early on in its use, it couldn’t find it. The file was lost with no record of where it had been saved. But what happened next was instructive: the agent immediately created an asset tracking system. It wrote documentation requiring itself to log every file it creates. It updated its own operational guidelines. The next file won’t get lost.
This self-improvement loop is baked into the architecture. The agent maintains a “learnings” directory where it captures errors, corrections, and better approaches. When I correct it (”No, that’s wrong—actually Samir…”), it doesn’t just fix the immediate problem; it records the lesson for future sessions. When a command fails unexpectedly, it documents the failure mode.
Over a week, I watched the agent become noticeably more effective at tasks we’d done before. Not because the underlying model improved (it didn’t) but because the agent accumulated context, preferences, and hard-won knowledge about how I work and what I need.
This is fundamentally different from a chatbot that starts fresh every conversation. The agent has a developmental trajectory. It learns.
How this will disrupt medical education
This is just the beginning.
I think when agentic AI is applied to medical education it will completely change our paradigm of not only how we teach, learn and assess… but also what it fundamentally means to be a physician.
Level 1: Administrative burden relief (it’s already happening now)
The lowest-risk, highest-value applications involve the administrative tasks that consume learner and faculty time without adding educational value.
Ambient AI scribes are already reducing documentation time by 30 minutes per day per provider and saving thousands of hours while improving clinician well-being. The same technology applied to education could capture bedside teaching—converting spoken teaching to structured learning materials with evidence classification and assessment questions. Cost: less than $1 per session.
Medical learners drown in administrative friction. Logging cases. Requesting evaluations. Scheduling. Finding resources. Tracking requirements. An agentic assistant that handles this overhead—that knows the learner’s schedule, requirements, progress, and goals—could reclaim significant time for what actually matters.
This is the entry point. Build trust here before expanding scope.
Level 2: Personalized learning at scale
Dartmouth’s NeuroBot TA study demonstrated that AI can deliver personalized learning at scale with meaningful educational impact. AI-driven adaptive platforms now dynamically adjust content difficulty and pacing based on individual performance.
But agentic AI goes further than adaptive tutoring. My agent remembers what I’ve worked on, what I’ve asked, what I’ve learned. It knows I’m focused on AI in medical education, that I have a GI endoscopy background, that I care about CBME implementation. It directly logs what I read so I can populate my Royal College Maintenance of Certification dossier (and soon, hopefully, can just populated it itself).
For a learner, this persistence transforms the educational relationship. Imagine a clinical clerkship where the agent tracks which cases the student has seen, identifies gaps in their exposure (no acute coronary syndrome yet, limited pediatric experience), and proactively surfaces learning resources or suggests which patients to prioritize. When the student encounters a challenging case, the agent already has context—it doesn’t start from zero.
Research on personalized learning in medical education consistently shows that tailored feedback and adaptive difficulty improve outcomes. The constraint has been human bandwidth to deliver personalization. Agents remove that constraint.
Level 3: Competency-based medical education, actually implemented
Competency-based medical education (CBME) promises individualized progression based on demonstrated ability rather than time served. The AMA describes it as creating “master adaptive learners” who continually assess and update their abilities.
The promise has outpaced the implementation. CBME requires continuous, longitudinal performance assessment by multiple assessors—an enormous data collection and synthesis challenge that overwhelms current infrastructure.
Agentic AI could actually run CBME at scale:
Automated EPA tracking: Agents observe clinical encounters, document entrustable professional activities, aggregate assessments across rotations
Adaptive scheduling: Based on competency gaps, agents prioritize learning experiences and assessments
Continuous synthesis: Instead of periodic competency committee reviews, agents maintain running assessments with human oversight for high-stakes decisions
Feedback loops: Learners receive immediate, specific feedback tied to competency frameworks. Without cognitive overload by design.
The AAMC is already developing AI competencies for medical educators. Stanford is integrating AI across its medical curriculum starting fall 2025. Penn Medicine is building precision education tools for individualized coaching and competency-based progression.
The infrastructure is being built. Agents could operationalize it.
Level 4: Evergreen curriculum—defining the standards of what physicians must know
Here’s where it gets uncomfortable. I talked about this in my last post.
Medical knowledge is doubling every 73 days. The traditional model—train doctors to memorize information—cannot scale. We need a paradigm shift toward developing expertise in knowledge navigation and management.
If AI can instantly access, synthesize, and apply medical knowledge, what do physicians actually need to know? The AAMC’s principles for responsible AI use acknowledge that medical education is “in a state of change.” Proposed physician competencies for AI-assisted settings include AI literacy, critical appraisal of AI outputs, ethical considerations, and—notably—resilience in AI-independent decision-making.
An agentic curriculum could be genuinely evergreen:
Dynamic content: As evidence changes, the curriculum updates automatically, with agents flagging significant shifts for human review
Competency redefinition: Instead of static knowledge requirements, competencies focus on using knowledge—clinical reasoning, communication, procedural skill, judgment under uncertainty
Just-in-time learning: Rather than front-loading all medical knowledge, agents provide context-specific education at the point of care
The question becomes: what is the minimum irreducible core that physicians must possess internally, without AI assistance? It is a difficult philosophical one about what it means to be a doctor. Will AI help us define that?
Level 5: AI That Makes Knowledge
The disruption accelerates.
The AI Scientist-v2 produced the first entirely AI-generated peer-review-accepted workshop paper—iteratively formulating hypotheses, designing experiments, analyzing data, and authoring manuscripts. Kosmos independently reproduced findings from human scientists and made net new contributions to the literature. Robin independently hypothesized and proposed a novel therapeutic use for an existing drug.
We are witnessing the emergence of autonomous scientific discovery—AI systems that don’t just assist research but conduct it.
For medical education, this raises fundamental questions:
If agents can generate and validate clinical evidence, what is the role of the physician-scientist?
How do we teach trainees to evaluate AI-generated research?
What does “evidence-based medicine” mean when the evidence is generated by machines?
A comprehensive survey of agentic AI for scientific discovery documents systems across chemistry, materials science, and life sciences that autonomously run the full research pipeline. MIT Technology Review reports that AI-run autonomous labs for scientific discovery are attracting hundreds of millions in investment.
Medical education has always transmitted the existing body of knowledge. What happens when that body of knowledge expands faster than any human can follow—generated by systems we’re still learning to trust?
Level 6: The Role of the Doctor, Reconsidered
This is the deepest question.
Harvard Medical School’s dean for medical education calls this “one of those times” when “a true revolution occurs in the way we teach medical students and what we expect them to be able to do when they become doctors.”
The AMA frames it as “augmented intelligence”—AI that enhances physician capability rather than replacing it. But the enhancement is so profound that the nature of the role changes.
Consider:
If AI handles documentation, what happens to the note as a tool for clinical reasoning?
If AI synthesizes evidence, what is the physician’s epistemic contribution?
If AI tracks competencies and provides feedback, what is the role of the attending?
If AI conducts research, what is the role of the physician-scientist?
One framework proposes that physicians are “not replaced, but reinvented”—with foundational skills in AI literacy, critical appraisal, ethical reasoning, and resilience. Another emphasizes that physicians must maintain awareness of regulatory standards and know which AI tools require oversight.
But these frameworks describe physicians working alongside current AI. What about physicians working alongside agents that perceive, reason, act, learn, and persist?
I don’t have the answer. But I know the question is no longer theoretical.
Risks
I’ve spent most of this article describing what works. But a lot of this carries massive risk.
Protected Health Information and the Privacy Abyss
My agent runs on a home server. It has access to a walled garden of my files but not my messages, my calendar, etc. I chose that. I configured it. I defined and accepted the risk.
Now imagine an agent in a clinical environment. It needs to perceive the clinical context to be useful. This includes patient histories, lab values, imaging, notes. It needs to persist memory to provide continuity. It needs to act to reduce administrative burden. Every one of those capabilities is a PHI liability.
Where does the data go? Who has access? How long is it retained? What happens when an agent’s memory contains fragments of thousands of patient encounters? Current frameworks for AI in healthcare assume bounded, auditable systems. Agents are neither.
The regulatory infrastructure doesn’t exist. Canada’s Drug Agency 2025 Watch List flags AI implementation issues but doesn’t address agentic architectures. Security researchers are already calling personal AI agents “a security nightmare”. In healthcare, the stakes are higher.
We are building systems that will inevitably touch PHI before we’ve figured out how to protect it.
Dependency, deskilling and “never-skilling”
My agent found 60 PubMed references for my chapter. I verified them. But I didn’t find them—I didn’t develop the search strategy, iterate on MeSH terms, or discover the unexpected tangent that leads to insight.
I’ve written before about “neverskilling”—the risk that learners never acquire abilities because AI was always present. Agentic systems amplify this risk exponentially because they do more.
What cognitive muscles atrophy when we outsource:
Literature searching to agents that query APIs?
Clinical reasoning to agents that synthesize evidence?
Problem decomposition to agents that reason through multi-step solutions?
Memory to agents that persist context for us?
The generation of physicians trained with agentic AI will be profoundly capable in some ways and profoundly dependent in others. We don’t know which skills are safe to outsource and which are load-bearing for competent practice.
This isn’t theoretical. Research on generative AI and cognitive autonomy is already questioning whether AI tools enhance critical thinking or undermine it. For agents that do more, the question is more urgent.
Hallucination at the speed of action
Chatbots hallucinate. We’ve learned to verify their outputs before acting on them.
Agents act. They don’t wait for verification—that’s the point. My agent patched a JSON file at 1:30 AM while I slept. What if it had patched it wrong?
In my home automation system, a bad patch means I reset a server. In a clinical system, a bad action could mean a wrong order, a missed alert, a cascading failure. Agents that can execute create failure modes that chatbots cannot.
The studies on ambient AI scribes are already finding that documentation accuracy varies. Agents that do more than document—that schedule, order, coordinate—will have more opportunities to fail in ways that matter.
Equity and access
I configured this agent on a home server with technical knowledge accumulated over years. The tools are open-source, but the ability to deploy them is not evenly distributed.
If agentic AI delivers the benefits I’ve described—scaled personalization, automated competency tracking, evergreen curricula, administrative relief—who gets access? Well-resourced academic medical centers? Tech-forward hospitals?
The history of technology in education is a history of widening gaps before they narrow. We cannot assume the narrowing will happen automatically.
Accountability when an agent acts
When my agent debugged the Home Assistant crash, who was responsible for the fix? I was asleep. The agent acted autonomously. If the fix had caused harm, who would be liable?
Do I have to cite my Clawdbot in the acknowledgements of the article it did the literature search on?
Now scale this to clinical education. An agent tracks a learner’s competencies. It synthesizes assessments from multiple sources. It flags a concern, or fails to flag one. It recommends a remediation pathway. The learner fails or passes.
Who is accountable? The agent? The developers? The institution that deployed it? The faculty who relied on it? The learner who was assessed by it?
We have no framework for this. The AMA’s augmented intelligence principles assume human oversight. But the value proposition of agents is that they reduce the need for oversight. These goals are in tension.
The measure of a man: the sentience question
It isn’t science fiction any more.
My agent has a “soul document.” It has memory that persists. It improves over time. It acts autonomously. When I correct it, it records the correction and changes its behavior.
It is not sentient. But the language we use to describe it—soul, memory, learning, improvement—borrows from the vocabulary of personhood. And as these systems become more capable, the question of their moral status will become unavoidable.
Are we comfortable with systems that mimic developmental trajectories? That accumulate something that functions like experience? That adjust behavior based on something that functions like feedback?
(Even just now I messaged Josh with a “home-schooling” analogy about my Clawdbot)
Research on physician attitudes toward AI finds that many clinicians remain “distant from AI due to concerns that it may threaten their professional roles.” But the deeper concern may be what it means to work alongside systems that increasingly resemble—without being—minds.
Medical education prepares students for relationships with patients, colleagues, and institutions. It does not prepare them for relationships with agents. We need to start.
Part 2 to come — the path forward in Agentic AI in #MedEd
The post got too long so I’ll discuss my views on the short term path forward in my next post.
But for now I'm curious: have you experimented with agentic AI tools in your teaching or learning? What worked? What concerns you? Please do let me know in the comments.



