This case study is password protected.
Enter the password to continue.
Designing a therapeutic AI from the inside out
Healer Jenn Morse has spent decades developing a four-stage therapeutic framework centered on emotional regulation, belief reframing, and somatic awareness. Inner Sage was the attempt to bring that methodology to scale — an AI product for high-functioning individuals who want the depth of a therapeutic process without the barrier of 1:1 sessions.
The founding team had the methodology, the technical infrastructure, and the vision. What they didn't have was a way to turn a nuanced human framework into something an LLM could faithfully inhabit without flattening it.
Jenn Morse's six-stage framework — emotional regulation, belief reframing, somatic awareness — works because it's sequenced and because it demands real presence. You can't skip stages. You can't rush the body. Inner Sage was the bet that this could scale: an AI product for high-functioning people who want something with genuine depth, not a mood journal with a chatbot bolted on.
The team came in with the methodology, the infrastructure, and a clear vision. What nobody had figured out yet was the translation layer — how to take something that lives in the relational, embodied space of human therapeutic practice and make it operational inside a language model without losing what made it work.
"The risk wasn't that Madeline would say something wrong. It was that she'd say something warm, well-phrased, and completely unrecognizable to Jenn's methodology. Good-sounding and therapeutically meaningless aren't the same failure. Only one shows up in testing."
My job was to sit between Jenn's framework and the model and translate it — not summarize it, not flatten it into bullet points the engineers could hand off. Actually translate it, in the linguistic sense: find the equivalent structure, not the nearest approximation.
Early conversations kept circling the same questions: what does Madeline ask, what does she say when someone discloses something hard, how does she handle silence. Legitimate questions, all of them — but downstream of something nobody had named yet. What does this system actually know about where a user is? And what is it allowed to do from there?
Without state awareness, the AI had no way to tell whether to explore, consolidate, or transition. So it guessed — and the guesses were different every session.
The framework existed as documents — detailed, clinically precise, completely un-operationalizable. There was no mechanism to say: in this phase, these moves are permitted. In this one, they're not.
Twelve named phases, sequenced, each with defined entry conditions and constrained exits. Madeline always knows where she is in the arc — and what she's allowed to do from there.
Hard-coded instructions, vectorized methodology, and team strategy each live in separate layers. Jenn can revise the clinical content without touching behavioral rules. Engineering can update the logic without touching the content. Neither has to ask the other's permission.
The state machine is what gives Madeline her spine. Every phase has a name, a therapeutic purpose, defined conditions for entering it, and constrained paths out. An AI without this just drifts — producing responses that feel coherent in isolation and make no sense as a sequence. With it, Madeline always knows where she is. She knows what she's allowed to do next. She can't improvise her way past a gate she hasn't cleared.
Click any phase to see what it does and what moves it permits.
The choices that shaped this product were mostly about permission — what Madeline's allowed to do, under what conditions, and what the system does when those conditions aren't met. A few of those calls defined everything that came after.
| Decision | Options Considered | What We Chose | Why |
|---|---|---|---|
| Confidence scoring model | Continuous multi-dimensional score (regulation × clarity × readiness) | Three binary consent gates (C_state, C_belief, C_readiness) | A continuous score sounds more rigorous. It's also nearly impossible to test, explain to a clinician, or debug when it misbehaves. Binary gates are blunt by design — and they work. False precision at MVP stage costs more than it buys. |
| Methodology encoding | Full framework in the system prompt | Vectorized content separated from hard-coded instructions | When instruction and methodology live in the same document, every content update is a potential behavior change — and you won't always know which one you triggered. Separation means Jenn can evolve the clinical framework without touching the behavioral layer, and vice versa. |
| Onboarding flow | AI-guided onboarding from session one | Deterministic non-AI onboarding | Onboarding is where consent gets captured, safety signals get recorded, and baseline preferences get set. None of that should be improvised. A language model that hallucinates through onboarding isn't creating a UX problem — it's creating a clinical one. |
| Re-regulation handling | Complete session restart on dysregulation | Return arc to P01-P03 with session context retained | Throwing away a user's context the moment they get activated isn't protecting them — it's abandoning them at the worst possible moment. The model loops back, but it doesn't forget. The thread stays intact. |
| Check-in timing | Single daily check-in cadence | AM/PM differentiated logic | A nervous system at 7am isn't doing what it's doing at 9pm. Morning sessions call for intention-setting; evening ones for processing. Madeline needed different behavioral defaults for each — not just different language. |
The deliverable wasn't a prompt library. It was a behavioral operating system — something the engineering team could build against with confidence, and the clinical team could actually read and trust.
I built the architecture before I built any way to test it. A rough LLM-as-judge eval framework from the start — even a scrappy one — would've grounded architectural decisions that stayed theoretical for longer than they needed to.
I spent weeks designing the continuous multi-dimensional scoring model before landing on binary gates. The simplification was obviously right — in retrospect, it was always the answer. I should've started there and made the case for complexity only if the simple version failed.
A diagram like this one, produced in week two instead of month four, would've saved weeks of misaligned back-and-forth between clinical and technical stakeholders. People argue less about phase boundaries when they're both looking at the same picture.
The Master Document Library ended up as the shared contract between design, engineering, and clinical — the artifact everyone pointed at when they disagreed. It only got there because documentation was scoped as a deliverable from day one, not assembled after the fact from notes and Slack threads.