Building an AI meeting SaaS end-to-end while bots get banned
What end-to-end AI product engineering actually looks like, in 2026, when the category is being reshaped by lawsuits and admin policies. The Insight Draft architecture — what’s hard, what shipped, what I’d do differently.
The AI meeting category has spent the last twelve months getting reshaped. In August 2025, Otter was hit with a CIPA (California Invasion of Privacy Act) class action over notetaker recordings. Microsoft is rolling out admin policies in May–June 2026 that let Teams tenants block third-party recording bots by default. Universities — UW, Chapman, UC Riverside — have banned non-native AI bots from their meeting estates. And in March 2026, Granola raised $125M Series C at a $1.5B valuation on the bet that botlessrecording is the future. Granola did it for one platform; doing it across Google Meet, Microsoft Teams, Zoom, and Slack Huddles is the work I’m about to describe.
Meanwhile, every business meeting still needs an AI summary. So buyers are stuck. The bot-based incumbents (Otter, Fireflies, Chorus, Read.ai) are being procurement- blocked. The platform-native tools (Microsoft Copilot, Gemini for Meet, Zoom AI Companion) only cover the platform you’re inside — useless if your team spans Meet, Teams, Zoom, and Slack Huddles. The Granola-style botless newcomers cover one or two platforms each.
Insight Draftis a two-person founding team. My co-founder Francesco and I built it end-to-end. I owned the architecture and the systems described in this case study — the extension, the API, the LLM orchestration, the billing lifecycle, and the infrastructure. Francesco contributed engineering on the CMS and analytics paths in addition to product and business. Bot-free recording across Google Meet, Microsoft Teams, and Zoom via a Manifest V3 (MV3) Chrome extension I’ve maintained for two years; a separate Slack Huddle bot for the case where there’s no browser tab to capture; Deepgram speaker-attributed transcription enriched with live caption scraping; six LLM call types per meeting analysing summaries, chapters, highlights, behaviour mentions; RAG (retrieval-augmented generation) over transcripts with verifiable citations; Stripe billing with proper subscription lifecycle; observability and CI/CD.
Why this is interesting now
The buyer pressure is bipolar. On one side: the regulatory and platform pushback on bots. On the other: the feature bar set by venture-backed incumbents that users now expect — speaker-attributed transcripts, topic chapters, decision/action extraction, conversation analytics, multi-language support, grounded Q&A with citations.
The architectural question is straightforward: can a small team ship a product with the depth of a $30M-funded competitor, by leaning hard on opinionated platform choices and avoiding work that doesn’t differentiate? The answer that’s in the code: yes, by buying the things that aren’t differentiating (OpenAI-hosted vector store, Deepgram for transcription, Stripe for billing, Postmark for email) and building the things that are (recording capture across platforms, speaker resolution, the orchestration around the LLM calls).
What I built
The monorepo holds twelve services. Eight of them run together in production under one Docker Compose plus Traefik 2.5 (the proxy itself, the Angular SPA, the marketing site, the .NET API, the Node LLM service, Postgres, the DB-setup container, and the encrypted backup runner). The Chrome extension and the Slack-Huddle bot run on their own; the internal CMS and the E2E suite are tooling. The five services that do the work day-to-day:
- Chrome extension(Manifest V3, multi-package monorepo) — records Meet/Teams/Zoom via
tabCapturefor browser meetings anddesktopCapturefor desktop apps. Has a content script running in MAIN world on Google Meet that scrapes the platform’s own caption stream for live speaker attribution. Bidirectional messaging with the web app viaexternally_connectable. - Slack Huddle bot (
meeting-bot/) — Playwright with stealth plugin, joins Slack huddles where there’s no browser tab to capture. Multiple flow variants (auth-cookie, invite, magic-link). Janus client for direct connection to Slack’s WebRTC SFU as a fallback. Necessary because the extension can’t reach Slack’s desktop UI. - .NET 8 API (
insight-draft-api/) — ASP.NET Core, EF Core, Hangfire (background jobs), MediatR (in-process CQRS), Identity + JWT auth, two PostgreSQL databases (main app + a separate transcript DB to keep high-writeWord/Captiontables off the metadata DB). Stripe.net 48, Deepgram SDK 6, AWS SDK, FFmpeg via CliWrap. This is where the recording pipeline, billing lifecycle, and tenant model live. - Node.js LLM service (
insight-draft-api-llm/) — Express + tsoa-generated routes, kept deliberately thin. Proxies and orchestrates the OpenAI Responses API with strict structured outputs, moderation passthrough, conversation persistence, and the assistant RAG path backed by OpenAI-hostedfile_searchvector stores. The .NET API calls this; the SPA never does. - Angular 17 SPA (
insight-draft-ui/) — NgRx (Redux for Angular), SignalR (.NET’s real-time hub framework) client for status updates, vidstack/media-chrome video player consuming WebVTT files for subtitles/speakers/chapters/highlights, Toast UI editor for in-app notes, ngx-translate across six languages (en, it, fr, es, nl, bg).
The other seven services round it out:
- Marketing site (Angular SSR), internal CMS (React 19 + Vite + shadcn), CMS API (clean-architecture .NET skeleton), Playwright E2E suite (extension + meetings projects), DB-setup container (one-shot bootstrap), deployment(Docker Compose + Traefik 2.5 with Let’s Encrypt + AWS Secrets Manager), cookie-consent kit.
The hardest engineering parts
1. Statistical-voting speaker mapping
Deepgram returns numeric diarization IDs — Speaker 0, 1, 2 — with no relation to actual participants. The mapper’s job is to bind those numeric IDs to known users and external participant identifiers from the meeting’s participant list. Naive approaches (count consecutive segments, longest-talker heuristic) fail on real meetings with overlap and short utterances.
StatisticalVotingSpeakerMapperdoes it differently. It iterates every caption emitted by Deepgram, votes for the participant whose known speaking-time window overlaps it, weights by overlap duration, and then derives a confidence threshold per mapping. The threshold itself adapts to coverage — relaxed for short sample windows, stricter for longer ones — via a separate ThresholdPolicy. When confidence is below the threshold, the mapping falls back to the simpler TimestampOnlySpeakerMapperrather than making a high-confidence claim it can’t back up.
Around it: VoteCollector, MappingBuilder, CoverageAnalyzer, SpeakerTimeLookup. The whole stack is pluggable behind ISpeakerMappingService with separate strategies for different meeting providers (SimulatedDiarizationMapper, ManualRecordingDiarizationStrategy, StatisticalVotingSpeakerMapper).
2. The host-inference heuristic for Google Meet
For Google Meet specifically, the extension scrapes the platform’s own caption stream from the page’s JS context (MAIN-world content script), and the API matches caption deviceId to participant ExternalUserId. There’s a real-world quirk: the meeting host never appears in their own RTC participants feed. Google’s client-side state doesn’t list the local user. So when the matcher runs after the meeting ends, the host’s captions have no participant to bind to.
The fix is inline in RecordingCompletionService.CreateTranscriptWithLLM (around lines 465–486): if exactly one deviceId remains unmatched after the regular pass, that deviceId is the host. A hand-engineered correction for a real Google API gap. The kind of fix you only build after watching real meeting traces fail and figuring out why.
3. Custom Hangfire fan-out/fan-in via Postgres atomic UPDATE
Hangfire ships single-job continuations out of the box. Batches (fan-out from N jobs to a single continuation when all complete) is a paid Hangfire Pro feature. So I built it. The three core files (PgBatchCoordinator, BatchContinuationFilter, PgJobResultStorage) are about 280 lines together; the whole batching project including builders and DI plumbing is around 530.
The mechanism is a Postgres job_batch row holding a remaining_slots counter. Each child job calls SignalAsync, which executes:
UPDATE hangfire.job_batch
SET remaining_slots = remaining_slots - 1
WHERE batch_id = $1 AND remaining_slots > 0
RETURNING remaining_slotsWhen the returned value hits 0, the global Hangfire filter fires the continuation. PgJobResultStorage lets the continuation consume typed results from earlier jobs in the batch (GetBatchResultAsync<VideoProcessingResult>).
This is the spine of the recording-completion pipeline. Video processing, transcription, thumbnail generation, sprite-sheet generation, and speaker-attribution all run in parallel; VideoCleanupJobonly fires after every one finishes, with access to all their typed outputs. The trade-off is real — we own the failure surface. The two risks are double-decrement on retry (mitigated by the atomic WHERE remaining_slots > 0guard so the same retry can’t take the counter below zero) and batches stuck above zero if a child job is permanently dropped (handled by Hangfire’s standard failure callbacks). Worth knowing if you ever go this route: $500/mo of Hangfire Pro would have been a rational call too — the build-vs-buy was decided more on “we already have Postgres, atomic SQL is the smaller new thing” than on dollar savings.
4. PostMeetingAnalysisBackgroundJob — six parallel LLM call types per meeting
One file, 970 lines, 49KB. PostMeetingAnalysisBackgroundJob fans out across two parallel waves of Task.WhenAll covering six distinct LLM call types, plus a sequential keywords follow-up if the behaviour-mentions wave returned content:
- Highlights extraction
- Behaviour-mentions extraction (against the org’s declared values)
- Summary
- Tags
- Chapters
- Meeting classification
Each call has its own strongly-typed C# IPrompt class extending BasePrompt, its own structured-output JSON schema, a 180-second timeout, and explicit useStructuredOutputs: true, reasoningEffort: "minimal" parameters tuned to gpt-5-mini. The job degrades gracefully: if the transcript refinement step fails, the orchestrator logs and continues with the un-refined text rather than failing the whole pipeline. If the org has no declared values, the behaviour-mentions call is skipped. If the meeting is under 120 seconds, VTT generation for chapters and highlights is skipped entirely.
An internal README of the prompt-system refactor reports 25–30% fewer tokens, 20–25% better accuracy, and 75% fewer JSON parsing errors against the prior ad-hoc baseline. I’d treat those as engineering-team estimates rather than benchmarked numbers — what actually matters for maintainability is the structural win (one consistent prompt class shape, strict-mode JSON schema everywhere, the model can’t emit malformed JSON).
The architecture also supports multi-provider routing — the LLM service has an enum slot for Anthropic and a Strategy interface for the model client — but currently routes everything through OpenAI. The plug point is there for when cost or latency makes a Claude or Gemini swap worth it.
5. Stripe webhook lifecycle with synthetic ClaimsPrincipal
Stripe webhooks arrive unauthenticated. But everything downstream — EF tenant-stamping (which reads organization_id from HttpContext.Userclaims), audit logging, subscription update logic — assumes an authenticated user is in scope.
BillingController bridges the gap. The handler validates the Stripe signature, checks idempotency against a StripeWebhookEvent table with a unique constraint on the event id, resolves the org from the event payload (which lives in different fields per event type), then mounts a synthetic ClaimsPrincipal with the resolved organization id onto HttpContext.User. Now the existing tenant-stamping handler in SaveChangesAsync works as if a real user had made the request.
From there: MediatR dispatches a typed command (SubscriptionCreatedCommand, SubscriptionUpdatedCommand, etc.); a Strategy factory picks the right subscription-change strategy (ImmediateUpgradeStrategy, ScheduledDowngradeStrategy, PaymentModeUpgradeStrategy, ZeroCostUpgradeStrategy, SetupPaymentMethodStrategy) based on the diff between current and target subscription state; the result reconciles to a SubscriptionStateHash on the org so the SPA can detect drift and force a token refresh.
Subscription permissions are baked into the JWT itself. The auth refresh recomputes them from the current subscription plan, so per-request authorization checks read JWT claims — zero Stripe API calls on the read path. Stripe is only called when the user actively manages their subscription or when a Stripe webhook fires.
Engineering choices worth calling out
Two databases, on purpose
The transcript schema (Transcript, Word, LiveCaption) sits in its own physical Postgres database with its own EF context. Hot writes during transcription — potentially millions of words per long recording — don’t compete with the metadata DB for connections, locks, or vacuum. Migrations are split too: --context TranscriptDbContext for one, ApplicationDbContext for the other. Costs slightly more in operational complexity (two backups, two connection strings), pays for itself when transcript volume grows.
OpenAI-hosted vector store, no self-hosted RAG
The assistant uses OpenAI’s file_searchtool against two vector store IDs — one general knowledge base, one app-specific. The citations the system returns are not free-form text references: they’re a strict-typed JSON schema with seven discriminated action types (navigate, open_video, open_settings, copy_text, open_external_link, open_meeting, contact_support). Each action is bound to real navigable state, not text that could be hallucinated.
The trade-off is honest: I don’t run pgvector or qdrant or pinecone. I don’t maintain an embedding pipeline or version embeddings on model changes. I get OpenAI’s search quality and OpenAI’s pricing, both good enough for this product. If the cost or quality calculus changes, the Strategy interface for retrieval is in place.
MediatR pre-save events for quota enforcement
ApplicationDbContext.DispatchBeforeSaveEventsAsync publishes domain notifications (MeetingRecordingBeforeSaveEvent, PromptBeforeSaveEvent, RecordingDurationBeforeSaveEvent) before the EF transaction commits. A subscription-quota handler can throw and roll back the entire transaction. This means quota enforcement isn’t a decorator on every controller — it’s a single integration point at the persistence layer that catches every code path including jobs, webhooks, and future GraphQL routes you haven’t written yet.
Two-pass language alignment
The assistant’s alignment prompt is wired for seven languages; the SPA ships translations for six today. The system prompt instructs the model to detect the user’s question language and respond in it. In practice, models drift: a Bulgarian question gets an English answer because the UI lang is English and the system prompt is English.
The fix is a second LLM call. After the first response, a LanguageAlignmentPromptruns that detects the actual language of the question and the actual language of the response, and re-translates if they don’t match. Costs an extra call. Eliminates an entire class of “answered in the wrong language” bugs. The model’s self-declared detected_question_language and detected_response_languageare part of the strict response schema, so they’re queryable as telemetry.
JWT-baked subscription permissions
Around 80 base permissions of form Resource.Action.Entity plus 7 space-scoped variants Spaces.{spaceId}.Action.Entity— ~87 total across plans and roles. The authorisation check is a JWT claim lookup. When a webhook updates a subscription, the auth refresh recomputes permissions and a SubscriptionStateHash on the org gets bumped; the SPA compares the hash returned in API response headers against the one in its current token and triggers a silent re-auth on drift. Saves real Stripe API calls and real latency on the hot path.
WebSocket / track-element JWT-via-query-string allowlist
Browsers can’t put custom headers on WebSocket handshakes or HTML <track> requests. So the JWT is allowed via ?access_token= for exactly two paths: /hubs/transcript-status and /api/v1/recordings/media. Anywhere else, query-string JWTs are ignored. Practical accommodation for a real browser limitation, narrowly scoped.
Strict schemas + structured outputs everywhere
All gpt-5-mini calls in the post-meeting analysis use useStructuredOutputs: truewith the OpenAI Responses API’s strict-mode JSON schema. The conversation controller fails fast on parse errors when structured outputs are on, rather than swallowing a malformed response. The non-GPT-5 fallback path uses loose json_object mode for backward compatibility. The README claims this combination cut JSON parsing errors by 75%.
What I’d do differently
Live transcription.Today, Deepgram is called post-call with the full recording. Live captions exist (Google Meet only, scraped by the extension) but feed only speaker attribution — not the displayed transcript. A WebSocket Deepgram stream during the meeting is the obvious upgrade: real-time transcript visible mid-meeting, more useful for live collaboration. I deferred it because post-call simplifies the failure model.
Multi-provider for real.The Strategy is in place for Anthropic and Gemini. The plumbing isn’t. Adding Claude as a fallback (when OpenAI rate-limits) and Gemini for cheap classification work would cut costs and improve resilience. Cost-benefit hasn’t hit the threshold yet.
Token tracking. The LLMService.TrackTokenUsageAsync method is currently logged-only. It would graduate to enforced quotas per plan tier — the right shape would be a budget per organization per billing period, with a soft warning at 80% and hard cutoff at 100%. The Strategy is in place; the implementation isn’t.
Observability.Serilog to Postgres + Slack works. For a SaaS at scale I’d want OpenTelemetry traces flowing into Honeycomb or Tempo so the multi-service spans (extension → API → LLM service → Deepgram → Hangfire jobs → Stripe webhook → SPA) can be inspected end-to-end without correlating log lines by hand.
Staging. Currently disabled to save costs. The right move is to spin it up only on PR merges, not 24/7.
What’s live
- Public Chrome Web Store extension — install link — Manifest V3, two years on the store, multi-package monorepo with E2E test suite
- Production SaaS at
app.insightdraft.comwith Stripe billing, multi-environment Jenkins CI/CD, AWS for compute and Hetzner for backups - Eight production servicesorchestrated under one Docker Compose plus Traefik 2.5 with Let’s Encrypt (twelve in the monorepo total, including the standalone Chrome extension and Slack-Huddle bot)
- Custom Hangfire fan-out/fan-in primitive that gives us Batches without paying for Hangfire Pro
- Six LLM call types per meeting across two parallel
Task.WhenAllwaves, with strict JSON schemas and graceful degradation - End-to-end Playwright suite running against simulated meetings with real Deepgram callbacks
What this is not
- Not real-time transcription via WebSocket.Deepgram is post-call. The “live” experience for users is status updates over SignalR plus extension-scraped Google Meet captions for speaker attribution.
- Not multi-provider LLM yet.The Strategy and enum scaffold for Anthropic exist; the wiring doesn’t. Today, every call is to OpenAI.
- Not self-hosted RAG. Vector storage is OpenAI-hosted via
file_search. Right call for now; will need to reconsider if cost or quality changes. - Not solo across the whole company. Insight Draft is co-founded. I owned the architecture and the systems described in this case study; my co-founder Francesco contributed engineering on the CMS and analytics paths in addition to product and business.
- Not a finished product. Active development, real bug backlog, real shipping cadence. The Chrome extension is the oldest piece (two years on the Web Store); the rest of the platform is roughly eighteen months of focused work on top of it.
If you’re hiring this kind of work
What this case study demonstrates: I’ve owned the architecture and the full system surface for a meeting-AI product — the extension, the API, the LLM orchestration, the infrastructure, the subscription lifecycle — and watched it run for two years. Few engineers have done this combination end-to-end at a depth that survives operating it. Granola raised $125M to do botless capture for one platform; we built it for three plus Slack Huddles with a co-founding pair, on AWS plus Hetzner.
I take on a small number of engagements per year for founders building AI products and companies that need a senior IC who can ship end-to-end. A typical engagement looks like:
- Weeks 1–2— I trace your existing capture/transcribe/summarise path end-to-end, identify the three most-likely failure modes under load, and write up the architecture recommendations
- Weeks 3–6— spike the most-uncertain piece (extension capture, LLM orchestration, RAG pipeline, payments lifecycle) end-to-end against your real stack
- Weeks 7–10— production hardening, observability, CI/CD, threat model. Pair with one or two of your senior engineers throughout so the codebase transfers
- Week 11+— handoff with documented runbooks; optional retainer for follow-up questions
On conflict of interest: Insight Draft is an active SaaS in this category. I won’t take engagements that overlap directly with its product surface (no work that competes with our roadmap, no IP from your project flows backwards). I’m happy to scope this explicitly upfront so you know exactly what’s in and out before we start.
If that fits what you’re scoping, the booking link below skips slides and goes straight to a 60-minute technical call.
Building something AI-shaped end-to-end?
60-min technical call — no slides, no pitch. Architecture, trade-offs, what would actually work for your stack.