Building an AI meeting SaaS end-to-end while bots get banned

The AI meeting category has spent the last twelve months getting reshaped. In August 2025, Otter was hit with a CIPA (California Invasion of Privacy Act) class action over notetaker recordings. Microsoft is rolling out admin policies in May–June 2026 that let Teams tenants block third-party recording bots by default. Universities — UW, Chapman, UC Riverside — have banned non-native AI bots from their meeting estates. And in March 2026, Granola raised $125M Series C at a $1.5B valuation on the bet that botlessrecording is the future. Granola did it for one platform; doing it across Google Meet, Microsoft Teams, Zoom, and Slack Huddles is the work I’m about to describe.

Meanwhile, every business meeting still needs an AI summary. So buyers are stuck. The bot-based incumbents (Otter, Fireflies, Chorus, Read.ai) are being procurement- blocked. The platform-native tools (Microsoft Copilot, Gemini for Meet, Zoom AI Companion) only cover the platform you’re inside — useless if your team spans Meet, Teams, Zoom, and Slack Huddles. The Granola-style botless newcomers cover one or two platforms each.

Insight Draftis a two-person founding team. My co-founder Francesco and I built it end-to-end. I owned the architecture and the systems described in this case study — the extension, the API, the LLM orchestration, the billing lifecycle, and the infrastructure. Francesco contributed engineering on the CMS and analytics paths in addition to product and business. Bot-free recording across Google Meet, Microsoft Teams, and Zoom via a Manifest V3 (MV3) Chrome extension I’ve maintained for two years; a separate Slack Huddle bot for the case where there’s no browser tab to capture; Deepgram speaker-attributed transcription enriched with live caption scraping; six LLM call types per meeting analysing summaries, chapters, highlights, behaviour mentions; RAG (retrieval-augmented generation) over transcripts with verifiable citations; Stripe billing with proper subscription lifecycle; observability and CI/CD.

Why this is interesting now

The buyer pressure is bipolar. On one side: the regulatory and platform pushback on bots. On the other: the feature bar set by venture-backed incumbents that users now expect — speaker-attributed transcripts, topic chapters, decision/action extraction, conversation analytics, multi-language support, grounded Q&A with citations.

The architectural question is straightforward: can a small team ship a product with the depth of a $30M-funded competitor, by leaning hard on opinionated platform choices and avoiding work that doesn’t differentiate? The answer that’s in the code: yes, by buying the things that aren’t differentiating (OpenAI-hosted vector store, Deepgram for transcription, Stripe for billing, Postmark for email) and building the things that are (recording capture across platforms, speaker resolution, the orchestration around the LLM calls).

What I built

The monorepo holds twelve services. Eight of them run together in production under one Docker Compose plus Traefik 2.5 (the proxy itself, the Angular SPA, the marketing site, the .NET API, the Node LLM service, Postgres, the DB-setup container, and the encrypted backup runner). The Chrome extension and the Slack-Huddle bot run on their own; the internal CMS and the E2E suite are tooling. The five services that do the work day-to-day:

Rendering diagram…

Chrome extension(Manifest V3, multi-package monorepo) — records Meet/Teams/Zoom via tabCapture for browser meetings and desktopCapturefor desktop apps. Has a content script running in MAIN world on Google Meet that scrapes the platform’s own caption stream for live speaker attribution. Bidirectional messaging with the web app via externally_connectable.
Slack Huddle bot (meeting-bot/) — Playwright with stealth plugin, joins Slack huddles where there’s no browser tab to capture. Multiple flow variants (auth-cookie, invite, magic-link). Janus client for direct connection to Slack’s WebRTC SFU as a fallback. Necessary because the extension can’t reach Slack’s desktop UI.
.NET 8 API (insight-draft-api/) — ASP.NET Core, EF Core, Hangfire (background jobs), MediatR (in-process CQRS), Identity + JWT auth, two PostgreSQL databases (main app + a separate transcript DB to keep high-write Word/Caption tables off the metadata DB). Stripe.net 48, Deepgram SDK 6, AWS SDK, FFmpeg via CliWrap. This is where the recording pipeline, billing lifecycle, and tenant model live.
Node.js LLM service (insight-draft-api-llm/) — Express + tsoa-generated routes, kept deliberately thin. Proxies and orchestrates the OpenAI Responses API with strict structured outputs, moderation passthrough, conversation persistence, and the assistant RAG path backed by OpenAI-hosted file_search vector stores. The .NET API calls this; the SPA never does.
Angular 17 SPA (insight-draft-ui/) — NgRx (Redux for Angular), SignalR (.NET’s real-time hub framework) client for status updates, vidstack/media-chrome video player consuming WebVTT files for subtitles/speakers/chapters/highlights, Toast UI editor for in-app notes, ngx-translate across six languages (en, it, fr, es, nl, bg).

The other seven services round it out:

Marketing site (Angular SSR), internal CMS (React 19 + Vite + shadcn), CMS API (clean-architecture .NET skeleton), Playwright E2E suite (extension + meetings projects), DB-setup container (one-shot bootstrap), deployment(Docker Compose + Traefik 2.5 with Let’s Encrypt + AWS Secrets Manager), cookie-consent kit.

The hardest engineering parts

1. Statistical-voting speaker mapping

Deepgram returns numeric diarization IDs — Speaker 0, 1, 2 — with no relation to actual participants. The mapper’s job is to bind those numeric IDs to known users and external participant identifiers from the meeting’s participant list. Naive approaches (count consecutive segments, longest-talker heuristic) fail on real meetings with overlap and short utterances.

StatisticalVotingSpeakerMapperdoes it differently. It iterates every caption emitted by Deepgram, votes for the participant whose known speaking-time window overlaps it, weights by overlap duration, and then derives a confidence threshold per mapping. The threshold itself adapts to coverage — relaxed for short sample windows, stricter for longer ones — via a separate ThresholdPolicy. When confidence is below the threshold, the mapping falls back to the simpler TimestampOnlySpeakerMapperrather than making a high-confidence claim it can’t back up.

Around it: VoteCollector, MappingBuilder, CoverageAnalyzer, SpeakerTimeLookup. The whole stack is pluggable behind ISpeakerMappingService with separate strategies for different meeting providers (SimulatedDiarizationMapper, ManualRecordingDiarizationStrategy, StatisticalVotingSpeakerMapper).

2. The host-inference heuristic for Google Meet

For Google Meet specifically, the extension scrapes the platform’s own caption stream from the page’s JS context (MAIN-world content script), and the API matches caption deviceId to participant ExternalUserId. There’s a real-world quirk: the meeting host never appears in their own RTC participants feed. Google’s client-side state doesn’t list the local user. So when the matcher runs after the meeting ends, the host’s captions have no participant to bind to.

The fix is inline in RecordingCompletionService.CreateTranscriptWithLLM (around lines 465–486): if exactly one deviceId remains unmatched after the regular pass, that deviceId is the host. A hand-engineered correction for a real Google API gap. The kind of fix you only build after watching real meeting traces fail and figuring out why.

3. Custom Hangfire fan-out/fan-in via Postgres atomic UPDATE

Hangfire ships single-job continuations out of the box. Batches (fan-out from N jobs to a single continuation when all complete) is a paid Hangfire Pro feature. So I built it. The three core files (PgBatchCoordinator, BatchContinuationFilter, PgJobResultStorage) are about 280 lines together; the whole batching project including builders and DI plumbing is around 530.

The mechanism is a Postgres job_batch row holding a remaining_slots counter. Each child job calls SignalAsync, which executes:

PgBatchCoordinator.SignalAsyncsql

UPDATE hangfire.job_batch
SET remaining_slots = remaining_slots - 1
WHERE batch_id = $1 AND remaining_slots > 0
RETURNING remaining_slots

When the returned value hits 0, the global Hangfire filter fires the continuation. PgJobResultStorage lets the continuation consume typed results from earlier jobs in the batch (GetBatchResultAsync<VideoProcessingResult>).

This is the spine of the recording-completion pipeline. Video processing, transcription, thumbnail generation, sprite-sheet generation, and speaker-attribution all run in parallel; VideoCleanupJobonly fires after every one finishes, with access to all their typed outputs. The trade-off is real — we own the failure surface. The two risks are double-decrement on retry (mitigated by the atomic WHERE remaining_slots > 0guard so the same retry can’t take the counter below zero) and batches stuck above zero if a child job is permanently dropped (handled by Hangfire’s standard failure callbacks). Worth knowing if you ever go this route: $500/mo of Hangfire Pro would have been a rational call too — the build-vs-buy was decided more on “we already have Postgres, atomic SQL is the smaller new thing” than on dollar savings.

4. PostMeetingAnalysisBackgroundJob — six parallel LLM call types per meeting

One file, 970 lines, 49KB. PostMeetingAnalysisBackgroundJob fans out across two parallel waves of Task.WhenAll covering six distinct LLM call types, plus a sequential keywords follow-up if the behaviour-mentions wave returned content:

Highlights extraction
Behaviour-mentions extraction (against the org’s declared values)
Summary
Tags
Chapters
Meeting classification

Each call has its own strongly-typed C# IPrompt class extending BasePrompt, its own structured-output JSON schema, a 180-second timeout, and explicit useStructuredOutputs: true, reasoningEffort: "minimal" parameters tuned to gpt-5-mini. The job degrades gracefully: if the transcript refinement step fails, the orchestrator logs and continues with the un-refined text rather than failing the whole pipeline. If the org has no declared values, the behaviour-mentions call is skipped. If the meeting is under 120 seconds, VTT generation for chapters and highlights is skipped entirely.

An internal README of the prompt-system refactor reports 25–30% fewer tokens, 20–25% better accuracy, and 75% fewer JSON parsing errors against the prior ad-hoc baseline. I’d treat those as engineering-team estimates rather than benchmarked numbers — what actually matters for maintainability is the structural win (one consistent prompt class shape, strict-mode JSON schema everywhere, the model can’t emit malformed JSON).

The architecture also supports multi-provider routing — the LLM service has an enum slot for Anthropic and a Strategy interface for the model client — but currently routes everything through OpenAI. The plug point is there for when cost or latency makes a Claude or Gemini swap worth it.

5. Stripe webhook lifecycle with synthetic ClaimsPrincipal

Stripe webhooks arrive unauthenticated. But everything downstream — EF tenant-stamping (which reads organization_id from HttpContext.Userclaims), audit logging, subscription update logic — assumes an authenticated user is in scope.

BillingController bridges the gap. The handler validates the Stripe signature, checks idempotency against a StripeWebhookEvent table with a unique constraint on the event id, resolves the org from the event payload (which lives in different fields per event type), then mounts a synthetic ClaimsPrincipal with the resolved organization id onto HttpContext.User. Now the existing tenant-stamping handler in SaveChangesAsync works as if a real user had made the request.

From there: MediatR dispatches a typed command (SubscriptionCreatedCommand, SubscriptionUpdatedCommand, etc.); a Strategy factory picks the right subscription-change strategy (ImmediateUpgradeStrategy, ScheduledDowngradeStrategy, PaymentModeUpgradeStrategy, ZeroCostUpgradeStrategy, SetupPaymentMethodStrategy) based on the diff between current and target subscription state; the result reconciles to a SubscriptionStateHash on the org so the SPA can detect drift and force a token refresh.

Subscription permissions are baked into the JWT itself. The auth refresh recomputes them from the current subscription plan, so per-request authorization checks read JWT claims — zero Stripe API calls on the read path. Stripe is only called when the user actively manages their subscription or when a Stripe webhook fires.

Engineering choices worth calling out

Two databases, on purpose

The transcript schema (Transcript, Word, LiveCaption) sits in its own physical Postgres database with its own EF context. Hot writes during transcription — potentially millions of words per long recording — don’t compete with the metadata DB for connections, locks, or vacuum. Migrations are split too: --context TranscriptDbContext for one, ApplicationDbContext for the other. Costs slightly more in operational complexity (two backups, two connection strings), pays for itself when transcript volume grows.

OpenAI-hosted vector store, no self-hosted RAG

The assistant uses OpenAI’s file_searchtool against two vector store IDs — one general knowledge base, one app-specific. The citations the system returns are not free-form text references: they’re a strict-typed JSON schema with seven discriminated action types (navigate, open_video, open_settings, copy_text, open_external_link, open_meeting, contact_support). Each action is bound to real navigable state, not text that could be hallucinated.

The trade-off is honest: I don’t run pgvector or qdrant or pinecone. I don’t maintain an embedding pipeline or version embeddings on model changes. I get OpenAI’s search quality and OpenAI’s pricing, both good enough for this product. If the cost or quality calculus changes, the Strategy interface for retrieval is in place.

MediatR pre-save events for quota enforcement

ApplicationDbContext.DispatchBeforeSaveEventsAsync publishes domain notifications (MeetingRecordingBeforeSaveEvent, PromptBeforeSaveEvent, RecordingDurationBeforeSaveEvent) before the EF transaction commits. A subscription-quota handler can throw and roll back the entire transaction. This means quota enforcement isn’t a decorator on every controller — it’s a single integration point at the persistence layer that catches every code path including jobs, webhooks, and future GraphQL routes you haven’t written yet.

Two-pass language alignment

The assistant’s alignment prompt is wired for seven languages; the SPA ships translations for six today. The system prompt instructs the model to detect the user’s question language and respond in it. In practice, models drift: a Bulgarian question gets an English answer because the UI lang is English and the system prompt is English.

The fix is a second LLM call. After the first response, a LanguageAlignmentPromptruns that detects the actual language of the question and the actual language of the response, and re-translates if they don’t match. Costs an extra call. Eliminates an entire class of “answered in the wrong language” bugs. The model’s self-declared detected_question_language and detected_response_languageare part of the strict response schema, so they’re queryable as telemetry.

JWT-baked subscription permissions

Around 80 base permissions of form Resource.Action.Entity plus 7 space-scoped variants Spaces.{spaceId}.Action.Entity— ~87 total across plans and roles. The authorisation check is a JWT claim lookup. When a webhook updates a subscription, the auth refresh recomputes permissions and a SubscriptionStateHash on the org gets bumped; the SPA compares the hash returned in API response headers against the one in its current token and triggers a silent re-auth on drift. Saves real Stripe API calls and real latency on the hot path.

WebSocket / track-element JWT-via-query-string allowlist

Browsers can’t put custom headers on WebSocket handshakes or HTML <track> requests. So the JWT is allowed via ?access_token= for exactly two paths: /hubs/transcript-status and /api/v1/recordings/media. Anywhere else, query-string JWTs are ignored. Practical accommodation for a real browser limitation, narrowly scoped.

Strict schemas + structured outputs everywhere

All gpt-5-mini calls in the post-meeting analysis use useStructuredOutputs: truewith the OpenAI Responses API’s strict-mode JSON schema. The conversation controller fails fast on parse errors when structured outputs are on, rather than swallowing a malformed response. The non-GPT-5 fallback path uses loose json_object mode for backward compatibility. The README claims this combination cut JSON parsing errors by 75%.

What I’d do differently

Live transcription.Today, Deepgram is called post-call with the full recording. Live captions exist (Google Meet only, scraped by the extension) but feed only speaker attribution — not the displayed transcript. A WebSocket Deepgram stream during the meeting is the obvious upgrade: real-time transcript visible mid-meeting, more useful for live collaboration. I deferred it because post-call simplifies the failure model.

Multi-provider for real.The Strategy is in place for Anthropic and Gemini. The plumbing isn’t. Adding Claude as a fallback (when OpenAI rate-limits) and Gemini for cheap classification work would cut costs and improve resilience. Cost-benefit hasn’t hit the threshold yet.

Token tracking. The LLMService.TrackTokenUsageAsync method is currently logged-only. It would graduate to enforced quotas per plan tier — the right shape would be a budget per organization per billing period, with a soft warning at 80% and hard cutoff at 100%. The Strategy is in place; the implementation isn’t.

Observability.Serilog to Postgres + Slack works. For a SaaS at scale I’d want OpenTelemetry traces flowing into Honeycomb or Tempo so the multi-service spans (extension → API → LLM service → Deepgram → Hangfire jobs → Stripe webhook → SPA) can be inspected end-to-end without correlating log lines by hand.

Staging. Currently disabled to save costs. The right move is to spin it up only on PR merges, not 24/7.

What’s live

Public Chrome Web Store extension — install link — Manifest V3, two years on the store, multi-package monorepo with E2E test suite
Production SaaS at app.insightdraft.com with Stripe billing, multi-environment Jenkins CI/CD, AWS for compute and Hetzner for backups
Eight production servicesorchestrated under one Docker Compose plus Traefik 2.5 with Let’s Encrypt (twelve in the monorepo total, including the standalone Chrome extension and Slack-Huddle bot)
Custom Hangfire fan-out/fan-in primitive that gives us Batches without paying for Hangfire Pro
Six LLM call types per meeting across two parallel Task.WhenAll waves, with strict JSON schemas and graceful degradation
End-to-end Playwright suite running against simulated meetings with real Deepgram callbacks

What this is not

Not real-time transcription via WebSocket.Deepgram is post-call. The “live” experience for users is status updates over SignalR plus extension-scraped Google Meet captions for speaker attribution.
Not multi-provider LLM yet.The Strategy and enum scaffold for Anthropic exist; the wiring doesn’t. Today, every call is to OpenAI.
Not self-hosted RAG. Vector storage is OpenAI-hosted viafile_search. Right call for now; will need to reconsider if cost or quality changes.
Not solo across the whole company. Insight Draft is co-founded. I owned the architecture and the systems described in this case study; my co-founder Francesco contributed engineering on the CMS and analytics paths in addition to product and business.
Not a finished product. Active development, real bug backlog, real shipping cadence. The Chrome extension is the oldest piece (two years on the Web Store); the rest of the platform is roughly eighteen months of focused work on top of it.

If you’re hiring this kind of work

What this case study demonstrates: I’ve owned the architecture and the full system surface for a meeting-AI product — the extension, the API, the LLM orchestration, the infrastructure, the subscription lifecycle — and watched it run for two years. Few engineers have done this combination end-to-end at a depth that survives operating it. Granola raised $125M to do botless capture for one platform; we built it for three plus Slack Huddles with a co-founding pair, on AWS plus Hetzner.

I take on a small number of engagements per year for founders building AI products and companies that need a senior IC who can ship end-to-end. A typical engagement looks like:

Weeks 1–2— I trace your existing capture/transcribe/summarise path end-to-end, identify the three most-likely failure modes under load, and write up the architecture recommendations
Weeks 3–6— spike the most-uncertain piece (extension capture, LLM orchestration, RAG pipeline, payments lifecycle) end-to-end against your real stack
Weeks 7–10— production hardening, observability, CI/CD, threat model. Pair with one or two of your senior engineers throughout so the codebase transfers
Week 11+— handoff with documented runbooks; optional retainer for follow-up questions

On conflict of interest: Insight Draft is an active SaaS in this category. I won’t take engagements that overlap directly with its product surface (no work that competes with our roadmap, no IP from your project flows backwards). I’m happy to scope this explicitly upfront so you know exactly what’s in and out before we start.

If that fits what you’re scoping, the booking link below skips slides and goes straight to a 60-minute technical call.