Legal CLE AI Platform — System & Infrastructure Design
An evidence-grounded question-answering and search layer over a large library of legal-education video, embedded in the members’ website.
Guiding principle. Every substantive answer is anchored to the source video — the member can watch the attorney say it, word for word. In law the exact wording carries the meaning, so accuracy is the product and grounding is how we guarantee it. Every design decision below serves that: retrieval that finds the right passage, timecodes that play the exact moment, and attribution that names the right attorney.
V1 current scope — the cheapest credible live demo on real content V2 future scope — expansion & scale once V1 proves out. The Roadmap breaks this down; the pills on later section headings mark which phase each part belongs to.
For the client
Product & Scope
1.1 What the product is
Two member-facing capabilities over one indexed content library:
- Question answering with a four-panel answer. A member asks a question (e.g. “how does a 1031 exchange defer capital gains?”) and receives four linked panels: (1) an AI reasoning answer, (2) the transcript excerpt it rests on, (3) the video clip of the attorney saying it, played at the exact moment, and (4) that attorney’s credentials. When the answer draws on more than one program, it returns several clips, not just one.
- Hybrid search. Fast, exact browsing over a structured catalog (practice area, sub-practice area, CLE state approvals, format, date, and a popularity score) with an AI semantic layer for the open-ended, long-tail queries that facets can’t express.
The two share the same corpus and index, so a search result and a cited answer point at the same programs, timecodes, and attorneys.
1.2 The corpus we are designing for
The numbers drive the architecture, so they are stated up front:
- ~800 programs today, each ~2 hours — on the order of 1,600 hours of video — already transcribed by the prior system.
- Growing at ~50–60 programs/month now, ramping to ~25 per week, toward a corpus of 2,000–3,000+ hours.
- New or edited programs must appear in the AI within a ~48-hour turnaround.
- Expected usage is modest: tens of concurrent users at launch, low hundreds at maturity, a few thousand queries per day. This is a “correctness and polish” problem, not a high-concurrency one — which keeps the infrastructure lean.
Roadmap · Current Scope (V1) vs Future Scope (V2)
The work ships in two versions. V1 is the cheapest credible live demo on real content, built in three pieces — Setup & Infrastructure, then UI Integration, then Testing & Improvement. V2 expands on it with scale and richer features once V1 has proven its value. Keeping V1 lean is deliberate: it controls the initial spend and lets real usage decide what we invest in next. Throughout the rest of this document, V1 / V2 pills on the section headings mark which phase each part belongs to.
Reading the diagram: V1 is built in three sequential pieces to reach a working demo; once it proves out, V2 layers on scale, search/answer improvements, deeper training, and reach. Green = V1, amber = V2.
V1 — current scope, in detail V1
Every box below is in the first version. The three pieces build on one another.
Reading the diagram: the three V1 pieces run left to right (Setup & Infrastructure → UI Integration → Testing & Improvement); under each are its component workstreams. The detail is below.
Piece 1 · Setup, Infrastructure & Migration
- Access, scope lock & handoff — access to the current system (repo, app, Ragie, WordPress/WP Engine, the S3 videos, designs, content sample); lock the V1 win condition, deadline, and the demo content slice; a knowledge-transfer with the departing developer.
- Ragie migration — export the transcripts, chunks, and metatags before Ragie goes offline; stand up our own store so nothing is lost at the cutover.
- Single secure Azure environment — Container Apps, Azure Database for PostgreSQL (catalog + pgvector), Azure Managed Redis, Azure Storage Queue, Blob, Key Vault, Container Registry, private networking, CI/CD, and basic monitoring + budget alerts.
- Content pipeline over a slice + RAG backbone — extract → transcribe with diarization → resolve speakers → chunk (timecodes) → tag → embed → index a representative set of programs; the retrieval service (embeddings, hybrid search, re-rank, grounded reasoning, SSE streaming); and the operator console for the human-in-the-loop steps.
- Auth + 3-tier gating foundation — the WordPress→iframe token handshake, entitlement checks, signed video URLs, and the public / teaser / member boundary enforced server-side.
Piece 2 · UI Integration
- Embed + the 4-panel answer — iframe embed with the token handshake; the four-panel answer streaming token-by-token; single and multiple video clips; core states (loading/streaming, teaser, no-source, error); and a link out to program materials.
- Hybrid search UI — faceted browse (practice area, sub-area, state, format, date, popularity) with counts and subcategorization; keyword and semantic search; live/replay featured first.
- Video playback + the upgrade moment — timecoded clip playback (seek-and-play a range) via signed URLs; the teaser-to-upgrade paywall moment.
Piece 3 · Testing & Improvement
- Eval + the refinement loop — build an initial eval set (real attorney queries + expected results); score retrieval hit-rate + grounding, tune prompts/chunking/tags/re-rank, repeat.
- Grounding + quality bar — citation check, no-source fallback, not-legal-advice notice; agree and hit the demo acceptance criteria.
- QA + telemetry — QA the three-tier gating and the streaming/resume behavior; basic telemetry (queries, clip plays, teaser hits, conversions).
What V1 deliberately excludes
To keep the first version cheap and fast, these are explicitly out of V1 and live in V2:
- Full-library ingestion (V1 indexes a representative slice) and the automated, event-triggered pipeline.
- PDF / written-materials indexing (the member answer still links to materials in V1; full-text search of them is V2).
- Autoscaling, high-availability replicas, multi-region.
- Cross-source answer synthesis, conversational follow-ups, saved history.
- Multi-tenant / partner bar associations, the paid add-on tier, lead capture, deep cost-optimization, A/B testing, analytics dashboards.
V2 — future scope V2
Once V1 proves members use it, V2 expands on every axis. The diagram shows each expansion area and the work it contains.
Reading the diagram: four expansion areas, each broken into the work it contains. These layer on after V1 has proven out. The detail is below.
Scale & hardening
- Full 2,000–3,000+ hour ingestion with an automated pipeline tied to the content workflow.
- Autoscaling, HA Postgres + read replica, a caching tier, and full monitoring / alerting / budgets.
Search & answer improvements
- Cross-source multi-answer synthesis + de-duplication, conversational follow-ups, saved history.
- PDF / written-materials indexing; deeper chunking/tagging, richer facets, autocomplete and legal-term synonyms.
Advanced training
- Larger eval sets from real member queries, continuous tuning, hallucination-reduction measurement, feedback-driven refinement.
Monetization & reach
- The paid add-on entitlement (~+$100/yr), single-class buyers, lead capture, anti-scraping.
- Partner bar associations (multi-tenant: branded front-ends, per-tenant content and entitlements).
V1 vs V2 at a glance
| Area | V1 | V2 |
|---|---|---|
| Content | Representative slice, video/transcript | Full library, automated pipeline, PDFs/materials |
| Retrieval | Single + multi-clip grounded answer | Cross-source synthesis, conversation |
| Search | Facets + keyword + semantic + popularity | Autocomplete, synonyms, richer facets |
| Infrastructure | Single secure Azure environment | Autoscaling, HA replica, caching |
| Auth / monetization | 3-tier gating + signed video | Paid add-on tier, lead capture, anti-abuse |
| Quality | Initial eval set + loop | Continuous training, feedback, A/B |
| Reach | Single brand | Partner bar associations (multi-tenant) |
How It Connects
The application is an independent web service embedded in the members’ WordPress site as an iframe. It is not a WordPress plugin; it runs on its own infrastructure and talks to WordPress only to learn who a visitor is. Access is organized into three levels so the product can be an acquisition surface and an upgrade lever at once.
Reading the diagram: a visitor loads the members’ site; the AI app is embedded in the page and calls its own backend. The backend reads the catalog and vector index, streams answers, checks membership against WordPress, and plays video from the existing media store. The AI models sit behind a service interface so they can be swapped.
2.1 The three access levels
| Level | Who | What they get |
|---|---|---|
| Catalog search | Anyone (public) | Full keyword/faceted search — discover which programs exist on a topic. The acquisition surface. |
| Answer preview (teaser) | Anyone (public) | A lightweight preview that shows the value without the grounded depth — deliberately cheap to produce (see Section 8.2). Exactly how much the preview reveals is a product decision to confirm with the client (Section 17). The upgrade trigger. |
| Full answer | Members | The complete four-panel answer: grounded reasoning, transcript, the exact video clip, and attorney credentials. |
The boundary is enforced on the server (Section 12), never merely hidden in the interface, and the video itself is released only through short-lived signed links (Section 10).
Query, Retrieval & the 4-Panel Answer V1V2 conversation
A question runs through embed → hybrid retrieve → re-rank → grounded reasoning → assemble panels, streamed token-by-token so the answer begins appearing immediately. The same flow produces a teaser for non-members and one or several clips for members.
Reading the diagram: the route resolves entitlement first. A non-member gets a teaser payload; a member gets the full pipeline — retrieve, re-rank, reason, and push events (reasoning tokens, then one or more four-panel citations) onto a Redis stream that the subscriber relays to the browser.
Reading the diagram: structured filters narrow the space, vector search pulls 10–15 candidates, and the model re-ranks to the best 5–6. Only if grounded evidence clears a threshold does it answer; otherwise it returns an honest “no source” rather than fabricating. When strong chunks come from more than one program, several clips are surfaced.
8.1 How each panel is assembled
The four panels are not four separate generations — they are four views of the same retrieved chunk. This is what guarantees they always agree with one another:
Reading the diagram: one ranked chunk fans out into all four panels — its text becomes the transcript panel, its timecodes become the video clip, its speaker becomes the attorney panel, and the reasoning panel cites its id. Because everything derives from the same chunk, the clip, the quote, and the attorney can never drift out of sync.
- The citation contract. The reasoning model is required to attach, to each claim, the id of the chunk that supports it — it may only assert what a retrieved chunk backs. The assembly layer then resolves each cited id into its transcript text (panel 2), its
[start, end]→ a signed clip URL (panel 3), and itsspeaker_id→ the attorney record (panel 4). - Single vs. multiple results. When the strongest chunks come from one program, the answer shows one clip. When they legitimately span programs, citations are grouped by program, ranked, and shown as several clips — the “multiple video results” case — capped to a small number to stay readable.
- Link to materials. Alongside each cited program, the member answer links out to that program’s existing written materials (handouts, slides). Full-text search of those PDFs is a later phase, but the link itself is available now.
Reading the diagram: an idempotency claim makes retries safe; the generation task writes events with monotonic ids to a Redis stream; the subscriber relays them as SSE. A dropped connection resumes from the last seen id, so a flaky network never loses or duplicates an answer.
8.2 The teaser and the no-source fallback
For a non-member, the answer endpoint returns a lightweight preview plus an upgrade prompt — deliberately cheap to produce (built from catalog / retrieval metadata rather than a full grounded generation), so an unauthenticated question never triggers an expensive per-request model call. It withholds the gated depth (the grounded reasoning, the transcript, and the video). Exactly how much the preview reveals is a product decision still to confirm with the client (Section 17). For a member whose question has no strong source, the system says so plainly; it never invents a citation. A claim with no video behind it does not get made.
8.3 Why it is built this way
- Same-chunk assembly is the structural guarantee against a mismatched quote, clip, or attorney.
- Stream early. Reasoning appears while citations resolve; perceived latency stays low, and resumable streams survive flaky networks.
- Re-rank, don’t over-retrieve. Pulling 10–15 and re-ranking to 5–6 is cheaper and more precise than asking the model to read everything.
- Popular queries can be cached to trim cost and latency, since the corpus changes slowly relative to query volume.
- Long answers survive the stream. The hosting platform is configured with a raised request/idle timeout, and the server sends periodic keep-alive events, so a longer multi-clip answer is never cut off mid-stream (Section 15).
8.4 Hybrid retrieval — one query path
Retrieval is a single round-trip against Postgres that blends three signals and returns 10–15 candidate chunks rows. There is no separate keyword database and no second network hop — pgvector, full-text search, and the catalog filters all live in the same store.
- Three signals, one fused rank. (1) Vector: the query embedding is compared against
chunk.embeddingwith pgvector’s cosine-distance operator<=>, accelerated by an HNSW index so it never scans the whole table. (2) Full-text: a Postgrestsvector/ts_rankmatch overchunk.contentcatches exact terms, statute names, and acronyms that embeddings blur. (3) Metadata filters: structured predicates onprogram_id, practice-areatags, jurisdiction (cle_approvals.state), and format narrow the space before ranking, as hardWHEREclauses. - Fusion by Reciprocal Rank Fusion (RRF). The vector list and the full-text list are each ranked independently, then combined by
score = Σ wi / (k + ranki)(small constantk, e.g. 60). RRF needs no score normalization between the two very different scales and is robust to outliers; weightswvector/wtextare tunable and start vector-leaning. The metadata filters are not fused — they gate the candidate pool. - What comes back. The top 10–15 rows, each carrying
chunk_id,content,start_ms/end_ms,speaker_id,program_id, and the fused score. This is deliberately over-retrieval — recall now, precision next.
8.5 Re-ranking — trim 15 to the best 5–6
- A cross-encoder reranker, not the reasoning model. The 10–15 candidates are scored by a dedicated reranker that reads (query, chunk.content) as a pair and returns a relevance score. Unlike the bi-encoder embeddings used for retrieval, a cross-encoder attends to query and chunk jointly, so it is far more precise at ordering near-ties — but too expensive to run over the whole corpus, which is why it runs only on the shortlist.
- Trim to 5–6. The top 5–6 reranked chunks become the grounding set — a small reasoning prompt (lower token cost, faster first token) and higher precision, since the model reads only strong evidence.
- Cost / latency. One reranker pass over ~15 short passages is tens of milliseconds and a fraction of a reasoning-token’s cost. If no reranker endpoint is available in V1, a single small-model LLM scoring call is the fallback, at higher latency.
8.6 Grounded reasoning & the citation contract
The reasoning model (Claude on Azure AI Foundry, or Azure OpenAI GPT) never sees the corpus — only the 5–6 reranked rows, packed as an evidence block, and it is contractually bound to cite them.
- How the rows are packed. Each chunk is serialized into the context with its id as a stable handle — e.g.
[chunk_id: 8f2a] (program: “Deposition Strategy 2024”, speaker: Jane Roe) <content>. The ids are opaque tokens the model must echo, not invent. - System-prompt shape. Role (a CLE research assistant), a hard rule — “assert only what a provided chunk supports; attach the supporting
chunk_id(s) to every claim; if the evidence is insufficient, say so and cite nothing” — and the required output schema. No outside knowledge, no paraphrase beyond the chunks. - Structured output, not prose. The model returns JSON: an ordered list of
claims, each{ text, chunk_ids: […] }, plus the set ofcited_chunk_ids. Structured output makes the citation machine-checkable and lets the assembly layer resolve ids deterministically; the Panel-1 reasoning text is reconstructed from the ordered claims with inline citation markers. - Grounding gate. If the model reports insufficient evidence (or cites nothing), the pipeline emits the honest no-source answer rather than a fabricated one — the gate shown in the retrieval loop.
8.7 From DB rows to the four panels — the assembly mapping
This is the step the whole system turns on: we have chunks rows and video files, and here they become the rendered answer. Assembly is pure resolution — no further generation. For each chunk_id the model cited, the assembly layer reads that one row and fans its fields into the four panels, which is the structural reason the quote, clip, and attorney can never disagree.
Reading the diagram: the query is embedded, hybrid-retrieved against thechunksrows, reranked to 5–6, and reasoned over so the model returns claims tagged withchunk_ids. For each cited id the assembly layer re-reads that row:contentfills the transcript panel,start_ms/end_msmint a signed clip URL for the video panel,speaker_idjoins thespeakersrow for the attorney panel, and the ordered claims form the reasoning panel — then the whole thing is serialized to a payload and pushed as SSE events. Green = Postgres rows, blue = process steps, amber = LLM / signed URL / Redis.
- Field-level mapping (per cited chunk).
chunk.content→ Panel 2 (verbatim transcript).chunk.start_ms/chunk.end_ms+program_id→ a short-lived signed URL with a range → Panel 3 (seek-and-play clip; Section 10).chunk.speaker_id→ a join to thespeakerscatalog row (name, firm, bio, photo) → Panel 4 (attorney). The model’s ordered claims, with inline citation markers pointing at the same ids → Panel 1 (reasoning). - Grouping into clips. Cited chunks are grouped by
program_id; each group becomes one clip whose range spans (or stitches) its chunks, ordered by rank. A single-program answer shows one clip; a genuinely cross-program answer shows several — capped at ~4 to stay readable. Panels 2 and 4 are keyed per clip so the transcript and attorney track whichever clip the reader is on. - The payload. Assembly emits one object:
{ answer_markdown, citations: [{ chunk_id, program_id, clip: {url, start_ms, end_ms}, transcript, speaker: {…} }], programs: [{ id, materials_url }] }. Panel 1 streams first; each citation is a discrete event the client slots into panels 2–4 as it arrives. - Signed URLs are minted late. The clip URL is generated only after entitlement passes and only at assembly time, so it expires in minutes and never leaks into the reasoning context or the query log.
8.8 Grounding verification — enforcing “every claim maps to a chunk”
- Deterministic check first. Because the model returns structured claims, verification is mechanical: every
chunk_idthe model cites must exist in the set that was actually sent to it, and every claim must carry at least one id. An id the model invented (not in the sent set) or a claim with no id is dropped, or the answer is failed — no LLM needed to catch fabrication of references. - Optional semantic check. Whether each claim’s text is genuinely supported by its cited chunk is a stronger, non-deterministic property. A lightweight second LLM pass (a cheap entailment check: “does this chunk support this claim?”) can gate it, but adds latency and cost, so V1 relies on the deterministic id check plus the strict prompt contract, with the entailment pass as a later hardening step and an offline eval metric (Section 11).
8.9 The teaser path — cheap, no grounded generation
- Metadata, not a model call. For an anonymous caller the endpoint runs only the cheap front of the pipeline — embed + hybrid retrieve to learn whether and where coverage exists — then builds the preview from catalog / retrieval metadata: matching program titles, speaker names, tags, clip counts. It skips reranking, the reasoning LLM, transcript text, and any signed video URL, so an unauthenticated question can never trigger an expensive per-request generation or leak gated content.
- Shape. The teaser returns a short “we have N programs covering this” summary plus an upgrade prompt. Exactly how much it reveals is the product decision flagged in Section 17.
8.10 SSE event protocol
- Event types.
token(a reasoning-text delta for Panel 1),citation(one fully-resolved four-panel payload for a clip),done(terminal, carries the final answer id), anderror(terminal, carries a code). Every event has a monotonicid(the Redis stream entry id). - How the client assembles. The browser appends
tokendeltas into the reasoning panel as they arrive, and on eachcitationslots its transcript / clip / attorney into panels 2–4 (creating a new clip tab when a newprogram_idappears).doneunlocks the finished state;errorshows a graceful failure. - Resumability. The client tracks the last event id; on a dropped connection it reconnects with
Last-Event-IDand the subscriber replays from that offset in the Redis stream — no lost or duplicated tokens on a flaky network. - Idempotency. The
POST /askcarries anIdempotency-Key; the route claims it with a unique-key insert in Postgres, so a retry attaches to the existing generation task and stream instead of spawning a second (and double-billing the model).
8.11 Caching, latency budget & mid-stream failure
- Caching. A semantic / hot-query cache short-circuits repeats: near-duplicate queries (by embedding similarity) or exact-hash hits return the stored answer + citations, re-minting only the short-lived signed URLs. The corpus changes slowly relative to query volume, so cache hits are common and cheap; entries are invalidated when their cited programs change.
- Rough latency budget. embed ~50–100 ms · hybrid retrieve ~30–80 ms · rerank ~30–60 ms · then time-to-first-token from the reasoning model roughly 0.5–1.5 s — so reasoning tokens start streaming well under ~2 s, while citations resolve and stream a moment behind. A cache hit collapses this to the URL-minting cost.
- Mid-stream failure. The reasoning call runs under a timeout; a failure or timeout after tokens have streamed emits a terminal
errorevent (not a silent hang) and marks the idempotency record failed, so a retry re-runs cleanly. The hosting platform’s raised idle timeout plus periodic keep-alive events (Section 15) keep a long multi-clip answer from being cut off.
Hybrid Search V1 coreV2 autocomplete
The structured path is primary and fast; the semantic path handles the long tail. A relational catalog does the everyday queries — browse a practice area, filter by state approval — instantly and exactly, and the AI layer catches the open-ended questions that facets can’t express.
Reading the diagram: known/faceted queries go straight to the catalog with counts and subcategorization; open-ended queries fall through to semantic search and are blended back in. Either way, results are ordered live/replay first, then by popularity.
9.1 Facets, counts & narrowing
Facets: practice area, sub-practice area, CLE state approval, credit hours, format (live/replay vs. on-demand), date, speaker, and popularity. Broad results subcategorize automatically — litigation (268) → trucking, personal injury, employment, trial skills — and show accurate counts (“300 in estate planning”), because the catalog can compute them exactly.
9.2 Popularity ordering
Each program carries a popularity score — a small denormalized value recomputed on a schedule, not a live aggregate — used as a secondary sort after the live/replay-first rule. It can be seeded from existing platform view or enrollment counts and refined over time by in-app engagement (clip plays, click-throughs) recorded in the query log, so the catalog surfaces what members actually watch.
9.3 Keyword search
Between exact facets and semantic search sits a keyword / full-text tier over program titles and descriptions, run in the same Postgres database (its built-in full-text / trigram search). This handles the everyday “find classes about trucking” term query — a core requirement — quickly and precisely, without falling through to the fuzzier semantic layer.
9.4 Semantic fallback
When neither facets nor keywords resolve a query, it is embedded and matched against program and chunk descriptions; results from all three paths are then fused and de-duplicated — so the common case is fast and exact, and the open-ended case still returns something useful.
Video Delivery & Timecoded Playback V1
The marquee feature depends on dropping the member at the exact minute of a two-hour program and playing just that stretch — without downloading the whole file. This is a solved, standard capability given a small condition on the file format.
Reading the diagram: after the entitlement check passes, the API mints a short-lived signed URL for the program. Our player seeks to the chunk’s start time, which triggers an HTTP Range request; the store returns only those bytes (206 Partial Content) through the CDN; a small script pauses at the end time. Nothing extra is downloaded.
10.1 Playing a range without downloading the whole file
The player asks for only the bytes it needs via an HTTP Range request, and the media store answers with just that slice (206 Partial Content). Playback starts mid-file after transferring a small amount of data. The one condition depends on format:
- Progressive MP4: the file’s index (the moov atom) must be at the front (“faststart”); otherwise the browser must pull most of the file before it can seek. The fix is a one-time, lossless re-mux that relocates the index — no re-encode.
- HLS / DASH: the video is pre-split into short segments, so the player fetches only the segments covering the clip. Inherently seek-friendly, no faststart concern.
10.2 Format support
| Format | Precise range-play? | Notes |
|---|---|---|
| MP4 (H.264/H.265) — faststart on | ✅ Yes | The simple recommended case. One-time +faststart re-mux, no re-encode. |
| MP4 — faststart off | ⚠️ Not effectively | Browser downloads most of the file first. Fix = re-mux to faststart. |
| HLS / DASH (segments) | ✅ Yes | Best practice for 2-hour content; fetches only the covering segments. |
| WebM (index at front) | ✅ Yes | Seekable if cues are written at the front; uncommon here. |
| MOV / AVI / WMV / raw masters | ❌ Not for browser delivery | Transcode to MP4 or HLS as an ingestion step. |
Placing a CDN in front of the store does not break this — a properly configured CDN forwards Range headers and caches parts, so popular clips serve fast from the edge.
10.3 Approach, player & access control
- Seek-and-play a range of the full program. We set the start time (padded ~1s so we never begin mid-word) and enforce the stop in our own player code rather than relying on a URL parameter, which is inconsistent across browsers. Cutting a separate physical clip file per citation is unnecessary and is not part of the design.
- Player: a single open-source player that handles both progressive MP4 and segmented streaming with the same code and exposes straightforward programmatic control (seek, play, pause). We build and control it; we keep it simple, without DRM.
- Access control: after entitlement passes, the API mints a short-lived, expiring signed URL for that video; the link expires in minutes so it can’t be shared or scraped. This slots directly into the auth handshake (Section 12): membership verified → signed URL minted → player loads it.
10.4 From citation to clip: the concrete lookup
Sections 10.1–10.3 established that Range playback works; this is the exact chain that turns a retrieved chunk into a playing clip. Retrieval hands the API a chunk row, and three joins resolve it to a byte-range fetch:
- Chunk → program. The
chunkcarriesprogram_id,start_ms,end_ms. We load the parentprogramand read itsvideo_asset_id— the stable handle to the physical asset, decoupled from the DB row so the storage layout can change without touching chunks. - Program → S3 key / CloudFront path.
video_asset_idmaps to an S3 object key (e.g.programs/<asset>/master.mp4for MP4, orprograms/<asset>/index.m3u8for HLS), the same path CloudFront serves. The mapping is a deterministic template, not a per-clip stored URL — no URL is persisted, so none can leak or go stale. - API response to the player. After minting (10.5) the API returns a compact
{ "url": "<signed>", "start_ms": 903000, "end_ms": 1187000 }. Theurlpoints at the full program, never a pre-cut clip; the range lives only in the two integers. - Player behaviour. On load the player sets
video.currentTime = start_ms / 1000(padded ~1s). SettingcurrentTimetriggers the browser to issue aRange: bytes=…request for the byte offset the moov index maps that timestamp to; atimeupdatelistener callspause()oncecurrentTime ≥ end_ms/1000. Start and stop are both enforced in our code — no reliance on URL fragments.
Reading the diagram: a member clicks a cited clip; the API runs the entitlement check (Section 12), resolves the chunk to its program’svideo_asset_idand S3 key, reads AWS credentials from Key Vault, and mints a short-lived signed CloudFront URL. It returns{url, start_ms, end_ms}; the player’s seek fires an HTTP Range request that CloudFront serves from the edge (or fetches from S3 as206 Partial Content), then stops atend_ms. Purple = secrets/signing, amber = CDN, green = S3, red = the entitlement guard.
10.5 Minting the signed URL across two clouds
The app runs on Azure Container Apps but the video lives on AWS S3 + CloudFront, so the signer holds AWS trust material inside Azure. That material sits in Azure Key Vault, never in code, image, or build-baked env:
- What Key Vault holds: the CloudFront key-pair ID and its private key (PEM) used to sign, plus — if any S3-direct or MediaConvert calls are needed — an AWS IAM access key/secret (or federated credentials). The signing private key is the sensitive item; it is pulled via managed identity at runtime and cached in memory only.
- How the URL is built: the API composes a CloudFront canned policy — the resource path plus a
DateLessThanexpiry a few minutes out — then RSA-SHA1 signs the policy with the private key and appendsExpires,Signature, andKey-Pair-Idquery parameters. A custom policy (adding an IP or path-prefix constraint) is used when one signature must cover many objects (HLS, 10.7). - Order of operations: the entitlement check (Section 12: verified token → tier check) happens first; only an entitled request reaches the signer. Signing is cheap and local — no round-trip to AWS — so it adds negligible latency.
- TTL discipline: expiry is minutes, sized to comfortably outlast one clip play but not survive sharing or scraping.
10.6 Keeping the CDN fast: caching vs. per-viewer signatures
A signed URL carries a unique Signature query string per viewer and per mint. If CloudFront keyed its cache on the full query string, every viewer would miss the edge cache and re-pull from S3 — defeating the CDN for exactly the popular clips we most want cached. Two standard fixes:
- Signed cookies instead of signed URLs. Authorize a path (e.g.
/programs/<asset>/*) once viaCloudFront-Policy/-Signature/-Key-Pair-Idcookies; media requests then carry a clean, cache-friendly URL with no per-request signature, so the edge object is shared across viewers. This is also the natural fit for HLS, where one page view pulls a manifest plus many segments under the same path. - Cache policy that ignores the signature params. Keep signed URLs but configure the CloudFront cache policy to exclude
Signature/Expires/Key-Pair-Idfrom the cache key (while origin-request / trusted-key-group validation still reads them). Popular byte ranges then serve from one shared edge object regardless of who signed. - Range interplay: whichever path is chosen, the
Rangeheader must remain forwarded and part of the cache key (the AWS managedCachingOptimizedpolicy does this), so partial-content parts cache correctly per byte range.
We default to signed cookies scoped to the program path — it keeps clip URLs clean, caches well at the edge, and covers both MP4 and the multi-file HLS case with one authorization.
10.7 MP4 vs. HLS: the signing branch
- MP4 (faststart): a single object. One signed URL (or one path cookie) plus the player’s
Rangerequest is the whole mechanism — the simplest case, and what 10.4’s response shape assumes. - HLS: a manifest (
.m3u8) plus many segment files. Per-object signed URLs would mean rewriting every segment URI in the manifest at request time — brittle. The clean answer is signed cookies scoped to the program path (10.6): one authorization covers the manifest and every segment, so the player fetches segments with plain URLs and the edge caches them normally.
Because the video_asset_id → key template already encodes which asset is MP4 vs. HLS, the API picks the branch from the program’s format without the player caring — it always receives {url, start_ms, end_ms} and seeks the same way.
10.8 Failure & expiry handling
- Expired URL / cookie. With a minutes-long TTL, a link may expire before a paused or re-shared play resumes; CloudFront then returns
403. The player treats a mid-playback403as a signal to call the API for a fresh mint (re-running the entitlement check) and resumes at the samecurrentTime— transparent to the member. - Not entitled. If the entitlement check fails (anonymous or wrong tier), the API never mints a URL; it returns the teaser / upgrade prompt from Section 12’s gating rather than a video, so no signed link is ever issued.
- Signer / Key Vault unavailable. A failure to read the signing key surfaces as a clean “video temporarily unavailable” error, not a broken player; the rest of the 4-panel answer (text, citation, attorney) still renders.
Grounding, Safety & the Quality Loop V1V2 advanced training
Every answer passes layered checks; accuracy is then driven up over time by a programmatic refinement loop. The two work together — the checks stop a bad answer today, the loop makes fewer of them tomorrow.
Reading the diagram: a pre-answer guard declines out-of-scope requests (advice on a member’s actual case); a post-answer check verifies every claim maps to a cited chunk; an optional judge approves or triggers a single stricter rewrite. Red = guard/decline paths.
Reading the diagram: a set of real attorney queries with expected results runs through the pipeline in bulk; results are scored on retrieval hit-rate and grounding; prompts, chunking/tagging, and re-rank weights are tuned and re-run until the error converges, then the winning configuration is promoted.
11.1 What the checks look for
- Grounding: does every assertion trace to a retrieved chunk (and therefore a video)?
- Scope: general legal education, not advice on a specific matter; the disclaimer is present.
- Citation accuracy: the transcript, clip, and attorney shown actually support the claim — including the right attorney from the speaker mapping.
11.2 Why it is built this way
- Quality is a process, not a one-time prompt. The loop drives error toward zero — run many queries, score, adjust prompts/chunking/re-rank, repeat — and the same harness measures whether a change actually helped before it ships.
- Cheap to start, compounding to improve. A small evaluation set proves the system now; it grows from real member queries over time.
11.3 The evaluation harness — how we measure answer quality V1
The refinement loop in 11.2 is only as trustworthy as the ruler it uses. That ruler is a frozen ground-truth set plus a scorer that runs the whole pipeline offline and produces a single pass/fail gate. It turns “the answer felt better” into a number we can regress against.
Reading the diagram: a curated ground-truth set (green) runs in batch through the exact production pipeline; each answer is scored on four axes; the scores are checked against an acceptance bar (red gate). A config that clears the bar is promoted; one that regresses sends the operator back to the tuning levers. The V2 branch (dashed) grows the set from real member queries and feeds continuous online scoring.
The ground-truth set
- What it is. A versioned list of real attorney queries, each paired with the expected outcome: which program(s) and passage(s)/timecodes should be cited, the key facts a correct answer must contain, and the correct attribution. Stored as a checked-in fixture (rows keyed to
chunk_id/program_id), not free text, so scoring is deterministic. - Who authors it. Client-side SMEs (the CLE attorneys/content team) write the queries and mark the expected sources — they are the only authority on what “correct” means for a legal topic. The operator console (Section 7.4) is where they confirm the expected clip matches. We seed it; they ratify it.
- Size. V1 starts small and deliberate — tens of curated cases spanning the topics members actually ask about — enough to prove the system and catch regressions, cheap enough for a domain expert to hand-verify every row.
The offline batch run
- The harness replays every query through the exact production retrieval + reasoning path (same chunking, re-rank, prompts, and grounding checks from 11.1) under a named config — a pinned bundle of prompt versions, chunk parameters, tag weights, re-rank weights, and model id. Runs are reproducible: same config + same set = same score.
- It runs as a job on the same worker plane as ingestion (Section 7), so a full sweep is one enqueued batch, not a manual sit-and-watch.
The metrics
- Retrieval
recall@k/hit-rate. Of the passages the SME marked expected, how many appear in the top-k retrieved chunks? Purely deterministic (set membership onchunk_id). This isolates retrieval quality from generation — if the right chunk never surfaces, no prompt tuning can save the answer. - Grounding /
faithfulness. Does every assertion trace to a cited chunk (no fabricated claims)? Scored two ways: a deterministic check that each sentence carries a citation, plus an LLM-judge that reads claim-vs-source and rules whether the source actually entails the claim. This is the metric the whole product promise rests on. - Citation accuracy. Does the shown transcript/clip/attorney genuinely support the claim — and is it the right attorney? Deterministic where possible (cited
chunk_idresolves to the expectedspeaker_idand timecode range), judge-assisted for “does this passage really back the sentence.” A confident answer citing the wrong speaker is a hard failure. - Answer relevance. Does the answer address what was asked (not merely cite something grounded but off-topic)? LLM-judged against the query and the SME’s expected key facts.
The acceptance bar (the gate)
- A config is promotable only if it clears a threshold on each axis — e.g. recall@k and citation accuracy above a set floor, and zero tolerance for the failure classes that break trust (a fabricated claim, a wrong-attorney citation). The bar is a threshold, not a demand for a perfect score — we ship when it is good enough and provably not worse, not when it is flawless.
- No regressions. A change that lifts one metric while dropping another below its floor does not promote. The gate is the objective referee that keeps the loop honest.
- Judge scores are calibrated against a slice of SME-labelled examples so “the judge says grounded” tracks “a lawyer agrees it’s grounded” before we trust the judge to gate.
11.4 Testing that the data is correct — DB rows and clips V1
Answer quality is downstream of data quality: a perfect prompt over a mis-timed clip or a mis-mapped speaker still produces a wrong, confident citation. So the rows in Postgres/pgvector and the clips on storage are tested in their own right, independent of any answer.
- Transcript↔video timeline fidelity. For a sampled set of chunks, verify the stored
[start, end]actually lands where that text is spoken in the video — catching trimmed masters, intro offsets, or drift between the transcript clock and the stored file. This is the concrete test behind the Section 17.1 open question; one sample program calibrates the offset, then spot-checks confirm it holds across the library. - Speaker attribution. Confirm each chunk’s
speaker_idresolves to the right attorney against the per-program roster — the backbone of “who said it.” Sampled human verification in the operator console (Section 7.4), with the diarization mis-attribution rate tracked as its own quality number. - CLE-approval data is authoritative, never model-guessed. Jurisdiction/credit/approval facts come from the system of record and are treated as ground truth; the model may surface them but must never infer or paraphrase them. Tested by asserting these fields are populated from the source feed, not generated.
- Where it runs. These checks live over the same catalog and job tables as ingestion and surface in the operator console, so transcripts, speakers, and clips are verified before a member ever sees them.
11.5 Evaluation at scale V2
- The set grows from real usage. The curated V1 set is seeded with real member queries mined from the query log and with cases flagged by in-product feedback (thumbs down, an explicit “wrong attorney” report). SMEs label the expected result; the case joins the frozen set. The ruler gets longer and more representative over time.
- Continuous / online evaluation. A sampled slice of live traffic is scored continuously by the same faithfulness/citation judges, so quality is a monitored dashboard metric — not just a pre-ship check — and drift (from a model update, a content batch, or a query-mix shift) is caught in production.
- Regression testing per change. Every config or model-version change is run against the frozen set before promotion; the gate from 11.3 blocks anything that regresses — making a model upgrade a measured decision, not a leap of faith.
- A/B testing. Two promotable configs run against live traffic (or a held-out split) and are compared on the same metrics plus real member signals, so a genuinely better config wins on evidence.
- Hallucination reduction over time. Faithfulness and citation-accuracy scores are tracked as a trend line across releases; the explicit goal is a monotonically falling ungrounded-claim rate, reported per release.
11.6 What feeds the harness
- The query log — every member question, its retrieved chunk ids, the emitted citations, and the config that produced them: the raw material for growing the eval set and for online scoring.
- The citation trail — the
chunk_id/speaker_id/timecode each answer cited, which the scorer compares against ground truth and which the data-quality checks (11.4) validate independently. - A feedback signal V2 — in-product thumbs and structured reports (“wrong attorney,” “not what I asked”) that both surface candidate eval cases and act as a real-world check on the offline scores.
Authentication & Membership Gating V1V2 add-on tier
The app is embedded in the members’ site but runs on its own origin, so it cannot read the site’s session cookie — browsers block third-party cookies inside a cross-origin iframe. Instead, the site (which does know the logged-in member) mints a short-lived signed token and hands it to the embedded app, which presents it to its own API as a bearer credential. No shared cookies, and the boundary is always enforced on the server.
Reading the diagram: the site reads the member and tier from the membership system, mints a short-lived asymmetrically-signed token, and passes it into the iframe. The app sends it as a bearer header; the API verifies the signature against the site’s public key and checks entitlement before returning a gated answer or minting a signed video URL. Nothing depends on third-party cookies.
Reading the diagram: catalog search is always public; a question from a non-member returns a teaser plus an upgrade prompt; a question from an entitled member returns the full four-panel answer and a signed video URL. Enforced server-side; video released only via a signed, expiring URL.
12.1 The token handshake
- Minted server-side by the site from the member’s tier, asymmetrically signed (the site holds the private key; the API verifies with the public key, so no secret is shared between them), with claims for the member, the tier, the intended audience, and a short expiry (~15 minutes).
- Passed into the iframe explicitly (not via the URL, which leaks into logs and history); the app validates the sender origin, holds the token in memory, and the parent page silently re-mints it before expiry.
- Verified on every request — signature, expiry, audience, and issuer — before any gated payload is produced.
12.2 Entitlement & source of truth
Payments and subscriptions are the authority for who is a member; the site’s membership system is kept in sync with them and is the operational source the token is minted from. The token is a short-lived snapshot of that tier, so the API can decide access without calling the billing system on every request — a cancelled member loses access within one token lifetime, which is appropriate for this content. The entitlement check returns a structured tier rather than a yes/no, so a future paid add-on slots in as another tier value without reworking the gate.
12.3 Why it is built this way
- Explicit token beats shared cookies — it is immune to third-party-cookie restrictions and works the same in every browser.
- Server-side enforcement — the interface may hide gated content, but access is decided on the server, and video is protected with signed, expiring URLs so it can’t be hot-linked.
- A first-party subdomain is the fallback if deeper session integration is ever needed; the same token approach still applies, so we start with the iframe.
12.4 Token format & lifecycle
The token is a JWT signed with RS256 (asymmetric): the site holds the private key and signs, the API verifies with the public key only — so a compromise of the API can’t forge site tokens, and the two systems never share a secret. The verifier pins the expected algorithm and ignores the token’s own alg header, closing the classic “alg = none” and RS256→HS256 confusion attacks.
- Claims —
iss(the WordPress site),aud(the app API),sub(member id),tier(public/teaser/member, with room for a future add-on value),iat,exp(~15-minute TTL), andjti(unique id, available for replay tracking / revocation). The API validates signature,exp,iss, andaudon every request — theaud/isschecks stop a token minted for one purpose being replayed at another. - Minting — a small WordPress mu-plugin reads the already-logged-in session and its plugin-derived tier at page render and issues the JWT. This is deliberately not the generic “JWT Authentication for WP REST API” plugin (built around a username/password
/tokenlogin); here there is no separate login — the WordPress session already establishes identity. - Key management & rotation — the private signing key lives in Key Vault. The public key is published at a JWKS endpoint with a
kidso keys rotate without redeploying the verifier — the API fetches and caches JWKS, selecting bykid, so overlapping old/new keys both verify during a rotation. - Where the token lives — in memory in the iframe only, never
localStorage/sessionStorage(reduces XSS theft) and never in the URL (leaks into history, referrer, logs). - Silent refresh — the parent page holds the live WordPress session, so it re-mints before expiry (a timer firing a couple of minutes ahead of
exp) and pushes the fresh token into the iframe viapostMessage— a clean silent refresh with no third-party-cookie dependency. - Mid-session 401 recovery — if a request races an expiry and gets a
401, the iframe posts a refresh request to the parent, receives a new token, and retries once; a second failure surfaces as a real auth error rather than an infinite loop.
Reading the diagram: the mu-plugin signs a short-lived JWT with the Key Vault private key and hands it into the iframe bypostMessage; the app sends it as a bearer credential and the API verifies it (algorithm pinned, signature checked against the JWKS public key, plusiss/aud/exp) before the entitlement gate runs. Two paths keep the session alive without cookies — a refresh timer re-mints ahead of expiry, and a mid-session401triggers a re-mint and a single retry.
12.5 The postMessage handshake
- Who initiates — the embedded app, once loaded, posts a
readymessage to its parent; the parent responds by minting and posting the token (and may push tokens unprompted on refresh). This ordering avoids a race where the parent posts before the iframe is listening. - Message schema — a small, versioned envelope, e.g.
{ type: "lawcle.auth.token", v: 1, token: "<jwt>" }for delivery and{ type: "lawcle.auth.ready" }/{ type: "lawcle.auth.refresh-request" }from the iframe. Typed messages let the iframe ignore unrelatedpostMessagetraffic on the page. - Origin allowlist (both directions) — the parent posts with an explicit
targetOrigin(the app’s origin, never"*"); the iframe checksevent.originagainst the known WordPress origin before touching the payload and drops anything else. - What shows before the token arrives — the public surface renders immediately (catalog search needs no token), so first paint is never blocked on auth; gated affordances show a neutral loading/locked state until the token lands, then resolve to member or upgrade UI.
12.6 Browser-boundary hardening
- No CSRF surface — auth is a bearer token in an explicit header, not an ambient cookie, so the browser never auto-attaches credentials to cross-site requests; there is nothing for a forged cross-site request to ride on, and no CSRF-token machinery is required.
- Clickjacking / embedding control — the app sets
Content-Security-Policy: frame-ancestorsto the WordPress origin only (supersedingX-Frame-Options), so no other site can frame it. The parent embeds with a restrictive iframesandbox. - CORS — the API’s allowed-origins list contains the app origin (and the WordPress origin for JWKS/embed as applicable); because auth rides in a header rather than cookies, CORS runs in non-credentialed mode (
Access-Control-Allow-Credentialsstays off) andAuthorizationis listed inAccess-Control-Allow-Headers. - Transport & enforcement — everything over HTTPS; gating is decided server-side in the auth/entitlement middleware (order
CORS → RequestId → Auth/entitlement) — the UI hiding gated content is never the control.
12.7 Subscription-status endpoint contract
The membership plugin exposes a server-to-server status endpoint (e.g. GET /wp-json/<plugin>/v1/subscription-status?member=<id>; the exact path depends on the plugin — MemberPress, PMPro, WooCommerce Subscriptions, or a small custom mu-plugin route). It is authenticated with a server-side API key (never called from the browser) and returns a compact projection:
- Response schema —
{ member_id, tier, status }, wherestatusisactive/cancelled/expired/pendingandtieris the entitlement level the gate reads. - Source-of-truth layering — Stripe is the authority for subscription state (verify the
Stripe-Signatureon webhooks, dedupe on event id, refetch the live object rather than trusting event order); the WordPress plugin DB is a projection kept in sync; the JWTtierclaim is a ~15-minute snapshot of that projection. - When the API calls it — not on the per-request hot path. The token snapshot decides ordinary requests, so a cancelled member loses access within one token lifetime, which is acceptable for this content. The API optionally makes a live status check only on high-value gated actions or at refresh time.
12.8 Error semantics
401 Unauthorized— missing, malformed, expired, or signature-invalid token on a gated endpoint. The iframe treats this as recoverable: request a fresh token and retry once (12.4); only a repeat failure surfaces as “session expired — reload.”402 Payment Required— a valid token whose tier is below what the action needs (anonymous/teaserasking for a full answer). The response carries the teaser payload plus upgrade metadata; the UI renders the preview and an upgrade prompt rather than an error. This is the normal non-member path, not a fault.403 Forbidden— a well-formed, correctly-audienced token that is nonetheless not permitted (a suspended member, or a request the tier can never satisfy). The UI shows a definitive “not available on your account” state, with no retry and no upgrade CTA.- Anonymous is not an error — a request with no token to a public endpoint (catalog search) is served normally; only gated endpoints distinguish
401vs402vs403.
Open Questions & Access
These are the genuine unknowns that shape the build, plus the access needed to resolve them. They are the fastest path from this design to a firm plan.
17.1 Open questions that affect the design
- Video format & timeline fidelity. Are the programs stored as progressive MP4 or as segmented streams, and (for MP4) is faststart set? And — the real precision risk — do the transcript timestamps line up with the actual stored files (watch for trimmed masters or intro offsets)? One sample video answers most of this.
- Prior-vendor export. What exactly can be exported (transcripts, chunks, metadata), and by when must the cutover happen?
- Speaker rosters. Is the per-program attorney roster structured and reliably complete? It is the backbone of attribution. Are any recordings multichannel (separate mic per speaker), which would make attribution near-exact?
- Diarization accuracy bar. What mis-attribution rate is acceptable on real two-hour panels? This decides the speech-service choice.
- Membership & plugin. Which WordPress membership plugin backs Stripe (it decides how we read subscription status)? How many tiers exist today, and do the AI features differ across them beyond member vs. non-member?
- Region & data residency. Which region should host the models and data (US-based CLE audience), and is the reasoning model available with adequate quota there?
- Win condition. The single most important goal for the first release (demo, member usage, or revenue) and any date it is anchored to.
- Teaser depth. For a non-member, how much should the answer preview reveal — just that coverage exists, a short summary, or which programs it would cite? This sets where the free/paid line sits (Section 8.2).
- Reasoning model preference. Both Claude (via Azure AI Foundry) and GPT (via Azure OpenAI) run on Azure; the design defaults to the latest Claude, but confirm the client’s preference and that the Azure subscription can access it.
17.2 Access to confirm and build
- The code repository and the running app (read access is fine).
- The prior RAG vendor account and an export of transcripts/chunks before cutover.
- WordPress admin / staging and how membership and payments are wired, so the token handshake can read tier.
- The video store — one sample program URL answers most of the video questions.
- The cloud subscription to confirm region, identity, and model access.
- Design files and a small content sample (a few programs with transcripts and the catalog metadata).
- A short handoff with the current developer before they roll off.
How it works under the hood
Architecture Overview V1
The system separates into two planes. The request plane is fast and member-facing: search, the streamed answer, and video. The processing plane is slow and asynchronous: turning raw video into an indexed, citable corpus, and running the evaluation loop. They share the datastores but never block each other — a two-hour transcription job can never slow a member’s query.
Reading the diagram: the WordPress site embeds the app, which calls the API. The API reads the catalog and vector store, streams via Redis, verifies membership, and releases video through signed URLs. Background workers (the processing plane) share the datastores and the AI services but run off the request path. Green = datastores, amber = cache/queue/external, purple = secrets/auth.
3.1 The components at a glance
| Component | Responsibility |
|---|---|
| API + middleware | The single entry point. Applies CORS, request-id, and the auth/entitlement check in order, then routes to search, ask (streaming), program, and health. |
| Retrieval / RAG engine | Turns a question into a grounded, cited answer: embed → hybrid retrieve → re-rank → reason → assemble panels (Section 8). |
| Search engine | Structured faceted browse with a semantic fallback and popularity ordering (Section 9). |
| Streaming subsystem | Delivers the answer token-by-token over SSE, resumable across dropped connections (Section 8). |
| Ingestion workers | The processing plane: transcribe, diarize, map speakers, chunk, tag, embed, index (Sections 4–7). |
| Catalog (Postgres) | Programs, speakers, CLE approvals, tags, popularity, and the query log — the structured source of truth. |
| Vector index (pgvector) | Transcript-chunk embeddings with their timecodes, co-located with the catalog in the same database. |
| Cache / stream (Redis) | Resumable SSE streams, idempotency records, and hot-query caching. |
| Media store | The source video, served to members through a CDN with signed URLs (Section 10). |
| AI services | A reasoning LLM, an embeddings model, and a speech-to-text (transcription + diarization) service — each behind a swappable interface (Section 14). |
Reading the diagram: on boot the service health-checks each dependency in order before accepting traffic; on shutdown it disposes pools and clients cleanly. A failed check fails the boot loudly rather than serving a half-wired app.
3.2 Why it is built this way
- Structured catalog behind the search. A relational catalog answers “show me litigation” or “what’s approved in Florida” instantly and exactly — the everyday queries a pure semantic index cannot do reliably.
- One database for facets and vectors. Keeping embeddings next to the metadata we filter on makes hybrid retrieval a single query path, with no second datastore to operate at this scale.
- Two planes, one index. Heavy ingestion runs as background jobs so it never touches request latency, yet writes to the same catalog the request plane reads.
Content Ingestion V1 sliceV2 full pipeline
Ingestion is the processing plane’s core job: turn a raw two-hour recording into a set of short, searchable, citable passages, each tied to an exact video moment and the attorney who spoke it. It runs as a sequence of stages, each an independent background job (Section 7), so a failure or a re-run affects only one program and one stage.
Reading the diagram: audio is extracted from the source video, transcribed with speaker diarization, speaker labels are resolved to named attorneys, then the transcript is chunked (carrying timecodes and speaker), tagged, and embedded. Vectors land in the index, metadata in the catalog, and the video stays in place.
4.1 The stages
- Extract audio from the source video for speech processing.
- Transcribe + diarize — produce the transcript with word-level timestamps and speaker turns in one pass (Section 5).
- Resolve speakers — bind each anonymous speaker to a named attorney from the program roster (Section 5).
- Chunk — split the transcript into overlapping passages, each carrying its program, speaker, and
[start, end]timecodes (Section 6). - Tag / categorize — derive practice area, sub-practice area, and topic from the content (this is the layer above raw retrieval that powers faceted browse).
- Embed + index — embed each passage and write vectors to the index and metadata to the catalog.
Program-level metadata (title, dates, live/replay format, and the speaker roster) is imported from the existing content system — the pipeline enriches it (tags, chunks, timecodes); it does not originate it.
The inherited transcripts and chunks from the prior vendor lack speaker attribution and reliable timecodes (Sections 5–6), so the plan is to re-process from audio to regenerate both correctly. Whether the entire ~800-program library is re-processed or only a representative first-cut slice is contingent on the video/timestamp findings (Section 17): if the existing timecodes prove accurate, less re-processing is needed. The prior vendor’s exported metatags are not discarded — they seed the tagging stage, and the model fills gaps and normalizes them rather than re-deriving tags from scratch.
4.2 Re-runnable by design
Tagging and chunking are tuned repeatedly during the quality loop, and programs get edited, so each stage is idempotent and replayable per program. Re-tagging the whole library, or re-chunking a single edited program, is a matter of re-enqueuing jobs — not a migration. CLE state-approval data is compliance-grade and enters the catalog from an authoritative source; it is never inferred by a model.
External Services V1
Four capabilities are provided by external services, each behind a swappable interface so any one can be re-decided during the quality loop without re-architecting. Each has a recommended default and viable alternatives, so a different vendor can be chosen without changing the design.
Reading the diagram: the four external capabilities (reasoning LLM, embeddings, transcription + diarization, video delivery) each sit behind our interface, with a recommended service and alternatives listed underneath.
14.1 Reasoning LLM
Recommended: Claude on Microsoft Foundry — a frontier reasoning model callable on an Azure endpoint with the organization’s existing Microsoft identity, so it stays inside the Azure boundary. Alternatives: Azure OpenAI GPT models (single Microsoft/OpenAI stack); the Anthropic API directly; or Claude via AWS (where the video already lives).
14.2 Embeddings
Recommended: the text-embedding-3-large model via Azure OpenAI, keeping embeddings first-party in the same tenant. Alternatives: the same model via the OpenAI API, or a self-hosted open-source embedding model if cost or data-residency ever demands it.
14.3 Transcription + diarization
Recommended: Azure AI Speech batch transcription with diarization — it returns speaker turns and word-level timestamps in one job, runs in-tenant, and is inexpensive at this scale (the one-time backfill is on the order of a few hundred dollars; ongoing cost is tens of dollars a month). Alternatives: a self-hosted WhisperX + pyannote pipeline (best timestamp precision, cheapest at volume, needs operational ownership); or specialist APIs such as AssemblyAI or Deepgram. The choice is settled by a diarization-accuracy benchmark on a handful of real programs, since attribution accuracy is exactly what the inherited system lacked.
14.4 Video delivery
Recommended: keep the video where it already lives (Amazon S3) and serve it through its native CDN (CloudFront) with signed URLs — no migration, and the cross-cloud boundary is reduced to tiny signed-URL calls rather than video bytes (Section 15). Alternative: migrate the media into Azure storage behind Azure’s CDN for a single-cloud footprint — a later option that the cost math does not force now.
Engineering deep dive
Speaker Diarization & Attribution V1
Panels 3 and 4 — the clip and the attorney’s credentials — are only trustworthy if the system knows which attorney spoke the retrieved passage. The inherited transcripts do not carry that: speech was never matched to speakers. So attribution is a first-class ingestion step, not an afterthought.
Reading the diagram: the audio is transcribed and diarized into anonymous speaker turns with timestamps; those turns are matched to the program’s named roster using the introductions and speaking patterns; high-confidence bindings are accepted automatically, low-confidence ones are queued for a quick human confirm before the chunks inherit a speaker_id.
5.1 Why re-processing from audio is required
Diarization is an acoustic task — it distinguishes voices from the waveform — so it cannot be layered onto a plain text transcript that has no audio signal. Re-processing from the source audio is therefore necessary, and it pays a second dividend: it produces fresh, reliable word-level timestamps, which the video-clip feature depends on and the inherited transcripts may not carry accurately.
5.2 From anonymous speakers to named attorneys
Diarization returns anonymous labels (“Speaker A / B / C”), stable within a recording but nameless. Binding a label to a real attorney is a separate step, solved without biometric voice enrollment (which does not scale to a rotating cast of hundreds of guest attorneys):
- Most programs have a small cast (a moderator plus one to three presenters) and ship a speaker roster (the names and bios that fill panel 4).
- An LLM pass over the first minutes — where the moderator introduces each speaker in turn — plus speaking-time patterns proposes each binding (Speaker 0 = Jane Smith) with a confidence score.
- High-confidence bindings are accepted; low-confidence ones surface for a quick human confirm. At ~25 programs/week this is cheap, and it protects the promise — showing the wrong attorney’s face is worse than a wording slip.
- The binding is stored per program, so the same diarized label maps consistently across every chunk of that recording.
5.3 At scale
The one-time backfill is ~1,600 hours; the steady state is ~50 hours/week. There is no real-time requirement, so transcription runs as batch/async jobs (Section 7) at a controlled concurrency that respects the speech service’s throughput and cost. If diarization accuracy on long multi-attorney panels proves marginal, the speech service is swappable (Section 14) and a benchmark on a handful of representative programs settles the choice.
Chunking Strategy V1
A two-hour transcript is far too large to retrieve or cite as one unit. Chunking splits it into passages that are (a) semantically self-contained, (b) small enough to embed and re-rank precisely, and (c) tied to an exact video range and a single speaker. Chunk quality is the single biggest lever on answer quality: chunks that are too big blur retrieval; too small lose context.
Reading the diagram: the diarized transcript is scanned for boundaries — speaker changes and topic shifts — then grouped into passages within a target token window with a small overlap so context is never cut mid-thought. Each passage carries its program, speaker, and start/end times before it is embedded and indexed.
6.1 How passages are formed
- Boundaries first. A chunk never spans a speaker change (so attribution stays clean) and prefers to break at topic shifts rather than mid-argument.
- Windowed with overlap. Within those boundaries, text is grouped to a target size with a small overlap between neighbours, so a passage retrieved in isolation still reads coherently and a point split across the seam isn’t lost.
- Metadata travels with the chunk. Every chunk stores
program_id,speaker_id, and millisecondstart/end— the exact data panels 2, 3, and 4 need.
6.2 New and edited content
New programs run the same path automatically as they arrive. An edited program is simply re-chunked in place: because the operation is idempotent and keyed per program, re-running it replaces that program’s chunks and vectors without touching the rest of the corpus. The inherited chunks are not reused — they predate diarization and dependable timecodes — so chunks are regenerated from audio (across the full library or a first-cut slice, per Section 17).
6.3 Tuned, not fixed
Target size, overlap, and boundary rules are parameters, not hard-coded constants. The quality loop (Section 11) evaluates retrieval on a fixed question set and adjusts them; because re-chunking is a background job, sweeping a new setting across the corpus is routine.
Background Processing & Jobs V1V2 event-triggered
Transcription, diarization, chunking, embedding, and evaluation are slow, bursty, rate-limited, and occasionally failing operations. Running any of them inline would make a member wait on a two-hour transcription. So the whole processing plane is a durable job system: work is enqueued on an Azure Storage Queue, a pool of our own worker containers drains it, and progress and failures are tracked per program. Each program is one job that runs its stages in sequence (extract → transcribe/diarize → resolve speakers → chunk → embed/index), with the current stage recorded in ingest_jobs so a re-run resumes cleanly. The queue is only the broker; all orchestration logic lives in our own codebase, not in serverless functions.
Reading the diagram: three sources feed one queue — the one-time backfill, newly arriving programs, and a scheduler that enforces the freshness target. A horizontally-scaling worker pool runs each stage; every job is idempotent and its status is tracked, so retries are safe and failures divert to a review lane instead of blocking the line.
7.1 Handling volume
- Backfill vs. steady state. The initial ~800 programs are enqueued once and drained at a controlled concurrency — high enough to finish the backfill in a reasonable window, capped to respect the speech and embedding services’ rate limits and to keep spend predictable. The ongoing ~25 programs/week is a trickle the same pool absorbs without noticing.
- Scale by adding workers. Throughput is a function of worker count; because each program is independent, the pool scales horizontally with no coordination.
- Cost control. Concurrency caps and batch tiers keep the external-service bill bounded; the heavy spend is the one-time backfill, after which it settles to the weekly trickle.
7.2 Reliability
- Idempotent per (program, stage). Re-running a stage is always safe — it replaces that stage’s output — which is what makes retries, edits, and parameter sweeps trivial.
- Isolated failures. A bad recording fails its own job only; it never stalls the queue. Repeatedly failing jobs divert to a dead-letter / review lane for a human to inspect.
- Observable. Each program’s stage and status live in the catalog (an
ingest_jobsrecord, Section 13), so “where is program X” and “what’s stuck” are simple queries.
7.3 Meeting the 48-hour freshness target
A scheduled trigger (running on a short interval) picks up new or edited programs and enqueues them — comfortably inside the 48-hour window with no event plumbing to build. Wiring ingestion directly to the content-production workflow so new programs trigger instantly is a natural later enhancement, but is not needed to hit the target.
7.4 The operator console
Several steps in the pipeline are deliberately human-in-the-loop, so the platform needs a small internal admin view for the team that runs content — not a member-facing screen. It sits over the same catalog and job tables and lets an operator see and act on what is happening behind the scenes, so transcripts, speakers, and clips can be verified before members ever see them.
Reading the diagram: the operator console reads ingestion job status and the catalog, and lets a person confirm low-confidence speaker matches, review and retry failed jobs, spot-check a program’s transcript and clips against the video, and re-run tagging or chunking for a program.
- Ingestion status — where each program is in the pipeline (from
ingest_jobs) and what is stuck. - Speaker confirmations — the queue of low-confidence label→attorney matches awaiting a quick human yes/no (Section 5).
- Failed-job review — the dead-letter lane, with one-click retry after a fix.
- Verify & re-run — spot-check a transcript / clip against the source video, and re-tag or re-chunk a program on demand.
7.5 The stage state machine
Each program is one row in ingest_jobs that walks a fixed sequence of states. A worker leases the job for exactly one stage, does that stage’s work, writes the result, advances the state, and re-enqueues the job for the next stage — so a long pipeline never occupies a single worker for two hours, and any stage can be retried or resumed independently.
Reading the diagram: a job movesqueued → extracting → transcribing → mapping-speakers → chunking → embedding → indexed, one stage per lease. Any stage can throw; a failure increments a retry counter and re-enqueues with backoff, and once retries are exhausted the job lands indead-letterfor a human.indexedanddead-letterare the terminal states.
- One stage per lease. A queue message carries
{program_id, stage, attempt}. The worker reads the job’s current stage fromingest_jobs, runs only that stage, and on success writesstage = <next>and posts a fresh message for the next stage in the same sequence (write result → advance row → enqueue next → delete current message). This keeps each unit of work short and makes the queue depth a live picture of remaining work. - Visibility timeout. When a worker dequeues a message it becomes invisible for a bounded lease (sized per stage — short for
chunk/embed, generous fortranscribewhich mostly polls an external job rather than holding the CPU). If the worker crashes mid-stage, the message reappears after the timeout and another worker retries — safe only because every stage is idempotent (below). Long-running stages extend the lease (an update-visibility heartbeat) rather than picking one timeout large enough to hide a real crash. - Idempotency keyed by
(program_id, stage). Each stage is delete-then-write: before writing, it removes any prior output for that(program_id, stage)— the program’s chunks, its vectors, its speaker map — then writes fresh. A message redelivered after a crash, a manual re-run, or a parameter sweep therefore produces exactly one clean copy, never duplicates. This is the same property Sections 4.2 and 6.2 rely on for re-tagging and re-chunking. - Retries with exponential backoff. A stage that throws increments
attemptand re-enqueues with a visibility delay of roughlybase · 2^attempt(e.g. 30s, 1m, 2m, 4m, capped ~15m), with jitter so a provider blip doesn’t resynchronize the whole pool into a thundering retry. Transient classes (a429, a 5xx, a timeout) are retried; deterministic ones (a corrupt media file, a missing roster) fail fast toward dead-letter. - Poison messages & dead-letter. After
max_attempts(~5) the job is writtenstage = failedwith the last error and diverted to a dead-letter lane — a separate queue plus theingest_jobsrow flagged for review. Azure Storage Queue also exposes a per-messagedequeueCountas a backstop poison-detector. The dead-letter lane is exactly what the operator console’s failed-job review (7.4) reads, and a one-click retry resetsattemptand re-enqueues at the failed stage. - Resume, don’t restart. Because the current stage is persisted, re-running a job re-enters at its recorded stage rather than re-transcribing a two-hour file to fix a chunking tweak — the expensive stages (
extract,transcribe) are done once and their outputs (staged audio in Blob, the diarized transcript JSON) are reused by later re-runs.
7.6 ASR batch mechanics
The transcribe stage is the pipeline’s long pole and the only one that hands work to a third party, so it is modelled as submit-then-poll rather than a blocking call.
- Audio staged in Blob. The prior
extractstage pulls the source video (on AWS/S3), runsffmpegto strip a mono 16 kHz PCM/WAV track, and writes it to Azure Blob under a deterministic key (audio/{program_id}.wav). Azure AI Speech reads the audio by SAS URL, so the ~2-hr file is never streamed through our workers. - Async submit → poll. The worker calls Azure AI Speech batch transcription (
POSTa job over the blob SAS, withdiarizationandwordLevelTimestampsenabled), stores the returned transcription id on the job, and releases the lease. The job re-enqueues itself as a lightweighttranscribe-polltick that checks status on a slow cadence (every 1–2 min) — a worker is never blocked for the 5–20 min a 2-hr file takes. (A completion webhook can later replace polling; polling is enough for a 48-hr SLA.) - Parse the diarized result. On
Succeeded, the worker fetches the result JSON and parses per-word entries — each carryingtext,offset/duration(givingstart_ms/end_ms), and aspeakerlabel. These feedmapping-speakers(Section 5) andchunking(Section 6); the raw JSON is retained in Blob as the QA/replay source. - Failure handling. A
Failedtranscription or a malformed result routes through the same retry/backoff path; a persistently failing file (bad codec, corrupt upload) dead-letters for the operator to re-point at a clean source.
7.7 Chunking execution
Section 6 sets the strategy; the mechanics are deliberately thin. Token counts use the embedding model’s own tokenizer (the cl100k-family encoding behind text-embedding-3-large) so a chunk’s measured size matches what the embedding call bills and truncates at. Boundary detection walks the diarized word stream: a hard break at every speaker change (so a chunk never straddles two attorneys), and a preferred break at topic/pause cues within a speaker’s turn. Passages are packed to a target window (a few hundred tokens) with a small token overlap carried from the prior chunk, and each emitted chunk is stamped with program_id, speaker_id, and millisecond start/end from its first and last words. Target size, overlap, and boundary rules are the parameters the quality loop sweeps (Section 6.3).
7.8 Embedding at volume
- Batched calls. The
embeddingstage sends chunks to Azure OpenAItext-embedding-3-largein batches (dozens to low-hundreds of chunks per request, bounded by the per-request token ceiling) rather than one call per chunk — cutting overhead and staying inside throughput limits. - Corpus size. At a few-hundred-token target with overlap, a 2-hr program yields on the rough order of 1,000–1,500 chunks, so the ~1,600-hr backfill is on the order of ~1–1.5 million chunks / embeddings — large enough to batch and rate-limit deliberately, small enough to fit comfortably in one Postgres/pgvector instance.
- Rate limits (
429). Azure OpenAI enforces tokens-per-minute quotas; on a429the stage honoursRetry-Afterand otherwise falls back to the same exponential backoff+jitter as the state machine (7.5). A modest client-side concurrency cap keeps the pool under quota by construction rather than relying on retries to absorb overshoot. - pgvector upsert + HNSW. Vectors (3,072-dim) are written to the chunk table’s
vectorcolumn and served by an HNSW index (e.g.m = 16,ef_construction = 64, cosine ops), withef_searchtuned at query time for the recall/latency trade-off (Section 8). The write is an upsert keyed onchunk_id; with delete-then-write idempotency (7.5) a re-embed leaves exactly one vector per chunk. - Re-embedding on model change. If the embedding model changes, embeddings are regenerated by re-enqueuing the
embeddingstage across the corpus — a background sweep, not a migration. Because dimensionality/distribution may shift, the strategy is to rewrite the vector column (populate a new column/index, then cut queries over) so retrieval never mixes vectors from two models. Query vectors must always come from the same model as the stored vectors.
7.9 Backfill throughput & cost
The ~800-program / ~1,600-hr backfill is enqueued once and drained at a controlled worker concurrency chosen to sit under the Azure AI Speech and Azure OpenAI limits rather than to max out wall-clock — the constraint is provider quota and predictable spend, not our compute. Because transcribe is submit-then-poll (7.6), a handful of workers keep dozens of transcription jobs in flight at the provider, so throughput is gated by the Speech service’s batch capacity, not worker count. At Azure batch rates (~$0.36/hr audio) the transcription bill for the full backfill is on the order of a few hundred dollars, one-time; embedding ~1–1.5M chunks is a comparably modest one-time figure. The backfill realistically drains over days, not weeks, at a deliberately capped rate. After cutover the steady ~25 programs/week (~50 hr/week) is a trickle the same pool absorbs at a few dollars a week, comfortably inside the 48-hr freshness target.
7.10 Ragie migration ETL
Before the prior vendor (Ragie) goes offline, its transcripts, chunks, and metatags are exported and loaded into our own store as a one-time seed — so nothing is lost at cutover even though the pipeline regenerates most of it. The exported metatags seed the tagging stage (Section 4.1): the model normalizes and fills gaps rather than re-deriving tags from scratch. The exported transcripts/chunks load as a fallback/QA reference, not as retrieval input — they predate diarization and dependable timecodes (Sections 5–6), so the authoritative chunks and vectors come from re-processing audio. In state-machine terms the ETL is just another work source feeding ingest_jobs: it lands program rows and seed metadata, and the normal extract → … → indexed sequence then supersedes the seed per program.
Data Model V1
The catalog is small and clean: programs and their chunks, speakers with their diarized-label mapping, CLE approvals, tags, popularity, the ingestion-job tracker, and the query/citation log that also feeds evaluation.
Reading the diagram: a program is segmented into timecoded chunks (each carrying an embedding and a diarized speaker), features speakers, is approved in states, is tagged, and carries a popularity score. A speaker-map records how each diarized label was bound to a named attorney; an ingest-jobs row tracks each program through the pipeline; queries log their citations back to programs and chunks.
Deployment & Infrastructure on Azure V1 single envV2 scale & HA
The platform runs on Microsoft Azure — the organization standardizes on Microsoft 365, so Azure keeps identity, billing, and operations in one plane. The app is a streaming container service in front of managed Postgres, managed Redis, and object storage, with secrets in a managed vault, private networking, and CI/CD from Git. Security is designed in: no public database, managed identity instead of stored credentials, and a hard scaling ceiling as the cost and blast-radius guardrail.
15.1 The building blocks
| Role | Azure service | Why |
|---|---|---|
| Containers (API + workers) | Azure Container Apps | Streaming-capable, scales to the modest load easily, cleanly separates the API from background workers. The streaming endpoint runs with a raised request/idle timeout (Premium ingress) so long answers aren’t cut — see Section 8. |
| Catalog + vectors | Azure Database for PostgreSQL Flexible Server | pgvector supported (HNSW/IVF, plus an Azure-only high-recall index) — one database for facets and vectors. |
| Cache / SSE stream | Azure Managed Redis | Resumable streaming, idempotency, hot-query cache. |
| Job queue (broker) | Azure Storage Queue | Feeds the ingestion worker containers; the orchestration logic stays in our code (Section 7). |
| Working storage | Azure Blob Storage | Extracted audio and (later) PDFs, inside the tenant. |
| Secrets | Azure Key Vault | Connection strings, signing keys, API keys — referenced at start, never in the image. |
| Container images | Azure Container Registry | Immutable, commit-tagged images the runtime pulls via managed identity. |
| Edge / CDN / WAF | Azure Front Door | The single public entry point, with a web application firewall and anonymous rate-limiting. |
| Monitoring | Azure Monitor + Application Insights | Traces, SSE latency, dashboards, alerts, and budget guardrails. |
| Identity | Microsoft Entra ID + managed identities | One identity plane across the app, datastores, CI/CD, and the reasoning model. |
| Video (kept in place) | Amazon S3 + CloudFront | Video already lives here; served via signed URLs (Section 15.3). |
Reading the diagram: CI pushes an image to the registry, which deploys an immutable revision of the API on Container Apps. The API reaches managed datastores over private networking, verifies membership against the site, mints signed URLs for video, and calls the AI services with a managed identity. Secrets come from Key Vault; logs and metrics flow to monitoring with budget alerts.
15.2 Secure networking
The security posture is part of the architecture, shown below.
Reading the diagram: public traffic enters only through the edge (with the firewall) into the container environment in a private network. The database, cache, storage, and vault have no public endpoint — they are reached over private endpoints inside the network. The app authenticates to every resource with a managed identity; no keys are stored in the app. Purple = identity/secrets, red = the guarded public edge.
- Private network + subnets. Containers run in their own subnet; datastores live in a separate subnet reachable only via private endpoints. Nothing sensitive is exposed to the public internet.
- Firewall at the edge. The CDN/WAF is the only public entry point; it fronts the app and can rate-limit anonymous traffic — useful for the public teaser tier.
- Managed identity, not credentials. The app uses its platform identity to pull images, read secrets, reach the database and cache, and call the models — so there are no long-lived keys in the app to leak.
- Secrets in the vault. Everything sensitive lives in Key Vault and is referenced at container start; nothing sensitive is in the image or the repository.
- Encryption & least privilege. TLS in transit, encryption at rest by default, and access scoped narrowly per resource.
15.3 CI/CD — push, build, deploy
Every change flows through an automated pipeline: a push to a branch runs tests, builds and scans a Docker image, pushes it to the registry, and deploys an immutable revision — staging automatically, production behind a reviewed merge.
Reading the diagram: a push to the main branch runs tests, builds and scans the image, pushes it to the registry, and auto-deploys a revision to staging; production is reached by a reviewed pull request into the production branch. Same workflow shape, different config — no drift, full audit trail.
- Docker build on push — the pipeline builds the image, runs a vulnerability scan, and tags it by commit, so every deploy is a frozen, reproducible artifact.
- Immutable revisions — each deploy is a new revision; rollback is instant by shifting traffic to the prior one.
- Per-environment isolation — staging and production have separate datastores, secrets, and identities; staging can never touch production data.
- Secretless pipeline — CI authenticates to the cloud via federated identity, not stored cloud credentials.
15.4 Video stays where it is
The video already lives in Amazon S3. At this scale it is not migrated: S3 stays the origin behind its native CDN (so there is no cross-cloud charge on the video bytes), and the Azure app makes only small signed-URL calls across to it. This isolates the cross-cloud boundary to tiny API calls rather than gigabytes of video. A full migration into Azure storage is a later option if a single-cloud footprint becomes worth the one-time move; the cost math does not force it now.
15.5 How the pieces connect — service-to-service auth
Everything inside Azure authenticates with Microsoft Entra ID, not stored credentials. The API app and the worker app each run under a user-assigned managed identity (id-lawcle-api, id-lawcle-worker) — user-assigned so the identity and its role grants survive revision churn and can be pre-provisioned before the first deploy. Each identity is granted only the RBAC roles it needs, scoped to a single resource; there is no shared admin key in the running system.
Reading the diagram: dotted edges are the identity/trust plane — the managed identity presenting a short-lived Entra token to each resource (no stored keys); solid edges are the network data path. Public traffic can enter only through the WAF (red); the datastores (green) sit behind private endpoints with no public address; all outbound calls to the model endpoints and AWS CloudFront funnel through one NAT gateway (amber). The single stored credential in the system is the set of AWS keys in Key Vault (purple), used to sign video URLs.
- Pull images —
AcrPullon the Container Registry. Container Apps authenticates the identity to ACR at revision start and pulls the commit-tagged image; no registry username/password. - Read secrets —
Key Vault Secrets Useron the vault (in RBAC authorization mode, not legacy access policies). Container Apps secret references resolve@Microsoft.KeyVault(...)URIs through the identity at start. - Postgres — Entra authentication is enabled on the Flexible Server; the managed identity is mapped to a Postgres role and the app fetches a token (scope
https://ossrdbms-aad.database.windows.net/.default) in place of a password. No database password exists to leak or rotate. - Redis — Azure Managed Redis uses Entra token auth: an access-policy assignment binds the identity’s object id to a Redis ACL, so the app connects with a token instead of the primary key.
- Blob + Storage Queue —
Storage Blob Data Contributor(worker writes audio/PDFs),Storage Queue Data Message Sender(API enqueues) and… Processor(worker dequeues) — data-plane roles, not the account key. - Models —
Cognitive Services Useron the Foundry / Azure OpenAI resource; the app requests a token (scopehttps://cognitiveservices.azure.com/.default) and calls Claude-on-Foundry, embeddings, and Speech keyless — the Entra-native path that fits the Microsoft 365 identity plane. - Monitoring — App Insights ingestion via the identity (
Monitoring Metrics Publisher); no instrumentation key in the image. - The one exception — AWS. CloudFront/S3 signing needs an AWS credential, which Entra cannot mint. An IAM access-key pair and the CloudFront signing private key live in Key Vault; the app reads them via its identity and signs URLs locally. (A later hardening step could replace the static keys with AWS OIDC federation of the Entra identity, removing even this one stored credential.)
15.6 Network paths — private by default, one public door
- VNet integration. The Container Apps environment is deployed into a delegated infrastructure subnet with
internal: true, so it is fronted only by an internal load balancer (no public IP). The workload subnet and a separate data subnet live in the same VNet. - WAF is the only public ingress. Azure Front Door (Premium) reaches the internal environment over Private Link; its WAF (managed rule sets + a custom anonymous rate-limit rule for the teaser tier) is the sole public entry. The app validates the Front Door
X-Azure-FDIDheader so the internal load balancer accepts nothing but Front Door. - Datastores via private endpoints. Postgres, Redis, Key Vault, Blob, and the Storage Queue all have public network access disabled and expose a private endpoint in the data subnet. The corresponding private DNS zones (
privatelink.postgres.database.azure.com, the Managed Redis zone,privatelink.vaultcore.azure.net,privatelink.blob.core.windows.net,privatelink.queue.core.windows.net) are linked to the VNet, so the app resolves each service to its private IP and never traverses the public internet. - Controlled egress. Outbound calls (Claude-on-Foundry, Azure OpenAI embeddings, Azure AI Speech, AWS S3/CloudFront) leave through a NAT gateway on the workload subnet, giving a small set of stable egress IPs — which is what lets an AWS bucket policy or partner allowlist the platform, and keeps outbound identity auditable.
- Model endpoints. V1 calls the Azure AI endpoints over their public FQDN but with Entra tokens (still via the NAT); if data-residency review requires it, the Foundry/OpenAI resource moves behind its own private endpoint with no application change.
15.7 Secrets inventory & rotation
Everything sensitive is in Key Vault and referenced at container start — nothing sensitive is baked into the image or the repository. Because most Azure resources use Entra tokens, the vault holds far fewer secrets than a password-based design would.
- Postgres connection parameters — host, database,
sslmode=require(the credential is an Entra token, so there is no stored password). - AWS access-key pair — IAM key id + secret, scoped to CloudFront/S3 read + signing only.
- CloudFront signing key — the private key (PEM) and its key-pair/key-group id used to mint signed, timestamped deep-link URLs.
- JWT key material — the private signing key for member-session tokens and the published JWKS (each key carries a
kidfor overlap-based rotation). - Model / API keys (fallback) — kept only as a fallback to the keyless Entra path; plus any specialist-ASR key (Deepgram/AssemblyAI) if that alternative is chosen.
- Session / idempotency secrets — the HMAC secret backing SSE resume tokens and request-idempotency keys.
- Rotation. Managed-identity tokens are short-lived and auto-refreshed — nothing to rotate for Azure resources. AWS keys use overlapping dual-key rotation; the CloudFront key is rotated by publishing a new key to the key group before retiring the old; the JWT signing key rotates with overlapping validity so JWKS serves both
kids during cutover. Because Key Vault references resolve at start, rotating a secret is picked up on the next revision.
15.8 External quotas & resilience
The reasoning model, embeddings, and transcription are metered external dependencies, so each is treated as a fault domain with its own limits, timeouts, and isolation — one slow provider must never cascade into the interactive answer path.
- Azure OpenAI embeddings — the
text-embedding-3-largedeployment has a per-region tokens-per-minute (TPM) quota; backfill embedding runs with bounded concurrency sized to stay under TPM, honouringRetry-Afterwith backoff + jitter on429. - Claude on Foundry — quota can default to 0 TPM until requested, so launch concurrency needs an explicit quota grant per model/region. Streaming answers respect the same 429/backoff discipline; a region or model outage falls back to the configured alternate (Azure OpenAI GPT).
- Azure AI Speech — batch transcription is bounded by concurrent-job limits per resource; the worker submits and polls off the Storage Queue rather than blocking, so a large backfill drains within the ceiling instead of throttling.
- Per-dependency policy — every external client has an explicit connect + overall timeout, bounded retries (idempotent / 429 / 503 only, with jitter), and a circuit breaker so a stalled provider fails fast instead of exhausting the request pool and starving live SSE streams.
- Isolation & recovery — ingestion runs in a separate worker app (a bulkhead) so backfill can never starve interactive
/ask; SSE streams are resumable from the Redis stream viaLast-Event-ID, so a dropped connection re-attaches instead of restarting the answer.
15.9 Internal API surface
The app exposes a small, versioned HTTP surface, consumed by the members’ website iframe and reached only through Front Door. Endpoints are namespaced under /v1; additive changes stay in /v1, breaking changes would open /v2 alongside it.
GET /v1/search— hybrid keyword + vector search. Paramsq, facet filters (speaker,topic,date,program), pagination; returns ranked results as JSON, each with program metadata and a timestamped deep-link.POST /v1/ask— the four-panel answer, streamed as SSE (text/event-stream). Body{ question, filters, conversationId }; the stream carries token deltas, citation payloads, and periodic keep-alive heartbeats; resumable viaLast-Event-IDbacked by a Redis stream.GET /v1/program/{id}— program metadata, transcript segments and speaker turns, and a signed video URL minted on demand (short TTL, timestamped).GET /health(plus/health/liveand/health/ready) — the liveness/readiness probes Container Apps uses; readiness confirms Postgres and Redis reachability before a revision takes traffic.- Auth. Member requests carry the signed session JWT from the iframe host (Section 12); the anonymous teaser tier is unauthenticated and rate-limited at the WAF.
Scaling & Operations V1V2
The same architecture carries the system from launch to growth by configuration, not redesign. Given the modest expected load, launch needs very little; the headroom is there when partner associations or higher volume arrive.
Reading the diagram: scaling is a settings change — raise the container replica ceiling, then add a database read replica and a cache tier only when measurements call for it. Nothing about the shape of the system changes.
16.1 Operating guardrails
- A hard replica ceiling is the primary cost and blast-radius guardrail — a traffic spike or a runaway loop can never fan out without limit.
- Budget alerts on the subscription and on the metered AI/speech services catch cost surprises early.
- Dashboards and alerts cover request latency, SSE health, error rates, and the ingestion queue depth — so a stuck backfill or a failing external service is visible immediately.
- Backups of the catalog and vector index; the video and its store remain the system of record for media.
Beyond V1 & V2 - later phases
Called out here for completeness. None of these are in the current build. Each is noted in the design where a decision downstream might be affected by whether we eventually pull one of these in.
- Indexing the written materials that ship with programs (handouts, slide decks, case excerpts - mostly PDFs).
- Conversational follow-ups, saved history, and cross-source synthesis that merges near-identical clips.
- A paid add-on tier and single-class buyers; lead capture and anti-scraping.
- Partner bar associations (multi-tenant, branded front-ends with per-tenant content and entitlements).