Legal CLE AI Platform — System & Infrastructure Design

An evidence-grounded question-answering and search layer over a large library of legal-education video, embedded in the members’ website.

Guiding principle. Every substantive answer is anchored to the source video — the member can watch the attorney say it, word for word. In law the exact wording carries the meaning, so accuracy is the product and grounding is how we guarantee it. Every design decision below serves that: retrieval that finds the right passage, timecodes that play the exact moment, and attribution that names the right attorney.

V1 current scope — the cheapest credible live demo on real content   V2 future scope — expansion & scale once V1 proves out. The Roadmap breaks this down; the pills on later section headings mark which phase each part belongs to.

For the client

Product & Scope

1.1 What the product is

Two member-facing capabilities over one indexed content library:

The two share the same corpus and index, so a search result and a cited answer point at the same programs, timecodes, and attorneys.

1.2 The corpus we are designing for

The numbers drive the architecture, so they are stated up front:

Roadmap · Current Scope (V1) vs Future Scope (V2)

The work ships in two versions. V1 is the cheapest credible live demo on real content, built in three pieces — Setup & Infrastructure, then UI Integration, then Testing & Improvement. V2 expands on it with scale and richer features once V1 has proven its value. Keeping V1 lean is deliberate: it controls the initial spend and lets real usage decide what we invest in next. Throughout the rest of this document, V1 / V2 pills on the section headings mark which phase each part belongs to.

Rendering…
Reading the diagram: V1 is built in three sequential pieces to reach a working demo; once it proves out, V2 layers on scale, search/answer improvements, deeper training, and reach. Green = V1, amber = V2.

V1 — current scope, in detail V1

Every box below is in the first version. The three pieces build on one another.

Rendering…
Reading the diagram: the three V1 pieces run left to right (Setup & Infrastructure → UI Integration → Testing & Improvement); under each are its component workstreams. The detail is below.

Piece 1 · Setup, Infrastructure & Migration

Piece 2 · UI Integration

Piece 3 · Testing & Improvement

What V1 deliberately excludes

To keep the first version cheap and fast, these are explicitly out of V1 and live in V2:

V2 — future scope V2

Once V1 proves members use it, V2 expands on every axis. The diagram shows each expansion area and the work it contains.

Rendering…
Reading the diagram: four expansion areas, each broken into the work it contains. These layer on after V1 has proven out. The detail is below.

Scale & hardening

Search & answer improvements

Advanced training

Monetization & reach

V1 vs V2 at a glance

AreaV1V2
ContentRepresentative slice, video/transcriptFull library, automated pipeline, PDFs/materials
RetrievalSingle + multi-clip grounded answerCross-source synthesis, conversation
SearchFacets + keyword + semantic + popularityAutocomplete, synonyms, richer facets
InfrastructureSingle secure Azure environmentAutoscaling, HA replica, caching
Auth / monetization3-tier gating + signed videoPaid add-on tier, lead capture, anti-abuse
QualityInitial eval set + loopContinuous training, feedback, A/B
ReachSingle brandPartner bar associations (multi-tenant)

How It Connects

The application is an independent web service embedded in the members’ WordPress site as an iframe. It is not a WordPress plugin; it runs on its own infrastructure and talks to WordPress only to learn who a visitor is. Access is organized into three levels so the product can be an acquisition surface and an upgrade lever at once.

Rendering…
Reading the diagram: a visitor loads the members’ site; the AI app is embedded in the page and calls its own backend. The backend reads the catalog and vector index, streams answers, checks membership against WordPress, and plays video from the existing media store. The AI models sit behind a service interface so they can be swapped.

2.1 The three access levels

LevelWhoWhat they get
Catalog searchAnyone (public)Full keyword/faceted search — discover which programs exist on a topic. The acquisition surface.
Answer preview (teaser)Anyone (public)A lightweight preview that shows the value without the grounded depth — deliberately cheap to produce (see Section 8.2). Exactly how much the preview reveals is a product decision to confirm with the client (Section 17). The upgrade trigger.
Full answerMembersThe complete four-panel answer: grounded reasoning, transcript, the exact video clip, and attorney credentials.

The boundary is enforced on the server (Section 12), never merely hidden in the interface, and the video itself is released only through short-lived signed links (Section 10).

Query, Retrieval & the 4-Panel Answer V1V2 conversation

A question runs through embed → hybrid retrieve → re-rank → grounded reasoning → assemble panels, streamed token-by-token so the answer begins appearing immediately. The same flow produces a teaser for non-members and one or several clips for members.

Rendering…
Reading the diagram: the route resolves entitlement first. A non-member gets a teaser payload; a member gets the full pipeline — retrieve, re-rank, reason, and push events (reasoning tokens, then one or more four-panel citations) onto a Redis stream that the subscriber relays to the browser.
Rendering…
Reading the diagram: structured filters narrow the space, vector search pulls 10–15 candidates, and the model re-ranks to the best 5–6. Only if grounded evidence clears a threshold does it answer; otherwise it returns an honest “no source” rather than fabricating. When strong chunks come from more than one program, several clips are surfaced.

8.1 How each panel is assembled

The four panels are not four separate generations — they are four views of the same retrieved chunk. This is what guarantees they always agree with one another:

Rendering…
Reading the diagram: one ranked chunk fans out into all four panels — its text becomes the transcript panel, its timecodes become the video clip, its speaker becomes the attorney panel, and the reasoning panel cites its id. Because everything derives from the same chunk, the clip, the quote, and the attorney can never drift out of sync.
Rendering…
Reading the diagram: an idempotency claim makes retries safe; the generation task writes events with monotonic ids to a Redis stream; the subscriber relays them as SSE. A dropped connection resumes from the last seen id, so a flaky network never loses or duplicates an answer.

8.2 The teaser and the no-source fallback

For a non-member, the answer endpoint returns a lightweight preview plus an upgrade prompt — deliberately cheap to produce (built from catalog / retrieval metadata rather than a full grounded generation), so an unauthenticated question never triggers an expensive per-request model call. It withholds the gated depth (the grounded reasoning, the transcript, and the video). Exactly how much the preview reveals is a product decision still to confirm with the client (Section 17). For a member whose question has no strong source, the system says so plainly; it never invents a citation. A claim with no video behind it does not get made.

8.3 Why it is built this way

8.4 Hybrid retrieval — one query path

Retrieval is a single round-trip against Postgres that blends three signals and returns 10–15 candidate chunks rows. There is no separate keyword database and no second network hop — pgvector, full-text search, and the catalog filters all live in the same store.

8.5 Re-ranking — trim 15 to the best 5–6

8.6 Grounded reasoning & the citation contract

The reasoning model (Claude on Azure AI Foundry, or Azure OpenAI GPT) never sees the corpus — only the 5–6 reranked rows, packed as an evidence block, and it is contractually bound to cite them.

8.7 From DB rows to the four panels — the assembly mapping

This is the step the whole system turns on: we have chunks rows and video files, and here they become the rendered answer. Assembly is pure resolution — no further generation. For each chunk_id the model cited, the assembly layer reads that one row and fans its fields into the four panels, which is the structural reason the quote, clip, and attorney can never disagree.

Rendering…
Reading the diagram: the query is embedded, hybrid-retrieved against the chunks rows, reranked to 5–6, and reasoned over so the model returns claims tagged with chunk_ids. For each cited id the assembly layer re-reads that row: content fills the transcript panel, start_ms/end_ms mint a signed clip URL for the video panel, speaker_id joins the speakers row for the attorney panel, and the ordered claims form the reasoning panel — then the whole thing is serialized to a payload and pushed as SSE events. Green = Postgres rows, blue = process steps, amber = LLM / signed URL / Redis.

8.8 Grounding verification — enforcing “every claim maps to a chunk”

8.9 The teaser path — cheap, no grounded generation

8.10 SSE event protocol

8.11 Caching, latency budget & mid-stream failure

Hybrid Search V1 coreV2 autocomplete

The structured path is primary and fast; the semantic path handles the long tail. A relational catalog does the everyday queries — browse a practice area, filter by state approval — instantly and exactly, and the AI layer catches the open-ended questions that facets can’t express.

Rendering…
Reading the diagram: known/faceted queries go straight to the catalog with counts and subcategorization; open-ended queries fall through to semantic search and are blended back in. Either way, results are ordered live/replay first, then by popularity.

9.1 Facets, counts & narrowing

Facets: practice area, sub-practice area, CLE state approval, credit hours, format (live/replay vs. on-demand), date, speaker, and popularity. Broad results subcategorize automatically — litigation (268) → trucking, personal injury, employment, trial skills — and show accurate counts (“300 in estate planning”), because the catalog can compute them exactly.

9.2 Popularity ordering

Each program carries a popularity score — a small denormalized value recomputed on a schedule, not a live aggregate — used as a secondary sort after the live/replay-first rule. It can be seeded from existing platform view or enrollment counts and refined over time by in-app engagement (clip plays, click-throughs) recorded in the query log, so the catalog surfaces what members actually watch.

9.3 Keyword search

Between exact facets and semantic search sits a keyword / full-text tier over program titles and descriptions, run in the same Postgres database (its built-in full-text / trigram search). This handles the everyday “find classes about trucking” term query — a core requirement — quickly and precisely, without falling through to the fuzzier semantic layer.

9.4 Semantic fallback

When neither facets nor keywords resolve a query, it is embedded and matched against program and chunk descriptions; results from all three paths are then fused and de-duplicated — so the common case is fast and exact, and the open-ended case still returns something useful.

Video Delivery & Timecoded Playback V1

The marquee feature depends on dropping the member at the exact minute of a two-hour program and playing just that stretch — without downloading the whole file. This is a solved, standard capability given a small condition on the file format.

Rendering…
Reading the diagram: after the entitlement check passes, the API mints a short-lived signed URL for the program. Our player seeks to the chunk’s start time, which triggers an HTTP Range request; the store returns only those bytes (206 Partial Content) through the CDN; a small script pauses at the end time. Nothing extra is downloaded.

10.1 Playing a range without downloading the whole file

The player asks for only the bytes it needs via an HTTP Range request, and the media store answers with just that slice (206 Partial Content). Playback starts mid-file after transferring a small amount of data. The one condition depends on format:

10.2 Format support

FormatPrecise range-play?Notes
MP4 (H.264/H.265) — faststart on✅ YesThe simple recommended case. One-time +faststart re-mux, no re-encode.
MP4 — faststart off⚠️ Not effectivelyBrowser downloads most of the file first. Fix = re-mux to faststart.
HLS / DASH (segments)✅ YesBest practice for 2-hour content; fetches only the covering segments.
WebM (index at front)✅ YesSeekable if cues are written at the front; uncommon here.
MOV / AVI / WMV / raw masters❌ Not for browser deliveryTranscode to MP4 or HLS as an ingestion step.

Placing a CDN in front of the store does not break this — a properly configured CDN forwards Range headers and caches parts, so popular clips serve fast from the edge.

10.3 Approach, player & access control

10.4 From citation to clip: the concrete lookup

Sections 10.1–10.3 established that Range playback works; this is the exact chain that turns a retrieved chunk into a playing clip. Retrieval hands the API a chunk row, and three joins resolve it to a byte-range fetch:

Rendering…
Reading the diagram: a member clicks a cited clip; the API runs the entitlement check (Section 12), resolves the chunk to its program’s video_asset_id and S3 key, reads AWS credentials from Key Vault, and mints a short-lived signed CloudFront URL. It returns {url, start_ms, end_ms}; the player’s seek fires an HTTP Range request that CloudFront serves from the edge (or fetches from S3 as 206 Partial Content), then stops at end_ms. Purple = secrets/signing, amber = CDN, green = S3, red = the entitlement guard.

10.5 Minting the signed URL across two clouds

The app runs on Azure Container Apps but the video lives on AWS S3 + CloudFront, so the signer holds AWS trust material inside Azure. That material sits in Azure Key Vault, never in code, image, or build-baked env:

10.6 Keeping the CDN fast: caching vs. per-viewer signatures

A signed URL carries a unique Signature query string per viewer and per mint. If CloudFront keyed its cache on the full query string, every viewer would miss the edge cache and re-pull from S3 — defeating the CDN for exactly the popular clips we most want cached. Two standard fixes:

We default to signed cookies scoped to the program path — it keeps clip URLs clean, caches well at the edge, and covers both MP4 and the multi-file HLS case with one authorization.

10.7 MP4 vs. HLS: the signing branch

Because the video_asset_id → key template already encodes which asset is MP4 vs. HLS, the API picks the branch from the program’s format without the player caring — it always receives {url, start_ms, end_ms} and seeks the same way.

10.8 Failure & expiry handling

Grounding, Safety & the Quality Loop V1V2 advanced training

Every answer passes layered checks; accuracy is then driven up over time by a programmatic refinement loop. The two work together — the checks stop a bad answer today, the loop makes fewer of them tomorrow.

Rendering…
Reading the diagram: a pre-answer guard declines out-of-scope requests (advice on a member’s actual case); a post-answer check verifies every claim maps to a cited chunk; an optional judge approves or triggers a single stricter rewrite. Red = guard/decline paths.
Rendering…
Reading the diagram: a set of real attorney queries with expected results runs through the pipeline in bulk; results are scored on retrieval hit-rate and grounding; prompts, chunking/tagging, and re-rank weights are tuned and re-run until the error converges, then the winning configuration is promoted.

11.1 What the checks look for

11.2 Why it is built this way

11.3 The evaluation harness — how we measure answer quality V1

The refinement loop in 11.2 is only as trustworthy as the ruler it uses. That ruler is a frozen ground-truth set plus a scorer that runs the whole pipeline offline and produces a single pass/fail gate. It turns “the answer felt better” into a number we can regress against.

Rendering…
Reading the diagram: a curated ground-truth set (green) runs in batch through the exact production pipeline; each answer is scored on four axes; the scores are checked against an acceptance bar (red gate). A config that clears the bar is promoted; one that regresses sends the operator back to the tuning levers. The V2 branch (dashed) grows the set from real member queries and feeds continuous online scoring.

The ground-truth set

The offline batch run

The metrics

The acceptance bar (the gate)

11.4 Testing that the data is correct — DB rows and clips V1

Answer quality is downstream of data quality: a perfect prompt over a mis-timed clip or a mis-mapped speaker still produces a wrong, confident citation. So the rows in Postgres/pgvector and the clips on storage are tested in their own right, independent of any answer.

11.5 Evaluation at scale V2

11.6 What feeds the harness

Authentication & Membership Gating V1V2 add-on tier

The app is embedded in the members’ site but runs on its own origin, so it cannot read the site’s session cookie — browsers block third-party cookies inside a cross-origin iframe. Instead, the site (which does know the logged-in member) mints a short-lived signed token and hands it to the embedded app, which presents it to its own API as a bearer credential. No shared cookies, and the boundary is always enforced on the server.

Rendering…
Reading the diagram: the site reads the member and tier from the membership system, mints a short-lived asymmetrically-signed token, and passes it into the iframe. The app sends it as a bearer header; the API verifies the signature against the site’s public key and checks entitlement before returning a gated answer or minting a signed video URL. Nothing depends on third-party cookies.
Rendering…
Reading the diagram: catalog search is always public; a question from a non-member returns a teaser plus an upgrade prompt; a question from an entitled member returns the full four-panel answer and a signed video URL. Enforced server-side; video released only via a signed, expiring URL.

12.1 The token handshake

12.2 Entitlement & source of truth

Payments and subscriptions are the authority for who is a member; the site’s membership system is kept in sync with them and is the operational source the token is minted from. The token is a short-lived snapshot of that tier, so the API can decide access without calling the billing system on every request — a cancelled member loses access within one token lifetime, which is appropriate for this content. The entitlement check returns a structured tier rather than a yes/no, so a future paid add-on slots in as another tier value without reworking the gate.

12.3 Why it is built this way

12.4 Token format & lifecycle

The token is a JWT signed with RS256 (asymmetric): the site holds the private key and signs, the API verifies with the public key only — so a compromise of the API can’t forge site tokens, and the two systems never share a secret. The verifier pins the expected algorithm and ignores the token’s own alg header, closing the classic “alg = none” and RS256→HS256 confusion attacks.

Rendering…
Reading the diagram: the mu-plugin signs a short-lived JWT with the Key Vault private key and hands it into the iframe by postMessage; the app sends it as a bearer credential and the API verifies it (algorithm pinned, signature checked against the JWKS public key, plus iss/aud/exp) before the entitlement gate runs. Two paths keep the session alive without cookies — a refresh timer re-mints ahead of expiry, and a mid-session 401 triggers a re-mint and a single retry.

12.5 The postMessage handshake

12.6 Browser-boundary hardening

12.7 Subscription-status endpoint contract

The membership plugin exposes a server-to-server status endpoint (e.g. GET /wp-json/<plugin>/v1/subscription-status?member=<id>; the exact path depends on the plugin — MemberPress, PMPro, WooCommerce Subscriptions, or a small custom mu-plugin route). It is authenticated with a server-side API key (never called from the browser) and returns a compact projection:

12.8 Error semantics

Open Questions & Access

These are the genuine unknowns that shape the build, plus the access needed to resolve them. They are the fastest path from this design to a firm plan.

17.1 Open questions that affect the design

  1. Video format & timeline fidelity. Are the programs stored as progressive MP4 or as segmented streams, and (for MP4) is faststart set? And — the real precision risk — do the transcript timestamps line up with the actual stored files (watch for trimmed masters or intro offsets)? One sample video answers most of this.
  2. Prior-vendor export. What exactly can be exported (transcripts, chunks, metadata), and by when must the cutover happen?
  3. Speaker rosters. Is the per-program attorney roster structured and reliably complete? It is the backbone of attribution. Are any recordings multichannel (separate mic per speaker), which would make attribution near-exact?
  4. Diarization accuracy bar. What mis-attribution rate is acceptable on real two-hour panels? This decides the speech-service choice.
  5. Membership & plugin. Which WordPress membership plugin backs Stripe (it decides how we read subscription status)? How many tiers exist today, and do the AI features differ across them beyond member vs. non-member?
  6. Region & data residency. Which region should host the models and data (US-based CLE audience), and is the reasoning model available with adequate quota there?
  7. Win condition. The single most important goal for the first release (demo, member usage, or revenue) and any date it is anchored to.
  8. Teaser depth. For a non-member, how much should the answer preview reveal — just that coverage exists, a short summary, or which programs it would cite? This sets where the free/paid line sits (Section 8.2).
  9. Reasoning model preference. Both Claude (via Azure AI Foundry) and GPT (via Azure OpenAI) run on Azure; the design defaults to the latest Claude, but confirm the client’s preference and that the Azure subscription can access it.

17.2 Access to confirm and build

How it works under the hood

Architecture Overview V1

The system separates into two planes. The request plane is fast and member-facing: search, the streamed answer, and video. The processing plane is slow and asynchronous: turning raw video into an indexed, citable corpus, and running the evaluation loop. They share the datastores but never block each other — a two-hour transcription job can never slow a member’s query.

Rendering…
Reading the diagram: the WordPress site embeds the app, which calls the API. The API reads the catalog and vector store, streams via Redis, verifies membership, and releases video through signed URLs. Background workers (the processing plane) share the datastores and the AI services but run off the request path. Green = datastores, amber = cache/queue/external, purple = secrets/auth.

3.1 The components at a glance

ComponentResponsibility
API + middlewareThe single entry point. Applies CORS, request-id, and the auth/entitlement check in order, then routes to search, ask (streaming), program, and health.
Retrieval / RAG engineTurns a question into a grounded, cited answer: embed → hybrid retrieve → re-rank → reason → assemble panels (Section 8).
Search engineStructured faceted browse with a semantic fallback and popularity ordering (Section 9).
Streaming subsystemDelivers the answer token-by-token over SSE, resumable across dropped connections (Section 8).
Ingestion workersThe processing plane: transcribe, diarize, map speakers, chunk, tag, embed, index (Sections 4–7).
Catalog (Postgres)Programs, speakers, CLE approvals, tags, popularity, and the query log — the structured source of truth.
Vector index (pgvector)Transcript-chunk embeddings with their timecodes, co-located with the catalog in the same database.
Cache / stream (Redis)Resumable SSE streams, idempotency records, and hot-query caching.
Media storeThe source video, served to members through a CDN with signed URLs (Section 10).
AI servicesA reasoning LLM, an embeddings model, and a speech-to-text (transcription + diarization) service — each behind a swappable interface (Section 14).
Rendering…
Reading the diagram: on boot the service health-checks each dependency in order before accepting traffic; on shutdown it disposes pools and clients cleanly. A failed check fails the boot loudly rather than serving a half-wired app.

3.2 Why it is built this way

Content Ingestion V1 sliceV2 full pipeline

Ingestion is the processing plane’s core job: turn a raw two-hour recording into a set of short, searchable, citable passages, each tied to an exact video moment and the attorney who spoke it. It runs as a sequence of stages, each an independent background job (Section 7), so a failure or a re-run affects only one program and one stage.

Rendering…
Reading the diagram: audio is extracted from the source video, transcribed with speaker diarization, speaker labels are resolved to named attorneys, then the transcript is chunked (carrying timecodes and speaker), tagged, and embedded. Vectors land in the index, metadata in the catalog, and the video stays in place.

4.1 The stages

  1. Extract audio from the source video for speech processing.
  2. Transcribe + diarize — produce the transcript with word-level timestamps and speaker turns in one pass (Section 5).
  3. Resolve speakers — bind each anonymous speaker to a named attorney from the program roster (Section 5).
  4. Chunk — split the transcript into overlapping passages, each carrying its program, speaker, and [start, end] timecodes (Section 6).
  5. Tag / categorize — derive practice area, sub-practice area, and topic from the content (this is the layer above raw retrieval that powers faceted browse).
  6. Embed + index — embed each passage and write vectors to the index and metadata to the catalog.

Program-level metadata (title, dates, live/replay format, and the speaker roster) is imported from the existing content system — the pipeline enriches it (tags, chunks, timecodes); it does not originate it.

The inherited transcripts and chunks from the prior vendor lack speaker attribution and reliable timecodes (Sections 5–6), so the plan is to re-process from audio to regenerate both correctly. Whether the entire ~800-program library is re-processed or only a representative first-cut slice is contingent on the video/timestamp findings (Section 17): if the existing timecodes prove accurate, less re-processing is needed. The prior vendor’s exported metatags are not discarded — they seed the tagging stage, and the model fills gaps and normalizes them rather than re-deriving tags from scratch.

4.2 Re-runnable by design

Tagging and chunking are tuned repeatedly during the quality loop, and programs get edited, so each stage is idempotent and replayable per program. Re-tagging the whole library, or re-chunking a single edited program, is a matter of re-enqueuing jobs — not a migration. CLE state-approval data is compliance-grade and enters the catalog from an authoritative source; it is never inferred by a model.

External Services V1

Four capabilities are provided by external services, each behind a swappable interface so any one can be re-decided during the quality loop without re-architecting. Each has a recommended default and viable alternatives, so a different vendor can be chosen without changing the design.

Rendering…
Reading the diagram: the four external capabilities (reasoning LLM, embeddings, transcription + diarization, video delivery) each sit behind our interface, with a recommended service and alternatives listed underneath.

14.1 Reasoning LLM

Recommended: Claude on Microsoft Foundry — a frontier reasoning model callable on an Azure endpoint with the organization’s existing Microsoft identity, so it stays inside the Azure boundary. Alternatives: Azure OpenAI GPT models (single Microsoft/OpenAI stack); the Anthropic API directly; or Claude via AWS (where the video already lives).

14.2 Embeddings

Recommended: the text-embedding-3-large model via Azure OpenAI, keeping embeddings first-party in the same tenant. Alternatives: the same model via the OpenAI API, or a self-hosted open-source embedding model if cost or data-residency ever demands it.

14.3 Transcription + diarization

Recommended: Azure AI Speech batch transcription with diarization — it returns speaker turns and word-level timestamps in one job, runs in-tenant, and is inexpensive at this scale (the one-time backfill is on the order of a few hundred dollars; ongoing cost is tens of dollars a month). Alternatives: a self-hosted WhisperX + pyannote pipeline (best timestamp precision, cheapest at volume, needs operational ownership); or specialist APIs such as AssemblyAI or Deepgram. The choice is settled by a diarization-accuracy benchmark on a handful of real programs, since attribution accuracy is exactly what the inherited system lacked.

14.4 Video delivery

Recommended: keep the video where it already lives (Amazon S3) and serve it through its native CDN (CloudFront) with signed URLs — no migration, and the cross-cloud boundary is reduced to tiny signed-URL calls rather than video bytes (Section 15). Alternative: migrate the media into Azure storage behind Azure’s CDN for a single-cloud footprint — a later option that the cost math does not force now.

Engineering deep dive

Speaker Diarization & Attribution V1

Panels 3 and 4 — the clip and the attorney’s credentials — are only trustworthy if the system knows which attorney spoke the retrieved passage. The inherited transcripts do not carry that: speech was never matched to speakers. So attribution is a first-class ingestion step, not an afterthought.

Rendering…
Reading the diagram: the audio is transcribed and diarized into anonymous speaker turns with timestamps; those turns are matched to the program’s named roster using the introductions and speaking patterns; high-confidence bindings are accepted automatically, low-confidence ones are queued for a quick human confirm before the chunks inherit a speaker_id.

5.1 Why re-processing from audio is required

Diarization is an acoustic task — it distinguishes voices from the waveform — so it cannot be layered onto a plain text transcript that has no audio signal. Re-processing from the source audio is therefore necessary, and it pays a second dividend: it produces fresh, reliable word-level timestamps, which the video-clip feature depends on and the inherited transcripts may not carry accurately.

5.2 From anonymous speakers to named attorneys

Diarization returns anonymous labels (“Speaker A / B / C”), stable within a recording but nameless. Binding a label to a real attorney is a separate step, solved without biometric voice enrollment (which does not scale to a rotating cast of hundreds of guest attorneys):

5.3 At scale

The one-time backfill is ~1,600 hours; the steady state is ~50 hours/week. There is no real-time requirement, so transcription runs as batch/async jobs (Section 7) at a controlled concurrency that respects the speech service’s throughput and cost. If diarization accuracy on long multi-attorney panels proves marginal, the speech service is swappable (Section 14) and a benchmark on a handful of representative programs settles the choice.

Chunking Strategy V1

A two-hour transcript is far too large to retrieve or cite as one unit. Chunking splits it into passages that are (a) semantically self-contained, (b) small enough to embed and re-rank precisely, and (c) tied to an exact video range and a single speaker. Chunk quality is the single biggest lever on answer quality: chunks that are too big blur retrieval; too small lose context.

Rendering…
Reading the diagram: the diarized transcript is scanned for boundaries — speaker changes and topic shifts — then grouped into passages within a target token window with a small overlap so context is never cut mid-thought. Each passage carries its program, speaker, and start/end times before it is embedded and indexed.

6.1 How passages are formed

6.2 New and edited content

New programs run the same path automatically as they arrive. An edited program is simply re-chunked in place: because the operation is idempotent and keyed per program, re-running it replaces that program’s chunks and vectors without touching the rest of the corpus. The inherited chunks are not reused — they predate diarization and dependable timecodes — so chunks are regenerated from audio (across the full library or a first-cut slice, per Section 17).

6.3 Tuned, not fixed

Target size, overlap, and boundary rules are parameters, not hard-coded constants. The quality loop (Section 11) evaluates retrieval on a fixed question set and adjusts them; because re-chunking is a background job, sweeping a new setting across the corpus is routine.

Background Processing & Jobs V1V2 event-triggered

Transcription, diarization, chunking, embedding, and evaluation are slow, bursty, rate-limited, and occasionally failing operations. Running any of them inline would make a member wait on a two-hour transcription. So the whole processing plane is a durable job system: work is enqueued on an Azure Storage Queue, a pool of our own worker containers drains it, and progress and failures are tracked per program. Each program is one job that runs its stages in sequence (extract → transcribe/diarize → resolve speakers → chunk → embed/index), with the current stage recorded in ingest_jobs so a re-run resumes cleanly. The queue is only the broker; all orchestration logic lives in our own codebase, not in serverless functions.

Rendering…
Reading the diagram: three sources feed one queue — the one-time backfill, newly arriving programs, and a scheduler that enforces the freshness target. A horizontally-scaling worker pool runs each stage; every job is idempotent and its status is tracked, so retries are safe and failures divert to a review lane instead of blocking the line.

7.1 Handling volume

7.2 Reliability

7.3 Meeting the 48-hour freshness target

A scheduled trigger (running on a short interval) picks up new or edited programs and enqueues them — comfortably inside the 48-hour window with no event plumbing to build. Wiring ingestion directly to the content-production workflow so new programs trigger instantly is a natural later enhancement, but is not needed to hit the target.

7.4 The operator console

Several steps in the pipeline are deliberately human-in-the-loop, so the platform needs a small internal admin view for the team that runs content — not a member-facing screen. It sits over the same catalog and job tables and lets an operator see and act on what is happening behind the scenes, so transcripts, speakers, and clips can be verified before members ever see them.

Rendering…
Reading the diagram: the operator console reads ingestion job status and the catalog, and lets a person confirm low-confidence speaker matches, review and retry failed jobs, spot-check a program’s transcript and clips against the video, and re-run tagging or chunking for a program.

7.5 The stage state machine

Each program is one row in ingest_jobs that walks a fixed sequence of states. A worker leases the job for exactly one stage, does that stage’s work, writes the result, advances the state, and re-enqueues the job for the next stage — so a long pipeline never occupies a single worker for two hours, and any stage can be retried or resumed independently.

Rendering…
Reading the diagram: a job moves queued → extracting → transcribing → mapping-speakers → chunking → embedding → indexed, one stage per lease. Any stage can throw; a failure increments a retry counter and re-enqueues with backoff, and once retries are exhausted the job lands in dead-letter for a human. indexed and dead-letter are the terminal states.

7.6 ASR batch mechanics

The transcribe stage is the pipeline’s long pole and the only one that hands work to a third party, so it is modelled as submit-then-poll rather than a blocking call.

7.7 Chunking execution

Section 6 sets the strategy; the mechanics are deliberately thin. Token counts use the embedding model’s own tokenizer (the cl100k-family encoding behind text-embedding-3-large) so a chunk’s measured size matches what the embedding call bills and truncates at. Boundary detection walks the diarized word stream: a hard break at every speaker change (so a chunk never straddles two attorneys), and a preferred break at topic/pause cues within a speaker’s turn. Passages are packed to a target window (a few hundred tokens) with a small token overlap carried from the prior chunk, and each emitted chunk is stamped with program_id, speaker_id, and millisecond start/end from its first and last words. Target size, overlap, and boundary rules are the parameters the quality loop sweeps (Section 6.3).

7.8 Embedding at volume

7.9 Backfill throughput & cost

The ~800-program / ~1,600-hr backfill is enqueued once and drained at a controlled worker concurrency chosen to sit under the Azure AI Speech and Azure OpenAI limits rather than to max out wall-clock — the constraint is provider quota and predictable spend, not our compute. Because transcribe is submit-then-poll (7.6), a handful of workers keep dozens of transcription jobs in flight at the provider, so throughput is gated by the Speech service’s batch capacity, not worker count. At Azure batch rates (~$0.36/hr audio) the transcription bill for the full backfill is on the order of a few hundred dollars, one-time; embedding ~1–1.5M chunks is a comparably modest one-time figure. The backfill realistically drains over days, not weeks, at a deliberately capped rate. After cutover the steady ~25 programs/week (~50 hr/week) is a trickle the same pool absorbs at a few dollars a week, comfortably inside the 48-hr freshness target.

7.10 Ragie migration ETL

Before the prior vendor (Ragie) goes offline, its transcripts, chunks, and metatags are exported and loaded into our own store as a one-time seed — so nothing is lost at cutover even though the pipeline regenerates most of it. The exported metatags seed the tagging stage (Section 4.1): the model normalizes and fills gaps rather than re-deriving tags from scratch. The exported transcripts/chunks load as a fallback/QA reference, not as retrieval input — they predate diarization and dependable timecodes (Sections 5–6), so the authoritative chunks and vectors come from re-processing audio. In state-machine terms the ETL is just another work source feeding ingest_jobs: it lands program rows and seed metadata, and the normal extract → … → indexed sequence then supersedes the seed per program.

Data Model V1

The catalog is small and clean: programs and their chunks, speakers with their diarized-label mapping, CLE approvals, tags, popularity, the ingestion-job tracker, and the query/citation log that also feeds evaluation.

Rendering…
Reading the diagram: a program is segmented into timecoded chunks (each carrying an embedding and a diarized speaker), features speakers, is approved in states, is tagged, and carries a popularity score. A speaker-map records how each diarized label was bound to a named attorney; an ingest-jobs row tracks each program through the pipeline; queries log their citations back to programs and chunks.

Deployment & Infrastructure on Azure V1 single envV2 scale & HA

The platform runs on Microsoft Azure — the organization standardizes on Microsoft 365, so Azure keeps identity, billing, and operations in one plane. The app is a streaming container service in front of managed Postgres, managed Redis, and object storage, with secrets in a managed vault, private networking, and CI/CD from Git. Security is designed in: no public database, managed identity instead of stored credentials, and a hard scaling ceiling as the cost and blast-radius guardrail.

15.1 The building blocks

RoleAzure serviceWhy
Containers (API + workers)Azure Container AppsStreaming-capable, scales to the modest load easily, cleanly separates the API from background workers. The streaming endpoint runs with a raised request/idle timeout (Premium ingress) so long answers aren’t cut — see Section 8.
Catalog + vectorsAzure Database for PostgreSQL Flexible Serverpgvector supported (HNSW/IVF, plus an Azure-only high-recall index) — one database for facets and vectors.
Cache / SSE streamAzure Managed RedisResumable streaming, idempotency, hot-query cache.
Job queue (broker)Azure Storage QueueFeeds the ingestion worker containers; the orchestration logic stays in our code (Section 7).
Working storageAzure Blob StorageExtracted audio and (later) PDFs, inside the tenant.
SecretsAzure Key VaultConnection strings, signing keys, API keys — referenced at start, never in the image.
Container imagesAzure Container RegistryImmutable, commit-tagged images the runtime pulls via managed identity.
Edge / CDN / WAFAzure Front DoorThe single public entry point, with a web application firewall and anonymous rate-limiting.
MonitoringAzure Monitor + Application InsightsTraces, SSE latency, dashboards, alerts, and budget guardrails.
IdentityMicrosoft Entra ID + managed identitiesOne identity plane across the app, datastores, CI/CD, and the reasoning model.
Video (kept in place)Amazon S3 + CloudFrontVideo already lives here; served via signed URLs (Section 15.3).
Rendering…
Reading the diagram: CI pushes an image to the registry, which deploys an immutable revision of the API on Container Apps. The API reaches managed datastores over private networking, verifies membership against the site, mints signed URLs for video, and calls the AI services with a managed identity. Secrets come from Key Vault; logs and metrics flow to monitoring with budget alerts.

15.2 Secure networking

The security posture is part of the architecture, shown below.

Rendering…
Reading the diagram: public traffic enters only through the edge (with the firewall) into the container environment in a private network. The database, cache, storage, and vault have no public endpoint — they are reached over private endpoints inside the network. The app authenticates to every resource with a managed identity; no keys are stored in the app. Purple = identity/secrets, red = the guarded public edge.

15.3 CI/CD — push, build, deploy

Every change flows through an automated pipeline: a push to a branch runs tests, builds and scans a Docker image, pushes it to the registry, and deploys an immutable revision — staging automatically, production behind a reviewed merge.

Rendering…
Reading the diagram: a push to the main branch runs tests, builds and scans the image, pushes it to the registry, and auto-deploys a revision to staging; production is reached by a reviewed pull request into the production branch. Same workflow shape, different config — no drift, full audit trail.

15.4 Video stays where it is

The video already lives in Amazon S3. At this scale it is not migrated: S3 stays the origin behind its native CDN (so there is no cross-cloud charge on the video bytes), and the Azure app makes only small signed-URL calls across to it. This isolates the cross-cloud boundary to tiny API calls rather than gigabytes of video. A full migration into Azure storage is a later option if a single-cloud footprint becomes worth the one-time move; the cost math does not force it now.

15.5 How the pieces connect — service-to-service auth

Everything inside Azure authenticates with Microsoft Entra ID, not stored credentials. The API app and the worker app each run under a user-assigned managed identity (id-lawcle-api, id-lawcle-worker) — user-assigned so the identity and its role grants survive revision churn and can be pre-provisioned before the first deploy. Each identity is granted only the RBAC roles it needs, scoped to a single resource; there is no shared admin key in the running system.

Rendering…
Reading the diagram: dotted edges are the identity/trust plane — the managed identity presenting a short-lived Entra token to each resource (no stored keys); solid edges are the network data path. Public traffic can enter only through the WAF (red); the datastores (green) sit behind private endpoints with no public address; all outbound calls to the model endpoints and AWS CloudFront funnel through one NAT gateway (amber). The single stored credential in the system is the set of AWS keys in Key Vault (purple), used to sign video URLs.

15.6 Network paths — private by default, one public door

15.7 Secrets inventory & rotation

Everything sensitive is in Key Vault and referenced at container start — nothing sensitive is baked into the image or the repository. Because most Azure resources use Entra tokens, the vault holds far fewer secrets than a password-based design would.

15.8 External quotas & resilience

The reasoning model, embeddings, and transcription are metered external dependencies, so each is treated as a fault domain with its own limits, timeouts, and isolation — one slow provider must never cascade into the interactive answer path.

15.9 Internal API surface

The app exposes a small, versioned HTTP surface, consumed by the members’ website iframe and reached only through Front Door. Endpoints are namespaced under /v1; additive changes stay in /v1, breaking changes would open /v2 alongside it.

Scaling & Operations V1V2

The same architecture carries the system from launch to growth by configuration, not redesign. Given the modest expected load, launch needs very little; the headroom is there when partner associations or higher volume arrive.

Rendering…
Reading the diagram: scaling is a settings change — raise the container replica ceiling, then add a database read replica and a cache tier only when measurements call for it. Nothing about the shape of the system changes.

16.1 Operating guardrails

Beyond V1 & V2 - later phases

Called out here for completeness. None of these are in the current build. Each is noted in the design where a decision downstream might be affected by whether we eventually pull one of these in.