An AI-Native Operations Architecture for Knowledge-Intensive Firms

How This Started

Next Halo operates as a vertical AI studio. The model is deliberate: identify a domain where operational complexity is high, find the practitioner who has lived that problem longest, and co-own the problem definition before writing a line of code. Not as a client relationship. As a partnership — shared conviction that the problem is worth solving, shared stake in whether the solution is real.

For this initiative, that partner was a strategy consultant with 15+ years inside tier-one and boutique firms — partner-level, with direct exposure to the operational failures this system was built to address. They didn't hand us a brief. They challenged every assumption about how consulting operations actually break, which layers AI could own reliably, and what human judgment still couldn't be replaced. That collaboration shaped the architecture at the problem level, before any technical decisions were made. It is ongoing.

The Approach

From Business Problem to System Architecture

The starting point was not technology — it was understanding how a knowledge-intensive operation actually runs. A consulting firm was the test case: complex enough to stress every assumption, structured enough to map completely. The domain knowledge that made this possible didn't come from observation — it came from a practitioner who had run these engagements from the inside, for years, at senior level. Every engagement follows a recognisable pattern: an RFP arrives, a brief is drafted, deliverables are scoped, and then the real work begins — managing client communications, detecting scope changes before they become budget overruns, producing deliverables that meet the original intent, and carrying knowledge across engagements.

Each of these functions traditionally requires experienced human judgment at every step. The architectural question: which of these layers can AI handle reliably, and where must humans remain?

The answer shaped a two-tier system:

Desktop client (Electron) — the consultant's personal AI workspace. Email ingestion, local intelligence, agent execution, document generation, and private project memory
Nexus API (FastAPI + PostgreSQL) — the company's system of record. We call it the Nexus because all persistent organisational data converges here — shared project state, team-wide semantic search (Qdrant), real-time event distribution (Redis PUB/SUB + WebSocket), and persistent organisational knowledge

Key architectural decisions:

AI positioned at the interpretation and production layers, not the reporting layer
Event-driven service orchestration with single-responsibility AI services
Local-first vector search (sqlite-vec in Electron) with cloud sync
Dynamic skill generation — agents that build their own instructions from project context
Dual-layer memory architecture eliminating cold-start degradation
Conservative AI design — precision over recall, human checkpoints at decision points

What We Built

The platform's capability is delivered through five core systems, each addressing a critical workflow layer.

1. Autonomous Email Intelligence Pipeline

The system reads client emails with the same context an engagement manager would carry. Each classification runs against the project's full brief, RFP documents, deliverable structure, and communication history — not keyword matching on subject lines.

The classifier uses tool-based LLM classification. The model receives structured tools (match_and_push, suggest_new_project, skip_email) and is forced to choose one, returning structured JSON with project matching and explicit reasoning.

Scope detection operates on intent, not keywords. When a client email says “could we also include a competitor analysis in the final deck,” the system reasons about whether that deliverable exists in the current scope, whether it's implied by adjacent work, or whether it represents genuine scope expansion. Output is structured: in_scope, scope_creep, or uncertain — with a summary, affected tasks, and reasoning.

The design is conservative by default. False negatives are recoverable. False positives destroy trust. The system tunes for precision over recall — and the difference in user adoption is immediate.

The POC uses email and local files as its primary signal sources, but the event-driven architecture generalises to any communication channel — Slack, WhatsApp, Microsoft Teams, and similar. The classification and orchestration layers are channel-agnostic by design; adding a new input source means writing an adapter, not rearchitecting the pipeline.

2. Event-Driven Service Orchestration

The processing pipeline is fully autonomous from ingestion to Manager notification. An OrchestrationService coordinates specialised services in sequence: Email arrives → Classification → RFP Detector → Delta Extractor → Brief Updater → Manager Notification.

Each service has a single responsibility — microservices thinking applied to AI. The RFPDetector determines if a communication contains an RFP. The DeltaExtractor compares it against the existing brief and scope. The BriefService generates or appends updated content. The DeliverableAgentService handles section-level drafting and assembly.

Clear boundaries make each service independently testable, debuggable, and improvable. When classification accuracy dips, the problem is isolated to one service. When delta extraction needs refinement, one service changes — nothing else is touched.

The human checkpoint sits at the decision point — approve or reject a proposed scope change — not at every detection step. The system detects and prepares; the Manager decides and approves. Real-time coordination runs on Redis PUB/SUB with WebSocket connections for live updates on per-project and per-user channels.

3. Dynamic Skill Agent System

Most AI integrations ship with static prompt templates. When an unfamiliar task arrives, the system has no answer. This architecture takes a different approach: every task execution dynamically generates its own skill — a custom instruction set built from the project's actual context.

The system uses the PI Coding Agent SDK (pi-mono) — an autonomous agent runtime with a full tool loop — in a two-phase pipeline:

Phase 1 — Exploration and Skill Generation

The agent enters read-only mode with file system tools (Read, Grep, Glob). It reads the project brief, RFP documents, communications, existing deliverables, and progress notes from prior sessions. From this exploration, it generates a structured execution plan — effectively a custom skill tailored to this specific task in this specific project context. The user reviews which files the agent read, can reject irrelevant context, and approves or revises the plan before execution begins.

Phase 2 — Autonomous Execution

The same agent session unlocks write tools (Write, Edit, Bash) and executes the plan autonomously — creating documents, modifying files, running commands. The agent retains its full exploration context, eliminating information loss between planning and execution. Timeouts prevent runaway execution. Context overflow triggers automatic retries.

A task labelled “prepare financial proposal” gets a completely different generated skill than “competitive landscape analysis” — different source files, different output structure, different execution steps — without either being hardcoded.

4. Local-First Intelligence with Cloud Sync

The desktop application runs sqlite-vec for local vector embeddings inside Electron. Once content is indexed, semantic search queries execute locally in milliseconds. Embedding generation uses OpenAI's text-embedding-3-small (1536 dimensions) by default — the embedding provider is replaceable at the configuration level, as the vector store operates on 1536-dimensional vectors regardless of which model generates them. Graceful fallback to FTS5 keyword search runs when offline.

The backend runs Qdrant with PostgreSQL for team-wide vector search. Cross-project queries hit the cloud layer. Single-project work stays local.

A sync layer bridges the two. The desktop holds emails, memory, and chat in local SQLite (better-sqlite3, WAL mode); business data lives in the FastAPI backend and syncs via REST.

5. Dual-Layer Organisational Memory

Agents without memory are agents that forget. This architecture eliminates cold-start degradation entirely.

Layer 1 — Curated Knowledge (never deleted)

MEMORY.md — persistent agent memory, updated across sessions
Per-project BRIEF.md — living project context
Identity files (COMPANY.md, IDENTITY.md, AGENTS.md) — agent persona and role definitions

Layer 2 — Append-Only Logs (daily)

Daily conversation logs (YYYY-MM-DD.md) capturing every exchange
Per-project communication summaries in structured directories
Progress notes per task — completed steps, missing items, next actions, output files with byte sizes

All memory files are indexed into SQLite FTS5 for keyword search, with sqlite-vec providing semantic similarity. When an agent resumes a task, readProgressNotes() loads prior context and injects it into the execution prompt. The agent picks up where it left off — no repeated work, no lost context.

Three Constraints That Shaped Everything

Before any architecture decision was made, three constraints were set as non-negotiable. Not as compliance requirements — as the conditions under which a knowledge-intensive organisation would actually trust and run a system like this.

Data never leaves the organisation's control

Client communications, engagement documents, institutional knowledge — this is the most sensitive data a knowledge-intensive organisation holds. Any architecture that requires uploading it to a third-party service for processing is a non-starter. The system was designed so that interpretation and intelligence happen locally, on the organisation's own hardware. Data flows up by design, not by default, and only what needs to be shared ever leaves the device.

AI inference costs are contained by design

Token costs are a real operational constraint today and a declining one over time — but the architecture shouldn't assume either. The system was built to minimise API calls structurally: local vector search runs before cloud search, processed inputs are tracked so nothing is classified twice, and keyword matching handles what doesn't need semantic reasoning. The result is a system that gets cheaper as model pricing drops without requiring architectural changes to capture that benefit.

No single model provider owns the stack

The service architecture — discrete, single-responsibility services for classification, delta extraction, brief management, deliverable generation — was designed so that each service is independently swappable. The system uses specific models today. The design principle is that it doesn't have to. No organisation should be locked into a provider relationship at the infrastructure level of their operations.

How the Architecture Enforces These Constraints

Each of the three constraints above maps directly to architectural decisions in the build. The desktop application is not a technology preference — it's the enforcement mechanism.

Local processing enforces data sovereignty

Emails are parsed locally — never uploaded to a cloud service for processing. Identity files, workspace documents, and progress notes live on the user's machine. Only project metadata and shared knowledge sync to the Nexus. Data flows up by design, not by default.

Local-first search contains token spend

Processed emails are tracked locally in SQLite — no re-classifying messages the system has already seen. FTS5 keyword search runs free on disk, only falling back to OpenAI embeddings when semantic matching is needed. Local project files enrich every classification prompt without additional retrieval API calls.

Full tool access without cloud sandboxing

The PI agent runs locally with full tool access: Bash, Read, Write, Edit, Glob, Grep — on the user's machine, not in a cloud sandbox. IMAP email access via ImapFlow reads Gmail directly. Document parsing happens in-process: pdf-parse for PDFs, mammoth for DOCX, xlsx for spreadsheets, adm-zip for PPTX.

Security implementation

PKCE OAuth2 flow (RFC 7636) with no client secret in the binary. Context isolation and disabled Node.js integration in the renderer. A controlled contextBridge API surface. Tokens stored in encrypted Electron Store, not browser localStorage.

Production Engineering

Surgical deliverable updates

A scope change affects one section of a deliverable out of twelve — only that section regenerates. The DeliverableSection model tracks each section independently. Unchanged sections stay byte-for-byte identical.

Task knowledge graph

Tasks aren't flat lists. The system generates structured dependency links — depends_on, shares_context, follows — with confidence scoring and reasoning capture. When scope changes, the knowledge graph surfaces impacted tasks automatically.

Deliverable pipeline

Extract requirements from RFP → draft per-section → polish assembled document → export as DOCX. DeliverableVersion snapshots track output at each iteration.

Enterprise auth

Keycloak OIDC with PKCE, role-based access (Partner, Manager, Consultant), and dual role resolution — merging JWT realm roles with database grants.

Project lifecycle

Projects have distinct modes — proposal and delivery — linked so an engagement transitions naturally from pitch to execution.

What This Demonstrates

This started as a research question, shaped by someone who understood the domain, and produced a working system. The architecture is sound — AI can sit at the interpretation and production layers of a knowledge-intensive operation while humans remain at the decision layer. That question now has a definitive answer, backed by a system that runs.

Architecture Validation

The engineering patterns hold under real-world complexity: event-driven AI orchestration, local-first intelligence, dynamic skill generation, structured memory. These aren't prototype shortcuts — they're production patterns with real trade-offs and real solutions.

Engineering Lessons

Classification context is everything

Accuracy improved dramatically when the classifier received each project's full brief and deliverable structure — the same context a human manager would have.

Surgical updates beat full rewrites

Regenerating entire documents on every change destroys trust. Modify only what the change actually touches.

Specialised services beat monolithic prompts

Dedicated services — RFPDetector, DeltaExtractor, BriefService, DeliverableAgentService — are independently testable and improvable.

Session continuity changes agent quality

Agents that resume with progress notes produce significantly better output than agents starting cold. This is the difference between a system people use and a system people abandon.

What This Opened

R&D rarely ends where it starts. Mapping a domain completely — its workflows, its failure modes, where context gets lost between humans, where structured AI creates an asymmetric advantage — produces more than a working system. It produces a thesis about that domain.

The domain partnership that shaped this research didn't end when the system was built. It accelerated. What began as problem definition has become something more deliberate: a vertical product, co-owned with someone who understands the domain at the level required to build something that actually holds.

We've already started. More when there's something to show.

Technology Stack

electron react sqlite-vec better-sqlite3 fastapi python typescript node.js postgresql qdrant redis keycloak pi-coding-agent-sdk imapflow llm-agnostic

An AI-Native Operations Architecture for Knowledge-Intensive Firms

We built this to understand whether an entire knowledge-intensive operation could be redesigned around AI — not as a feature, but as the architecture underneath. The system runs. The thesis that came out of it runs deeper.

INITIATIVE

DOMAIN

YEAR

Key Insights

THE QUESTION

THE ANSWER