Flagship ProjectAI Workflow

Active Dev

AI-Native Analytics Platform

A governed data platform designed for AI reasoning — Apache Iceberg lakehouse, dbt semantic layer, and a Claude-powered analyst that answers business questions with explainable, trusted metrics.

Apache IcebergdbtSemantic LayerClaudeS3

Overview

AI analysts that read raw CSV files are not AI analysts. They are sophisticated pattern-matchers operating on inconsistent, ungoverned, untrustworthy data. The results they produce reflect that — hallucinated metrics, contradictory answers, business context the model cannot possibly have.

This project demonstrates the alternative: an AI-native analytics platform built on a governed data foundation. Apache Iceberg for the lakehouse layer, dbt for transformation and metric governance, a semantic layer as the contract between data and AI, and Claude as the reasoning engine. The result is an AI analyst that gives explainable, consistent, trustworthy answers — because the data beneath it deserves that trust.

Problem & Challenges

The core problem is not AI capability. Modern LLMs can reason over data fluently. The problem is data quality, consistency, and context. Without a governed data foundation, AI analytics produces confident-sounding answers that are wrong in ways that are hard to detect.

AI reading CSV files directly has no business context — it does not know that 'revenue' means net collected, not gross billed, or that sessions shorter than 5 minutes are test records
Inconsistent source data produces inconsistent answers — the same question asked twice returns different numbers because the underlying data has no single source of truth
Metric definitions live in analyst heads, not in code — making it impossible for an AI to apply them consistently
Without a semantic layer, AI must guess at column meanings, table relationships, and business logic — and guesses compound into hallucinations
Trust collapses quickly — one wrong AI-generated number that reaches a business decision destroys confidence in the entire system

Architecture

A modern AI-native analytics stack built in layers, each layer adding governance, trust, and context that makes AI reasoning reliable.

SOURCES

Raw Data

Operational Systems · Event Streams · APIs

INGEST

S3 / Object Store

Scalable storage · decoupled from compute

WAREHOUSE

Apache Iceberg

ACID transactions · schema evolution · time travel

TRANSFORM

DBT

Medallion layers · metric contracts · tests

SEMANTIC

Semantic Layer

Business context · governed metrics · NL mappings

REPORTING

AI Analyst

Claude · governed queries · explainable answers

Data Flow

Ingestion

Raw operational data lands in S3 via streaming or batch pipelines. Apache Iceberg provides ACID semantics, schema evolution, and time-travel queries over the raw layer.

Transformation

dbt models transform raw Iceberg tables through staging (cleaning, typing) to mart (business logic, aggregations). Every transformation is version-controlled and testable.

Metric Governance

dbt metric definitions codify business rules as data contracts. 'Revenue' has exactly one definition, tested on every run, consumed by every downstream system identically.

Semantic Mapping

The semantic layer maps business vocabulary to governed data assets. 'Last month's active customers' resolves to a specific mart table, specific columns, specific filter logic — not a best guess.

AI Reasoning

Claude queries the semantic layer, not raw tables. Every answer is grounded in a governed metric definition. Reasoning steps are visible — the AI explains what data it used and why.

Semantic Layer

The semantic layer is the architectural contract between raw data and AI reasoning. It translates physical table structures into business concepts that both humans and AI understand and trust.

Business Concept Mapping

Physical columns like 'trx_amt_net_usd' become governed concepts like 'Net Revenue'. The AI queries business concepts, not raw columns — eliminating the most common source of LLM hallucinations in analytics contexts.

Metric Governance

Every metric has exactly one definition, tested continuously. 'Monthly Active Users' means the same thing to the dashboard, the AI analyst, and the executive report — because it is defined once as code and consumed everywhere identically.

Natural Language Grounding

The semantic layer maps business vocabulary to governed data assets. When a user asks 'how did revenue perform last quarter?', the AI resolves 'revenue' to a specific, tested dbt metric with known caveats — not a probabilistic column-name guess.

Context Injection

Before the AI answers any question, it receives the relevant semantic layer definitions: business rules, filter logic, known data quality caveats. This context makes LLM reasoning deterministic and explainable rather than opaque and probabilistic.

AI Layer

The AI layer is the interface between business questions and governed data. It does not replace analysts — it makes their work accessible to anyone in the organisation.

Semantic Context Injection

Before answering any question, Claude receives the semantic layer definition for relevant metrics — business rules, column meanings, known caveats. This context eliminates the most common source of hallucinations.

Governed Query Generation

SQL is generated against the semantic layer, not raw tables. Metric definitions are enforced automatically. No analyst intervention required to ensure consistency.

Explainable Answers

Every AI-generated answer includes the metric definition used, the data range queried, and any caveats. Business users understand not just the answer but how it was produced.

Anomaly Detection

The AI layer monitors KPIs on a schedule — detecting deviations, generating hypotheses for root causes, and escalating to humans only when confidence is low.

Without DWH vs With DWH

Without DWH

Without DWH — AI Guesses

AI reads raw files directly. No context, no governance, no trust.

Inconsistent answers — the same metric returns different values on different days
Duplicated logic — revenue calculation differs between CSV files with no reconciliation
Hallucinations — AI infers column meanings that don't match business definitions
Missing business context — test records, cancelled transactions, and edge cases are not filtered
No explainability — the AI cannot show its working because the working is guesswork

With DWH

With DWH — AI Reasons

AI operates on a governed semantic layer. Every answer is grounded, explainable, and consistent.

Trusted metrics — every number comes from a single canonical definition enforced as code
Reusable definitions — business logic is written once, tested continuously, used everywhere
Explainable answers — the AI shows which metric definition it used and why
Scalable analytics — adding a new data source extends the platform without rebuilding the AI layer
Consistent reasoning — the same question always produces the same answer from the same governed data

Technologies

Apache IcebergdbtSemantic LayerClaudeS3

Results

AI analyst answers business questions with consistent, explainable, governed metrics
Zero hallucinated metric definitions — business logic enforced by dbt contracts
Natural language interface reduces time-to-insight from hours to seconds
Semantic layer serves as a single source of truth for both AI and human analysts
Platform architecture is extensible — new data sources add to the semantic layer without disrupting existing AI queries

Lessons Learned

The AI is only as good as the data

The most impactful improvements to AI answer quality came from improving data governance, not from prompt engineering. Clean, governed data with clear business context produces better AI reasoning than any prompt trick.

The semantic layer is the AI's operating manual

Without a semantic layer, AI analytics is a probabilistic exercise. With one, it becomes a deterministic lookup of governed business logic. The investment in semantic layer design pays back on every query.

Explainability is not optional

Business users will not trust AI answers they cannot verify. Designing explainability into the AI layer from the start — showing sources, metric definitions, and reasoning steps — was the difference between adoption and rejection.

Governance scales the platform

Every metric definition added to the semantic layer is immediately available to the AI analyst. The governance investment compounds — the platform becomes smarter as the semantic layer grows, with no additional AI work required.

Future Vision

Autonomous analytics agents that proactively surface insights without being asked — monitoring KPIs, detecting anomalies, and generating executive summaries on a schedule
Multi-agent architectures where specialised AI analysts handle different business domains — revenue, operations, customer — and collaborate to answer cross-domain questions
Real-time AI analytics over streaming data — extending the governed semantic layer to Kafka event streams for sub-second AI-powered operational intelligence
Self-updating semantic layers where the AI proposes new metric definitions based on observed query patterns, subject to human review and governance approval