Name	Name	Last commit message	Last commit date
parent directory ..
evals	evals
prompts	prompts
scripts	scripts
src/content_quality	src/content_quality
tests	tests
.env.example	.env.example
.gitignore	.gitignore
Dockerfile	Dockerfile
Makefile	Makefile
README.md	README.md
compose.yml	compose.yml
otel-collector-config.yaml	otel-collector-config.yaml
promptfooconfig.yaml	promptfooconfig.yaml
pyproject.toml	pyproject.toml
uv.lock	uv.lock

AI Content Quality Agent

Full Documentation

AI-powered content quality analysis with eval-driven development and unified observability via Base14 Scout.

Stack: Python 3.14 · FastAPI · LlamaIndex · Promptfoo · OpenTelemetry · Base14 Scout

Why Eval-Driven

AI applications suffer from "prompt roulette" — teams iterate on prompts by gut feel, ship without quality gates, and have no visibility into LLM behavior in production. This project demonstrates an eval-driven workflow: systematic prompt evaluation with Promptfoo before deploy, CI quality gates blocking regressions, and full production observability through OpenTelemetry with traces spanning HTTP through LLM calls.

Quick Start

# Install dependencies
make dev

# Run locally
make run

# Run checks (lint + typecheck + tests)
make check

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check with component status
`/review`	POST	Review content for quality issues (hyperbole, bias, unsourced claims)
`/improve`	POST	Suggest specific text improvements with before/after
`/score`	POST	Score content 0-100 with clarity/accuracy/engagement/originality breakdown

All analysis endpoints accept a JSON body with content (string, max 10,000 chars) and optional content_type (one of general, marketing, technical, blog).

# Health check
curl http://localhost:8000/health

# Review content for quality issues
curl -X POST http://localhost:8000/review \
  -H "Content-Type: application/json" \
  -d '{"content": "This revolutionary product is the absolute best!", "content_type": "marketing"}'

# Get improvement suggestions
curl -X POST http://localhost:8000/improve \
  -H "Content-Type: application/json" \
  -d '{"content": "The thing is really good and stuff.", "content_type": "blog"}'

# Score content quality (0-100)
curl -X POST http://localhost:8000/score \
  -H "Content-Type: application/json" \
  -d '{"content": "Kubernetes orchestrates containerized workloads across clusters.", "content_type": "technical"}'

Eval Pipeline

Prompts are evaluated offline using Promptfoo before they reach production. The pipeline includes 22 test cases across marketing, technical, and blog content, plus adversarial inputs (prompt injection, whitespace, non-English, mixed HTML/markdown).

# Run full eval suite (no cache, forces fresh LLM calls)
make eval

# View results in browser with side-by-side comparison
make eval-view

CI gate: The GitHub Actions workflow (.github/workflows/eval.yml) runs promptfoo eval --no-cache on every PR and blocks merge if the pass rate drops below 95%. Concurrency controls prevent parallel PRs from racing on the LLM API.

Side-by-side comparison: Multiple prompt versions (e.g., review_v1 vs review_v2) run against the same test cases, letting you compare output quality before switching production prompts.

Prompt files live in prompts/ as YAML with separate system and user templates. Test datasets and custom assertion functions live in evals/.

Observability

Every request produces a unified trace spanning HTTP and LLM calls — all visible in Base14 Scout.

Instrumentation Approach

GenAI telemetry (spans, metrics, events) is handled by custom instrumentation in llm.py following OTel GenAI semantic conventions. We intentionally do NOT use OpenInference/LlamaIndex auto-instrumentation to avoid non-standard attributes (llm.*, input.*, output.*) that pollute telemetry with framework-specific data outside the OTel GenAI semconv.

What's Instrumented

Layer	Instrumentation	Type	What You Get
HTTP server	`FastAPIInstrumentor`	Auto	Request spans with method, path, status, duration (excludes `/health`, suppresses ASGI sub-spans)
HTTP server	`MetricsMiddleware`	Custom	`http.server.request.count`, `http.server.request.duration`, `http.server.active_requests` (excludes `/health`)
Logging	`LoggingInstrumentor`	Auto	Trace-correlated log records with `trace_id` and `span_id`
GenAI spans	Custom OTel spans	Custom	`chat {model}` spans with OTel GenAI semconv attributes
GenAI metrics	Custom OTel meters	Custom	`gen_ai.client.token.usage`, `gen_ai.client.cost`, `gen_ai.client.operation.duration`
GenAI errors	Custom OTel counters	Custom	`gen_ai.client.error.count`, `gen_ai.client.retry.count`
Evaluations	Custom OTel events + metrics	Custom	`gen_ai.evaluation.result` events, `gen_ai.evaluation.score` histogram
PII scrubbing	Custom event processing	Custom	Emails, phone numbers, SSNs scrubbed from span events before export

Span Attributes (OTel GenAI Semconv)

Each chat {model} span carries these attributes:

Attribute	Source	Example
`gen_ai.operation.name`	Custom	`chat`
`gen_ai.request.model`	Custom	`claude-3-5-haiku-20241022`
`gen_ai.response.model`	Custom	`claude-3-5-haiku-20241022`
`gen_ai.provider.name`	Custom	`anthropic`
`gen_ai.request.temperature`	Custom	`0.3`
`gen_ai.output.type`	Custom	`json`
`gen_ai.usage.input_tokens`	Custom	`923`
`gen_ai.usage.output_tokens`	Custom	`150`
`gen_ai.usage.cost_usd`	Custom	`0.001666`
`gen_ai.response.id`	Custom	`msg_abc123`
`gen_ai.response.finish_reasons`	Custom	`["end_turn"]`
`server.address`	Custom	`api.anthropic.com`
`server.port`	Custom	`443`
`content.type`	Custom	`marketing`
`content.length`	Custom	`42`
`endpoint`	Custom	`/review`

Token & Cost Tracking

Token usage and cost are extracted from each LLM response across all supported providers:

Provider	Token Source	Pricing
OpenAI	`additional_kwargs["prompt_tokens"]` / `["completion_tokens"]`	Per-model lookup table
Google Gemini	`additional_kwargs["prompt_tokens"]` / `["completion_tokens"]`	Per-model lookup table
Anthropic	`raw["usage"]["input_tokens"]` / `["output_tokens"]`	Per-model lookup table

Cost is calculated using a built-in pricing table (PRICING in llm.py) and recorded as both a span attribute (gen_ai.usage.cost_usd) and a counter metric (gen_ai.client.cost).

Configuration

Copy .env.example to .env and configure:

Variable	Default	Description
`LLM_PROVIDER`	`openai`	LLM provider (`openai`, `google`, `anthropic`)
`LLM_MODEL`	`gpt-4.1-nano`	Model name for the selected provider
`LLM_TEMPERATURE`	`0.3`	LLM temperature
`OPENAI_API_KEY`	—	OpenAI API key (when provider is `openai`)
`GOOGLE_API_KEY`	—	Google API key (when provider is `google`)
`ANTHROPIC_API_KEY`	—	Anthropic API key (when provider is `anthropic`)
`OTEL_SDK_DISABLED`	`false`	Disable telemetry (`true` to disable)
`OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`	`false`	Capture prompt/completion content on span events
`SCOUT_ENVIRONMENT`	`development`	Deployment environment tag
`SCOUT_CLIENT_ID`	—	Base14 Scout OAuth client ID
`SCOUT_CLIENT_SECRET`	—	Base14 Scout OAuth client secret
`REQUEST_TIMEOUT`	`60.0`	Request timeout (seconds) for LLM analysis endpoints
`REVIEW_PROMPT_VERSION`	`v1`	Prompt version for `/review`
`IMPROVE_PROMPT_VERSION`	`v1`	Prompt version for `/improve`
`SCORE_PROMPT_VERSION`	`v1`	Prompt version for `/score`
`HOST`	`0.0.0.0`	Server bind address
`PORT`	`8000`	Server port

Docker

cp .env.example .env
# Edit .env with your API keys and LLM provider

# Start full stack (app + OTel Collector)
docker compose up -d

# Run API smoke tests
./scripts/test-api.sh

# Verify telemetry pipeline
./scripts/verify-scout.sh

# Tear down
docker compose down -v

The OTel Collector (otel-collector-config.yaml) is configured with memory_limiter, batch processing, and otlp_http export to Base14 Scout with OAuth2 authentication, retry, and gzip compression.

Dashboards

Three Base14 Scout dashboards provide production visibility:

Content Quality Dashboard

Tracks content analysis quality and evaluation scores.

Panel	Metric / Query	Description
Avg Quality Score	`avg(gen_ai.evaluation.score)`	24h average with day-over-day comparison
Score Distribution	`histogram(gen_ai.evaluation.score)`	Bucketed distribution (90-100, 80-89, etc.)
Quality Over Time	`gen_ai.evaluation.score` time series	Weekly trend with pass threshold line at 60
Issues by Type	`count by content_issue.type`	Breakdown: hyperbole, unsourced, unclear, bias, grammar
Quality by Content Type	`avg(gen_ai.evaluation.score) by content.type`	Comparison across technical, blog, marketing

Eval Pass Rate Dashboard

Tracks Promptfoo eval results and prompt version performance.

Panel	Metric / Query	Description
Current Pass Rate	CI eval pass/total ratio	Current rate with delta vs last run
CI Gate Threshold	Static: `95.0%`	Visual threshold indicator
Pass Rate by Prompt Version	Pass rate per `prompt.version`	Side-by-side: `review_v1` vs `review_v2`, `improve_v1`, `score_v1`
Failed Assertions	Failed test case + assertion detail	Table of test case, assertion type, expected vs actual

Cost & Token Dashboard

Tracks LLM costs and token usage.

Panel	Metric / Query	Description
Total Cost (24h)	`sum(gen_ai.client.cost)`	Daily total with day-over-day delta
Cost per Request	`avg(gen_ai.client.cost)`	Average cost per LLM call
Token Usage Over Time	`gen_ai.client.token.usage` time series	Input vs output token trend
Cost by Endpoint	`sum(gen_ai.client.cost) by endpoint`	Breakdown: `/review`, `/improve`, `/score`
Input vs Output Tokens	`sum(gen_ai.client.token.usage) by gen_ai.token.type`	Ratio of input to output tokens

Alerts

Recommended alert rules for production monitoring:

Alert	Condition	Severity	Action
Quality Drop	`avg(gen_ai.evaluation.score) < 70`	Warning	Review prompts, run eval suite
Eval Pass Rate Drop	CI pass rate < 90%	Critical	Block deploy, investigate failures
High Latency	`p95(gen_ai.client.operation.duration) > 5s`	Warning	Check provider status, consider model
Cost Spike	`rate(gen_ai.client.cost) > 2x baseline`	Warning	Review request volume, model selection
High Daily Cost	`sum(gen_ai.client.cost) > $10`	Warning	Review usage patterns
Error Rate	`sum(gen_ai.client.error.count) / total > 5%`	Critical	Check logs, investigate trace
Retry Storm	`rate(gen_ai.client.retry.count) > 10/min`	Warning	Possible upstream degradation
Token Anomaly	`tokens > 3x baseline`	Warning	Possible prompt injection

Project Structure

ai-content-quality/
├── src/content_quality/
│   ├── main.py                  # FastAPI app, routes, lifespan
│   ├── config.py                # Settings from environment
│   ├── pii.py                   # PII scrubbing for span events
│   ├── telemetry.py             # OTel SDK setup (traces, metrics, logs)
│   ├── middleware/
│   │   └── metrics.py           # HTTP request metrics middleware
│   ├── models/
│   │   ├── requests.py          # ContentRequest with validation
│   │   └── responses.py         # Pydantic response models (ReviewResult, ImproveResult, ScoreResult)
│   └── services/
│       ├── analyzer.py          # ContentAnalyzer with eval event recording
│       ├── llm.py               # LLM call with OTel GenAI semconv, retry, cost tracking, PII scrubbing
│       └── prompts.py           # YAML prompt loader
├── prompts/                     # Versioned prompt templates (YAML)
│   ├── review_v1.yaml
│   ├── review_v2.yaml
│   ├── improve_v1.yaml
│   └── score_v1.yaml
├── evals/                       # Promptfoo eval pipeline
│   ├── assertions/              # Custom JS assertion functions
│   │   ├── review.js
│   │   ├── improve.js
│   │   └── score.js
│   └── datasets/                # Test case data
│       ├── review_cases.json
│       ├── improve_cases.json
│       └── score_cases.json
├── tests/                       # Unit tests (137 tests)
│   ├── conftest.py
│   ├── test_analyzer.py
│   ├── test_api.py
│   ├── test_llm.py
│   ├── test_middleware.py
│   ├── test_models.py
│   ├── test_pii.py
│   ├── test_prompts.py
│   └── test_telemetry.py
├── scripts/
│   ├── test-api.sh              # API smoke test script
│   └── verify-scout.sh          # Telemetry verification script
├── promptfooconfig.yaml         # Eval pipeline configuration
├── compose.yml                  # Docker Compose (app + OTel Collector)
├── otel-collector-config.yaml   # Collector pipeline config
├── Dockerfile
├── Makefile
└── pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

AI Content Quality Agent

Why Eval-Driven

Quick Start

API Endpoints

Eval Pipeline

Observability

Instrumentation Approach

What's Instrumented

Span Attributes (OTel GenAI Semconv)

Token & Cost Tracking

Configuration

Docker

Dashboards

Content Quality Dashboard

Eval Pass Rate Dashboard

Cost & Token Dashboard

Alerts

Project Structure

FilesExpand file tree

ai-content-quality

Directory actions

More options

Directory actions

More options

Latest commit

History

ai-content-quality

Folders and files

parent directory

README.md

AI Content Quality Agent

Why Eval-Driven

Quick Start

API Endpoints

Eval Pipeline

Observability

Instrumentation Approach

What's Instrumented

Span Attributes (OTel GenAI Semconv)

Token & Cost Tracking

Configuration

Docker

Dashboards

Content Quality Dashboard

Eval Pass Rate Dashboard

Cost & Token Dashboard

Alerts

Project Structure