Skip to main content

Overview

Keywords AI’s observability is built around one core concept: different views of the same underlying data. All views present the same log data, just organized differently for different use cases:

Logs

Plain view - Individual LLM requests as they happen

Traces

Hierarchical view - Multi-step workflows and AI agent operations

Threads

Conversational view - Linear chat interface for dialogue systems

Scores

Evaluation view - Quality assessments and performance metrics
All four views show the same underlying data - they just organize and present it differently. The core data structure remains consistent across all views.

Logs

A log represents a single LLM request and contains all the information about that interaction. This is the foundational data that powers all other views.

Core fields

  • start_time — Request start time (RFC3339)
  • timestamp — Span end time (RFC3339)
  • prompt_tokens — Tokens in the prompt/input
  • completion_tokens — Tokens in the model output
  • prompt_cache_hit_tokens — Tokens served from cache
  • prompt_cache_creation_tokens — Tokens added to cache
  • total_request_tokens — Sum of prompt and completion tokens
  • latency — Total request latency in seconds
  • time_to_first_token — Time to first token (seconds)
  • tokens_per_second — Output token throughput (TPS)
  • routing_time — Deprecated: time spent deciding the model/route
  • cost — Total request cost (USD)
  • unique_id — Request unique identifier
  • UNIQUE_ORGANIZATION_ID_KEY — Organization unique identifier
  • organization_key_id — API key identifier
  • environment — Runtime environment (e.g., test, prod)
  • customer_identifier — User/customer-level identifier
  • evaluation_identifier — Evaluator run identifier
  • prompt_id — Prompt identifier
  • prompt_version_number — Prompt version number
  • custom_identifier — Custom identifier provided by client
  • thread_identifier — Logical thread identifier
  • thread_unique_id — Unique thread identifier
  • trace_unique_id — Trace identifier
  • span_unique_id — Span identifier
  • span_name — Span name
  • span_parent_id — Parent span identifier
  • span_workflow_name — Workflow name associated with span
  • trace_group_identifier — Group identifier for related traces
  • deployment_name — Deployment name
  • provider_id — Provider identifier
  • model — Model name
  • status_code — HTTP-like status code
  • status — Status string
  • tool_calls — Tool calls recorded
  • stream
  • stream_options
  • temperature
  • max_tokens
  • logit_bias
  • logprobs
  • top_logprobs
  • frequency_penalty
  • presence_penalty
  • stop
  • n
  • response_format
  • verbosity
  • tools
  • input — Raw request input (e.g., prompt or structured messages)
  • output — Raw model output
  • prompt_messages — Messages sent to the model
  • completion_message — Final assistant message from the model

Log types

All logs are categorized by type to help with filtering and organization:
  • LLM inference
  • Workflow & agent
  • Advanced
  • text: Basic text completion requests
  • chat: Conversational chat completions (most common)
  • completion: Legacy completion format
  • response: Response API calls
  • embedding: Vector embedding generation
  • transcription: Speech-to-text conversion
  • speech: Text-to-speech generation

Example log structure

{
  "id": "log_123",
  "log_type": "chat",
  "prompt": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  "model": "gpt-4",
  "tokens": {
    "input": 8,
    "output": 7,
    "total": 15
  },
  "cost": 0.0003,
  "latency": 1200,
  "timestamp": "2024-01-15T10:30:00Z",
  "status": "success"
}

Traces

Traces organize the same log data into hierarchical workflows, perfect for complex AI agent operations and multi-step processes.
Traces group related logs using spans to show the execution flow of multi-step processes. Each span corresponds to a log entry but adds hierarchical context.

Trace structure

Key trace fields

  • trace_unique_id: Groups all spans in the same workflow
  • span_unique_id: Individual span identifier (maps to log ID)
  • span_parent_id: Creates the hierarchical structure
  • span_name: Descriptive name for the operation
  • span_workflow_name: The nearest workflow this span belongs to

Multi-trace grouping

Complex workflows can span multiple traces using trace_group_identifier:

Threads

Threads organize the same log data in a conversational format, ideal for chat applications and dialogue systems.

Thread structure

  • Thread ID: Unique identifier for the conversation
  • Messages: Ordered sequence of user and assistant messages (each message maps to log entries)
Notice how each message in the thread references a log_id - this shows how threads are just a different presentation of the same underlying log data.

Scores

Scores organize the same log data with evaluation metrics and quality assessments, perfect for monitoring LLM performance and conducting evaluations.
Scores are evaluation results that are attached to specific logs. Each score represents a quality assessment or performance metric for a particular LLM request.

Score structure

Scores are linked to logs through the log_id field and can be created by two types of evaluators:
  • Platform evaluators: Use evaluator_id (UUID from Keywords AI platform)
  • Custom evaluators: Use evaluator_slug (your custom string identifier)

Key score fields

  • id: Unique score identifier
  • log_id: Links the score to its corresponding log entry
  • evaluator_id: UUID of Keywords AI platform evaluator (optional)
  • evaluator_slug: Custom evaluator identifier (optional)
  • is_passed: Whether the evaluation passed defined criteria
  • cost: Cost of running the evaluation
  • created_at: When the score was created
Each evaluator can only have one score per log, ensuring data integrity and preventing duplicate evaluations.

Score value types

Critical: Use the correct value field based on evaluator’s score_value_type
Scores support four different value types based on the evaluator’s score_value_type:
Evaluator’s score_value_typeUse This FieldData TypeExample
numericalnumerical_valuenumber4.5
booleanboolean_valuebooleantrue
categoricalcategorical_valuearray of strings["excellent", "coherent"]
commentstring_valuestring"Good response quality"

Detailed descriptions

Numerical Scores
  • Use case: Ratings, confidence scores, quality metrics
  • Range: Defined by evaluator’s min_score and max_score
  • Example: Rating response quality from 1-5
Boolean Scores
  • Use case: Pass/fail evaluations, binary classifications
  • Values: true or false
  • Example: Content safety check
Categorical Scores
  • Use case: Multi-choice classifications
  • Values: Array of predefined choices from evaluator’s categorical_choices
  • Example: ["relevant", "accurate", "helpful"]
Comment Scores
  • Use case: Qualitative feedback, explanations
  • Values: Free-form text
  • Example: Detailed evaluation reasoning

Evaluator types

Scores are created by evaluators, which come in three types: LLM Evaluators (type: "llm")
  • AI-powered evaluation using language models
  • Requires evaluator_definition prompt
  • Supports all score value types
Human Evaluators (type: "human")
  • Manual evaluation by human reviewers
  • Often used with categorical or comment scores
  • Requires predefined choices for categorical
Code Evaluators (type: "code")
  • Programmatic evaluation using custom code
  • Requires eval_code_snippet
  • Most flexible for complex logic
Legacy fields (llm_input, llm_output) are normalized to input/output when reading the inputs field.