Skip to main content
This tutorial guides you through the process of running experiments using the Keywords AI API. This workflow allows you to programmatically evaluate your LLM outputs, from log ingestion to experiment execution.

Prerequisites

  • A Keywords AI API Key.
  • Python installed (for the code examples).

Step 1: Ingest Logs

First, you need to have logs in the system to create a dataset. You can log your LLM requests using the Request Logging endpoint. Use the Log ingestion endpoint to send your LLM request data to Keywords AI.
import requests

url = "https://api.keywordsai.co/api/request-logs/create/"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
    "model": "gpt-4",
    "prompt_messages": [{"role": "user", "content": "Hello world"}],
    "completion_message": {"role": "assistant", "content": "Hello! How can I help?"},
    # ... other fields
}
response = requests.post(url, headers=headers, json=payload)

Step 2: Retrieve Logs for Dataset

Once you have logs, you might want to filter them to create a dataset. Use the Retrieve log list endpoint to find the specific logs you want to include (e.g., by time range or metadata).
url = "https://api.keywordsai.co/api/request-logs/list/"
# ... fetch logs to get their IDs

Step 3: Create a Dataset

Create a dataset from your logs. You can filter logs by ID, time range, or sample a percentage of them. Refer to the Create dataset endpoint for more details.
import requests
import json

url = "https://api.keywordsai.co/api/datasets/"

payload = json.dumps({
  "name": "My Evaluation Dataset",
  "description": "Dataset created from production logs",
  "type": "sampling",
  "sampling": 50, # Sample 50 logs
  "start_time": "2024-01-01T00:00:00Z",
  "end_time": "2024-01-31T23:59:59Z",
  "initial_log_filters": {
    "id": {
      "operator": "in",
      "value": [
        "log_id_1", "log_id_2" # IDs obtained from Step 2
      ]
    }
  }
})
headers = {
  'Authorization': 'Bearer YOUR_API_KEY',
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

Step 4: Create an Evaluator

Define how you want to evaluate your experiments. You can create an LLM-based evaluator, a code-based evaluator, or a human evaluator. See Create Evaluator for options.
url = "https://api.keywordsai.co/api/evaluators/"
payload = {
    "name": "Response Quality",
    "type": "llm",
    "score_value_type": "numerical",
    "configurations": {
        "evaluator_definition": "Rate the quality from 1-5.\nInput: {{llm_input}}\nOutput: {{llm_output}}",
        "scoring_rubric": "1=Poor, 5=Excellent",
        "llm_engine": "gpt-4o",
        "min_score": 1,
        "max_score": 5
    }
}
# ... POST request

Step 5: Create and Run Experiment

Now, create an experiment using your dataset and evaluator. If you are running a Custom Workflow (where you process inputs yourself and submit results), follow the Experiment V2 API.
url = "https://api.keywordsai.co/api/v2/experiments/"
payload = {
    "name": "My Experiment",
    "dataset_id": "DATASET_ID_FROM_STEP_3",
    "workflows": [
        {
            "type": "custom", 
            "config": {"name": "My Custom Processing"}
        }
    ],
    "evaluator_slugs": ["response_quality"] # Slug from Step 4
}
# ... POST request
For custom workflows, the system creates placeholder traces. You then:
  1. List the logs to get the inputs.
  2. Process the inputs with your model/logic.
  3. Submit the results back to the experiment.
  4. The evaluators will run automatically.
# Example: Submitting a result
trace_id = "..."
url = f"https://api.keywordsai.co/api/v2/experiments/{experiment_id}/logs/{trace_id}/"
payload = {
    "output": "My model generated response..."
}
requests.patch(url, headers=headers, json=payload)