Skip to main content
FutureAGI provides automated evaluation, tracing, and quality assessment for LLM applications. Combined with Portkey, get comprehensive observability covering both operational performance and response quality.
Portkey handles “what happened, how fast, and how much?” while FutureAGI answers “how good was the response?”

Quick Start

pip install portkey-ai fi-instrumentation traceai-portkey
import asyncio
from portkey_ai import Portkey
from traceai_portkey import PortkeyInstrumentor
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    ProjectType, EvalTag, EvalTagType,
    EvalSpanKind, EvalName, ModelChoices
)

# Setup FutureAGI tracing
tracer_provider = register(
    project_name="Model-Benchmarking",
    project_type=ProjectType.EXPERIMENT,
    project_version_name="gpt-4.1-test",
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.IS_CONCISE,
            custom_eval_name="Is_Concise",
            mapping={"input": "llm.output_messages.0.message.content"},
            model=ModelChoices.TURING_LARGE
        ),
    ]
)
PortkeyInstrumentor().instrument(tracer_provider=tracer_provider)

# Use Portkey gateway with provider slug
client = Portkey(api_key="YOUR_PORTKEY_API_KEY")

response = await client.chat.completions.create(
    model="@openai-prod/gpt-4.1",  # Provider slug from Model Catalog
    messages=[{"role": "user", "content": "Explain quantum computing in 3 sentences."}],
    max_tokens=1024
)

print(response.choices[0].message.content)

Setup

  1. Add providers in Model Catalog
  2. Get Portkey API key
  3. Get FutureAGI API key
  4. Use model="@provider-slug/model-name" in requests

Multi-Model Benchmarking

Compare models across providers:
models = [
    {"name": "GPT-4.1", "model": "@openai-prod/gpt-4.1"},
    {"name": "Claude Sonnet", "model": "@anthropic-prod/claude-sonnet-4"},
    {"name": "Llama-3-70b", "model": "@groq-prod/llama3-70b-8192"},
]

scenarios = {
    "reasoning": "A farmer has 17 sheep. All but 9 die. How many are left?",
    "creative": "Write a 6-word story about a robot who discovers music.",
    "code": "Write a Python function to find the nth Fibonacci number.",
}

client = Portkey(api_key="YOUR_PORTKEY_API_KEY")

for test_name, prompt in scenarios.items():
    for model in models:
        tracer_provider = register(
            project_name="Model-Benchmarking",
            project_type=ProjectType.EXPERIMENT,
            project_version_name=model["name"]
        )
        PortkeyInstrumentor().instrument(tracer_provider=tracer_provider)
        
        response = await client.chat.completions.create(
            model=model["model"],
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1024
        )
        print(f"{model['name']}: {response.choices[0].message.content[:100]}...")
        
        PortkeyInstrumentor().uninstrument()

Evaluation Tags

Configure automatic quality assessment:
eval_tags=[
    # Response conciseness
    EvalTag(
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        eval_name=EvalName.IS_CONCISE,
        mapping={"input": "llm.output_messages.0.message.content"},
        model=ModelChoices.TURING_LARGE
    ),
    # Context adherence
    EvalTag(
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        eval_name=EvalName.CONTEXT_ADHERENCE,
        mapping={
            "context": "llm.input_messages.0.message.content",
            "output": "llm.output_messages.0.message.content",
        },
        model=ModelChoices.TURING_LARGE
    ),
    # Task completion
    EvalTag(
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        eval_name=EvalName.TASK_COMPLETION,
        mapping={
            "input": "llm.input_messages.0.message.content",
            "output": "llm.output_messages.0.message.content",
        },
        model=ModelChoices.TURING_LARGE
    ),
]

Advanced Use Cases

Complex Agentic Workflows

The integration supports tracing complex workflows with multiple LLM calls:
async def ecommerce_assistant_workflow(user_query):
    intent = await classify_intent(user_query)
    products = await search_products(intent)
    response = await generate_response(products, user_query)
    # All steps are automatically traced and evaluated
    return response

CI/CD Integration

Use this integration in your CI/CD pipelines for:
  • Automated Model Testing: Run evaluation suites on new model versions
  • Quality Gates: Set thresholds for evaluation scores before deployment
  • Performance Monitoring: Track degradation in model quality over time
  • Cost Optimization: Monitor and alert on cost spikes

View Results

FutureAGI Dashboard

Navigate to Prototype Tab → “Model-Benchmarking” project:
  • Automated evaluation scores
  • Quality metrics per response
  • Model comparison views

Portkey Dashboard

Portkey Dashboard
  • Unified logs across providers
  • Cost tracking per request
  • Latency comparisons
  • Token usage analytics

Next Steps