FutureAGI provides automated evaluation, tracing, and quality assessment for LLM applications. Combined with Portkey, get comprehensive observability covering both operational performance and response quality.
Portkey handles “what happened, how fast, and how much?” while FutureAGI answers “how good was the response?”
Quick Start
pip install portkey-ai fi-instrumentation traceai-portkey
import asyncio
from portkey_ai import Portkey
from traceai_portkey import PortkeyInstrumentor
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
ProjectType, EvalTag, EvalTagType,
EvalSpanKind, EvalName, ModelChoices
)
# Setup FutureAGI tracing
tracer_provider = register(
project_name="Model-Benchmarking",
project_type=ProjectType.EXPERIMENT,
project_version_name="gpt-4.1-test",
eval_tags=[
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
eval_name=EvalName.IS_CONCISE,
custom_eval_name="Is_Concise",
mapping={"input": "llm.output_messages.0.message.content"},
model=ModelChoices.TURING_LARGE
),
]
)
PortkeyInstrumentor().instrument(tracer_provider=tracer_provider)
# Use Portkey gateway with provider slug
client = Portkey(api_key="YOUR_PORTKEY_API_KEY")
response = await client.chat.completions.create(
model="@openai-prod/gpt-4.1", # Provider slug from Model Catalog
messages=[{"role": "user", "content": "Explain quantum computing in 3 sentences."}],
max_tokens=1024
)
print(response.choices[0].message.content)
Setup
- Add providers in Model Catalog
- Get Portkey API key
- Get FutureAGI API key
- Use
model="@provider-slug/model-name" in requests
Multi-Model Benchmarking
Compare models across providers:
models = [
{"name": "GPT-4.1", "model": "@openai-prod/gpt-4.1"},
{"name": "Claude Sonnet", "model": "@anthropic-prod/claude-sonnet-4"},
{"name": "Llama-3-70b", "model": "@groq-prod/llama3-70b-8192"},
]
scenarios = {
"reasoning": "A farmer has 17 sheep. All but 9 die. How many are left?",
"creative": "Write a 6-word story about a robot who discovers music.",
"code": "Write a Python function to find the nth Fibonacci number.",
}
client = Portkey(api_key="YOUR_PORTKEY_API_KEY")
for test_name, prompt in scenarios.items():
for model in models:
tracer_provider = register(
project_name="Model-Benchmarking",
project_type=ProjectType.EXPERIMENT,
project_version_name=model["name"]
)
PortkeyInstrumentor().instrument(tracer_provider=tracer_provider)
response = await client.chat.completions.create(
model=model["model"],
messages=[{"role": "user", "content": prompt}],
max_tokens=1024
)
print(f"{model['name']}: {response.choices[0].message.content[:100]}...")
PortkeyInstrumentor().uninstrument()
Configure automatic quality assessment:
eval_tags=[
# Response conciseness
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
eval_name=EvalName.IS_CONCISE,
mapping={"input": "llm.output_messages.0.message.content"},
model=ModelChoices.TURING_LARGE
),
# Context adherence
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
eval_name=EvalName.CONTEXT_ADHERENCE,
mapping={
"context": "llm.input_messages.0.message.content",
"output": "llm.output_messages.0.message.content",
},
model=ModelChoices.TURING_LARGE
),
# Task completion
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
eval_name=EvalName.TASK_COMPLETION,
mapping={
"input": "llm.input_messages.0.message.content",
"output": "llm.output_messages.0.message.content",
},
model=ModelChoices.TURING_LARGE
),
]
Advanced Use Cases
Complex Agentic Workflows
The integration supports tracing complex workflows with multiple LLM calls:
async def ecommerce_assistant_workflow(user_query):
intent = await classify_intent(user_query)
products = await search_products(intent)
response = await generate_response(products, user_query)
# All steps are automatically traced and evaluated
return response
CI/CD Integration
Use this integration in your CI/CD pipelines for:
- Automated Model Testing: Run evaluation suites on new model versions
- Quality Gates: Set thresholds for evaluation scores before deployment
- Performance Monitoring: Track degradation in model quality over time
- Cost Optimization: Monitor and alert on cost spikes
View Results
FutureAGI Dashboard
Navigate to Prototype Tab → “Model-Benchmarking” project:
- Automated evaluation scores
- Quality metrics per response
- Model comparison views
Portkey Dashboard
- Unified logs across providers
- Cost tracking per request
- Latency comparisons
- Token usage analytics
Next Steps