Overview
The Agent Traffic API lets you send web traffic logs to Scrunch from any platform or hosting environment — including CDNs and setups that don’t have a native Scrunch integration. Once data is flowing, Scrunch automatically classifies each request by bot type (retrieval, training, indexer) and agent source (GPTBot, ClaudeBot, and others).
This guide covers:
- Setting up a site with the API platform in the dashboard
- Sending single events and batches
- Backfilling historical log data
- Managing multiple sites (for agencies and multi-brand setups)
- Retry logic and error handling
Prerequisites
- A Scrunch account with Agent Traffic access
- Access to your web server or CDN access logs
- Your site’s domain (e.g.,
example.com)
- In the Scrunch dashboard, open the Agent Traffic page.
- Click Add website.
- Enter your domain and select API as the platform.
- Copy the Site ID (ULID format) and API key (JWT token) that appear after saving.
You will need both values for every request. Each site has its own Site ID and API key — if you are managing multiple domains, repeat this step for each one.
Your site will show a pending status until the first valid request is received. It transitions to active automatically within 5 minutes.
Step 2: Send your first event
Endpoint
POST https://webhooks.scrunchai.com/v1/sites/{site_id}/platforms/custom/web-traffic
Authentication
Include the API key in the X-Api-Key header:
X-Api-Key: <your-jwt-token>
Required fields
| Field | Type | Description |
|---|
domain | string | The domain of the site (e.g. example.com) |
user_agent | string | The full, original User-Agent string from the request |
url | string | Full URL (e.g. https://example.com/blog/post) |
path | string | URL path only (e.g. /blog/post) |
method | string | HTTP method (e.g. GET) |
status_code | integer | HTTP response status code (e.g. 200) |
timestamp | integer | Unix epoch in seconds (e.g. 1700000000) |
Optional fields
| Field | Type | Description |
|---|
response_time | integer | Response time in milliseconds |
ip | string | IP address of the requesting client |
Always pass the original, unmodified user_agent string from the incoming request. Scrunch’s bot classification runs entirely off this field. Truncating or transforming it will result in incorrect or missing bot detection.
Single event (cURL)
curl -X POST "https://webhooks.scrunchai.com/v1/sites/{site_id}/platforms/custom/web-traffic" \
-H "Content-Type: application/json" \
-H "X-Api-Key: YOUR_API_KEY" \
-d '{
"domain": "example.com",
"user_agent": "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)",
"url": "https://example.com/blog/post",
"path": "/blog/post",
"method": "GET",
"status_code": 200,
"timestamp": 1700000000,
"response_time": 120,
"ip": "203.0.113.1"
}'
A successful response returns:
Step 3: Send batches with NDJSON
For production use, send multiple events per request using newline-delimited JSON (NDJSON). Each line is a complete JSON object. This reduces request overhead and is the recommended approach for any significant traffic volume.
Set Content-Type: application/x-ndjson:
curl -X POST "https://webhooks.scrunchai.com/v1/sites/{site_id}/platforms/custom/web-traffic" \
-H "Content-Type: application/x-ndjson" \
-H "X-Api-Key: YOUR_API_KEY" \
-d '{"domain":"example.com","user_agent":"Mozilla/5.0 (compatible; GPTBot/1.0)","url":"https://example.com/page-1","path":"/page-1","method":"GET","status_code":200,"timestamp":1700000000}
{"domain":"example.com","user_agent":"Mozilla/5.0 (compatible; ClaudeBot/1.0)","url":"https://example.com/page-2","path":"/page-2","method":"GET","status_code":200,"timestamp":1700000060,"response_time":95}'
Keep each batch under 1 MB uncompressed. Split larger payloads into multiple requests.
Step 4: Backfill historical data with Python
If you have existing access logs, use this script to send them in batches. It reads a CSV of log entries, maps fields to the API schema, and sends NDJSON batches with retry handling for rate limits.
Your CSV should have columns matching the required and optional fields. At minimum:
timestamp,domain,user_agent,url,path,method,status_code,response_time_ms,ip_address
1700000000,example.com,"Mozilla/5.0 (compatible; GPTBot/1.0)",https://example.com/page,/page,GET,200,120,203.0.113.1
Backfill script
import csv
import json
import time
import requests
API_KEY = "your-jwt-token"
SITE_ID = "your-site-id"
ENDPOINT = f"https://webhooks.scrunchai.com/v1/sites/{SITE_ID}/platforms/custom/web-traffic"
BATCH_SIZE_BYTES = 1_000_000 # 1 MB per batch
def load_payloads(csv_path: str) -> list[dict]:
"""Read a CSV of access log rows and map to API payload format."""
payloads = []
with open(csv_path, encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
domain = row.get("domain", "")
path = row.get("path", "/") or "/"
payload = {
"domain": domain,
"user_agent": row.get("user_agent", ""),
"url": row.get("url", "") or f"https://{domain}{path}",
"path": path,
"method": row.get("method", "GET") or "GET",
"status_code": int(row.get("status_code", "200") or "200"),
"timestamp": int(row.get("timestamp", "0") or "0"),
"response_time": int(row.get("response_time_ms", "0") or "0"),
"ip": row.get("ip_address") or None,
}
payloads.append(payload)
return payloads
def build_batches(payloads: list[dict], max_bytes: int = BATCH_SIZE_BYTES) -> list[list[dict]]:
"""Split payloads into batches that fit within max_bytes uncompressed."""
batches, current, current_size = [], [], 0
for p in payloads:
size = len(json.dumps(p).encode()) + 1 # +1 for newline
if current and current_size + size > max_bytes:
batches.append(current)
current, current_size = [], 0
current.append(p)
current_size += size
if current:
batches.append(current)
return batches
def send_batch(batch: list[dict], retries: int = 3) -> None:
"""Send a single NDJSON batch with retry logic for rate limits."""
ndjson = "\n".join(json.dumps(p) for p in batch) + "\n"
for attempt in range(retries):
response = requests.post(
ENDPOINT,
content=ndjson.encode("utf-8"),
headers={
"Content-Type": "application/x-ndjson",
"X-Api-Key": API_KEY,
},
timeout=60,
)
if response.status_code == 200:
return
if response.status_code == 429:
wait = int(response.headers.get("Retry-After", 5))
print(f"Rate limited. Retrying in {wait}s...")
time.sleep(wait)
else:
response.raise_for_status()
raise RuntimeError(f"Failed to send batch after {retries} attempts")
def main(csv_path: str) -> None:
payloads = load_payloads(csv_path)
batches = build_batches(payloads)
print(f"Loaded {len(payloads)} events across {len(batches)} batch(es)")
for i, batch in enumerate(batches, 1):
print(f"Sending batch {i}/{len(batches)} ({len(batch)} events)...")
send_batch(batch)
print(f" Batch {i} sent successfully")
print("Done.")
if __name__ == "__main__":
import sys
main(sys.argv[1])
Run it:
python backfill.py your_logs.csv
Managing multiple sites
If you are an agency or managing multiple brands, each domain requires its own site entry in the dashboard with its own Site ID and API key. The sending logic is identical across all sites — only the site_id in the URL and the X-Api-Key header change per site.
A common pattern for multi-site setups:
SITES = [
{"site_id": "01ABC...", "api_key": "token-for-site-a", "domain": "brand-a.com"},
{"site_id": "01DEF...", "api_key": "token-for-site-b", "domain": "brand-b.com"},
]
for site in SITES:
# filter payloads for this domain, then send
site_payloads = [p for p in all_payloads if p["domain"] == site["domain"]]
# ... send using site["site_id"] and site["api_key"]
This approach scales well when onboarding many brands: provision each site in the dashboard, collect credentials, and run the same pipeline with different configuration per site.
Error handling
| Status | Meaning | Action |
|---|
200 | Accepted and queued | No action needed |
401 | Invalid or missing API key | Verify the X-Api-Key value and header name |
422 | Validation error | Check all required fields are present and correctly typed |
429 | Rate limited | Wait and retry; respect the Retry-After response header |
500 | Server error | Retry with exponential backoff; contact support if persistent |
Troubleshooting
Site is stuck in pending status
The site activates within 5 minutes of the first valid request. If it remains pending, confirm a request was actually sent (not a dry run), check that the Site ID in the URL matches the one in the dashboard, and verify the API key is correct.
Bot traffic is not being classified
Bot classification is derived entirely from the user_agent field. Confirm you are passing the raw, original user-agent string from the incoming request without modification. Check your log format — some CDNs normalize or truncate user-agent strings before writing them to logs. If so, use a logging integration that captures the original header.
Getting 422 errors
The most common cause is a missing required field or an incorrect type. Check that timestamp is a Unix epoch integer (not ISO 8601), status_code is an integer (not a string), and path starts with a /.
NDJSON batches are being rejected
Each line must be a complete, valid JSON object with no embedded newlines. The Content-Type header must be exactly application/x-ndjson. Keep batch size under 1 MB uncompressed.