Skip to content

Equistamp 0.0.1

3rd Party AI Evaluation Service Setting & Protecting the Global Standard of AI Safety


Endpoints


GET /auth

Description

Get the current user.

Use the fields parameter if you only want specific fields. This can also be used to get a long lived API token, e.g.:

import requests

res = requests.put(
    'https://equistamp.net/auth',
        json={'email': '<your email address>', 'password': '<your password>'}
)
if res.status_code == 403:
    raise ValueError(f'Invalid email or password: {res.json()}')

session_token = res.json()['session_token']

res = requests.get(
    'https://equistamp.net/auth',
    headers={'Session-Token': session_token},
    params={'fields': 'api_token'}
)

if res.status_code != 200:
    raise ValueError(res.json())

api_token = res.json()['api_token]

Input parameters

Parameter In Type Default Nullable Description
fields path No Specific fields to be returned in the response, separated by commas - if this is used, only the specified fields will be returned

Responses

{
    "id": "f801655d-5f3c-492c-b815-86105e52d772",
    "email_address": "mr.blobby@some.domain",
    "user_name": "mr_blobby",
    "full_name": "Mr Blobby, esq.",
    "user_image": "https://equistamp.com/avatars/123123123123.png",
    "bio": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die",
    "display_options": {
        "bio": true,
        "email_address": true,
        "user_image": false
    },
    "join_date": "2022-04-13",
    "subscription_level": "pro",
    "alerts": [
        {
            "id": "acfd47de-772f-4fc2-bced-77e9aae9e369",
            "name": "They are coming!!",
            "description": "string",
            "public": true,
            "last_trigger_date": "2022-04-13T15:42:05.901Z",
            "trigger_cooldown": "string",
            "owner_id": "959b0298-bf7c-4912-9037-f86a4107448a",
            "triggers": [
                "8297647c-b499-4cb3-bf85-94dcbf150d12"
            ],
            "subscriptions": [
                "76f90940-48c7-4610-a900-f400cc7167eb"
            ]
        }
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "email_address": {
            "type": "string",
            "description": "The email address of this user. User for logging in, so must be unique.",
            "format": "email",
            "example": "mr.blobby@some.domain"
        },
        "user_name": {
            "type": "string",
            "description": "The user name. Used for logging in and as a unique, human readable identifier of this user",
            "example": "mr_blobby"
        },
        "full_name": {
            "type": "string",
            "description": "The presentable name of this user. This can be any string",
            "nullable": true,
            "example": "Mr Blobby, esq."
        },
        "user_image": {
            "type": "string",
            "description": "The user avatar, as bytes when uploading, and its URL when fetching",
            "nullable": true,
            "example": "https://equistamp.com/avatars/123123123123.png"
        },
        "bio": {
            "type": "string",
            "description": "A description of this user. Will be rendered as markdown on the website",
            "nullable": true,
            "example": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die"
        },
        "display_options": {
            "description": "A mapping of <displayable field> to true/false, which controls what will be displayed to other users. No option which is not explicitly enabled will be shown to anyone else than you or system admins. To illustrate, the attached example will only allow the user's bio and email address to be returned when other users call this endpoint, and all other fields will not be returned.",
            "type": "object",
            "additonalProperties": "boolean",
            "example": {
                "bio": true,
                "email_address": true,
                "user_image": false
            }
        },
        "join_date": {
            "type": "string",
            "format": "date"
        },
        "subscription_level": {
            "type": "string",
            "description": "The current subscription level of this user",
            "enum": [
                "admin",
                "free",
                "enterprise",
                "pro"
            ],
            "example": "pro"
        },
        "alerts": {
            "type": "array",
            "items": {
                "$ref": "#/components/schemas/ShallowAlert"
            }
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /auth

Log in the provided user, or send an email with a login link.

Description

This endpoint handles logging in, both when valid credentials are provided, and when the user needs to reset their password. This happens depending on the provided JSON body:

  1. If login credentials are provided, then try to log the user in - if this fails, a 401 will be returned
  2. If reset_email is provided, assume that the user has forgotten their password. If this email can be found in the system, then send them an email with a log in link. Either way, this will always return a 200, to avoid leaking email addresses.

Log in credentials are a user identifier and a password. The following are supported:

  • username - this is the user name of the user (not the display name)
  • email - the email of the user
  • login - this will accept either the email or username

The result of logging in is a JSON object with a Session-Token. This should be provided as the Session-Token header on subsequent calls to the API to authenticate the user. The token will expire after a week of inactivity, but otherwise will be refreshed while using the system.

Request body

{
    "username": "mr_blobby",
    "email": "mr_blobby@bla.com",
    "login": "mr_blobby@bla.com",
    "password": "hunter2",
    "reset_email": "bla@bla.com"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "username": {
            "type": "string",
            "example": "mr_blobby"
        },
        "email": {
            "type": "string",
            "example": "mr_blobby@bla.com"
        },
        "login": {
            "type": "string",
            "example": "mr_blobby@bla.com"
        },
        "password": {
            "type": "string",
            "format": "password",
            "example": "hunter2"
        },
        "reset_email": {
            "type": "string",
            "format": "email",
            "example": "bla@bla.com",
            "description": "Used when resetting a password. A login link will be sent to this email, but only if can be found in the system. When missing, this will fail silently, i.e. a 200 will be returned"
        }
    }
}

Responses

Schema of the response body
{
    "oneOf": [
        {
            "type": "object",
            "description": "Returned when the user successfully logs in",
            "properties": {
                "session_token": {
                    "type": "string",
                    "format": "uuid",
                    "description": "The session token of the logged in user. This should be sent as the \"Session-Token\" header on all subsequent calls. "
                },
                "token_expiration": {
                    "type": "number",
                    "format": "int32",
                    "description": "The POSIX timestamp when this token will expire. Generally in a weeks time."
                }
            }
        },
        {
            "type": "string",
            "description": "This is returned in the case of a password reset."
        }
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Error.


POST /alert

Create a new alert.

Description

This will create a new alert.

Request body

{
    "name": "They are coming!!",
    "description": "string",
    "public": true,
    "trigger_cooldown": "string",
    "triggers": [
        "eeb0632b-1935-4498-b1c1-bc3e0664e234"
    ],
    "subscriptions": [
        "efa1e33e-28ba-4ea5-a8cf-824625443d3e"
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "name": {
            "type": "string",
            "description": "The name of the alert, displayed in the list of alerts",
            "example": "They are coming!!"
        },
        "description": {
            "type": "string",
            "nullable": true
        },
        "public": {
            "type": "boolean"
        },
        "trigger_cooldown": {
            "type": "string",
            "description": "How often the trigger can fire",
            "nullable": true
        },
        "triggers": {
            "type": "array",
            "items": {
                "type": "string",
                "format": "uuid"
            }
        },
        "subscriptions": {
            "type": "array",
            "items": {
                "type": "string",
                "format": "uuid"
            }
        }
    }
}

Responses

{
    "id": "b6b01bfa-3e24-4610-ba4f-6b286a05d0b2",
    "name": "They are coming!!",
    "description": "string",
    "public": true,
    "last_trigger_date": "2022-04-13T15:42:05.901Z",
    "trigger_cooldown": "string",
    "owner_id": "922dd638-11ac-4a3d-8191-7e183aa239da",
    "triggers": [
        {
            "id": "3cdfd9dd-8ee4-4cd7-b745-85bef97634e6",
            "type": "string",
            "invert": true,
            "metric": "string",
            "threshold": 10.12,
            "models": null,
            "evaluations": null,
            "alert_id": "1b9aeafd-2cfc-477e-8935-7b2e379d261d"
        }
    ],
    "subscriptions": [
        {
            "confirmed": true,
            "method": "string",
            "destination": "string"
        }
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "name": {
            "type": "string",
            "description": "The name of the alert, displayed in the list of alerts",
            "example": "They are coming!!"
        },
        "description": {
            "type": "string",
            "nullable": true
        },
        "public": {
            "type": "boolean"
        },
        "last_trigger_date": {
            "type": "string",
            "format": "date-time",
            "nullable": true
        },
        "trigger_cooldown": {
            "type": "string",
            "description": "How often the trigger can fire",
            "nullable": true
        },
        "owner_id": {
            "type": "string",
            "format": "uuid"
        },
        "triggers": {
            "type": "array",
            "items": {
                "$ref": "#/components/schemas/ShallowTrigger"
            }
        },
        "subscriptions": {
            "type": "array",
            "items": {
                "$ref": "#/components/schemas/ShallowSubscriberAlert"
            }
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /alert

Input parameters

Parameter In Type Default Nullable Description
endCreationDate query string Yes Filter out all alerts that were created after this date
endPredictedTriggerDate query string Yes Filter out all alerts that are expected to trigger after this date
evaluations query array Yes A list of evaluation ids. Only alerts pertaining to these evaluations will be returned
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned
maxThreshold query number Yes Filter out all alerts that have a higher threshold than provided
minThreshold query number Yes Filter out all alerts that have a lower threshold than provided
models query array Yes A list of model ids. Only alerts pertaining to these models will be returned
order_by query string Yes Sort the returned results ascendingly
owner_id query string Yes Return all alerts belonging to the given owner. If `me` is provided, then all alerts of the caller will be returned
startCreationDate query string Yes Filter out all alerts that were created before this date
startPredictedTriggerDate query string Yes Filter out all alerts that are expected to trigger before this date
subscriber_id query string Yes Return all alerts subscribed to by the given owner. If `me` is provided, then subscribed alerts of the caller will be returned. This endpoint requires the caller to be allowed to filter by subscriber_id - it's not something everyone can do
triggerCooldown query string Yes Filter by how often the alert can be triggered

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/Alert"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/Alert"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /alert

Request body

{
    "name": "They are coming!!",
    "description": "string",
    "public": true,
    "trigger_cooldown": "string",
    "triggers": [
        "022153ca-3866-4196-8e57-88bac2e73275"
    ],
    "subscriptions": [
        "04807a58-388a-43d5-af71-826e23ffee52"
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "name": {
            "type": "string",
            "description": "The name of the alert, displayed in the list of alerts",
            "example": "They are coming!!"
        },
        "description": {
            "type": "string",
            "nullable": true
        },
        "public": {
            "type": "boolean"
        },
        "trigger_cooldown": {
            "type": "string",
            "description": "How often the trigger can fire",
            "nullable": true
        },
        "triggers": {
            "type": "array",
            "items": {
                "type": "string",
                "format": "uuid"
            }
        },
        "subscriptions": {
            "type": "array",
            "items": {
                "type": "string",
                "format": "uuid"
            }
        }
    }
}

Responses

"Alert updated"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "string",
    "enum": [
        "Alert updated"
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /dsltest

Check whether DSL code fragments are correct.

Description

This endpoint will execute a provided DSL fragment and return the result. It will be run with test data, but you can use it to call your models or whatever. Queries that take too long will be terminated.

DSL Phases

There are four places where the DSL is used:

  • Constructing prompts
  • Sending requests to models
  • Parsing the responses that models return
  • Grading the parsed responses

These four steps happen sequentially for each task. This endpoint only checks one phase, which you must specify. That being said, there's nothing stopping you from chaining all four, e.g.:

import requests

API_KEY = "<your api key goes here>"

def run_code(code, stage, overrides):
    headers = {'Api-Token': API_KEY}
    res = requests.post('https://equistamp.net/dsltest', headers=headers, json={"code": code, "stage": stage, "context": overrides})
    if res.status_code != 200:
        raise ValueError(f'bad request: {res.text}')
    return res.json()

prompt = run_code('(str "Do something with this task: " task}', 'prompt')
response = run_code('(POST "https://your.model/endpoint" {:json {"prompt" prompt}})', 'request')
parsed_response = run_code('(get-in response ["path" "to" "response"])', 'response', {"response": response})
grader_result = run_code('parsed-response', 'grader', {"response": response, "parsed-response": parsed_response})

print(grader_result)

Context

When starting a request, a context is created with useful constants:

Base constants

  • task - the text of the task to be completed
  • endpoint_type - the type of endpoint - possible values are: aws, together.ai, conversational, google_cloud, azure, text-generation, anthropic, fill-mask, zero-shot-classification, custom, open_ai, text2text-generation, mistral
  • cache - An atom containing a cache that can be used to store data between requests. Acts as a map, so items can be accessed via (get @cache <key>) and set via (swap! cache assoc <key> <val>).

Task specific context

Mulitple choice tasks

In the case of multiple choice tasks, the following are also available:

  • num_choices - the number of available choices
  • letter-choices - the letters corresponding to the available choices
  • correct - the letters of all correct answers - only available to the Grader
Boolean tasks

Boolean tasks (i.e. true/false) will add the following to the grader's context:

  • correct - whether the current task is true or false
Free response tasks

Free response tasks are tasks that expect arbitrary text. These kind of tasks don't really have "correct" answers that can be saved, as much as phrases that are similar to what is expected, e.g. "What is a group of whales called?" could be answered with "A pod", "Pod", "it's a pod" or other such combinations, all of which are correct. You could also accept "a family" which is sort of correct, in that some species are very matrilineal, but others form more casual pods. There is also "school", which in general applies to fish, but is sometimes also used for whales. On the other hand "a gander" or "a murder" are flat out incorrect, as those apply to birds. To help manage this, we support positive-examples, which is a list of strings that are close to the kind of response you're expecting, and negative-examples, which is a list of strings that are opposite in meaning to what you expect.

The default grader uses cosine similarities to check responses. It will check the model's response against all positive and negative examples, normalized to <0, 1>. The complement of negative similarities is used, as in their case the idea is to have something that is opposite in meaning (as opposed to just maximally unsimilar). The maximum value is then returned and used as the correctness score for that given task.

The following will be added to the grader's context:

  • positive-examples - a list of strings that should be similar to the model's response
  • negative-examples - a list of strings that should be opposite to the model's response
  • embedder - a one argument function that receives a string and returns an embedding vector
JSON tasks

JSON tasks expect the model to answer with correct JSON according to a schema. The schema will be added to the context.

  • schema - the expected schema of the resulting JSON object

Stage context

Each subsequent stage (request, response, grader) will have values added in the previous stages:

Request
  • prompt - the prompt to be sent to the model
Response
  • response - the result of the Request DSL call
Grader
  • parsed-response - the result of the Response call

Request body

{
    "code": "(get-in response [:json \"value\"])",
    "stage": "response"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "code": {
            "description": "The DSL code to be evaluated",
            "type": "string",
            "example": "(get-in response [:json \"value\"])"
        },
        "stage": {
            "description": "The kind of DSL code to be tested",
            "example": "response",
            "type": "string",
            "enum": [
                "system_prompt",
                "prompt",
                "request",
                "response",
                "grader"
            ]
        }
    },
    "context": {
        "description": "Additional items to be added to the execution context",
        "type": "object",
        "properties": {
            "task-type": {
                "description": "The task of type to be used. Must be one of \"FRQ\", \"MCQ\", \"bool\", \"json\"",
                "example": "MCQ"
            },
            "response": {
                "description": "The response used when testing 'response' DSL code. If not provided, a dummy value will be used",
                "example": {
                    "json": {
                        "value": "bla bla"
                    }
                }
            },
            "parsed-response": {
                "description": "The parsed_response used when testing 'grader' DSL code. If not provided, a dummy value will be used",
                "example": "bla bla"
            }
        },
        "additionalProperties": true
    }
}

Responses

{
    "result": null
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "result": {
            "description": "This will be whatever the code returned"
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


POST /evaluation

Create a new evaluation.

Description

Adding tasks to new evaluations

There are three ways to add tasks to evaluations:

  1. directly during creation by providing a CSV with tasks via the csv_url and columns_mapping parameters
  2. by sending a tasks CSV to the /evaluationbuilderhandler endpoint
  3. by uploading tasks directly via the /task endpoint

The first option is recommended, as it will automatically call the /evaluationbuilderhandler endpoint for you, once the evaluation is created.

Request body

{
    "name": "My lovely evaluation",
    "public": true,
    "public_usable": false,
    "reports_visible": false,
    "description": "# This is an evaluation, see more at [this link](http://some.link)",
    "task_types": "MCQ",
    "modalities": "text",
    "min_questions_to_complete": 321,
    "tags": [
        "f7b6acf2-f8ea-45dc-a47f-9fcf8af5eb79"
    ],
    "csv_url": "https://example.com",
    "default_task_type": "MCQ",
    "columns_mapping": {
        "Question col": {
            "columnType": "question"
        },
        "Paraphrase of question": {
            "columnType": "paraphrase",
            "paraphraseOf": "Question col"
        }
    },
    "references": {
        "bla": {
            "schema": {
                "properties": {
                    "name": {
                        "type": "string"
                    }
                }
            },
            "name": "My wonderful schema",
            "description": "Some description here"
        },
        "other-name_with.interpunction123": {
            "schema": {
                "properties": {
                    "name": {
                        "type": "string"
                    }
                }
            }
        }
    },
    "prompt": "(str \"Please answer this question: \" task)",
    "grader": {
        "MCQ": "(= parsedResponse correct)",
        "default": "false"
    }
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "name": {
            "type": "string",
            "example": "My lovely evaluation"
        },
        "public": {
            "type": "boolean",
            "description": "Whether this evaluation should be publicly visible. If true, anyone can view its details or evaluate models with it"
        },
        "public_usable": {
            "type": "boolean",
            "description": "Whether this evaluation can be ran by anyone. To avoid tasks being leaked, you might want to have the results shown, but have control over what it can be run on.",
            "example": false
        },
        "reports_visible": {
            "type": "boolean",
            "description": "Whether anyone can pay to see reports for this evaluation.",
            "example": false
        },
        "description": {
            "type": "string",
            "description": "The description of this evaluation, as displayed on the site. Markdown can be used for formatting",
            "nullable": true,
            "example": "# This is an evaluation, see more at [this link](http://some.link)"
        },
        "task_types": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The types of tasks supported by this evaluation",
            "enum": [
                "FRQ",
                "bool",
                "json",
                "MCQ"
            ],
            "example": "MCQ"
        },
        "modalities": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The available modalities of this evaluation",
            "enum": [
                "text"
            ],
            "example": "text"
        },
        "min_questions_to_complete": {
            "type": "integer",
            "format": "int64",
            "description": "The default number of tasks to run before an evaluation session is deemed finished.\nA given evaluation session may process more tasks, as starting a new evaluation session for an evaluation/model pair which is already running will just add more tasks to the current session, rather than starting a new one.",
            "nullable": true,
            "example": 321
        },
        "tags": {
            "type": "array",
            "items": {
                "type": "string",
                "format": "uuid"
            }
        },
        "csv_url": {
            "description": "The URL of a CSV file containing the tasks of the new evaluation",
            "example": "https://example.com",
            "type": "string"
        },
        "default_task_type": {
            "description": "The default type of tasks - can be overrode on a per row basis. Will use \"MCQ\" if not set",
            "example": "MCQ",
            "nullable": true,
            "type": "string",
            "enum": [
                "FRQ",
                "bool",
                "json",
                "MCQ"
            ]
        },
        "columns_mapping": {
            "description": "A mapping that specifies which CSV columns contain which types of data. See the [Evaluation Builder](#post-evaluationbuilderhandler) endpoint for details",
            "type": "object",
            "example": {
                "Question col": {
                    "columnType": "question"
                },
                "Paraphrase of question": {
                    "columnType": "paraphrase",
                    "paraphraseOf": "Question col"
                }
            },
            "additionalProperties": {
                "$ref": "#/components/schemas/ColumnMapping"
            }
        },
        "references": {
            "description": "A mapping of keys to schemas. The keys can contain ASCII alphanumeric characters, \"-\", \"_\" and \".\".",
            "type": "object",
            "additionalProperties": {
                "type": "object",
                "properties": {
                    "schema": {
                        "type": "object",
                        "description": "The JSON schema to be used"
                    },
                    "name": {
                        "type": "string",
                        "description": "An optional name for this schema - this will only be used for displaying, the actual matching is done by comparing the keys of the `references` object."
                    },
                    "description": {
                        "type": "string",
                        "description": "An optional description for this schema"
                    },
                    "type": {
                        "type": "string",
                        "enum": [
                            "json"
                        ],
                        "description": "The type of schema. If not provided, will be assumed to be JSON",
                        "example": "json"
                    }
                },
                "required": [
                    "schema"
                ]
            },
            "example": {
                "bla": {
                    "schema": {
                        "properties": {
                            "name": {
                                "type": "string"
                            }
                        }
                    },
                    "name": "My wonderful schema",
                    "description": "Some description here"
                },
                "other-name_with.interpunction123": {
                    "schema": {
                        "properties": {
                            "name": {
                                "type": "string"
                            }
                        }
                    }
                }
            }
        },
        "prompt": {
            "description": "DSL code defining how to create prompts. See the [DSL page](/docs/dsl/) for more info.",
            "example": "(str \"Please answer this question: \" task)"
        },
        "grader": {
            "description": "DSL code specifying how to grade LLM responses. This can be empty, in which case the default grader will be used. You can specify a grader that will be used for all types of tasks, or per task type graders. If you provide both a default grader and one for a specific task type, the specific one takes precedence.",
            "oneOf": [
                {
                    "type": "string",
                    "description": "DSL code that should be used for all response",
                    "example": "(= parsedResponse \"ok\")"
                },
                {
                    "type": "object",
                    "description": "Per task type DSL code. Use the \"default\" key to specify the code that should be used for tasks types that aren't specified - otherwise the system default grader will be used.",
                    "properties": {
                        "default": {
                            "type": "string",
                            "description": "The default grader to be used for task types that aren't specified.",
                            "example": "(if (= parsedResponse correct) 1 0)"
                        },
                        "FRQ": {
                            "type": "string",
                            "description": "The DSL code to be used to grade FRQ tasks. If this is empty, the default grader will be used"
                        },
                        "bool": {
                            "type": "string",
                            "description": "The DSL code to be used to grade bool tasks. If this is empty, the default grader will be used"
                        },
                        "json": {
                            "type": "string",
                            "description": "The DSL code to be used to grade json tasks. If this is empty, the default grader will be used"
                        },
                        "MCQ": {
                            "type": "string",
                            "description": "The DSL code to be used to grade MCQ tasks. If this is empty, the default grader will be used"
                        }
                    },
                    "example": {
                        "MCQ": "(= parsedResponse correct)",
                        "default": "false"
                    }
                }
            ],
            "example": {
                "MCQ": "(= parsedResponse correct)",
                "default": "false"
            }
        }
    }
}

Responses

{
    "id": "9f65948b-0839-4704-94c1-a74682d43594",
    "name": "My lovely evaluation",
    "public": true,
    "public_usable": false,
    "reports_visible": false,
    "quality": 0.89,
    "num_tasks": 2000,
    "description": "# This is an evaluation, see more at [this link](http://some.link)",
    "last_updated": "2022-04-13T15:42:05.901Z",
    "task_types": "MCQ",
    "modalities": "text",
    "min_questions_to_complete": 321,
    "owner": {
        "id": "2120c29a-ed02-4065-bbba-c5ada79d7c47",
        "email_address": "mr.blobby@some.domain",
        "user_name": "mr_blobby",
        "full_name": "Mr Blobby, esq.",
        "user_image": "https://equistamp.com/avatars/123123123123.png",
        "bio": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die",
        "display_options": {
            "bio": true,
            "email_address": true,
            "user_image": false
        },
        "join_date": "2022-04-13",
        "subscription_level": "pro",
        "alerts": [
            "88103840-7fe3-41a2-b492-230df4dac99d"
        ]
    },
    "tags": [
        {
            "id": "53cbd07d-fa52-4dc8-bfd1-10c3588d2174",
            "name": "string"
        }
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "name": {
            "type": "string",
            "example": "My lovely evaluation"
        },
        "public": {
            "type": "boolean",
            "description": "Whether this evaluation should be publicly visible. If true, anyone can view its details or evaluate models with it"
        },
        "public_usable": {
            "type": "boolean",
            "description": "Whether this evaluation can be ran by anyone. To avoid tasks being leaked, you might want to have the results shown, but have control over what it can be run on.",
            "example": false
        },
        "reports_visible": {
            "type": "boolean",
            "description": "Whether anyone can pay to see reports for this evaluation.",
            "example": false
        },
        "quality": {
            "type": "number",
            "format": "double",
            "description": "The quality of this evaluation, i.e. how much it can be trusted, from 0 to 1.",
            "example": 0.89
        },
        "num_tasks": {
            "type": "integer",
            "format": "int64",
            "description": "The total number of tasks defined for this evaluation. Includes redacted tasks.",
            "example": 2000
        },
        "description": {
            "type": "string",
            "description": "The description of this evaluation, as displayed on the site. Markdown can be used for formatting",
            "nullable": true,
            "example": "# This is an evaluation, see more at [this link](http://some.link)"
        },
        "last_updated": {
            "type": "string",
            "format": "date-time"
        },
        "task_types": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The types of tasks supported by this evaluation",
            "enum": [
                "FRQ",
                "bool",
                "json",
                "MCQ"
            ],
            "example": "MCQ"
        },
        "modalities": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The available modalities of this evaluation",
            "enum": [
                "text"
            ],
            "example": "text"
        },
        "min_questions_to_complete": {
            "type": "integer",
            "format": "int64",
            "description": "The default number of tasks to run before an evaluation session is deemed finished.\nA given evaluation session may process more tasks, as starting a new evaluation session for an evaluation/model pair which is already running will just add more tasks to the current session, rather than starting a new one.",
            "nullable": true,
            "example": 321
        },
        "owner": {
            "$ref": "#/components/schemas/ShallowUser"
        },
        "tags": {
            "type": "array",
            "items": {
                "$ref": "#/components/schemas/ShallowTag"
            }
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /evaluation

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/Evaluation"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/Evaluation"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /evaluation

Request body

{
    "name": "My lovely evaluation",
    "public": true,
    "public_usable": false,
    "reports_visible": false,
    "description": "# This is an evaluation, see more at [this link](http://some.link)",
    "task_types": "MCQ",
    "modalities": "text",
    "min_questions_to_complete": 321,
    "tags": [
        "341337c9-bc8c-4d87-bbf2-7d440f7c124f"
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "name": {
            "type": "string",
            "example": "My lovely evaluation"
        },
        "public": {
            "type": "boolean",
            "description": "Whether this evaluation should be publicly visible. If true, anyone can view its details or evaluate models with it"
        },
        "public_usable": {
            "type": "boolean",
            "description": "Whether this evaluation can be ran by anyone. To avoid tasks being leaked, you might want to have the results shown, but have control over what it can be run on.",
            "example": false
        },
        "reports_visible": {
            "type": "boolean",
            "description": "Whether anyone can pay to see reports for this evaluation.",
            "example": false
        },
        "description": {
            "type": "string",
            "description": "The description of this evaluation, as displayed on the site. Markdown can be used for formatting",
            "nullable": true,
            "example": "# This is an evaluation, see more at [this link](http://some.link)"
        },
        "task_types": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The types of tasks supported by this evaluation",
            "enum": [
                "FRQ",
                "bool",
                "json",
                "MCQ"
            ],
            "example": "MCQ"
        },
        "modalities": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The available modalities of this evaluation",
            "enum": [
                "text"
            ],
            "example": "text"
        },
        "min_questions_to_complete": {
            "type": "integer",
            "format": "int64",
            "description": "The default number of tasks to run before an evaluation session is deemed finished.\nA given evaluation session may process more tasks, as starting a new evaluation session for an evaluation/model pair which is already running will just add more tasks to the current session, rather than starting a new one.",
            "nullable": true,
            "example": 321
        },
        "tags": {
            "type": "array",
            "items": {
                "type": "string",
                "format": "uuid"
            }
        }
    }
}

Responses

"Evaluation updated"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "string",
    "enum": [
        "Evaluation updated"
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /evaluationbuilderhandler

Import tasks from a CSV file.

Description

This endpoint will fetch a CSV file and create a task from each row (without the first one, which is used as a header). If dry_run is true, then this will only check for errors and not save anything to the database.

Number of questions to complete

Each evaluation run will use a subsample of all available tasks. You can set this number by providing a value for min_questions_to_complete. If you don't set this manually, it will be set on the basis of the number of tasks in your file, in such a way as to have a 95% confidence level. In practice this number tends to be larger than needed - the score of most evaluation runs don't change that much after around 200 tasks.

Task type

Unless specified otherwise, it's assumed that all tasks are Multiple Choice Questions. The can be changed by

  1. setting default_task_type, which will change the default to whatever you provide
  2. providing a type column, which can be used to set the task types for specific rows - any rows where the type column is not empty will that value as the type, otherwise will use the default type
Columns mapping

For the CSV import to work correctly, you must provide a way to map columns to task fields. This is done by providing a mapping of <column name> to a column definition object. The available fields in the definition object are:

  • columnType - this specified what this column should be used as. Must always be provided
  • paraphraseOf - used by paraphrase columns to point to what they're paraphrasing. All texts can have paraphrases. When a field has paraphrases defined, these will always be used when sending texts to models, or displaying them on the frontend. Only you and system administrators will have access to the non paraphrase texts.

Request body

{
    "public_usable": false,
    "reports_visible": false,
    "min_questions_to_complete": 321,
    "tags": [
        "80578ad7-0506-4c5e-a2e6-586523676152"
    ],
    "evaluation_id": "64a578cc-05b8-4749-a2eb-ff63f34d78fd",
    "dry_run": true,
    "csv_url": "https://example.com",
    "default_task_type": "MCQ",
    "columns_mapping": {
        "Question col": {
            "columnType": "question"
        },
        "Paraphrase of question": {
            "columnType": "paraphrase",
            "paraphraseOf": "Question col"
        }
    },
    "references": {
        "bla": {
            "schema": {
                "properties": {
                    "name": {
                        "type": "string"
                    }
                }
            },
            "name": "My wonderful schema",
            "description": "Some description here"
        },
        "other-name_with.interpunction123": {
            "schema": {
                "properties": {
                    "name": {
                        "type": "string"
                    }
                }
            }
        }
    },
    "prompt": "(str \"Please answer this question: \" task)",
    "grader": {
        "MCQ": "(= parsedResponse correct)",
        "default": "false"
    }
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "public_usable": {
            "type": "boolean",
            "description": "Whether this evaluation can be ran by anyone. To avoid tasks being leaked, you might want to have the results shown, but have control over what it can be run on.",
            "example": false
        },
        "reports_visible": {
            "type": "boolean",
            "description": "Whether anyone can pay to see reports for this evaluation.",
            "example": false
        },
        "min_questions_to_complete": {
            "type": "integer",
            "format": "int64",
            "description": "The default number of tasks to run before an evaluation session is deemed finished.\nA given evaluation session may process more tasks, as starting a new evaluation session for an evaluation/model pair which is already running will just add more tasks to the current session, rather than starting a new one.",
            "nullable": true,
            "example": 321
        },
        "tags": {
            "type": "array",
            "items": {
                "type": "string",
                "format": "uuid"
            }
        },
        "evaluation_id": {
            "description": "The id of the evaluation to add tasks to",
            "type": "string",
            "format": "uuid"
        },
        "dry_run": {
            "description": "If true, this call will only check for errors and not actually import anything",
            "type": "boolean"
        },
        "csv_url": {
            "description": "The URL of a CSV file containing the tasks of the new evaluation",
            "example": "https://example.com",
            "type": "string"
        },
        "default_task_type": {
            "description": "The default type of tasks - can be overrode on a per row basis. Will use \"MCQ\" if not set",
            "example": "MCQ",
            "nullable": true,
            "type": "string",
            "enum": [
                "FRQ",
                "bool",
                "json",
                "MCQ"
            ]
        },
        "columns_mapping": {
            "description": "A mapping that specifies which CSV columns contain which types of data. See the [Evaluation Builder](#post-evaluationbuilderhandler) endpoint for details",
            "type": "object",
            "example": {
                "Question col": {
                    "columnType": "question"
                },
                "Paraphrase of question": {
                    "columnType": "paraphrase",
                    "paraphraseOf": "Question col"
                }
            },
            "additionalProperties": {
                "$ref": "#/components/schemas/ColumnMapping"
            }
        },
        "references": {
            "description": "A mapping of keys to schemas. The keys can contain ASCII alphanumeric characters, \"-\", \"_\" and \".\".",
            "type": "object",
            "additionalProperties": {
                "type": "object",
                "properties": {
                    "schema": {
                        "type": "object",
                        "description": "The JSON schema to be used"
                    },
                    "name": {
                        "type": "string",
                        "description": "An optional name for this schema - this will only be used for displaying, the actual matching is done by comparing the keys of the `references` object."
                    },
                    "description": {
                        "type": "string",
                        "description": "An optional description for this schema"
                    },
                    "type": {
                        "type": "string",
                        "enum": [
                            "json"
                        ],
                        "description": "The type of schema. If not provided, will be assumed to be JSON",
                        "example": "json"
                    }
                },
                "required": [
                    "schema"
                ]
            },
            "example": {
                "bla": {
                    "schema": {
                        "properties": {
                            "name": {
                                "type": "string"
                            }
                        }
                    },
                    "name": "My wonderful schema",
                    "description": "Some description here"
                },
                "other-name_with.interpunction123": {
                    "schema": {
                        "properties": {
                            "name": {
                                "type": "string"
                            }
                        }
                    }
                }
            }
        },
        "prompt": {
            "description": "DSL code defining how to create prompts. See the [DSL page](/docs/dsl/) for more info.",
            "example": "(str \"Please answer this question: \" task)"
        },
        "grader": {
            "description": "DSL code specifying how to grade LLM responses. This can be empty, in which case the default grader will be used. You can specify a grader that will be used for all types of tasks, or per task type graders. If you provide both a default grader and one for a specific task type, the specific one takes precedence.",
            "oneOf": [
                {
                    "type": "string",
                    "description": "DSL code that should be used for all response",
                    "example": "(= parsedResponse \"ok\")"
                },
                {
                    "type": "object",
                    "description": "Per task type DSL code. Use the \"default\" key to specify the code that should be used for tasks types that aren't specified - otherwise the system default grader will be used.",
                    "properties": {
                        "default": {
                            "type": "string",
                            "description": "The default grader to be used for task types that aren't specified.",
                            "example": "(if (= parsedResponse correct) 1 0)"
                        },
                        "FRQ": {
                            "type": "string",
                            "description": "The DSL code to be used to grade FRQ tasks. If this is empty, the default grader will be used"
                        },
                        "bool": {
                            "type": "string",
                            "description": "The DSL code to be used to grade bool tasks. If this is empty, the default grader will be used"
                        },
                        "json": {
                            "type": "string",
                            "description": "The DSL code to be used to grade json tasks. If this is empty, the default grader will be used"
                        },
                        "MCQ": {
                            "type": "string",
                            "description": "The DSL code to be used to grade MCQ tasks. If this is empty, the default grader will be used"
                        }
                    },
                    "example": {
                        "MCQ": "(= parsedResponse correct)",
                        "default": "false"
                    }
                }
            ],
            "example": {
                "MCQ": "(= parsedResponse correct)",
                "default": "false"
            }
        }
    }
}

Responses

{
    "id": "6f7c068b-17be-42a1-913c-1e3c349af033",
    "name": "My lovely evaluation",
    "public": true,
    "public_usable": false,
    "reports_visible": false,
    "quality": 0.89,
    "num_tasks": 2000,
    "description": "# This is an evaluation, see more at [this link](http://some.link)",
    "last_updated": "2022-04-13T15:42:05.901Z",
    "task_types": "MCQ",
    "modalities": "text",
    "min_questions_to_complete": 321,
    "owner": {
        "id": "f059ec20-0e0f-4c5c-81d3-4a6e3aa64ed4",
        "email_address": "mr.blobby@some.domain",
        "user_name": "mr_blobby",
        "full_name": "Mr Blobby, esq.",
        "user_image": "https://equistamp.com/avatars/123123123123.png",
        "bio": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die",
        "display_options": {
            "bio": true,
            "email_address": true,
            "user_image": false
        },
        "join_date": "2022-04-13",
        "subscription_level": "pro",
        "alerts": [
            "6c6028a9-85b2-4f11-b83e-53683cd48d9b"
        ]
    },
    "tags": [
        {
            "id": "d71d88f6-3afe-41a9-b263-94d1f38e81d7",
            "name": "string"
        }
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "name": {
            "type": "string",
            "example": "My lovely evaluation"
        },
        "public": {
            "type": "boolean",
            "description": "Whether this evaluation should be publicly visible. If true, anyone can view its details or evaluate models with it"
        },
        "public_usable": {
            "type": "boolean",
            "description": "Whether this evaluation can be ran by anyone. To avoid tasks being leaked, you might want to have the results shown, but have control over what it can be run on.",
            "example": false
        },
        "reports_visible": {
            "type": "boolean",
            "description": "Whether anyone can pay to see reports for this evaluation.",
            "example": false
        },
        "quality": {
            "type": "number",
            "format": "double",
            "description": "The quality of this evaluation, i.e. how much it can be trusted, from 0 to 1.",
            "example": 0.89
        },
        "num_tasks": {
            "type": "integer",
            "format": "int64",
            "description": "The total number of tasks defined for this evaluation. Includes redacted tasks.",
            "example": 2000
        },
        "description": {
            "type": "string",
            "description": "The description of this evaluation, as displayed on the site. Markdown can be used for formatting",
            "nullable": true,
            "example": "# This is an evaluation, see more at [this link](http://some.link)"
        },
        "last_updated": {
            "type": "string",
            "format": "date-time"
        },
        "task_types": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The types of tasks supported by this evaluation",
            "enum": [
                "FRQ",
                "bool",
                "json",
                "MCQ"
            ],
            "example": "MCQ"
        },
        "modalities": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The available modalities of this evaluation",
            "enum": [
                "text"
            ],
            "example": "text"
        },
        "min_questions_to_complete": {
            "type": "integer",
            "format": "int64",
            "description": "The default number of tasks to run before an evaluation session is deemed finished.\nA given evaluation session may process more tasks, as starting a new evaluation session for an evaluation/model pair which is already running will just add more tasks to the current session, rather than starting a new one.",
            "nullable": true,
            "example": 321
        },
        "owner": {
            "$ref": "#/components/schemas/ShallowUser"
        },
        "tags": {
            "type": "array",
            "items": {
                "$ref": "#/components/schemas/ShallowTag"
            }
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /evaluationbuilderhandler

Check whether a CSV file contains valid tasks

Description

This endpoint will fetch a CSV file from the provided URL and validate each row to make sure that it can be processed. Rows with errors or warnings will be returned with appropriate messages, to help debug problems. When the CSV is processed (after sending an appropriate POST request to this endpoint), rows that have errors will be skipped.

Column mapping

To check whether all the rows are correct, you must provide a way to work out which columns correspond to which fields in the resulting tasks. In the case of GET requests, they should be provided as follows. Check out our sample tasks file for examples:

Basic mappings

  • question - this is the only required parameter. This should specify the name of the column containing the main text to be sent to models
  • type - this specifies where to check for per row task type overrides. By default it's assumed that tasks are multiple choice questions, unless default_task_type is set in the POST request. But if you want most tasks to be one type, but have a couple that are of a different type (e.g. true-false questions), then you can do so by using this column.
  • redacted - this specified where to check whether a task should be hidden by default. By default it's assumed that all tasks should be used when testing models, but sometimes a given task may be incorrect, or maybe not the best quality. One way around this would be to delete any problematical rows before uploading, but that can be a lot of work. To make things easier, tasks can be uploaded as redacted, which means that they won't be sent to models. Any rows with a redacted column defined, which have non empty values, will be saved as redacted

Paraphrases

All texts can have paraphrases. When a field has paraphrases defined, these will always be used when sending texts to models, or displaying them on the frontend. Only you and system administrators will have access to the non paraphrase texts. Paraphrases are declared as paraphrase.<paraphrase column>=<paraphrased column>. So e.g. paraphrase.question%20paraphrase=Question will declare that the "question paraphrase" column is a paraphrase of the "Question" column.

Boolean question mappings

Boolean questions have only two possible answers - True or False. You can have one column which provides this value. Any row where the answer column equals 1 or true (case insensitive) will be deemed to be a question where the correct answer is True. Any other value is False.

  • bool_correct - any rows which are 1 or case insensitive true or yes (so e.g. TrUe, TRue or true) will be deemed to be true statements. Anything else is false.

Free response question mappings

Free response questions are questions where the model can answer with any text. An example of this kind of question would be "fill in the blank". You can provide both correct and incorrect texts - free response questions are checked on the basis of similarity. Two identical texts should have a similarity of 1 and texts with opposite meanings will have a similarity of 0. You can specify expected answers either as things which should be similar, or texts which are opposite, in which case the similarity will be calculated as 1 - <similarity score>. Each row must have at least one correct or incorrect value provided.

  • frq_correct - a comma separated list of URL encoded column names, e.g. 'Correct%201,Correct%20%3D%20this'
  • frq_incorrect - a comma separated list of URL encoded column names, e.g. 'This%20is%20wrong,Bad%21%21'

Multiple response question mappings

In the case of multiple response questions, you must provide at least one correct answer, and at least one incorrect answers. You can add more if you want, but we will only use the first 10 correct answers, and the first 20 incorrect answers. These column definitions should be provided via:

  • mcq_correct - a comma separated list of URL encoded column names, e.g. 'Correct%201,Correct%20%3D%20this'
  • mcq_incorrect - a comma separated list of URL encoded column names, e.g. 'This%20is%20wrong,Bad%21%21'

Json question mappings

Tasks which expect valid JSON responses have the following column types, both of which are optional:

  • schema - a JSON schema specifying the structure of the expected JSON. If this is provided, all responses must conform to this schema. If not provided, then the schema will be assumed to be any valid JSON. The schema can be provided via a reference (see below).
  • expected - an expected JSON object. The JSON returned by the model must have the same values as the expected object

Example column mappings

Assuming you have a CSV file with the following columns:

  • Task type - contains the type of tasks
  • Timestamp - date of last edit - not needed here, so should be ignored
  • `` - an empty column
  • Task question to answer - the text to which models should respond
  • Question paraphrase - an alternative way of phrasing the question
  • Correct answer - the expected answer
  • Alternative correct answer - another answer that will also be accepted as correct
  • Bad response example - an incorrect answer to be provided as an option in the multiple choice question
  • Wrong answer - another incorrect answer to be provided as an option in the multiple choice question

Then you would have to send a GET request with type=Task%20type&question=Task%20question%20to%20answer&paraphrase.Question%20paraphrase=Task question to answer&mcq_correct=Correct%20answer,Alternative%20correct%20answer&mcq_incorrect=Bad%20response%20example,Wrong%20answer

References

In the case of schemas, it would be annoying to provide a massive JSON object in each row. To make this simpler, you can provide a set of references. Any schema column with a value that is a reference key will use the schema object that is stored as that reference. Reference names can contain English letters (upper and lowercase), digits and "-", "_", and ".". References can also have names and descriptions, for easier management. Both of these are optional, and do not in any way effect how the references are matched to rows. References should be provided as reference.<value>.<reference name> GET parameters, where <value> is one of "schema", "name" or "description". An example would be a GET request with: type=Task&question=Question&json_schema=Schema&reference.name.ref1&reference.schema.ref1=%7B%22asd%22%3A+%22asd%22%7D, which would set ref1 to be {"asd": "asd"} on all rows that have ref1 as their schema.

Input parameters

Parameter In Type Default Nullable Description
csv_url path No The URL of a CSV file containing the tasks of the new evaluation
only_header path No When set, will just return the headers of the CSV file
question path No The columns in the CSV file containing the questions
redacted path No The column in the CSV file marking tasks as redacted
type path No The column in the CSV file containing the per row task type

Responses

{
    "errors": [
        {
            "task_num": 3,
            "errors": [
                {
                    "message": "This row couldn't be parsed",
                    "level": "warning",
                    "type": "validation"
                }
            ],
            "warnings": [
                "This row is suspicious"
            ]
        }
    ],
    "num_tasks": 123,
    "min_questions_to_complete": 42
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "errors": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "task_num": {
                        "description": "The index of the row that has these errors",
                        "type": "number",
                        "format": "int64",
                        "example": 3
                    },
                    "errors": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "message": {
                                    "type": "string",
                                    "example": "This row couldn't be parsed"
                                },
                                "level": {
                                    "type": "string",
                                    "enum": [
                                        "warning",
                                        "error"
                                    ]
                                },
                                "type": {
                                    "type": "string",
                                    "example": "validation"
                                }
                            }
                        }
                    },
                    "warnings": {
                        "type": "array",
                        "items": {
                            "type": "string",
                            "example": "This row is suspicious"
                        }
                    }
                }
            }
        },
        "num_tasks": {
            "description": "The number of rows with tasks found, including rows with errors",
            "type": "number",
            "format": "int64",
            "example": 123
        },
        "min_questions_to_complete": {
            "description": "The minimum number of tasks per evaluation session. If this wasn't provided in the query parameters, it will be calculated based on the number of tasks found",
            "type": "number",
            "format": "int64",
            "example": 42
        }
    }
}

Refer to the common response description: ValidationError.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /evaluationmodeljobshandler

Request body

{
    "job_name": "string",
    "minutes_between_evaluations": 10.12,
    "job_description": "string",
    "start_date": "2022-04-13T15:42:05.901Z"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "job_name": {
            "type": "string"
        },
        "minutes_between_evaluations": {
            "type": "number",
            "format": "int64"
        },
        "job_description": {
            "type": "string"
        },
        "start_date": {
            "type": "string",
            "format": "date-time",
            "nullable": true
        }
    }
}

Responses

{
    "job_name": "string",
    "minutes_between_evaluations": 10.12,
    "job_body": null,
    "job_description": "string",
    "job_schedule_arn": "string",
    "start_date": "2022-04-13T15:42:05.901Z",
    "owner_id": "c3821565-8ad3-48d3-be6b-6785eec6de4d",
    "model_id": "6aba704c-b89d-40af-9c68-9dde86479c65",
    "evaluation_id": "8402290a-eb86-44be-a7b7-bfa35072c30f",
    "id": "c959d296-96ea-4fc8-8b9c-7a66d53d436e",
    "creation_date": "2022-04-13T15:42:05.901Z"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "job_name": {
            "type": "string"
        },
        "minutes_between_evaluations": {
            "type": "number",
            "format": "int64"
        },
        "job_body": {},
        "job_description": {
            "type": "string"
        },
        "job_schedule_arn": {
            "type": "string"
        },
        "start_date": {
            "type": "string",
            "format": "date-time",
            "nullable": true
        },
        "owner_id": {
            "type": "string",
            "format": "uuid"
        },
        "model_id": {
            "type": "string",
            "format": "uuid"
        },
        "evaluation_id": {
            "type": "string",
            "format": "uuid"
        },
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "creation_date": {
            "type": "string",
            "format": "date-time"
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /evaluationmodeljobshandler

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/EvaluationModelJobs"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/EvaluationModelJobs"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /evaluationmodeljobshandler

Request body

{
    "job_name": "string",
    "minutes_between_evaluations": 10.12,
    "job_description": "string",
    "start_date": "2022-04-13T15:42:05.901Z"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "job_name": {
            "type": "string"
        },
        "minutes_between_evaluations": {
            "type": "number",
            "format": "int64"
        },
        "job_description": {
            "type": "string"
        },
        "start_date": {
            "type": "string",
            "format": "date-time",
            "nullable": true
        }
    }
}

Responses

"EvaluationModelJobs updated"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "string",
    "enum": [
        "EvaluationModelJobs updated"
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /evaluationsession

Run an evaluation on a model, or take the test as a human.

Description
Human tests

Humans can test themselves on evaluations to check how hard they are. This should be done via the "Test yourself" button on evaluation pages. A random subsample of some 20 tasks will be returned, and once all of them have been completed, a summary shown of how well they did compared to other humans and AI models. Human tests can only be taken by the actual caller, as determined by Session-Token or Api-Token. Providing a different user via evaluatee_id won't do anything.

Each human test is idempotent, so until it has been completed, calling this endpoint for a given evaluation will return the same 20 tasks. This can be overriden with the restart parameter - when that is true, then a new evaluation session will be started.

Human tests are free.

AI model evaluation

Calling this endpoint with a model id in the evaluatee_id field and is_human_being_evaluated = false will start a new evaluation session for the provided evaluation_id. This requires payment, which will automatically be subtracted from your credits. If you don't have enough credits, a 402 error will be returned, with a link to your user profile, where you can purchase more credits.

By default there will be only one evaluation session per evaluation/model pair at a time. Calling this endpoint for a running evaluation session will append tasks to the current session rather than creating a new one. You can force a new evaluation session by setting restart = true.

Request body

{
    "origin": "user",
    "is_human_being_evaluated": true,
    "min_verbosity": 10.12,
    "max_verbosity": 10.12,
    "avg_verbosity": 10.12,
    "median_verbosity": 10.12,
    "evaluatee_id": "1ec67c40-fa5d-4a4d-867c-ac1cf75d4ec4",
    "evaluation_id": "9cc68041-01fe-474e-a623-59f37c7074aa",
    "notify": [
        {
            "method": "email",
            "destination": "mr.blobby@acme.com"
        }
    ],
    "restart": false,
    "system_prompt": "(str \"Please answer this: \" task)",
    "prompt": {
        "MCQ": "(str \"I have a multiple choice question for you to answer: \" task)",
        "default": "(str \"Answer this, please: \" task)"
    },
    "request": {
        "MCQ": "(bedrock-call \"your-access-key\" \"your-secret-key\" \"Jurassic\" task-text)",
        "default": "false"
    },
    "response": {
        "MCQ": "(= parsedResponse correct)",
        "default": "false"
    },
    "grader": {
        "MCQ": "(= parsedResponse correct)",
        "default": "false"
    }
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "origin": {
            "type": "string",
            "description": "The source of this evaluation session, i.e. what triggered it",
            "example": "user",
            "enum": [
                "alert",
                "user",
                "job",
                "model"
            ]
        },
        "is_human_being_evaluated": {
            "type": "boolean",
            "description": "Whether this evaluation session is a human test. When false will start an automatic test for the provided model and evaluation.",
            "example": true
        },
        "min_verbosity": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "max_verbosity": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "avg_verbosity": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "median_verbosity": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "evaluatee_id": {
            "type": "string",
            "format": "uuid",
            "description": "In the case of human tests, the id of the user taking the test. In the case of testing models, the id of the model to be tested"
        },
        "evaluation_id": {
            "type": "string",
            "format": "uuid",
            "description": "The id of the evaluation to be run"
        },
        "notify": {
            "description": "How to notify that the evaluation session has finished. There can be up to 20 notification methods provided. If not methods provided, an email will be sent to the user that triggered it.",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "method": {
                        "type": "string",
                        "enum": [
                            "email",
                            "webhook",
                            "sms",
                            "call"
                        ],
                        "description": "The notification method",
                        "example": "email"
                    },
                    "destination": {
                        "type": "string",
                        "description": "Where to send a notification",
                        "example": "mr.blobby@acme.com"
                    }
                }
            }
        },
        "restart": {
            "description": "Will force a new evaluation session if true - by default, calling this endpoint for a evaluation - model session that is already running, will add more tasks to the running session, rather than creating a new one",
            "example": false,
            "type": "boolean"
        },
        "system_prompt": {
            "description": "DSL code specifying how to construct model system prompts. This can be empty.",
            "type": "string",
            "example": "(str \"Please answer this: \" task)"
        },
        "prompt": {
            "description": "DSL code specifying how to construct model prompts. This can be empty, in which case the prompt code of the evaluation will be used. You can specify a `prompt` that will be used for all types of tasks, or per task type `prompt`s. If you provide both a default `prompt` and one for a specific task type, the specific one takes precedence. This can only be used if you're an admin of the selected evaluation - otherwise an error will be returned.",
            "oneOf": [
                {
                    "type": "string",
                    "description": "DSL code that should be used for all prompts",
                    "example": "(str \"Please answer this: \" task)"
                },
                {
                    "type": "object",
                    "description": "Per task type DSL code. Use the \"default\" key to specify the code that should be used for tasks types that aren't specified - otherwise the system default prompt will be used.",
                    "properties": {
                        "default": {
                            "type": "string",
                            "description": "The default `prompt` to be used for task types that aren't specified.",
                            "example": "(str \"Answer this, please: \" task)"
                        },
                        "FRQ": {
                            "type": "string",
                            "description": "The DSL code to be used to create prompts for FRQ tasks. If this is empty, the default `prompt` will be used"
                        },
                        "bool": {
                            "type": "string",
                            "description": "The DSL code to be used to create prompts for bool tasks. If this is empty, the default `prompt` will be used"
                        },
                        "json": {
                            "type": "string",
                            "description": "The DSL code to be used to create prompts for json tasks. If this is empty, the default `prompt` will be used"
                        },
                        "MCQ": {
                            "type": "string",
                            "description": "The DSL code to be used to create prompts for MCQ tasks. If this is empty, the default `prompt` will be used"
                        }
                    },
                    "example": {
                        "MCQ": "(str \"I have a multiple choice question for you to answer: \" task)",
                        "default": "(str \"Answer this, please: \" task)"
                    }
                }
            ],
            "example": {
                "MCQ": "(str \"I have a multiple choice question for you to answer: \" task)",
                "default": "(str \"Answer this, please: \" task)"
            }
        },
        "request": {
            "description": "DSL code specifying how to send tasks to the model. This can be empty, in which case the request code of the model will be used. You can specify a `request` that will be used for all types of tasks, or per task type `request`s. If you provide both a default `request` and one for a specific task type, the specific one takes precedence. This can only be used if you're an admin of the selected model - otherwise an error will be returned.",
            "oneOf": [
                {
                    "type": "string",
                    "description": "DSL code that should be used for all requests",
                    "example": "(POST \"http://my.model.endpoint\" {:json {\"task\" task}})"
                },
                {
                    "type": "object",
                    "description": "Per task type DSL code. Use the \"default\" key to specify the code that should be used for tasks types that aren't specified - otherwise the system default request code will be used.",
                    "properties": {
                        "default": {
                            "type": "string",
                            "description": "The default `request` to be used for task types that aren't specified.",
                            "example": "(openai-call \"your_key\" \"gtp-4\" task)"
                        },
                        "FRQ": {
                            "type": "string",
                            "description": "The DSL code to be used to send requests for FRQ tasks. If this is empty, the default `request` will be used"
                        },
                        "bool": {
                            "type": "string",
                            "description": "The DSL code to be used to send requests for bool tasks. If this is empty, the default `request` will be used"
                        },
                        "json": {
                            "type": "string",
                            "description": "The DSL code to be used to send requests for json tasks. If this is empty, the default `request` will be used"
                        },
                        "MCQ": {
                            "type": "string",
                            "description": "The DSL code to be used to send requests for MCQ tasks. If this is empty, the default `request` will be used"
                        }
                    },
                    "example": {
                        "MCQ": "(openai-call \"sk-your-secret-key\" \"gtp-4-turbo\" task-text)",
                        "default": "(anthropic-call \"sk-your-secret-key\" \"claude\" task)"
                    }
                }
            ],
            "example": {
                "MCQ": "(bedrock-call \"your-access-key\" \"your-secret-key\" \"Jurassic\" task-text)",
                "default": "false"
            }
        },
        "response": {
            "description": "DSL code specifying how to parse LLM responses. This can be empty, in which case the response code of the model will be used. You can specify a `response` parser that will be used for all types of tasks, or per task type parsers. If you provide both a default parser and one for a specific task type, the specific one takes precedence. This can only be used if you're an admin of the selected model - otherwise an error will be returned.",
            "oneOf": [
                {
                    "type": "string",
                    "description": "DSL code that should be used for all responses",
                    "example": "(get-in response [\"json\" \"resp\"])"
                },
                {
                    "type": "object",
                    "description": "Per task type DSL code. Use the \"default\" key to specify the code that should be used for tasks types that aren't specified - otherwise the model's default parser will be used.",
                    "properties": {
                        "default": {
                            "type": "string",
                            "description": "The default parser to be used for task types that aren't specified.",
                            "example": "response"
                        },
                        "FRQ": {
                            "type": "string",
                            "description": "The DSL code to be used to parse FRQ task responses. If this is empty, the default parser will be used"
                        },
                        "bool": {
                            "type": "string",
                            "description": "The DSL code to be used to parse bool task responses. If this is empty, the default parser will be used"
                        },
                        "json": {
                            "type": "string",
                            "description": "The DSL code to be used to parse json task responses. If this is empty, the default parser will be used"
                        },
                        "MCQ": {
                            "type": "string",
                            "description": "The DSL code to be used to parse MCQ task responses. If this is empty, the default parser will be used"
                        }
                    },
                    "example": {
                        "MCQ": "(= parsedResponse correct)",
                        "default": "false"
                    }
                }
            ],
            "example": {
                "MCQ": "(= parsedResponse correct)",
                "default": "false"
            }
        },
        "grader": {
            "description": "DSL code specifying how to grade LLM responses. This can be empty, in which case the grader of the evaluation will be used. You can specify a grader that will be used for all types of tasks, or per task type graders. If you provide both a default grader and one for a specific task type, the specific one takes precedence. This can only be used if you're an admin of the selected evaluation - otherwise an error will be returned.",
            "oneOf": [
                {
                    "type": "string",
                    "description": "DSL code that should be used for all response",
                    "example": "(= parsedResponse \"ok\")"
                },
                {
                    "type": "object",
                    "description": "Per task type DSL code. Use the \"default\" key to specify the code that should be used for tasks types that aren't specified - otherwise the grader of the evaluation will be used.",
                    "properties": {
                        "default": {
                            "type": "string",
                            "description": "The default grader to be used for task types that aren't specified.",
                            "example": "(if (= parsedResponse correct) 1 0)"
                        },
                        "FRQ": {
                            "type": "string",
                            "description": "The DSL code to be used to grade FRQ tasks. If this is empty, the default grader will be used"
                        },
                        "bool": {
                            "type": "string",
                            "description": "The DSL code to be used to grade bool tasks. If this is empty, the default grader will be used"
                        },
                        "json": {
                            "type": "string",
                            "description": "The DSL code to be used to grade json tasks. If this is empty, the default grader will be used"
                        },
                        "MCQ": {
                            "type": "string",
                            "description": "The DSL code to be used to grade MCQ tasks. If this is empty, the default grader will be used"
                        }
                    },
                    "example": {
                        "MCQ": "(= parsedResponse correct)",
                        "default": "false"
                    }
                }
            ],
            "example": {
                "MCQ": "(= parsedResponse correct)",
                "default": "false"
            }
        }
    }
}

Responses

{
    "id": "4b5d04c5-46c8-4361-aec2-6943db45be82",
    "datetime_started": "2022-04-13T15:42:05.901Z",
    "datetime_completed": "2022-04-13T15:42:05.901Z",
    "origin": "user",
    "completed": true,
    "failed": true,
    "is_human_being_evaluated": true,
    "num_questions_answered": 10.12,
    "num_answered_correctly": 10.12,
    "num_tasks_to_complete": 10.12,
    "num_endpoint_failures": 10.12,
    "num_endpoint_calls": 10.12,
    "num_characters_sent_to_endpoint": 10.12,
    "num_characters_received_from_endpoint": 10.12,
    "median_seconds_per_task": 10.12,
    "mean_seconds_per_task": 10.12,
    "std_seconds_per_task": 10.12,
    "distribution_of_seconds_per_task": null,
    "min_seconds_per_task": 10.12,
    "max_seconds_per_task": 10.12,
    "median_characters_per_task": 10.12,
    "mean_characters_per_task": 10.12,
    "std_characters_per_task": 10.12,
    "distribution_of_characters_per_task": null,
    "min_characters_per_task": 10.12,
    "max_characters_per_task": 10.12,
    "min_verbosity": 10.12,
    "max_verbosity": 10.12,
    "avg_verbosity": 10.12,
    "median_verbosity": 10.12,
    "evaluatee_id": "cc475c38-985c-4b3d-9e3a-766b4945166e",
    "evaluation_id": "761e4f79-385d-47a8-bd39-ab5b3ffb78ed"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "datetime_started": {
            "type": "string",
            "format": "date-time"
        },
        "datetime_completed": {
            "type": "string",
            "format": "date-time",
            "nullable": true
        },
        "origin": {
            "type": "string",
            "description": "The source of this evaluation session, i.e. what triggered it",
            "example": "user",
            "enum": [
                "alert",
                "user",
                "job",
                "model"
            ]
        },
        "completed": {
            "type": "boolean"
        },
        "failed": {
            "type": "boolean"
        },
        "is_human_being_evaluated": {
            "type": "boolean",
            "description": "Whether this evaluation session is a human test. When false will start an automatic test for the provided model and evaluation.",
            "example": true
        },
        "num_questions_answered": {
            "type": "number",
            "format": "int64"
        },
        "num_answered_correctly": {
            "type": "number",
            "format": "int64"
        },
        "num_tasks_to_complete": {
            "type": "number",
            "format": "int64"
        },
        "num_endpoint_failures": {
            "type": "number",
            "format": "int64"
        },
        "num_endpoint_calls": {
            "type": "number",
            "format": "int64"
        },
        "num_characters_sent_to_endpoint": {
            "type": "number",
            "format": "int64"
        },
        "num_characters_received_from_endpoint": {
            "type": "number",
            "format": "int64"
        },
        "median_seconds_per_task": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "mean_seconds_per_task": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "std_seconds_per_task": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "distribution_of_seconds_per_task": {
            "nullable": true
        },
        "min_seconds_per_task": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "max_seconds_per_task": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "median_characters_per_task": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "mean_characters_per_task": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "std_characters_per_task": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "distribution_of_characters_per_task": {
            "nullable": true
        },
        "min_characters_per_task": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "max_characters_per_task": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "min_verbosity": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "max_verbosity": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "avg_verbosity": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "median_verbosity": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "evaluatee_id": {
            "type": "string",
            "format": "uuid",
            "description": "In the case of human tests, the id of the user taking the test. In the case of testing models, the id of the model to be tested"
        },
        "evaluation_id": {
            "type": "string",
            "format": "uuid",
            "description": "The id of the evaluation to be run"
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: PaymentRequired.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /evaluationsession

Get evaluation sessions.

Description

If the id parameter is provided, this endpoint will return the appropriate evaluation session if possible. In the case of human tests, you can only use this endpoint to get your own results. In the case of AI model runs, you can use this endpoint to get any evaluations of models where either the model/evaluation is public, or you're an administrator of it.

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/EvaluationSession"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/EvaluationSession"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /model

Request body

{
    "name": "my model",
    "description": "# This is a model, see more at [this link](http://some.link)",
    "publisher": "Models R Us",
    "architecture": "RNN",
    "picture": "http://some.example/pic",
    "num_parameters": 30000000,
    "modalities": "text",
    "public": true,
    "public_usable": false,
    "check_availability": true,
    "endpoint_type": "open_ai",
    "setup_code": "(POST \"http://start.my.model\")",
    "teardown_code": "(POST \"http://start.my.model\")",
    "task_holding_queue_url": "string",
    "task_execution_queue_url": "string",
    "task_execution_dlq_url": "string",
    "lambda_arn": "string",
    "cost_per_input_character_usd": 2e-05,
    "cost_per_output_character_usd": 0.0005,
    "cost_per_instance_hour_usd": 4.99,
    "max_characters_per_minute": 400,
    "max_request_per_minute": 30,
    "max_context_window_characters": 4096,
    "request_code": "(openai-call \"sk-your-secret-key\" \"gtp-4-turbo\" task-text)",
    "response_code": "(get-in response [\"json\" \"response\"])"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "name": {
            "type": "string",
            "example": "my model"
        },
        "description": {
            "type": "string",
            "description": "The description of this model, as displayed on the site. Markdown can be used for formatting",
            "nullable": true,
            "example": "# This is a model, see more at [this link](http://some.link)"
        },
        "publisher": {
            "type": "string",
            "description": "The entity that created this model",
            "nullable": true,
            "example": "Models R Us"
        },
        "architecture": {
            "type": "string",
            "description": "The architecture of this model",
            "nullable": true,
            "example": "RNN"
        },
        "picture": {
            "type": "string",
            "description": "An url to an image representing this model",
            "nullable": true,
            "example": "http://some.example/pic"
        },
        "num_parameters": {
            "type": "integer",
            "format": "int64",
            "description": "The number of parameters of the model",
            "nullable": true,
            "example": 30000000
        },
        "modalities": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The modalities accepted by this model",
            "enum": [
                "text"
            ],
            "example": "text"
        },
        "public": {
            "type": "boolean",
            "description": "Whether this evaluation should be publicly visible. If true, anyone can view its details."
        },
        "public_usable": {
            "type": "boolean",
            "description": "Whether this model can be tested by anyone. LLMs can cost a lot to run, and these costs are on whoever added the model. This setting is here to add an extra protection against people running up large compute costs on this model. When not set, this is `false`.",
            "example": false
        },
        "check_availability": {
            "type": "boolean",
            "description": "Whether the availability of this model should be checked. When true, we will ping the endpoint every ",
            "nullable": true
        },
        "endpoint_type": {
            "type": "string",
            "description": "The type of endpoint being called. We have dedicated handlers for many of the most popular AI model providers",
            "enum": [
                "aws",
                "together.ai",
                "conversational",
                "google_cloud",
                "azure",
                "text-generation",
                "anthropic",
                "fill-mask",
                "zero-shot-classification",
                "custom",
                "open_ai",
                "text2text-generation",
                "mistral"
            ],
            "example": "open_ai"
        },
        "setup_code": {
            "type": "string",
            "description": "An optional piece of DSL code to be called if the model isn't running. This is useful when your model needs time to spin up - you can defined a call to start it here, which will be called once the model is first used.",
            "nullable": true,
            "example": "(POST \"http://start.my.model\")"
        },
        "teardown_code": {
            "type": "string",
            "description": "An optional piece of DSL code to be run after the model has finished all evaluation sessions. This is useful e.g. when your model is living on an AWS server, where you pay for uptime. You can defined a call to kill the instance, which will be called after no more evaluation sessions are running.",
            "nullable": true,
            "example": "(POST \"http://start.my.model\")"
        },
        "task_holding_queue_url": {
            "type": "string",
            "nullable": true
        },
        "task_execution_queue_url": {
            "type": "string",
            "nullable": true
        },
        "task_execution_dlq_url": {
            "type": "string",
            "nullable": true
        },
        "lambda_arn": {
            "type": "string",
            "nullable": true
        },
        "cost_per_input_character_usd": {
            "type": "number",
            "format": "double",
            "description": "The cost of a single input character in USD. We assume that a single token is 4 characters.",
            "example": 2e-05
        },
        "cost_per_output_character_usd": {
            "type": "number",
            "format": "double",
            "description": "The cost of a single output character in USD. We assume that a single token is 4 characters.",
            "example": 0.0005
        },
        "cost_per_instance_hour_usd": {
            "type": "number",
            "format": "double",
            "description": "The cost of running the model for an hour, in USD. This doesn't include input/output tokens - it's purely the server uptime. This is useful e.g. with HuggingFace inference endpoints, where they charge for server time, not for tokens throughput.",
            "example": 4.99
        },
        "max_characters_per_minute": {
            "type": "integer",
            "format": "int64",
            "description": "The maximum allowed number of characters per minute. We assume that one token is 4 characters. This must be at least 1.",
            "example": 400
        },
        "max_request_per_minute": {
            "type": "integer",
            "format": "int64",
            "description": "The maximum allowed number of requess per minute. This must be at least 1.",
            "example": 30
        },
        "max_context_window_characters": {
            "type": "integer",
            "format": "int64",
            "description": "The maximum number of characters allowed in the context window of this model. We assume that 1 token is 4 characters",
            "nullable": true,
            "example": 4096
        },
        "request_code": {
            "description": "DSL code defining how to send requests to the model. See the [DSL page](/docs/dsl/) for more info.",
            "example": "(openai-call \"sk-your-secret-key\" \"gtp-4-turbo\" task-text)"
        },
        "response_code": {
            "description": "DSL code defining how to parse responses from the model. See the [DSL page](/docs/dsl/) for more info.",
            "example": "(get-in response [\"json\" \"response\"])"
        }
    }
}

Responses

{
    "id": "c56d2964-8364-4b8a-b6a6-517ec796fa31",
    "name": "my model",
    "description": "# This is a model, see more at [this link](http://some.link)",
    "owner_id": "4b716715-3cca-411f-b3b7-1dd767965a83",
    "publisher": "Models R Us",
    "architecture": "RNN",
    "picture": "http://some.example/pic",
    "num_parameters": 30000000,
    "modalities": "text",
    "public": true,
    "public_usable": false,
    "check_availability": true,
    "quality": 0.89,
    "endpoint_type": "open_ai",
    "cost_per_input_character_usd": 2e-05,
    "cost_per_output_character_usd": 0.0005,
    "cost_per_instance_hour_usd": 4.99,
    "max_characters_per_minute": 400,
    "max_request_per_minute": 30,
    "max_context_window_characters": 4096,
    "elo_score": 10.12,
    "score": 10.12,
    "availability": 10.12,
    "top_example_id": "f6b9676f-3954-4a35-aa54-b8695e4189ee",
    "worst_example_id": "cdb87dda-fc9b-4bf8-9419-fc5614d88356",
    "owner": {
        "id": "7fd75322-29d1-4e27-9e3b-499cee8cdadc",
        "email_address": "mr.blobby@some.domain",
        "user_name": "mr_blobby",
        "full_name": "Mr Blobby, esq.",
        "user_image": "https://equistamp.com/avatars/123123123123.png",
        "bio": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die",
        "display_options": {
            "bio": true,
            "email_address": true,
            "user_image": false
        },
        "join_date": "2022-04-13",
        "subscription_level": "pro",
        "alerts": [
            "a685d9e4-d46b-4f61-9358-9c5c53d2efb5"
        ]
    },
    "top_example": {
        "id": "627596cc-ea70-40cd-b187-56dc733caf0a",
        "task_type": "string",
        "is_task_live": true,
        "modalities": [
            "string"
        ],
        "redacted": true,
        "num_possible_answers": 10.12,
        "evaluation_task_number": 10.12,
        "median_human_completion_seconds": 10.12,
        "median_ai_completion_seconds": 10.12,
        "num_times_human_evaluated": 10.12,
        "num_times_ai_evaluated": 10.12,
        "num_times_humans_answered_correctly": 10.12,
        "num_times_ai_answered_correctly": 10.12,
        "evaluation_id": "df4341dc-0d57-4043-8d26-6277b0fd47de",
        "owner_id": "301c3b95-f8fb-4022-86a0-6ba2273cedee",
        "tags": [
            "ccdc9aec-dfd2-4d8d-a953-fee364f64b4d"
        ]
    },
    "worst_example": null,
    "best_evaluation_session": {
        "id": "e538629a-7b80-48af-9564-c8d07477ab55",
        "datetime_started": "2022-04-13T15:42:05.901Z",
        "datetime_completed": "2022-04-13T15:42:05.901Z",
        "origin": "user",
        "completed": true,
        "failed": true,
        "is_human_being_evaluated": true,
        "num_questions_answered": 10.12,
        "num_answered_correctly": 10.12,
        "num_tasks_to_complete": 10.12,
        "num_endpoint_failures": 10.12,
        "num_endpoint_calls": 10.12,
        "num_characters_sent_to_endpoint": 10.12,
        "num_characters_received_from_endpoint": 10.12,
        "median_seconds_per_task": 10.12,
        "mean_seconds_per_task": 10.12,
        "std_seconds_per_task": 10.12,
        "distribution_of_seconds_per_task": null,
        "min_seconds_per_task": 10.12,
        "max_seconds_per_task": 10.12,
        "median_characters_per_task": 10.12,
        "mean_characters_per_task": 10.12,
        "std_characters_per_task": 10.12,
        "distribution_of_characters_per_task": null,
        "min_characters_per_task": 10.12,
        "max_characters_per_task": 10.12,
        "min_verbosity": 10.12,
        "max_verbosity": 10.12,
        "avg_verbosity": 10.12,
        "median_verbosity": 10.12,
        "evaluatee_id": "b9fc398f-5caf-4d85-bca3-9ee6b5b965d3",
        "evaluation_id": "29b9e5c3-e9e3-4c97-96f0-fd74d191a1d8"
    },
    "worst_evaluation_session": null
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "name": {
            "type": "string",
            "example": "my model"
        },
        "description": {
            "type": "string",
            "description": "The description of this model, as displayed on the site. Markdown can be used for formatting",
            "nullable": true,
            "example": "# This is a model, see more at [this link](http://some.link)"
        },
        "owner_id": {
            "type": "string",
            "format": "uuid"
        },
        "publisher": {
            "type": "string",
            "description": "The entity that created this model",
            "nullable": true,
            "example": "Models R Us"
        },
        "architecture": {
            "type": "string",
            "description": "The architecture of this model",
            "nullable": true,
            "example": "RNN"
        },
        "picture": {
            "type": "string",
            "description": "An url to an image representing this model",
            "nullable": true,
            "example": "http://some.example/pic"
        },
        "num_parameters": {
            "type": "integer",
            "format": "int64",
            "description": "The number of parameters of the model",
            "nullable": true,
            "example": 30000000
        },
        "modalities": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The modalities accepted by this model",
            "enum": [
                "text"
            ],
            "example": "text"
        },
        "public": {
            "type": "boolean",
            "description": "Whether this evaluation should be publicly visible. If true, anyone can view its details."
        },
        "public_usable": {
            "type": "boolean",
            "description": "Whether this model can be tested by anyone. LLMs can cost a lot to run, and these costs are on whoever added the model. This setting is here to add an extra protection against people running up large compute costs on this model. When not set, this is `false`.",
            "example": false
        },
        "check_availability": {
            "type": "boolean",
            "description": "Whether the availability of this model should be checked. When true, we will ping the endpoint every ",
            "nullable": true
        },
        "quality": {
            "type": "number",
            "format": "double",
            "description": "The quality of this model, i.e. how much it's worth using, from 0 to 1. This is very subjective, and mainly used to decide whether it should be used by default e.g. on the frontpage.",
            "example": 0.89
        },
        "endpoint_type": {
            "type": "string",
            "description": "The type of endpoint being called. We have dedicated handlers for many of the most popular AI model providers",
            "enum": [
                "aws",
                "together.ai",
                "conversational",
                "google_cloud",
                "azure",
                "text-generation",
                "anthropic",
                "fill-mask",
                "zero-shot-classification",
                "custom",
                "open_ai",
                "text2text-generation",
                "mistral"
            ],
            "example": "open_ai"
        },
        "cost_per_input_character_usd": {
            "type": "number",
            "format": "double",
            "description": "The cost of a single input character in USD. We assume that a single token is 4 characters.",
            "example": 2e-05
        },
        "cost_per_output_character_usd": {
            "type": "number",
            "format": "double",
            "description": "The cost of a single output character in USD. We assume that a single token is 4 characters.",
            "example": 0.0005
        },
        "cost_per_instance_hour_usd": {
            "type": "number",
            "format": "double",
            "description": "The cost of running the model for an hour, in USD. This doesn't include input/output tokens - it's purely the server uptime. This is useful e.g. with HuggingFace inference endpoints, where they charge for server time, not for tokens throughput.",
            "example": 4.99
        },
        "max_characters_per_minute": {
            "type": "integer",
            "format": "int64",
            "description": "The maximum allowed number of characters per minute. We assume that one token is 4 characters. This must be at least 1.",
            "example": 400
        },
        "max_request_per_minute": {
            "type": "integer",
            "format": "int64",
            "description": "The maximum allowed number of requess per minute. This must be at least 1.",
            "example": 30
        },
        "max_context_window_characters": {
            "type": "integer",
            "format": "int64",
            "description": "The maximum number of characters allowed in the context window of this model. We assume that 1 token is 4 characters",
            "nullable": true,
            "example": 4096
        },
        "elo_score": {
            "type": "number",
            "format": "double",
            "description": "The ELO score, according to LLMSys",
            "nullable": true
        },
        "score": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "availability": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "top_example_id": {
            "type": "string",
            "format": "uuid",
            "nullable": true
        },
        "worst_example_id": {
            "type": "string",
            "format": "uuid",
            "nullable": true
        },
        "owner": {
            "$ref": "#/components/schemas/ShallowUser"
        },
        "top_example": {
            "$ref": "#/components/schemas/ShallowTask"
        },
        "worst_example": {
            "$ref": "#/components/schemas/ShallowTask"
        },
        "best_evaluation_session": {
            "$ref": "#/components/schemas/ShallowEvaluationSession"
        },
        "worst_evaluation_session": {
            "$ref": "#/components/schemas/ShallowEvaluationSession"
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /model

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/Model"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/Model"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /model

Request body

{
    "name": "my model",
    "description": "# This is a model, see more at [this link](http://some.link)",
    "publisher": "Models R Us",
    "architecture": "RNN",
    "picture": "http://some.example/pic",
    "num_parameters": 30000000,
    "modalities": "text",
    "public": true,
    "public_usable": false,
    "check_availability": true,
    "endpoint_type": "open_ai",
    "setup_code": "(POST \"http://start.my.model\")",
    "teardown_code": "(POST \"http://start.my.model\")",
    "task_holding_queue_url": "string",
    "task_execution_queue_url": "string",
    "task_execution_dlq_url": "string",
    "lambda_arn": "string",
    "cost_per_input_character_usd": 2e-05,
    "cost_per_output_character_usd": 0.0005,
    "cost_per_instance_hour_usd": 4.99,
    "max_characters_per_minute": 400,
    "max_request_per_minute": 30,
    "max_context_window_characters": 4096
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "name": {
            "type": "string",
            "example": "my model"
        },
        "description": {
            "type": "string",
            "description": "The description of this model, as displayed on the site. Markdown can be used for formatting",
            "nullable": true,
            "example": "# This is a model, see more at [this link](http://some.link)"
        },
        "publisher": {
            "type": "string",
            "description": "The entity that created this model",
            "nullable": true,
            "example": "Models R Us"
        },
        "architecture": {
            "type": "string",
            "description": "The architecture of this model",
            "nullable": true,
            "example": "RNN"
        },
        "picture": {
            "type": "string",
            "description": "An url to an image representing this model",
            "nullable": true,
            "example": "http://some.example/pic"
        },
        "num_parameters": {
            "type": "integer",
            "format": "int64",
            "description": "The number of parameters of the model",
            "nullable": true,
            "example": 30000000
        },
        "modalities": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "The modalities accepted by this model",
            "enum": [
                "text"
            ],
            "example": "text"
        },
        "public": {
            "type": "boolean",
            "description": "Whether this evaluation should be publicly visible. If true, anyone can view its details."
        },
        "public_usable": {
            "type": "boolean",
            "description": "Whether this model can be tested by anyone. LLMs can cost a lot to run, and these costs are on whoever added the model. This setting is here to add an extra protection against people running up large compute costs on this model. When not set, this is `false`.",
            "example": false
        },
        "check_availability": {
            "type": "boolean",
            "description": "Whether the availability of this model should be checked. When true, we will ping the endpoint every ",
            "nullable": true
        },
        "endpoint_type": {
            "type": "string",
            "description": "The type of endpoint being called. We have dedicated handlers for many of the most popular AI model providers",
            "enum": [
                "aws",
                "together.ai",
                "conversational",
                "google_cloud",
                "azure",
                "text-generation",
                "anthropic",
                "fill-mask",
                "zero-shot-classification",
                "custom",
                "open_ai",
                "text2text-generation",
                "mistral"
            ],
            "example": "open_ai"
        },
        "setup_code": {
            "type": "string",
            "description": "An optional piece of DSL code to be called if the model isn't running. This is useful when your model needs time to spin up - you can defined a call to start it here, which will be called once the model is first used.",
            "nullable": true,
            "example": "(POST \"http://start.my.model\")"
        },
        "teardown_code": {
            "type": "string",
            "description": "An optional piece of DSL code to be run after the model has finished all evaluation sessions. This is useful e.g. when your model is living on an AWS server, where you pay for uptime. You can defined a call to kill the instance, which will be called after no more evaluation sessions are running.",
            "nullable": true,
            "example": "(POST \"http://start.my.model\")"
        },
        "task_holding_queue_url": {
            "type": "string",
            "nullable": true
        },
        "task_execution_queue_url": {
            "type": "string",
            "nullable": true
        },
        "task_execution_dlq_url": {
            "type": "string",
            "nullable": true
        },
        "lambda_arn": {
            "type": "string",
            "nullable": true
        },
        "cost_per_input_character_usd": {
            "type": "number",
            "format": "double",
            "description": "The cost of a single input character in USD. We assume that a single token is 4 characters.",
            "example": 2e-05
        },
        "cost_per_output_character_usd": {
            "type": "number",
            "format": "double",
            "description": "The cost of a single output character in USD. We assume that a single token is 4 characters.",
            "example": 0.0005
        },
        "cost_per_instance_hour_usd": {
            "type": "number",
            "format": "double",
            "description": "The cost of running the model for an hour, in USD. This doesn't include input/output tokens - it's purely the server uptime. This is useful e.g. with HuggingFace inference endpoints, where they charge for server time, not for tokens throughput.",
            "example": 4.99
        },
        "max_characters_per_minute": {
            "type": "integer",
            "format": "int64",
            "description": "The maximum allowed number of characters per minute. We assume that one token is 4 characters. This must be at least 1.",
            "example": 400
        },
        "max_request_per_minute": {
            "type": "integer",
            "format": "int64",
            "description": "The maximum allowed number of requess per minute. This must be at least 1.",
            "example": 30
        },
        "max_context_window_characters": {
            "type": "integer",
            "format": "int64",
            "description": "The maximum number of characters allowed in the context window of this model. We assume that 1 token is 4 characters",
            "nullable": true,
            "example": 4096
        }
    }
}

Responses

"Model updated"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "string",
    "enum": [
        "Model updated"
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /modelsconnecter

Request body

{
    "evaluation_id": "b66b4389-4843-436c-919a-cc2bbde4c8ae",
    "evaluatee_id": "ca7047ce-a47e-4784-875b-ffb281131aea",
    "cadence": "string",
    "price": 10.12,
    "connections": [
        {
            "evaluation_id": "e29c81ce-92cb-4191-a98c-51d55b0527df",
            "evaluatee_id": "ec79c154-4a57-4e5b-b56d-30f72ab01efb",
            "cadence": "once",
            "price": 123,
            "name": "my wonderful model evaluation"
        }
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "evaluation_id": {
            "type": "string",
            "format": "uuid"
        },
        "evaluatee_id": {
            "type": "string",
            "format": "uuid"
        },
        "cadence": {
            "type": "string",
            "nullable": true
        },
        "price": {
            "type": "number",
            "format": "int64"
        },
        "connections": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "evaluation_id": {
                        "type": "string",
                        "format": "uuid",
                        "description": "The id of the evaluation to be run"
                    },
                    "evaluatee_id": {
                        "type": "string",
                        "format": "uuid",
                        "description": "The id of the model to be evaluated"
                    },
                    "cadence": {
                        "type": "string",
                        "enum": [
                            "daily",
                            "quarterly",
                            "once",
                            "every 2 weeks",
                            "weekly",
                            "monthly"
                        ],
                        "example": "once",
                        "description": "How often this evaluation should be run on this model"
                    },
                    "price": {
                        "type": "number",
                        "format": "int64",
                        "min": 100,
                        "example": 123,
                        "description": "The price to run a single evaluation on this model. This is the price you expect to pay in cents - if the actual cost will be larger - e.g. if the evaluation has more tasks added, or the model has its pricing updated - then an error will be raised, so you don't get hit with hidden costs"
                    },
                    "name": {
                        "type": "string",
                        "description": "A string identifier for this connection - used for displaying line items in Stripe",
                        "example": "my wonderful model evaluation"
                    }
                }
            }
        }
    }
}

Responses

{
    "id": "9578841c-8225-4226-9825-191e2388178c",
    "evaluation_id": "3b299532-6d74-47c8-bb0c-f048040d364a",
    "evaluatee_id": "9ea81133-9c6f-4c65-b4bc-1ace62a3e561",
    "cadence": "string",
    "price": 10.12,
    "model": {
        "id": "276d9d54-f514-4509-8690-57ae98627c69",
        "name": "my model",
        "description": "# This is a model, see more at [this link](http://some.link)",
        "owner_id": "636ac7d7-7dd4-4ee2-9a69-b8e2bf5f3332",
        "publisher": "Models R Us",
        "architecture": "RNN",
        "picture": "http://some.example/pic",
        "num_parameters": 30000000,
        "modalities": "text",
        "public": true,
        "public_usable": false,
        "check_availability": true,
        "quality": 0.89,
        "endpoint_type": "open_ai",
        "cost_per_input_character_usd": 2e-05,
        "cost_per_output_character_usd": 0.0005,
        "cost_per_instance_hour_usd": 4.99,
        "max_characters_per_minute": 400,
        "max_request_per_minute": 30,
        "max_context_window_characters": 4096,
        "elo_score": 10.12,
        "score": 10.12,
        "availability": 10.12,
        "top_example_id": "3432730e-756a-4d5e-81dd-c9166b97330c",
        "worst_example_id": "000690ea-8a19-4627-a4c0-c1734778e452",
        "owner": "23a1154b-0f04-412f-afb0-04a1b7768f8b",
        "top_example": "c4867b79-f39b-4276-805e-1dbd4c406e82",
        "worst_example": "a3ea7580-07a4-4e66-9d08-250cb4636d14",
        "best_evaluation_session": "671cbe01-5308-4a75-8c60-c23c115551f5",
        "worst_evaluation_session": "8f6dc477-af40-465e-9e9d-6fdb620d810a"
    },
    "evaluation": {
        "id": "667911fc-7bd8-4c6d-94eb-29e8e179fd6b",
        "name": "My lovely evaluation",
        "public": true,
        "public_usable": false,
        "reports_visible": false,
        "quality": 0.89,
        "num_tasks": 2000,
        "description": "# This is an evaluation, see more at [this link](http://some.link)",
        "last_updated": "2022-04-13T15:42:05.901Z",
        "task_types": "MCQ",
        "modalities": "text",
        "min_questions_to_complete": 321,
        "owner": "ce16ae57-2a1a-488b-9b57-15ee5e68abd5",
        "tags": [
            "a893caa0-f12d-44ee-a3b9-c6bfe2bef0e5"
        ]
    }
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "evaluation_id": {
            "type": "string",
            "format": "uuid"
        },
        "evaluatee_id": {
            "type": "string",
            "format": "uuid"
        },
        "cadence": {
            "type": "string",
            "nullable": true
        },
        "price": {
            "type": "number",
            "format": "int64"
        },
        "model": {
            "$ref": "#/components/schemas/ShallowModel"
        },
        "evaluation": {
            "$ref": "#/components/schemas/ShallowEvaluation"
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /modelsconnecter

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/EvaluationEvaluatee"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/EvaluationEvaluatee"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /queryexternalmodelhandler

Run a task on a model.

Description

This endpoint can be called either as part of an evaluation session, or on its own.

If evaluation_session_id is provided, it will run the task as part of that evaluation session. Each evaluation session has a set number of tasks to evaluate, so if you call this endpoint for a finished evaluation session, you will get an error.

If no evaluation_session_id is provided, the model will be called with the provided task. This is a paid operation, and will subtract the appropriate amount of credits from your account, or raise a 402 if you don't have enough.

You can override the default request and response code for models of which you are an admin, and the prompt and grader code of evaluations of which you are an admin.

Request body

{
    "response_time_in_seconds": 10.12,
    "task_id": "5a783b42-8b3b-46dc-bc16-697df56bbe2f",
    "evaluation_session_id": "b95bdcd4-cf93-4f6f-8cdf-a1fd7db57e30",
    "model_id": "c98b0ca2-ff29-4dbe-bdeb-9fd4d0b2fdf0",
    "system_prompt": "(str \"Please answer this: \" task)",
    "prompt": {
        "MCQ": "(str \"I have a multiple choice question for you to answer: \" task)",
        "default": "(str \"Answer this, please: \" task)"
    },
    "request": {
        "MCQ": "(bedrock-call \"your-access-key\" \"your-secret-key\" \"Jurassic\" task-text)",
        "default": "false"
    },
    "response": {
        "MCQ": "(= parsedResponse correct)",
        "default": "false"
    },
    "grader": {
        "MCQ": "(= parsedResponse correct)",
        "default": "false"
    }
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "response_time_in_seconds": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "task_id": {
            "description": "The id of the task to be run on the model",
            "type": "string",
            "format": "uuid"
        },
        "evaluation_session_id": {
            "description": "The id of the evaluation session that is being checked",
            "type": "string",
            "format": "uuid"
        },
        "model_id": {
            "description": "The id of the model that is being evaluation",
            "type": "string",
            "format": "uuid"
        },
        "system_prompt": {
            "description": "DSL code specifying how to construct model system prompts. This can be empty.",
            "type": "string",
            "example": "(str \"Please answer this: \" task)"
        },
        "prompt": {
            "description": "DSL code specifying how to construct model prompts. This can be empty, in which case the prompt code of the evaluation will be used. You can specify a `prompt` that will be used for all types of tasks, or per task type `prompt`s. If you provide both a default `prompt` and one for a specific task type, the specific one takes precedence. This can only be used if you're an admin of the selected evaluation - otherwise an error will be returned.",
            "oneOf": [
                {
                    "type": "string",
                    "description": "DSL code that should be used for all prompts",
                    "example": "(str \"Please answer this: \" task)"
                },
                {
                    "type": "object",
                    "description": "Per task type DSL code. Use the \"default\" key to specify the code that should be used for tasks types that aren't specified - otherwise the system default prompt will be used.",
                    "properties": {
                        "default": {
                            "type": "string",
                            "description": "The default `prompt` to be used for task types that aren't specified.",
                            "example": "(str \"Answer this, please: \" task)"
                        },
                        "FRQ": {
                            "type": "string",
                            "description": "The DSL code to be used to create prompts for FRQ tasks. If this is empty, the default `prompt` will be used"
                        },
                        "bool": {
                            "type": "string",
                            "description": "The DSL code to be used to create prompts for bool tasks. If this is empty, the default `prompt` will be used"
                        },
                        "json": {
                            "type": "string",
                            "description": "The DSL code to be used to create prompts for json tasks. If this is empty, the default `prompt` will be used"
                        },
                        "MCQ": {
                            "type": "string",
                            "description": "The DSL code to be used to create prompts for MCQ tasks. If this is empty, the default `prompt` will be used"
                        }
                    },
                    "example": {
                        "MCQ": "(str \"I have a multiple choice question for you to answer: \" task)",
                        "default": "(str \"Answer this, please: \" task)"
                    }
                }
            ],
            "example": {
                "MCQ": "(str \"I have a multiple choice question for you to answer: \" task)",
                "default": "(str \"Answer this, please: \" task)"
            }
        },
        "request": {
            "description": "DSL code specifying how to send tasks to the model. This can be empty, in which case the request code of the model will be used. You can specify a `request` that will be used for all types of tasks, or per task type `request`s. If you provide both a default `request` and one for a specific task type, the specific one takes precedence. This can only be used if you're an admin of the selected model - otherwise an error will be returned.",
            "oneOf": [
                {
                    "type": "string",
                    "description": "DSL code that should be used for all requests",
                    "example": "(POST \"http://my.model.endpoint\" {:json {\"task\" task}})"
                },
                {
                    "type": "object",
                    "description": "Per task type DSL code. Use the \"default\" key to specify the code that should be used for tasks types that aren't specified - otherwise the system default request code will be used.",
                    "properties": {
                        "default": {
                            "type": "string",
                            "description": "The default `request` to be used for task types that aren't specified.",
                            "example": "(openai-call \"your_key\" \"gtp-4\" task)"
                        },
                        "FRQ": {
                            "type": "string",
                            "description": "The DSL code to be used to send requests for FRQ tasks. If this is empty, the default `request` will be used"
                        },
                        "bool": {
                            "type": "string",
                            "description": "The DSL code to be used to send requests for bool tasks. If this is empty, the default `request` will be used"
                        },
                        "json": {
                            "type": "string",
                            "description": "The DSL code to be used to send requests for json tasks. If this is empty, the default `request` will be used"
                        },
                        "MCQ": {
                            "type": "string",
                            "description": "The DSL code to be used to send requests for MCQ tasks. If this is empty, the default `request` will be used"
                        }
                    },
                    "example": {
                        "MCQ": "(openai-call \"sk-your-secret-key\" \"gtp-4-turbo\" task-text)",
                        "default": "(anthropic-call \"sk-your-secret-key\" \"claude\" task)"
                    }
                }
            ],
            "example": {
                "MCQ": "(bedrock-call \"your-access-key\" \"your-secret-key\" \"Jurassic\" task-text)",
                "default": "false"
            }
        },
        "response": {
            "description": "DSL code specifying how to parse LLM responses. This can be empty, in which case the response code of the model will be used. You can specify a `response` parser that will be used for all types of tasks, or per task type parsers. If you provide both a default parser and one for a specific task type, the specific one takes precedence. This can only be used if you're an admin of the selected model - otherwise an error will be returned.",
            "oneOf": [
                {
                    "type": "string",
                    "description": "DSL code that should be used for all responses",
                    "example": "(get-in response [\"json\" \"resp\"])"
                },
                {
                    "type": "object",
                    "description": "Per task type DSL code. Use the \"default\" key to specify the code that should be used for tasks types that aren't specified - otherwise the model's default parser will be used.",
                    "properties": {
                        "default": {
                            "type": "string",
                            "description": "The default parser to be used for task types that aren't specified.",
                            "example": "response"
                        },
                        "FRQ": {
                            "type": "string",
                            "description": "The DSL code to be used to parse FRQ task responses. If this is empty, the default parser will be used"
                        },
                        "bool": {
                            "type": "string",
                            "description": "The DSL code to be used to parse bool task responses. If this is empty, the default parser will be used"
                        },
                        "json": {
                            "type": "string",
                            "description": "The DSL code to be used to parse json task responses. If this is empty, the default parser will be used"
                        },
                        "MCQ": {
                            "type": "string",
                            "description": "The DSL code to be used to parse MCQ task responses. If this is empty, the default parser will be used"
                        }
                    },
                    "example": {
                        "MCQ": "(= parsedResponse correct)",
                        "default": "false"
                    }
                }
            ],
            "example": {
                "MCQ": "(= parsedResponse correct)",
                "default": "false"
            }
        },
        "grader": {
            "description": "DSL code specifying how to grade LLM responses. This can be empty, in which case the grader of the evaluation will be used. You can specify a grader that will be used for all types of tasks, or per task type graders. If you provide both a default grader and one for a specific task type, the specific one takes precedence. This can only be used if you're an admin of the selected evaluation - otherwise an error will be returned.",
            "oneOf": [
                {
                    "type": "string",
                    "description": "DSL code that should be used for all response",
                    "example": "(= parsedResponse \"ok\")"
                },
                {
                    "type": "object",
                    "description": "Per task type DSL code. Use the \"default\" key to specify the code that should be used for tasks types that aren't specified - otherwise the grader of the evaluation will be used.",
                    "properties": {
                        "default": {
                            "type": "string",
                            "description": "The default grader to be used for task types that aren't specified.",
                            "example": "(if (= parsedResponse correct) 1 0)"
                        },
                        "FRQ": {
                            "type": "string",
                            "description": "The DSL code to be used to grade FRQ tasks. If this is empty, the default grader will be used"
                        },
                        "bool": {
                            "type": "string",
                            "description": "The DSL code to be used to grade bool tasks. If this is empty, the default grader will be used"
                        },
                        "json": {
                            "type": "string",
                            "description": "The DSL code to be used to grade json tasks. If this is empty, the default grader will be used"
                        },
                        "MCQ": {
                            "type": "string",
                            "description": "The DSL code to be used to grade MCQ tasks. If this is empty, the default grader will be used"
                        }
                    },
                    "example": {
                        "MCQ": "(= parsedResponse correct)",
                        "default": "false"
                    }
                }
            ],
            "example": {
                "MCQ": "(= parsedResponse correct)",
                "default": "false"
            }
        }
    }
}

Responses

{
    "id": "dac5244b-b2c4-4384-b090-4c177921c2d3",
    "raw_task_text": "string",
    "raw_response_text": "string",
    "parsed_response_text": "string",
    "response_time_in_seconds": 10.12,
    "correctness": 10.12,
    "task_id": "0cc718da-9881-4a6a-9ce0-15e71163c608",
    "evaluatee_id": "3deebcde-982e-4364-8d8d-1fec093c2b48",
    "chosen_answer_id": "8a9ef395-5ff6-469f-9202-b19040c3e63c",
    "evaluation_session_id": "6cf3e2d3-1f84-4c93-a424-a8376b6cafc9",
    "creation_date": "2022-04-13T15:42:05.901Z"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "raw_task_text": {
            "type": "string",
            "nullable": true
        },
        "raw_response_text": {
            "type": "string",
            "nullable": true
        },
        "parsed_response_text": {
            "type": "string",
            "nullable": true
        },
        "response_time_in_seconds": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "correctness": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "task_id": {
            "type": "string",
            "format": "uuid"
        },
        "evaluatee_id": {
            "type": "string",
            "format": "uuid"
        },
        "chosen_answer_id": {
            "type": "string",
            "format": "uuid",
            "nullable": true
        },
        "evaluation_session_id": {
            "type": "string",
            "format": "uuid"
        },
        "creation_date": {
            "type": "string",
            "format": "date-time"
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


POST /response

Request body

{
    "response_time_in_seconds": 10.12,
    "task_id": "68e12e78-da68-450a-a84a-828c4455128a",
    "evaluation_session_id": "a7ba235b-1f57-4f7c-9e5e-83a684ab1904",
    "task_type": "MCQ",
    "question": "What time is it?",
    "answer_text": "Half past nine",
    "answer_id": "23aa466b-55cd-466f-b2f4-3ea1611c9a2b"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "response_time_in_seconds": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "task_id": {
            "type": "string",
            "format": "uuid"
        },
        "evaluation_session_id": {
            "type": "string",
            "format": "uuid"
        },
        "task_type": {
            "description": "The type of tasks for which this is a response",
            "example": "MCQ",
            "type": "string",
            "enum": [
                "FRQ",
                "bool",
                "json",
                "MCQ"
            ]
        },
        "question": {
            "type": "string",
            "description": "The text of the question for which this is a response",
            "example": "What time is it?"
        },
        "answer_text": {
            "type": "string",
            "description": "The text returned from the model",
            "example": "Half past nine"
        },
        "answer_id": {
            "type": "string",
            "format": "uuid",
            "nullable": true,
            "description": "The id of the selected answer, in the case of multiple choice questions"
        }
    }
}

Responses

{
    "id": "43663d9c-bf7d-4469-897d-891d415ab2b9",
    "raw_task_text": "string",
    "raw_response_text": "string",
    "parsed_response_text": "string",
    "response_time_in_seconds": 10.12,
    "correctness": 10.12,
    "task_id": "7e43ce8d-44f3-4a0c-a589-f222a6d4032b",
    "evaluatee_id": "b0e2fc7d-7312-451e-8382-b7cec800f013",
    "chosen_answer_id": "a01104af-ba21-409d-93e9-d9fcda9c6454",
    "evaluation_session_id": "b22c2e05-27e2-4df1-a80b-c385009ff118",
    "creation_date": "2022-04-13T15:42:05.901Z"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "raw_task_text": {
            "type": "string",
            "nullable": true
        },
        "raw_response_text": {
            "type": "string",
            "nullable": true
        },
        "parsed_response_text": {
            "type": "string",
            "nullable": true
        },
        "response_time_in_seconds": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "correctness": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "task_id": {
            "type": "string",
            "format": "uuid"
        },
        "evaluatee_id": {
            "type": "string",
            "format": "uuid"
        },
        "chosen_answer_id": {
            "type": "string",
            "format": "uuid",
            "nullable": true
        },
        "evaluation_session_id": {
            "type": "string",
            "format": "uuid"
        },
        "creation_date": {
            "type": "string",
            "format": "date-time"
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /response

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/Response"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/Response"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /response

Request body

{
    "response_time_in_seconds": 10.12,
    "task_id": "62a47814-8a55-4f97-833e-eae15d380fad",
    "evaluation_session_id": "f95abe3c-bb53-4a74-80df-7dd8d7c4215b"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "response_time_in_seconds": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "task_id": {
            "type": "string",
            "format": "uuid"
        },
        "evaluation_session_id": {
            "type": "string",
            "format": "uuid"
        }
    }
}

Responses

"Response updated"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "string",
    "enum": [
        "Response updated"
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


GET /scores

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/CurrentScores"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/CurrentScores"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /schema

Request body

{
    "key": "my-schema",
    "name": "My schema",
    "description": "This is a description. Nice, innit?",
    "type": "json",
    "schema": "{\"$schema\": \"http://json-schema.org/draft-07/schema#\", \"title\": \"JSON parser\", \"type\": \"object\", \"properties\": {\"name\": {\"type\": \"string\"}}}",
    "evaluation_id": "6c4eb8b6-501e-4bc4-b7a4-75451e359d55"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "key": {
            "description": "The key of this schema, as used in csv file upload references. Reference keys can contain English letters (upper and lowercase), digits and \"-\", \"_\", and \".\"",
            "type": "string",
            "example": "my-schema"
        },
        "name": {
            "description": "The name of this schema, used only for display purposes.",
            "type": "string",
            "example": "My schema"
        },
        "description": {
            "description": "The name of this schema, used only for display purposes.",
            "type": "string",
            "example": "This is a description. Nice, innit?"
        },
        "type": {
            "description": "The type of the new schema",
            "example": "json",
            "type": "string",
            "enum": [
                "json"
            ]
        },
        "schema": {
            "description": "A schema to validate answers against.",
            "example": "{\"$schema\": \"http://json-schema.org/draft-07/schema#\", \"title\": \"JSON parser\", \"type\": \"object\", \"properties\": {\"name\": {\"type\": \"string\"}}}",
            "type": "object"
        },
        "evaluation_id": {
            "description": "The id of the evaluation that this schema is for",
            "type": "string",
            "format": "uuid"
        }
    }
}

Responses

{
    "key": "My-lovely-schema",
    "name": "My lovely schema",
    "description": "This will be used to check stuff",
    "evaluation_id": "6b11e6f8-5df8-4310-9417-c69740aba967",
    "id": "27d6d87e-f9e8-47bb-a14b-526b786e814b"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "key": {
            "type": "string",
            "description": "The identifier used in csv files for this schema",
            "nullable": true,
            "example": "My-lovely-schema"
        },
        "name": {
            "type": "string",
            "description": "An optional name describing this schema",
            "nullable": true,
            "example": "My lovely schema"
        },
        "description": {
            "type": "string",
            "description": "An optional description of this schema",
            "nullable": true,
            "example": "This will be used to check stuff"
        },
        "evaluation_id": {
            "type": "string",
            "format": "uuid"
        },
        "id": {
            "type": "string",
            "format": "uuid"
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /schema

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/SchemaHistory"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/SchemaHistory"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /schema

Request body

{
    "key": "My-lovely-schema",
    "name": "My lovely schema",
    "description": "This will be used to check stuff"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "key": {
            "type": "string",
            "description": "The identifier used in csv files for this schema",
            "nullable": true,
            "example": "My-lovely-schema"
        },
        "name": {
            "type": "string",
            "description": "An optional name describing this schema",
            "nullable": true,
            "example": "My lovely schema"
        },
        "description": {
            "type": "string",
            "description": "An optional description of this schema",
            "nullable": true,
            "example": "This will be used to check stuff"
        }
    }
}

Responses

"SchemaHistory updated"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "string",
    "enum": [
        "SchemaHistory updated"
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /subscription

Request body

{
    "confirmed": true,
    "type": "alert",
    "item": "bb7ca0bd-0a81-45e1-94e4-7129324bd6af",
    "method": "email",
    "destination": "(GET \"http://example.com\")"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "confirmed": {
            "type": "boolean",
            "nullable": true
        },
        "type": {
            "description": "The type of object to subscribe to",
            "example": "alert",
            "type": "string",
            "enum": [
                "alert",
                "evaluation_session"
            ]
        },
        "item": {
            "description": "The id of the item to subscribe to",
            "type": "string",
            "format": "uuid"
        },
        "method": {
            "description": "The method used to notify",
            "type": "string",
            "example": "email",
            "enum": [
                "email",
                "webhook",
                "sms",
                "call"
            ]
        },
        "destination": {
            "description": "The destination to which messages should be sent. In the case of email methods this must be a valid email. For text messages and calls a valid phone number. In the case of webhooks, this should be a DSL network call.",
            "type": "string",
            "example": "(GET \"http://example.com\")"
        }
    }
}

Responses

{
    "confirmed": true,
    "method": "string",
    "destination": "string"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "confirmed": {
            "type": "boolean",
            "nullable": true
        },
        "method": {
            "type": "string"
        },
        "destination": {
            "type": "string"
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /subscription

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned
item query string Yes The id of the item that was subscribed to
item_type query string Yes The type of subscriptions to look for.

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/Subscriber"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/Subscriber"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /subscription

Request body

{
    "confirmed": true
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "confirmed": {
            "type": "boolean",
            "nullable": true
        }
    }
}

Responses

"Subscriber updated"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "string",
    "enum": [
        "Subscriber updated"
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /tag

Request body

{
    "name": "string"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "name": {
            "type": "string"
        }
    }
}

Responses

{
    "id": "108bff04-4420-43b8-b185-7c1f3d355f65",
    "name": "string"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "name": {
            "type": "string"
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /tag

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/Tag"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/Tag"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /tag

Request body

{
    "name": "string"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "name": {
            "type": "string"
        }
    }
}

Responses

"Tag updated"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "string",
    "enum": [
        "Tag updated"
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /task

Request body

{
    "task_type": "string",
    "is_task_live": true,
    "modalities": [
        "string"
    ],
    "redacted": true,
    "tags": [
        "372ffb70-8cb1-4381-a390-583ef609b89d"
    ],
    "type": "MCQ",
    "questions": [
        {
            "text": "What time is it?",
            "paraphrases": []
        }
    ],
    "answers": [
        {
            "text": "half past one",
            "paraphrases": [
                "1:30 PM",
                "13:30"
            ],
            "correct": false
        },
        {
            "text": "Time is an illusion",
            "correct": false
        },
        {
            "text": "Now",
            "correct": true
        }
    ],
    "correct": true,
    "schema": "{\"$schema\": \"http://json-schema.org/draft-07/schema#\", \"title\": \"JSON parser\", \"type\": \"object\", \"properties\": {\"name\": {\"type\": \"string\"}}}",
    "evaluation_id": "8a067fa5-3527-48c2-85fa-fb27e0dc6c8b"
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "task_type": {
            "type": "string"
        },
        "is_task_live": {
            "type": "boolean",
            "nullable": true
        },
        "modalities": {
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "redacted": {
            "type": "boolean"
        },
        "tags": {
            "type": "array",
            "items": {
                "type": "string",
                "format": "uuid"
            }
        },
        "type": {
            "description": "The type of the new task",
            "example": "MCQ",
            "type": "string",
            "enum": [
                "FRQ",
                "bool",
                "json",
                "MCQ"
            ]
        },
        "questions": {
            "description": "The task questions - i.e. what the models should answer",
            "example": [
                {
                    "text": "What time is it?",
                    "paraphrases": []
                }
            ],
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "example": "what time is it?"
                    },
                    "paraphrases": {
                        "type": "array",
                        "items": {
                            "type": "string",
                            "example": "can you tell me the time?"
                        }
                    }
                }
            }
        },
        "answers": {
            "description": "A list of possible answers to be sent to models with the question",
            "type": "array",
            "items": {
                "$ref": "#/components/schemas/MCQAnswer"
            },
            "example": [
                {
                    "text": "half past one",
                    "paraphrases": [
                        "1:30 PM",
                        "13:30"
                    ],
                    "correct": false
                },
                {
                    "text": "Time is an illusion",
                    "correct": false
                },
                {
                    "text": "Now",
                    "correct": true
                }
            ]
        },
        "correct": {
            "description": "Whether this task is correct. This is used in boolean tasks",
            "type": "boolean"
        },
        "schema": {
            "description": "A schema to validate answers against. This is used in JSON tasks",
            "example": "{\"$schema\": \"http://json-schema.org/draft-07/schema#\", \"title\": \"JSON parser\", \"type\": \"object\", \"properties\": {\"name\": {\"type\": \"string\"}}}",
            "type": "string"
        },
        "evaluation_id": {
            "description": "The id of the evaluation that this task is for",
            "type": "string",
            "format": "uuid"
        }
    }
}

Responses

{
    "id": "2388e7ce-b3c1-4e7c-9243-eda914667d0d",
    "task_type": "string",
    "is_task_live": true,
    "modalities": [
        "string"
    ],
    "redacted": true,
    "num_possible_answers": 10.12,
    "evaluation_task_number": 10.12,
    "median_human_completion_seconds": 10.12,
    "median_ai_completion_seconds": 10.12,
    "num_times_human_evaluated": 10.12,
    "num_times_ai_evaluated": 10.12,
    "num_times_humans_answered_correctly": 10.12,
    "num_times_ai_answered_correctly": 10.12,
    "evaluation_id": "16e15daa-b411-4243-ac5b-b03043036c93",
    "owner_id": "ad55f409-28d0-47db-9903-5bf9c4fad6b1",
    "tags": [
        {
            "id": "c1e73682-4408-4e6c-8daa-3c1b799d3fe4",
            "name": "string"
        }
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "task_type": {
            "type": "string"
        },
        "is_task_live": {
            "type": "boolean",
            "nullable": true
        },
        "modalities": {
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "redacted": {
            "type": "boolean"
        },
        "num_possible_answers": {
            "type": "number",
            "format": "int64"
        },
        "evaluation_task_number": {
            "type": "number",
            "format": "int64"
        },
        "median_human_completion_seconds": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "median_ai_completion_seconds": {
            "type": "number",
            "format": "double",
            "nullable": true
        },
        "num_times_human_evaluated": {
            "type": "number",
            "format": "int64"
        },
        "num_times_ai_evaluated": {
            "type": "number",
            "format": "int64"
        },
        "num_times_humans_answered_correctly": {
            "type": "number",
            "format": "int64"
        },
        "num_times_ai_answered_correctly": {
            "type": "number",
            "format": "int64"
        },
        "evaluation_id": {
            "type": "string",
            "format": "uuid"
        },
        "owner_id": {
            "type": "string",
            "format": "uuid"
        },
        "tags": {
            "type": "array",
            "items": {
                "$ref": "#/components/schemas/ShallowTag"
            }
        }
    }
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: Error.


GET /task

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/Task"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/Task"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /task

Request body

{
    "task_type": "string",
    "is_task_live": true,
    "modalities": [
        "string"
    ],
    "redacted": true,
    "tags": [
        "758c7803-ab3d-4ea5-8ce7-dce6a195deba"
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "task_type": {
            "type": "string"
        },
        "is_task_live": {
            "type": "boolean",
            "nullable": true
        },
        "modalities": {
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "redacted": {
            "type": "boolean"
        },
        "tags": {
            "type": "array",
            "items": {
                "type": "string",
                "format": "uuid"
            }
        }
    }
}

Responses

"Task updated"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "string",
    "enum": [
        "Task updated"
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


POST /user

Request body

{
    "email_address": "mr.blobby@some.domain",
    "user_name": "mr_blobby",
    "full_name": "Mr Blobby, esq.",
    "user_image": "https://equistamp.com/avatars/123123123123.png",
    "bio": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die",
    "display_options": {
        "bio": true,
        "email_address": true,
        "user_image": false
    }
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "email_address": {
            "type": "string",
            "description": "The email address of this user. User for logging in, so must be unique.",
            "format": "email",
            "example": "mr.blobby@some.domain"
        },
        "user_name": {
            "type": "string",
            "description": "The user name. Used for logging in and as a unique, human readable identifier of this user",
            "example": "mr_blobby"
        },
        "full_name": {
            "type": "string",
            "description": "The presentable name of this user. This can be any string",
            "nullable": true,
            "example": "Mr Blobby, esq."
        },
        "user_image": {
            "type": "string",
            "description": "The user avatar, as bytes when uploading, and its URL when fetching",
            "nullable": true,
            "example": "https://equistamp.com/avatars/123123123123.png"
        },
        "bio": {
            "type": "string",
            "description": "A description of this user. Will be rendered as markdown on the website",
            "nullable": true,
            "example": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die"
        },
        "display_options": {
            "description": "A mapping of <displayable field> to true/false, which controls what will be displayed to other users. No option which is not explicitly enabled will be shown to anyone else than you or system admins. To illustrate, the attached example will only allow the user's bio and email address to be returned when other users call this endpoint, and all other fields will not be returned.",
            "type": "object",
            "additonalProperties": "boolean",
            "example": {
                "bio": true,
                "email_address": true,
                "user_image": false
            }
        }
    }
}

Responses

{
    "id": "8947fced-c2bc-4df7-a716-e75528489c62",
    "email_address": "mr.blobby@some.domain",
    "user_name": "mr_blobby",
    "full_name": "Mr Blobby, esq.",
    "user_image": "https://equistamp.com/avatars/123123123123.png",
    "bio": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die",
    "display_options": {
        "bio": true,
        "email_address": true,
        "user_image": false
    },
    "join_date": "2022-04-13",
    "subscription_level": "pro",
    "alerts": [
        {
            "id": "8c421c10-9d50-454c-a21c-34fcf870cdcf",
            "name": "They are coming!!",
            "description": "string",
            "public": true,
            "last_trigger_date": "2022-04-13T15:42:05.901Z",
            "trigger_cooldown": "string",
            "owner_id": "06a6d247-1966-42cd-a93c-8f9c568035e1",
            "triggers": [
                "b469c7a7-d655-4587-932d-17f04587339e"
            ],
            "subscriptions": [
                "fbc2f8a6-785f-4a6e-9bba-4f3638392094"
            ]
        }
    ]
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "format": "uuid"
        },
        "email_address": {
            "type": "string",
            "description": "The email address of this user. User for logging in, so must be unique.",
            "format": "email",
            "example": "mr.blobby@some.domain"
        },
        "user_name": {
            "type": "string",
            "description": "The user name. Used for logging in and as a unique, human readable identifier of this user",
            "example": "mr_blobby"
        },
        "full_name": {
            "type": "string",
            "description": "The presentable name of this user. This can be any string",
            "nullable": true,
            "example": "Mr Blobby, esq."
        },
        "user_image": {
            "type": "string",
            "description": "The user avatar, as bytes when uploading, and its URL when fetching",
            "nullable": true,
            "example": "https://equistamp.com/avatars/123123123123.png"
        },
        "bio": {
            "type": "string",
            "description": "A description of this user. Will be rendered as markdown on the website",
            "nullable": true,
            "example": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die"
        },
        "display_options": {
            "description": "A mapping of <displayable field> to true/false, which controls what will be displayed to other users. No option which is not explicitly enabled will be shown to anyone else than you or system admins. To illustrate, the attached example will only allow the user's bio and email address to be returned when other users call this endpoint, and all other fields will not be returned.",
            "type": "object",
            "additonalProperties": "boolean",
            "example": {
                "bio": true,
                "email_address": true,
                "user_image": false
            }
        },
        "join_date": {
            "type": "string",
            "format": "date"
        },
        "subscription_level": {
            "type": "string",
            "description": "The current subscription level of this user",
            "enum": [
                "admin",
                "free",
                "enterprise",
                "pro"
            ],
            "example": "pro"
        },
        "alerts": {
            "type": "array",
            "items": {
                "$ref": "#/components/schemas/ShallowAlert"
            }
        }
    }
}

Refer to the common response description: Error.


GET /user

Input parameters

Parameter In Type Default Nullable Description
id query string Yes Will return the item with this id, or die trying. When this parameter is provided, then only a single item will be returned

Responses

Schema of the response body
{
    "oneOf": [
        {
            "$ref": "#/components/schemas/User"
        },
        {
            "type": "object",
            "properties": {
                "items": {
                    "description": "An array of all the items that were found, but capped at most at `per_page`",
                    "type": "array",
                    "items": {
                        "$ref": "#/components/schemas/User"
                    }
                },
                "count": {
                    "description": "The total number of items found",
                    "type": "number",
                    "format": "int32"
                },
                "per_page": {
                    "description": "The number of items returned per page",
                    "type": "number",
                    "format": "int32"
                },
                "page": {
                    "description": "The number of available pages",
                    "type": "number",
                    "format": "int32"
                }
            }
        }
    ]
}

Refer to the common response description: NotFound.

Refer to the common response description: Error.


PUT /user

Request body

{
    "email_address": "mr.blobby@some.domain",
    "user_name": "mr_blobby",
    "full_name": "Mr Blobby, esq.",
    "user_image": "https://equistamp.com/avatars/123123123123.png",
    "bio": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die",
    "display_options": {
        "bio": true,
        "email_address": true,
        "user_image": false
    }
}
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the request body
{
    "type": "object",
    "properties": {
        "email_address": {
            "type": "string",
            "description": "The email address of this user. User for logging in, so must be unique.",
            "format": "email",
            "example": "mr.blobby@some.domain"
        },
        "user_name": {
            "type": "string",
            "description": "The user name. Used for logging in and as a unique, human readable identifier of this user",
            "example": "mr_blobby"
        },
        "full_name": {
            "type": "string",
            "description": "The presentable name of this user. This can be any string",
            "nullable": true,
            "example": "Mr Blobby, esq."
        },
        "user_image": {
            "type": "string",
            "description": "The user avatar, as bytes when uploading, and its URL when fetching",
            "nullable": true,
            "example": "https://equistamp.com/avatars/123123123123.png"
        },
        "bio": {
            "type": "string",
            "description": "A description of this user. Will be rendered as markdown on the website",
            "nullable": true,
            "example": "Hello, my name is Inigo Montoya. You Killed my Father. Prepare to die"
        },
        "display_options": {
            "description": "A mapping of <displayable field> to true/false, which controls what will be displayed to other users. No option which is not explicitly enabled will be shown to anyone else than you or system admins. To illustrate, the attached example will only allow the user's bio and email address to be returned when other users call this endpoint, and all other fields will not be returned.",
            "type": "object",
            "additonalProperties": "boolean",
            "example": {
                "bio": true,
                "email_address": true,
                "user_image": false
            }
        }
    }
}

Responses

"User updated"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "type": "string",
    "enum": [
        "User updated"
    ]
}

Refer to the common response description: Unauthorized.

Refer to the common response description: Unauthenticated.

Refer to the common response description: NotFound.

Refer to the common response description: Error.


Schemas

Alert

Name Type Description
description string | null
id string(uuid)
last_trigger_date string(date-time) | null
name string The name of the alert, displayed in the list of alerts
owner_id string(uuid)
public boolean
subscriptions Array<ShallowSubscriberAlert>
trigger_cooldown string | null How often the trigger can fire
triggers Array<ShallowTrigger>

ColumnMapping

Name Type Description
columnType string
paraphraseOf string | null

CurrentScores

Evaluation

Name Type Description
description string | null The description of this evaluation, as displayed on the site. Markdown can be used for formatting
id string(uuid)
last_updated string(date-time)
min_questions_to_complete integer(int64) | null The default number of tasks to run before an evaluation session is deemed finished. A given evaluation session may process more tasks, as starting a new evaluation session for an evaluation/model pair which is already running will just add more tasks to the current session, rather than starting a new one.
modalities Array<string> The available modalities of this evaluation
name string
num_tasks integer(int64) The total number of tasks defined for this evaluation. Includes redacted tasks.
owner ShallowUser
public boolean Whether this evaluation should be publicly visible. If true, anyone can view its details or evaluate models with it
public_usable boolean Whether this evaluation can be ran by anyone. To avoid tasks being leaked, you might want to have the results shown, but have control over what it can be run on.
quality number(double) The quality of this evaluation, i.e. how much it can be trusted, from 0 to 1.
reports_visible boolean Whether anyone can pay to see reports for this evaluation.
tags Array<ShallowTag>
task_types Array<string> The types of tasks supported by this evaluation

EvaluationEvaluatee

Name Type Description
cadence string | null
evaluatee_id string(uuid)
evaluation ShallowEvaluation
evaluation_id string(uuid)
id string(uuid)
model ShallowModel
price number(int64)

EvaluationModelJobs

Name Type Description
creation_date string(date-time)
evaluation_id string(uuid)
id string(uuid)
job_body
job_description string
job_name string
job_schedule_arn string
minutes_between_evaluations number(int64)
model_id string(uuid)
owner_id string(uuid)
start_date string(date-time) | null

EvaluationSession

Name Type Description
avg_verbosity number(double) | null
completed boolean
datetime_completed string(date-time) | null
datetime_started string(date-time)
distribution_of_characters_per_task
distribution_of_seconds_per_task
evaluatee_id string(uuid) In the case of human tests, the id of the user taking the test. In the case of testing models, the id of the model to be tested
evaluation_id string(uuid) The id of the evaluation to be run
failed boolean
id string(uuid)
is_human_being_evaluated boolean Whether this evaluation session is a human test. When false will start an automatic test for the provided model and evaluation.
max_characters_per_task number(double) | null
max_seconds_per_task number(double) | null
max_verbosity number(double) | null
mean_characters_per_task number(double) | null
mean_seconds_per_task number(double) | null
median_characters_per_task number(double) | null
median_seconds_per_task number(double) | null
median_verbosity number(double) | null
min_characters_per_task number(double) | null
min_seconds_per_task number(double) | null
min_verbosity number(double) | null
num_answered_correctly number(int64)
num_characters_received_from_endpoint number(int64)
num_characters_sent_to_endpoint number(int64)
num_endpoint_calls number(int64)
num_endpoint_failures number(int64)
num_questions_answered number(int64)
num_tasks_to_complete number(int64)
origin string The source of this evaluation session, i.e. what triggered it
std_characters_per_task number(double) | null
std_seconds_per_task number(double) | null

MCQAnswer

Name Type Description
correct boolean
paraphrases Array<string> A list of paraphrases of this answer - if provided, will always be used rather than the actual answer text
text string The text of the answer, as will be displayed to the models. If paraphrases are provided, this will never be shown to anyone other than you

Model

Name Type Description
architecture string | null The architecture of this model
availability number(double) | null
best_evaluation_session ShallowEvaluationSession
check_availability boolean | null Whether the availability of this model should be checked. When true, we will ping the endpoint every
cost_per_input_character_usd number(double) The cost of a single input character in USD. We assume that a single token is 4 characters.
cost_per_instance_hour_usd number(double) The cost of running the model for an hour, in USD. This doesn't include input/output tokens - it's purely the server uptime. This is useful e.g. with HuggingFace inference endpoints, where they charge for server time, not for tokens throughput.
cost_per_output_character_usd number(double) The cost of a single output character in USD. We assume that a single token is 4 characters.
description string | null The description of this model, as displayed on the site. Markdown can be used for formatting
elo_score number(double) | null The ELO score, according to LLMSys
endpoint_type string The type of endpoint being called. We have dedicated handlers for many of the most popular AI model providers
id string(uuid)
max_characters_per_minute integer(int64) The maximum allowed number of characters per minute. We assume that one token is 4 characters. This must be at least 1.
max_context_window_characters integer(int64) | null The maximum number of characters allowed in the context window of this model. We assume that 1 token is 4 characters
max_request_per_minute integer(int64) The maximum allowed number of requess per minute. This must be at least 1.
modalities Array<string> The modalities accepted by this model
name string
num_parameters integer(int64) | null The number of parameters of the model
owner ShallowUser
owner_id string(uuid)
picture string | null An url to an image representing this model
public boolean Whether this evaluation should be publicly visible. If true, anyone can view its details.
public_usable boolean Whether this model can be tested by anyone. LLMs can cost a lot to run, and these costs are on whoever added the model. This setting is here to add an extra protection against people running up large compute costs on this model. When not set, this is `false`.
publisher string | null The entity that created this model
quality number(double) The quality of this model, i.e. how much it's worth using, from 0 to 1. This is very subjective, and mainly used to decide whether it should be used by default e.g. on the frontpage.
score number(double) | null
top_example ShallowTask
top_example_id string(uuid) | null
worst_evaluation_session ShallowEvaluationSession
worst_example ShallowTask
worst_example_id string(uuid) | null

Response

Name Type Description
chosen_answer_id string(uuid) | null
correctness number(double) | null
creation_date string(date-time)
evaluatee_id string(uuid)
evaluation_session_id string(uuid)
id string(uuid)
parsed_response_text string | null
raw_response_text string | null
raw_task_text string | null
response_time_in_seconds number(double) | null
task_id string(uuid)

SchemaHistory

Name Type Description
description string | null An optional description of this schema
evaluation_id string(uuid)
id string(uuid)
key string | null The identifier used in csv files for this schema
name string | null An optional name describing this schema

ShallowAlert

Name Type Description
description string | null
id string(uuid)
last_trigger_date string(date-time) | null
name string The name of the alert, displayed in the list of alerts
owner_id string(uuid)
public boolean
subscriptions Array<string(uuid)>
trigger_cooldown string | null How often the trigger can fire
triggers Array<string(uuid)>

ShallowCurrentScores

ShallowEvaluation

Name Type Description
description string | null The description of this evaluation, as displayed on the site. Markdown can be used for formatting
id string(uuid)
last_updated string(date-time)
min_questions_to_complete integer(int64) | null The default number of tasks to run before an evaluation session is deemed finished. A given evaluation session may process more tasks, as starting a new evaluation session for an evaluation/model pair which is already running will just add more tasks to the current session, rather than starting a new one.
modalities Array<string> The available modalities of this evaluation
name string
num_tasks integer(int64) The total number of tasks defined for this evaluation. Includes redacted tasks.
owner string(uuid)
public boolean Whether this evaluation should be publicly visible. If true, anyone can view its details or evaluate models with it
public_usable boolean Whether this evaluation can be ran by anyone. To avoid tasks being leaked, you might want to have the results shown, but have control over what it can be run on.
quality number(double) The quality of this evaluation, i.e. how much it can be trusted, from 0 to 1.
reports_visible boolean Whether anyone can pay to see reports for this evaluation.
tags Array<string(uuid)>
task_types Array<string> The types of tasks supported by this evaluation

ShallowEvaluationEvaluatee

Name Type Description
cadence string | null
evaluatee_id string(uuid)
evaluation string(uuid)
evaluation_id string(uuid)
id string(uuid)
model string(uuid)
price number(int64)

ShallowEvaluationModelJobs

Name Type Description
creation_date string(date-time)
evaluation_id string(uuid)
id string(uuid)
job_body
job_description string
job_name string
job_schedule_arn string
minutes_between_evaluations number(int64)
model_id string(uuid)
owner_id string(uuid)
start_date string(date-time) | null

ShallowEvaluationSession

Name Type Description
avg_verbosity number(double) | null
completed boolean
datetime_completed string(date-time) | null
datetime_started string(date-time)
distribution_of_characters_per_task
distribution_of_seconds_per_task
evaluatee_id string(uuid) In the case of human tests, the id of the user taking the test. In the case of testing models, the id of the model to be tested
evaluation_id string(uuid) The id of the evaluation to be run
failed boolean
id string(uuid)
is_human_being_evaluated boolean Whether this evaluation session is a human test. When false will start an automatic test for the provided model and evaluation.
max_characters_per_task number(double) | null
max_seconds_per_task number(double) | null
max_verbosity number(double) | null
mean_characters_per_task number(double) | null
mean_seconds_per_task number(double) | null
median_characters_per_task number(double) | null
median_seconds_per_task number(double) | null
median_verbosity number(double) | null
min_characters_per_task number(double) | null
min_seconds_per_task number(double) | null
min_verbosity number(double) | null
num_answered_correctly number(int64)
num_characters_received_from_endpoint number(int64)
num_characters_sent_to_endpoint number(int64)
num_endpoint_calls number(int64)
num_endpoint_failures number(int64)
num_questions_answered number(int64)
num_tasks_to_complete number(int64)
origin string The source of this evaluation session, i.e. what triggered it
std_characters_per_task number(double) | null
std_seconds_per_task number(double) | null

ShallowModel

Name Type Description
architecture string | null The architecture of this model
availability number(double) | null
best_evaluation_session string(uuid)
check_availability boolean | null Whether the availability of this model should be checked. When true, we will ping the endpoint every
cost_per_input_character_usd number(double) The cost of a single input character in USD. We assume that a single token is 4 characters.
cost_per_instance_hour_usd number(double) The cost of running the model for an hour, in USD. This doesn't include input/output tokens - it's purely the server uptime. This is useful e.g. with HuggingFace inference endpoints, where they charge for server time, not for tokens throughput.
cost_per_output_character_usd number(double) The cost of a single output character in USD. We assume that a single token is 4 characters.
description string | null The description of this model, as displayed on the site. Markdown can be used for formatting
elo_score number(double) | null The ELO score, according to LLMSys
endpoint_type string The type of endpoint being called. We have dedicated handlers for many of the most popular AI model providers
id string(uuid)
max_characters_per_minute integer(int64) The maximum allowed number of characters per minute. We assume that one token is 4 characters. This must be at least 1.
max_context_window_characters integer(int64) | null The maximum number of characters allowed in the context window of this model. We assume that 1 token is 4 characters
max_request_per_minute integer(int64) The maximum allowed number of requess per minute. This must be at least 1.
modalities Array<string> The modalities accepted by this model
name string
num_parameters integer(int64) | null The number of parameters of the model
owner string(uuid)
owner_id string(uuid)
picture string | null An url to an image representing this model
public boolean Whether this evaluation should be publicly visible. If true, anyone can view its details.
public_usable boolean Whether this model can be tested by anyone. LLMs can cost a lot to run, and these costs are on whoever added the model. This setting is here to add an extra protection against people running up large compute costs on this model. When not set, this is `false`.
publisher string | null The entity that created this model
quality number(double) The quality of this model, i.e. how much it's worth using, from 0 to 1. This is very subjective, and mainly used to decide whether it should be used by default e.g. on the frontpage.
score number(double) | null
top_example string(uuid)
top_example_id string(uuid) | null
worst_evaluation_session string(uuid)
worst_example string(uuid)
worst_example_id string(uuid) | null

ShallowResponse

Name Type Description
chosen_answer_id string(uuid) | null
correctness number(double) | null
creation_date string(date-time)
evaluatee_id string(uuid)
evaluation_session_id string(uuid)
id string(uuid)
parsed_response_text string | null
raw_response_text string | null
raw_task_text string | null
response_time_in_seconds number(double) | null
task_id string(uuid)

ShallowSchemaHistory

Name Type Description
description string | null An optional description of this schema
evaluation_id string(uuid)
id string(uuid)
key string | null The identifier used in csv files for this schema
name string | null An optional name describing this schema

ShallowSubscriber

Name Type Description
confirmed boolean | null
destination string
method string

ShallowSubscriberAlert

Name Type Description
confirmed boolean | null
destination string
method string

ShallowTag

Name Type Description
id string(uuid)
name string

ShallowTask

Name Type Description
evaluation_id string(uuid)
evaluation_task_number number(int64)
id string(uuid)
is_task_live boolean | null
median_ai_completion_seconds number(double) | null
median_human_completion_seconds number(double) | null
modalities Array<string>
num_possible_answers number(int64)
num_times_ai_answered_correctly number(int64)
num_times_ai_evaluated number(int64)
num_times_human_evaluated number(int64)
num_times_humans_answered_correctly number(int64)
owner_id string(uuid)
redacted boolean
tags Array<string(uuid)>
task_type string

ShallowTrigger

Name Type Description
alert_id string(uuid)
evaluations
id string(uuid)
invert boolean
metric string | null
models
threshold number(double) | null
type string

ShallowUser

Name Type Description
alerts Array<string(uuid)>
bio string | null A description of this user. Will be rendered as markdown on the website
display_options Example: {'bio': True, 'email_address': True, 'user_image': False} A mapping of to true/false, which controls what will be displayed to other users. No option which is not explicitly enabled will be shown to anyone else than you or system admins. To illustrate, the attached example will only allow the user's bio and email address to be returned when other users call this endpoint, and all other fields will not be returned.
email_address string() The email address of this user. User for logging in, so must be unique.
full_name string | null The presentable name of this user. This can be any string
id string(uuid)
join_date string(date)
subscription_level string The current subscription level of this user
user_image string | null The user avatar, as bytes when uploading, and its URL when fetching
user_name string The user name. Used for logging in and as a unique, human readable identifier of this user

Subscriber

Name Type Description
confirmed boolean | null
destination string
method string

SubscriberAlert

Name Type Description
confirmed boolean | null
destination string
method string

Tag

Name Type Description
id string(uuid)
name string

Task

Name Type Description
evaluation_id string(uuid)
evaluation_task_number number(int64)
id string(uuid)
is_task_live boolean | null
median_ai_completion_seconds number(double) | null
median_human_completion_seconds number(double) | null
modalities Array<string>
num_possible_answers number(int64)
num_times_ai_answered_correctly number(int64)
num_times_ai_evaluated number(int64)
num_times_human_evaluated number(int64)
num_times_humans_answered_correctly number(int64)
owner_id string(uuid)
redacted boolean
tags Array<ShallowTag>
task_type string

Trigger

Name Type Description
alert_id string(uuid)
evaluations
id string(uuid)
invert boolean
metric string | null
models
threshold number(double) | null
type string

User

Name Type Description
alerts Array<ShallowAlert>
bio string | null A description of this user. Will be rendered as markdown on the website
display_options Example: {'bio': True, 'email_address': True, 'user_image': False} A mapping of to true/false, which controls what will be displayed to other users. No option which is not explicitly enabled will be shown to anyone else than you or system admins. To illustrate, the attached example will only allow the user's bio and email address to be returned when other users call this endpoint, and all other fields will not be returned.
email_address string() The email address of this user. User for logging in, so must be unique.
full_name string | null The presentable name of this user. This can be any string
id string(uuid)
join_date string(date)
subscription_level string The current subscription level of this user
user_image string | null The user avatar, as bytes when uploading, and its URL when fetching
user_name string The user name. Used for logging in and as a unique, human readable identifier of this user

Common responses

This section describes common responses that are reused across operations.

Unauthenticated

A valid API token is needed to access this endpoint

"string"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "description": "An error message describing what happened",
    "type": "string"
}

PaymentRequired

The user has insufficient credits to process this request

"string"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "description": "An error message describing what happened",
    "type": "string"
}

Unauthorized

The provided API token does not have the appropriate permissions to fulfill this request

"string"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "description": "An error message describing what happened",
    "type": "string"
}

NotFound

Could not find this item

"string"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "description": "An error message describing what happened",
    "type": "string"
}

ValidationError

The request has bad data

"string"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "description": "An error message describing what happened",
    "type": "string"
}

Error

A server error

"string"
⚠️ This example has been generated automatically from the schema and it is not accurate. Refer to the schema for more information.

Schema of the response body
{
    "description": "An error message describing what happened",
    "type": "string"
}

Common parameters

This section describes common parameters that are reused across operations.

apiToken

Name In Type Default Nullable Description
Api-Token header string No