Skip to content

bmel:promptToxicityScore

Category: LLM & AI Observability · Returns: bmel:number

bmel:promptToxicityScore(prompt: bmel:string)

Description

Computes a toxicity score for the given prompt text in the range [0.0, 1.0]. Internally runs the prompt through a multi-label toxicity classifier that detects: hate speech, harassment, threats, self-harm incitement, sexual content, violence, and prompt injection / jailbreak attempts. Returns the highest score across all detected categories (worst-case signal). Interpretation: 0.0 = no toxicity detected; < 0.2 = low risk; 0.2–0.5 = moderate risk, review recommended; 0.5–0.8 = high risk; > 0.8 = critical, likely harmful. Useful as a guardrail metric in LLM_INFERENCE_FRAME and AGENTIC_SESSION_FRAME to detect adversarial inputs, jailbreaks, and policy violations before or after they reach the model.

Arguments

Parameter Type Required Description
prompt bmel:string The prompt text to evaluate for toxicity.

Example

bmel:promptToxicityScore({getCompletion:Request Payload}.$.prompt)

Back to BMEL Reference