bmel:promptToxicityScore¶
Category: LLM & AI Observability · Returns: bmel:number
bmel:promptToxicityScore(prompt: bmel:string)
Description¶
Computes a toxicity score for the given prompt text in the range [0.0, 1.0]. Internally runs the prompt through a multi-label toxicity classifier that detects: hate speech, harassment, threats, self-harm incitement, sexual content, violence, and prompt injection / jailbreak attempts. Returns the highest score across all detected categories (worst-case signal). Interpretation: 0.0 = no toxicity detected; < 0.2 = low risk; 0.2–0.5 = moderate risk, review recommended; 0.5–0.8 = high risk; > 0.8 = critical, likely harmful. Useful as a guardrail metric in LLM_INFERENCE_FRAME and AGENTIC_SESSION_FRAME to detect adversarial inputs, jailbreaks, and policy violations before or after they reach the model.
Arguments¶
| Parameter | Type | Required | Description |
|---|---|---|---|
prompt | bmel:string | ✅ | The prompt text to evaluate for toxicity. |
Example¶
bmel:promptToxicityScore({getCompletion:Request Payload}.$.prompt)