bmel:promptToxicityScore¶

Category: LLM & AI Observability · Returns: bmel:number

bmel:promptToxicityScore(prompt: bmel:string)

Description¶

Computes a toxicity score for the given prompt text in the range [0.0, 1.0]. Internally runs the prompt through a multi-label toxicity classifier that detects: hate speech, harassment, threats, self-harm incitement, sexual content, violence, and prompt injection / jailbreak attempts. Returns the highest score across all detected categories (worst-case signal). Interpretation: 0.0 = no toxicity detected; < 0.2 = low risk; 0.2–0.5 = moderate risk, review recommended; 0.5–0.8 = high risk; > 0.8 = critical, likely harmful. Useful as a guardrail metric in LLM_INFERENCE_FRAME and AGENTIC_SESSION_FRAME to detect adversarial inputs, jailbreaks, and policy violations before or after they reach the model.

Arguments¶

Parameter	Type	Required	Description
`prompt`	`bmel:string`	✅	The prompt text to evaluate for toxicity.

Example¶

bmel:promptToxicityScore({getCompletion:Request Payload}.$.prompt)

← Back to BMEL Reference