The Llama Guard 3 plugin uses Meta’s Llama-3.1-8B model, fine-tuned for content safety classification. It can classify content in both LLM inputs (prompt classification) and responses (response classification), detecting violations across 14 different hazard categories.Documentation Index
Fetch the complete documentation index at: https://docs.datawizz.ai/llms.txt
Use this file to discover all available pages before exploring further.
Features
- Dual Deployment Support: Works with Cloudflare Workers AI binding (zero-latency) or REST API fallback (for standalone binaries)
- 14 Hazard Categories: Detects violence, hate speech, sexual content, self-harm, and more
- Flexible Actions: Either reject unsafe requests or log warnings while allowing them through
- Conversation-Aware: Can scan individual messages or entire conversation history
- Token Usage Tracking: Reports token consumption for monitoring and cost control
Hazard Categories
Llama Guard 3 detects the following categories (S1-S14):- S1: Violent Crimes
- S2: Non-Violent Crimes
- S3: Sex-Related Crimes
- S4: Child Sexual Exploitation
- S5: Defamation
- S6: Specialized Advice (financial, medical, legal)
- S7: Privacy Violations
- S8: Intellectual Property
- S9: Indiscriminate Weapons
- S10: Hate Speech
- S11: Suicide & Self-Harm
- S12: Sexual Content
- S13: Elections
- S14: Code Interpreter Abuse
Configuration Options
action (string)
reject: Block requests containing unsafe content (default)warn: Log warnings but allow requests to proceed- Default:
"reject"
scanAllMessages (boolean)
true: Scans all messages in the conversationfalse: Only scans the latest message- Default:
false
rejectMessage (string)
- Custom message to return when content is rejected
- If not provided, returns a message listing violated categories
- Example:
"Your message violates our content policy"
temperature (number)
- Controls randomness in model output (0-5)
- Lower values = more consistent classification
- Default:
0.6
maxTokens (integer)
- Maximum tokens to generate in response (1-2048)
- Default:
256
cloudflareAccountId (string)
- Required for REST API fallback (standalone binary)
- Your Cloudflare Account ID
- Not needed when running in Cloudflare Workers
cloudflareAuthToken (string)
- Required for REST API fallback (standalone binary)
- Cloudflare API token with Workers AI access
- Not needed when running in Cloudflare Workers
Example Configurations
Basic Usage (Cloudflare Workers)
Scan Entire Conversation
Warning Mode (Non-Blocking)
Standalone Binary with REST API
Response Behavior
When Content is Safe
- Request proceeds normally
- No modifications to messages
- Debug logs indicate “Content classified as SAFE”
When Content is Unsafe (action: “reject”)
- Request is blocked
- Returns rejection reason with violated categories
- Example:
"Content safety check failed. Violated categories: S10, S11"
When Content is Unsafe (action: “warn”)
- Request proceeds with warning in debug logs
- No modifications to messages
- Warning includes violated categories
- Useful for monitoring without blocking users
On Error
- Plugin fails gracefully
- Request is allowed to proceed with warning in debug
- Error details logged for troubleshooting
Deployment Modes
Cloudflare Workers (Recommended)
Uses the native AI binding for zero-latency, cost-efficient inference:- ✅ Zero cold start
- ✅ Lower latency
- ✅ No auth tokens needed
- ✅ Integrated billing
Standalone Binary (REST API Fallback)
Uses Cloudflare’s REST API when AI binding is not available:- ✅ Works outside Cloudflare Workers
- ✅ Self-hosted deployment
- ✅ Same model and accuracy
Use Cases
Content Moderation
Block harmful content before it reaches your LLM:Conversation Safety
Monitor entire conversations for policy violations:Compliance Monitoring
Log safety issues without blocking users:Fine-Grained Control
Adjust temperature for stricter or looser classification:Performance
- Cloudflare Workers: ~100-200ms latency
- REST API: ~200-500ms latency (depending on network)
- Token Usage: Typically 50-200 tokens per request
- Cost: 0.03 per 1M output tokens
Limitations
- Message Format: Only processes
userandassistantroles (system/developer/tool messages treated as user) - Text Only: Images, audio, and other media are not analyzed
- Language: Optimized for English, may have reduced accuracy in other languages
- Context Length: Limited by model’s context window (~8K tokens)
Troubleshooting
Error: “cloudflareAccountId and cloudflareAuthToken are required”
Cause: Running outside Cloudflare Workers without REST API credentials Solution: Add your Cloudflare Account ID and API token to the configurationError: “Cloudflare API error (401)”
Cause: Invalid or expired API token Solution:- Generate a new API token at https://dash.cloudflare.com/profile/api-tokens
- Ensure it has “Workers AI” permissions
- Update
cloudflareAuthTokenin configuration
False Positives
Cause: Model may flag legitimate content Solution:- Increase
temperature(e.g., 0.8) for less strict classification - Use
action: "warn"to monitor without blocking - Review debug logs to understand classification reasoning
High Latency
Cause: Using REST API instead of Workers AI binding Solution:- Deploy to Cloudflare Workers for optimal performance
- Consider caching results for repeated content
- Use
scanAllMessages: falseto scan only recent messages
Integration Examples
With Prompt Engineering
With Logging Plugin
Multi-Layer Protection
Use alongside other plugins for defense-in-depth:- Regex Detect: Block obvious patterns (profanity, PII)
- Llama Guard: Catch nuanced safety issues
- Invisible Text: Prevent steganography attacks