Leena AI uses a guardrail model to identify the harmful content, and if it is there in the LLM response, it detects and allows the system to handle it by not responding to such kind of content.
- Purpose and Functionality
WorkLM Guard is specifically trained for content safety classification. Its primary functions include:
- Classifying content in LLM responses (response classification)
- Generating text outputs indicating whether a given response is safe or unsafe
- Identifying specific content categories that have been violated in unsafe content
- Mitigation of Harmful Content
WorkLM Guard addresses various types of harmful content, including illegal, defamatory, and fake information, through its comprehensive classification system:
- Hazard Taxonomy: The model is trained on the MLCommons taxonomy, which covers 13 hazard categories. This standardized approach ensures a broad coverage of potential harmful content types.
- Additional Category: An extra category for “Code Interpreter Abuse” has been added, specifically targeting potential misuse in tool call scenarios.
- Binary Decision Making: The model can produce classifier scores based on token probabilities, allowing for the application of score thresholding to make binary decisions on content safety.
- Bias Mitigation
While the provided information doesn’t explicitly mention bias mitigation, several aspects of WorkLM Guard suggest potential bias reduction:
- Standardized Taxonomy: By using the MLCommons taxonomy, the model likely benefits from a well-defined and potentially less biased classification system.
- Fine-tuning Process: The model has been fine-tuned specifically for content safety, which may include measures to reduce biases present in the original WorkLM model.
Preventing Harmful Content in Model Training
Leena AI also prevents harmful content from being included in model training via below measures
- Base Model Training
Leena AI adopts a cautious approach to base model training:
- The company does not engage in pretraining or training of base models.
- This strategy significantly reduces the risk of inadvertently incorporating harmful content during the foundational stages of model development.
- Focused Fine-tuning
Leena AI’s model development process centers on targeted fine-tuning:
- Fine-tuning is performed exclusively on data curated for specific use cases.
- This focused approach minimizes the likelihood of introducing harmful content during the fine-tuning phase.
- Rigorous Data Validation
To further ensure the integrity of their training data, Leena AI employs a comprehensive validation process:
- All training data undergoes scrutiny using the WorkLM guard system.
- WorkLM Guard is designed to detect and identify potentially harmful content.
- If harmful content is detected, it is promptly removed from the training dataset.
- Advanced Content Filtering
For a detailed explanation of the WorkLM guard system and its role in mitigating harmful content in Language Learning Models (LLMs), please refer to the separate document titled “Mitigating harmful content in LLM.”
By implementing these measures, Leena AI demonstrates a strong commitment to maintaining the safety and integrity of its AI models throughout the training and fine-tuning processes.