ira/docs/llm_query_classification.md

123 lines
5.1 KiB
Markdown

# LLM-Based Query Classification
## Overview
This document describes the implementation of LLM-based query domain classification in the sim-search project, replacing the previous keyword-based approach.
## Motivation
The previous keyword-based classification had several limitations:
- Relied on static lists of keywords that needed constant updating
- Could not capture the semantic meaning of queries
- Generated false classifications for ambiguous or novel queries
- Required significant maintenance to keep keyword lists updated
## Implementation
### New Components
1. **LLM Interface Extension**:
- Added `classify_query_domain()` method to `LLMInterface` class
- Added `_classify_query_domain_impl()` private implementation method
- Configured to use the fast Llama-3.1-8b-instant model by default
2. **Query Processor Updates**:
- Added `_structure_query_with_llm()` method that uses the LLM classification results
- Updated `process_query()` to use both query type and domain classification
- Retained keyword-based method as a fallback in case of LLM API failures
3. **Structured Query Enhancements**:
- Added new fields to the structured query:
- `domain`: Primary domain type (academic, code, current_events, general)
- `domain_confidence`: Confidence score for the primary domain
- `secondary_domains`: Array of secondary domains with confidence scores
- `classification_reasoning`: Explanation of the classification
4. **Configuration Updates**:
- Added `classify_query_domain` to the module-specific model assignments
- Using the same Llama-3.1-8b-instant model for domain classification as for other query processing tasks
5. **Logging and Monitoring**:
- Added detailed logging of domain classification results
- Log secondary domains with confidence scores
- Log the reasoning behind classifications
6. **Error Handling**:
- Added fallback to keyword-based classification if LLM-based classification fails
- Implemented robust JSON parsing with fallbacks to default values
- Added explicit error messages for troubleshooting
### Classification Process
The query domain classification process works as follows:
1. The query is sent to the LLM with a prompt specifying the four domain types
2. The LLM returns a JSON response containing:
- Primary domain type with confidence score
- Array of secondary domain types with confidence scores
- Reasoning for the classification
3. The response is parsed and integrated into the structured query
4. The `is_academic`, `is_code`, and `is_current_events` flags are set based on:
- Primary domain matching the type
- Any secondary domain matching the type with confidence above 0.3
5. The structured query is then used by downstream components like the search executor
## Benefits
The new approach offers several advantages:
1. **Semantic Understanding**: Captures the meaning and intent of queries rather than just keyword matching
2. **Multi-Domain Recognition**: Recognizes when queries span multiple domains with confidence scores
3. **Self-Explaining**: Provides reasoning for classifications, aiding debugging and transparency
4. **Adaptability**: Automatically adapts to new topics and terminology without code changes
5. **Confidence Scoring**: Indicates how confident the system is in its classification
## Testing and Validation
A comprehensive test script (`test_domain_classification.py`) was created to:
1. Test the raw domain classification function with a variety of queries
2. Test the query processor's integration with domain classification
3. Compare the LLM-based approach with the previous keyword-based approach
## Examples
### Academic Query Example
**Query**: "What are the technological, economic, and social implications of large language models in today's society?"
**LLM Classification**:
```json
{
"primary_type": "academic",
"confidence": 0.9,
"secondary_types": [
{"type": "general", "confidence": 0.4}
],
"reasoning": "This query is asking about implications of LLMs across multiple domains (technological, economic, and social) which is a scholarly research topic that would be well-addressed by academic sources."
}
```
### Code Query Example
**Query**: "How do I implement a transformer model in PyTorch for text classification?"
**LLM Classification**:
```json
{
"primary_type": "code",
"confidence": 0.95,
"secondary_types": [
{"type": "academic", "confidence": 0.4}
],
"reasoning": "This is primarily a programming question about implementing a specific model in PyTorch, which is a coding framework. It has academic aspects since it relates to machine learning models, but the focus is on implementation."
}
```
## Future Improvements
Potential enhancements for the future:
1. **Caching**: Add caching for frequently asked or similar queries to reduce API calls
2. **Few-Shot Learning**: Add examples in the prompt to improve classification accuracy
3. **Expanded Domains**: Consider additional domain categories beyond the current four
4. **UI Integration**: Expose classification reasoning in the UI for advanced users
5. **Classification Feedback Loop**: Allow users to correct misclassifications to improve the system over time