123 lines
5.1 KiB
Markdown
123 lines
5.1 KiB
Markdown
# LLM-Based Query Classification
|
|
|
|
## Overview
|
|
|
|
This document describes the implementation of LLM-based query domain classification in the sim-search project, replacing the previous keyword-based approach.
|
|
|
|
## Motivation
|
|
|
|
The previous keyword-based classification had several limitations:
|
|
- Relied on static lists of keywords that needed constant updating
|
|
- Could not capture the semantic meaning of queries
|
|
- Generated false classifications for ambiguous or novel queries
|
|
- Required significant maintenance to keep keyword lists updated
|
|
|
|
## Implementation
|
|
|
|
### New Components
|
|
|
|
1. **LLM Interface Extension**:
|
|
- Added `classify_query_domain()` method to `LLMInterface` class
|
|
- Added `_classify_query_domain_impl()` private implementation method
|
|
- Configured to use the fast Llama-3.1-8b-instant model by default
|
|
|
|
2. **Query Processor Updates**:
|
|
- Added `_structure_query_with_llm()` method that uses the LLM classification results
|
|
- Updated `process_query()` to use both query type and domain classification
|
|
- Retained keyword-based method as a fallback in case of LLM API failures
|
|
|
|
3. **Structured Query Enhancements**:
|
|
- Added new fields to the structured query:
|
|
- `domain`: Primary domain type (academic, code, current_events, general)
|
|
- `domain_confidence`: Confidence score for the primary domain
|
|
- `secondary_domains`: Array of secondary domains with confidence scores
|
|
- `classification_reasoning`: Explanation of the classification
|
|
|
|
4. **Configuration Updates**:
|
|
- Added `classify_query_domain` to the module-specific model assignments
|
|
- Using the same Llama-3.1-8b-instant model for domain classification as for other query processing tasks
|
|
|
|
5. **Logging and Monitoring**:
|
|
- Added detailed logging of domain classification results
|
|
- Log secondary domains with confidence scores
|
|
- Log the reasoning behind classifications
|
|
|
|
6. **Error Handling**:
|
|
- Added fallback to keyword-based classification if LLM-based classification fails
|
|
- Implemented robust JSON parsing with fallbacks to default values
|
|
- Added explicit error messages for troubleshooting
|
|
|
|
### Classification Process
|
|
|
|
The query domain classification process works as follows:
|
|
|
|
1. The query is sent to the LLM with a prompt specifying the four domain types
|
|
2. The LLM returns a JSON response containing:
|
|
- Primary domain type with confidence score
|
|
- Array of secondary domain types with confidence scores
|
|
- Reasoning for the classification
|
|
3. The response is parsed and integrated into the structured query
|
|
4. The `is_academic`, `is_code`, and `is_current_events` flags are set based on:
|
|
- Primary domain matching the type
|
|
- Any secondary domain matching the type with confidence above 0.3
|
|
5. The structured query is then used by downstream components like the search executor
|
|
|
|
## Benefits
|
|
|
|
The new approach offers several advantages:
|
|
|
|
1. **Semantic Understanding**: Captures the meaning and intent of queries rather than just keyword matching
|
|
2. **Multi-Domain Recognition**: Recognizes when queries span multiple domains with confidence scores
|
|
3. **Self-Explaining**: Provides reasoning for classifications, aiding debugging and transparency
|
|
4. **Adaptability**: Automatically adapts to new topics and terminology without code changes
|
|
5. **Confidence Scoring**: Indicates how confident the system is in its classification
|
|
|
|
## Testing and Validation
|
|
|
|
A comprehensive test script (`test_domain_classification.py`) was created to:
|
|
1. Test the raw domain classification function with a variety of queries
|
|
2. Test the query processor's integration with domain classification
|
|
3. Compare the LLM-based approach with the previous keyword-based approach
|
|
|
|
## Examples
|
|
|
|
### Academic Query Example
|
|
**Query**: "What are the technological, economic, and social implications of large language models in today's society?"
|
|
|
|
**LLM Classification**:
|
|
```json
|
|
{
|
|
"primary_type": "academic",
|
|
"confidence": 0.9,
|
|
"secondary_types": [
|
|
{"type": "general", "confidence": 0.4}
|
|
],
|
|
"reasoning": "This query is asking about implications of LLMs across multiple domains (technological, economic, and social) which is a scholarly research topic that would be well-addressed by academic sources."
|
|
}
|
|
```
|
|
|
|
### Code Query Example
|
|
**Query**: "How do I implement a transformer model in PyTorch for text classification?"
|
|
|
|
**LLM Classification**:
|
|
```json
|
|
{
|
|
"primary_type": "code",
|
|
"confidence": 0.95,
|
|
"secondary_types": [
|
|
{"type": "academic", "confidence": 0.4}
|
|
],
|
|
"reasoning": "This is primarily a programming question about implementing a specific model in PyTorch, which is a coding framework. It has academic aspects since it relates to machine learning models, but the focus is on implementation."
|
|
}
|
|
```
|
|
|
|
## Future Improvements
|
|
|
|
Potential enhancements for the future:
|
|
|
|
1. **Caching**: Add caching for frequently asked or similar queries to reduce API calls
|
|
2. **Few-Shot Learning**: Add examples in the prompt to improve classification accuracy
|
|
3. **Expanded Domains**: Consider additional domain categories beyond the current four
|
|
4. **UI Integration**: Expose classification reasoning in the UI for advanced users
|
|
5. **Classification Feedback Loop**: Allow users to correct misclassifications to improve the system over time
|