2.6 KiB
2.6 KiB
2025-03-18: LLM-Based Query Classification Implementation
Context
The project was using a keyword-based approach to classify queries into different domains (academic, code, current events). This approach had several limitations:
- Reliance on static keyword lists that needed constant maintenance
- Inability to understand the semantic meaning of queries
- False classifications for ambiguous queries or those containing keywords with multiple meanings
- Difficulty handling emerging topics without updating keyword lists
Decision
-
Replace the keyword-based query classification with an LLM-based approach:
- Implement a new
classify_query_domain
method in theLLMInterface
class - Create a new query structuring method that uses the LLM classification results
- Retain the keyword-based method as a fallback
- Add confidence scores and reasoning to the classification results
- Implement a new
-
Enhance the structured query format:
- Add primary domain and confidence
- Include secondary domains with confidence scores
- Add classification reasoning
- Maintain backward compatibility with existing search executor
-
Use a 0.3 confidence threshold for secondary domains:
- Set domain flags (is_academic, is_code, is_current_events) based on primary domain
- Also set flags for secondary domains with confidence scores above 0.3
Rationale
- LLM-based approach provides better semantic understanding of queries
- Multi-domain classification with confidence scores handles complex queries better
- Self-explaining classifications with reasoning aids debugging and transparency
- The approach automatically adapts to new topics without code changes
- Retaining keyword-based fallback ensures system resilience
Alternatives Considered
-
Expanding the keyword lists:
- Would still lack semantic understanding
- Increasing maintenance burden
- False positives would still occur
-
Using embedding similarity to predefined domain descriptions:
- Potentially more computationally expensive
- Less explainable than the LLM's reasoning
- Would require managing embedding models
-
Creating a custom classifier:
- Would require labeled training data
- More development effort
- Less flexible than the LLM approach
Impact
- More accurate query classification, especially for ambiguous or multi-domain queries
- Reduction in maintenance overhead for keyword lists
- Better search engine selection based on query domains
- Improved report generation due to more accurate query understanding
- Enhanced debugging capabilities with classification reasoning