## 2025-03-18: LLM-Based Query Classification Implementation ### Context The project was using a keyword-based approach to classify queries into different domains (academic, code, current events). This approach had several limitations: - Reliance on static keyword lists that needed constant maintenance - Inability to understand the semantic meaning of queries - False classifications for ambiguous queries or those containing keywords with multiple meanings - Difficulty handling emerging topics without updating keyword lists ### Decision 1. Replace the keyword-based query classification with an LLM-based approach: - Implement a new `classify_query_domain` method in the `LLMInterface` class - Create a new query structuring method that uses the LLM classification results - Retain the keyword-based method as a fallback - Add confidence scores and reasoning to the classification results 2. Enhance the structured query format: - Add primary domain and confidence - Include secondary domains with confidence scores - Add classification reasoning - Maintain backward compatibility with existing search executor 3. Use a 0.3 confidence threshold for secondary domains: - Set domain flags (is_academic, is_code, is_current_events) based on primary domain - Also set flags for secondary domains with confidence scores above 0.3 ### Rationale - LLM-based approach provides better semantic understanding of queries - Multi-domain classification with confidence scores handles complex queries better - Self-explaining classifications with reasoning aids debugging and transparency - The approach automatically adapts to new topics without code changes - Retaining keyword-based fallback ensures system resilience ### Alternatives Considered 1. Expanding the keyword lists: - Would still lack semantic understanding - Increasing maintenance burden - False positives would still occur 2. Using embedding similarity to predefined domain descriptions: - Potentially more computationally expensive - Less explainable than the LLM's reasoning - Would require managing embedding models 3. Creating a custom classifier: - Would require labeled training data - More development effort - Less flexible than the LLM approach ### Impact - More accurate query classification, especially for ambiguous or multi-domain queries - Reduction in maintenance overhead for keyword lists - Better search engine selection based on query domains - Improved report generation due to more accurate query understanding - Enhanced debugging capabilities with classification reasoning