# LLM-Based Query Classification ## Overview This document describes the implementation of LLM-based query domain classification in the sim-search project, replacing the previous keyword-based approach. ## Motivation The previous keyword-based classification had several limitations: - Relied on static lists of keywords that needed constant updating - Could not capture the semantic meaning of queries - Generated false classifications for ambiguous or novel queries - Required significant maintenance to keep keyword lists updated ## Implementation ### New Components 1. **LLM Interface Extension**: - Added `classify_query_domain()` method to `LLMInterface` class - Added `_classify_query_domain_impl()` private implementation method - Configured to use the fast Llama-3.1-8b-instant model by default 2. **Query Processor Updates**: - Added `_structure_query_with_llm()` method that uses the LLM classification results - Updated `process_query()` to use both query type and domain classification - Retained keyword-based method as a fallback in case of LLM API failures 3. **Structured Query Enhancements**: - Added new fields to the structured query: - `domain`: Primary domain type (academic, code, current_events, general) - `domain_confidence`: Confidence score for the primary domain - `secondary_domains`: Array of secondary domains with confidence scores - `classification_reasoning`: Explanation of the classification 4. **Configuration Updates**: - Added `classify_query_domain` to the module-specific model assignments - Using the same Llama-3.1-8b-instant model for domain classification as for other query processing tasks 5. **Logging and Monitoring**: - Added detailed logging of domain classification results - Log secondary domains with confidence scores - Log the reasoning behind classifications 6. **Error Handling**: - Added fallback to keyword-based classification if LLM-based classification fails - Implemented robust JSON parsing with fallbacks to default values - Added explicit error messages for troubleshooting ### Classification Process The query domain classification process works as follows: 1. The query is sent to the LLM with a prompt specifying the four domain types 2. The LLM returns a JSON response containing: - Primary domain type with confidence score - Array of secondary domain types with confidence scores - Reasoning for the classification 3. The response is parsed and integrated into the structured query 4. The `is_academic`, `is_code`, and `is_current_events` flags are set based on: - Primary domain matching the type - Any secondary domain matching the type with confidence above 0.3 5. The structured query is then used by downstream components like the search executor ## Benefits The new approach offers several advantages: 1. **Semantic Understanding**: Captures the meaning and intent of queries rather than just keyword matching 2. **Multi-Domain Recognition**: Recognizes when queries span multiple domains with confidence scores 3. **Self-Explaining**: Provides reasoning for classifications, aiding debugging and transparency 4. **Adaptability**: Automatically adapts to new topics and terminology without code changes 5. **Confidence Scoring**: Indicates how confident the system is in its classification ## Testing and Validation A comprehensive test script (`test_domain_classification.py`) was created to: 1. Test the raw domain classification function with a variety of queries 2. Test the query processor's integration with domain classification 3. Compare the LLM-based approach with the previous keyword-based approach ## Examples ### Academic Query Example **Query**: "What are the technological, economic, and social implications of large language models in today's society?" **LLM Classification**: ```json { "primary_type": "academic", "confidence": 0.9, "secondary_types": [ {"type": "general", "confidence": 0.4} ], "reasoning": "This query is asking about implications of LLMs across multiple domains (technological, economic, and social) which is a scholarly research topic that would be well-addressed by academic sources." } ``` ### Code Query Example **Query**: "How do I implement a transformer model in PyTorch for text classification?" **LLM Classification**: ```json { "primary_type": "code", "confidence": 0.95, "secondary_types": [ {"type": "academic", "confidence": 0.4} ], "reasoning": "This is primarily a programming question about implementing a specific model in PyTorch, which is a coding framework. It has academic aspects since it relates to machine learning models, but the focus is on implementation." } ``` ## Future Improvements Potential enhancements for the future: 1. **Caching**: Add caching for frequently asked or similar queries to reduce API calls 2. **Few-Shot Learning**: Add examples in the prompt to improve classification accuracy 3. **Expanded Domains**: Consider additional domain categories beyond the current four 4. **UI Integration**: Expose classification reasoning in the UI for advanced users 5. **Classification Feedback Loop**: Allow users to correct misclassifications to improve the system over time