ira/.note/llm_query_classification_pl...

16 KiB

LLM-Based Query Classification Implementation Plan

Overview

This document outlines a plan to replace the current keyword-based query classification system with an LLM-based approach. The current system uses predefined keyword lists to determine if a query is academic, code-related, or about current events. This approach is limited by the static nature of the keywords and doesn't capture the semantic meaning of queries. Switching to an LLM-based classification will provide more accurate and adaptable query typing.

Current Limitations

  1. Keyword Dependency:

    • The system relies on static lists of keywords that need constant updating
    • Many relevant terms are likely to be missing, especially for emerging topics
    • Some words have different meanings in different contexts (e.g., "model" can refer to code or academic concepts)
  2. False Classifications:

    • Queries about LLMs being incorrectly classified as code-related instead of academic
    • General queries potentially being misclassified if they happen to contain certain keywords
    • No way to handle queries that span multiple categories
  3. Maintenance Burden:

    • Need to regularly update keyword lists for each category
    • Complex if/then logic to determine query types
    • Hard to adapt to new research domains or technologies

Proposed Solution

Replace the keyword-based classification with an LLM-based classification that:

  1. Uses semantic understanding to determine query intent and domain
  2. Can classify queries into multiple categories with confidence scores
  3. Provides reasoning for the classification
  4. Can adapt to new topics without code changes

Technical Implementation

1. Extend LLM Interface with Domain Classification

Add a new method to the LLMInterface class in query/llm_interface.py:

async def classify_query_domain(self, query: str) -> Dict[str, Any]:
    """
    Classify a query's domain type (academic, code, current_events, general).
    
    Args:
        query: The query to classify
        
    Returns:
        Dictionary with query domain type and confidence scores
    """
    # Get the model assigned to this function
    model_name = self.config.get_module_model('query_processing', 'classify_query_domain')
    
    # Create a new interface with the assigned model if different from current
    if model_name != self.model_name:
        interface = LLMInterface(model_name)
        return await interface._classify_query_domain_impl(query)
    
    return await self._classify_query_domain_impl(query)

async def _classify_query_domain_impl(self, query: str) -> Dict[str, Any]:
    """Implementation of query domain classification."""
    messages = [
        {"role": "system", "content": """You are an expert query classifier. 
        Analyze the given query and classify it into the following domain types:
        - academic: Related to scholarly research, scientific studies, academic papers, formal theories, university-level research topics, or scholarly fields of study
        - code: Related to programming, software development, technical implementation, coding languages, frameworks, or technology implementation questions
        - current_events: Related to recent news, ongoing developments, time-sensitive information, current politics, breaking stories, or real-time events
        - general: General information seeking that doesn't fit the above categories
        
        You may assign multiple types if the query spans several domains.
        
        Respond with a JSON object containing:
        {
            "primary_type": "the most appropriate type",
            "confidence": 0.X,
            "secondary_types": [{"type": "another_applicable_type", "confidence": 0.X}, ...],
            "reasoning": "brief explanation of your classification"
        }
        """},
        {"role": "user", "content": query}
    ]
    
    # Generate classification
    response = await self.generate_completion(messages)
    
    # Parse JSON response
    try:
        classification = json.loads(response)
        return classification
    except json.JSONDecodeError:
        # Fallback to default classification if parsing fails
        print(f"Error parsing domain classification response: {response}")
        return {
            "primary_type": "general", 
            "confidence": 0.5, 
            "secondary_types": [],
            "reasoning": "Failed to parse classification response"
        }

2. Update QueryProcessor Class

Modify the QueryProcessor class in query/query_processor.py to use the new LLM-based classification:

async def process_query(self, query: str) -> Dict[str, Any]:
    """
    Process a user query.
    
    Args:
        query: The raw user query
        
    Returns:
        Dictionary containing the processed query information
    """
    logger.info(f"Processing query: {query}")
    
    # Enhance the query
    enhanced_query = await self.llm_interface.enhance_query(query)
    logger.info(f"Enhanced query: {enhanced_query}")
    
    # Classify the query type (factual, exploratory, comparative)
    query_type_classification = await self.llm_interface.classify_query(query)
    logger.info(f"Query type classification: {query_type_classification}")
    
    # Classify the query domain (academic, code, current_events, general)
    domain_classification = await self.llm_interface.classify_query_domain(query)
    logger.info(f"Query domain classification: {domain_classification}")
    
    # Extract entities from the classification
    entities = query_type_classification.get('entities', [])
    
    # Structure the query using the new classification approach
    structured_query = self._structure_query_with_llm(
        query, 
        enhanced_query, 
        query_type_classification,
        domain_classification
    )
    
    # Decompose the query into sub-questions (if complex enough)
    structured_query = await self.query_decomposer.decompose_query(query, structured_query)
    
    # Log the number of sub-questions if any
    if 'sub_questions' in structured_query and structured_query['sub_questions']:
        logger.info(f"Decomposed into {len(structured_query['sub_questions'])} sub-questions")
    else:
        logger.info("Query was not decomposed into sub-questions")
    
    return structured_query

def _structure_query_with_llm(self, original_query: str, enhanced_query: str,
                         type_classification: Dict[str, Any],
                         domain_classification: Dict[str, Any]) -> Dict[str, Any]:
    """
    Structure a query using LLM classification results.
    
    Args:
        original_query: The original user query
        enhanced_query: The enhanced query
        type_classification: Classification of query type (factual, exploratory, comparative)
        domain_classification: Classification of query domain (academic, code, current_events)
        
    Returns:
        Dictionary containing the structured query
    """
    # Get primary domain and confidence
    primary_domain = domain_classification.get('primary_type', 'general')
    primary_confidence = domain_classification.get('confidence', 0.5)
    
    # Get secondary domains
    secondary_domains = domain_classification.get('secondary_types', [])
    
    # Determine domain flags
    is_academic = primary_domain == 'academic' or any(d['type'] == 'academic' for d in secondary_domains)
    is_code = primary_domain == 'code' or any(d['type'] == 'code' for d in secondary_domains)
    is_current_events = primary_domain == 'current_events' or any(d['type'] == 'current_events' for d in secondary_domains)
    
    # Higher threshold for secondary domains to avoid false positives
    if primary_domain != 'academic' and any(d['type'] == 'academic' and d['confidence'] >= 0.3 for d in secondary_domains):
        is_academic = True
        
    if primary_domain != 'code' and any(d['type'] == 'code' and d['confidence'] >= 0.3 for d in secondary_domains):
        is_code = True
        
    if primary_domain != 'current_events' and any(d['type'] == 'current_events' and d['confidence'] >= 0.3 for d in secondary_domains):
        is_current_events = True
    
    return {
        'original_query': original_query,
        'enhanced_query': enhanced_query,
        'type': type_classification.get('type', 'unknown'),
        'intent': type_classification.get('intent', 'research'),
        'entities': type_classification.get('entities', []),
        'domain': primary_domain,
        'domain_confidence': primary_confidence,
        'secondary_domains': secondary_domains,
        'classification_reasoning': domain_classification.get('reasoning', ''),
        'timestamp': None,  # Will be filled in by the caller
        'is_current_events': is_current_events,
        'is_academic': is_academic,
        'is_code': is_code,
        'metadata': {
            'type_classification': type_classification,
            'domain_classification': domain_classification
        }
    }

3. Remove Legacy Keyword-Based Classification Methods

Once the new LLM-based classification is working correctly, remove or deprecate the old keyword-based methods:

  • _is_current_events_query
  • _is_academic_query
  • _is_code_query

And the original _structure_query method.

4. Update Search Executor Integration

The SearchExecutor class already looks for the flags in the structured query:

  • is_academic
  • is_code
  • is_current_events

So no changes are needed to the execute_search method. The improved classification will simply provide more accurate flags.

5. Update Configuration

Add the new classify_query_domain function to the module model configuration to allow different models to be assigned to this function:

module_models:
  query_processing:
    enhance_query: llama-3.1-8b-instant  # Fast model for query enhancement
    classify_query: llama-3.1-8b-instant  # Fast model for query type classification
    classify_query_domain: llama-3.1-8b-instant  # Fast model for domain classification
    generate_search_queries: llama-3.1-8b-instant  # Fast model for search query generation

6. Testing Plan

  1. Unit Tests:

    • Create test cases for classify_query_domain with various query types
    • Verify correct classification of academic, code, and current events queries
    • Test edge cases and queries that span multiple domains
  2. Integration Tests:

    • Test the full query processing pipeline with the new classification
    • Verify that the correct search engines are selected based on the classification
    • Compare results with the old keyword-based approach
  3. Regression Testing:

    • Ensure that all existing functionality works with the new classification
    • Verify that no existing test cases fail

7. Logging and Monitoring

Add detailed logging to monitor the performance of the new classification:

logger.info(f"Query domain classification: primary={domain_classification.get('primary_type')} confidence={domain_classification.get('confidence')}")
if domain_classification.get('secondary_types'):
    for sec_type in domain_classification.get('secondary_types'):
        logger.info(f"Secondary domain: {sec_type['type']} confidence={sec_type['confidence']}")
logger.info(f"Classification reasoning: {domain_classification.get('reasoning', 'None provided')}")

8. Fallback Mechanism

Implement a fallback to the keyword-based approach if the LLM classification fails:

try:
    domain_classification = await self.llm_interface.classify_query_domain(query)
    structured_query = self._structure_query_with_llm(query, enhanced_query, query_type_classification, domain_classification)
except Exception as e:
    logger.error(f"LLM domain classification failed: {e}. Falling back to keyword-based classification.")
    # Fallback to keyword-based approach
    structured_query = self._structure_query(query, enhanced_query, query_type_classification)

Timeline and Resources

Phase 1: Development (2-3 days)

  • Implement the new classify_query_domain method in LLMInterface
  • Create the new _structure_query_with_llm method in QueryProcessor
  • Update the process_query method to use the new approach
  • Add configuration for the new function

Phase 2: Testing (1-2 days)

  • Create test cases for the new classification
  • Test with various query types
  • Compare with the old approach

Phase 3: Deployment and Monitoring (1 day)

  • Deploy the new version
  • Monitor logs for classification issues
  • Adjust prompts and thresholds as needed

Phase 4: Cleanup (1 day)

  • Remove the old keyword-based methods
  • Update documentation

Expected Outcomes

  1. Improved Classification Accuracy:

    • More accurate identification of academic, code, and current events queries
    • Better handling of queries that span multiple domains
    • Proper classification of queries about emerging topics (like LLMs)
  2. Reduced Maintenance:

    • No need to update keyword lists
    • Adaptability to new domains without code changes
  3. Enhanced User Experience:

    • More relevant search results
    • Better report generation due to proper query classification
  4. System Robustness:

    • Graceful handling of edge cases
    • Better explanation of classification decisions
    • Proper confidence scoring for ambiguous queries

Examples

To illustrate how the new approach would work, here are some examples:

Example 1: Academic Query

Query: "What are the technological, economic, and social implications of large language models in today's society?"

Current Classification: Might be misclassified as code-related due to "models"

LLM Classification:

{
  "primary_type": "academic",
  "confidence": 0.9,
  "secondary_types": [
    {"type": "general", "confidence": 0.4}
  ],
  "reasoning": "This query is asking about implications of LLMs across multiple domains (technological, economic, and social) which is a scholarly research topic that would be well-addressed by academic sources."
}

Example 2: Code Query

Query: "How do I implement a transformer model in PyTorch for text classification?"

Current Classification: Might be correctly classified as code-related due to "implement", "model", and "PyTorch"

LLM Classification:

{
  "primary_type": "code",
  "confidence": 0.95,
  "secondary_types": [
    {"type": "academic", "confidence": 0.4}
  ],
  "reasoning": "This is primarily a programming question about implementing a specific model in PyTorch, which is a coding framework. It has academic aspects since it relates to machine learning models, but the focus is on implementation."
}

Example 3: Current Events Query

Query: "What are the latest developments in the Ukraine conflict?"

Current Classification: Likely correct if "Ukraine" is in the current events entities list

LLM Classification:

{
  "primary_type": "current_events",
  "confidence": 0.95,
  "secondary_types": [],
  "reasoning": "This query is asking about 'latest developments' in an ongoing conflict, which clearly indicates a focus on recent news and time-sensitive information."
}

Example 4: Mixed Query

Query: "How are LLMs being used to detect and prevent cyber attacks?"

Current Classification: Might have mixed signals from both academic and code keywords

LLM Classification:

{
  "primary_type": "academic",
  "confidence": 0.7,
  "secondary_types": [
    {"type": "code", "confidence": 0.6},
    {"type": "current_events", "confidence": 0.3}
  ],
  "reasoning": "This query relates to research on LLM applications in cybersecurity (academic), has technical implementation aspects (code), and could relate to recent developments in the field (current events). The primary focus appears to be on research and study of this application."
}

Conclusion

Replacing the keyword-based classification with an LLM-based approach will significantly improve the accuracy and adaptability of the query classification system. This will lead to better search results and report generation, particularly for complex or multi-domain queries like those about large language models. The implementation can be completed in 5-7 days and will reduce ongoing maintenance work by eliminating the need to update keyword lists.