ira/docs/llm_query_classification.md

# LLM-Based Query Classification

## Overview

This document describes the implementation of LLM-based query domain classification in the sim-search project, replacing the previous keyword-based approach.

## Motivation

The previous keyword-based classification had several limitations:
- Relied on static lists of keywords that needed constant updating
- Could not capture the semantic meaning of queries
- Generated false classifications for ambiguous or novel queries
- Required significant maintenance to keep keyword lists updated

## Implementation

### New Components

1. **LLM Interface Extension**:
   - Added `classify_query_domain()` method to `LLMInterface` class
   - Added `_classify_query_domain_impl()` private implementation method
   - Configured to use the fast Llama-3.1-8b-instant model by default

2. **Query Processor Updates**:
   - Added `_structure_query_with_llm()` method that uses the LLM classification results
   - Updated `process_query()` to use both query type and domain classification
   - Retained keyword-based method as a fallback in case of LLM API failures

3. **Structured Query Enhancements**:
   - Added new fields to the structured query:
     - `domain`: Primary domain type (academic, code, current_events, general)
     - `domain_confidence`: Confidence score for the primary domain
     - `secondary_domains`: Array of secondary domains with confidence scores
     - `classification_reasoning`: Explanation of the classification

4. **Configuration Updates**:
   - Added `classify_query_domain` to the module-specific model assignments
   - Using the same Llama-3.1-8b-instant model for domain classification as for other query processing tasks

5. **Logging and Monitoring**:
   - Added detailed logging of domain classification results
   - Log secondary domains with confidence scores
   - Log the reasoning behind classifications

6. **Error Handling**:
   - Added fallback to keyword-based classification if LLM-based classification fails
   - Implemented robust JSON parsing with fallbacks to default values
   - Added explicit error messages for troubleshooting

### Classification Process

The query domain classification process works as follows:

1. The query is sent to the LLM with a prompt specifying the four domain types
2. The LLM returns a JSON response containing:
   - Primary domain type with confidence score
   - Array of secondary domain types with confidence scores
   - Reasoning for the classification
3. The response is parsed and integrated into the structured query
4. The `is_academic`, `is_code`, and `is_current_events` flags are set based on:
   - Primary domain matching the type
   - Any secondary domain matching the type with confidence above 0.3
5. The structured query is then used by downstream components like the search executor

## Benefits

The new approach offers several advantages:

1. **Semantic Understanding**: Captures the meaning and intent of queries rather than just keyword matching
2. **Multi-Domain Recognition**: Recognizes when queries span multiple domains with confidence scores
3. **Self-Explaining**: Provides reasoning for classifications, aiding debugging and transparency
4. **Adaptability**: Automatically adapts to new topics and terminology without code changes
5. **Confidence Scoring**: Indicates how confident the system is in its classification

## Testing and Validation

A comprehensive test script (`test_domain_classification.py`) was created to:
1. Test the raw domain classification function with a variety of queries
2. Test the query processor's integration with domain classification
3. Compare the LLM-based approach with the previous keyword-based approach

## Examples

### Academic Query Example
**Query**: "What are the technological, economic, and social implications of large language models in today's society?"

**LLM Classification**:
```json
{
  "primary_type": "academic",
  "confidence": 0.9,
  "secondary_types": [
    {"type": "general", "confidence": 0.4}
  ],
  "reasoning": "This query is asking about implications of LLMs across multiple domains (technological, economic, and social) which is a scholarly research topic that would be well-addressed by academic sources."
}
```

### Code Query Example
**Query**: "How do I implement a transformer model in PyTorch for text classification?"

**LLM Classification**:
```json
{
  "primary_type": "code",
  "confidence": 0.95,
  "secondary_types": [
    {"type": "academic", "confidence": 0.4}
  ],
  "reasoning": "This is primarily a programming question about implementing a specific model in PyTorch, which is a coding framework. It has academic aspects since it relates to machine learning models, but the focus is on implementation."
}
```

## Future Improvements

Potential enhancements for the future:

1. **Caching**: Add caching for frequently asked or similar queries to reduce API calls
2. **Few-Shot Learning**: Add examples in the prompt to improve classification accuracy
3. **Expanded Domains**: Consider additional domain categories beyond the current four
4. **UI Integration**: Expose classification reasoning in the UI for advanced users
5. **Classification Feedback Loop**: Allow users to correct misclassifications to improve the system over time