5.1 KiB
LLM-Based Query Classification
Overview
This document describes the implementation of LLM-based query domain classification in the sim-search project, replacing the previous keyword-based approach.
Motivation
The previous keyword-based classification had several limitations:
- Relied on static lists of keywords that needed constant updating
- Could not capture the semantic meaning of queries
- Generated false classifications for ambiguous or novel queries
- Required significant maintenance to keep keyword lists updated
Implementation
New Components
-
LLM Interface Extension:
- Added
classify_query_domain()
method toLLMInterface
class - Added
_classify_query_domain_impl()
private implementation method - Configured to use the fast Llama-3.1-8b-instant model by default
- Added
-
Query Processor Updates:
- Added
_structure_query_with_llm()
method that uses the LLM classification results - Updated
process_query()
to use both query type and domain classification - Retained keyword-based method as a fallback in case of LLM API failures
- Added
-
Structured Query Enhancements:
- Added new fields to the structured query:
domain
: Primary domain type (academic, code, current_events, general)domain_confidence
: Confidence score for the primary domainsecondary_domains
: Array of secondary domains with confidence scoresclassification_reasoning
: Explanation of the classification
- Added new fields to the structured query:
-
Configuration Updates:
- Added
classify_query_domain
to the module-specific model assignments - Using the same Llama-3.1-8b-instant model for domain classification as for other query processing tasks
- Added
-
Logging and Monitoring:
- Added detailed logging of domain classification results
- Log secondary domains with confidence scores
- Log the reasoning behind classifications
-
Error Handling:
- Added fallback to keyword-based classification if LLM-based classification fails
- Implemented robust JSON parsing with fallbacks to default values
- Added explicit error messages for troubleshooting
Classification Process
The query domain classification process works as follows:
- The query is sent to the LLM with a prompt specifying the four domain types
- The LLM returns a JSON response containing:
- Primary domain type with confidence score
- Array of secondary domain types with confidence scores
- Reasoning for the classification
- The response is parsed and integrated into the structured query
- The
is_academic
,is_code
, andis_current_events
flags are set based on:- Primary domain matching the type
- Any secondary domain matching the type with confidence above 0.3
- The structured query is then used by downstream components like the search executor
Benefits
The new approach offers several advantages:
- Semantic Understanding: Captures the meaning and intent of queries rather than just keyword matching
- Multi-Domain Recognition: Recognizes when queries span multiple domains with confidence scores
- Self-Explaining: Provides reasoning for classifications, aiding debugging and transparency
- Adaptability: Automatically adapts to new topics and terminology without code changes
- Confidence Scoring: Indicates how confident the system is in its classification
Testing and Validation
A comprehensive test script (test_domain_classification.py
) was created to:
- Test the raw domain classification function with a variety of queries
- Test the query processor's integration with domain classification
- Compare the LLM-based approach with the previous keyword-based approach
Examples
Academic Query Example
Query: "What are the technological, economic, and social implications of large language models in today's society?"
LLM Classification:
{
"primary_type": "academic",
"confidence": 0.9,
"secondary_types": [
{"type": "general", "confidence": 0.4}
],
"reasoning": "This query is asking about implications of LLMs across multiple domains (technological, economic, and social) which is a scholarly research topic that would be well-addressed by academic sources."
}
Code Query Example
Query: "How do I implement a transformer model in PyTorch for text classification?"
LLM Classification:
{
"primary_type": "code",
"confidence": 0.95,
"secondary_types": [
{"type": "academic", "confidence": 0.4}
],
"reasoning": "This is primarily a programming question about implementing a specific model in PyTorch, which is a coding framework. It has academic aspects since it relates to machine learning models, but the focus is on implementation."
}
Future Improvements
Potential enhancements for the future:
- Caching: Add caching for frequently asked or similar queries to reduce API calls
- Few-Shot Learning: Add examples in the prompt to improve classification accuracy
- Expanded Domains: Consider additional domain categories beyond the current four
- UI Integration: Expose classification reasoning in the UI for advanced users
- Classification Feedback Loop: Allow users to correct misclassifications to improve the system over time