Implement Phase 2 of Report Generation module: document prioritization and chunking strategies
This commit is contained in:
parent
60f78dab9c
commit
695e4b7ecd
|
@ -1,79 +1,85 @@
|
|||
# Current Focus: Intelligent Research System Development
|
||||
# Current Focus: Report Generation Module Implementation (Phase 2)
|
||||
|
||||
## Latest Update (2025-02-27)
|
||||
|
||||
We are currently developing an intelligent research system that automates the process of finding, filtering, and synthesizing information from various sources. The system is designed to be modular, allowing different components to utilize specific LLM models and endpoints based on their requirements.
|
||||
We have successfully implemented Phase 1 of the Report Generation module, which includes document scraping and SQLite storage. The next focus is on Phase 2: Document Prioritization and Chunking, followed by integration with the search execution pipeline.
|
||||
|
||||
### Recent Progress
|
||||
|
||||
1. **Configuration Enhancements**:
|
||||
1. **Report Generation Module Phase 1 Implementation**:
|
||||
- Created a SQLite database manager with tables for documents and metadata
|
||||
- Implemented a document scraper with Jina Reader API integration and fallback mechanisms
|
||||
- Developed the basic report generator structure
|
||||
- Added URL retention, metadata storage, and content deduplication
|
||||
- Created comprehensive test scripts to verify functionality
|
||||
- Successfully tested document scraping, storage, and retrieval
|
||||
|
||||
2. **Configuration Enhancements**:
|
||||
- Implemented module-specific model assignments in the configuration
|
||||
- Added support for different LLM providers and endpoints
|
||||
- Added configuration for Jina AI's reranker
|
||||
- Added support for OpenRouter and Groq as LLM providers
|
||||
- Configured the system to use Groq's Llama 3.1 and 3.3 models for testing
|
||||
|
||||
2. **LLM Interface Updates**:
|
||||
3. **LLM Interface Updates**:
|
||||
- Enhanced the LLMInterface to support different models for different modules
|
||||
- Implemented dynamic model switching based on the module and function
|
||||
- Added support for Groq and OpenRouter providers
|
||||
- Added special handling for provider-specific requirements
|
||||
- Modified the query enhancement prompt to return only the enhanced query text without explanations
|
||||
- Optimized prompt templates for different LLM models
|
||||
|
||||
3. **Document Ranking Module**:
|
||||
- Created a new JinaReranker class that uses Jina AI's Reranker API
|
||||
- Implemented document reranking with metadata support
|
||||
- Configured to use the "jina-reranker-v2-base-multilingual" model
|
||||
4. **Search Execution Updates**:
|
||||
- Fixed issues with the Serper API integration
|
||||
- Updated the search handler interface for better error handling
|
||||
- Implemented parallel search execution using thread pools
|
||||
- Enhanced the result collector to properly process and deduplicate results
|
||||
|
||||
4. **Search Execution Module**:
|
||||
- Fixed the Serper API integration for both regular search and Scholar search
|
||||
- Streamlined the search execution process by removing redundant Google search handler
|
||||
- Added query truncation to handle long queries (Serper API has a 2048 character limit)
|
||||
- Enhanced error handling for API requests
|
||||
- Improved result processing and deduplication
|
||||
- Created comprehensive test scripts for all search handlers
|
||||
|
||||
5. **UI Development**:
|
||||
- Created a Gradio web interface for the research system
|
||||
- Implemented query input and result display components
|
||||
- Added support for configuring the number of results
|
||||
- Included example queries for easy testing
|
||||
- Created a results directory for saving search results
|
||||
5. **Jina Reranker Integration**:
|
||||
- Successfully integrated the Jina AI Reranker API to improve search result relevance
|
||||
- Fixed issues with API request and response format compatibility
|
||||
- Updated the reranker to handle different response structures
|
||||
- Improved error handling for a more robust integration
|
||||
|
||||
### Current Tasks
|
||||
|
||||
1. **Report Generation Module Development**:
|
||||
- Designing the report synthesis pipeline
|
||||
- Implementing result summarization using Groq's Llama 3.3 70B Versatile model
|
||||
- Creating formatting and export options
|
||||
1. **Report Generation Module Implementation (Phase 2)**:
|
||||
- Implementing document prioritization based on relevance scores
|
||||
- Developing chunking strategies for long documents
|
||||
- Creating token budget management system
|
||||
- Designing document selection algorithm
|
||||
|
||||
2. **UI Enhancement**:
|
||||
- Adding more configuration options to the UI
|
||||
- Implementing report generation in the UI
|
||||
2. **Integration with Search Execution**:
|
||||
- Connecting the report generation module to the search execution pipeline
|
||||
- Implementing automatic processing of search results
|
||||
- Creating end-to-end test cases for the integrated pipeline
|
||||
|
||||
3. **UI Enhancement**:
|
||||
- Adding report generation options to the UI
|
||||
- Implementing progress indicators for document scraping and report generation
|
||||
- Creating visualization components for search results
|
||||
|
||||
### Next Steps
|
||||
|
||||
1. **Integrate Search Execution with Query Processor**:
|
||||
- Ensure seamless flow from query processing to search execution
|
||||
- Test end-to-end pipeline with various query types
|
||||
- Fine-tune result scoring and filtering
|
||||
1. **Complete Phase 2 of Report Generation Module**:
|
||||
- Implement relevance-based document prioritization
|
||||
- Develop section-based and fixed-size chunking strategies
|
||||
- Create token budget management system
|
||||
- Design and implement document selection algorithm
|
||||
|
||||
2. **Build the Report Generation Module**:
|
||||
- Implement report synthesis using Groq's Llama 3.3 70B Versatile model
|
||||
- Create formatting and export options
|
||||
- Develop citation and reference management
|
||||
2. **Begin Phase 3 of Report Generation Module**:
|
||||
- Integrate with Groq's Llama 3.3 70B Versatile model for report synthesis
|
||||
- Implement map-reduce approach for processing documents
|
||||
- Create report templates for different query types
|
||||
- Add citation generation and reference management
|
||||
|
||||
3. **Comprehensive System Testing**:
|
||||
- Test the complete pipeline from query to report
|
||||
- Evaluate performance with different query types and domains
|
||||
- Optimize for speed and accuracy
|
||||
3. **Comprehensive Testing**:
|
||||
- Create end-to-end tests for the complete pipeline
|
||||
- Test with various document types and sizes
|
||||
- Evaluate performance and optimize as needed
|
||||
|
||||
### Technical Notes
|
||||
|
||||
- Using LiteLLM for unified LLM interface across different providers
|
||||
- Implementing a modular architecture for flexibility and maintainability
|
||||
- Using Jina AI's reranker for improved document ranking
|
||||
- Using Groq's Llama 3.1 and 3.3 models for fast inference during testing
|
||||
- Using Jina Reader API for web scraping with BeautifulSoup as fallback
|
||||
- Implemented SQLite database for document storage with proper schema
|
||||
- Using asynchronous processing for improved performance in web scraping
|
||||
- Managing API keys securely through environment variables and configuration files
|
||||
- Using Gradio for the web interface to provide an easy-to-use frontend
|
||||
- Planning to use Groq's Llama 3.3 70B Versatile model for report synthesis
|
||||
|
|
|
@ -221,3 +221,108 @@ After integrating Groq and OpenRouter as additional LLM providers, we needed to
|
|||
- Verified that the query processor works correctly with Groq models
|
||||
- Established a testing approach that can be used for other modules
|
||||
- Created reusable test scripts for future development
|
||||
|
||||
## 2025-02-27: Report Generation Module Implementation
|
||||
|
||||
### Decision: Use Jina Reader for Web Scraping and SQLite for Document Storage
|
||||
- **Context**: Need to implement document scraping and storage for the Report Generation module
|
||||
- **Options Considered**:
|
||||
1. In-memory document storage with custom web scraping
|
||||
2. SQLite database with Jina Reader for web scraping
|
||||
3. NoSQL database (e.g., MongoDB) with BeautifulSoup for web scraping
|
||||
4. Cloud-based document storage with third-party scraping service
|
||||
- **Decision**: Use Jina Reader for web scraping and SQLite for document storage
|
||||
- **Rationale**:
|
||||
- Jina Reader provides clean content extraction from web pages
|
||||
- Integration with existing Jina components (embeddings, reranker) for a consistent approach
|
||||
- SQLite offers persistence without the complexity of a full database server
|
||||
- SQLite's transactional nature ensures data integrity
|
||||
- Local storage reduces latency and eliminates cloud dependencies
|
||||
- Ability to store metadata alongside documents for better filtering and selection
|
||||
|
||||
### Decision: Implement Phased Approach for Report Generation
|
||||
- **Context**: Need to handle potentially large numbers of documents within LLM context window limitations
|
||||
- **Options Considered**:
|
||||
1. Single-pass approach with document truncation
|
||||
2. Use of a model with larger context window
|
||||
3. Phased approach with document prioritization and chunking
|
||||
4. Outsourcing document synthesis to a specialized service
|
||||
- **Decision**: Implement a phased approach with document prioritization and chunking
|
||||
- **Rationale**:
|
||||
- Allows handling of large document collections despite context window limitations
|
||||
- Prioritization ensures the most relevant content is included
|
||||
- Chunking strategies can preserve document structure and context
|
||||
- Map-reduce pattern enables processing of unlimited document collections
|
||||
- Flexible architecture can accommodate different models as needed
|
||||
- Progressive implementation allows for iterative testing and refinement
|
||||
|
||||
## 2025-02-27: Document Prioritization and Chunking Strategies
|
||||
|
||||
### Decision
|
||||
|
||||
Implemented document prioritization and chunking strategies for the Report Generation module (Phase 2) to extract the most relevant portions of scraped documents and prepare them for LLM processing.
|
||||
|
||||
### Context
|
||||
|
||||
After implementing the document scraping and storage components (Phase 1), we needed to develop strategies for prioritizing documents based on relevance and chunking them to fit within the LLM's context window limits. This is crucial for ensuring that the most important information is included in the final report.
|
||||
|
||||
### Options Considered
|
||||
|
||||
1. **Document Prioritization:**
|
||||
- Option A: Use only relevance scores from search results
|
||||
- Option B: Combine relevance scores with document metadata (recency, token count)
|
||||
- Option C: Use a machine learning model to score documents
|
||||
|
||||
2. **Chunking Strategies:**
|
||||
- Option A: Fixed-size chunking with overlap
|
||||
- Option B: Section-based chunking using Markdown headers
|
||||
- Option C: Hierarchical chunking for very large documents
|
||||
- Option D: Semantic chunking based on content similarity
|
||||
|
||||
### Decision and Rationale
|
||||
|
||||
For document prioritization, we chose Option B: a weighted scoring system that combines:
|
||||
- Relevance scores from search results (primary factor)
|
||||
- Document recency (secondary factor)
|
||||
- Document token count (tertiary factor)
|
||||
|
||||
This approach allows us to prioritize documents that are both relevant to the query and recent, while also considering the information density of the document.
|
||||
|
||||
For chunking strategies, we implemented a hybrid approach:
|
||||
- Section-based chunking (Option B) as the primary strategy, which preserves the logical structure of documents
|
||||
- Fixed-size chunking (Option A) as a fallback for documents without clear section headers
|
||||
- Hierarchical chunking (Option C) for very large documents, which creates a summary chunk and preserves important sections
|
||||
|
||||
We decided against semantic chunking (Option D) for now due to the additional computational overhead and complexity, but may consider it for future enhancements.
|
||||
|
||||
### Implementation Details
|
||||
|
||||
1. **Document Prioritization:**
|
||||
- Created a scoring formula that weights relevance (50-60%), recency (30%), and token count (10-20%)
|
||||
- Normalized all scores to a 0-1 range for consistent weighting
|
||||
- Added the priority score to each document for use in chunk selection
|
||||
|
||||
2. **Chunking Strategies:**
|
||||
- Implemented section-based chunking using regex to identify Markdown headers
|
||||
- Added fixed-size chunking with configurable chunk size and overlap
|
||||
- Created hierarchical chunking for very large documents
|
||||
- Preserved document metadata in all chunks for traceability
|
||||
|
||||
3. **Chunk Selection:**
|
||||
- Implemented a token budget management system to stay within context limits
|
||||
- Created an algorithm to select chunks based on priority while ensuring representation from multiple documents
|
||||
- Added minimum chunks per document to prevent over-representation of a single source
|
||||
|
||||
### Impact and Next Steps
|
||||
|
||||
This implementation allows us to:
|
||||
- Prioritize the most relevant and recent information
|
||||
- Preserve the logical structure of documents
|
||||
- Efficiently manage token budgets for different LLM models
|
||||
- Balance information from multiple sources
|
||||
|
||||
Next steps include:
|
||||
- Integrating with the LLM interface for report synthesis (Phase 3)
|
||||
- Implementing the map-reduce approach for processing document chunks
|
||||
- Creating report templates for different query types
|
||||
- Adding citation generation and reference management
|
||||
|
|
|
@ -198,6 +198,7 @@ Added support for OpenRouter and Groq as LLM providers and configured the system
|
|||
1. Test the system with Groq's models to evaluate performance
|
||||
2. Implement the remaining query processing components
|
||||
3. Create the Gradio UI for user interaction
|
||||
4. Test the full system with end-to-end workflows
|
||||
|
||||
## Session: 2025-02-27 (Update 6)
|
||||
|
||||
|
@ -393,3 +394,166 @@ Implemented a Gradio web interface for the intelligent research system, providin
|
|||
2. Implement report generation in the UI
|
||||
3. Add visualization components for search results
|
||||
4. Test the UI with various query types and search engines
|
||||
|
||||
## Session: 2025-02-27 (Afternoon)
|
||||
|
||||
### Overview
|
||||
In this session, we focused on debugging and fixing the Jina Reranker API integration to ensure it correctly processes queries and documents, enhancing the relevance of search results.
|
||||
|
||||
### Key Activities
|
||||
1. **Jina Reranker API Integration**:
|
||||
- Updated the `rerank` method in the JinaReranker class to match the expected API request format
|
||||
- Modified the request payload to send an array of plain string documents instead of objects
|
||||
- Enhanced response processing to handle both current and older API response formats
|
||||
- Added detailed logging for API requests and responses for better debugging
|
||||
|
||||
2. **Testing Improvements**:
|
||||
- Created a simplified test script (`test_simple_reranker.py`) to isolate and test the reranker functionality
|
||||
- Updated the main test script to focus on core functionality without complex dependencies
|
||||
- Implemented JSON result saving for better analysis of reranker output
|
||||
- Added proper error handling in tests to provide clear feedback on issues
|
||||
|
||||
3. **Code Quality Enhancements**:
|
||||
- Improved error handling throughout the reranker implementation
|
||||
- Added informative debug messages at key points in the execution flow
|
||||
- Ensured backward compatibility with previous API response formats
|
||||
- Documented the expected request and response structures
|
||||
|
||||
### Insights and Learnings
|
||||
- The Jina Reranker API expects documents as an array of plain strings, not objects with a "text" field
|
||||
- The reranker response format includes a "document" field in the results which may contain either the text directly or an object with a "text" field
|
||||
- Proper error handling and debug output are crucial for diagnosing issues with external API integrations
|
||||
- Isolating components for testing makes debugging much more efficient
|
||||
|
||||
### Challenges
|
||||
- Adapting to changes in the Jina Reranker API response format
|
||||
- Ensuring backward compatibility with older response formats
|
||||
- Debugging nested API response structures
|
||||
- Managing environment variables and configuration consistently across test scripts
|
||||
|
||||
### Next Steps
|
||||
1. **Expand Testing**: Develop more comprehensive test cases for the reranker with diverse document types
|
||||
2. **Integration**: Ensure the reranker is properly integrated with the result collector for end-to-end functionality
|
||||
3. **Documentation**: Update API documentation to reflect the latest changes to the reranker implementation
|
||||
4. **UI Integration**: Add reranker configuration options to the Gradio interface
|
||||
|
||||
## Session: 2025-02-27 - Report Generation Module Planning
|
||||
|
||||
### Overview
|
||||
In this session, we focused on planning the Report Generation module, designing a comprehensive implementation approach, and making key decisions about document scraping, storage, and processing.
|
||||
|
||||
### Key Activities
|
||||
1. **Designed a Phased Implementation Plan**:
|
||||
- Created a four-phase implementation plan for the Report Generation module
|
||||
- Phase 1: Document Scraping and Storage
|
||||
- Phase 2: Document Prioritization and Chunking
|
||||
- Phase 3: Report Generation
|
||||
- Phase 4: Advanced Features
|
||||
- Documented the plan in the memory bank for future reference
|
||||
|
||||
2. **Made Key Design Decisions**:
|
||||
- Decided to use Jina Reader for web scraping due to its clean content extraction capabilities
|
||||
- Chose SQLite for document storage to ensure persistence and efficient querying
|
||||
- Designed a database schema with Documents and Metadata tables
|
||||
- Planned a token budget management system to handle context window limitations
|
||||
- Decided on a map-reduce approach for processing large document collections
|
||||
|
||||
3. **Addressed Context Window Limitations**:
|
||||
- Evaluated Groq's Llama 3.3 70B Versatile model's 128K context window
|
||||
- Designed document prioritization strategies based on relevance scores
|
||||
- Planned chunking strategies for handling long documents
|
||||
- Considered alternative models with larger context windows for future implementation
|
||||
|
||||
4. **Updated Documentation**:
|
||||
- Added the implementation plan to the memory bank
|
||||
- Updated the decision log with rationale for key decisions
|
||||
- Revised the current focus to reflect the new implementation priorities
|
||||
- Added a new session log entry to document the planning process
|
||||
|
||||
### Insights
|
||||
- A phased implementation approach allows for incremental development and testing
|
||||
- SQLite provides a good balance of simplicity and functionality for document storage
|
||||
- Jina Reader integrates well with our existing Jina components (embeddings, reranker)
|
||||
- The map-reduce pattern enables processing of unlimited document collections despite context window limitations
|
||||
- Document prioritization is crucial for ensuring the most relevant content is included in reports
|
||||
|
||||
### Challenges
|
||||
- Managing the 128K context window limitation with potentially large document collections
|
||||
- Balancing between document coverage and report quality
|
||||
- Ensuring efficient web scraping without overwhelming target websites
|
||||
- Designing a flexible architecture that can accommodate different models and approaches
|
||||
|
||||
### Next Steps
|
||||
1. Begin implementing Phase 1 of the Report Generation module:
|
||||
- Set up the SQLite database with the designed schema
|
||||
- Implement the Jina Reader integration for web scraping
|
||||
- Create the document processing pipeline
|
||||
- Develop URL validation and normalization functionality
|
||||
- Add caching and deduplication for scraped content
|
||||
|
||||
2. Plan for Phase 2 implementation:
|
||||
- Design the token budget management system
|
||||
- Develop document prioritization algorithms
|
||||
- Create chunking strategies for long documents
|
||||
|
||||
## Session: 2025-02-27 - Report Generation Module Implementation (Phase 1)
|
||||
|
||||
### Overview
|
||||
In this session, we implemented Phase 1 of the Report Generation module, focusing on document scraping and SQLite storage. We created the necessary components for scraping web pages, storing their content in a SQLite database, and retrieving documents for report generation.
|
||||
|
||||
### Key Activities
|
||||
1. **Created Database Manager**:
|
||||
- Implemented a SQLite database manager with tables for documents and metadata
|
||||
- Added full CRUD operations for documents
|
||||
- Implemented transaction handling for data integrity
|
||||
- Created methods for document search and retrieval
|
||||
- Used aiosqlite for asynchronous database operations
|
||||
|
||||
2. **Implemented Document Scraper**:
|
||||
- Created a document scraper with Jina Reader API integration
|
||||
- Added fallback mechanism using BeautifulSoup for when Jina API fails
|
||||
- Implemented URL validation and normalization
|
||||
- Added content conversion to Markdown format
|
||||
- Implemented token counting using tiktoken
|
||||
- Created metadata extraction from HTML content
|
||||
- Added document deduplication using content hashing
|
||||
|
||||
3. **Developed Report Generator Base**:
|
||||
- Created the basic structure for the report generation process
|
||||
- Implemented methods to process search results by scraping URLs
|
||||
- Integrated with the database manager and document scraper
|
||||
- Set up the foundation for future phases
|
||||
|
||||
4. **Created Test Script**:
|
||||
- Developed a test script to verify functionality
|
||||
- Tested document scraping, storage, and retrieval
|
||||
- Verified search functionality within the database
|
||||
- Ensured proper error handling and fallback mechanisms
|
||||
|
||||
### Insights
|
||||
- The fallback mechanism for document scraping is crucial, as the Jina Reader API may not always be available or may fail for certain URLs
|
||||
- Asynchronous processing significantly improves performance when scraping multiple URLs
|
||||
- Content hashing is an effective way to prevent duplicate documents in the database
|
||||
- Storing metadata separately from document content provides flexibility for future enhancements
|
||||
- The SQLite database provides a good balance of simplicity and functionality for document storage
|
||||
|
||||
### Challenges
|
||||
- Handling different HTML structures across websites for metadata extraction
|
||||
- Managing asynchronous operations and error handling
|
||||
- Ensuring proper transaction handling for database operations
|
||||
- Balancing between clean content extraction and preserving important information
|
||||
|
||||
### Next Steps
|
||||
1. **Integration with Search Execution**:
|
||||
- Connect the report generation module to the search execution pipeline
|
||||
- Implement automatic processing of search results
|
||||
|
||||
2. **Begin Phase 2 Implementation**:
|
||||
- Develop document prioritization based on relevance scores
|
||||
- Implement chunking strategies for long documents
|
||||
- Create token budget management system
|
||||
|
||||
3. **Testing and Refinement**:
|
||||
- Create more comprehensive tests for edge cases
|
||||
- Refine error handling and logging
|
||||
- Optimize performance for large numbers of documents
|
||||
|
|
|
@ -0,0 +1,493 @@
|
|||
"""
|
||||
Document processor module for the report generation module.
|
||||
|
||||
This module provides functionality to prioritize documents based on relevance scores,
|
||||
chunk long documents into manageable pieces, and select the most relevant chunks
|
||||
to stay within token budget limits.
|
||||
"""
|
||||
|
||||
import re
|
||||
import math
|
||||
import logging
|
||||
import tiktoken
|
||||
from typing import Dict, List, Any, Optional, Tuple, Union, Set
|
||||
from datetime import datetime
|
||||
|
||||
from report.database.db_manager import get_db_manager
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DocumentProcessor:
|
||||
"""
|
||||
Document processor for the report generation module.
|
||||
|
||||
This class provides methods to prioritize documents based on relevance scores,
|
||||
chunk long documents into manageable pieces, and select the most relevant chunks
|
||||
to stay within token budget limits.
|
||||
"""
|
||||
|
||||
def __init__(self, default_token_limit: int = 120000):
|
||||
"""
|
||||
Initialize the document processor.
|
||||
|
||||
Args:
|
||||
default_token_limit: Default token limit for the context window
|
||||
"""
|
||||
self.db_manager = get_db_manager()
|
||||
self.default_token_limit = default_token_limit
|
||||
self.tokenizer = tiktoken.get_encoding("cl100k_base") # Using OpenAI's tokenizer
|
||||
|
||||
def _count_tokens(self, text: str) -> int:
|
||||
"""
|
||||
Count the number of tokens in a text.
|
||||
|
||||
Args:
|
||||
text: The text to count tokens for
|
||||
|
||||
Returns:
|
||||
Number of tokens in the text
|
||||
"""
|
||||
return len(self.tokenizer.encode(text))
|
||||
|
||||
def prioritize_documents(self, documents: List[Dict[str, Any]],
|
||||
relevance_scores: Optional[Dict[str, float]] = None,
|
||||
recency_weight: float = 0.3,
|
||||
token_count_weight: float = 0.2) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Prioritize documents based on relevance scores, recency, and token count.
|
||||
|
||||
Args:
|
||||
documents: List of documents to prioritize
|
||||
relevance_scores: Dictionary mapping document URLs to relevance scores
|
||||
recency_weight: Weight for recency in the prioritization score
|
||||
token_count_weight: Weight for token count in the prioritization score
|
||||
|
||||
Returns:
|
||||
List of documents sorted by priority score
|
||||
"""
|
||||
# If no relevance scores provided, use equal scores for all documents
|
||||
if relevance_scores is None:
|
||||
relevance_scores = {doc['url']: 1.0 for doc in documents}
|
||||
|
||||
# Get current time for recency calculation
|
||||
current_time = datetime.now()
|
||||
|
||||
# Calculate priority scores
|
||||
for doc in documents:
|
||||
# Relevance score (normalized to 0-1)
|
||||
relevance_score = relevance_scores.get(doc['url'], 0.0)
|
||||
|
||||
# Recency score (normalized to 0-1)
|
||||
try:
|
||||
doc_time = datetime.fromisoformat(doc['scrape_date'])
|
||||
time_diff = (current_time - doc_time).total_seconds() / 86400 # Convert to days
|
||||
recency_score = 1.0 / (1.0 + time_diff) # Newer documents get higher scores
|
||||
except (KeyError, ValueError):
|
||||
recency_score = 0.5 # Default if scrape_date is missing or invalid
|
||||
|
||||
# Token count score (normalized to 0-1)
|
||||
# Prefer documents with more tokens, but not too many
|
||||
token_count = doc.get('token_count', 0)
|
||||
token_count_score = min(token_count / 5000, 1.0) # Normalize to 0-1
|
||||
|
||||
# Calculate final priority score
|
||||
relevance_weight = 1.0 - recency_weight - token_count_weight
|
||||
priority_score = (
|
||||
relevance_weight * relevance_score +
|
||||
recency_weight * recency_score +
|
||||
token_count_weight * token_count_score
|
||||
)
|
||||
|
||||
# Add priority score to document
|
||||
doc['priority_score'] = priority_score
|
||||
|
||||
# Sort documents by priority score (descending)
|
||||
return sorted(documents, key=lambda x: x.get('priority_score', 0.0), reverse=True)
|
||||
|
||||
def chunk_document_by_sections(self, document: Dict[str, Any],
|
||||
max_chunk_tokens: int = 1000,
|
||||
overlap_tokens: int = 100) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Chunk a document by sections based on Markdown headers.
|
||||
|
||||
Args:
|
||||
document: Document to chunk
|
||||
max_chunk_tokens: Maximum number of tokens per chunk
|
||||
overlap_tokens: Number of tokens to overlap between chunks
|
||||
|
||||
Returns:
|
||||
List of document chunks
|
||||
"""
|
||||
content = document.get('content', '')
|
||||
|
||||
# If content is empty, return empty list
|
||||
if not content.strip():
|
||||
return []
|
||||
|
||||
# Find all headers in the content
|
||||
header_pattern = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
|
||||
headers = list(header_pattern.finditer(content))
|
||||
|
||||
# If no headers found, use fixed-size chunking
|
||||
if not headers:
|
||||
return self.chunk_document_fixed_size(document, max_chunk_tokens, overlap_tokens)
|
||||
|
||||
chunks = []
|
||||
|
||||
# Process each section (from one header to the next)
|
||||
for i in range(len(headers)):
|
||||
start_pos = headers[i].start()
|
||||
|
||||
# Determine end position (next header or end of content)
|
||||
if i < len(headers) - 1:
|
||||
end_pos = headers[i + 1].start()
|
||||
else:
|
||||
end_pos = len(content)
|
||||
|
||||
section_content = content[start_pos:end_pos]
|
||||
section_tokens = self._count_tokens(section_content)
|
||||
|
||||
# If section is small enough, add it as a single chunk
|
||||
if section_tokens <= max_chunk_tokens:
|
||||
chunks.append({
|
||||
'document_id': document.get('id'),
|
||||
'url': document.get('url'),
|
||||
'title': document.get('title'),
|
||||
'content': section_content,
|
||||
'token_count': section_tokens,
|
||||
'chunk_type': 'section',
|
||||
'section_title': headers[i].group(2),
|
||||
'section_level': len(headers[i].group(1)),
|
||||
'priority_score': document.get('priority_score', 0.0)
|
||||
})
|
||||
else:
|
||||
# If section is too large, split it into fixed-size chunks
|
||||
section_chunks = self._split_text_fixed_size(
|
||||
section_content,
|
||||
max_chunk_tokens,
|
||||
overlap_tokens
|
||||
)
|
||||
|
||||
for j, chunk_content in enumerate(section_chunks):
|
||||
chunk_tokens = self._count_tokens(chunk_content)
|
||||
chunks.append({
|
||||
'document_id': document.get('id'),
|
||||
'url': document.get('url'),
|
||||
'title': document.get('title'),
|
||||
'content': chunk_content,
|
||||
'token_count': chunk_tokens,
|
||||
'chunk_type': 'section_part',
|
||||
'section_title': headers[i].group(2),
|
||||
'section_level': len(headers[i].group(1)),
|
||||
'section_part': j + 1,
|
||||
'total_parts': len(section_chunks),
|
||||
'priority_score': document.get('priority_score', 0.0)
|
||||
})
|
||||
|
||||
return chunks
|
||||
|
||||
def chunk_document_fixed_size(self, document: Dict[str, Any],
|
||||
max_chunk_tokens: int = 1000,
|
||||
overlap_tokens: int = 100) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Chunk a document into fixed-size chunks with overlap.
|
||||
|
||||
Args:
|
||||
document: Document to chunk
|
||||
max_chunk_tokens: Maximum number of tokens per chunk
|
||||
overlap_tokens: Number of tokens to overlap between chunks
|
||||
|
||||
Returns:
|
||||
List of document chunks
|
||||
"""
|
||||
content = document.get('content', '')
|
||||
|
||||
# If content is empty, return empty list
|
||||
if not content.strip():
|
||||
return []
|
||||
|
||||
# Split content into fixed-size chunks
|
||||
content_chunks = self._split_text_fixed_size(content, max_chunk_tokens, overlap_tokens)
|
||||
|
||||
chunks = []
|
||||
|
||||
# Create chunk objects
|
||||
for i, chunk_content in enumerate(content_chunks):
|
||||
chunk_tokens = self._count_tokens(chunk_content)
|
||||
chunks.append({
|
||||
'document_id': document.get('id'),
|
||||
'url': document.get('url'),
|
||||
'title': document.get('title'),
|
||||
'content': chunk_content,
|
||||
'token_count': chunk_tokens,
|
||||
'chunk_type': 'fixed',
|
||||
'chunk_index': i + 1,
|
||||
'total_chunks': len(content_chunks),
|
||||
'priority_score': document.get('priority_score', 0.0)
|
||||
})
|
||||
|
||||
return chunks
|
||||
|
||||
def _split_text_fixed_size(self, text: str,
|
||||
max_chunk_tokens: int = 1000,
|
||||
overlap_tokens: int = 100) -> List[str]:
|
||||
"""
|
||||
Split text into fixed-size chunks with overlap.
|
||||
|
||||
Args:
|
||||
text: Text to split
|
||||
max_chunk_tokens: Maximum number of tokens per chunk
|
||||
overlap_tokens: Number of tokens to overlap between chunks
|
||||
|
||||
Returns:
|
||||
List of text chunks
|
||||
"""
|
||||
# Encode text into tokens
|
||||
tokens = self.tokenizer.encode(text)
|
||||
|
||||
# If text is small enough, return as a single chunk
|
||||
if len(tokens) <= max_chunk_tokens:
|
||||
return [text]
|
||||
|
||||
# Calculate number of chunks needed
|
||||
num_chunks = math.ceil((len(tokens) - overlap_tokens) / (max_chunk_tokens - overlap_tokens))
|
||||
|
||||
chunks = []
|
||||
|
||||
# Split tokens into chunks
|
||||
for i in range(num_chunks):
|
||||
# Calculate start and end positions
|
||||
start_pos = i * (max_chunk_tokens - overlap_tokens)
|
||||
end_pos = min(start_pos + max_chunk_tokens, len(tokens))
|
||||
|
||||
# Extract chunk tokens
|
||||
chunk_tokens = tokens[start_pos:end_pos]
|
||||
|
||||
# Decode chunk tokens back to text
|
||||
chunk_text = self.tokenizer.decode(chunk_tokens)
|
||||
|
||||
chunks.append(chunk_text)
|
||||
|
||||
return chunks
|
||||
|
||||
def chunk_document_hierarchical(self, document: Dict[str, Any],
|
||||
max_chunk_tokens: int = 1000,
|
||||
overlap_tokens: int = 100) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Chunk a very large document using a hierarchical approach.
|
||||
|
||||
This method first chunks the document by sections, then further chunks
|
||||
large sections into smaller pieces.
|
||||
|
||||
Args:
|
||||
document: Document to chunk
|
||||
max_chunk_tokens: Maximum number of tokens per chunk
|
||||
overlap_tokens: Number of tokens to overlap between chunks
|
||||
|
||||
Returns:
|
||||
List of document chunks
|
||||
"""
|
||||
# First, chunk by sections
|
||||
section_chunks = self.chunk_document_by_sections(document, max_chunk_tokens, overlap_tokens)
|
||||
|
||||
# If the document is small enough, return section chunks
|
||||
if sum(chunk.get('token_count', 0) for chunk in section_chunks) <= max_chunk_tokens * 3:
|
||||
return section_chunks
|
||||
|
||||
# Otherwise, create a summary chunk and keep the most important sections
|
||||
content = document.get('content', '')
|
||||
title = document.get('title', '')
|
||||
|
||||
# Extract first paragraph as summary
|
||||
first_para_match = re.search(r'^(.*?)\n\n', content, re.DOTALL)
|
||||
summary = first_para_match.group(1) if first_para_match else content[:500]
|
||||
|
||||
# Create summary chunk
|
||||
summary_chunk = {
|
||||
'document_id': document.get('id'),
|
||||
'url': document.get('url'),
|
||||
'title': title,
|
||||
'content': f"# {title}\n\n{summary}\n\n(This is a summary of a large document)",
|
||||
'token_count': self._count_tokens(f"# {title}\n\n{summary}\n\n(This is a summary of a large document)"),
|
||||
'chunk_type': 'summary',
|
||||
'priority_score': document.get('priority_score', 0.0) * 1.2 # Boost summary priority
|
||||
}
|
||||
|
||||
# Sort section chunks by priority (section level and position)
|
||||
def section_priority(chunk):
|
||||
# Prioritize by section level (lower is more important)
|
||||
level_score = 6 - chunk.get('section_level', 3)
|
||||
# Prioritize earlier sections
|
||||
position_score = 1.0 / (1.0 + chunk.get('chunk_index', 0) + chunk.get('section_part', 0))
|
||||
return level_score * position_score
|
||||
|
||||
sorted_sections = sorted(section_chunks, key=section_priority, reverse=True)
|
||||
|
||||
# Return summary chunk and top sections
|
||||
return [summary_chunk] + sorted_sections
|
||||
|
||||
def select_chunks_for_context(self, chunks: List[Dict[str, Any]],
|
||||
token_budget: int,
|
||||
min_chunks_per_doc: int = 1) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Select chunks to include in the context window based on token budget.
|
||||
|
||||
Args:
|
||||
chunks: List of document chunks
|
||||
token_budget: Maximum number of tokens to use
|
||||
min_chunks_per_doc: Minimum number of chunks to include per document
|
||||
|
||||
Returns:
|
||||
List of selected chunks
|
||||
"""
|
||||
# Group chunks by document
|
||||
doc_chunks = {}
|
||||
for chunk in chunks:
|
||||
doc_id = chunk.get('document_id')
|
||||
if doc_id not in doc_chunks:
|
||||
doc_chunks[doc_id] = []
|
||||
doc_chunks[doc_id].append(chunk)
|
||||
|
||||
# Sort chunks within each document by priority
|
||||
for doc_id in doc_chunks:
|
||||
doc_chunks[doc_id] = sorted(
|
||||
doc_chunks[doc_id],
|
||||
key=lambda x: x.get('priority_score', 0.0),
|
||||
reverse=True
|
||||
)
|
||||
|
||||
# Select at least min_chunks_per_doc from each document
|
||||
selected_chunks = []
|
||||
remaining_budget = token_budget
|
||||
|
||||
# First pass: select minimum chunks from each document
|
||||
for doc_id, chunks in doc_chunks.items():
|
||||
for i in range(min(min_chunks_per_doc, len(chunks))):
|
||||
chunk = chunks[i]
|
||||
selected_chunks.append(chunk)
|
||||
remaining_budget -= chunk.get('token_count', 0)
|
||||
|
||||
# If we've exceeded the budget, sort selected chunks and trim
|
||||
if remaining_budget <= 0:
|
||||
selected_chunks = sorted(
|
||||
selected_chunks,
|
||||
key=lambda x: x.get('priority_score', 0.0),
|
||||
reverse=True
|
||||
)
|
||||
|
||||
# Keep adding chunks until we exceed the budget
|
||||
current_budget = 0
|
||||
for i, chunk in enumerate(selected_chunks):
|
||||
current_budget += chunk.get('token_count', 0)
|
||||
if current_budget > token_budget:
|
||||
selected_chunks = selected_chunks[:i]
|
||||
break
|
||||
|
||||
return selected_chunks
|
||||
|
||||
# Second pass: add more chunks based on priority until budget is exhausted
|
||||
# Flatten remaining chunks from all documents
|
||||
remaining_chunks = []
|
||||
for doc_id, chunks in doc_chunks.items():
|
||||
if len(chunks) > min_chunks_per_doc:
|
||||
remaining_chunks.extend(chunks[min_chunks_per_doc:])
|
||||
|
||||
# Sort remaining chunks by priority
|
||||
remaining_chunks = sorted(
|
||||
remaining_chunks,
|
||||
key=lambda x: x.get('priority_score', 0.0),
|
||||
reverse=True
|
||||
)
|
||||
|
||||
# Add chunks until budget is exhausted
|
||||
for chunk in remaining_chunks:
|
||||
if chunk.get('token_count', 0) <= remaining_budget:
|
||||
selected_chunks.append(chunk)
|
||||
remaining_budget -= chunk.get('token_count', 0)
|
||||
|
||||
if remaining_budget <= 0:
|
||||
break
|
||||
|
||||
return selected_chunks
|
||||
|
||||
def process_documents_for_report(self, documents: List[Dict[str, Any]],
|
||||
relevance_scores: Optional[Dict[str, float]] = None,
|
||||
token_budget: Optional[int] = None,
|
||||
chunk_size: int = 1000,
|
||||
overlap_size: int = 100) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Process documents for report generation.
|
||||
|
||||
This method prioritizes documents, chunks them, and selects the most
|
||||
relevant chunks to stay within the token budget.
|
||||
|
||||
Args:
|
||||
documents: List of documents to process
|
||||
relevance_scores: Dictionary mapping document URLs to relevance scores
|
||||
token_budget: Maximum number of tokens to use (default: self.default_token_limit)
|
||||
chunk_size: Maximum number of tokens per chunk
|
||||
overlap_size: Number of tokens to overlap between chunks
|
||||
|
||||
Returns:
|
||||
List of selected document chunks
|
||||
"""
|
||||
if token_budget is None:
|
||||
token_budget = self.default_token_limit
|
||||
|
||||
# Prioritize documents
|
||||
prioritized_docs = self.prioritize_documents(documents, relevance_scores)
|
||||
|
||||
# Chunk documents
|
||||
all_chunks = []
|
||||
for doc in prioritized_docs:
|
||||
# Choose chunking strategy based on document size
|
||||
token_count = doc.get('token_count', 0)
|
||||
|
||||
if token_count > chunk_size * 10:
|
||||
# Very large document: use hierarchical chunking
|
||||
chunks = self.chunk_document_hierarchical(doc, chunk_size, overlap_size)
|
||||
elif token_count > chunk_size:
|
||||
# Medium document: use section-based chunking
|
||||
chunks = self.chunk_document_by_sections(doc, chunk_size, overlap_size)
|
||||
else:
|
||||
# Small document: keep as a single chunk
|
||||
chunks = [{
|
||||
'document_id': doc.get('id'),
|
||||
'url': doc.get('url'),
|
||||
'title': doc.get('title'),
|
||||
'content': doc.get('content', ''),
|
||||
'token_count': token_count,
|
||||
'chunk_type': 'full',
|
||||
'priority_score': doc.get('priority_score', 0.0)
|
||||
}]
|
||||
|
||||
all_chunks.extend(chunks)
|
||||
|
||||
# Select chunks based on token budget
|
||||
selected_chunks = self.select_chunks_for_context(all_chunks, token_budget)
|
||||
|
||||
# Log statistics
|
||||
total_docs = len(documents)
|
||||
total_chunks = len(all_chunks)
|
||||
selected_chunk_count = len(selected_chunks)
|
||||
selected_token_count = sum(chunk.get('token_count', 0) for chunk in selected_chunks)
|
||||
|
||||
logger.info(f"Processed {total_docs} documents into {total_chunks} chunks")
|
||||
logger.info(f"Selected {selected_chunk_count} chunks with {selected_token_count} tokens")
|
||||
|
||||
return selected_chunks
|
||||
|
||||
|
||||
# Create a singleton instance for global use
|
||||
document_processor = DocumentProcessor()
|
||||
|
||||
def get_document_processor() -> DocumentProcessor:
|
||||
"""
|
||||
Get the global document processor instance.
|
||||
|
||||
Returns:
|
||||
DocumentProcessor instance
|
||||
"""
|
||||
return document_processor
|
|
@ -13,6 +13,7 @@ from typing import Dict, List, Any, Optional, Tuple, Union
|
|||
|
||||
from report.database.db_manager import get_db_manager, initialize_database
|
||||
from report.document_scraper import get_document_scraper
|
||||
from report.document_processor import get_document_processor
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
||||
|
@ -31,6 +32,7 @@ class ReportGenerator:
|
|||
"""Initialize the report generator."""
|
||||
self.db_manager = get_db_manager()
|
||||
self.document_scraper = get_document_scraper()
|
||||
self.document_processor = get_document_processor()
|
||||
|
||||
async def initialize(self):
|
||||
"""Initialize the report generator by setting up the database."""
|
||||
|
@ -50,13 +52,19 @@ class ReportGenerator:
|
|||
# Extract URLs from search results
|
||||
urls = [result.get('url') for result in search_results if result.get('url')]
|
||||
|
||||
# Extract relevance scores if available
|
||||
relevance_scores = {}
|
||||
for result in search_results:
|
||||
if result.get('url') and result.get('score') is not None:
|
||||
relevance_scores[result.get('url')] = result.get('score')
|
||||
|
||||
# Scrape URLs and store in database
|
||||
documents = await self.document_scraper.scrape_urls(urls)
|
||||
|
||||
# Log results
|
||||
logger.info(f"Processed {len(documents)} documents out of {len(urls)} URLs")
|
||||
|
||||
return documents
|
||||
return documents, relevance_scores
|
||||
|
||||
async def get_document_by_url(self, url: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
|
@ -83,6 +91,84 @@ class ReportGenerator:
|
|||
"""
|
||||
return await self.db_manager.search_documents(query, limit)
|
||||
|
||||
async def prepare_documents_for_report(self,
|
||||
search_results: List[Dict[str, Any]],
|
||||
token_budget: Optional[int] = None,
|
||||
chunk_size: int = 1000,
|
||||
overlap_size: int = 100) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Prepare documents for report generation by processing search results,
|
||||
prioritizing documents, and chunking them to fit within token budget.
|
||||
|
||||
Args:
|
||||
search_results: List of search results
|
||||
token_budget: Maximum number of tokens to use
|
||||
chunk_size: Maximum number of tokens per chunk
|
||||
overlap_size: Number of tokens to overlap between chunks
|
||||
|
||||
Returns:
|
||||
List of selected document chunks
|
||||
"""
|
||||
# Process search results to get documents and relevance scores
|
||||
documents, relevance_scores = await self.process_search_results(search_results)
|
||||
|
||||
# Prioritize and chunk documents
|
||||
selected_chunks = self.document_processor.process_documents_for_report(
|
||||
documents,
|
||||
relevance_scores,
|
||||
token_budget,
|
||||
chunk_size,
|
||||
overlap_size
|
||||
)
|
||||
|
||||
return selected_chunks
|
||||
|
||||
async def generate_report(self,
|
||||
search_results: List[Dict[str, Any]],
|
||||
query: str,
|
||||
token_budget: Optional[int] = None,
|
||||
chunk_size: int = 1000,
|
||||
overlap_size: int = 100) -> str:
|
||||
"""
|
||||
Generate a report from search results.
|
||||
|
||||
Args:
|
||||
search_results: List of search results
|
||||
query: Original search query
|
||||
token_budget: Maximum number of tokens to use
|
||||
chunk_size: Maximum number of tokens per chunk
|
||||
overlap_size: Number of tokens to overlap between chunks
|
||||
|
||||
Returns:
|
||||
Generated report as a string
|
||||
"""
|
||||
# Prepare documents for report
|
||||
selected_chunks = await self.prepare_documents_for_report(
|
||||
search_results,
|
||||
token_budget,
|
||||
chunk_size,
|
||||
overlap_size
|
||||
)
|
||||
|
||||
# TODO: Implement report synthesis using LLM
|
||||
# For now, just return a placeholder report
|
||||
report = f"# Report for: {query}\n\n"
|
||||
report += f"Based on {len(selected_chunks)} document chunks\n\n"
|
||||
|
||||
# Add document summaries
|
||||
for i, chunk in enumerate(selected_chunks[:5]): # Show first 5 chunks
|
||||
report += f"## Document {i+1}: {chunk.get('title', 'Untitled')}\n"
|
||||
report += f"Source: {chunk.get('url', 'Unknown')}\n"
|
||||
report += f"Chunk type: {chunk.get('chunk_type', 'Unknown')}\n"
|
||||
report += f"Priority score: {chunk.get('priority_score', 0.0):.2f}\n\n"
|
||||
|
||||
# Add a snippet of the content
|
||||
content = chunk.get('content', '')
|
||||
snippet = content[:200] + "..." if len(content) > 200 else content
|
||||
report += f"{snippet}\n\n"
|
||||
|
||||
return report
|
||||
|
||||
|
||||
# Create a singleton instance for global use
|
||||
report_generator = ReportGenerator()
|
||||
|
@ -100,30 +186,50 @@ def get_report_generator() -> ReportGenerator:
|
|||
"""
|
||||
return report_generator
|
||||
|
||||
# Example usage
|
||||
async def test_report_generator():
|
||||
"""Test the report generator with sample search results."""
|
||||
# Initialize report generator
|
||||
# Initialize the report generator
|
||||
await initialize_report_generator()
|
||||
|
||||
# Sample search results
|
||||
search_results = [
|
||||
{"url": "https://en.wikipedia.org/wiki/Web_scraping", "title": "Web scraping - Wikipedia"},
|
||||
{"url": "https://en.wikipedia.org/wiki/Natural_language_processing", "title": "Natural language processing - Wikipedia"}
|
||||
{
|
||||
'title': 'Example Document 1',
|
||||
'url': 'https://example.com/doc1',
|
||||
'snippet': 'This is an example document.',
|
||||
'score': 0.95
|
||||
},
|
||||
{
|
||||
'title': 'Example Document 2',
|
||||
'url': 'https://example.com/doc2',
|
||||
'snippet': 'This is another example document.',
|
||||
'score': 0.85
|
||||
},
|
||||
{
|
||||
'title': 'Python Documentation',
|
||||
'url': 'https://docs.python.org/3/',
|
||||
'snippet': 'Official Python documentation.',
|
||||
'score': 0.75
|
||||
}
|
||||
]
|
||||
|
||||
# Process search results
|
||||
generator = get_report_generator()
|
||||
documents = await generator.process_search_results(search_results)
|
||||
documents, relevance_scores = await report_generator.process_search_results(search_results)
|
||||
|
||||
# Print results
|
||||
# Print documents
|
||||
print(f"Processed {len(documents)} documents")
|
||||
for doc in documents:
|
||||
print(f"Title: {doc['title']}")
|
||||
print(f"URL: {doc['url']}")
|
||||
print(f"Token count: {doc['token_count']}")
|
||||
print(f"Content preview: {doc['content'][:200]}...")
|
||||
print("-" * 80)
|
||||
print(f"Document: {doc.get('title')} ({doc.get('url')})")
|
||||
print(f"Token count: {doc.get('token_count')}")
|
||||
print(f"Content snippet: {doc.get('content')[:100]}...")
|
||||
print()
|
||||
|
||||
# Generate report
|
||||
report = await report_generator.generate_report(search_results, "Python programming")
|
||||
|
||||
# Print report
|
||||
print("Generated Report:")
|
||||
print(report)
|
||||
|
||||
# Run test if this module is executed directly
|
||||
if __name__ == "__main__":
|
||||
|
|
|
@ -0,0 +1,156 @@
|
|||
"""
|
||||
Test script for the document processor module.
|
||||
|
||||
This script tests the document prioritization and chunking functionality
|
||||
of the document processor module.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import asyncio
|
||||
import json
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Any, Optional
|
||||
|
||||
# Add the project root directory to the Python path
|
||||
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
from report.document_processor import get_document_processor
|
||||
from report.database.db_manager import get_db_manager, initialize_database
|
||||
from report.document_scraper import get_document_scraper
|
||||
|
||||
async def test_document_processor():
|
||||
"""Test the document processor with sample documents."""
|
||||
# Initialize the database
|
||||
await initialize_database()
|
||||
|
||||
# Get the document processor and scraper
|
||||
document_processor = get_document_processor()
|
||||
document_scraper = get_document_scraper()
|
||||
db_manager = get_db_manager()
|
||||
|
||||
# Sample URLs to test with
|
||||
test_urls = [
|
||||
"https://en.wikipedia.org/wiki/Python_(programming_language)",
|
||||
"https://en.wikipedia.org/wiki/Natural_language_processing",
|
||||
"https://docs.python.org/3/tutorial/index.html",
|
||||
"https://en.wikipedia.org/wiki/Machine_learning"
|
||||
]
|
||||
|
||||
# Scrape the URLs
|
||||
print(f"Scraping {len(test_urls)} URLs...")
|
||||
documents = await document_scraper.scrape_urls(test_urls)
|
||||
print(f"Scraped {len(documents)} documents")
|
||||
|
||||
# Sample relevance scores
|
||||
relevance_scores = {
|
||||
"https://en.wikipedia.org/wiki/Python_(programming_language)": 0.95,
|
||||
"https://en.wikipedia.org/wiki/Natural_language_processing": 0.85,
|
||||
"https://docs.python.org/3/tutorial/index.html": 0.75,
|
||||
"https://en.wikipedia.org/wiki/Machine_learning": 0.65
|
||||
}
|
||||
|
||||
# Test document prioritization
|
||||
print("\nTesting document prioritization...")
|
||||
prioritized_docs = document_processor.prioritize_documents(documents, relevance_scores)
|
||||
|
||||
print("Prioritized documents:")
|
||||
for i, doc in enumerate(prioritized_docs):
|
||||
print(f"{i+1}. {doc.get('title')} - Score: {doc.get('priority_score', 0.0):.2f}")
|
||||
|
||||
# Test document chunking
|
||||
print("\nTesting document chunking...")
|
||||
|
||||
# Test section-based chunking
|
||||
print("\nSection-based chunking:")
|
||||
if documents:
|
||||
section_chunks = document_processor.chunk_document_by_sections(documents[0], 1000, 100)
|
||||
print(f"Created {len(section_chunks)} section-based chunks")
|
||||
|
||||
for i, chunk in enumerate(section_chunks[:3]): # Show first 3 chunks
|
||||
print(f"Chunk {i+1}:")
|
||||
print(f" Type: {chunk.get('chunk_type')}")
|
||||
print(f" Section: {chunk.get('section_title', 'N/A')}")
|
||||
print(f" Tokens: {chunk.get('token_count')}")
|
||||
content = chunk.get('content', '')
|
||||
print(f" Content preview: {content[:100]}...")
|
||||
|
||||
# Test fixed-size chunking
|
||||
print("\nFixed-size chunking:")
|
||||
if documents:
|
||||
fixed_chunks = document_processor.chunk_document_fixed_size(documents[0], 1000, 100)
|
||||
print(f"Created {len(fixed_chunks)} fixed-size chunks")
|
||||
|
||||
for i, chunk in enumerate(fixed_chunks[:3]): # Show first 3 chunks
|
||||
print(f"Chunk {i+1}:")
|
||||
print(f" Type: {chunk.get('chunk_type')}")
|
||||
print(f" Index: {chunk.get('chunk_index')}/{chunk.get('total_chunks')}")
|
||||
print(f" Tokens: {chunk.get('token_count')}")
|
||||
content = chunk.get('content', '')
|
||||
print(f" Content preview: {content[:100]}...")
|
||||
|
||||
# Test hierarchical chunking
|
||||
print("\nHierarchical chunking:")
|
||||
if documents:
|
||||
hierarchical_chunks = document_processor.chunk_document_hierarchical(documents[0], 1000, 100)
|
||||
print(f"Created {len(hierarchical_chunks)} hierarchical chunks")
|
||||
|
||||
for i, chunk in enumerate(hierarchical_chunks[:3]): # Show first 3 chunks
|
||||
print(f"Chunk {i+1}:")
|
||||
print(f" Type: {chunk.get('chunk_type')}")
|
||||
if chunk.get('chunk_type') == 'summary':
|
||||
print(f" Summary chunk")
|
||||
else:
|
||||
print(f" Section: {chunk.get('section_title', 'N/A')}")
|
||||
print(f" Tokens: {chunk.get('token_count')}")
|
||||
content = chunk.get('content', '')
|
||||
print(f" Content preview: {content[:100]}...")
|
||||
|
||||
# Test chunk selection
|
||||
print("\nTesting chunk selection...")
|
||||
|
||||
# Create a mix of chunks from all documents
|
||||
all_chunks = []
|
||||
for doc in documents:
|
||||
chunks = document_processor.chunk_document_by_sections(doc, 1000, 100)
|
||||
all_chunks.extend(chunks)
|
||||
|
||||
print(f"Total chunks: {len(all_chunks)}")
|
||||
|
||||
# Select chunks based on token budget
|
||||
token_budget = 10000
|
||||
selected_chunks = document_processor.select_chunks_for_context(all_chunks, token_budget)
|
||||
|
||||
total_tokens = sum(chunk.get('token_count', 0) for chunk in selected_chunks)
|
||||
print(f"Selected {len(selected_chunks)} chunks with {total_tokens} tokens (budget: {token_budget})")
|
||||
|
||||
# Test full document processing
|
||||
print("\nTesting full document processing...")
|
||||
processed_chunks = document_processor.process_documents_for_report(
|
||||
documents,
|
||||
relevance_scores,
|
||||
token_budget=20000,
|
||||
chunk_size=1000,
|
||||
overlap_size=100
|
||||
)
|
||||
|
||||
total_processed_tokens = sum(chunk.get('token_count', 0) for chunk in processed_chunks)
|
||||
print(f"Processed {len(processed_chunks)} chunks with {total_processed_tokens} tokens")
|
||||
|
||||
# Show the top 5 chunks
|
||||
print("\nTop 5 chunks:")
|
||||
for i, chunk in enumerate(processed_chunks[:5]):
|
||||
print(f"Chunk {i+1}:")
|
||||
print(f" Document: {chunk.get('title')}")
|
||||
print(f" Type: {chunk.get('chunk_type')}")
|
||||
print(f" Priority: {chunk.get('priority_score', 0.0):.2f}")
|
||||
print(f" Tokens: {chunk.get('token_count')}")
|
||||
content = chunk.get('content', '')
|
||||
print(f" Content preview: {content[:100]}...")
|
||||
|
||||
async def main():
|
||||
"""Main function to run the tests."""
|
||||
await test_document_processor()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
Loading…
Reference in New Issue