Implement Phase 2 of Report Generation module: document prioritization and chunking strategies

This commit is contained in:
Steve White 2025-02-27 17:47:02 -06:00
parent 60f78dab9c
commit 695e4b7ecd
6 changed files with 1091 additions and 61 deletions

View File

@ -1,79 +1,85 @@
# Current Focus: Intelligent Research System Development
# Current Focus: Report Generation Module Implementation (Phase 2)
## Latest Update (2025-02-27)
We are currently developing an intelligent research system that automates the process of finding, filtering, and synthesizing information from various sources. The system is designed to be modular, allowing different components to utilize specific LLM models and endpoints based on their requirements.
We have successfully implemented Phase 1 of the Report Generation module, which includes document scraping and SQLite storage. The next focus is on Phase 2: Document Prioritization and Chunking, followed by integration with the search execution pipeline.
### Recent Progress
1. **Configuration Enhancements**:
1. **Report Generation Module Phase 1 Implementation**:
- Created a SQLite database manager with tables for documents and metadata
- Implemented a document scraper with Jina Reader API integration and fallback mechanisms
- Developed the basic report generator structure
- Added URL retention, metadata storage, and content deduplication
- Created comprehensive test scripts to verify functionality
- Successfully tested document scraping, storage, and retrieval
2. **Configuration Enhancements**:
- Implemented module-specific model assignments in the configuration
- Added support for different LLM providers and endpoints
- Added configuration for Jina AI's reranker
- Added support for OpenRouter and Groq as LLM providers
- Configured the system to use Groq's Llama 3.1 and 3.3 models for testing
2. **LLM Interface Updates**:
3. **LLM Interface Updates**:
- Enhanced the LLMInterface to support different models for different modules
- Implemented dynamic model switching based on the module and function
- Added support for Groq and OpenRouter providers
- Added special handling for provider-specific requirements
- Modified the query enhancement prompt to return only the enhanced query text without explanations
- Optimized prompt templates for different LLM models
3. **Document Ranking Module**:
- Created a new JinaReranker class that uses Jina AI's Reranker API
- Implemented document reranking with metadata support
- Configured to use the "jina-reranker-v2-base-multilingual" model
4. **Search Execution Updates**:
- Fixed issues with the Serper API integration
- Updated the search handler interface for better error handling
- Implemented parallel search execution using thread pools
- Enhanced the result collector to properly process and deduplicate results
4. **Search Execution Module**:
- Fixed the Serper API integration for both regular search and Scholar search
- Streamlined the search execution process by removing redundant Google search handler
- Added query truncation to handle long queries (Serper API has a 2048 character limit)
- Enhanced error handling for API requests
- Improved result processing and deduplication
- Created comprehensive test scripts for all search handlers
5. **UI Development**:
- Created a Gradio web interface for the research system
- Implemented query input and result display components
- Added support for configuring the number of results
- Included example queries for easy testing
- Created a results directory for saving search results
5. **Jina Reranker Integration**:
- Successfully integrated the Jina AI Reranker API to improve search result relevance
- Fixed issues with API request and response format compatibility
- Updated the reranker to handle different response structures
- Improved error handling for a more robust integration
### Current Tasks
1. **Report Generation Module Development**:
- Designing the report synthesis pipeline
- Implementing result summarization using Groq's Llama 3.3 70B Versatile model
- Creating formatting and export options
1. **Report Generation Module Implementation (Phase 2)**:
- Implementing document prioritization based on relevance scores
- Developing chunking strategies for long documents
- Creating token budget management system
- Designing document selection algorithm
2. **UI Enhancement**:
- Adding more configuration options to the UI
- Implementing report generation in the UI
2. **Integration with Search Execution**:
- Connecting the report generation module to the search execution pipeline
- Implementing automatic processing of search results
- Creating end-to-end test cases for the integrated pipeline
3. **UI Enhancement**:
- Adding report generation options to the UI
- Implementing progress indicators for document scraping and report generation
- Creating visualization components for search results
### Next Steps
1. **Integrate Search Execution with Query Processor**:
- Ensure seamless flow from query processing to search execution
- Test end-to-end pipeline with various query types
- Fine-tune result scoring and filtering
1. **Complete Phase 2 of Report Generation Module**:
- Implement relevance-based document prioritization
- Develop section-based and fixed-size chunking strategies
- Create token budget management system
- Design and implement document selection algorithm
2. **Build the Report Generation Module**:
- Implement report synthesis using Groq's Llama 3.3 70B Versatile model
- Create formatting and export options
- Develop citation and reference management
2. **Begin Phase 3 of Report Generation Module**:
- Integrate with Groq's Llama 3.3 70B Versatile model for report synthesis
- Implement map-reduce approach for processing documents
- Create report templates for different query types
- Add citation generation and reference management
3. **Comprehensive System Testing**:
- Test the complete pipeline from query to report
- Evaluate performance with different query types and domains
- Optimize for speed and accuracy
3. **Comprehensive Testing**:
- Create end-to-end tests for the complete pipeline
- Test with various document types and sizes
- Evaluate performance and optimize as needed
### Technical Notes
- Using LiteLLM for unified LLM interface across different providers
- Implementing a modular architecture for flexibility and maintainability
- Using Jina AI's reranker for improved document ranking
- Using Groq's Llama 3.1 and 3.3 models for fast inference during testing
- Using Jina Reader API for web scraping with BeautifulSoup as fallback
- Implemented SQLite database for document storage with proper schema
- Using asynchronous processing for improved performance in web scraping
- Managing API keys securely through environment variables and configuration files
- Using Gradio for the web interface to provide an easy-to-use frontend
- Planning to use Groq's Llama 3.3 70B Versatile model for report synthesis

View File

@ -221,3 +221,108 @@ After integrating Groq and OpenRouter as additional LLM providers, we needed to
- Verified that the query processor works correctly with Groq models
- Established a testing approach that can be used for other modules
- Created reusable test scripts for future development
## 2025-02-27: Report Generation Module Implementation
### Decision: Use Jina Reader for Web Scraping and SQLite for Document Storage
- **Context**: Need to implement document scraping and storage for the Report Generation module
- **Options Considered**:
1. In-memory document storage with custom web scraping
2. SQLite database with Jina Reader for web scraping
3. NoSQL database (e.g., MongoDB) with BeautifulSoup for web scraping
4. Cloud-based document storage with third-party scraping service
- **Decision**: Use Jina Reader for web scraping and SQLite for document storage
- **Rationale**:
- Jina Reader provides clean content extraction from web pages
- Integration with existing Jina components (embeddings, reranker) for a consistent approach
- SQLite offers persistence without the complexity of a full database server
- SQLite's transactional nature ensures data integrity
- Local storage reduces latency and eliminates cloud dependencies
- Ability to store metadata alongside documents for better filtering and selection
### Decision: Implement Phased Approach for Report Generation
- **Context**: Need to handle potentially large numbers of documents within LLM context window limitations
- **Options Considered**:
1. Single-pass approach with document truncation
2. Use of a model with larger context window
3. Phased approach with document prioritization and chunking
4. Outsourcing document synthesis to a specialized service
- **Decision**: Implement a phased approach with document prioritization and chunking
- **Rationale**:
- Allows handling of large document collections despite context window limitations
- Prioritization ensures the most relevant content is included
- Chunking strategies can preserve document structure and context
- Map-reduce pattern enables processing of unlimited document collections
- Flexible architecture can accommodate different models as needed
- Progressive implementation allows for iterative testing and refinement
## 2025-02-27: Document Prioritization and Chunking Strategies
### Decision
Implemented document prioritization and chunking strategies for the Report Generation module (Phase 2) to extract the most relevant portions of scraped documents and prepare them for LLM processing.
### Context
After implementing the document scraping and storage components (Phase 1), we needed to develop strategies for prioritizing documents based on relevance and chunking them to fit within the LLM's context window limits. This is crucial for ensuring that the most important information is included in the final report.
### Options Considered
1. **Document Prioritization:**
- Option A: Use only relevance scores from search results
- Option B: Combine relevance scores with document metadata (recency, token count)
- Option C: Use a machine learning model to score documents
2. **Chunking Strategies:**
- Option A: Fixed-size chunking with overlap
- Option B: Section-based chunking using Markdown headers
- Option C: Hierarchical chunking for very large documents
- Option D: Semantic chunking based on content similarity
### Decision and Rationale
For document prioritization, we chose Option B: a weighted scoring system that combines:
- Relevance scores from search results (primary factor)
- Document recency (secondary factor)
- Document token count (tertiary factor)
This approach allows us to prioritize documents that are both relevant to the query and recent, while also considering the information density of the document.
For chunking strategies, we implemented a hybrid approach:
- Section-based chunking (Option B) as the primary strategy, which preserves the logical structure of documents
- Fixed-size chunking (Option A) as a fallback for documents without clear section headers
- Hierarchical chunking (Option C) for very large documents, which creates a summary chunk and preserves important sections
We decided against semantic chunking (Option D) for now due to the additional computational overhead and complexity, but may consider it for future enhancements.
### Implementation Details
1. **Document Prioritization:**
- Created a scoring formula that weights relevance (50-60%), recency (30%), and token count (10-20%)
- Normalized all scores to a 0-1 range for consistent weighting
- Added the priority score to each document for use in chunk selection
2. **Chunking Strategies:**
- Implemented section-based chunking using regex to identify Markdown headers
- Added fixed-size chunking with configurable chunk size and overlap
- Created hierarchical chunking for very large documents
- Preserved document metadata in all chunks for traceability
3. **Chunk Selection:**
- Implemented a token budget management system to stay within context limits
- Created an algorithm to select chunks based on priority while ensuring representation from multiple documents
- Added minimum chunks per document to prevent over-representation of a single source
### Impact and Next Steps
This implementation allows us to:
- Prioritize the most relevant and recent information
- Preserve the logical structure of documents
- Efficiently manage token budgets for different LLM models
- Balance information from multiple sources
Next steps include:
- Integrating with the LLM interface for report synthesis (Phase 3)
- Implementing the map-reduce approach for processing document chunks
- Creating report templates for different query types
- Adding citation generation and reference management

View File

@ -198,6 +198,7 @@ Added support for OpenRouter and Groq as LLM providers and configured the system
1. Test the system with Groq's models to evaluate performance
2. Implement the remaining query processing components
3. Create the Gradio UI for user interaction
4. Test the full system with end-to-end workflows
## Session: 2025-02-27 (Update 6)
@ -393,3 +394,166 @@ Implemented a Gradio web interface for the intelligent research system, providin
2. Implement report generation in the UI
3. Add visualization components for search results
4. Test the UI with various query types and search engines
## Session: 2025-02-27 (Afternoon)
### Overview
In this session, we focused on debugging and fixing the Jina Reranker API integration to ensure it correctly processes queries and documents, enhancing the relevance of search results.
### Key Activities
1. **Jina Reranker API Integration**:
- Updated the `rerank` method in the JinaReranker class to match the expected API request format
- Modified the request payload to send an array of plain string documents instead of objects
- Enhanced response processing to handle both current and older API response formats
- Added detailed logging for API requests and responses for better debugging
2. **Testing Improvements**:
- Created a simplified test script (`test_simple_reranker.py`) to isolate and test the reranker functionality
- Updated the main test script to focus on core functionality without complex dependencies
- Implemented JSON result saving for better analysis of reranker output
- Added proper error handling in tests to provide clear feedback on issues
3. **Code Quality Enhancements**:
- Improved error handling throughout the reranker implementation
- Added informative debug messages at key points in the execution flow
- Ensured backward compatibility with previous API response formats
- Documented the expected request and response structures
### Insights and Learnings
- The Jina Reranker API expects documents as an array of plain strings, not objects with a "text" field
- The reranker response format includes a "document" field in the results which may contain either the text directly or an object with a "text" field
- Proper error handling and debug output are crucial for diagnosing issues with external API integrations
- Isolating components for testing makes debugging much more efficient
### Challenges
- Adapting to changes in the Jina Reranker API response format
- Ensuring backward compatibility with older response formats
- Debugging nested API response structures
- Managing environment variables and configuration consistently across test scripts
### Next Steps
1. **Expand Testing**: Develop more comprehensive test cases for the reranker with diverse document types
2. **Integration**: Ensure the reranker is properly integrated with the result collector for end-to-end functionality
3. **Documentation**: Update API documentation to reflect the latest changes to the reranker implementation
4. **UI Integration**: Add reranker configuration options to the Gradio interface
## Session: 2025-02-27 - Report Generation Module Planning
### Overview
In this session, we focused on planning the Report Generation module, designing a comprehensive implementation approach, and making key decisions about document scraping, storage, and processing.
### Key Activities
1. **Designed a Phased Implementation Plan**:
- Created a four-phase implementation plan for the Report Generation module
- Phase 1: Document Scraping and Storage
- Phase 2: Document Prioritization and Chunking
- Phase 3: Report Generation
- Phase 4: Advanced Features
- Documented the plan in the memory bank for future reference
2. **Made Key Design Decisions**:
- Decided to use Jina Reader for web scraping due to its clean content extraction capabilities
- Chose SQLite for document storage to ensure persistence and efficient querying
- Designed a database schema with Documents and Metadata tables
- Planned a token budget management system to handle context window limitations
- Decided on a map-reduce approach for processing large document collections
3. **Addressed Context Window Limitations**:
- Evaluated Groq's Llama 3.3 70B Versatile model's 128K context window
- Designed document prioritization strategies based on relevance scores
- Planned chunking strategies for handling long documents
- Considered alternative models with larger context windows for future implementation
4. **Updated Documentation**:
- Added the implementation plan to the memory bank
- Updated the decision log with rationale for key decisions
- Revised the current focus to reflect the new implementation priorities
- Added a new session log entry to document the planning process
### Insights
- A phased implementation approach allows for incremental development and testing
- SQLite provides a good balance of simplicity and functionality for document storage
- Jina Reader integrates well with our existing Jina components (embeddings, reranker)
- The map-reduce pattern enables processing of unlimited document collections despite context window limitations
- Document prioritization is crucial for ensuring the most relevant content is included in reports
### Challenges
- Managing the 128K context window limitation with potentially large document collections
- Balancing between document coverage and report quality
- Ensuring efficient web scraping without overwhelming target websites
- Designing a flexible architecture that can accommodate different models and approaches
### Next Steps
1. Begin implementing Phase 1 of the Report Generation module:
- Set up the SQLite database with the designed schema
- Implement the Jina Reader integration for web scraping
- Create the document processing pipeline
- Develop URL validation and normalization functionality
- Add caching and deduplication for scraped content
2. Plan for Phase 2 implementation:
- Design the token budget management system
- Develop document prioritization algorithms
- Create chunking strategies for long documents
## Session: 2025-02-27 - Report Generation Module Implementation (Phase 1)
### Overview
In this session, we implemented Phase 1 of the Report Generation module, focusing on document scraping and SQLite storage. We created the necessary components for scraping web pages, storing their content in a SQLite database, and retrieving documents for report generation.
### Key Activities
1. **Created Database Manager**:
- Implemented a SQLite database manager with tables for documents and metadata
- Added full CRUD operations for documents
- Implemented transaction handling for data integrity
- Created methods for document search and retrieval
- Used aiosqlite for asynchronous database operations
2. **Implemented Document Scraper**:
- Created a document scraper with Jina Reader API integration
- Added fallback mechanism using BeautifulSoup for when Jina API fails
- Implemented URL validation and normalization
- Added content conversion to Markdown format
- Implemented token counting using tiktoken
- Created metadata extraction from HTML content
- Added document deduplication using content hashing
3. **Developed Report Generator Base**:
- Created the basic structure for the report generation process
- Implemented methods to process search results by scraping URLs
- Integrated with the database manager and document scraper
- Set up the foundation for future phases
4. **Created Test Script**:
- Developed a test script to verify functionality
- Tested document scraping, storage, and retrieval
- Verified search functionality within the database
- Ensured proper error handling and fallback mechanisms
### Insights
- The fallback mechanism for document scraping is crucial, as the Jina Reader API may not always be available or may fail for certain URLs
- Asynchronous processing significantly improves performance when scraping multiple URLs
- Content hashing is an effective way to prevent duplicate documents in the database
- Storing metadata separately from document content provides flexibility for future enhancements
- The SQLite database provides a good balance of simplicity and functionality for document storage
### Challenges
- Handling different HTML structures across websites for metadata extraction
- Managing asynchronous operations and error handling
- Ensuring proper transaction handling for database operations
- Balancing between clean content extraction and preserving important information
### Next Steps
1. **Integration with Search Execution**:
- Connect the report generation module to the search execution pipeline
- Implement automatic processing of search results
2. **Begin Phase 2 Implementation**:
- Develop document prioritization based on relevance scores
- Implement chunking strategies for long documents
- Create token budget management system
3. **Testing and Refinement**:
- Create more comprehensive tests for edge cases
- Refine error handling and logging
- Optimize performance for large numbers of documents

View File

@ -0,0 +1,493 @@
"""
Document processor module for the report generation module.
This module provides functionality to prioritize documents based on relevance scores,
chunk long documents into manageable pieces, and select the most relevant chunks
to stay within token budget limits.
"""
import re
import math
import logging
import tiktoken
from typing import Dict, List, Any, Optional, Tuple, Union, Set
from datetime import datetime
from report.database.db_manager import get_db_manager
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class DocumentProcessor:
"""
Document processor for the report generation module.
This class provides methods to prioritize documents based on relevance scores,
chunk long documents into manageable pieces, and select the most relevant chunks
to stay within token budget limits.
"""
def __init__(self, default_token_limit: int = 120000):
"""
Initialize the document processor.
Args:
default_token_limit: Default token limit for the context window
"""
self.db_manager = get_db_manager()
self.default_token_limit = default_token_limit
self.tokenizer = tiktoken.get_encoding("cl100k_base") # Using OpenAI's tokenizer
def _count_tokens(self, text: str) -> int:
"""
Count the number of tokens in a text.
Args:
text: The text to count tokens for
Returns:
Number of tokens in the text
"""
return len(self.tokenizer.encode(text))
def prioritize_documents(self, documents: List[Dict[str, Any]],
relevance_scores: Optional[Dict[str, float]] = None,
recency_weight: float = 0.3,
token_count_weight: float = 0.2) -> List[Dict[str, Any]]:
"""
Prioritize documents based on relevance scores, recency, and token count.
Args:
documents: List of documents to prioritize
relevance_scores: Dictionary mapping document URLs to relevance scores
recency_weight: Weight for recency in the prioritization score
token_count_weight: Weight for token count in the prioritization score
Returns:
List of documents sorted by priority score
"""
# If no relevance scores provided, use equal scores for all documents
if relevance_scores is None:
relevance_scores = {doc['url']: 1.0 for doc in documents}
# Get current time for recency calculation
current_time = datetime.now()
# Calculate priority scores
for doc in documents:
# Relevance score (normalized to 0-1)
relevance_score = relevance_scores.get(doc['url'], 0.0)
# Recency score (normalized to 0-1)
try:
doc_time = datetime.fromisoformat(doc['scrape_date'])
time_diff = (current_time - doc_time).total_seconds() / 86400 # Convert to days
recency_score = 1.0 / (1.0 + time_diff) # Newer documents get higher scores
except (KeyError, ValueError):
recency_score = 0.5 # Default if scrape_date is missing or invalid
# Token count score (normalized to 0-1)
# Prefer documents with more tokens, but not too many
token_count = doc.get('token_count', 0)
token_count_score = min(token_count / 5000, 1.0) # Normalize to 0-1
# Calculate final priority score
relevance_weight = 1.0 - recency_weight - token_count_weight
priority_score = (
relevance_weight * relevance_score +
recency_weight * recency_score +
token_count_weight * token_count_score
)
# Add priority score to document
doc['priority_score'] = priority_score
# Sort documents by priority score (descending)
return sorted(documents, key=lambda x: x.get('priority_score', 0.0), reverse=True)
def chunk_document_by_sections(self, document: Dict[str, Any],
max_chunk_tokens: int = 1000,
overlap_tokens: int = 100) -> List[Dict[str, Any]]:
"""
Chunk a document by sections based on Markdown headers.
Args:
document: Document to chunk
max_chunk_tokens: Maximum number of tokens per chunk
overlap_tokens: Number of tokens to overlap between chunks
Returns:
List of document chunks
"""
content = document.get('content', '')
# If content is empty, return empty list
if not content.strip():
return []
# Find all headers in the content
header_pattern = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
headers = list(header_pattern.finditer(content))
# If no headers found, use fixed-size chunking
if not headers:
return self.chunk_document_fixed_size(document, max_chunk_tokens, overlap_tokens)
chunks = []
# Process each section (from one header to the next)
for i in range(len(headers)):
start_pos = headers[i].start()
# Determine end position (next header or end of content)
if i < len(headers) - 1:
end_pos = headers[i + 1].start()
else:
end_pos = len(content)
section_content = content[start_pos:end_pos]
section_tokens = self._count_tokens(section_content)
# If section is small enough, add it as a single chunk
if section_tokens <= max_chunk_tokens:
chunks.append({
'document_id': document.get('id'),
'url': document.get('url'),
'title': document.get('title'),
'content': section_content,
'token_count': section_tokens,
'chunk_type': 'section',
'section_title': headers[i].group(2),
'section_level': len(headers[i].group(1)),
'priority_score': document.get('priority_score', 0.0)
})
else:
# If section is too large, split it into fixed-size chunks
section_chunks = self._split_text_fixed_size(
section_content,
max_chunk_tokens,
overlap_tokens
)
for j, chunk_content in enumerate(section_chunks):
chunk_tokens = self._count_tokens(chunk_content)
chunks.append({
'document_id': document.get('id'),
'url': document.get('url'),
'title': document.get('title'),
'content': chunk_content,
'token_count': chunk_tokens,
'chunk_type': 'section_part',
'section_title': headers[i].group(2),
'section_level': len(headers[i].group(1)),
'section_part': j + 1,
'total_parts': len(section_chunks),
'priority_score': document.get('priority_score', 0.0)
})
return chunks
def chunk_document_fixed_size(self, document: Dict[str, Any],
max_chunk_tokens: int = 1000,
overlap_tokens: int = 100) -> List[Dict[str, Any]]:
"""
Chunk a document into fixed-size chunks with overlap.
Args:
document: Document to chunk
max_chunk_tokens: Maximum number of tokens per chunk
overlap_tokens: Number of tokens to overlap between chunks
Returns:
List of document chunks
"""
content = document.get('content', '')
# If content is empty, return empty list
if not content.strip():
return []
# Split content into fixed-size chunks
content_chunks = self._split_text_fixed_size(content, max_chunk_tokens, overlap_tokens)
chunks = []
# Create chunk objects
for i, chunk_content in enumerate(content_chunks):
chunk_tokens = self._count_tokens(chunk_content)
chunks.append({
'document_id': document.get('id'),
'url': document.get('url'),
'title': document.get('title'),
'content': chunk_content,
'token_count': chunk_tokens,
'chunk_type': 'fixed',
'chunk_index': i + 1,
'total_chunks': len(content_chunks),
'priority_score': document.get('priority_score', 0.0)
})
return chunks
def _split_text_fixed_size(self, text: str,
max_chunk_tokens: int = 1000,
overlap_tokens: int = 100) -> List[str]:
"""
Split text into fixed-size chunks with overlap.
Args:
text: Text to split
max_chunk_tokens: Maximum number of tokens per chunk
overlap_tokens: Number of tokens to overlap between chunks
Returns:
List of text chunks
"""
# Encode text into tokens
tokens = self.tokenizer.encode(text)
# If text is small enough, return as a single chunk
if len(tokens) <= max_chunk_tokens:
return [text]
# Calculate number of chunks needed
num_chunks = math.ceil((len(tokens) - overlap_tokens) / (max_chunk_tokens - overlap_tokens))
chunks = []
# Split tokens into chunks
for i in range(num_chunks):
# Calculate start and end positions
start_pos = i * (max_chunk_tokens - overlap_tokens)
end_pos = min(start_pos + max_chunk_tokens, len(tokens))
# Extract chunk tokens
chunk_tokens = tokens[start_pos:end_pos]
# Decode chunk tokens back to text
chunk_text = self.tokenizer.decode(chunk_tokens)
chunks.append(chunk_text)
return chunks
def chunk_document_hierarchical(self, document: Dict[str, Any],
max_chunk_tokens: int = 1000,
overlap_tokens: int = 100) -> List[Dict[str, Any]]:
"""
Chunk a very large document using a hierarchical approach.
This method first chunks the document by sections, then further chunks
large sections into smaller pieces.
Args:
document: Document to chunk
max_chunk_tokens: Maximum number of tokens per chunk
overlap_tokens: Number of tokens to overlap between chunks
Returns:
List of document chunks
"""
# First, chunk by sections
section_chunks = self.chunk_document_by_sections(document, max_chunk_tokens, overlap_tokens)
# If the document is small enough, return section chunks
if sum(chunk.get('token_count', 0) for chunk in section_chunks) <= max_chunk_tokens * 3:
return section_chunks
# Otherwise, create a summary chunk and keep the most important sections
content = document.get('content', '')
title = document.get('title', '')
# Extract first paragraph as summary
first_para_match = re.search(r'^(.*?)\n\n', content, re.DOTALL)
summary = first_para_match.group(1) if first_para_match else content[:500]
# Create summary chunk
summary_chunk = {
'document_id': document.get('id'),
'url': document.get('url'),
'title': title,
'content': f"# {title}\n\n{summary}\n\n(This is a summary of a large document)",
'token_count': self._count_tokens(f"# {title}\n\n{summary}\n\n(This is a summary of a large document)"),
'chunk_type': 'summary',
'priority_score': document.get('priority_score', 0.0) * 1.2 # Boost summary priority
}
# Sort section chunks by priority (section level and position)
def section_priority(chunk):
# Prioritize by section level (lower is more important)
level_score = 6 - chunk.get('section_level', 3)
# Prioritize earlier sections
position_score = 1.0 / (1.0 + chunk.get('chunk_index', 0) + chunk.get('section_part', 0))
return level_score * position_score
sorted_sections = sorted(section_chunks, key=section_priority, reverse=True)
# Return summary chunk and top sections
return [summary_chunk] + sorted_sections
def select_chunks_for_context(self, chunks: List[Dict[str, Any]],
token_budget: int,
min_chunks_per_doc: int = 1) -> List[Dict[str, Any]]:
"""
Select chunks to include in the context window based on token budget.
Args:
chunks: List of document chunks
token_budget: Maximum number of tokens to use
min_chunks_per_doc: Minimum number of chunks to include per document
Returns:
List of selected chunks
"""
# Group chunks by document
doc_chunks = {}
for chunk in chunks:
doc_id = chunk.get('document_id')
if doc_id not in doc_chunks:
doc_chunks[doc_id] = []
doc_chunks[doc_id].append(chunk)
# Sort chunks within each document by priority
for doc_id in doc_chunks:
doc_chunks[doc_id] = sorted(
doc_chunks[doc_id],
key=lambda x: x.get('priority_score', 0.0),
reverse=True
)
# Select at least min_chunks_per_doc from each document
selected_chunks = []
remaining_budget = token_budget
# First pass: select minimum chunks from each document
for doc_id, chunks in doc_chunks.items():
for i in range(min(min_chunks_per_doc, len(chunks))):
chunk = chunks[i]
selected_chunks.append(chunk)
remaining_budget -= chunk.get('token_count', 0)
# If we've exceeded the budget, sort selected chunks and trim
if remaining_budget <= 0:
selected_chunks = sorted(
selected_chunks,
key=lambda x: x.get('priority_score', 0.0),
reverse=True
)
# Keep adding chunks until we exceed the budget
current_budget = 0
for i, chunk in enumerate(selected_chunks):
current_budget += chunk.get('token_count', 0)
if current_budget > token_budget:
selected_chunks = selected_chunks[:i]
break
return selected_chunks
# Second pass: add more chunks based on priority until budget is exhausted
# Flatten remaining chunks from all documents
remaining_chunks = []
for doc_id, chunks in doc_chunks.items():
if len(chunks) > min_chunks_per_doc:
remaining_chunks.extend(chunks[min_chunks_per_doc:])
# Sort remaining chunks by priority
remaining_chunks = sorted(
remaining_chunks,
key=lambda x: x.get('priority_score', 0.0),
reverse=True
)
# Add chunks until budget is exhausted
for chunk in remaining_chunks:
if chunk.get('token_count', 0) <= remaining_budget:
selected_chunks.append(chunk)
remaining_budget -= chunk.get('token_count', 0)
if remaining_budget <= 0:
break
return selected_chunks
def process_documents_for_report(self, documents: List[Dict[str, Any]],
relevance_scores: Optional[Dict[str, float]] = None,
token_budget: Optional[int] = None,
chunk_size: int = 1000,
overlap_size: int = 100) -> List[Dict[str, Any]]:
"""
Process documents for report generation.
This method prioritizes documents, chunks them, and selects the most
relevant chunks to stay within the token budget.
Args:
documents: List of documents to process
relevance_scores: Dictionary mapping document URLs to relevance scores
token_budget: Maximum number of tokens to use (default: self.default_token_limit)
chunk_size: Maximum number of tokens per chunk
overlap_size: Number of tokens to overlap between chunks
Returns:
List of selected document chunks
"""
if token_budget is None:
token_budget = self.default_token_limit
# Prioritize documents
prioritized_docs = self.prioritize_documents(documents, relevance_scores)
# Chunk documents
all_chunks = []
for doc in prioritized_docs:
# Choose chunking strategy based on document size
token_count = doc.get('token_count', 0)
if token_count > chunk_size * 10:
# Very large document: use hierarchical chunking
chunks = self.chunk_document_hierarchical(doc, chunk_size, overlap_size)
elif token_count > chunk_size:
# Medium document: use section-based chunking
chunks = self.chunk_document_by_sections(doc, chunk_size, overlap_size)
else:
# Small document: keep as a single chunk
chunks = [{
'document_id': doc.get('id'),
'url': doc.get('url'),
'title': doc.get('title'),
'content': doc.get('content', ''),
'token_count': token_count,
'chunk_type': 'full',
'priority_score': doc.get('priority_score', 0.0)
}]
all_chunks.extend(chunks)
# Select chunks based on token budget
selected_chunks = self.select_chunks_for_context(all_chunks, token_budget)
# Log statistics
total_docs = len(documents)
total_chunks = len(all_chunks)
selected_chunk_count = len(selected_chunks)
selected_token_count = sum(chunk.get('token_count', 0) for chunk in selected_chunks)
logger.info(f"Processed {total_docs} documents into {total_chunks} chunks")
logger.info(f"Selected {selected_chunk_count} chunks with {selected_token_count} tokens")
return selected_chunks
# Create a singleton instance for global use
document_processor = DocumentProcessor()
def get_document_processor() -> DocumentProcessor:
"""
Get the global document processor instance.
Returns:
DocumentProcessor instance
"""
return document_processor

View File

@ -13,6 +13,7 @@ from typing import Dict, List, Any, Optional, Tuple, Union
from report.database.db_manager import get_db_manager, initialize_database
from report.document_scraper import get_document_scraper
from report.document_processor import get_document_processor
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
@ -31,6 +32,7 @@ class ReportGenerator:
"""Initialize the report generator."""
self.db_manager = get_db_manager()
self.document_scraper = get_document_scraper()
self.document_processor = get_document_processor()
async def initialize(self):
"""Initialize the report generator by setting up the database."""
@ -50,13 +52,19 @@ class ReportGenerator:
# Extract URLs from search results
urls = [result.get('url') for result in search_results if result.get('url')]
# Extract relevance scores if available
relevance_scores = {}
for result in search_results:
if result.get('url') and result.get('score') is not None:
relevance_scores[result.get('url')] = result.get('score')
# Scrape URLs and store in database
documents = await self.document_scraper.scrape_urls(urls)
# Log results
logger.info(f"Processed {len(documents)} documents out of {len(urls)} URLs")
return documents
return documents, relevance_scores
async def get_document_by_url(self, url: str) -> Optional[Dict[str, Any]]:
"""
@ -83,6 +91,84 @@ class ReportGenerator:
"""
return await self.db_manager.search_documents(query, limit)
async def prepare_documents_for_report(self,
search_results: List[Dict[str, Any]],
token_budget: Optional[int] = None,
chunk_size: int = 1000,
overlap_size: int = 100) -> List[Dict[str, Any]]:
"""
Prepare documents for report generation by processing search results,
prioritizing documents, and chunking them to fit within token budget.
Args:
search_results: List of search results
token_budget: Maximum number of tokens to use
chunk_size: Maximum number of tokens per chunk
overlap_size: Number of tokens to overlap between chunks
Returns:
List of selected document chunks
"""
# Process search results to get documents and relevance scores
documents, relevance_scores = await self.process_search_results(search_results)
# Prioritize and chunk documents
selected_chunks = self.document_processor.process_documents_for_report(
documents,
relevance_scores,
token_budget,
chunk_size,
overlap_size
)
return selected_chunks
async def generate_report(self,
search_results: List[Dict[str, Any]],
query: str,
token_budget: Optional[int] = None,
chunk_size: int = 1000,
overlap_size: int = 100) -> str:
"""
Generate a report from search results.
Args:
search_results: List of search results
query: Original search query
token_budget: Maximum number of tokens to use
chunk_size: Maximum number of tokens per chunk
overlap_size: Number of tokens to overlap between chunks
Returns:
Generated report as a string
"""
# Prepare documents for report
selected_chunks = await self.prepare_documents_for_report(
search_results,
token_budget,
chunk_size,
overlap_size
)
# TODO: Implement report synthesis using LLM
# For now, just return a placeholder report
report = f"# Report for: {query}\n\n"
report += f"Based on {len(selected_chunks)} document chunks\n\n"
# Add document summaries
for i, chunk in enumerate(selected_chunks[:5]): # Show first 5 chunks
report += f"## Document {i+1}: {chunk.get('title', 'Untitled')}\n"
report += f"Source: {chunk.get('url', 'Unknown')}\n"
report += f"Chunk type: {chunk.get('chunk_type', 'Unknown')}\n"
report += f"Priority score: {chunk.get('priority_score', 0.0):.2f}\n\n"
# Add a snippet of the content
content = chunk.get('content', '')
snippet = content[:200] + "..." if len(content) > 200 else content
report += f"{snippet}\n\n"
return report
# Create a singleton instance for global use
report_generator = ReportGenerator()
@ -100,30 +186,50 @@ def get_report_generator() -> ReportGenerator:
"""
return report_generator
# Example usage
async def test_report_generator():
"""Test the report generator with sample search results."""
# Initialize report generator
# Initialize the report generator
await initialize_report_generator()
# Sample search results
search_results = [
{"url": "https://en.wikipedia.org/wiki/Web_scraping", "title": "Web scraping - Wikipedia"},
{"url": "https://en.wikipedia.org/wiki/Natural_language_processing", "title": "Natural language processing - Wikipedia"}
{
'title': 'Example Document 1',
'url': 'https://example.com/doc1',
'snippet': 'This is an example document.',
'score': 0.95
},
{
'title': 'Example Document 2',
'url': 'https://example.com/doc2',
'snippet': 'This is another example document.',
'score': 0.85
},
{
'title': 'Python Documentation',
'url': 'https://docs.python.org/3/',
'snippet': 'Official Python documentation.',
'score': 0.75
}
]
# Process search results
generator = get_report_generator()
documents = await generator.process_search_results(search_results)
documents, relevance_scores = await report_generator.process_search_results(search_results)
# Print results
# Print documents
print(f"Processed {len(documents)} documents")
for doc in documents:
print(f"Title: {doc['title']}")
print(f"URL: {doc['url']}")
print(f"Token count: {doc['token_count']}")
print(f"Content preview: {doc['content'][:200]}...")
print("-" * 80)
print(f"Document: {doc.get('title')} ({doc.get('url')})")
print(f"Token count: {doc.get('token_count')}")
print(f"Content snippet: {doc.get('content')[:100]}...")
print()
# Generate report
report = await report_generator.generate_report(search_results, "Python programming")
# Print report
print("Generated Report:")
print(report)
# Run test if this module is executed directly
if __name__ == "__main__":

View File

@ -0,0 +1,156 @@
"""
Test script for the document processor module.
This script tests the document prioritization and chunking functionality
of the document processor module.
"""
import os
import sys
import asyncio
import json
from datetime import datetime
from typing import Dict, List, Any, Optional
# Add the project root directory to the Python path
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from report.document_processor import get_document_processor
from report.database.db_manager import get_db_manager, initialize_database
from report.document_scraper import get_document_scraper
async def test_document_processor():
"""Test the document processor with sample documents."""
# Initialize the database
await initialize_database()
# Get the document processor and scraper
document_processor = get_document_processor()
document_scraper = get_document_scraper()
db_manager = get_db_manager()
# Sample URLs to test with
test_urls = [
"https://en.wikipedia.org/wiki/Python_(programming_language)",
"https://en.wikipedia.org/wiki/Natural_language_processing",
"https://docs.python.org/3/tutorial/index.html",
"https://en.wikipedia.org/wiki/Machine_learning"
]
# Scrape the URLs
print(f"Scraping {len(test_urls)} URLs...")
documents = await document_scraper.scrape_urls(test_urls)
print(f"Scraped {len(documents)} documents")
# Sample relevance scores
relevance_scores = {
"https://en.wikipedia.org/wiki/Python_(programming_language)": 0.95,
"https://en.wikipedia.org/wiki/Natural_language_processing": 0.85,
"https://docs.python.org/3/tutorial/index.html": 0.75,
"https://en.wikipedia.org/wiki/Machine_learning": 0.65
}
# Test document prioritization
print("\nTesting document prioritization...")
prioritized_docs = document_processor.prioritize_documents(documents, relevance_scores)
print("Prioritized documents:")
for i, doc in enumerate(prioritized_docs):
print(f"{i+1}. {doc.get('title')} - Score: {doc.get('priority_score', 0.0):.2f}")
# Test document chunking
print("\nTesting document chunking...")
# Test section-based chunking
print("\nSection-based chunking:")
if documents:
section_chunks = document_processor.chunk_document_by_sections(documents[0], 1000, 100)
print(f"Created {len(section_chunks)} section-based chunks")
for i, chunk in enumerate(section_chunks[:3]): # Show first 3 chunks
print(f"Chunk {i+1}:")
print(f" Type: {chunk.get('chunk_type')}")
print(f" Section: {chunk.get('section_title', 'N/A')}")
print(f" Tokens: {chunk.get('token_count')}")
content = chunk.get('content', '')
print(f" Content preview: {content[:100]}...")
# Test fixed-size chunking
print("\nFixed-size chunking:")
if documents:
fixed_chunks = document_processor.chunk_document_fixed_size(documents[0], 1000, 100)
print(f"Created {len(fixed_chunks)} fixed-size chunks")
for i, chunk in enumerate(fixed_chunks[:3]): # Show first 3 chunks
print(f"Chunk {i+1}:")
print(f" Type: {chunk.get('chunk_type')}")
print(f" Index: {chunk.get('chunk_index')}/{chunk.get('total_chunks')}")
print(f" Tokens: {chunk.get('token_count')}")
content = chunk.get('content', '')
print(f" Content preview: {content[:100]}...")
# Test hierarchical chunking
print("\nHierarchical chunking:")
if documents:
hierarchical_chunks = document_processor.chunk_document_hierarchical(documents[0], 1000, 100)
print(f"Created {len(hierarchical_chunks)} hierarchical chunks")
for i, chunk in enumerate(hierarchical_chunks[:3]): # Show first 3 chunks
print(f"Chunk {i+1}:")
print(f" Type: {chunk.get('chunk_type')}")
if chunk.get('chunk_type') == 'summary':
print(f" Summary chunk")
else:
print(f" Section: {chunk.get('section_title', 'N/A')}")
print(f" Tokens: {chunk.get('token_count')}")
content = chunk.get('content', '')
print(f" Content preview: {content[:100]}...")
# Test chunk selection
print("\nTesting chunk selection...")
# Create a mix of chunks from all documents
all_chunks = []
for doc in documents:
chunks = document_processor.chunk_document_by_sections(doc, 1000, 100)
all_chunks.extend(chunks)
print(f"Total chunks: {len(all_chunks)}")
# Select chunks based on token budget
token_budget = 10000
selected_chunks = document_processor.select_chunks_for_context(all_chunks, token_budget)
total_tokens = sum(chunk.get('token_count', 0) for chunk in selected_chunks)
print(f"Selected {len(selected_chunks)} chunks with {total_tokens} tokens (budget: {token_budget})")
# Test full document processing
print("\nTesting full document processing...")
processed_chunks = document_processor.process_documents_for_report(
documents,
relevance_scores,
token_budget=20000,
chunk_size=1000,
overlap_size=100
)
total_processed_tokens = sum(chunk.get('token_count', 0) for chunk in processed_chunks)
print(f"Processed {len(processed_chunks)} chunks with {total_processed_tokens} tokens")
# Show the top 5 chunks
print("\nTop 5 chunks:")
for i, chunk in enumerate(processed_chunks[:5]):
print(f"Chunk {i+1}:")
print(f" Document: {chunk.get('title')}")
print(f" Type: {chunk.get('chunk_type')}")
print(f" Priority: {chunk.get('priority_score', 0.0):.2f}")
print(f" Tokens: {chunk.get('token_count')}")
content = chunk.get('content', '')
print(f" Content preview: {content[:100]}...")
async def main():
"""Main function to run the tests."""
await test_document_processor()
if __name__ == "__main__":
asyncio.run(main())