ira/.note/code_structure.md

331 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Code Structure
## Current Project Organization
```
project/
├── examples/ # Sample data and query examples
├── report/ # Report generation module
│ ├── __init__.py
│ ├── report_generator.py # Module for generating reports
│ ├── report_synthesis.py # Module for synthesizing reports
│ ├── progressive_report_synthesis.py # Module for progressive report generation
│ ├── document_processor.py # Module for processing documents
│ ├── document_scraper.py # Module for scraping documents
│ ├── report_detail_levels.py # Module for managing report detail levels
│ ├── report_templates.py # Module for managing report templates
│ └── database/ # Database for storing reports
│ ├── __init__.py
│ └── db_manager.py # Module for managing the database
├── tests/ # Test suite
│ ├── __init__.py
│ ├── execution/ # Search execution tests
│ │ ├── __init__.py
│ │ ├── test_search.py
│ │ ├── test_search_execution.py
│ │ └── test_all_handlers.py
│ ├── integration/ # Integration tests
│ │ ├── __init__.py
│ │ ├── test_ev_query.py
│ │ └── test_query_to_report.py
│ ├── query/ # Query processing tests
│ │ ├── __init__.py
│ │ ├── test_query_processor.py
│ │ ├── test_query_processor_comprehensive.py
│ │ └── test_llm_interface.py
│ ├── ranking/ # Ranking algorithm tests
│ │ ├── __init__.py
│ │ ├── test_reranker.py
│ │ ├── test_similarity.py
│ │ └── test_simple_reranker.py
│ ├── report/ # Report generation tests
│ │ ├── __init__.py
│ │ ├── test_custom_model.py
│ │ ├── test_detail_levels.py
│ │ ├── test_brief_report.py
│ │ └── test_report_templates.py
│ ├── ui/ # UI component tests
│ │ ├── __init__.py
│ │ └── test_ui_search.py
│ ├── test_document_processor.py
│ ├── test_document_scraper.py
│ └── test_report_synthesis.py
├── utils/ # Utility scripts and shared functions
│ ├── __init__.py
│ ├── jina_similarity.py # Module for computing text similarity
│ └── markdown_segmenter.py # Module for segmenting markdown documents
├── config/ # Configuration management
│ ├── __init__.py
│ ├── config.py # Configuration management class
│ └── config.yaml # YAML configuration file with settings for different components
├── query/ # Query processing module
│ ├── __init__.py
│ ├── query_processor.py # Module for processing user queries
│ └── llm_interface.py # Module for interacting with LLM providers
├── execution/ # Search execution module
│ ├── __init__.py
│ ├── search_executor.py # Module for executing search queries
│ ├── result_collector.py # Module for collecting search results
│ └── api_handlers/ # Handlers for different search APIs
│ ├── __init__.py
│ ├── base_handler.py # Base class for search handlers
│ ├── serper_handler.py # Handler for Serper API (Google search)
│ ├── scholar_handler.py # Handler for Google Scholar via Serper
│ ├── google_handler.py # Handler for Google search
│ └── arxiv_handler.py # Handler for arXiv API
├── ranking/ # Ranking module
│ ├── __init__.py
│ └── jina_reranker.py # Module for reranking documents using Jina AI
├── ui/ # UI module
│ ├── __init__.py
│ └── gradio_interface.py # Gradio-based web interface
├── scripts/ # Scripts
│ └── query_to_report.py # Script for generating reports from queries
├── run_ui.py # Script to run the UI
└── requirements.txt # Project dependencies
```
## Module Details
### Config Module
The `config` module manages configuration settings for the entire system, including API keys, model selections, and other parameters.
### Files
- `__init__.py`: Package initialization file
- `config.py`: Configuration management class
- `config.yaml`: YAML configuration file with settings for different components
### Classes
- `Config`: Singleton class for loading and accessing configuration settings
- `load_config(config_path)`: Loads configuration from a YAML file
- `get(key, default=None)`: Gets a configuration value by key
### Query Module
The `query` module handles the processing and enhancement of user queries, including classification and optimization for search.
### Files
- `__init__.py`: Package initialization file
- `query_processor.py`: Main module for processing user queries
- `query_classifier.py`: Module for classifying query types
- `llm_interface.py`: Interface for interacting with LLM providers
### Classes
- `QueryProcessor`: Main class for processing user queries
- `process_query(query)`: Processes a user query and returns enhanced results
- `classify_query(query)`: Classifies a query by type and intent
- `generate_search_queries(query, classification)`: Generates optimized search queries
- `QueryClassifier`: Class for classifying queries
- `classify(query)`: Classifies a query by type, intent, and entities
- `LLMInterface`: Interface for interacting with LLM providers
- `get_completion(prompt, model=None)`: Gets a completion from an LLM
- `enhance_query(query)`: Enhances a query with additional context
- `classify_query(query)`: Uses an LLM to classify a query
### Execution Module
The `execution` module handles the execution of search queries across multiple search engines and the collection of results.
### Files
- `__init__.py`: Package initialization file
- `search_executor.py`: Module for executing search queries
- `result_collector.py`: Module for collecting and processing search results
- `api_handlers/`: Directory containing handlers for different search APIs
- `__init__.py`: Package initialization file
- `base_handler.py`: Base class for search handlers
- `serper_handler.py`: Handler for Serper API (Google search)
- `scholar_handler.py`: Handler for Google Scholar via Serper
- `arxiv_handler.py`: Handler for arXiv API
### Classes
- `SearchExecutor`: Class for executing search queries
- `execute_search(query_data)`: Executes a search across multiple engines
- `_execute_search_async(query, engines)`: Executes a search asynchronously
- `_execute_search_sync(query, engines)`: Executes a search synchronously
- `ResultCollector`: Class for collecting and processing search results
- `process_results(search_results)`: Processes search results from multiple engines
- `deduplicate_results(results)`: Deduplicates results based on URL
- `save_results(results, file_path)`: Saves results to a file
- `BaseSearchHandler`: Base class for search handlers
- `search(query, num_results)`: Abstract method for searching
- `_process_response(response)`: Processes the API response
- `SerperSearchHandler`: Handler for Serper API
- `search(query, num_results)`: Searches using Serper API
- `_process_response(response)`: Processes the Serper API response
- `ScholarSearchHandler`: Handler for Google Scholar via Serper
- `search(query, num_results)`: Searches Google Scholar
- `_process_response(response)`: Processes the Scholar API response
- `ArxivSearchHandler`: Handler for arXiv API
- `search(query, num_results)`: Searches arXiv
- `_process_response(response)`: Processes the arXiv API response
### Ranking Module
The `ranking` module provides functionality for reranking and prioritizing documents based on their relevance to the user's query.
### Files
- `__init__.py`: Package initialization file
- `jina_reranker.py`: Module for reranking documents using Jina AI
- `filter_manager.py`: Module for filtering documents
### Classes
- `JinaReranker`: Class for reranking documents
- `rerank(documents, query)`: Reranks documents based on relevance to query
- `_prepare_inputs(documents, query)`: Prepares inputs for the reranker
- `FilterManager`: Class for filtering documents
- `filter_by_date(documents, start_date, end_date)`: Filters by date
- `filter_by_source(documents, sources)`: Filters by source
### Report Templates Module
The `report_templates` module provides a template system for generating reports with different detail levels and query types.
### Files
- `__init__.py`: Package initialization file
- `report_templates.py`: Module for managing report templates
### Classes
- `QueryType` (Enum): Defines the types of queries supported by the system
- `FACTUAL`: For factual queries seeking specific information
- `EXPLORATORY`: For exploratory queries investigating a topic
- `COMPARATIVE`: For comparative queries comparing multiple items
- `DetailLevel` (Enum): Defines the levels of detail for generated reports
- `BRIEF`: Short summary with key findings
- `STANDARD`: Standard report with introduction, key findings, and analysis
- `DETAILED`: Detailed report with methodology and more in-depth analysis
- `COMPREHENSIVE`: Comprehensive report with executive summary, literature review, and appendices
- `ReportTemplate`: Class representing a report template
- `template` (str): The template string with placeholders
- `detail_level` (DetailLevel): The detail level of the template
- `query_type` (QueryType): The query type the template is designed for
- `model` (Optional[str]): The LLM model recommended for this template
- `required_sections` (Optional[List[str]]): Required sections in the template
- `validate()`: Validates that the template contains all required sections
- `ReportTemplateManager`: Class for managing report templates
- `add_template(template)`: Adds a template to the manager
- `get_template(query_type, detail_level)`: Gets a template for a specific query type and detail level
- `get_available_templates()`: Gets a list of available templates
- `initialize_default_templates()`: Initializes the default templates for all combinations of query types and detail levels
### Progressive Report Synthesis Module
The `progressive_report_synthesis` module provides functionality to synthesize reports from document chunks using a progressive approach, where chunks are processed iteratively and the report is refined over time.
### Files
- `__init__.py`: Package initialization file
- `progressive_report_synthesis.py`: Module for progressive report generation
### Classes
- `ReportState`: Class to track the state of a progressive report
- `current_report` (str): The current version of the report
- `processed_chunks` (Set[str]): Set of document IDs that have been processed
- `version` (int): Current version number of the report
- `last_update_time` (float): Timestamp of the last update
- `improvement_scores` (List[float]): List of improvement scores for each iteration
- `is_complete` (bool): Whether the report generation is complete
- `termination_reason` (Optional[str]): Reason for termination if complete
- `ProgressiveReportSynthesizer`: Class for progressive report synthesis
- Extends `ReportSynthesizer` to implement a progressive approach
- `set_progress_callback(callback)`: Sets a callback function to report progress
- `prioritize_chunks(chunks, query)`: Prioritizes chunks based on relevance
- `extract_information_from_chunk(chunk, query, detail_level)`: Extracts key information from a chunk
- `refine_report(current_report, new_information, query, query_type, detail_level)`: Refines the report with new information
- `initialize_report(initial_chunks, query, query_type, detail_level)`: Initializes the report with the first batch of chunks
- `should_terminate(improvement_score)`: Determines if the process should terminate
- `synthesize_report_progressively(chunks, query, query_type, detail_level)`: Main method for progressive report generation
- `synthesize_report(chunks, query, query_type, detail_level)`: Override of parent method to use progressive approach for comprehensive detail level
- `get_progressive_report_synthesizer(model_name)`: Factory function to get a singleton instance
## Recent Updates
### 2025-03-12: Progressive Report Generation Implementation
1. **Progressive Report Synthesis Module**:
- Created a new module `progressive_report_synthesis.py` for progressive report generation
- Implemented `ReportState` class to track the state of a progressive report
- Created `ProgressiveReportSynthesizer` class extending from `ReportSynthesizer`
- Implemented chunk prioritization algorithm based on relevance scores
- Developed iterative refinement process with specialized prompts
- Added state management to track report versions and processed chunks
- Implemented termination conditions (all chunks processed, diminishing returns, max iterations)
- Added support for different models with adaptive batch sizing
- Implemented progress tracking and callback mechanism
2. **Report Generator Integration**:
- Modified `report_generator.py` to use the progressive report synthesizer for comprehensive detail level
- Created a hybrid system that uses standard map-reduce for brief/standard/detailed levels
- Added proper model selection and configuration for both synthesizers
3. **Testing**:
- Created `test_progressive_report.py` to test progressive report generation
- Implemented comparison functionality between progressive and standard approaches
- Added test cases for different query types and document collections
### 2025-03-11: Report Templates Implementation
1. **Report Templates Module**:
- Created a new module `report_templates.py` for managing report templates
- Implemented enums for query types (FACTUAL, EXPLORATORY, COMPARATIVE) and detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE)
- Created a template system with placeholders for different report sections
- Implemented 12 different templates (3 query types × 4 detail levels)
- Added validation to ensure templates contain all required sections
2. **Report Synthesis Integration**:
- Updated the report synthesis module to use the new template system
- Added support for different templates based on query type and detail level
- Implemented fallback to standard templates when specific templates are not found
- Added better logging for template retrieval process
3. **Testing**:
- Created test_report_templates.py to test template retrieval and validation
- Implemented test_brief_report.py to test the brief report generation
- Successfully tested all combinations of detail levels and query types
### 2025-02-28: Async Implementation and Reference Formatting
1. **LLM Interface Updates**:
- Converted key methods to async:
- `generate_completion`
- `classify_query`
- `enhance_query`
- `generate_search_queries`
- Added special handling for Gemini models
- Improved reference formatting instructions
2. **Query Processor Updates**:
- Updated `process_query` to be async
- Made `generate_search_queries` async
- Fixed async/await patterns throughout
3. **Gradio Interface Updates**:
- Modified `generate_report` to handle async operations
- Updated report button click handler
- Improved error handling