# Code Structure ## Current Project Organization ``` project/ │ ├── examples/ # Sample data and query examples ├── report/ # Report generation module │ ├── __init__.py │ ├── report_generator.py # Module for generating reports │ ├── report_synthesis.py # Module for synthesizing reports │ ├── progressive_report_synthesis.py # Module for progressive report generation │ ├── document_processor.py # Module for processing documents │ ├── document_scraper.py # Module for scraping documents │ ├── report_detail_levels.py # Module for managing report detail levels │ ├── report_templates.py # Module for managing report templates │ └── database/ # Database for storing reports │ ├── __init__.py │ └── db_manager.py # Module for managing the database ├── tests/ # Test suite │ ├── __init__.py │ ├── execution/ # Search execution tests │ │ ├── __init__.py │ │ ├── test_search.py │ │ ├── test_search_execution.py │ │ └── test_all_handlers.py │ ├── integration/ # Integration tests │ │ ├── __init__.py │ │ ├── test_ev_query.py │ │ └── test_query_to_report.py │ ├── query/ # Query processing tests │ │ ├── __init__.py │ │ ├── test_query_processor.py │ │ ├── test_query_processor_comprehensive.py │ │ └── test_llm_interface.py │ ├── ranking/ # Ranking algorithm tests │ │ ├── __init__.py │ │ ├── test_reranker.py │ │ ├── test_similarity.py │ │ └── test_simple_reranker.py │ ├── report/ # Report generation tests │ │ ├── __init__.py │ │ ├── test_custom_model.py │ │ ├── test_detail_levels.py │ │ ├── test_brief_report.py │ │ └── test_report_templates.py │ ├── ui/ # UI component tests │ │ ├── __init__.py │ │ └── test_ui_search.py │ ├── test_document_processor.py │ ├── test_document_scraper.py │ └── test_report_synthesis.py ├── utils/ # Utility scripts and shared functions │ ├── __init__.py │ ├── jina_similarity.py # Module for computing text similarity │ └── markdown_segmenter.py # Module for segmenting markdown documents ├── config/ # Configuration management │ ├── __init__.py │ ├── config.py # Configuration management class │ └── config.yaml # YAML configuration file with settings for different components ├── query/ # Query processing module │ ├── __init__.py │ ├── query_processor.py # Module for processing user queries │ └── llm_interface.py # Module for interacting with LLM providers ├── execution/ # Search execution module │ ├── __init__.py │ ├── search_executor.py # Module for executing search queries │ ├── result_collector.py # Module for collecting search results │ └── api_handlers/ # Handlers for different search APIs │ ├── __init__.py │ ├── base_handler.py # Base class for search handlers │ ├── serper_handler.py # Handler for Serper API (Google search) │ ├── scholar_handler.py # Handler for Google Scholar via Serper │ ├── google_handler.py # Handler for Google search │ └── arxiv_handler.py # Handler for arXiv API ├── ranking/ # Ranking module │ ├── __init__.py │ └── jina_reranker.py # Module for reranking documents using Jina AI ├── ui/ # UI module │ ├── __init__.py │ └── gradio_interface.py # Gradio-based web interface ├── scripts/ # Scripts │ └── query_to_report.py # Script for generating reports from queries ├── run_ui.py # Script to run the UI └── requirements.txt # Project dependencies ``` ## Module Details ### Config Module The `config` module manages configuration settings for the entire system, including API keys, model selections, and other parameters. ### Files - `__init__.py`: Package initialization file - `config.py`: Configuration management class - `config.yaml`: YAML configuration file with settings for different components ### Classes - `Config`: Singleton class for loading and accessing configuration settings - `load_config(config_path)`: Loads configuration from a YAML file - `get(key, default=None)`: Gets a configuration value by key ### Query Module The `query` module handles the processing and enhancement of user queries, including classification and optimization for search. ### Files - `__init__.py`: Package initialization file - `query_processor.py`: Main module for processing user queries - `query_classifier.py`: Module for classifying query types - `llm_interface.py`: Interface for interacting with LLM providers ### Classes - `QueryProcessor`: Main class for processing user queries - `process_query(query)`: Processes a user query and returns enhanced results - `classify_query(query)`: Classifies a query by type and intent - `generate_search_queries(query, classification)`: Generates optimized search queries - `QueryClassifier`: Class for classifying queries - `classify(query)`: Classifies a query by type, intent, and entities - `LLMInterface`: Interface for interacting with LLM providers - `get_completion(prompt, model=None)`: Gets a completion from an LLM - `enhance_query(query)`: Enhances a query with additional context - `classify_query(query)`: Uses an LLM to classify a query ### Execution Module The `execution` module handles the execution of search queries across multiple search engines and the collection of results. ### Files - `__init__.py`: Package initialization file - `search_executor.py`: Module for executing search queries - `result_collector.py`: Module for collecting and processing search results - `api_handlers/`: Directory containing handlers for different search APIs - `__init__.py`: Package initialization file - `base_handler.py`: Base class for search handlers - `serper_handler.py`: Handler for Serper API (Google search) - `scholar_handler.py`: Handler for Google Scholar via Serper - `arxiv_handler.py`: Handler for arXiv API ### Classes - `SearchExecutor`: Class for executing search queries - `execute_search(query_data)`: Executes a search across multiple engines - `_execute_search_async(query, engines)`: Executes a search asynchronously - `_execute_search_sync(query, engines)`: Executes a search synchronously - `ResultCollector`: Class for collecting and processing search results - `process_results(search_results)`: Processes search results from multiple engines - `deduplicate_results(results)`: Deduplicates results based on URL - `save_results(results, file_path)`: Saves results to a file - `BaseSearchHandler`: Base class for search handlers - `search(query, num_results)`: Abstract method for searching - `_process_response(response)`: Processes the API response - `SerperSearchHandler`: Handler for Serper API - `search(query, num_results)`: Searches using Serper API - `_process_response(response)`: Processes the Serper API response - `ScholarSearchHandler`: Handler for Google Scholar via Serper - `search(query, num_results)`: Searches Google Scholar - `_process_response(response)`: Processes the Scholar API response - `ArxivSearchHandler`: Handler for arXiv API - `search(query, num_results)`: Searches arXiv - `_process_response(response)`: Processes the arXiv API response ### Ranking Module The `ranking` module provides functionality for reranking and prioritizing documents based on their relevance to the user's query. ### Files - `__init__.py`: Package initialization file - `jina_reranker.py`: Module for reranking documents using Jina AI - `filter_manager.py`: Module for filtering documents ### Classes - `JinaReranker`: Class for reranking documents - `rerank(documents, query)`: Reranks documents based on relevance to query - `_prepare_inputs(documents, query)`: Prepares inputs for the reranker - `FilterManager`: Class for filtering documents - `filter_by_date(documents, start_date, end_date)`: Filters by date - `filter_by_source(documents, sources)`: Filters by source ### Report Templates Module The `report_templates` module provides a template system for generating reports with different detail levels and query types. ### Files - `__init__.py`: Package initialization file - `report_templates.py`: Module for managing report templates ### Classes - `QueryType` (Enum): Defines the types of queries supported by the system - `FACTUAL`: For factual queries seeking specific information - `EXPLORATORY`: For exploratory queries investigating a topic - `COMPARATIVE`: For comparative queries comparing multiple items - `DetailLevel` (Enum): Defines the levels of detail for generated reports - `BRIEF`: Short summary with key findings - `STANDARD`: Standard report with introduction, key findings, and analysis - `DETAILED`: Detailed report with methodology and more in-depth analysis - `COMPREHENSIVE`: Comprehensive report with executive summary, literature review, and appendices - `ReportTemplate`: Class representing a report template - `template` (str): The template string with placeholders - `detail_level` (DetailLevel): The detail level of the template - `query_type` (QueryType): The query type the template is designed for - `model` (Optional[str]): The LLM model recommended for this template - `required_sections` (Optional[List[str]]): Required sections in the template - `validate()`: Validates that the template contains all required sections - `ReportTemplateManager`: Class for managing report templates - `add_template(template)`: Adds a template to the manager - `get_template(query_type, detail_level)`: Gets a template for a specific query type and detail level - `get_available_templates()`: Gets a list of available templates - `initialize_default_templates()`: Initializes the default templates for all combinations of query types and detail levels ### Progressive Report Synthesis Module The `progressive_report_synthesis` module provides functionality to synthesize reports from document chunks using a progressive approach, where chunks are processed iteratively and the report is refined over time. ### Files - `__init__.py`: Package initialization file - `progressive_report_synthesis.py`: Module for progressive report generation ### Classes - `ReportState`: Class to track the state of a progressive report - `current_report` (str): The current version of the report - `processed_chunks` (Set[str]): Set of document IDs that have been processed - `version` (int): Current version number of the report - `last_update_time` (float): Timestamp of the last update - `improvement_scores` (List[float]): List of improvement scores for each iteration - `is_complete` (bool): Whether the report generation is complete - `termination_reason` (Optional[str]): Reason for termination if complete - `ProgressiveReportSynthesizer`: Class for progressive report synthesis - Extends `ReportSynthesizer` to implement a progressive approach - `set_progress_callback(callback)`: Sets a callback function to report progress - `prioritize_chunks(chunks, query)`: Prioritizes chunks based on relevance - `extract_information_from_chunk(chunk, query, detail_level)`: Extracts key information from a chunk - `refine_report(current_report, new_information, query, query_type, detail_level)`: Refines the report with new information - `initialize_report(initial_chunks, query, query_type, detail_level)`: Initializes the report with the first batch of chunks - `should_terminate(improvement_score)`: Determines if the process should terminate - `synthesize_report_progressively(chunks, query, query_type, detail_level)`: Main method for progressive report generation - `synthesize_report(chunks, query, query_type, detail_level)`: Override of parent method to use progressive approach for comprehensive detail level - `get_progressive_report_synthesizer(model_name)`: Factory function to get a singleton instance ## Recent Updates ### 2025-03-12: Progressive Report Generation Implementation 1. **Progressive Report Synthesis Module**: - Created a new module `progressive_report_synthesis.py` for progressive report generation - Implemented `ReportState` class to track the state of a progressive report - Created `ProgressiveReportSynthesizer` class extending from `ReportSynthesizer` - Implemented chunk prioritization algorithm based on relevance scores - Developed iterative refinement process with specialized prompts - Added state management to track report versions and processed chunks - Implemented termination conditions (all chunks processed, diminishing returns, max iterations) - Added support for different models with adaptive batch sizing - Implemented progress tracking and callback mechanism 2. **Report Generator Integration**: - Modified `report_generator.py` to use the progressive report synthesizer for comprehensive detail level - Created a hybrid system that uses standard map-reduce for brief/standard/detailed levels - Added proper model selection and configuration for both synthesizers 3. **Testing**: - Created `test_progressive_report.py` to test progressive report generation - Implemented comparison functionality between progressive and standard approaches - Added test cases for different query types and document collections ### 2025-03-11: Report Templates Implementation 1. **Report Templates Module**: - Created a new module `report_templates.py` for managing report templates - Implemented enums for query types (FACTUAL, EXPLORATORY, COMPARATIVE) and detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE) - Created a template system with placeholders for different report sections - Implemented 12 different templates (3 query types × 4 detail levels) - Added validation to ensure templates contain all required sections 2. **Report Synthesis Integration**: - Updated the report synthesis module to use the new template system - Added support for different templates based on query type and detail level - Implemented fallback to standard templates when specific templates are not found - Added better logging for template retrieval process 3. **Testing**: - Created test_report_templates.py to test template retrieval and validation - Implemented test_brief_report.py to test the brief report generation - Successfully tested all combinations of detail levels and query types ### 2025-02-28: Async Implementation and Reference Formatting 1. **LLM Interface Updates**: - Converted key methods to async: - `generate_completion` - `classify_query` - `enhance_query` - `generate_search_queries` - Added special handling for Gemini models - Improved reference formatting instructions 2. **Query Processor Updates**: - Updated `process_query` to be async - Made `generate_search_queries` async - Fixed async/await patterns throughout 3. **Gradio Interface Updates**: - Modified `generate_report` to handle async operations - Updated report button click handler - Improved error handling