# Code Structure ## Current Project Organization ``` project/ │ ├── examples/ # Sample data and query examples ├── report/ # Report generation module │ ├── __init__.py │ ├── report_generator.py # Module for generating reports │ ├── report_synthesis.py # Module for synthesizing reports │ ├── progressive_report_synthesis.py # Module for progressive report generation │ ├── document_processor.py # Module for processing documents │ ├── document_scraper.py # Module for scraping documents │ ├── report_detail_levels.py # Module for managing report detail levels │ ├── report_templates.py # Module for managing report templates │ └── database/ # Database for storing reports │ ├── __init__.py │ └── db_manager.py # Module for managing the database ├── tests/ # Test suite │ ├── __init__.py │ ├── execution/ # Search execution tests │ │ ├── __init__.py │ │ ├── test_search.py │ │ ├── test_search_execution.py │ │ └── test_all_handlers.py │ ├── integration/ # Integration tests │ │ ├── __init__.py │ │ ├── test_ev_query.py │ │ └── test_query_to_report.py │ ├── query/ # Query processing tests │ │ ├── __init__.py │ │ ├── test_query_processor.py │ │ ├── test_query_processor_comprehensive.py │ │ └── test_llm_interface.py │ ├── ranking/ # Ranking algorithm tests │ │ ├── __init__.py │ │ ├── test_reranker.py │ │ ├── test_similarity.py │ │ └── test_simple_reranker.py │ ├── report/ # Report generation tests │ │ ├── __init__.py │ │ ├── test_custom_model.py │ │ ├── test_detail_levels.py │ │ ├── test_brief_report.py │ │ └── test_report_templates.py │ ├── ui/ # UI component tests │ │ ├── __init__.py │ │ └── test_ui_search.py │ ├── test_document_processor.py │ ├── test_document_scraper.py │ └── test_report_synthesis.py ├── utils/ # Utility scripts and shared functions │ ├── __init__.py │ ├── jina_similarity.py # Module for computing text similarity │ └── markdown_segmenter.py # Module for segmenting markdown documents ├── config/ # Configuration management │ ├── __init__.py │ ├── config.py # Configuration management class │ └── config.yaml # YAML configuration file with settings for different components ├── query/ # Query processing module │ ├── __init__.py │ ├── query_processor.py # Module for processing user queries │ └── llm_interface.py # Module for interacting with LLM providers ├── execution/ # Search execution module │ ├── __init__.py │ ├── search_executor.py # Module for executing search queries │ ├── result_collector.py # Module for collecting search results │ └── api_handlers/ # Handlers for different search APIs │ ├── __init__.py │ ├── base_handler.py # Base class for search handlers │ ├── serper_handler.py # Handler for Serper API (Google search) │ ├── scholar_handler.py # Handler for Google Scholar via Serper │ ├── google_handler.py # Handler for Google search │ └── arxiv_handler.py # Handler for arXiv API ├── ranking/ # Ranking module │ ├── __init__.py │ └── jina_reranker.py # Module for reranking documents using Jina AI ├── ui/ # UI module │ ├── __init__.py │ └── gradio_interface.py # Gradio-based web interface ├── scripts/ # Scripts │ └── query_to_report.py # Script for generating reports from queries ├── sim-search-api/ # FastAPI backend │ ├── app/ │ │ ├── api/ │ │ │ ├── routes/ │ │ │ │ ├── __init__.py │ │ │ │ ├── auth.py # Authentication routes │ │ │ │ ├── query.py # Query processing routes │ │ │ │ ├── search.py # Search execution routes │ │ │ │ └── report.py # Report generation routes │ │ │ ├── __init__.py │ │ │ └── dependencies.py # API dependencies (auth, rate limiting) │ │ ├── core/ │ │ │ ├── __init__.py │ │ │ ├── config.py # API configuration │ │ │ └── security.py # Security utilities │ │ ├── db/ │ │ │ ├── __init__.py │ │ │ ├── session.py # Database session │ │ │ └── models.py # Database models for reports, searches │ │ ├── schemas/ │ │ │ ├── __init__.py │ │ │ ├── token.py # Token schemas │ │ │ ├── user.py # User schemas │ │ │ ├── query.py # Query schemas │ │ │ ├── search.py # Search result schemas │ │ │ └── report.py # Report schemas │ │ ├── services/ │ │ │ ├── __init__.py │ │ │ ├── query_service.py # Query processing service │ │ │ ├── search_service.py # Search execution service │ │ │ └── report_service.py # Report generation service │ │ └── main.py # FastAPI application │ ├── alembic/ # Database migrations │ │ ├── versions/ │ │ │ └── 001_initial_migration.py # Initial migration │ │ ├── env.py # Alembic environment │ │ └── script.py.mako # Alembic script template │ ├── .env.example # Environment variables template │ ├── alembic.ini # Alembic configuration │ ├── requirements.txt # API dependencies │ ├── run.py # Script to run the API │ └── README.md # API documentation ├── run_ui.py # Script to run the UI └── requirements.txt # Project dependencies ``` ## Module Details ### Config Module The `config` module manages configuration settings for the entire system, including API keys, model selections, and other parameters. ### Files - `__init__.py`: Package initialization file - `config.py`: Configuration management class - `config.yaml`: YAML configuration file with settings for different components ### Classes - `Config`: Singleton class for loading and accessing configuration settings - `load_config(config_path)`: Loads configuration from a YAML file - `get(key, default=None)`: Gets a configuration value by key ### Query Module The `query` module handles the processing and enhancement of user queries, including classification and optimization for search. ### Files - `__init__.py`: Package initialization file - `query_processor.py`: Main module for processing user queries - `query_classifier.py`: Module for classifying query types - `llm_interface.py`: Interface for interacting with LLM providers ### Classes - `QueryProcessor`: Main class for processing user queries - `process_query(query)`: Processes a user query and returns enhanced results - `classify_query(query)`: Classifies a query by type and intent - `generate_search_queries(query, classification)`: Generates optimized search queries - `QueryClassifier`: Class for classifying queries - `classify(query)`: Classifies a query by type, intent, and entities - `LLMInterface`: Interface for interacting with LLM providers - `get_completion(prompt, model=None)`: Gets a completion from an LLM - `enhance_query(query)`: Enhances a query with additional context - `classify_query(query)`: Uses an LLM to classify a query ### Execution Module The `execution` module handles the execution of search queries across multiple search engines and the collection of results. ### Files - `__init__.py`: Package initialization file - `search_executor.py`: Module for executing search queries - `result_collector.py`: Module for collecting and processing search results - `api_handlers/`: Directory containing handlers for different search APIs - `__init__.py`: Package initialization file - `base_handler.py`: Base class for search handlers - `serper_handler.py`: Handler for Serper API (Google search) - `scholar_handler.py`: Handler for Google Scholar via Serper - `arxiv_handler.py`: Handler for arXiv API ### Classes - `SearchExecutor`: Class for executing search queries - `execute_search(query_data)`: Executes a search across multiple engines - `_execute_search_async(query, engines)`: Executes a search asynchronously - `_execute_search_sync(query, engines)`: Executes a search synchronously - `ResultCollector`: Class for collecting and processing search results - `process_results(search_results)`: Processes search results from multiple engines - `deduplicate_results(results)`: Deduplicates results based on URL - `save_results(results, file_path)`: Saves results to a file - `BaseSearchHandler`: Base class for search handlers - `search(query, num_results)`: Abstract method for searching - `_process_response(response)`: Processes the API response - `SerperSearchHandler`: Handler for Serper API - `search(query, num_results)`: Searches using Serper API - `_process_response(response)`: Processes the Serper API response - `ScholarSearchHandler`: Handler for Google Scholar via Serper - `search(query, num_results)`: Searches Google Scholar - `_process_response(response)`: Processes the Scholar API response - `ArxivSearchHandler`: Handler for arXiv API - `search(query, num_results)`: Searches arXiv - `_process_response(response)`: Processes the arXiv API response ### Ranking Module The `ranking` module provides functionality for reranking and prioritizing documents based on their relevance to the user's query. ### Files - `__init__.py`: Package initialization file - `jina_reranker.py`: Module for reranking documents using Jina AI - `filter_manager.py`: Module for filtering documents ### Classes - `JinaReranker`: Class for reranking documents - `rerank(documents, query)`: Reranks documents based on relevance to query - `_prepare_inputs(documents, query)`: Prepares inputs for the reranker - `FilterManager`: Class for filtering documents - `filter_by_date(documents, start_date, end_date)`: Filters by date - `filter_by_source(documents, sources)`: Filters by source ### Report Templates Module The `report_templates` module provides a template system for generating reports with different detail levels and query types. ### Files - `__init__.py`: Package initialization file - `report_templates.py`: Module for managing report templates ### Classes - `QueryType` (Enum): Defines the types of queries supported by the system - `FACTUAL`: For factual queries seeking specific information - `EXPLORATORY`: For exploratory queries investigating a topic - `COMPARATIVE`: For comparative queries comparing multiple items - `DetailLevel` (Enum): Defines the levels of detail for generated reports - `BRIEF`: Short summary with key findings - `STANDARD`: Standard report with introduction, key findings, and analysis - `DETAILED`: Detailed report with methodology and more in-depth analysis - `COMPREHENSIVE`: Comprehensive report with executive summary, literature review, and appendices - `ReportTemplate`: Class representing a report template - `template` (str): The template string with placeholders - `detail_level` (DetailLevel): The detail level of the template - `query_type` (QueryType): The query type the template is designed for - `model` (Optional[str]): The LLM model recommended for this template - `required_sections` (Optional[List[str]]): Required sections in the template - `validate()`: Validates that the template contains all required sections - `ReportTemplateManager`: Class for managing report templates - `add_template(template)`: Adds a template to the manager - `get_template(query_type, detail_level)`: Gets a template for a specific query type and detail level - `get_available_templates()`: Gets a list of available templates - `initialize_default_templates()`: Initializes the default templates for all combinations of query types and detail levels ### Progressive Report Synthesis Module The `progressive_report_synthesis` module provides functionality to synthesize reports from document chunks using a progressive approach, where chunks are processed iteratively and the report is refined over time. ### Files - `__init__.py`: Package initialization file - `progressive_report_synthesis.py`: Module for progressive report generation ### Classes - `ReportState`: Class to track the state of a progressive report - `current_report` (str): The current version of the report - `processed_chunks` (Set[str]): Set of document IDs that have been processed - `version` (int): Current version number of the report - `last_update_time` (float): Timestamp of the last update - `improvement_scores` (List[float]): List of improvement scores for each iteration - `is_complete` (bool): Whether the report generation is complete - `termination_reason` (Optional[str]): Reason for termination if complete - `ProgressiveReportSynthesizer`: Class for progressive report synthesis - Extends `ReportSynthesizer` to implement a progressive approach - `set_progress_callback(callback)`: Sets a callback function to report progress - `prioritize_chunks(chunks, query)`: Prioritizes chunks based on relevance - `extract_information_from_chunk(chunk, query, detail_level)`: Extracts key information from a chunk - `refine_report(current_report, new_information, query, query_type, detail_level)`: Refines the report with new information - `initialize_report(initial_chunks, query, query_type, detail_level)`: Initializes the report with the first batch of chunks - `should_terminate(improvement_score)`: Determines if the process should terminate - `synthesize_report_progressively(chunks, query, query_type, detail_level)`: Main method for progressive report generation - `synthesize_report(chunks, query, query_type, detail_level)`: Override of parent method to use progressive approach for comprehensive detail level - `get_progressive_report_synthesizer(model_name)`: Factory function to get a singleton instance ### FastAPI Backend Module The `sim-search-api` module provides a RESTful API for the sim-search system, allowing for query processing, search execution, and report generation through HTTP endpoints. ### Files - `app/`: Main application directory - `api/`: API routes and dependencies - `routes/`: API route handlers - `auth.py`: Authentication routes - `query.py`: Query processing routes - `search.py`: Search execution routes - `report.py`: Report generation routes - `dependencies.py`: API dependencies (auth, rate limiting) - `core/`: Core functionality - `config.py`: API configuration - `security.py`: Security utilities - `db/`: Database models and session management - `models.py`: Database models for users, searches, and reports - `session.py`: Database session management - `schemas/`: Pydantic schemas for request/response validation - `token.py`: Token schemas - `user.py`: User schemas - `query.py`: Query schemas - `search.py`: Search result schemas - `report.py`: Report schemas - `services/`: Service layer for business logic - `query_service.py`: Query processing service - `search_service.py`: Search execution service - `report_service.py`: Report generation service - `main.py`: FastAPI application entry point - `alembic/`: Database migrations - `versions/`: Migration versions - `001_initial_migration.py`: Initial migration - `env.py`: Alembic environment - `script.py.mako`: Alembic script template - `alembic.ini`: Alembic configuration - `requirements.txt`: API dependencies - `run.py`: Script to run the API - `.env.example`: Environment variables template - `README.md`: API documentation ### Classes - `app.db.models.User`: User model for authentication - `id` (str): User ID - `email` (str): User email - `hashed_password` (str): Hashed password - `full_name` (str): User's full name - `is_active` (bool): Whether the user is active - `is_superuser` (bool): Whether the user is a superuser - `app.db.models.Search`: Search model for storing search results - `id` (str): Search ID - `user_id` (str): User ID - `query` (str): Original query - `enhanced_query` (str): Enhanced query - `query_type` (str): Query type - `engines` (str): Search engines used - `results_count` (int): Number of results - `results` (JSON): Search results - `created_at` (datetime): Creation timestamp - `app.db.models.Report`: Report model for storing generated reports - `id` (str): Report ID - `user_id` (str): User ID - `search_id` (str): Search ID - `title` (str): Report title - `content` (str): Report content - `detail_level` (str): Detail level - `query_type` (str): Query type - `model_used` (str): Model used for generation - `created_at` (datetime): Creation timestamp - `updated_at` (datetime): Update timestamp - `app.services.QueryService`: Service for query processing - `process_query(query)`: Processes a query - `classify_query(query)`: Classifies a query - `app.services.SearchService`: Service for search execution - `execute_search(structured_query, search_engines, num_results, timeout, user_id, db)`: Executes a search - `get_available_search_engines()`: Gets available search engines - `get_search_results(search)`: Gets results for a specific search - `app.services.ReportService`: Service for report generation - `generate_report_background(report_id, report_in, search, db, progress_dict)`: Generates a report in the background - `generate_report_file(report, format)`: Generates a report file in the specified format ## Recent Updates ### 2025-03-20: FastAPI Backend Implementation 1. **FastAPI Application Structure**: - Created a new directory `sim-search-api` for the FastAPI application - Set up project structure with API routes, core functionality, database models, schemas, and services - Implemented a layered architecture with API, service, and data layers - Added proper `__init__.py` files to make all directories proper Python packages 2. **API Routes Implementation**: - Created authentication routes for user registration and token generation - Implemented query processing routes for query enhancement and classification - Added search execution routes for executing searches and managing search history - Created report generation routes for generating and managing reports - Implemented proper error handling and validation for all routes 3. **Service Layer Implementation**: - Created `QueryService` to bridge between API and existing query processing functionality - Implemented `SearchService` for search execution and result management - Added `ReportService` for report generation and management - Ensured proper integration with existing sim-search functionality - Implemented asynchronous operation for all services 4. **Database Setup**: - Created SQLAlchemy models for users, searches, and reports - Implemented database session management - Set up Alembic for database migrations - Created initial migration script to create all tables - Added proper relationships between models 5. **Authentication and Security**: - Implemented JWT-based authentication - Added password hashing and verification - Created token generation and validation - Implemented user registration and login - Added proper authorization for protected routes 6. **Documentation and Configuration**: - Created comprehensive API documentation - Added OpenAPI documentation endpoints - Implemented environment variable configuration - Created a README with setup and usage instructions - Added example environment variables file ### 2025-03-12: Progressive Report Generation Implementation 1. **Progressive Report Synthesis Module**: - Created a new module `progressive_report_synthesis.py` for progressive report generation - Implemented `ReportState` class to track the state of a progressive report - Created `ProgressiveReportSynthesizer` class extending from `ReportSynthesizer` - Implemented chunk prioritization algorithm based on relevance scores - Developed iterative refinement process with specialized prompts - Added state management to track report versions and processed chunks - Implemented termination conditions (all chunks processed, diminishing returns, max iterations) - Added support for different models with adaptive batch sizing - Implemented progress tracking and callback mechanism 2. **Report Generator Integration**: - Modified `report_generator.py` to use the progressive report synthesizer for comprehensive detail level - Created a hybrid system that uses standard map-reduce for brief/standard/detailed levels - Added proper model selection and configuration for both synthesizers 3. **Testing**: - Created `test_progressive_report.py` to test progressive report generation - Implemented comparison functionality between progressive and standard approaches - Added test cases for different query types and document collections ### 2025-03-11: Report Templates Implementation 1. **Report Templates Module**: - Created a new module `report_templates.py` for managing report templates - Implemented enums for query types (FACTUAL, EXPLORATORY, COMPARATIVE) and detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE) - Created a template system with placeholders for different report sections - Implemented 12 different templates (3 query types × 4 detail levels) - Added validation to ensure templates contain all required sections 2. **Report Synthesis Integration**: - Updated the report synthesis module to use the new template system - Added support for different templates based on query type and detail level - Implemented fallback to standard templates when specific templates are not found - Added better logging for template retrieval process 3. **Testing**: - Created test_report_templates.py to test template retrieval and validation - Implemented test_brief_report.py to test the brief report generation - Successfully tested all combinations of detail levels and query types ### 2025-02-28: Async Implementation and Reference Formatting 1. **LLM Interface Updates**: - Converted key methods to async: - `generate_completion` - `classify_query` - `enhance_query` - `generate_search_queries` - Added special handling for Gemini models - Improved reference formatting instructions 2. **Query Processor Updates**: - Updated `process_query` to be async - Made `generate_search_queries` async - Fixed async/await patterns throughout 3. **Gradio Interface Updates**: - Modified `generate_report` to handle async operations - Updated report button click handler - Improved error handling