ira/.note/code_structure.md

# Code Structure

## Current Project Organization

```
project/
│
├── examples/          # Sample data and query examples
├── report/            # Report generation module
│   ├── __init__.py
│   ├── report_generator.py    # Module for generating reports
│   ├── report_synthesis.py    # Module for synthesizing reports
│   ├── progressive_report_synthesis.py # Module for progressive report generation
│   ├── document_processor.py  # Module for processing documents
│   ├── document_scraper.py    # Module for scraping documents
│   ├── report_detail_levels.py # Module for managing report detail levels
│   ├── report_templates.py    # Module for managing report templates
│   └── database/              # Database for storing reports
│       ├── __init__.py
│       └── db_manager.py      # Module for managing the database
├── tests/             # Test suite
│   ├── __init__.py
│   ├── execution/     # Search execution tests
│   │   ├── __init__.py
│   │   ├── test_search.py
│   │   ├── test_search_execution.py
│   │   └── test_all_handlers.py
│   ├── integration/   # Integration tests
│   │   ├── __init__.py
│   │   ├── test_ev_query.py
│   │   └── test_query_to_report.py
│   ├── query/         # Query processing tests
│   │   ├── __init__.py
│   │   ├── test_query_processor.py
│   │   ├── test_query_processor_comprehensive.py
│   │   └── test_llm_interface.py
│   ├── ranking/       # Ranking algorithm tests
│   │   ├── __init__.py
│   │   ├── test_reranker.py
│   │   ├── test_similarity.py
│   │   └── test_simple_reranker.py
│   ├── report/        # Report generation tests
│   │   ├── __init__.py
│   │   ├── test_custom_model.py
│   │   ├── test_detail_levels.py
│   │   ├── test_brief_report.py
│   │   └── test_report_templates.py
│   ├── ui/            # UI component tests
│   │   ├── __init__.py
│   │   └── test_ui_search.py
│   ├── test_document_processor.py
│   ├── test_document_scraper.py
│   └── test_report_synthesis.py
├── utils/             # Utility scripts and shared functions
│   ├── __init__.py
│   ├── jina_similarity.py     # Module for computing text similarity
│   └── markdown_segmenter.py  # Module for segmenting markdown documents
├── config/            # Configuration management
│   ├── __init__.py
│   ├── config.py              # Configuration management class
│   └── config.yaml            # YAML configuration file with settings for different components
├── query/            # Query processing module
│   ├── __init__.py
│   ├── query_processor.py     # Module for processing user queries
│   └── llm_interface.py       # Module for interacting with LLM providers
├── execution/        # Search execution module
│   ├── __init__.py
│   ├── search_executor.py     # Module for executing search queries
│   ├── result_collector.py    # Module for collecting search results
│   └── api_handlers/          # Handlers for different search APIs
│       ├── __init__.py
│       ├── base_handler.py    # Base class for search handlers
│       ├── serper_handler.py  # Handler for Serper API (Google search)
│       ├── scholar_handler.py # Handler for Google Scholar via Serper
│       ├── google_handler.py  # Handler for Google search
│       └── arxiv_handler.py   # Handler for arXiv API
├── ranking/          # Ranking module
│   ├── __init__.py
│   └── jina_reranker.py       # Module for reranking documents using Jina AI
├── ui/              # UI module
│   ├── __init__.py
│   └── gradio_interface.py    # Gradio-based web interface
├── scripts/         # Scripts
│   └── query_to_report.py     # Script for generating reports from queries
├── sim-search-api/   # FastAPI backend
│   ├── app/
│   │   ├── api/
│   │   │   ├── routes/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── auth.py           # Authentication routes
│   │   │   │   ├── query.py          # Query processing routes
│   │   │   │   ├── search.py         # Search execution routes
│   │   │   │   └── report.py         # Report generation routes
│   │   │   ├── __init__.py
│   │   │   └── dependencies.py       # API dependencies (auth, rate limiting)
│   │   ├── core/
│   │   │   ├── __init__.py
│   │   │   ├── config.py             # API configuration
│   │   │   └── security.py           # Security utilities
│   │   ├── db/
│   │   │   ├── __init__.py
│   │   │   ├── session.py            # Database session
│   │   │   └── models.py             # Database models for reports, searches
│   │   ├── schemas/
│   │   │   ├── __init__.py
│   │   │   ├── token.py              # Token schemas
│   │   │   ├── user.py               # User schemas
│   │   │   ├── query.py              # Query schemas
│   │   │   ├── search.py             # Search result schemas
│   │   │   └── report.py             # Report schemas
│   │   ├── services/
│   │   │   ├── __init__.py
│   │   │   ├── query_service.py      # Query processing service
│   │   │   ├── search_service.py     # Search execution service
│   │   │   └── report_service.py     # Report generation service
│   │   └── main.py                   # FastAPI application
│   ├── alembic/                      # Database migrations
│   │   ├── versions/
│   │   │   └── 001_initial_migration.py  # Initial migration
│   │   ├── env.py                    # Alembic environment
│   │   └── script.py.mako            # Alembic script template
│   ├── .env.example                  # Environment variables template
│   ├── alembic.ini                   # Alembic configuration
│   ├── requirements.txt              # API dependencies
│   ├── run.py                        # Script to run the API
│   └── README.md                     # API documentation
├── run_ui.py         # Script to run the UI
└── requirements.txt  # Project dependencies
```

## Module Details

### Config Module

The `config` module manages configuration settings for the entire system, including API keys, model selections, and other parameters.

### Files

- `__init__.py`: Package initialization file
- `config.py`: Configuration management class
- `config.yaml`: YAML configuration file with settings for different components

### Classes

- `Config`: Singleton class for loading and accessing configuration settings
  - `load_config(config_path)`: Loads configuration from a YAML file
  - `get(key, default=None)`: Gets a configuration value by key

### Query Module

The `query` module handles the processing and enhancement of user queries, including classification and optimization for search.

### Files

- `__init__.py`: Package initialization file
- `query_processor.py`: Main module for processing user queries
- `query_classifier.py`: Module for classifying query types
- `llm_interface.py`: Interface for interacting with LLM providers

### Classes

- `QueryProcessor`: Main class for processing user queries
  - `process_query(query)`: Processes a user query and returns enhanced results
  - `classify_query(query)`: Classifies a query by type and intent
  - `generate_search_queries(query, classification)`: Generates optimized search queries

- `QueryClassifier`: Class for classifying queries
  - `classify(query)`: Classifies a query by type, intent, and entities

- `LLMInterface`: Interface for interacting with LLM providers
  - `get_completion(prompt, model=None)`: Gets a completion from an LLM
  - `enhance_query(query)`: Enhances a query with additional context
  - `classify_query(query)`: Uses an LLM to classify a query

### Execution Module

The `execution` module handles the execution of search queries across multiple search engines and the collection of results.

### Files

- `__init__.py`: Package initialization file
- `search_executor.py`: Module for executing search queries
- `result_collector.py`: Module for collecting and processing search results
- `api_handlers/`: Directory containing handlers for different search APIs
  - `__init__.py`: Package initialization file
  - `base_handler.py`: Base class for search handlers
  - `serper_handler.py`: Handler for Serper API (Google search)
  - `scholar_handler.py`: Handler for Google Scholar via Serper
  - `arxiv_handler.py`: Handler for arXiv API

### Classes

- `SearchExecutor`: Class for executing search queries
  - `execute_search(query_data)`: Executes a search across multiple engines
  - `_execute_search_async(query, engines)`: Executes a search asynchronously
  - `_execute_search_sync(query, engines)`: Executes a search synchronously

- `ResultCollector`: Class for collecting and processing search results
  - `process_results(search_results)`: Processes search results from multiple engines
  - `deduplicate_results(results)`: Deduplicates results based on URL
  - `save_results(results, file_path)`: Saves results to a file

- `BaseSearchHandler`: Base class for search handlers
  - `search(query, num_results)`: Abstract method for searching
  - `_process_response(response)`: Processes the API response

- `SerperSearchHandler`: Handler for Serper API
  - `search(query, num_results)`: Searches using Serper API
  - `_process_response(response)`: Processes the Serper API response

- `ScholarSearchHandler`: Handler for Google Scholar via Serper
  - `search(query, num_results)`: Searches Google Scholar
  - `_process_response(response)`: Processes the Scholar API response

- `ArxivSearchHandler`: Handler for arXiv API
  - `search(query, num_results)`: Searches arXiv
  - `_process_response(response)`: Processes the arXiv API response

### Ranking Module

The `ranking` module provides functionality for reranking and prioritizing documents based on their relevance to the user's query.

### Files

- `__init__.py`: Package initialization file
- `jina_reranker.py`: Module for reranking documents using Jina AI
- `filter_manager.py`: Module for filtering documents

### Classes

- `JinaReranker`: Class for reranking documents
  - `rerank(documents, query)`: Reranks documents based on relevance to query
  - `_prepare_inputs(documents, query)`: Prepares inputs for the reranker

- `FilterManager`: Class for filtering documents
  - `filter_by_date(documents, start_date, end_date)`: Filters by date
  - `filter_by_source(documents, sources)`: Filters by source

### Report Templates Module

The `report_templates` module provides a template system for generating reports with different detail levels and query types.

### Files

- `__init__.py`: Package initialization file
- `report_templates.py`: Module for managing report templates

### Classes

- `QueryType` (Enum): Defines the types of queries supported by the system
  - `FACTUAL`: For factual queries seeking specific information
  - `EXPLORATORY`: For exploratory queries investigating a topic
  - `COMPARATIVE`: For comparative queries comparing multiple items

- `DetailLevel` (Enum): Defines the levels of detail for generated reports
  - `BRIEF`: Short summary with key findings
  - `STANDARD`: Standard report with introduction, key findings, and analysis
  - `DETAILED`: Detailed report with methodology and more in-depth analysis
  - `COMPREHENSIVE`: Comprehensive report with executive summary, literature review, and appendices

- `ReportTemplate`: Class representing a report template
  - `template` (str): The template string with placeholders
  - `detail_level` (DetailLevel): The detail level of the template
  - `query_type` (QueryType): The query type the template is designed for
  - `model` (Optional[str]): The LLM model recommended for this template
  - `required_sections` (Optional[List[str]]): Required sections in the template
  - `validate()`: Validates that the template contains all required sections

- `ReportTemplateManager`: Class for managing report templates
  - `add_template(template)`: Adds a template to the manager
  - `get_template(query_type, detail_level)`: Gets a template for a specific query type and detail level
  - `get_available_templates()`: Gets a list of available templates
  - `initialize_default_templates()`: Initializes the default templates for all combinations of query types and detail levels

### Progressive Report Synthesis Module

The `progressive_report_synthesis` module provides functionality to synthesize reports from document chunks using a progressive approach, where chunks are processed iteratively and the report is refined over time.

### Files

- `__init__.py`: Package initialization file
- `progressive_report_synthesis.py`: Module for progressive report generation

### Classes

- `ReportState`: Class to track the state of a progressive report
  - `current_report` (str): The current version of the report
  - `processed_chunks` (Set[str]): Set of document IDs that have been processed
  - `version` (int): Current version number of the report
  - `last_update_time` (float): Timestamp of the last update
  - `improvement_scores` (List[float]): List of improvement scores for each iteration
  - `is_complete` (bool): Whether the report generation is complete
  - `termination_reason` (Optional[str]): Reason for termination if complete

- `ProgressiveReportSynthesizer`: Class for progressive report synthesis
  - Extends `ReportSynthesizer` to implement a progressive approach
  - `set_progress_callback(callback)`: Sets a callback function to report progress
  - `prioritize_chunks(chunks, query)`: Prioritizes chunks based on relevance
  - `extract_information_from_chunk(chunk, query, detail_level)`: Extracts key information from a chunk
  - `refine_report(current_report, new_information, query, query_type, detail_level)`: Refines the report with new information
  - `initialize_report(initial_chunks, query, query_type, detail_level)`: Initializes the report with the first batch of chunks
  - `should_terminate(improvement_score)`: Determines if the process should terminate
  - `synthesize_report_progressively(chunks, query, query_type, detail_level)`: Main method for progressive report generation
  - `synthesize_report(chunks, query, query_type, detail_level)`: Override of parent method to use progressive approach for comprehensive detail level

- `get_progressive_report_synthesizer(model_name)`: Factory function to get a singleton instance

### FastAPI Backend Module

The `sim-search-api` module provides a RESTful API for the sim-search system, allowing for query processing, search execution, and report generation through HTTP endpoints.

### Files

- `app/`: Main application directory
  - `api/`: API routes and dependencies
    - `routes/`: API route handlers
      - `auth.py`: Authentication routes
      - `query.py`: Query processing routes
      - `search.py`: Search execution routes
      - `report.py`: Report generation routes
    - `dependencies.py`: API dependencies (auth, rate limiting)
  - `core/`: Core functionality
    - `config.py`: API configuration
    - `security.py`: Security utilities
  - `db/`: Database models and session management
    - `models.py`: Database models for users, searches, and reports
    - `session.py`: Database session management
  - `schemas/`: Pydantic schemas for request/response validation
    - `token.py`: Token schemas
    - `user.py`: User schemas
    - `query.py`: Query schemas
    - `search.py`: Search result schemas
    - `report.py`: Report schemas
  - `services/`: Service layer for business logic
    - `query_service.py`: Query processing service
    - `search_service.py`: Search execution service
    - `report_service.py`: Report generation service
  - `main.py`: FastAPI application entry point
- `alembic/`: Database migrations
  - `versions/`: Migration versions
    - `001_initial_migration.py`: Initial migration
  - `env.py`: Alembic environment
  - `script.py.mako`: Alembic script template
- `alembic.ini`: Alembic configuration
- `requirements.txt`: API dependencies
- `run.py`: Script to run the API
- `.env.example`: Environment variables template
- `README.md`: API documentation

### Classes

- `app.db.models.User`: User model for authentication
  - `id` (str): User ID
  - `email` (str): User email
  - `hashed_password` (str): Hashed password
  - `full_name` (str): User's full name
  - `is_active` (bool): Whether the user is active
  - `is_superuser` (bool): Whether the user is a superuser

- `app.db.models.Search`: Search model for storing search results
  - `id` (str): Search ID
  - `user_id` (str): User ID
  - `query` (str): Original query
  - `enhanced_query` (str): Enhanced query
  - `query_type` (str): Query type
  - `engines` (str): Search engines used
  - `results_count` (int): Number of results
  - `results` (JSON): Search results
  - `created_at` (datetime): Creation timestamp

- `app.db.models.Report`: Report model for storing generated reports
  - `id` (str): Report ID
  - `user_id` (str): User ID
  - `search_id` (str): Search ID
  - `title` (str): Report title
  - `content` (str): Report content
  - `detail_level` (str): Detail level
  - `query_type` (str): Query type
  - `model_used` (str): Model used for generation
  - `created_at` (datetime): Creation timestamp
  - `updated_at` (datetime): Update timestamp

- `app.services.QueryService`: Service for query processing
  - `process_query(query)`: Processes a query
  - `classify_query(query)`: Classifies a query

- `app.services.SearchService`: Service for search execution
  - `execute_search(structured_query, search_engines, num_results, timeout, user_id, db)`: Executes a search
  - `get_available_search_engines()`: Gets available search engines
  - `get_search_results(search)`: Gets results for a specific search

- `app.services.ReportService`: Service for report generation
  - `generate_report_background(report_id, report_in, search, db, progress_dict)`: Generates a report in the background
  - `generate_report_file(report, format)`: Generates a report file in the specified format

## Recent Updates

### 2025-03-20: FastAPI Backend Implementation

1. **FastAPI Application Structure**:
   - Created a new directory `sim-search-api` for the FastAPI application
   - Set up project structure with API routes, core functionality, database models, schemas, and services
   - Implemented a layered architecture with API, service, and data layers
   - Added proper `__init__.py` files to make all directories proper Python packages

2. **API Routes Implementation**:
   - Created authentication routes for user registration and token generation
   - Implemented query processing routes for query enhancement and classification
   - Added search execution routes for executing searches and managing search history
   - Created report generation routes for generating and managing reports
   - Implemented proper error handling and validation for all routes

3. **Service Layer Implementation**:
   - Created `QueryService` to bridge between API and existing query processing functionality
   - Implemented `SearchService` for search execution and result management
   - Added `ReportService` for report generation and management
   - Ensured proper integration with existing sim-search functionality
   - Implemented asynchronous operation for all services

4. **Database Setup**:
   - Created SQLAlchemy models for users, searches, and reports
   - Implemented database session management
   - Set up Alembic for database migrations
   - Created initial migration script to create all tables
   - Added proper relationships between models

5. **Authentication and Security**:
   - Implemented JWT-based authentication
   - Added password hashing and verification
   - Created token generation and validation
   - Implemented user registration and login
   - Added proper authorization for protected routes

6. **Documentation and Configuration**:
   - Created comprehensive API documentation
   - Added OpenAPI documentation endpoints
   - Implemented environment variable configuration
   - Created a README with setup and usage instructions
   - Added example environment variables file

### 2025-03-12: Progressive Report Generation Implementation

1. **Progressive Report Synthesis Module**:
   - Created a new module `progressive_report_synthesis.py` for progressive report generation
   - Implemented `ReportState` class to track the state of a progressive report
   - Created `ProgressiveReportSynthesizer` class extending from `ReportSynthesizer`
   - Implemented chunk prioritization algorithm based on relevance scores
   - Developed iterative refinement process with specialized prompts
   - Added state management to track report versions and processed chunks
   - Implemented termination conditions (all chunks processed, diminishing returns, max iterations)
   - Added support for different models with adaptive batch sizing
   - Implemented progress tracking and callback mechanism

2. **Report Generator Integration**:
   - Modified `report_generator.py` to use the progressive report synthesizer for comprehensive detail level
   - Created a hybrid system that uses standard map-reduce for brief/standard/detailed levels
   - Added proper model selection and configuration for both synthesizers

3. **Testing**:
   - Created `test_progressive_report.py` to test progressive report generation
   - Implemented comparison functionality between progressive and standard approaches
   - Added test cases for different query types and document collections

### 2025-03-11: Report Templates Implementation

1. **Report Templates Module**:
   - Created a new module `report_templates.py` for managing report templates
   - Implemented enums for query types (FACTUAL, EXPLORATORY, COMPARATIVE) and detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE)
   - Created a template system with placeholders for different report sections
   - Implemented 12 different templates (3 query types × 4 detail levels)
   - Added validation to ensure templates contain all required sections

2. **Report Synthesis Integration**:
   - Updated the report synthesis module to use the new template system
   - Added support for different templates based on query type and detail level
   - Implemented fallback to standard templates when specific templates are not found
   - Added better logging for template retrieval process

3. **Testing**:
   - Created test_report_templates.py to test template retrieval and validation
   - Implemented test_brief_report.py to test the brief report generation
   - Successfully tested all combinations of detail levels and query types

### 2025-02-28: Async Implementation and Reference Formatting

1. **LLM Interface Updates**:
   - Converted key methods to async:
     - `generate_completion`
     - `classify_query`
     - `enhance_query`
     - `generate_search_queries`
   - Added special handling for Gemini models
   - Improved reference formatting instructions

2. **Query Processor Updates**:
   - Updated `process_query` to be async
   - Made `generate_search_queries` async
   - Fixed async/await patterns throughout

3. **Gradio Interface Updates**:
   - Modified `generate_report` to handle async operations
   - Updated report button click handler
   - Improved error handling