ira/.note/code_structure.md

24 KiB
Raw Blame History

Code Structure

Current Project Organization

project/
│
├── examples/          # Sample data and query examples
├── report/            # Report generation module
│   ├── __init__.py
│   ├── report_generator.py    # Module for generating reports
│   ├── report_synthesis.py    # Module for synthesizing reports
│   ├── progressive_report_synthesis.py # Module for progressive report generation
│   ├── document_processor.py  # Module for processing documents
│   ├── document_scraper.py    # Module for scraping documents
│   ├── report_detail_levels.py # Module for managing report detail levels
│   ├── report_templates.py    # Module for managing report templates
│   └── database/              # Database for storing reports
│       ├── __init__.py
│       └── db_manager.py      # Module for managing the database
├── tests/             # Test suite
│   ├── __init__.py
│   ├── execution/     # Search execution tests
│   │   ├── __init__.py
│   │   ├── test_search.py
│   │   ├── test_search_execution.py
│   │   └── test_all_handlers.py
│   ├── integration/   # Integration tests
│   │   ├── __init__.py
│   │   ├── test_ev_query.py
│   │   └── test_query_to_report.py
│   ├── query/         # Query processing tests
│   │   ├── __init__.py
│   │   ├── test_query_processor.py
│   │   ├── test_query_processor_comprehensive.py
│   │   └── test_llm_interface.py
│   ├── ranking/       # Ranking algorithm tests
│   │   ├── __init__.py
│   │   ├── test_reranker.py
│   │   ├── test_similarity.py
│   │   └── test_simple_reranker.py
│   ├── report/        # Report generation tests
│   │   ├── __init__.py
│   │   ├── test_custom_model.py
│   │   ├── test_detail_levels.py
│   │   ├── test_brief_report.py
│   │   └── test_report_templates.py
│   ├── ui/            # UI component tests
│   │   ├── __init__.py
│   │   └── test_ui_search.py
│   ├── test_document_processor.py
│   ├── test_document_scraper.py
│   └── test_report_synthesis.py
├── utils/             # Utility scripts and shared functions
│   ├── __init__.py
│   ├── jina_similarity.py     # Module for computing text similarity
│   └── markdown_segmenter.py  # Module for segmenting markdown documents
├── config/            # Configuration management
│   ├── __init__.py
│   ├── config.py              # Configuration management class
│   └── config.yaml            # YAML configuration file with settings for different components
├── query/            # Query processing module
│   ├── __init__.py
│   ├── query_processor.py     # Module for processing user queries
│   └── llm_interface.py       # Module for interacting with LLM providers
├── execution/        # Search execution module
│   ├── __init__.py
│   ├── search_executor.py     # Module for executing search queries
│   ├── result_collector.py    # Module for collecting search results
│   └── api_handlers/          # Handlers for different search APIs
│       ├── __init__.py
│       ├── base_handler.py    # Base class for search handlers
│       ├── serper_handler.py  # Handler for Serper API (Google search)
│       ├── scholar_handler.py # Handler for Google Scholar via Serper
│       ├── google_handler.py  # Handler for Google search
│       └── arxiv_handler.py   # Handler for arXiv API
├── ranking/          # Ranking module
│   ├── __init__.py
│   └── jina_reranker.py       # Module for reranking documents using Jina AI
├── ui/              # UI module
│   ├── __init__.py
│   └── gradio_interface.py    # Gradio-based web interface
├── scripts/         # Scripts
│   └── query_to_report.py     # Script for generating reports from queries
├── sim-search-api/   # FastAPI backend
│   ├── app/
│   │   ├── api/
│   │   │   ├── routes/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── auth.py           # Authentication routes
│   │   │   │   ├── query.py          # Query processing routes
│   │   │   │   ├── search.py         # Search execution routes
│   │   │   │   └── report.py         # Report generation routes
│   │   │   ├── __init__.py
│   │   │   └── dependencies.py       # API dependencies (auth, rate limiting)
│   │   ├── core/
│   │   │   ├── __init__.py
│   │   │   ├── config.py             # API configuration
│   │   │   └── security.py           # Security utilities
│   │   ├── db/
│   │   │   ├── __init__.py
│   │   │   ├── session.py            # Database session
│   │   │   └── models.py             # Database models for reports, searches
│   │   ├── schemas/
│   │   │   ├── __init__.py
│   │   │   ├── token.py              # Token schemas
│   │   │   ├── user.py               # User schemas
│   │   │   ├── query.py              # Query schemas
│   │   │   ├── search.py             # Search result schemas
│   │   │   └── report.py             # Report schemas
│   │   ├── services/
│   │   │   ├── __init__.py
│   │   │   ├── query_service.py      # Query processing service
│   │   │   ├── search_service.py     # Search execution service
│   │   │   └── report_service.py     # Report generation service
│   │   └── main.py                   # FastAPI application
│   ├── alembic/                      # Database migrations
│   │   ├── versions/
│   │   │   └── 001_initial_migration.py  # Initial migration
│   │   ├── env.py                    # Alembic environment
│   │   └── script.py.mako            # Alembic script template
│   ├── .env.example                  # Environment variables template
│   ├── alembic.ini                   # Alembic configuration
│   ├── requirements.txt              # API dependencies
│   ├── run.py                        # Script to run the API
│   └── README.md                     # API documentation
├── run_ui.py         # Script to run the UI
└── requirements.txt  # Project dependencies

Module Details

Config Module

The config module manages configuration settings for the entire system, including API keys, model selections, and other parameters.

Files

  • __init__.py: Package initialization file
  • config.py: Configuration management class
  • config.yaml: YAML configuration file with settings for different components

Classes

  • Config: Singleton class for loading and accessing configuration settings
    • load_config(config_path): Loads configuration from a YAML file
    • get(key, default=None): Gets a configuration value by key

Query Module

The query module handles the processing and enhancement of user queries, including classification and optimization for search.

Files

  • __init__.py: Package initialization file
  • query_processor.py: Main module for processing user queries
  • query_classifier.py: Module for classifying query types
  • llm_interface.py: Interface for interacting with LLM providers

Classes

  • QueryProcessor: Main class for processing user queries

    • process_query(query): Processes a user query and returns enhanced results
    • classify_query(query): Classifies a query by type and intent
    • generate_search_queries(query, classification): Generates optimized search queries
  • QueryClassifier: Class for classifying queries

    • classify(query): Classifies a query by type, intent, and entities
  • LLMInterface: Interface for interacting with LLM providers

    • get_completion(prompt, model=None): Gets a completion from an LLM
    • enhance_query(query): Enhances a query with additional context
    • classify_query(query): Uses an LLM to classify a query

Execution Module

The execution module handles the execution of search queries across multiple search engines and the collection of results.

Files

  • __init__.py: Package initialization file
  • search_executor.py: Module for executing search queries
  • result_collector.py: Module for collecting and processing search results
  • api_handlers/: Directory containing handlers for different search APIs
    • __init__.py: Package initialization file
    • base_handler.py: Base class for search handlers
    • serper_handler.py: Handler for Serper API (Google search)
    • scholar_handler.py: Handler for Google Scholar via Serper
    • arxiv_handler.py: Handler for arXiv API

Classes

  • SearchExecutor: Class for executing search queries

    • execute_search(query_data): Executes a search across multiple engines
    • _execute_search_async(query, engines): Executes a search asynchronously
    • _execute_search_sync(query, engines): Executes a search synchronously
  • ResultCollector: Class for collecting and processing search results

    • process_results(search_results): Processes search results from multiple engines
    • deduplicate_results(results): Deduplicates results based on URL
    • save_results(results, file_path): Saves results to a file
  • BaseSearchHandler: Base class for search handlers

    • search(query, num_results): Abstract method for searching
    • _process_response(response): Processes the API response
  • SerperSearchHandler: Handler for Serper API

    • search(query, num_results): Searches using Serper API
    • _process_response(response): Processes the Serper API response
  • ScholarSearchHandler: Handler for Google Scholar via Serper

    • search(query, num_results): Searches Google Scholar
    • _process_response(response): Processes the Scholar API response
  • ArxivSearchHandler: Handler for arXiv API

    • search(query, num_results): Searches arXiv
    • _process_response(response): Processes the arXiv API response

Ranking Module

The ranking module provides functionality for reranking and prioritizing documents based on their relevance to the user's query.

Files

  • __init__.py: Package initialization file
  • jina_reranker.py: Module for reranking documents using Jina AI
  • filter_manager.py: Module for filtering documents

Classes

  • JinaReranker: Class for reranking documents

    • rerank(documents, query): Reranks documents based on relevance to query
    • _prepare_inputs(documents, query): Prepares inputs for the reranker
  • FilterManager: Class for filtering documents

    • filter_by_date(documents, start_date, end_date): Filters by date
    • filter_by_source(documents, sources): Filters by source

Report Templates Module

The report_templates module provides a template system for generating reports with different detail levels and query types.

Files

  • __init__.py: Package initialization file
  • report_templates.py: Module for managing report templates

Classes

  • QueryType (Enum): Defines the types of queries supported by the system

    • FACTUAL: For factual queries seeking specific information
    • EXPLORATORY: For exploratory queries investigating a topic
    • COMPARATIVE: For comparative queries comparing multiple items
  • DetailLevel (Enum): Defines the levels of detail for generated reports

    • BRIEF: Short summary with key findings
    • STANDARD: Standard report with introduction, key findings, and analysis
    • DETAILED: Detailed report with methodology and more in-depth analysis
    • COMPREHENSIVE: Comprehensive report with executive summary, literature review, and appendices
  • ReportTemplate: Class representing a report template

    • template (str): The template string with placeholders
    • detail_level (DetailLevel): The detail level of the template
    • query_type (QueryType): The query type the template is designed for
    • model (Optional[str]): The LLM model recommended for this template
    • required_sections (Optional[List[str]]): Required sections in the template
    • validate(): Validates that the template contains all required sections
  • ReportTemplateManager: Class for managing report templates

    • add_template(template): Adds a template to the manager
    • get_template(query_type, detail_level): Gets a template for a specific query type and detail level
    • get_available_templates(): Gets a list of available templates
    • initialize_default_templates(): Initializes the default templates for all combinations of query types and detail levels

Progressive Report Synthesis Module

The progressive_report_synthesis module provides functionality to synthesize reports from document chunks using a progressive approach, where chunks are processed iteratively and the report is refined over time.

Files

  • __init__.py: Package initialization file
  • progressive_report_synthesis.py: Module for progressive report generation

Classes

  • ReportState: Class to track the state of a progressive report

    • current_report (str): The current version of the report
    • processed_chunks (Set[str]): Set of document IDs that have been processed
    • version (int): Current version number of the report
    • last_update_time (float): Timestamp of the last update
    • improvement_scores (List[float]): List of improvement scores for each iteration
    • is_complete (bool): Whether the report generation is complete
    • termination_reason (Optional[str]): Reason for termination if complete
  • ProgressiveReportSynthesizer: Class for progressive report synthesis

    • Extends ReportSynthesizer to implement a progressive approach
    • set_progress_callback(callback): Sets a callback function to report progress
    • prioritize_chunks(chunks, query): Prioritizes chunks based on relevance
    • extract_information_from_chunk(chunk, query, detail_level): Extracts key information from a chunk
    • refine_report(current_report, new_information, query, query_type, detail_level): Refines the report with new information
    • initialize_report(initial_chunks, query, query_type, detail_level): Initializes the report with the first batch of chunks
    • should_terminate(improvement_score): Determines if the process should terminate
    • synthesize_report_progressively(chunks, query, query_type, detail_level): Main method for progressive report generation
    • synthesize_report(chunks, query, query_type, detail_level): Override of parent method to use progressive approach for comprehensive detail level
  • get_progressive_report_synthesizer(model_name): Factory function to get a singleton instance

FastAPI Backend Module

The sim-search-api module provides a RESTful API for the sim-search system, allowing for query processing, search execution, and report generation through HTTP endpoints.

Files

  • app/: Main application directory
    • api/: API routes and dependencies
      • routes/: API route handlers
        • auth.py: Authentication routes
        • query.py: Query processing routes
        • search.py: Search execution routes
        • report.py: Report generation routes
      • dependencies.py: API dependencies (auth, rate limiting)
    • core/: Core functionality
      • config.py: API configuration
      • security.py: Security utilities
    • db/: Database models and session management
      • models.py: Database models for users, searches, and reports
      • session.py: Database session management
    • schemas/: Pydantic schemas for request/response validation
      • token.py: Token schemas
      • user.py: User schemas
      • query.py: Query schemas
      • search.py: Search result schemas
      • report.py: Report schemas
    • services/: Service layer for business logic
      • query_service.py: Query processing service
      • search_service.py: Search execution service
      • report_service.py: Report generation service
    • main.py: FastAPI application entry point
  • alembic/: Database migrations
    • versions/: Migration versions
      • 001_initial_migration.py: Initial migration
    • env.py: Alembic environment
    • script.py.mako: Alembic script template
  • alembic.ini: Alembic configuration
  • requirements.txt: API dependencies
  • run.py: Script to run the API
  • .env.example: Environment variables template
  • README.md: API documentation

Classes

  • app.db.models.User: User model for authentication

    • id (str): User ID
    • email (str): User email
    • hashed_password (str): Hashed password
    • full_name (str): User's full name
    • is_active (bool): Whether the user is active
    • is_superuser (bool): Whether the user is a superuser
  • app.db.models.Search: Search model for storing search results

    • id (str): Search ID
    • user_id (str): User ID
    • query (str): Original query
    • enhanced_query (str): Enhanced query
    • query_type (str): Query type
    • engines (str): Search engines used
    • results_count (int): Number of results
    • results (JSON): Search results
    • created_at (datetime): Creation timestamp
  • app.db.models.Report: Report model for storing generated reports

    • id (str): Report ID
    • user_id (str): User ID
    • search_id (str): Search ID
    • title (str): Report title
    • content (str): Report content
    • detail_level (str): Detail level
    • query_type (str): Query type
    • model_used (str): Model used for generation
    • created_at (datetime): Creation timestamp
    • updated_at (datetime): Update timestamp
  • app.services.QueryService: Service for query processing

    • process_query(query): Processes a query
    • classify_query(query): Classifies a query
  • app.services.SearchService: Service for search execution

    • execute_search(structured_query, search_engines, num_results, timeout, user_id, db): Executes a search
    • get_available_search_engines(): Gets available search engines
    • get_search_results(search): Gets results for a specific search
  • app.services.ReportService: Service for report generation

    • generate_report_background(report_id, report_in, search, db, progress_dict): Generates a report in the background
    • generate_report_file(report, format): Generates a report file in the specified format

Recent Updates

2025-03-20: FastAPI Backend Implementation

  1. FastAPI Application Structure:

    • Created a new directory sim-search-api for the FastAPI application
    • Set up project structure with API routes, core functionality, database models, schemas, and services
    • Implemented a layered architecture with API, service, and data layers
    • Added proper __init__.py files to make all directories proper Python packages
  2. API Routes Implementation:

    • Created authentication routes for user registration and token generation
    • Implemented query processing routes for query enhancement and classification
    • Added search execution routes for executing searches and managing search history
    • Created report generation routes for generating and managing reports
    • Implemented proper error handling and validation for all routes
  3. Service Layer Implementation:

    • Created QueryService to bridge between API and existing query processing functionality
    • Implemented SearchService for search execution and result management
    • Added ReportService for report generation and management
    • Ensured proper integration with existing sim-search functionality
    • Implemented asynchronous operation for all services
  4. Database Setup:

    • Created SQLAlchemy models for users, searches, and reports
    • Implemented database session management
    • Set up Alembic for database migrations
    • Created initial migration script to create all tables
    • Added proper relationships between models
  5. Authentication and Security:

    • Implemented JWT-based authentication
    • Added password hashing and verification
    • Created token generation and validation
    • Implemented user registration and login
    • Added proper authorization for protected routes
  6. Documentation and Configuration:

    • Created comprehensive API documentation
    • Added OpenAPI documentation endpoints
    • Implemented environment variable configuration
    • Created a README with setup and usage instructions
    • Added example environment variables file

2025-03-12: Progressive Report Generation Implementation

  1. Progressive Report Synthesis Module:

    • Created a new module progressive_report_synthesis.py for progressive report generation
    • Implemented ReportState class to track the state of a progressive report
    • Created ProgressiveReportSynthesizer class extending from ReportSynthesizer
    • Implemented chunk prioritization algorithm based on relevance scores
    • Developed iterative refinement process with specialized prompts
    • Added state management to track report versions and processed chunks
    • Implemented termination conditions (all chunks processed, diminishing returns, max iterations)
    • Added support for different models with adaptive batch sizing
    • Implemented progress tracking and callback mechanism
  2. Report Generator Integration:

    • Modified report_generator.py to use the progressive report synthesizer for comprehensive detail level
    • Created a hybrid system that uses standard map-reduce for brief/standard/detailed levels
    • Added proper model selection and configuration for both synthesizers
  3. Testing:

    • Created test_progressive_report.py to test progressive report generation
    • Implemented comparison functionality between progressive and standard approaches
    • Added test cases for different query types and document collections

2025-03-11: Report Templates Implementation

  1. Report Templates Module:

    • Created a new module report_templates.py for managing report templates
    • Implemented enums for query types (FACTUAL, EXPLORATORY, COMPARATIVE) and detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE)
    • Created a template system with placeholders for different report sections
    • Implemented 12 different templates (3 query types × 4 detail levels)
    • Added validation to ensure templates contain all required sections
  2. Report Synthesis Integration:

    • Updated the report synthesis module to use the new template system
    • Added support for different templates based on query type and detail level
    • Implemented fallback to standard templates when specific templates are not found
    • Added better logging for template retrieval process
  3. Testing:

    • Created test_report_templates.py to test template retrieval and validation
    • Implemented test_brief_report.py to test the brief report generation
    • Successfully tested all combinations of detail levels and query types

2025-02-28: Async Implementation and Reference Formatting

  1. LLM Interface Updates:

    • Converted key methods to async:
      • generate_completion
      • classify_query
      • enhance_query
      • generate_search_queries
    • Added special handling for Gemini models
    • Improved reference formatting instructions
  2. Query Processor Updates:

    • Updated process_query to be async
    • Made generate_search_queries async
    • Fixed async/await patterns throughout
  3. Gradio Interface Updates:

    • Modified generate_report to handle async operations
    • Updated report button click handler
    • Improved error handling