ira/.note/code_structure.md

8.8 KiB

Code Structure

Current Project Organization

sim-search/
├── config/
│   ├── __init__.py
│   ├── config.py              # Configuration management
│   └── config.yaml            # Configuration file
├── query/
│   ├── __init__.py
│   ├── query_processor.py     # Module for processing user queries
│   └── llm_interface.py       # Module for interacting with LLM providers
├── execution/
│   ├── __init__.py
│   ├── search_executor.py     # Module for executing search queries
│   ├── result_collector.py    # Module for collecting search results
│   └── api_handlers/          # Handlers for different search APIs
│       ├── __init__.py
│       ├── base_handler.py    # Base class for search handlers
│       ├── serper_handler.py  # Handler for Serper API (Google search)
│       ├── scholar_handler.py # Handler for Google Scholar via Serper
│       ├── google_handler.py  # Handler for Google search
│       └── arxiv_handler.py   # Handler for arXiv API
├── ranking/
│   ├── __init__.py
│   └── jina_reranker.py       # Module for reranking documents using Jina AI
├── report/
│   ├── __init__.py
│   ├── report_generator.py    # Module for generating reports
│   ├── report_synthesis.py    # Module for synthesizing reports
│   ├── document_processor.py  # Module for processing documents
│   ├── document_scraper.py    # Module for scraping documents
│   ├── report_detail_levels.py # Module for managing report detail levels
│   └── database/              # Database for storing reports
│       ├── __init__.py
│       └── db_manager.py      # Module for managing the database
├── ui/
│   ├── __init__.py
│   └── gradio_interface.py    # Gradio-based web interface
├── utils/
│   ├── __init__.py
│   ├── jina_similarity.py     # Module for computing text similarity
│   └── markdown_segmenter.py  # Module for segmenting markdown documents
├── scripts/
│   └── query_to_report.py     # Script for generating reports from queries
├── tests/
│   ├── __init__.py
│   ├── query/                 # Tests for query module
│   │   ├── __init__.py
│   │   ├── test_query_processor.py
│   │   ├── test_query_processor_comprehensive.py
│   │   └── test_llm_interface.py
│   ├── execution/             # Tests for execution module
│   │   ├── __init__.py
│   │   ├── test_search.py
│   │   ├── test_search_execution.py
│   │   └── test_all_handlers.py
│   ├── ranking/               # Tests for ranking module
│   │   ├── __init__.py
│   │   ├── test_reranker.py
│   │   ├── test_similarity.py
│   │   └── test_simple_reranker.py
│   ├── report/                # Tests for report module
│   │   ├── __init__.py
│   │   ├── test_custom_model.py
│   │   └── test_detail_levels.py
│   ├── ui/                    # Tests for UI module
│   │   ├── __init__.py
│   │   └── test_ui_search.py
│   ├── integration/           # Integration tests
│   │   ├── __init__.py
│   │   ├── test_ev_query.py
│   │   └── test_query_to_report.py
│   ├── test_document_processor.py
│   ├── test_document_scraper.py
│   └── test_report_synthesis.py
├── examples/
│   ├── __init__.py
│   ├── data/                  # Example data files
│   └── scripts/               # Example scripts
│       └── __init__.py
├── run_ui.py                  # Script to run the UI
└── requirements.txt           # Project dependencies

Module Details

Config Module

The config module manages configuration settings for the entire system, including API keys, model selections, and other parameters.

Files

  • __init__.py: Package initialization file
  • config.py: Configuration management class
  • config.yaml: YAML configuration file with settings for different components

Classes

  • Config: Singleton class for loading and accessing configuration settings
    • load_config(config_path): Loads configuration from a YAML file
    • get(key, default=None): Gets a configuration value by key

Query Module

The query module handles the processing and enhancement of user queries, including classification and optimization for search.

Files

  • __init__.py: Package initialization file
  • query_processor.py: Main module for processing user queries
  • query_classifier.py: Module for classifying query types
  • llm_interface.py: Interface for interacting with LLM providers

Classes

  • QueryProcessor: Main class for processing user queries

    • process_query(query): Processes a user query and returns enhanced results
    • classify_query(query): Classifies a query by type and intent
    • generate_search_queries(query, classification): Generates optimized search queries
  • QueryClassifier: Class for classifying queries

    • classify(query): Classifies a query by type, intent, and entities
  • LLMInterface: Interface for interacting with LLM providers

    • get_completion(prompt, model=None): Gets a completion from an LLM
    • enhance_query(query): Enhances a query with additional context
    • classify_query(query): Uses an LLM to classify a query

Execution Module

The execution module handles the execution of search queries across multiple search engines and the collection of results.

Files

  • __init__.py: Package initialization file
  • search_executor.py: Module for executing search queries
  • result_collector.py: Module for collecting and processing search results
  • api_handlers/: Directory containing handlers for different search APIs
    • __init__.py: Package initialization file
    • base_handler.py: Base class for search handlers
    • serper_handler.py: Handler for Serper API (Google search)
    • scholar_handler.py: Handler for Google Scholar via Serper
    • arxiv_handler.py: Handler for arXiv API

Classes

  • SearchExecutor: Class for executing search queries

    • execute_search(query_data): Executes a search across multiple engines
    • _execute_search_async(query, engines): Executes a search asynchronously
    • _execute_search_sync(query, engines): Executes a search synchronously
  • ResultCollector: Class for collecting and processing search results

    • process_results(search_results): Processes search results from multiple engines
    • deduplicate_results(results): Deduplicates results based on URL
    • save_results(results, file_path): Saves results to a file
  • BaseSearchHandler: Base class for search handlers

    • search(query, num_results): Abstract method for searching
    • _process_response(response): Processes the API response
  • SerperSearchHandler: Handler for Serper API

    • search(query, num_results): Searches using Serper API
    • _process_response(response): Processes the Serper API response
  • ScholarSearchHandler: Handler for Google Scholar via Serper

    • search(query, num_results): Searches Google Scholar
    • _process_response(response): Processes the Scholar API response
  • ArxivSearchHandler: Handler for arXiv API

    • search(query, num_results): Searches arXiv
    • _process_response(response): Processes the arXiv API response

Ranking Module

The ranking module provides functionality for reranking and prioritizing documents based on their relevance to the user's query.

Files

  • __init__.py: Package initialization file
  • jina_reranker.py: Module for reranking documents using Jina AI
  • filter_manager.py: Module for filtering documents

Classes

  • JinaReranker: Class for reranking documents

    • rerank(documents, query): Reranks documents based on relevance to query
    • _prepare_inputs(documents, query): Prepares inputs for the reranker
  • FilterManager: Class for filtering documents

    • filter_by_date(documents, start_date, end_date): Filters by date
    • filter_by_source(documents, sources): Filters by source

Recent Updates

2025-02-28: Async Implementation and Reference Formatting

  1. LLM Interface Updates:

    • Converted key methods to async:
      • generate_completion
      • classify_query
      • enhance_query
      • generate_search_queries
    • Added special handling for Gemini models
    • Improved reference formatting instructions
  2. Query Processor Updates:

    • Updated process_query to be async
    • Made generate_search_queries async
    • Fixed async/await patterns throughout
  3. Gradio Interface Updates:

    • Modified generate_report to handle async operations
    • Updated report button click handler
    • Improved error handling