ira/.note/decision_log.md

9.9 KiB

Decision Log

2025-02-27: Initial Project Setup

  • Context: Need for semantic search capabilities that understand context beyond keywords
  • Options Considered:
    1. Build custom embedding solution
    2. Use open-source models locally
    3. Use Jina AI's APIs
  • Decision: Use Jina AI's APIs for embedding generation and similarity computation
  • Rationale:
    • High-quality embeddings with state-of-the-art models
    • No need to manage model deployment and infrastructure
    • Simple API integration with reasonable pricing
    • Support for long texts through segmentation

Decision: Separate Markdown Segmentation from Similarity Computation

  • Context: Need to handle potentially long markdown documents
  • Options Considered:
    1. Integrate segmentation directly into the similarity module
    2. Create a separate module for segmentation
  • Decision: Create a separate module (markdown_segmenter.py) for document segmentation
  • Rationale:
    • Better separation of concerns
    • More modular design allows for independent use of components
    • Easier to maintain and extend each component separately

Decision: Use Environment Variables for API Keys

  • Context: Need to securely manage API credentials
  • Options Considered:
    1. Configuration files
    2. Environment variables
    3. Secret management service
  • Decision: Use environment variables (JINA_API_KEY)
  • Rationale:
    • Simple to implement
    • Standard practice for managing secrets
    • Works well across different environments
    • Prevents accidental commit of credentials to version control

Decision: Use Cosine Similarity with Normalized Vectors

  • Context: Need a metric for comparing semantic similarity between text embeddings
  • Options Considered:
    1. Euclidean distance
    2. Cosine similarity
    3. Dot product
  • Decision: Use cosine similarity with normalized vectors
  • Rationale:
    • Standard approach for semantic similarity
    • Normalized vectors simplify computation (dot product equals cosine similarity)
    • Less sensitive to embedding magnitude, focusing on direction (meaning)

2025-02-27: Research System Architecture

Decision: Implement a Multi-Stage Research Pipeline

  • Context: Need to define the overall architecture for the intelligent research system
  • Options Considered:
    1. Monolithic application with tightly coupled components
    2. Microservices architecture with independent services
    3. Pipeline architecture with distinct processing stages
  • Decision: Implement an 8-stage pipeline architecture
  • Rationale:
    • Clear separation of concerns with each stage having a specific responsibility
    • Easier to develop and test individual components
    • Flexibility to swap or enhance specific stages without affecting others
    • Natural flow of data through the system matches the research process

Decision: Use Multiple Search Sources

  • Context: Need to gather comprehensive information from various sources
  • Options Considered:
    1. Use a single search API for simplicity
    2. Implement custom web scraping for all sources
    3. Use multiple specialized search APIs
  • Decision: Integrate multiple search sources (Google, Serper, Jina Search, Google Scholar, arXiv)
  • Rationale:
    • Different sources provide different types of information (academic, general, etc.)
    • Increases the breadth and diversity of search results
    • Specialized APIs like arXiv provide domain-specific information
    • Redundancy ensures more comprehensive coverage

Decision: Use Jina AI for Semantic Processing

  • Context: Need for advanced semantic understanding in document processing
  • Options Considered:
    1. Use simple keyword matching
    2. Implement custom embedding models
    3. Use Jina AI's suite of APIs
  • Decision: Use Jina AI's APIs for embedding generation, similarity computation, and reranking
  • Rationale:
    • High-quality embeddings with state-of-the-art models
    • Comprehensive API suite covering multiple needs (embeddings, segmentation, reranking)
    • Simple integration with reasonable pricing
    • Consistent approach across different semantic processing tasks

2025-02-27: Search Execution Architecture

Decision: Search Execution Architecture

  • Context: We needed to implement a search execution module that could execute search queries across multiple search engines and process the results in a standardized way.

  • Decision:

    1. Create a modular search execution architecture:
    • Implement a base handler interface (BaseSearchHandler) for all search API handlers
    • Create specific handlers for each search engine (Google, Serper, Scholar, arXiv)
    • Develop a central SearchExecutor class to manage execution across multiple engines
    • Implement a ResultCollector class for processing and organizing results
    1. Use parallel execution for search queries:
    • Implement thread-based parallelism using concurrent.futures
    • Add support for both synchronous and asynchronous execution
    • Include timeout management and error handling
    1. Standardize search results:
    • Define a common result format across all search engines
    • Include metadata specific to each search engine in a standardized way
    • Implement deduplication and scoring for result ranking
  • Rationale:

    • A modular architecture allows for easy addition of new search engines
    • Parallel execution significantly improves search performance
    • Standardized result format simplifies downstream processing
    • Separation of concerns between execution and result processing
  • Alternatives Considered:

    1. Sequential execution of search queries:
    • Simpler implementation
    • Much slower performance
    • Would not scale well with additional search engines
    1. Separate modules for each search engine:
    • Would lead to code duplication
    • More difficult to maintain
    • Less consistent result format
    1. Using a third-party search aggregation service:
    • Would introduce additional dependencies
    • Less control over the search process
    • Potential cost implications
  • Impact:

    • Efficient execution of search queries across multiple engines
    • Consistent result format for downstream processing
    • Flexible architecture that can be extended with new search engines
    • Clear separation of concerns between different components

2025-02-27: Search Execution Module Refinements

Decision: Remove Google Search Handler

  • Context: Both Google and Serper handlers were implemented, but Serper is essentially a front-end for Google search
  • Options Considered:
    1. Keep both handlers for redundancy
    2. Remove the Google handler and only use Serper
  • Decision: Remove the Google search handler
  • Rationale:
    • Redundant functionality as Serper provides the same results
    • Simplifies the codebase and reduces maintenance
    • Reduces API costs by avoiding duplicate searches
    • Serper provides a more reliable and consistent API for Google search

Decision: Modify LLM Query Enhancement Prompt

  • Context: The LLM was returning enhanced queries with explanations, which caused issues with search APIs
  • Options Considered:
    1. Post-process the LLM output to extract just the query
    2. Modify the prompt to request only the enhanced query
  • Decision: Modify the LLM prompt to request only the enhanced query without explanations
  • Rationale:
    • More reliable than post-processing, which could be error-prone
    • Cleaner implementation that addresses the root cause
    • Ensures consistent output format for downstream processing
    • Reduces the risk of exceeding API character limits

Decision: Implement Query Truncation

  • Context: Enhanced queries could exceed the Serper API's 2048 character limit
  • Options Considered:
    1. Limit the LLM's output length
    2. Truncate queries before sending to the API
    3. Split long queries into multiple searches
  • Decision: Implement query truncation in the search executor
  • Rationale:
    • Simple and effective solution
    • Preserves as much of the enhanced query as possible
    • Ensures API requests don't fail due to length constraints
    • Can be easily adjusted if API limits change

2025-02-27: Testing Strategy for Query Processor

Context

After integrating Groq and OpenRouter as additional LLM providers, we needed to verify that the query processor module functions correctly with these new providers.

Decision

  1. Create dedicated test scripts to validate the query processor functionality:

    • A basic test script for the core processing pipeline
    • A comprehensive test script for detailed component testing
  2. Use monkey patching to ensure tests consistently use the Groq model:

    • Create a global LLM interface with the Groq model
    • Override the get_llm_interface function to always return this interface
    • This approach allows testing without modifying the core code
  3. Test all key functionality of the query processor:

    • Query enhancement
    • Query classification
    • Search query generation
    • End-to-end processing pipeline

Rationale

  • Dedicated test scripts provide a repeatable way to verify functionality
  • Monkey patching allows testing with specific models without changing the core code
  • Comprehensive testing ensures all components work correctly with the new providers
  • Saving test results to a JSON file provides a reference for future development

Alternatives Considered

  1. Modifying the query processor to accept a model parameter:

    • Would require changing the core code
    • Could introduce bugs in the production code
  2. Using environment variables to control model selection:

    • Less precise control over which model is used
    • Could interfere with other tests or production use

Impact

  • Verified that the query processor works correctly with Groq models
  • Established a testing approach that can be used for other modules
  • Created reusable test scripts for future development