ira/.note/decision_log.md

# Decision Log

## 2025-02-27: Initial Project Setup

### Decision: Use Jina AI APIs for Semantic Search
- **Context**: Need for semantic search capabilities that understand context beyond keywords
- **Options Considered**:
  1. Build custom embedding solution
  2. Use open-source models locally
  3. Use Jina AI's APIs
- **Decision**: Use Jina AI's APIs for embedding generation and similarity computation
- **Rationale**:
  - High-quality embeddings with state-of-the-art models
  - No need to manage model deployment and infrastructure
  - Simple API integration with reasonable pricing
  - Support for long texts through segmentation

### Decision: Separate Markdown Segmentation from Similarity Computation
- **Context**: Need to handle potentially long markdown documents
- **Options Considered**:
  1. Integrate segmentation directly into the similarity module
  2. Create a separate module for segmentation
- **Decision**: Create a separate module (markdown_segmenter.py) for document segmentation
- **Rationale**:
  - Better separation of concerns
  - More modular design allows for independent use of components
  - Easier to maintain and extend each component separately

### Decision: Use Environment Variables for API Keys
- **Context**: Need to securely manage API credentials
- **Options Considered**:
  1. Configuration files
  2. Environment variables
  3. Secret management service
- **Decision**: Use environment variables (JINA_API_KEY)
- **Rationale**:
  - Simple to implement
  - Standard practice for managing secrets
  - Works well across different environments
  - Prevents accidental commit of credentials to version control

### Decision: Use Cosine Similarity with Normalized Vectors
- **Context**: Need a metric for comparing semantic similarity between text embeddings
- **Options Considered**:
  1. Euclidean distance
  2. Cosine similarity
  3. Dot product
- **Decision**: Use cosine similarity with normalized vectors
- **Rationale**:
  - Standard approach for semantic similarity
  - Normalized vectors simplify computation (dot product equals cosine similarity)
  - Less sensitive to embedding magnitude, focusing on direction (meaning)

## 2025-02-27: Research System Architecture

### Decision: Implement a Multi-Stage Research Pipeline
- **Context**: Need to define the overall architecture for the intelligent research system
- **Options Considered**:
  1. Monolithic application with tightly coupled components
  2. Microservices architecture with independent services
  3. Pipeline architecture with distinct processing stages
- **Decision**: Implement an 8-stage pipeline architecture
- **Rationale**:
  - Clear separation of concerns with each stage having a specific responsibility
  - Easier to develop and test individual components
  - Flexibility to swap or enhance specific stages without affecting others
  - Natural flow of data through the system matches the research process

### Decision: Use Multiple Search Sources
- **Context**: Need to gather comprehensive information from various sources
- **Options Considered**:
  1. Use a single search API for simplicity
  2. Implement custom web scraping for all sources
  3. Use multiple specialized search APIs
- **Decision**: Integrate multiple search sources (Google, Serper, Jina Search, Google Scholar, arXiv)
- **Rationale**:
  - Different sources provide different types of information (academic, general, etc.)
  - Increases the breadth and diversity of search results
  - Specialized APIs like arXiv provide domain-specific information
  - Redundancy ensures more comprehensive coverage

### Decision: Use Jina AI for Semantic Processing
- **Context**: Need for advanced semantic understanding in document processing
- **Options Considered**:
  1. Use simple keyword matching
  2. Implement custom embedding models
  3. Use Jina AI's suite of APIs
- **Decision**: Use Jina AI's APIs for embedding generation, similarity computation, and reranking
- **Rationale**:
  - High-quality embeddings with state-of-the-art models
  - Comprehensive API suite covering multiple needs (embeddings, segmentation, reranking)
  - Simple integration with reasonable pricing
  - Consistent approach across different semantic processing tasks

## 2025-02-27: Search Execution Architecture

### Decision: Search Execution Architecture
- **Context**: We needed to implement a search execution module that could execute search queries across multiple search engines and process the results in a standardized way.

- **Decision**:
  1. Create a modular search execution architecture:
    - Implement a base handler interface (`BaseSearchHandler`) for all search API handlers
    - Create specific handlers for each search engine (Google, Serper, Scholar, arXiv)
    - Develop a central `SearchExecutor` class to manage execution across multiple engines
    - Implement a `ResultCollector` class for processing and organizing results

  2. Use parallel execution for search queries:
    - Implement thread-based parallelism using `concurrent.futures`
    - Add support for both synchronous and asynchronous execution
    - Include timeout management and error handling

  3. Standardize search results:
    - Define a common result format across all search engines
    - Include metadata specific to each search engine in a standardized way
    - Implement deduplication and scoring for result ranking

- **Rationale**:
  - A modular architecture allows for easy addition of new search engines
  - Parallel execution significantly improves search performance
  - Standardized result format simplifies downstream processing
  - Separation of concerns between execution and result processing

- **Alternatives Considered**:
  1. Sequential execution of search queries:
    - Simpler implementation
    - Much slower performance
    - Would not scale well with additional search engines

  2. Separate modules for each search engine:
    - Would lead to code duplication
    - More difficult to maintain
    - Less consistent result format

  3. Using a third-party search aggregation service:
    - Would introduce additional dependencies
    - Less control over the search process
    - Potential cost implications

- **Impact**:
  - Efficient execution of search queries across multiple engines
  - Consistent result format for downstream processing
  - Flexible architecture that can be extended with new search engines
  - Clear separation of concerns between different components

## 2025-02-27: Search Execution Module Refinements

### Decision: Remove Google Search Handler
- **Context**: Both Google and Serper handlers were implemented, but Serper is essentially a front-end for Google search
- **Options Considered**:
  1. Keep both handlers for redundancy
  2. Remove the Google handler and only use Serper
- **Decision**: Remove the Google search handler
- **Rationale**:
  - Redundant functionality as Serper provides the same results
  - Simplifies the codebase and reduces maintenance
  - Reduces API costs by avoiding duplicate searches
  - Serper provides a more reliable and consistent API for Google search

### Decision: Modify LLM Query Enhancement Prompt
- **Context**: The LLM was returning enhanced queries with explanations, which caused issues with search APIs
- **Options Considered**:
  1. Post-process the LLM output to extract just the query
  2. Modify the prompt to request only the enhanced query
- **Decision**: Modify the LLM prompt to request only the enhanced query without explanations
- **Rationale**:
  - More reliable than post-processing, which could be error-prone
  - Cleaner implementation that addresses the root cause
  - Ensures consistent output format for downstream processing
  - Reduces the risk of exceeding API character limits

### Decision: Implement Query Truncation
- **Context**: Enhanced queries could exceed the Serper API's 2048 character limit
- **Options Considered**:
  1. Limit the LLM's output length
  2. Truncate queries before sending to the API
  3. Split long queries into multiple searches
- **Decision**: Implement query truncation in the search executor
- **Rationale**:
  - Simple and effective solution
  - Preserves as much of the enhanced query as possible
  - Ensures API requests don't fail due to length constraints
  - Can be easily adjusted if API limits change

## 2025-02-27: Testing Strategy for Query Processor

### Context
After integrating Groq and OpenRouter as additional LLM providers, we needed to verify that the query processor module functions correctly with these new providers.

### Decision
1. Create dedicated test scripts to validate the query processor functionality:
   - A basic test script for the core processing pipeline
   - A comprehensive test script for detailed component testing

2. Use monkey patching to ensure tests consistently use the Groq model:
   - Create a global LLM interface with the Groq model
   - Override the `get_llm_interface` function to always return this interface
   - This approach allows testing without modifying the core code

3. Test all key functionality of the query processor:
   - Query enhancement
   - Query classification
   - Search query generation
   - End-to-end processing pipeline

### Rationale
- Dedicated test scripts provide a repeatable way to verify functionality
- Monkey patching allows testing with specific models without changing the core code
- Comprehensive testing ensures all components work correctly with the new providers
- Saving test results to a JSON file provides a reference for future development

### Alternatives Considered
1. Modifying the query processor to accept a model parameter:
   - Would require changing the core code
   - Could introduce bugs in the production code

2. Using environment variables to control model selection:
   - Less precise control over which model is used
   - Could interfere with other tests or production use

### Impact
- Verified that the query processor works correctly with Groq models
- Established a testing approach that can be used for other modules
- Created reusable test scripts for future development

## 2025-02-27: Report Generation Module Implementation

### Decision: Use Jina Reader for Web Scraping and SQLite for Document Storage
- **Context**: Need to implement document scraping and storage for the Report Generation module
- **Options Considered**:
  1. In-memory document storage with custom web scraping
  2. SQLite database with Jina Reader for web scraping
  3. NoSQL database (e.g., MongoDB) with BeautifulSoup for web scraping
  4. Cloud-based document storage with third-party scraping service
- **Decision**: Use Jina Reader for web scraping and SQLite for document storage
- **Rationale**:
  - Jina Reader provides clean content extraction from web pages
  - Integration with existing Jina components (embeddings, reranker) for a consistent approach
  - SQLite offers persistence without the complexity of a full database server
  - SQLite's transactional nature ensures data integrity
  - Local storage reduces latency and eliminates cloud dependencies
  - Ability to store metadata alongside documents for better filtering and selection

### Decision: Implement Phased Approach for Report Generation
- **Context**: Need to handle potentially large numbers of documents within LLM context window limitations
- **Options Considered**:
  1. Single-pass approach with document truncation
  2. Use of a model with larger context window
  3. Phased approach with document prioritization and chunking
  4. Outsourcing document synthesis to a specialized service
- **Decision**: Implement a phased approach with document prioritization and chunking
- **Rationale**:
  - Allows handling of large document collections despite context window limitations
  - Prioritization ensures the most relevant content is included
  - Chunking strategies can preserve document structure and context
  - Map-reduce pattern enables processing of unlimited document collections
  - Flexible architecture can accommodate different models as needed
  - Progressive implementation allows for iterative testing and refinement

## 2025-02-27: Document Prioritization and Chunking Strategies

### Decision

Implemented document prioritization and chunking strategies for the Report Generation module (Phase 2) to extract the most relevant portions of scraped documents and prepare them for LLM processing.

### Context

After implementing the document scraping and storage components (Phase 1), we needed to develop strategies for prioritizing documents based on relevance and chunking them to fit within the LLM's context window limits. This is crucial for ensuring that the most important information is included in the final report.

### Options Considered

1. **Document Prioritization:**
   - Option A: Use only relevance scores from search results
   - Option B: Combine relevance scores with document metadata (recency, token count)
   - Option C: Use a machine learning model to score documents

2. **Chunking Strategies:**
   - Option A: Fixed-size chunking with overlap
   - Option B: Section-based chunking using Markdown headers
   - Option C: Hierarchical chunking for very large documents
   - Option D: Semantic chunking based on content similarity

### Decision and Rationale

For document prioritization, we chose Option B: a weighted scoring system that combines:
- Relevance scores from search results (primary factor)
- Document recency (secondary factor)
- Document token count (tertiary factor)

This approach allows us to prioritize documents that are both relevant to the query and recent, while also considering the information density of the document.

For chunking strategies, we implemented a hybrid approach:
- Section-based chunking (Option B) as the primary strategy, which preserves the logical structure of documents
- Fixed-size chunking (Option A) as a fallback for documents without clear section headers
- Hierarchical chunking (Option C) for very large documents, which creates a summary chunk and preserves important sections

We decided against semantic chunking (Option D) for now due to the additional computational overhead and complexity, but may consider it for future enhancements.

### Implementation Details

1. **Document Prioritization:**
   - Created a scoring formula that weights relevance (50-60%), recency (30%), and token count (10-20%)
   - Normalized all scores to a 0-1 range for consistent weighting
   - Added the priority score to each document for use in chunk selection

2. **Chunking Strategies:**
   - Implemented section-based chunking using regex to identify Markdown headers
   - Added fixed-size chunking with configurable chunk size and overlap
   - Created hierarchical chunking for very large documents
   - Preserved document metadata in all chunks for traceability

3. **Chunk Selection:**
   - Implemented a token budget management system to stay within context limits
   - Created an algorithm to select chunks based on priority while ensuring representation from multiple documents
   - Added minimum chunks per document to prevent over-representation of a single source

### Impact and Next Steps

This implementation allows us to:
- Prioritize the most relevant and recent information
- Preserve the logical structure of documents
- Efficiently manage token budgets for different LLM models
- Balance information from multiple sources

Next steps include:
- Integrating with the LLM interface for report synthesis (Phase 3)
- Implementing the map-reduce approach for processing document chunks
- Creating report templates for different query types
- Adding citation generation and reference management

## 2025-02-27: Map-Reduce Approach for Report Synthesis

### Context
For Phase 3 of the Report Generation module, we needed to implement a method to synthesize comprehensive reports from multiple document chunks. The challenge was to effectively process potentially large amounts of information while maintaining coherence and staying within token limits of LLM models.

### Options Considered
1. **Single-Pass Approach**: Send all document chunks to the LLM at once for processing.
   - Pros: Simpler implementation, LLM has full context at once
   - Cons: Limited by context window size, may exceed token limits for large documents

2. **Sequential Summarization**: Process each document sequentially, building up a summary incrementally.
   - Pros: Can handle unlimited documents, maintains some context
   - Cons: Risk of information loss, earlier documents may have undue influence

3. **Map-Reduce Approach**: Process individual chunks first (map), then combine the extracted information (reduce).
   - Pros: Can handle large numbers of documents, preserves key information, more efficient token usage
   - Cons: More complex implementation, requires two LLM passes

### Decision
We chose the **Map-Reduce Approach** for report synthesis because:
1. It allows us to process a large number of document chunks efficiently
2. It preserves key information from each document by extracting it in the map phase
3. It produces more coherent reports by synthesizing the extracted information in the reduce phase
4. It makes better use of token limits by focusing on relevant information

### Implementation Details
- **Map Phase**: Each document chunk is processed individually to extract key information relevant to the query
- **Reduce Phase**: The extracted information is synthesized into a coherent report
- **Query Type Templates**: Different report templates are used based on the query type (factual, exploratory, comparative)
- **Citation Management**: Citations are included in the report with a references section at the end

### Success Metrics
- Ability to process more documents than a single-pass approach
- Higher quality reports with better organization and coherence
- Proper attribution of information to sources
- Efficient token usage

### Status
Implemented and tested successfully with both sample data and real URLs.

## 2025-02-27: Report Generation Enhancements

### Decision: Implement Customizable Report Detail Levels
- **Context**: Need to provide flexibility in report generation to accommodate different use cases and detail requirements
- **Options Considered**:
  1. Fixed report format with predetermined detail level
  2. Simple toggle between "brief" and "detailed" reports
  3. Comprehensive configuration system with multiple adjustable parameters
- **Decision**: Implement a comprehensive configuration system with multiple adjustable parameters
- **Rationale**:
  - Different research tasks require different levels of detail
  - Users have varying needs for report comprehensiveness
  - A flexible system allows for fine-tuning based on specific use cases
  - Multiple configuration options provide more control over the output

### Implementation Details
1. **Configurable Parameters**:
   - Number of search results per engine
   - Token budget for report generation
   - Synthesis prompts for the LLM
   - Report style templates
   - Chunking parameters (size and overlap)
   - Model selection options

2. **Integration Points**:
   - Command-line arguments for scripts
   - Configuration file options
   - API parameters for programmatic use
   - UI controls for user-facing applications

3. **Default Configurations**:
   - Create preset configurations for common use cases:
     - Brief overview (fewer results, smaller token budget)
     - Standard report (balanced approach)
     - Comprehensive analysis (more results, larger token budget)
     - Technical deep-dive (specialized prompts, larger context)

## 2025-02-28: Async Implementation and Reference Formatting

### Decision: Convert LLM Interface Methods to Async

**Context**: The codebase was experiencing runtime errors related to coroutine handling, particularly with the LLM interface methods.

**Decision**: Convert all LLM interface methods to async and update dependent code to properly await these methods.

**Rationale**:
- LLM API calls are I/O-bound operations that benefit from async handling
- Consistent async/await patterns throughout the codebase improve reliability
- Proper async implementation prevents runtime errors related to coroutine handling

**Implementation**:
- Converted `generate_completion`, `classify_query`, `enhance_query`, and `generate_search_queries` methods to async
- Updated QueryProcessor methods to be async
- Modified query_to_report.py to correctly await async methods
- Updated the Gradio interface to handle async operations

### Decision: Enhance Reference Formatting Instructions

**Context**: References in generated reports were missing URLs and sometimes using generic placeholders like "Document 1".

**Decision**: Enhance the reference formatting instructions to emphasize including URLs and improve context preparation.

**Rationale**:
- Proper references with URLs are essential for academic and professional reports
- Clear instructions to the LLM improve the quality of generated references
- Duplicate URL fields in the context ensure URLs are captured

**Implementation**:
- Improved instructions to emphasize including URLs for each reference
- Added duplicate URL fields in the context to ensure URLs are captured
- Updated the reference generation prompt to explicitly request URLs
- Added a separate reference generation step to handle truncated references