Compare commits
2 Commits
bf49474ca6
...
12b453a14f
Author | SHA1 | Date |
---|---|---|
|
12b453a14f | |
|
b6b50e4ef8 |
|
@ -0,0 +1,5 @@
|
|||
Review the contensts of .note/ before modifying any files.
|
||||
|
||||
After each major successful test, please commit the changes to the repository with a meaningful commit message.
|
||||
|
||||
Update the contents of .note/ after each major change.
|
|
@ -51,3 +51,4 @@ logs/
|
|||
# Database files
|
||||
*.db
|
||||
report/database/*.db
|
||||
config/config.yaml
|
||||
|
|
|
@ -47,6 +47,18 @@
|
|||
- Tested the reranking functionality with the `JinaReranker` class
|
||||
- Checked that the report generation process works with the new structure
|
||||
|
||||
### Query Type Selection in Gradio UI
|
||||
- ✅ Added a dropdown menu for query type selection in the "Generate Report" tab
|
||||
- ✅ Included options for "auto-detect", "factual", "exploratory", and "comparative"
|
||||
- ✅ Added descriptive tooltips explaining each query type
|
||||
- ✅ Set "auto-detect" as the default option
|
||||
- ✅ Modified the `generate_report` method in the `GradioInterface` class to handle the new query_type parameter
|
||||
- ✅ Updated the report button click handler to pass the query type to the generate_report method
|
||||
- ✅ Updated the `generate_report` method in the `ReportGenerator` class to accept a query_type parameter
|
||||
- ✅ Modified the report synthesizer calls to pass the query_type parameter
|
||||
- ✅ Added a "Query Types" section to the Gradio UI explaining each query type
|
||||
- ✅ Committed changes with message "Add query type selection to Gradio UI and improve report generation"
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Run comprehensive tests to ensure all functionality works with the new directory structure
|
||||
|
@ -75,11 +87,20 @@
|
|||
- Estimated difficulty: Easy to Moderate (2-3 days of work)
|
||||
|
||||
2. **UI Improvements**:
|
||||
- **Add Chunk Processing Progress Indicators**:
|
||||
- Modify the `report_synthesis.py` file to add logging during the map phase of the map-reduce process
|
||||
- Add a counter variable to track which chunk is being processed
|
||||
- Use the existing logging infrastructure to output progress messages in the UI
|
||||
- Estimated difficulty: Easy (15-30 minutes of work)
|
||||
- ✅ **Add Chunk Processing Progress Indicators**:
|
||||
- ✅ Added a `set_progress_callback` method to the `ReportGenerator` class
|
||||
- ✅ Implemented progress tracking in both standard and progressive report synthesizers
|
||||
- ✅ Updated the Gradio UI to display progress during report generation
|
||||
- ✅ Fixed issues with progress reporting in the UI
|
||||
- ✅ Ensured proper initialization of the report generator in the UI
|
||||
- ✅ Added proper error handling for progress updates
|
||||
|
||||
- ✅ **Add Query Type Selection**:
|
||||
- ✅ Added a dropdown menu for query type selection in the "Generate Report" tab
|
||||
- ✅ Included options for "auto-detect", "factual", "exploratory", "comparative", and "code"
|
||||
- ✅ Added descriptive tooltips explaining each query type
|
||||
- ✅ Modified the report generation logic to handle the selected query type
|
||||
- ✅ Added documentation to help users understand when to use each query type
|
||||
|
||||
3. **Visualization Components**:
|
||||
- Identify common data types in reports that would benefit from visualization
|
||||
|
@ -96,8 +117,9 @@
|
|||
- Implementing report versioning and comparison
|
||||
|
||||
2. **Integration with UI**:
|
||||
- Adding report generation options to the UI
|
||||
- Implementing progress indicators for document scraping and report generation
|
||||
- ✅ Adding report generation options to the UI
|
||||
- ✅ Implementing progress indicators for document scraping and report generation
|
||||
- ✅ Adding query type selection to the UI
|
||||
- Creating visualization components for generated reports
|
||||
- Adding options to customize report generation parameters
|
||||
|
||||
|
@ -111,11 +133,11 @@
|
|||
|
||||
1. **Report Templates Implementation**:
|
||||
- ✅ Created a dedicated `report_templates.py` module with a comprehensive template system
|
||||
- ✅ Implemented `QueryType` enum for categorizing queries (FACTUAL, EXPLORATORY, COMPARATIVE)
|
||||
- ✅ Implemented `QueryType` enum for categorizing queries (FACTUAL, EXPLORATORY, COMPARATIVE, CODE)
|
||||
- ✅ Created `DetailLevel` enum for different report detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE)
|
||||
- ✅ Designed a `ReportTemplate` class with validation for required sections
|
||||
- ✅ Implemented a `ReportTemplateManager` to manage and retrieve templates
|
||||
- ✅ Created 12 different templates (3 query types × 4 detail levels)
|
||||
- ✅ Created 16 different templates (4 query types × 4 detail levels)
|
||||
- ✅ Added testing with `test_report_templates.py` and `test_brief_report.py`
|
||||
- ✅ Updated memory bank documentation with template system details
|
||||
|
||||
|
@ -127,6 +149,12 @@
|
|||
- ✅ Improved error handling in template retrieval with fallback to standard templates
|
||||
- ✅ Added better logging for template retrieval process
|
||||
|
||||
3. **UI Enhancements**:
|
||||
- ✅ Added progress tracking for report generation
|
||||
- ✅ Added query type selection dropdown
|
||||
- ✅ Added documentation for query types and detail levels
|
||||
- ✅ Improved error handling in the UI
|
||||
|
||||
### Next Steps
|
||||
|
||||
1. **Further Refinement of Report Templates**:
|
||||
|
@ -173,7 +201,20 @@
|
|||
- ✅ Implemented optimization for token usage and processing efficiency
|
||||
- ✅ Fine-tuned prompts and parameters based on testing results
|
||||
|
||||
3. **Visualization Components**:
|
||||
3. **Query Type Selection Enhancement**:
|
||||
- ✅ Added query type selection dropdown to the UI
|
||||
- ✅ Implemented handling of user-selected query types in the report generation process
|
||||
- ✅ Added documentation to help users understand when to use each query type
|
||||
- ✅ Added CODE as a new query type with specialized templates at all detail levels
|
||||
- ✅ Implemented code query detection with language, framework, and pattern recognition
|
||||
- ✅ Added GitHub and StackExchange search handlers for code-related queries
|
||||
- ⏳ Test the query type selection with various queries to ensure it works correctly
|
||||
- ⏳ Gather user feedback on the usefulness of manual query type selection
|
||||
- ⏳ Consider adding more specialized templates for specific query types
|
||||
- ⏳ Explore adding query type detection confidence scores to help users decide when to override
|
||||
- ⏳ Add examples of each query type to help users understand the differences
|
||||
|
||||
4. **Visualization Components**:
|
||||
- Identify common data types in reports that would benefit from visualization
|
||||
- Design and implement visualization components for these data types
|
||||
- Integrate visualization components into the report generation process
|
||||
|
@ -194,3 +235,14 @@
|
|||
- Tracks improvement scores to detect diminishing returns
|
||||
- Adapts batch size based on model context window
|
||||
- Provides progress tracking through callback mechanism
|
||||
- Added query type selection to the UI:
|
||||
- Allows users to explicitly select the query type (factual, exploratory, comparative, code)
|
||||
- Provides auto-detect option for convenience
|
||||
- Includes documentation to help users understand when to use each query type
|
||||
- Passes the selected query type through the report generation pipeline
|
||||
- Implemented specialized code query support:
|
||||
- Added GitHub API for searching code repositories
|
||||
- Added StackExchange API for programming Q&A content
|
||||
- Created code detection based on programming languages, frameworks, and patterns
|
||||
- Designed specialized report templates for code content with syntax highlighting
|
||||
- Enhanced result ranking to prioritize code-related sources for programming queries
|
||||
|
|
|
@ -583,281 +583,103 @@ In this session, we fixed issues in the Gradio UI for report generation and plan
|
|||
3. Test the current implementation with various query types to identify any remaining issues
|
||||
4. Update the documentation to reflect the new features and future plans
|
||||
|
||||
## Session: 2025-02-28: Google Gemini Integration and Reference Formatting
|
||||
## Session: 2025-03-12 - Query Type Selection in Gradio UI
|
||||
|
||||
### Overview
|
||||
Fixed the integration of Google Gemini models with LiteLLM, and fixed reference formatting issues.
|
||||
In this session, we enhanced the Gradio UI by adding a query type selection dropdown, allowing users to explicitly select the query type (factual, exploratory, comparative) instead of relying on automatic detection.
|
||||
|
||||
### Key Activities
|
||||
1. **Fixed Google Gemini Integration**:
|
||||
- Updated the model format to `gemini/gemini-2.0-flash` in config.yaml
|
||||
- Modified message formatting for Gemini models in LLM interface
|
||||
- Added proper handling for the 'gemini' provider in environment variable setup
|
||||
1. **Added Query Type Selection to Gradio UI**:
|
||||
- Added a dropdown menu for query type selection in the "Generate Report" tab
|
||||
- Included options for "auto-detect", "factual", "exploratory", and "comparative"
|
||||
- Added descriptive tooltips explaining each query type
|
||||
- Set "auto-detect" as the default option
|
||||
|
||||
2. **Fixed Reference Formatting Issues**:
|
||||
- Enhanced the instructions for reference formatting to ensure URLs are included
|
||||
- Added a recovery mechanism for truncated references
|
||||
- Improved context preparation to better extract URLs for references
|
||||
2. **Updated Report Generation Logic**:
|
||||
- Modified the `generate_report` method in the `GradioInterface` class to handle the new query_type parameter
|
||||
- Updated the report button click handler to pass the query type to the generate_report method
|
||||
- Added logging to show when a user-selected query type is being used
|
||||
|
||||
3. **Converted LLM Interface Methods to Async**:
|
||||
- Made `generate_completion`, `classify_query`, and `enhance_query` methods async
|
||||
- Updated dependent code to properly await these methods
|
||||
- Fixed runtime errors related to async/await patterns
|
||||
3. **Enhanced Report Generator**:
|
||||
- Updated the `generate_report` method in the `ReportGenerator` class to accept a query_type parameter
|
||||
- Modified the report synthesizer calls to pass the query_type parameter
|
||||
- Added logging to track query type usage
|
||||
|
||||
### Key Insights
|
||||
- Gemini models require special message formatting (using 'user' and 'model' roles instead of 'system' and 'assistant')
|
||||
- References were getting cut off due to token limits, requiring a separate generation step
|
||||
- The async conversion was necessary to properly handle async LLM calls throughout the codebase
|
||||
4. **Added Documentation**:
|
||||
- Added a "Query Types" section to the Gradio UI explaining each query type
|
||||
- Included examples of when to use each query type
|
||||
- Updated code comments to explain the query type parameter
|
||||
|
||||
### Insights
|
||||
- Explicit query type selection gives users more control over the report generation process
|
||||
- Different query types benefit from specialized report templates and structures
|
||||
- The auto-detect option provides convenience while still allowing manual override
|
||||
- Clear documentation helps users understand when to use each query type
|
||||
|
||||
### Challenges
|
||||
- Ensuring that the templates produce appropriate output for each detail level
|
||||
- Balancing between speed and quality for different detail levels
|
||||
- Managing token budgets effectively across different detail levels
|
||||
- Ensuring backward compatibility with existing code
|
||||
- Maintaining the auto-detect functionality while adding manual selection
|
||||
- Passing the query type parameter through multiple layers of the application
|
||||
- Providing clear explanations of query types for users
|
||||
|
||||
### Next Steps
|
||||
1. Continue testing with Gemini models to ensure stable operation
|
||||
2. Consider adding more robust error handling for LLM provider-specific issues
|
||||
3. Improve the reference formatting further if needed
|
||||
1. Test the query type selection with various queries to ensure it works correctly
|
||||
2. Gather user feedback on the usefulness of manual query type selection
|
||||
3. Consider adding more specialized templates for specific query types
|
||||
4. Explore adding query type detection confidence scores to help users decide when to override
|
||||
5. Add examples of each query type to help users understand the differences
|
||||
|
||||
## Session: 2025-02-28: Fixing Reference Formatting and Async Implementation
|
||||
## Session: 2025-03-12 - Fixed Query Type Parameter Bug
|
||||
|
||||
### Overview
|
||||
Fixed reference formatting issues with Gemini models and updated the codebase to properly handle async methods.
|
||||
Fixed a bug in the report generation process where the `query_type` parameter was not properly handled, causing an error when it was `None`.
|
||||
|
||||
### Key Activities
|
||||
1. **Enhanced Reference Formatting**:
|
||||
- Improved instructions to emphasize including URLs for each reference
|
||||
- Added duplicate URL fields in the context to ensure URLs are captured
|
||||
- Updated the reference generation prompt to explicitly request URLs
|
||||
- Added a separate reference generation step to handle truncated references
|
||||
1. **Fixed NoneType Error in Report Synthesis**:
|
||||
- Added a null check in the `_get_extraction_prompt` method in `report_synthesis.py`
|
||||
- Modified the condition that checks for comparative queries to handle the case where `query_type` is `None`
|
||||
- Ensured the method works correctly regardless of whether a query type is explicitly provided
|
||||
|
||||
2. **Fixed Async Implementation**:
|
||||
- Converted all LLM interface methods to async for proper handling
|
||||
- Updated QueryProcessor's generate_search_queries method to be async
|
||||
- Modified query_to_report.py to correctly await async methods
|
||||
- Fixed runtime errors related to async/await patterns
|
||||
|
||||
3. **Updated Gradio Interface**:
|
||||
- Modified the generate_report method to properly handle async operations
|
||||
- Updated the report button click handler to correctly pass parameters
|
||||
- Fixed the parameter order in the lambda function for async execution
|
||||
- Improved error handling in the UI
|
||||
|
||||
## Session: 2025-03-11
|
||||
|
||||
### Overview
|
||||
|
||||
Reorganized the project directory structure to improve maintainability and clarity, ensuring all components are properly organized into their respective directories.
|
||||
|
||||
### Key Activities
|
||||
|
||||
1. **Directory Structure Reorganization**:
|
||||
|
||||
- Created a dedicated `utils/` directory for utility scripts
|
||||
- Moved `jina_similarity.py` to `utils/`
|
||||
- Added `__init__.py` to make it a proper Python package
|
||||
- Organized test files into subdirectories under `tests/`
|
||||
- Created subdirectories for each module (query, execution, ranking, report, ui, integration)
|
||||
- Added `__init__.py` files to all test directories
|
||||
- Created an `examples/` directory with subdirectories for data and scripts
|
||||
- Moved sample data to `examples/data/`
|
||||
- Added `__init__.py` files to make them proper Python packages
|
||||
- Added a dedicated `scripts/` directory for utility scripts
|
||||
- Moved `query_to_report.py` to `scripts/`
|
||||
|
||||
2. **Pipeline Verification**:
|
||||
|
||||
- Tested the pipeline after reorganization to ensure functionality
|
||||
- Verified that the UI works correctly with the new directory structure
|
||||
- Confirmed that all imports are working properly with the new structure
|
||||
|
||||
3. **Embedding Usage Analysis**:
|
||||
|
||||
- Confirmed that the pipeline uses Jina AI's Embeddings API through the `JinaSimilarity` class
|
||||
- Verified that the `JinaReranker` class uses embeddings for document reranking
|
||||
- Analyzed how embeddings are integrated into the search and ranking process
|
||||
2. **Root Cause Analysis**:
|
||||
- Identified that the error occurred when the `query_type` parameter was `None` and the code tried to call `.lower()` on it
|
||||
- Traced the issue through the call chain from the UI to the report generator to the report synthesizer
|
||||
- Confirmed that the fix addresses the specific error message: `'NoneType' object has no attribute 'lower'`
|
||||
|
||||
### Insights
|
||||
|
||||
- A well-organized directory structure significantly improves code maintainability and readability
|
||||
- Using proper Python package structure with `__init__.py` files ensures clean imports
|
||||
- Separating tests, utilities, examples, and scripts into dedicated directories makes the codebase more navigable
|
||||
- The Jina AI embeddings are used throughout the pipeline for semantic similarity and document reranking
|
||||
|
||||
### Challenges
|
||||
|
||||
- Ensuring all import statements are updated correctly after moving files
|
||||
- Maintaining backward compatibility with existing code
|
||||
- Verifying that all components still work together after reorganization
|
||||
- Proper null checking is essential when working with optional parameters that are passed through multiple layers
|
||||
- The error occurred in the report synthesis module but was triggered by the UI's query type selection feature
|
||||
- The fix maintains backward compatibility while ensuring the new query type selection feature works correctly
|
||||
|
||||
### Next Steps
|
||||
1. Test the fix with various query types to ensure it works correctly
|
||||
2. Consider adding similar null checks in other parts of the code that handle the query_type parameter
|
||||
3. Add more comprehensive error handling throughout the report generation process
|
||||
4. Update the test suite to include tests for null query_type values
|
||||
|
||||
1. Run comprehensive tests to ensure all functionality works with the new directory structure
|
||||
2. Update any remaining documentation to reflect the new directory structure
|
||||
3. Consider moving the remaining test files in the root of the `tests/` directory to appropriate subdirectories
|
||||
4. Review import statements throughout the codebase to ensure they follow the new structure
|
||||
|
||||
### Key Insights
|
||||
- Async/await patterns need to be consistently applied throughout the codebase
|
||||
- Reference formatting requires explicit instructions to include URLs
|
||||
- Gradio's interface needs special handling for async functions
|
||||
|
||||
### Challenges
|
||||
- Ensuring that all async methods are properly awaited
|
||||
- Balancing between detailed instructions and token limits for reference generation
|
||||
- Managing the increased processing time for async operations
|
||||
|
||||
### Next Steps
|
||||
1. Continue testing with Gemini models to ensure stable operation
|
||||
2. Consider adding more robust error handling for LLM provider-specific issues
|
||||
3. Improve the reference formatting further if needed
|
||||
4. Update documentation to reflect the changes made to the LLM interface
|
||||
5. Consider adding more unit tests for the async methods
|
||||
|
||||
## Session: 2025-02-28: Fixed NoneType Error in Report Synthesis
|
||||
|
||||
### Issue
|
||||
Encountered an error during report generation:
|
||||
```
|
||||
TypeError: 'NoneType' object is not subscriptable
|
||||
```
|
||||
|
||||
The error occurred in the `map_document_chunks` method of the `ReportSynthesizer` class when trying to slice a title that was `None`.
|
||||
|
||||
### Changes Made
|
||||
1. Fixed the chunk counter in `map_document_chunks` method:
|
||||
- Used a separate counter for individual chunks instead of using the batch index
|
||||
- Added a null check for chunk titles with a fallback to 'Untitled'
|
||||
|
||||
2. Added defensive code in `synthesize_report` method:
|
||||
- Added code to ensure all chunks have a title before processing
|
||||
- Added null checks for title fields
|
||||
|
||||
3. Updated the `DocumentProcessor` class:
|
||||
- Modified `process_documents_for_report` to ensure all chunks have a title
|
||||
- Updated `chunk_document_by_sections`, `chunk_document_fixed_size`, and `chunk_document_hierarchical` methods to handle None titles
|
||||
- Added default 'Untitled' value for all title fields
|
||||
|
||||
### Testing
|
||||
The changes were tested with a report generation task that previously failed, and the error was resolved.
|
||||
|
||||
### Next Steps
|
||||
1. Consider adding more comprehensive null checks throughout the codebase
|
||||
2. Add unit tests to verify proper handling of missing or null fields
|
||||
3. Implement better error handling and recovery mechanisms
|
||||
|
||||
## Session: 2025-03-11
|
||||
## Session: 2025-03-12 - Fixed Template Retrieval for Null Query Type
|
||||
|
||||
### Overview
|
||||
Focused on resolving issues with the report generation template system and ensuring that different detail levels and query types work correctly in the report synthesis process.
|
||||
Fixed a second issue in the report generation process where the template retrieval was failing when the `query_type` parameter was `None`.
|
||||
|
||||
### Key Activities
|
||||
1. **Fixed Template Retrieval Issues**:
|
||||
- Updated the `get_template` method in the `ReportTemplateManager` to ensure it retrieves templates correctly based on query type and detail level
|
||||
- Implemented a helper method `_get_template_from_strings` in the `ReportSynthesizer` to convert string values for query types and detail levels to their respective enum objects
|
||||
- Added better logging for template retrieval process to aid in debugging
|
||||
1. **Fixed Template Retrieval for Null Query Type**:
|
||||
- Updated the `_get_template_from_strings` method in `report_synthesis.py` to handle `None` query_type
|
||||
- Added a default value of "exploratory" when query_type is `None`
|
||||
- Modified the method signature to explicitly indicate that query_type_str can be `None`
|
||||
- Added logging to indicate when the default query type is being used
|
||||
|
||||
2. **Tested All Detail Levels and Query Types**:
|
||||
- Created a comprehensive test script `test_all_detail_levels.py` to test all combinations of detail levels and query types
|
||||
- Successfully tested all detail levels (brief, standard, detailed, comprehensive) with factual queries
|
||||
- Successfully tested all detail levels with exploratory queries
|
||||
- Successfully tested all detail levels with comparative queries
|
||||
|
||||
3. **Improved Error Handling**:
|
||||
- Added fallback to standard templates if specific templates are not found
|
||||
- Enhanced logging to track whether templates are found during the synthesis process
|
||||
|
||||
4. **Code Organization**:
|
||||
- Removed duplicate `ReportTemplateManager` and `ReportTemplate` classes from `report_synthesis.py`
|
||||
- Used the imported versions from `report_templates.py` for better code maintainability
|
||||
2. **Root Cause Analysis**:
|
||||
- Identified that the error occurred when trying to convert `None` to a `QueryType` enum value
|
||||
- The error message was: "No template found for None standard" and "None is not a valid QueryType"
|
||||
- The issue was in the template retrieval process which is used by both standard and progressive report synthesis
|
||||
|
||||
### Insights
|
||||
- The template system is now working correctly for all combinations of query types and detail levels
|
||||
- Proper logging is essential for debugging template retrieval issues
|
||||
- Converting string values to enum objects is necessary for consistent template retrieval
|
||||
- Having a dedicated test script for all combinations helps ensure comprehensive coverage
|
||||
|
||||
### Challenges
|
||||
- Initially encountered issues where templates were not found during report synthesis, leading to `ValueError`
|
||||
- Needed to ensure that the correct classes and methods were used for template retrieval
|
||||
- When fixing one issue with optional parameters, it's important to check for similar issues in related code paths
|
||||
- Providing sensible defaults for optional parameters helps maintain robustness
|
||||
- Proper error handling and logging helps diagnose issues in complex systems with multiple layers
|
||||
|
||||
### Next Steps
|
||||
1. Conduct additional testing with real-world queries and document sets
|
||||
2. Compare the analytical depth and quality of reports generated with different detail levels
|
||||
3. Gather user feedback on the improved reports at different detail levels
|
||||
4. Further refine the detail level configurations based on testing and feedback
|
||||
|
||||
## Session: 2025-03-12 - Report Templates and Progressive Report Generation
|
||||
|
||||
### Overview
|
||||
Implemented a dedicated report templates module to standardize report generation across different query types and detail levels, and implemented progressive report generation for comprehensive reports.
|
||||
|
||||
### Key Activities
|
||||
1. **Created Report Templates Module**:
|
||||
- Developed a new `report_templates.py` module with a comprehensive template system
|
||||
- Implemented `QueryType` enum for categorizing queries (FACTUAL, EXPLORATORY, COMPARATIVE)
|
||||
- Created `DetailLevel` enum for different report detail levels (BRIEF, STANDARD, DETAILED, COMPREHENSIVE)
|
||||
- Designed a `ReportTemplate` class with validation for required sections
|
||||
- Implemented a `ReportTemplateManager` to manage and retrieve templates
|
||||
|
||||
2. **Implemented Template Variations**:
|
||||
- Created 12 different templates (3 query types × 4 detail levels)
|
||||
- Designed templates with appropriate sections for each combination
|
||||
- Added placeholders for dynamic content in each template
|
||||
- Ensured templates follow a consistent structure while adapting to specific needs
|
||||
|
||||
3. **Added Testing**:
|
||||
- Created `test_report_templates.py` to verify template retrieval and validation
|
||||
- Implemented `test_brief_report.py` to test brief report generation with a simple query
|
||||
- Verified that all templates can be correctly retrieved and used
|
||||
|
||||
4. **Implemented Progressive Report Generation**:
|
||||
- Created a new `progressive_report_synthesis.py` module with a `ProgressiveReportSynthesizer` class
|
||||
- Implemented chunk prioritization algorithm based on relevance scores
|
||||
- Developed iterative refinement process with specialized prompts
|
||||
- Added state management to track report versions and processed chunks
|
||||
- Implemented termination conditions (all chunks processed, diminishing returns, max iterations)
|
||||
- Added support for different models with adaptive batch sizing
|
||||
- Implemented progress tracking and callback mechanism
|
||||
- Created comprehensive test suite for progressive report generation
|
||||
|
||||
5. **Updated Report Generator**:
|
||||
- Modified `report_generator.py` to use the progressive report synthesizer for comprehensive detail level
|
||||
- Created a hybrid system that uses standard map-reduce for brief/standard/detailed levels
|
||||
- Added proper model selection and configuration for both synthesizers
|
||||
|
||||
6. **Updated Memory Bank**:
|
||||
- Added report templates information to code_structure.md
|
||||
- Updated current_focus.md with implementation details for progressive report generation
|
||||
- Updated session_log.md with details about the implementation
|
||||
- Ensured all new files are properly documented
|
||||
|
||||
### Insights
|
||||
- A standardized template system significantly improves report consistency
|
||||
- Different query types require specialized report structures
|
||||
- Validation ensures all required sections are present in templates
|
||||
- Enums provide type safety and prevent errors from string comparisons
|
||||
- Progressive report generation provides better results for very large document collections
|
||||
- The hybrid approach leverages the strengths of both map-reduce and progressive methods
|
||||
- Tracking improvement scores helps detect diminishing returns and optimize processing
|
||||
- Adaptive batch sizing based on model context window improves efficiency
|
||||
|
||||
### Challenges
|
||||
- Designing templates that are flexible enough for various content types
|
||||
- Balancing between standardization and customization for different query types
|
||||
- Ensuring proper integration with the existing report synthesis process
|
||||
- Managing state and tracking progress in progressive report generation
|
||||
- Preventing entrenchment of initial report structure in progressive approach
|
||||
- Optimizing token usage when sending entire reports for refinement
|
||||
- Determining appropriate termination conditions for the progressive approach
|
||||
|
||||
### Next Steps
|
||||
1. Integrate the progressive approach with the UI
|
||||
- Implement controls to pause, resume, or terminate the process
|
||||
- Create a preview mode to see the current report state
|
||||
- Add options to compare different versions of the report
|
||||
2. Conduct additional testing with real-world queries and document sets
|
||||
3. Add specialized templates for specific research domains
|
||||
4. Implement template customization options for users
|
||||
5. Implement visualization components for data mentioned in reports
|
||||
1. Test the fix with comprehensive reports to ensure it works correctly
|
||||
2. Consider adding similar default values for other optional parameters
|
||||
3. Review the codebase for other potential null reference issues
|
||||
4. Update documentation to clarify the behavior when optional parameters are not provided
|
||||
|
|
23
README.md
23
README.md
|
@ -13,7 +13,12 @@ This system automates the research process by:
|
|||
## Features
|
||||
|
||||
- **Query Processing**: Enhances user queries with additional context and classifies them by type and intent
|
||||
- **Multi-Source Search**: Executes searches across Serper (Google), Google Scholar, and arXiv
|
||||
- **Multi-Source Search**: Executes searches across general web (Serper/Google), academic sources, and current news
|
||||
- **Specialized Search Handlers**:
|
||||
- **Current Events**: Optimized news search for recent developments
|
||||
- **Academic Research**: Specialized academic search with OpenAlex, CORE, arXiv, and Google Scholar
|
||||
- **Open Access Detection**: Finds freely available versions of paywalled papers using Unpaywall
|
||||
- **Code/Programming**: Specialized code search using GitHub and StackExchange
|
||||
- **Intelligent Ranking**: Uses Jina AI's Re-Ranker to prioritize the most relevant results
|
||||
- **Result Deduplication**: Removes duplicate results across different search engines
|
||||
- **Modular Architecture**: Easily extensible with new search engines and LLM providers
|
||||
|
@ -24,7 +29,7 @@ This system automates the research process by:
|
|||
- **Search Executor**: Executes searches across multiple engines
|
||||
- **Result Collector**: Processes and organizes search results
|
||||
- **Document Ranker**: Ranks documents by relevance
|
||||
- **Report Generator**: Synthesizes information into a coherent report (coming soon)
|
||||
- **Report Generator**: Synthesizes information into coherent reports with specialized templates for different query types
|
||||
|
||||
## Getting Started
|
||||
|
||||
|
@ -33,8 +38,13 @@ This system automates the research process by:
|
|||
- Python 3.8+
|
||||
- API keys for:
|
||||
- Serper API (for Google and Scholar search)
|
||||
- NewsAPI (for current events search)
|
||||
- CORE API (for open access academic search)
|
||||
- GitHub API (for code search)
|
||||
- StackExchange API (for programming Q&A content)
|
||||
- Groq (or other LLM provider)
|
||||
- Jina AI (for reranking)
|
||||
- Email for OpenAlex and Unpaywall (recommended but not required)
|
||||
|
||||
### Installation
|
||||
|
||||
|
@ -58,8 +68,11 @@ cp config/config.yaml.example config/config.yaml
|
|||
```yaml
|
||||
api_keys:
|
||||
serper: "your-serper-api-key"
|
||||
newsapi: "your-newsapi-key"
|
||||
groq: "your-groq-api-key"
|
||||
jina: "your-jina-api-key"
|
||||
github: "your-github-api-key"
|
||||
stackexchange: "your-stackexchange-api-key"
|
||||
```
|
||||
|
||||
### Usage
|
||||
|
@ -135,4 +148,10 @@ This project is licensed under the MIT License - see the LICENSE file for detail
|
|||
|
||||
- [Jina AI](https://jina.ai/) for their embedding and reranking APIs
|
||||
- [Serper](https://serper.dev/) for their Google search API
|
||||
- [NewsAPI](https://newsapi.org/) for their news search API
|
||||
- [OpenAlex](https://openalex.org/) for their academic search API
|
||||
- [CORE](https://core.ac.uk/) for their open access academic search API
|
||||
- [Unpaywall](https://unpaywall.org/) for their open access discovery API
|
||||
- [Groq](https://groq.com/) for their fast LLM inference
|
||||
- [GitHub](https://github.com/) for their code search API
|
||||
- [StackExchange](https://stackexchange.com/) for their programming Q&A API
|
||||
|
|
|
@ -1,157 +0,0 @@
|
|||
# Example configuration file for the intelligent research system
|
||||
# Rename this file to config.yaml and fill in your API keys and settings
|
||||
|
||||
# API keys (alternatively, set environment variables)
|
||||
api_keys:
|
||||
openai: "your-openai-api-key" # Or set OPENAI_API_KEY environment variable
|
||||
jina: "your-jina-api-key" # Or set JINA_API_KEY environment variable
|
||||
serper: "your-serper-api-key" # Or set SERPER_API_KEY environment variable
|
||||
google: "your-google-api-key" # Or set GOOGLE_API_KEY environment variable
|
||||
anthropic: "your-anthropic-api-key" # Or set ANTHROPIC_API_KEY environment variable
|
||||
openrouter: "your-openrouter-api-key" # Or set OPENROUTER_API_KEY environment variable
|
||||
groq: "your-groq-api-key" # Or set GROQ_API_KEY environment variable
|
||||
|
||||
# LLM model configurations
|
||||
models:
|
||||
gpt-3.5-turbo:
|
||||
provider: "openai"
|
||||
temperature: 0.7
|
||||
max_tokens: 1000
|
||||
top_p: 1.0
|
||||
endpoint: null # Use default OpenAI endpoint
|
||||
|
||||
gpt-4:
|
||||
provider: "openai"
|
||||
temperature: 0.5
|
||||
max_tokens: 2000
|
||||
top_p: 1.0
|
||||
endpoint: null # Use default OpenAI endpoint
|
||||
|
||||
claude-2:
|
||||
provider: "anthropic"
|
||||
temperature: 0.7
|
||||
max_tokens: 1500
|
||||
top_p: 1.0
|
||||
endpoint: null # Use default Anthropic endpoint
|
||||
|
||||
azure-gpt-4:
|
||||
provider: "azure"
|
||||
temperature: 0.5
|
||||
max_tokens: 2000
|
||||
top_p: 1.0
|
||||
endpoint: "https://your-azure-endpoint.openai.azure.com"
|
||||
deployment_name: "your-deployment-name"
|
||||
api_version: "2023-05-15"
|
||||
|
||||
local-llama:
|
||||
provider: "ollama"
|
||||
temperature: 0.8
|
||||
max_tokens: 1000
|
||||
endpoint: "http://localhost:11434/api/generate"
|
||||
model_name: "llama2"
|
||||
|
||||
llama-3.1-8b-instant:
|
||||
provider: "groq"
|
||||
model_name: "llama-3.1-8b-instant"
|
||||
temperature: 0.7
|
||||
max_tokens: 1024
|
||||
top_p: 1.0
|
||||
endpoint: "https://api.groq.com/openai/v1"
|
||||
|
||||
llama-3.3-70b-versatile:
|
||||
provider: "groq"
|
||||
model_name: "llama-3.3-70b-versatile"
|
||||
temperature: 0.5
|
||||
max_tokens: 2048
|
||||
top_p: 1.0
|
||||
endpoint: "https://api.groq.com/openai/v1"
|
||||
|
||||
openrouter-mixtral:
|
||||
provider: "openrouter"
|
||||
model_name: "mistralai/mixtral-8x7b-instruct"
|
||||
temperature: 0.7
|
||||
max_tokens: 1024
|
||||
top_p: 1.0
|
||||
endpoint: "https://openrouter.ai/api/v1"
|
||||
|
||||
openrouter-claude:
|
||||
provider: "openrouter"
|
||||
model_name: "anthropic/claude-3-opus"
|
||||
temperature: 0.5
|
||||
max_tokens: 2048
|
||||
top_p: 1.0
|
||||
endpoint: "https://openrouter.ai/api/v1"
|
||||
|
||||
gemini-2.0-flash:
|
||||
provider: "gemini"
|
||||
model_name: "gemini-2.0-flash"
|
||||
temperature: 0.5
|
||||
max_tokens: 2048
|
||||
top_p: 1.0
|
||||
|
||||
# Default model to use if not specified for a module
|
||||
default_model: "llama-3.1-8b-instant" # Using Groq's Llama 3.1 8B model for testing
|
||||
|
||||
# Module-specific model assignments
|
||||
module_models:
|
||||
# Query processing module
|
||||
query_processing:
|
||||
enhance_query: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for query enhancement
|
||||
classify_query: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for classification
|
||||
generate_search_queries: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for generating search queries
|
||||
|
||||
# Search strategy module
|
||||
search_strategy:
|
||||
develop_strategy: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for developing search strategies
|
||||
target_selection: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for target selection
|
||||
|
||||
# Document ranking module
|
||||
document_ranking:
|
||||
rerank_documents: "jina-reranker" # Use Jina's reranker for document reranking
|
||||
|
||||
# Report generation module
|
||||
report_generation:
|
||||
synthesize_report: "gemini-2.0-flash" # Use Google's Gemini 2.0 Flash for report synthesis
|
||||
format_report: "llama-3.1-8b-instant" # Use Groq's Llama 3.1 8B for formatting
|
||||
|
||||
# Search engine configurations
|
||||
search_engines:
|
||||
google:
|
||||
enabled: true
|
||||
max_results: 10
|
||||
|
||||
serper:
|
||||
enabled: true
|
||||
max_results: 10
|
||||
|
||||
jina:
|
||||
enabled: true
|
||||
max_results: 10
|
||||
|
||||
scholar:
|
||||
enabled: false
|
||||
max_results: 5
|
||||
|
||||
arxiv:
|
||||
enabled: false
|
||||
max_results: 5
|
||||
|
||||
# Jina AI specific configurations
|
||||
jina:
|
||||
reranker:
|
||||
model: "jina-reranker-v2-base-multilingual" # Default reranker model
|
||||
top_n: 10 # Default number of top results to return
|
||||
|
||||
# UI configuration
|
||||
ui:
|
||||
theme: "light" # light or dark
|
||||
port: 7860
|
||||
share: false
|
||||
title: "Intelligent Research System"
|
||||
description: "An automated system for finding, filtering, and synthesizing information"
|
||||
|
||||
# System settings
|
||||
system:
|
||||
cache_dir: "data/cache"
|
||||
results_dir: "data/results"
|
||||
log_level: "INFO" # DEBUG, INFO, WARNING, ERROR, CRITICAL
|
|
@ -10,6 +10,10 @@ api_keys:
|
|||
anthropic: "your-anthropic-api-key" # Or set ANTHROPIC_API_KEY environment variable
|
||||
openrouter: "your-openrouter-api-key" # Or set OPENROUTER_API_KEY environment variable
|
||||
groq: "your-groq-api-key" # Or set GROQ_API_KEY environment variable
|
||||
newsapi: "your-newsapi-key" # Or set NEWSAPI_API_KEY environment variable
|
||||
core: "your-core-api-key" # Or set CORE_API_KEY environment variable
|
||||
github: "your-github-api-key" # Or set GITHUB_API_KEY environment variable
|
||||
stackexchange: "your-stackexchange-api-key" # Or set STACKEXCHANGE_API_KEY environment variable
|
||||
|
||||
# LLM model configurations
|
||||
models:
|
||||
|
@ -129,6 +133,35 @@ search_engines:
|
|||
enabled: false
|
||||
max_results: 5
|
||||
|
||||
news:
|
||||
enabled: true
|
||||
max_results: 10
|
||||
days_back: 7
|
||||
use_headlines: false # Set to true to use top headlines endpoint
|
||||
country: "us" # Country code for top headlines
|
||||
language: "en" # Language code
|
||||
|
||||
openalex:
|
||||
enabled: true
|
||||
max_results: 10
|
||||
filter_open_access: false # Set to true to only return open access publications
|
||||
|
||||
core:
|
||||
enabled: true
|
||||
max_results: 10
|
||||
full_text: true # Set to true to search in full text of papers
|
||||
|
||||
github:
|
||||
enabled: true
|
||||
max_results: 10
|
||||
sort: "best_match" # Options: best_match, stars, forks, updated
|
||||
|
||||
stackexchange:
|
||||
enabled: true
|
||||
max_results: 10
|
||||
site: "stackoverflow" # Default site (stackoverflow, serverfault, superuser, etc.)
|
||||
sort: "relevance" # Options: relevance, votes, creation, activity
|
||||
|
||||
# Jina AI specific configurations
|
||||
jina:
|
||||
reranker:
|
||||
|
@ -143,6 +176,22 @@ ui:
|
|||
title: "Intelligent Research System"
|
||||
description: "An automated system for finding, filtering, and synthesizing information"
|
||||
|
||||
# Academic search settings
|
||||
academic_search:
|
||||
email: "user@example.com" # Used for Unpaywall and OpenAlex APIs
|
||||
|
||||
# OpenAlex settings
|
||||
openalex:
|
||||
default_sort: "relevance_score:desc" # Other options: cited_by_count:desc, publication_date:desc
|
||||
|
||||
# Unpaywall settings
|
||||
unpaywall:
|
||||
# No specific settings needed
|
||||
|
||||
# CORE settings
|
||||
core:
|
||||
# No specific settings needed
|
||||
|
||||
# System settings
|
||||
system:
|
||||
cache_dir: "data/cache"
|
||||
|
|
|
@ -0,0 +1,88 @@
|
|||
"""
|
||||
Example script for using the academic search handlers.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
# Add the project root to the Python path
|
||||
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
|
||||
|
||||
from execution.search_executor import SearchExecutor
|
||||
from query.query_processor import get_query_processor
|
||||
from config.config import get_config
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run a sample academic search."""
|
||||
# Initialize components
|
||||
query_processor = get_query_processor()
|
||||
search_executor = SearchExecutor()
|
||||
|
||||
# Get a list of available search engines
|
||||
available_engines = search_executor.get_available_search_engines()
|
||||
print(f"Available search engines: {', '.join(available_engines)}")
|
||||
|
||||
# Check if academic search engines are available
|
||||
academic_engines = ["openalex", "core", "scholar", "arxiv"]
|
||||
available_academic = [engine for engine in academic_engines if engine in available_engines]
|
||||
|
||||
if not available_academic:
|
||||
print("No academic search engines are available. Please check your configuration.")
|
||||
return
|
||||
else:
|
||||
print(f"Available academic search engines: {', '.join(available_academic)}")
|
||||
|
||||
# Prompt for the query
|
||||
query = input("Enter your academic research query: ") or "What are the latest papers on large language model alignment?"
|
||||
|
||||
print(f"\nProcessing query: {query}")
|
||||
|
||||
# Process the query
|
||||
start_time = datetime.now()
|
||||
structured_query = await query_processor.process_query(query)
|
||||
|
||||
# Add academic query flag
|
||||
structured_query["is_academic"] = True
|
||||
|
||||
# Generate search queries optimized for each engine
|
||||
structured_query = await query_processor.generate_search_queries(
|
||||
structured_query, available_academic
|
||||
)
|
||||
|
||||
# Print the optimized queries
|
||||
print("\nOptimized queries for academic search:")
|
||||
for engine in available_academic:
|
||||
print(f"\n{engine.upper()} queries:")
|
||||
for i, q in enumerate(structured_query.get("search_queries", {}).get(engine, [])):
|
||||
print(f"{i+1}. {q}")
|
||||
|
||||
# Execute the search
|
||||
results = await search_executor.execute_search_async(
|
||||
structured_query,
|
||||
search_engines=available_academic,
|
||||
num_results=5
|
||||
)
|
||||
|
||||
# Print the results
|
||||
total_results = sum(len(engine_results) for engine_results in results.values())
|
||||
print(f"\nFound {total_results} academic results:")
|
||||
|
||||
for engine, engine_results in results.items():
|
||||
print(f"\n--- {engine.upper()} Results ({len(engine_results)}) ---")
|
||||
for i, result in enumerate(engine_results):
|
||||
print(f"\n{i+1}. {result.get('title', 'No title')}")
|
||||
print(f"Authors: {result.get('authors', 'Unknown')}")
|
||||
print(f"Year: {result.get('year', 'Unknown')}")
|
||||
print(f"Access: {result.get('access_status', 'Unknown')}")
|
||||
print(f"URL: {result.get('url', 'No URL')}")
|
||||
print(f"Snippet: {result.get('snippet', 'No snippet')[0:200]}...")
|
||||
|
||||
end_time = datetime.now()
|
||||
print(f"\nSearch completed in {(end_time - start_time).total_seconds():.2f} seconds")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
|
@ -0,0 +1,76 @@
|
|||
"""
|
||||
Example script for using the news search handler.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
# Add the project root to the Python path
|
||||
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
|
||||
|
||||
from execution.search_executor import SearchExecutor
|
||||
from query.query_processor import get_query_processor
|
||||
from config.config import get_config
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run a sample news search."""
|
||||
# Initialize components
|
||||
query_processor = get_query_processor()
|
||||
search_executor = SearchExecutor()
|
||||
|
||||
# Get a list of available search engines
|
||||
available_engines = search_executor.get_available_search_engines()
|
||||
print(f"Available search engines: {', '.join(available_engines)}")
|
||||
|
||||
# Check if news search is available
|
||||
if "news" not in available_engines:
|
||||
print("News search is not available. Please check your NewsAPI configuration.")
|
||||
return
|
||||
|
||||
# Prompt for the query
|
||||
query = input("Enter your query about recent events: ") or "Trump tariffs latest announcement"
|
||||
|
||||
print(f"\nProcessing query: {query}")
|
||||
|
||||
# Process the query
|
||||
start_time = datetime.now()
|
||||
structured_query = await query_processor.process_query(query)
|
||||
|
||||
# Generate search queries optimized for each engine
|
||||
structured_query = await query_processor.generate_search_queries(
|
||||
structured_query, ["news"]
|
||||
)
|
||||
|
||||
# Print the optimized queries
|
||||
print("\nOptimized queries for news search:")
|
||||
for i, q in enumerate(structured_query.get("search_queries", {}).get("news", [])):
|
||||
print(f"{i+1}. {q}")
|
||||
|
||||
# Execute the search
|
||||
results = await search_executor.execute_search_async(
|
||||
structured_query,
|
||||
search_engines=["news"],
|
||||
num_results=10
|
||||
)
|
||||
|
||||
# Print the results
|
||||
news_results = results.get("news", [])
|
||||
print(f"\nFound {len(news_results)} news results:")
|
||||
|
||||
for i, result in enumerate(news_results):
|
||||
print(f"\n--- Result {i+1} ---")
|
||||
print(f"Title: {result.get('title', 'No title')}")
|
||||
print(f"Source: {result.get('source', 'Unknown')}")
|
||||
print(f"Date: {result.get('published_date', 'Unknown date')}")
|
||||
print(f"URL: {result.get('url', 'No URL')}")
|
||||
print(f"Snippet: {result.get('snippet', 'No snippet')[0:200]}...")
|
||||
|
||||
end_time = datetime.now()
|
||||
print(f"\nSearch completed in {(end_time - start_time).total_seconds():.2f} seconds")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
|
@ -0,0 +1,160 @@
|
|||
"""
|
||||
CORE.ac.uk API handler.
|
||||
Provides access to open access academic papers from institutional repositories.
|
||||
"""
|
||||
|
||||
import os
|
||||
import requests
|
||||
from typing import Dict, List, Any, Optional
|
||||
|
||||
from .base_handler import BaseSearchHandler
|
||||
from config.config import get_config, get_api_key
|
||||
|
||||
|
||||
class CoreSearchHandler(BaseSearchHandler):
|
||||
"""Handler for CORE.ac.uk academic search API."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the CORE search handler."""
|
||||
self.config = get_config()
|
||||
self.api_key = get_api_key("core")
|
||||
self.base_url = "https://api.core.ac.uk/v3/search/works"
|
||||
self.available = self.api_key is not None
|
||||
|
||||
# Get any custom settings from config
|
||||
self.academic_config = self.config.config_data.get("academic_search", {}).get("core", {})
|
||||
|
||||
def search(self, query: str, num_results: int = 10, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Execute a search query using CORE.ac.uk.
|
||||
|
||||
Args:
|
||||
query: The search query to execute
|
||||
num_results: Number of results to return
|
||||
**kwargs: Additional search parameters:
|
||||
- full_text: Whether to search in full text (default: True)
|
||||
- filter_year: Filter by publication year or range
|
||||
- sort: Sort by relevance or publication date
|
||||
- repositories: Limit to specific repositories
|
||||
|
||||
Returns:
|
||||
List of search results with standardized format
|
||||
"""
|
||||
if not self.available:
|
||||
raise ValueError("CORE API is not available. API key is missing.")
|
||||
|
||||
# Set up the request headers
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.api_key}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# Set up the request body
|
||||
body = {
|
||||
"q": query,
|
||||
"limit": num_results,
|
||||
"offset": 0
|
||||
}
|
||||
|
||||
# Add full text search parameter
|
||||
full_text = kwargs.get("full_text", True)
|
||||
if full_text:
|
||||
body["fields"] = ["title", "authors", "year", "abstract", "fullText"]
|
||||
else:
|
||||
body["fields"] = ["title", "authors", "year", "abstract"]
|
||||
|
||||
# Add year filter if specified
|
||||
if "filter_year" in kwargs:
|
||||
body["filters"] = [{"year": kwargs["filter_year"]}]
|
||||
|
||||
# Add sort parameter
|
||||
if "sort" in kwargs:
|
||||
if kwargs["sort"] == "date":
|
||||
body["sort"] = [{"year": "desc"}]
|
||||
else:
|
||||
body["sort"] = [{"_score": "desc"}] # Default to relevance
|
||||
|
||||
# Add repository filter if specified
|
||||
if "repositories" in kwargs:
|
||||
if "filters" not in body:
|
||||
body["filters"] = []
|
||||
body["filters"].append({"repositoryIds": kwargs["repositories"]})
|
||||
|
||||
try:
|
||||
# Make the request
|
||||
response = requests.post(self.base_url, headers=headers, json=body)
|
||||
response.raise_for_status()
|
||||
|
||||
# Parse the response
|
||||
data = response.json()
|
||||
|
||||
# Process the results
|
||||
results = []
|
||||
for item in data.get("results", []):
|
||||
# Extract authors
|
||||
authors = []
|
||||
for author in item.get("authors", [])[:3]:
|
||||
author_name = author.get("name", "")
|
||||
if author_name:
|
||||
authors.append(author_name)
|
||||
|
||||
# Get publication year
|
||||
pub_year = item.get("year", "Unknown")
|
||||
|
||||
# Get DOI
|
||||
doi = item.get("doi", "")
|
||||
|
||||
# Determine URL - prefer the download URL if available
|
||||
url = item.get("downloadUrl", "")
|
||||
if not url and doi:
|
||||
url = f"https://doi.org/{doi}"
|
||||
if not url:
|
||||
url = item.get("sourceFulltextUrls", [""])[0] if item.get("sourceFulltextUrls") else ""
|
||||
|
||||
# Create snippet from abstract or first part of full text
|
||||
snippet = item.get("abstract", "")
|
||||
if not snippet and "fullText" in item:
|
||||
snippet = item.get("fullText", "")[:500] + "..."
|
||||
|
||||
# If no snippet is available, create one from metadata
|
||||
if not snippet:
|
||||
journal = item.get("publisher", "Unknown Journal")
|
||||
snippet = f"Open access academic paper from {journal}. {pub_year}."
|
||||
|
||||
# Create the result
|
||||
result = {
|
||||
"title": item.get("title", "Untitled"),
|
||||
"url": url,
|
||||
"snippet": snippet,
|
||||
"source": "core",
|
||||
"authors": ", ".join(authors),
|
||||
"year": pub_year,
|
||||
"journal": item.get("publisher", ""),
|
||||
"doi": doi,
|
||||
"open_access": True # CORE only indexes open access content
|
||||
}
|
||||
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
print(f"Error executing CORE search: {e}")
|
||||
return []
|
||||
|
||||
def get_name(self) -> str:
|
||||
"""Get the name of the search handler."""
|
||||
return "core"
|
||||
|
||||
def is_available(self) -> bool:
|
||||
"""Check if the CORE API is available."""
|
||||
return self.available
|
||||
|
||||
def get_rate_limit_info(self) -> Dict[str, Any]:
|
||||
"""Get information about the API's rate limits."""
|
||||
# These limits are based on the free tier
|
||||
return {
|
||||
"requests_per_minute": 30,
|
||||
"requests_per_day": 10000,
|
||||
"current_usage": None
|
||||
}
|
|
@ -0,0 +1,206 @@
|
|||
"""
|
||||
GitHub API handler for code search.
|
||||
|
||||
This module implements a search handler for GitHub's API,
|
||||
allowing code searches across GitHub repositories.
|
||||
"""
|
||||
|
||||
import os
|
||||
import requests
|
||||
from typing import Dict, List, Any, Optional
|
||||
|
||||
from config.config import get_config
|
||||
from ..api_handlers.base_handler import BaseSearchHandler
|
||||
|
||||
|
||||
class GitHubSearchHandler(BaseSearchHandler):
|
||||
"""Handler for GitHub code search."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the GitHub search handler."""
|
||||
self.config = get_config()
|
||||
self.api_key = os.environ.get('GITHUB_API_KEY') or self.config.config_data.get('api_keys', {}).get('github')
|
||||
self.api_url = "https://api.github.com"
|
||||
self.search_endpoint = "/search/code"
|
||||
self.user_agent = "SimSearch-Research-Assistant"
|
||||
|
||||
def search(self, query: str, num_results: int = 10, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Execute a code search on GitHub.
|
||||
|
||||
Args:
|
||||
query: The search query
|
||||
num_results: Number of results to return
|
||||
**kwargs: Additional search parameters
|
||||
- language: Filter by programming language
|
||||
- sort: Sort by (indexed, stars, forks, updated)
|
||||
- order: Sort order (asc, desc)
|
||||
|
||||
Returns:
|
||||
List of search results
|
||||
"""
|
||||
if not self.is_available():
|
||||
return []
|
||||
|
||||
# Prepare query parameters
|
||||
params = {
|
||||
"q": query,
|
||||
"per_page": min(num_results, 30), # GitHub API limit
|
||||
"page": 1
|
||||
}
|
||||
|
||||
# Add optional parameters
|
||||
if kwargs.get("language"):
|
||||
params["q"] += f" language:{kwargs['language']}"
|
||||
if kwargs.get("sort"):
|
||||
params["sort"] = kwargs["sort"]
|
||||
if kwargs.get("order"):
|
||||
params["order"] = kwargs["order"]
|
||||
|
||||
# Set up headers
|
||||
headers = {
|
||||
"Authorization": f"token {self.api_key}",
|
||||
"Accept": "application/vnd.github.v3+json",
|
||||
"User-Agent": self.user_agent
|
||||
}
|
||||
|
||||
try:
|
||||
# Make the API request
|
||||
response = requests.get(
|
||||
f"{self.api_url}{self.search_endpoint}",
|
||||
params=params,
|
||||
headers=headers
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
# Process results
|
||||
data = response.json()
|
||||
results = []
|
||||
|
||||
for item in data.get("items", []):
|
||||
# For each code result, fetch a bit of the file content
|
||||
snippet = self._get_code_snippet(item) if item.get("url") else "Code snippet not available"
|
||||
|
||||
# Construct a standardized result entry
|
||||
result = {
|
||||
"title": item.get("name", "Unnamed"),
|
||||
"url": item.get("html_url", ""),
|
||||
"snippet": snippet,
|
||||
"source": "github",
|
||||
"metadata": {
|
||||
"repository": item.get("repository", {}).get("full_name", ""),
|
||||
"path": item.get("path", ""),
|
||||
"language": kwargs.get("language", ""),
|
||||
"score": item.get("score", 0)
|
||||
}
|
||||
}
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
|
||||
except requests.RequestException as e:
|
||||
print(f"GitHub API error: {e}")
|
||||
return []
|
||||
|
||||
def _get_code_snippet(self, item: Dict[str, Any]) -> str:
|
||||
"""
|
||||
Fetch a snippet of the code file.
|
||||
|
||||
Args:
|
||||
item: The GitHub code search result item
|
||||
|
||||
Returns:
|
||||
A string containing a snippet of the code
|
||||
"""
|
||||
try:
|
||||
# Get the raw content URL
|
||||
content_url = item.get("url")
|
||||
if not content_url:
|
||||
return "Content not available"
|
||||
|
||||
# Request the content
|
||||
headers = {
|
||||
"Authorization": f"token {self.api_key}",
|
||||
"Accept": "application/vnd.github.v3.raw",
|
||||
"User-Agent": self.user_agent
|
||||
}
|
||||
|
||||
response = requests.get(content_url, headers=headers)
|
||||
response.raise_for_status()
|
||||
|
||||
# Get content and create a snippet
|
||||
content = response.json().get("content", "")
|
||||
if content:
|
||||
# GitHub returns Base64 encoded content
|
||||
import base64
|
||||
decoded = base64.b64decode(content).decode('utf-8')
|
||||
|
||||
# Create a snippet (first ~500 chars)
|
||||
snippet = decoded[:500] + ("..." if len(decoded) > 500 else "")
|
||||
return snippet
|
||||
return "Content not available"
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching code snippet: {e}")
|
||||
return "Error fetching code snippet"
|
||||
|
||||
def get_name(self) -> str:
|
||||
"""
|
||||
Get the name of the search handler.
|
||||
|
||||
Returns:
|
||||
Name of the search handler
|
||||
"""
|
||||
return "github"
|
||||
|
||||
def is_available(self) -> bool:
|
||||
"""
|
||||
Check if the GitHub API is available and properly configured.
|
||||
|
||||
Returns:
|
||||
True if the API is available, False otherwise
|
||||
"""
|
||||
return self.api_key is not None
|
||||
|
||||
def get_rate_limit_info(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Get information about GitHub API rate limits.
|
||||
|
||||
Returns:
|
||||
Dictionary with rate limit information
|
||||
"""
|
||||
if not self.is_available():
|
||||
return {"error": "GitHub API not configured"}
|
||||
|
||||
try:
|
||||
headers = {
|
||||
"Authorization": f"token {self.api_key}",
|
||||
"Accept": "application/vnd.github.v3+json",
|
||||
"User-Agent": self.user_agent
|
||||
}
|
||||
|
||||
response = requests.get(
|
||||
f"{self.api_url}/rate_limit",
|
||||
headers=headers
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
data = response.json()
|
||||
rate_limits = data.get("resources", {}).get("search", {})
|
||||
|
||||
return {
|
||||
"requests_per_minute": 30, # GitHub search API limit
|
||||
"requests_per_hour": rate_limits.get("limit", 0),
|
||||
"current_usage": {
|
||||
"remaining": rate_limits.get("remaining", 0),
|
||||
"reset_time": rate_limits.get("reset", 0)
|
||||
}
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error getting rate limit info: {e}")
|
||||
return {
|
||||
"error": str(e),
|
||||
"requests_per_minute": 30,
|
||||
"requests_per_hour": 5000 # Default limit
|
||||
}
|
|
@ -0,0 +1,152 @@
|
|||
"""
|
||||
NewsAPI handler for current events searches.
|
||||
Provides access to recent news articles from various sources.
|
||||
"""
|
||||
|
||||
import os
|
||||
import requests
|
||||
import datetime
|
||||
from typing import Dict, List, Any, Optional
|
||||
|
||||
from .base_handler import BaseSearchHandler
|
||||
from config.config import get_config, get_api_key
|
||||
|
||||
|
||||
class NewsSearchHandler(BaseSearchHandler):
|
||||
"""Handler for NewsAPI.org for current events searches."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the NewsAPI search handler."""
|
||||
self.config = get_config()
|
||||
self.api_key = get_api_key("newsapi")
|
||||
self.base_url = "https://newsapi.org/v2/everything"
|
||||
self.top_headlines_url = "https://newsapi.org/v2/top-headlines"
|
||||
self.available = self.api_key is not None
|
||||
|
||||
def search(self, query: str, num_results: int = 10, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Execute a search query using NewsAPI.
|
||||
|
||||
Args:
|
||||
query: The search query to execute
|
||||
num_results: Number of results to return
|
||||
**kwargs: Additional search parameters:
|
||||
- days_back: Number of days back to search (default: 7)
|
||||
- sort_by: Sort by criteria ("relevancy", "popularity", "publishedAt")
|
||||
- language: Language code (default: "en")
|
||||
- sources: Comma-separated list of news sources
|
||||
- domains: Comma-separated list of domains
|
||||
- use_headlines: Whether to use top headlines endpoint (default: False)
|
||||
- country: Country code for headlines (default: "us")
|
||||
- category: Category for headlines
|
||||
|
||||
Returns:
|
||||
List of search results with standardized format
|
||||
"""
|
||||
if not self.available:
|
||||
raise ValueError("NewsAPI is not available. API key is missing.")
|
||||
|
||||
# Determine which endpoint to use
|
||||
use_headlines = kwargs.get("use_headlines", False)
|
||||
url = self.top_headlines_url if use_headlines else self.base_url
|
||||
|
||||
# Calculate date range
|
||||
days_back = kwargs.get("days_back", 7)
|
||||
end_date = datetime.datetime.now().strftime("%Y-%m-%d")
|
||||
start_date = (datetime.datetime.now() - datetime.timedelta(days=days_back)).strftime("%Y-%m-%d")
|
||||
|
||||
# Set up the request parameters
|
||||
params = {
|
||||
"q": query,
|
||||
"pageSize": num_results,
|
||||
"apiKey": self.api_key,
|
||||
}
|
||||
|
||||
# Add parameters for everything endpoint
|
||||
if not use_headlines:
|
||||
params["from"] = start_date
|
||||
params["to"] = end_date
|
||||
params["sortBy"] = kwargs.get("sort_by", "publishedAt")
|
||||
|
||||
if "language" in kwargs:
|
||||
params["language"] = kwargs["language"]
|
||||
else:
|
||||
params["language"] = "en" # Default to English
|
||||
|
||||
if "sources" in kwargs:
|
||||
params["sources"] = kwargs["sources"]
|
||||
|
||||
if "domains" in kwargs:
|
||||
params["domains"] = kwargs["domains"]
|
||||
# Add parameters for top-headlines endpoint
|
||||
else:
|
||||
if "country" in kwargs:
|
||||
params["country"] = kwargs["country"]
|
||||
else:
|
||||
params["country"] = "us" # Default to US
|
||||
|
||||
if "category" in kwargs:
|
||||
params["category"] = kwargs["category"]
|
||||
|
||||
try:
|
||||
# Make the request
|
||||
response = requests.get(url, params=params)
|
||||
response.raise_for_status()
|
||||
|
||||
# Parse the response
|
||||
data = response.json()
|
||||
|
||||
# Check if the request was successful
|
||||
if data.get("status") != "ok":
|
||||
print(f"NewsAPI error: {data.get('message', 'Unknown error')}")
|
||||
return []
|
||||
|
||||
# Process the results
|
||||
results = []
|
||||
for article in data.get("articles", []):
|
||||
# Get the publication date with proper formatting
|
||||
pub_date = article.get("publishedAt", "")
|
||||
if pub_date:
|
||||
try:
|
||||
date_obj = datetime.datetime.fromisoformat(pub_date.replace("Z", "+00:00"))
|
||||
formatted_date = date_obj.strftime("%Y-%m-%d %H:%M:%S")
|
||||
except ValueError:
|
||||
formatted_date = pub_date
|
||||
else:
|
||||
formatted_date = ""
|
||||
|
||||
# Create a standardized result
|
||||
result = {
|
||||
"title": article.get("title", ""),
|
||||
"url": article.get("url", ""),
|
||||
"snippet": article.get("description", ""),
|
||||
"source": f"news:{article.get('source', {}).get('name', 'unknown')}",
|
||||
"published_date": formatted_date,
|
||||
"author": article.get("author", ""),
|
||||
"image_url": article.get("urlToImage", ""),
|
||||
"content": article.get("content", "")
|
||||
}
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
print(f"Error executing NewsAPI search: {e}")
|
||||
return []
|
||||
|
||||
def get_name(self) -> str:
|
||||
"""Get the name of the search handler."""
|
||||
return "news"
|
||||
|
||||
def is_available(self) -> bool:
|
||||
"""Check if the NewsAPI is available."""
|
||||
return self.available
|
||||
|
||||
def get_rate_limit_info(self) -> Dict[str, Any]:
|
||||
"""Get information about the API's rate limits."""
|
||||
# These are based on NewsAPI's developer plan
|
||||
return {
|
||||
"requests_per_minute": 100,
|
||||
"requests_per_day": 500, # Free tier limit
|
||||
"current_usage": None # NewsAPI doesn't provide usage info in responses
|
||||
}
|
|
@ -0,0 +1,180 @@
|
|||
"""
|
||||
OpenAlex API handler.
|
||||
Provides access to academic research papers and scholarly information.
|
||||
"""
|
||||
|
||||
import os
|
||||
import requests
|
||||
from typing import Dict, List, Any, Optional
|
||||
|
||||
from .base_handler import BaseSearchHandler
|
||||
from config.config import get_config, get_api_key
|
||||
|
||||
|
||||
class OpenAlexSearchHandler(BaseSearchHandler):
|
||||
"""Handler for OpenAlex academic search API."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the OpenAlex search handler."""
|
||||
self.config = get_config()
|
||||
# OpenAlex doesn't require an API key, but using an email is recommended
|
||||
self.email = self.config.config_data.get("academic_search", {}).get("email", "user@example.com")
|
||||
self.base_url = "https://api.openalex.org/works"
|
||||
self.available = True # OpenAlex doesn't require an API key
|
||||
|
||||
# Get any custom settings from config
|
||||
self.academic_config = self.config.config_data.get("academic_search", {}).get("openalex", {})
|
||||
|
||||
def search(self, query: str, num_results: int = 10, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Execute a search query using OpenAlex.
|
||||
|
||||
Args:
|
||||
query: The search query to execute
|
||||
num_results: Number of results to return
|
||||
**kwargs: Additional search parameters:
|
||||
- filter_type: Filter by work type (article, book, etc.)
|
||||
- filter_year: Filter by publication year or range
|
||||
- filter_open_access: Only return open access publications
|
||||
- sort: Sort by relevance, citations, publication date
|
||||
- filter_concept: Filter by academic concept/field
|
||||
|
||||
Returns:
|
||||
List of search results with standardized format
|
||||
"""
|
||||
# Build the search URL with parameters
|
||||
params = {
|
||||
"search": query,
|
||||
"per_page": num_results,
|
||||
"mailto": self.email # Good practice for the API
|
||||
}
|
||||
|
||||
# Add filters
|
||||
filters = []
|
||||
|
||||
# Type filter (article, book, etc.)
|
||||
if "filter_type" in kwargs:
|
||||
filters.append(f"type.id:{kwargs['filter_type']}")
|
||||
|
||||
# Year filter
|
||||
if "filter_year" in kwargs:
|
||||
filters.append(f"publication_year:{kwargs['filter_year']}")
|
||||
|
||||
# Open access filter
|
||||
if kwargs.get("filter_open_access", False):
|
||||
filters.append("is_oa:true")
|
||||
|
||||
# Concept/field filter
|
||||
if "filter_concept" in kwargs:
|
||||
filters.append(f"concepts.id:{kwargs['filter_concept']}")
|
||||
|
||||
# Combine filters if there are any
|
||||
if filters:
|
||||
params["filter"] = ",".join(filters)
|
||||
|
||||
# Sort parameter
|
||||
if "sort" in kwargs:
|
||||
params["sort"] = kwargs["sort"]
|
||||
else:
|
||||
# Default to sorting by relevance score
|
||||
params["sort"] = "relevance_score:desc"
|
||||
|
||||
try:
|
||||
# Make the request
|
||||
response = requests.get(self.base_url, params=params)
|
||||
response.raise_for_status()
|
||||
|
||||
# Parse the response
|
||||
data = response.json()
|
||||
|
||||
# Process the results
|
||||
results = []
|
||||
for item in data.get("results", []):
|
||||
# Extract authors
|
||||
authors = []
|
||||
for author in item.get("authorships", [])[:3]:
|
||||
author_name = author.get("author", {}).get("display_name", "")
|
||||
if author_name:
|
||||
authors.append(author_name)
|
||||
|
||||
# Format citation count
|
||||
citation_count = item.get("cited_by_count", 0)
|
||||
|
||||
# Get the publication year
|
||||
pub_year = item.get("publication_year", "Unknown")
|
||||
|
||||
# Check if it's open access
|
||||
is_oa = item.get("open_access", {}).get("is_oa", False)
|
||||
oa_status = "Open Access" if is_oa else "Subscription"
|
||||
|
||||
# Get journal/venue name
|
||||
journal = None
|
||||
if "primary_location" in item and item["primary_location"]:
|
||||
source = item.get("primary_location", {}).get("source", {})
|
||||
if source:
|
||||
journal = source.get("display_name", "Unknown Journal")
|
||||
|
||||
# Get DOI
|
||||
doi = item.get("doi")
|
||||
url = f"https://doi.org/{doi}" if doi else item.get("url", "")
|
||||
|
||||
# Get abstract
|
||||
abstract = item.get("abstract_inverted_index", None)
|
||||
snippet = ""
|
||||
|
||||
# Convert abstract_inverted_index to readable text if available
|
||||
if abstract:
|
||||
try:
|
||||
# The OpenAlex API uses an inverted index format
|
||||
# We need to reconstruct the text from this format
|
||||
words = {}
|
||||
for word, positions in abstract.items():
|
||||
for pos in positions:
|
||||
words[pos] = word
|
||||
|
||||
# Reconstruct the abstract from the positions
|
||||
snippet = " ".join([words.get(i, "") for i in sorted(words.keys())])
|
||||
except:
|
||||
snippet = "Abstract not available in readable format"
|
||||
|
||||
# Fallback if no abstract is available
|
||||
if not snippet:
|
||||
snippet = f"Academic paper: {item.get('title', 'Untitled')}. Published in {journal or 'Unknown'} ({pub_year}). {citation_count} citations."
|
||||
|
||||
# Create the result
|
||||
result = {
|
||||
"title": item.get("title", "Untitled"),
|
||||
"url": url,
|
||||
"snippet": snippet,
|
||||
"source": "openalex",
|
||||
"authors": ", ".join(authors),
|
||||
"year": pub_year,
|
||||
"citation_count": citation_count,
|
||||
"access_status": oa_status,
|
||||
"journal": journal,
|
||||
"doi": doi
|
||||
}
|
||||
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
print(f"Error executing OpenAlex search: {e}")
|
||||
return []
|
||||
|
||||
def get_name(self) -> str:
|
||||
"""Get the name of the search handler."""
|
||||
return "openalex"
|
||||
|
||||
def is_available(self) -> bool:
|
||||
"""Check if the OpenAlex API is available."""
|
||||
return self.available
|
||||
|
||||
def get_rate_limit_info(self) -> Dict[str, Any]:
|
||||
"""Get information about the API's rate limits."""
|
||||
return {
|
||||
"requests_per_minute": 100, # OpenAlex is quite generous with rate limits
|
||||
"requests_per_day": 100000, # 100k requests per day for anonymous users
|
||||
"current_usage": None # OpenAlex doesn't provide usage info in responses
|
||||
}
|
|
@ -0,0 +1,231 @@
|
|||
"""
|
||||
StackExchange API handler for programming question search.
|
||||
|
||||
This module implements a search handler for the StackExchange API,
|
||||
focusing on Stack Overflow and related programming Q&A sites.
|
||||
"""
|
||||
|
||||
import os
|
||||
import requests
|
||||
import time
|
||||
from typing import Dict, List, Any, Optional
|
||||
from urllib.parse import quote
|
||||
|
||||
from config.config import get_config
|
||||
from ..api_handlers.base_handler import BaseSearchHandler
|
||||
|
||||
|
||||
class StackExchangeSearchHandler(BaseSearchHandler):
|
||||
"""Handler for StackExchange/Stack Overflow search."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the StackExchange search handler."""
|
||||
self.config = get_config()
|
||||
self.api_key = os.environ.get('STACKEXCHANGE_API_KEY') or self.config.config_data.get('api_keys', {}).get('stackexchange')
|
||||
self.api_url = "https://api.stackexchange.com/2.3"
|
||||
self.search_endpoint = "/search/advanced"
|
||||
self.last_request_time = 0
|
||||
self.min_request_interval = 1.0 # seconds between requests to avoid throttling
|
||||
|
||||
def search(self, query: str, num_results: int = 10, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Execute a search on StackExchange.
|
||||
|
||||
Args:
|
||||
query: The search query
|
||||
num_results: Number of results to return
|
||||
**kwargs: Additional search parameters
|
||||
- site: StackExchange site to search (default: stackoverflow)
|
||||
- sort: Sort by (relevance, votes, creation, activity)
|
||||
- tags: List of tags to filter by
|
||||
- accepted: Only return questions with accepted answers
|
||||
|
||||
Returns:
|
||||
List of search results
|
||||
"""
|
||||
if not self.is_available():
|
||||
return []
|
||||
|
||||
# Rate limiting to avoid API restrictions
|
||||
self._respect_rate_limit()
|
||||
|
||||
# Prepare query parameters
|
||||
site = kwargs.get("site", "stackoverflow")
|
||||
params = {
|
||||
"q": query,
|
||||
"site": site,
|
||||
"pagesize": min(num_results, 30), # SE API limit per page
|
||||
"page": 1,
|
||||
"filter": "withbody", # Include question body
|
||||
"key": self.api_key
|
||||
}
|
||||
|
||||
# Add optional parameters
|
||||
if kwargs.get("sort"):
|
||||
params["sort"] = kwargs["sort"]
|
||||
if kwargs.get("tags"):
|
||||
params["tagged"] = ";".join(kwargs["tags"])
|
||||
if kwargs.get("accepted"):
|
||||
params["accepted"] = "True"
|
||||
|
||||
try:
|
||||
# Make the API request
|
||||
response = requests.get(
|
||||
f"{self.api_url}{self.search_endpoint}",
|
||||
params=params
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
# Process results
|
||||
data = response.json()
|
||||
results = []
|
||||
|
||||
for item in data.get("items", []):
|
||||
# Get answer count and score
|
||||
answer_count = item.get("answer_count", 0)
|
||||
score = item.get("score", 0)
|
||||
has_accepted = item.get("is_answered", False)
|
||||
|
||||
# Format tags
|
||||
tags = item.get("tags", [])
|
||||
tag_str = ", ".join(tags)
|
||||
|
||||
# Create snippet from question body
|
||||
body = item.get("body", "")
|
||||
snippet = self._extract_snippet(body, max_length=300)
|
||||
|
||||
# Additional metadata for result display
|
||||
meta_info = f"Score: {score} | Answers: {answer_count}"
|
||||
if has_accepted:
|
||||
meta_info += " | Has accepted answer"
|
||||
|
||||
# Format the snippet with meta information
|
||||
full_snippet = f"{snippet}\n\nTags: {tag_str}\n{meta_info}"
|
||||
|
||||
# Construct a standardized result entry
|
||||
result = {
|
||||
"title": item.get("title", "Unnamed Question"),
|
||||
"url": item.get("link", ""),
|
||||
"snippet": full_snippet,
|
||||
"source": f"stackexchange_{site}",
|
||||
"metadata": {
|
||||
"score": score,
|
||||
"answer_count": answer_count,
|
||||
"has_accepted": has_accepted,
|
||||
"tags": tags,
|
||||
"question_id": item.get("question_id", ""),
|
||||
"creation_date": item.get("creation_date", "")
|
||||
}
|
||||
}
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
|
||||
except requests.RequestException as e:
|
||||
print(f"StackExchange API error: {e}")
|
||||
return []
|
||||
|
||||
def _extract_snippet(self, html_content: str, max_length: int = 300) -> str:
|
||||
"""
|
||||
Extract a readable snippet from HTML content.
|
||||
|
||||
Args:
|
||||
html_content: HTML content from Stack Overflow
|
||||
max_length: Maximum length of the snippet
|
||||
|
||||
Returns:
|
||||
A plain text snippet
|
||||
"""
|
||||
try:
|
||||
# Basic HTML tag removal (a more robust solution would use a library like BeautifulSoup)
|
||||
import re
|
||||
text = re.sub(r'<[^>]+>', ' ', html_content)
|
||||
|
||||
# Remove excessive whitespace
|
||||
text = re.sub(r'\s+', ' ', text).strip()
|
||||
|
||||
# Truncate to max_length
|
||||
if len(text) > max_length:
|
||||
text = text[:max_length] + "..."
|
||||
|
||||
return text
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error extracting snippet: {e}")
|
||||
return "Snippet extraction failed"
|
||||
|
||||
def _respect_rate_limit(self):
|
||||
"""
|
||||
Ensure we don't exceed StackExchange API rate limits.
|
||||
"""
|
||||
current_time = time.time()
|
||||
time_since_last = current_time - self.last_request_time
|
||||
|
||||
if time_since_last < self.min_request_interval:
|
||||
sleep_time = self.min_request_interval - time_since_last
|
||||
time.sleep(sleep_time)
|
||||
|
||||
self.last_request_time = time.time()
|
||||
|
||||
def get_name(self) -> str:
|
||||
"""
|
||||
Get the name of the search handler.
|
||||
|
||||
Returns:
|
||||
Name of the search handler
|
||||
"""
|
||||
return "stackexchange"
|
||||
|
||||
def is_available(self) -> bool:
|
||||
"""
|
||||
Check if the StackExchange API is available.
|
||||
Note: StackExchange API can be used without an API key with reduced quotas.
|
||||
|
||||
Returns:
|
||||
True if the API is available
|
||||
"""
|
||||
return True # Can be used with or without API key
|
||||
|
||||
def get_rate_limit_info(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Get information about StackExchange API rate limits.
|
||||
|
||||
Returns:
|
||||
Dictionary with rate limit information
|
||||
"""
|
||||
quota_max = 300 if self.api_key else 100 # Default quotas
|
||||
|
||||
try:
|
||||
# Make a request to check quota
|
||||
params = {
|
||||
"site": "stackoverflow"
|
||||
}
|
||||
if self.api_key:
|
||||
params["key"] = self.api_key
|
||||
|
||||
response = requests.get(
|
||||
f"{self.api_url}/info",
|
||||
params=params
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
data = response.json()
|
||||
quota_remaining = data.get("quota_remaining", quota_max)
|
||||
|
||||
return {
|
||||
"requests_per_minute": 30, # Conservative estimate
|
||||
"requests_per_day": quota_max,
|
||||
"current_usage": {
|
||||
"remaining": quota_remaining,
|
||||
"max": quota_max,
|
||||
"reset_time": "Daily" # SE resets quotas daily
|
||||
}
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error getting rate limit info: {e}")
|
||||
return {
|
||||
"error": str(e),
|
||||
"requests_per_minute": 30,
|
||||
"requests_per_day": quota_max
|
||||
}
|
|
@ -28,6 +28,15 @@ class ResultCollector:
|
|||
print("Jina Reranker not available. Will use basic scoring instead.")
|
||||
self.reranker_available = False
|
||||
|
||||
# Initialize result enrichers
|
||||
try:
|
||||
from .result_enrichers.unpaywall_enricher import UnpaywallEnricher
|
||||
self.unpaywall_enricher = UnpaywallEnricher()
|
||||
self.unpaywall_available = True
|
||||
except (ImportError, ValueError):
|
||||
print("Unpaywall enricher not available. Will not enrich results with open access links.")
|
||||
self.unpaywall_available = False
|
||||
|
||||
def process_results(self,
|
||||
search_results: Dict[str, List[Dict[str, Any]]],
|
||||
dedup: bool = True,
|
||||
|
@ -68,6 +77,16 @@ class ResultCollector:
|
|||
if dedup:
|
||||
print(f"Deduplicated to {len(flattened_results)} results")
|
||||
|
||||
# Enrich results with open access links if available
|
||||
is_academic_query = any(result.get("source") in ["openalex", "core", "arxiv", "scholar"] for result in flattened_results)
|
||||
if is_academic_query and hasattr(self, 'unpaywall_enricher') and self.unpaywall_available:
|
||||
print("Enriching academic results with open access information")
|
||||
try:
|
||||
flattened_results = self.unpaywall_enricher.enrich_results(flattened_results)
|
||||
print("Results enriched with open access information")
|
||||
except Exception as e:
|
||||
print(f"Error enriching results with Unpaywall: {str(e)}")
|
||||
|
||||
# Apply reranking if requested and available
|
||||
if use_reranker and self.reranker is not None:
|
||||
print("Using Jina Reranker for semantic ranking")
|
||||
|
@ -161,12 +180,22 @@ class ResultCollector:
|
|||
source = result.get("source", "")
|
||||
if source == "scholar":
|
||||
score += 10
|
||||
elif source == "serper":
|
||||
score += 9
|
||||
elif source == "openalex":
|
||||
score += 10 # Top priority for academic queries
|
||||
elif source == "core":
|
||||
score += 9 # High priority for open access academic content
|
||||
elif source == "arxiv":
|
||||
score += 8
|
||||
score += 8 # Good for preprints and specific fields
|
||||
elif source == "github":
|
||||
score += 9 # High priority for code/programming queries
|
||||
elif source.startswith("stackexchange"):
|
||||
score += 10 # Top priority for code/programming questions
|
||||
elif source == "serper":
|
||||
score += 7 # General web search
|
||||
elif source == "news":
|
||||
score += 8 # Good for current events
|
||||
elif source == "google":
|
||||
score += 5
|
||||
score += 5 # Generic search
|
||||
|
||||
# Boost score based on position in original results
|
||||
position = result.get("raw_data", {}).get("position", 0)
|
||||
|
|
|
@ -0,0 +1,7 @@
|
|||
"""
|
||||
Result enrichers for improving search results with additional data.
|
||||
"""
|
||||
|
||||
from .unpaywall_enricher import UnpaywallEnricher
|
||||
|
||||
__all__ = ["UnpaywallEnricher"]
|
|
@ -0,0 +1,132 @@
|
|||
"""
|
||||
Unpaywall enricher for finding open access versions of scholarly articles.
|
||||
"""
|
||||
|
||||
import os
|
||||
import requests
|
||||
from typing import Dict, List, Any, Optional
|
||||
|
||||
from config.config import get_config, get_api_key
|
||||
|
||||
|
||||
class UnpaywallEnricher:
|
||||
"""Enricher for finding open access versions of papers using Unpaywall."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the Unpaywall enricher."""
|
||||
self.config = get_config()
|
||||
# Unpaywall recommends using an email for API access
|
||||
self.email = self.config.config_data.get("academic_search", {}).get("email", "user@example.com")
|
||||
self.base_url = "https://api.unpaywall.org/v2/"
|
||||
self.available = True # Unpaywall doesn't require an API key, just an email
|
||||
|
||||
# Get any custom settings from config
|
||||
self.academic_config = self.config.config_data.get("academic_search", {}).get("unpaywall", {})
|
||||
|
||||
def enrich_results(self, results: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Enrich search results with open access links from Unpaywall.
|
||||
|
||||
Args:
|
||||
results: List of search results to enrich
|
||||
|
||||
Returns:
|
||||
Enriched list of search results
|
||||
"""
|
||||
if not self.available:
|
||||
return results
|
||||
|
||||
# Process each result that has a DOI
|
||||
for result in results:
|
||||
doi = result.get("doi")
|
||||
if not doi:
|
||||
continue
|
||||
|
||||
# Skip results that are already marked as open access
|
||||
if result.get("open_access", False) or result.get("access_status") == "Open Access":
|
||||
continue
|
||||
|
||||
# Lookup the DOI in Unpaywall
|
||||
oa_data = self._lookup_doi(doi)
|
||||
if not oa_data:
|
||||
continue
|
||||
|
||||
# Enrich the result with open access data
|
||||
if oa_data.get("is_oa", False):
|
||||
result["open_access"] = True
|
||||
result["access_status"] = "Open Access"
|
||||
|
||||
# Get the best open access URL
|
||||
best_oa_url = self._get_best_oa_url(oa_data)
|
||||
if best_oa_url:
|
||||
result["oa_url"] = best_oa_url
|
||||
# Add a note to the snippet about open access availability
|
||||
if "snippet" in result:
|
||||
result["snippet"] += " [Open access version available]"
|
||||
else:
|
||||
result["open_access"] = False
|
||||
result["access_status"] = "Subscription"
|
||||
|
||||
return results
|
||||
|
||||
def _lookup_doi(self, doi: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Look up a DOI in Unpaywall.
|
||||
|
||||
Args:
|
||||
doi: The DOI to look up
|
||||
|
||||
Returns:
|
||||
Unpaywall data for the DOI, or None if not found
|
||||
"""
|
||||
try:
|
||||
# Normalize the DOI
|
||||
doi = doi.strip().lower()
|
||||
if doi.startswith("https://doi.org/"):
|
||||
doi = doi[16:]
|
||||
elif doi.startswith("doi:"):
|
||||
doi = doi[4:]
|
||||
|
||||
# Make the request to Unpaywall
|
||||
url = f"{self.base_url}{doi}?email={self.email}"
|
||||
response = requests.get(url)
|
||||
|
||||
# Check for successful response
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f"Error looking up DOI in Unpaywall: {e}")
|
||||
return None
|
||||
|
||||
def _get_best_oa_url(self, oa_data: Dict[str, Any]) -> Optional[str]:
|
||||
"""
|
||||
Get the best open access URL from Unpaywall data.
|
||||
|
||||
Args:
|
||||
oa_data: Unpaywall data for a DOI
|
||||
|
||||
Returns:
|
||||
Best open access URL, or None if not available
|
||||
"""
|
||||
# Check if there's a best OA location
|
||||
best_oa_location = oa_data.get("best_oa_location", None)
|
||||
if best_oa_location:
|
||||
# Get the URL from the best location
|
||||
return best_oa_location.get("url_for_pdf") or best_oa_location.get("url")
|
||||
|
||||
# If no best location, check all OA locations
|
||||
oa_locations = oa_data.get("oa_locations", [])
|
||||
if oa_locations:
|
||||
# Prefer PDF URLs
|
||||
for location in oa_locations:
|
||||
if location.get("url_for_pdf"):
|
||||
return location.get("url_for_pdf")
|
||||
|
||||
# Fall back to HTML URLs
|
||||
for location in oa_locations:
|
||||
if location.get("url"):
|
||||
return location.get("url")
|
||||
|
||||
return None
|
|
@ -15,6 +15,12 @@ from .api_handlers.base_handler import BaseSearchHandler
|
|||
from .api_handlers.serper_handler import SerperSearchHandler
|
||||
from .api_handlers.scholar_handler import ScholarSearchHandler
|
||||
from .api_handlers.arxiv_handler import ArxivSearchHandler
|
||||
from .api_handlers.news_handler import NewsSearchHandler
|
||||
from .api_handlers.openalex_handler import OpenAlexSearchHandler
|
||||
from .api_handlers.core_handler import CoreSearchHandler
|
||||
from .api_handlers.github_handler import GitHubSearchHandler
|
||||
from .api_handlers.stackexchange_handler import StackExchangeSearchHandler
|
||||
from .result_enrichers.unpaywall_enricher import UnpaywallEnricher
|
||||
|
||||
|
||||
class SearchExecutor:
|
||||
|
@ -30,6 +36,9 @@ class SearchExecutor:
|
|||
self.available_handlers = {name: handler for name, handler in self.handlers.items()
|
||||
if handler.is_available()}
|
||||
|
||||
# Initialize result enrichers
|
||||
self.unpaywall_enricher = UnpaywallEnricher()
|
||||
|
||||
def _initialize_handlers(self) -> Dict[str, BaseSearchHandler]:
|
||||
"""
|
||||
Initialize all search handlers.
|
||||
|
@ -40,7 +49,12 @@ class SearchExecutor:
|
|||
return {
|
||||
"serper": SerperSearchHandler(),
|
||||
"scholar": ScholarSearchHandler(),
|
||||
"arxiv": ArxivSearchHandler()
|
||||
"arxiv": ArxivSearchHandler(),
|
||||
"news": NewsSearchHandler(),
|
||||
"openalex": OpenAlexSearchHandler(),
|
||||
"core": CoreSearchHandler(),
|
||||
"github": GitHubSearchHandler(),
|
||||
"stackexchange": StackExchangeSearchHandler()
|
||||
}
|
||||
|
||||
def get_available_search_engines(self) -> List[str]:
|
||||
|
@ -82,14 +96,111 @@ class SearchExecutor:
|
|||
# If no search engines specified, use all available
|
||||
if search_engines is None:
|
||||
search_engines = list(self.available_handlers.keys())
|
||||
|
||||
# Handle specialized query types
|
||||
|
||||
# Current events queries
|
||||
if structured_query.get("is_current_events", False) and "news" in self.available_handlers:
|
||||
print("Current events query detected, prioritizing news search")
|
||||
# Make sure news is in the search engines
|
||||
if "news" not in search_engines:
|
||||
search_engines.append("news")
|
||||
|
||||
# If a specific engine is requested, honor that - otherwise limit to news + a general search engine
|
||||
# for a faster response with more relevant results
|
||||
if not structured_query.get("specific_engines", False):
|
||||
general_engines = ["serper", "google"]
|
||||
# Find an available general engine
|
||||
general_engine = next((e for e in general_engines if e in self.available_handlers), None)
|
||||
if general_engine:
|
||||
search_engines = ["news", general_engine]
|
||||
else:
|
||||
# Fall back to just news
|
||||
search_engines = ["news"]
|
||||
|
||||
# Academic queries
|
||||
elif structured_query.get("is_academic", False):
|
||||
print("Academic query detected, prioritizing academic search engines")
|
||||
|
||||
# Define academic search engines in order of priority
|
||||
academic_engines = ["openalex", "core", "arxiv", "scholar"]
|
||||
available_academic = [engine for engine in academic_engines if engine in self.available_handlers]
|
||||
|
||||
# Always include at least one general search engine for backup
|
||||
general_engines = ["serper", "google"]
|
||||
available_general = [engine for engine in general_engines if engine in self.available_handlers]
|
||||
|
||||
if available_academic and not structured_query.get("specific_engines", False):
|
||||
# Use available academic engines plus one general engine if available
|
||||
search_engines = available_academic
|
||||
if available_general:
|
||||
search_engines.append(available_general[0])
|
||||
elif not available_academic:
|
||||
# Just use general search if no academic engines are available
|
||||
search_engines = available_general
|
||||
|
||||
print(f"Selected engines for academic query: {search_engines}")
|
||||
|
||||
# Code/programming queries
|
||||
elif structured_query.get("is_code", False):
|
||||
print("Code/programming query detected, prioritizing code search engines")
|
||||
|
||||
# Define code search engines in order of priority
|
||||
code_engines = ["github", "stackexchange"]
|
||||
available_code = [engine for engine in code_engines if engine in self.available_handlers]
|
||||
|
||||
# Always include at least one general search engine for backup
|
||||
general_engines = ["serper", "google"]
|
||||
available_general = [engine for engine in general_engines if engine in self.available_handlers]
|
||||
|
||||
if available_code and not structured_query.get("specific_engines", False):
|
||||
# Use available code engines plus one general engine if available
|
||||
search_engines = available_code
|
||||
if available_general:
|
||||
search_engines.append(available_general[0])
|
||||
elif not available_code:
|
||||
# Just use general search if no code engines are available
|
||||
search_engines = available_general
|
||||
|
||||
print(f"Selected engines for code query: {search_engines}")
|
||||
else:
|
||||
# Filter to only include available search engines
|
||||
search_engines = [engine for engine in search_engines
|
||||
if engine in self.available_handlers]
|
||||
|
||||
# Add specialized handlers based on query type
|
||||
|
||||
# For current events queries
|
||||
if structured_query.get("is_current_events", False) and "news" in self.available_handlers and "news" not in search_engines:
|
||||
print("Current events query detected, adding news search")
|
||||
search_engines.append("news")
|
||||
|
||||
# For academic queries
|
||||
elif structured_query.get("is_academic", False):
|
||||
academic_engines = ["openalex", "core", "arxiv", "scholar"]
|
||||
for engine in academic_engines:
|
||||
if engine in self.available_handlers and engine not in search_engines:
|
||||
print(f"Academic query detected, adding {engine} search")
|
||||
search_engines.append(engine)
|
||||
|
||||
# For code/programming queries
|
||||
elif structured_query.get("is_code", False):
|
||||
code_engines = ["github", "stackexchange"]
|
||||
for engine in code_engines:
|
||||
if engine in self.available_handlers and engine not in search_engines:
|
||||
print(f"Code query detected, adding {engine} search")
|
||||
search_engines.append(engine)
|
||||
|
||||
# Get the search queries for each engine
|
||||
search_queries = structured_query.get("search_queries", {})
|
||||
|
||||
# For news searches on current events queries, add special parameters
|
||||
news_params = {}
|
||||
if "news" in search_engines and structured_query.get("is_current_events", False):
|
||||
# Set up news search parameters
|
||||
news_params["days_back"] = 7 # Limit to 7 days for current events
|
||||
news_params["sort_by"] = "publishedAt" # Sort by publication date
|
||||
|
||||
# Execute searches in parallel
|
||||
results = {}
|
||||
with concurrent.futures.ThreadPoolExecutor() as executor:
|
||||
|
@ -102,12 +213,18 @@ class SearchExecutor:
|
|||
# Get the appropriate query for this engine
|
||||
engine_query = search_queries.get(engine, query)
|
||||
|
||||
# Additional parameters for certain engines
|
||||
kwargs = {}
|
||||
if engine == "news" and news_params:
|
||||
kwargs = news_params
|
||||
|
||||
# Submit the search task
|
||||
future = executor.submit(
|
||||
self._execute_single_search,
|
||||
engine=engine,
|
||||
query=engine_query,
|
||||
num_results=num_results
|
||||
num_results=num_results,
|
||||
**kwargs
|
||||
)
|
||||
future_to_engine[future] = engine
|
||||
|
||||
|
@ -123,7 +240,7 @@ class SearchExecutor:
|
|||
|
||||
return results
|
||||
|
||||
def _execute_single_search(self, engine: str, query: str, num_results: int) -> List[Dict[str, Any]]:
|
||||
def _execute_single_search(self, engine: str, query: str, num_results: int, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Execute a search on a single search engine.
|
||||
|
||||
|
@ -131,6 +248,7 @@ class SearchExecutor:
|
|||
engine: Name of the search engine
|
||||
query: Query to execute
|
||||
num_results: Number of results to return
|
||||
**kwargs: Additional parameters to pass to the search handler
|
||||
|
||||
Returns:
|
||||
List of search results
|
||||
|
@ -140,8 +258,8 @@ class SearchExecutor:
|
|||
return []
|
||||
|
||||
try:
|
||||
# Execute the search
|
||||
results = handler.search(query, num_results=num_results)
|
||||
# Execute the search with any additional parameters
|
||||
results = handler.search(query, num_results=num_results, **kwargs)
|
||||
return results
|
||||
except Exception as e:
|
||||
print(f"Error executing search for {engine}: {e}")
|
||||
|
@ -164,17 +282,51 @@ class SearchExecutor:
|
|||
Returns:
|
||||
Dictionary mapping search engine names to lists of search results
|
||||
"""
|
||||
# Get the enhanced query
|
||||
query = structured_query.get("enhanced_query", structured_query.get("original_query", ""))
|
||||
|
||||
# If no search engines specified, use all available
|
||||
if search_engines is None:
|
||||
search_engines = list(self.available_handlers.keys())
|
||||
|
||||
# If this is a current events query, prioritize news handler if available
|
||||
if structured_query.get("is_current_events", False) and "news" in self.available_handlers:
|
||||
print("Current events query detected, prioritizing news search (async)")
|
||||
# Make sure news is in the search engines
|
||||
if "news" not in search_engines:
|
||||
search_engines.append("news")
|
||||
|
||||
# If a specific engine is requested, honor that - otherwise limit to news + a general search engine
|
||||
# for a faster response with more relevant results
|
||||
if not structured_query.get("specific_engines", False):
|
||||
general_engines = ["serper", "google"]
|
||||
# Find an available general engine
|
||||
general_engine = next((e for e in general_engines if e in self.available_handlers), None)
|
||||
if general_engine:
|
||||
search_engines = ["news", general_engine]
|
||||
else:
|
||||
# Fall back to just news
|
||||
search_engines = ["news"]
|
||||
else:
|
||||
# Filter to only include available search engines
|
||||
search_engines = [engine for engine in search_engines
|
||||
if engine in self.available_handlers]
|
||||
|
||||
# If this is a current events query, add news handler if available and not already included
|
||||
if structured_query.get("is_current_events", False) and "news" in self.available_handlers and "news" not in search_engines:
|
||||
print("Current events query detected, adding news search (async)")
|
||||
search_engines.append("news")
|
||||
|
||||
# Get the search queries for each engine
|
||||
search_queries = structured_query.get("search_queries", {})
|
||||
|
||||
# For news searches on current events queries, add special parameters
|
||||
news_params = {}
|
||||
if "news" in search_engines and structured_query.get("is_current_events", False):
|
||||
# Set up news search parameters
|
||||
news_params["days_back"] = 7 # Limit to 7 days for current events
|
||||
news_params["sort_by"] = "publishedAt" # Sort by publication date
|
||||
|
||||
# Create tasks for each search engine
|
||||
tasks = []
|
||||
for engine in search_engines:
|
||||
|
@ -182,10 +334,15 @@ class SearchExecutor:
|
|||
continue
|
||||
|
||||
# Get the appropriate query for this engine
|
||||
query = search_queries.get(engine, structured_query.get("enhanced_query", ""))
|
||||
engine_query = search_queries.get(engine, query)
|
||||
|
||||
# Additional parameters for certain engines
|
||||
kwargs = {}
|
||||
if engine == "news" and news_params:
|
||||
kwargs = news_params
|
||||
|
||||
# Create a task for this search
|
||||
task = self._execute_single_search_async(engine, query, num_results)
|
||||
task = self._execute_single_search_async(engine, engine_query, num_results, **kwargs)
|
||||
tasks.append((engine, task))
|
||||
|
||||
# Execute all tasks with timeout
|
||||
|
@ -203,7 +360,7 @@ class SearchExecutor:
|
|||
|
||||
return results
|
||||
|
||||
async def _execute_single_search_async(self, engine: str, query: str, num_results: int) -> List[Dict[str, Any]]:
|
||||
async def _execute_single_search_async(self, engine: str, query: str, num_results: int, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Execute a search on a single search engine asynchronously.
|
||||
|
||||
|
@ -211,12 +368,16 @@ class SearchExecutor:
|
|||
engine: Name of the search engine
|
||||
query: Query to execute
|
||||
num_results: Number of results to return
|
||||
**kwargs: Additional parameters to pass to the search handler
|
||||
|
||||
Returns:
|
||||
List of search results
|
||||
"""
|
||||
# Execute in a thread pool since most API calls are blocking
|
||||
loop = asyncio.get_event_loop()
|
||||
return await loop.run_in_executor(
|
||||
None, self._execute_single_search, engine, query, num_results
|
||||
)
|
||||
|
||||
# Create a partial function with all the arguments
|
||||
def execute_search():
|
||||
return self._execute_single_search(engine, query, num_results, **kwargs)
|
||||
|
||||
return await loop.run_in_executor(None, execute_search)
|
||||
|
|
|
@ -305,8 +305,75 @@ class LLMInterface:
|
|||
"""Implementation of search query generation."""
|
||||
engines_str = ", ".join(search_engines)
|
||||
|
||||
# Special instructions for news searches
|
||||
news_instructions = ""
|
||||
if "news" in search_engines:
|
||||
news_instructions = """
|
||||
For the 'news' search engine:
|
||||
- Focus on recent events and timely information
|
||||
- Include specific date ranges when relevant (e.g., "last week", "since June 1")
|
||||
- Use names of people, organizations, or specific events
|
||||
- For current events queries, prioritize factual keywords over conceptual terms
|
||||
- Include terms like "latest", "recent", "update", "announcement" where appropriate
|
||||
- Exclude general background terms that would dilute current event focus
|
||||
- Generate 3 queries optimized for news search
|
||||
"""
|
||||
|
||||
# Special instructions for academic searches
|
||||
academic_instructions = ""
|
||||
if any(engine in search_engines for engine in ["openalex", "core", "arxiv"]):
|
||||
academic_instructions = """
|
||||
For academic search engines ('openalex', 'core', 'arxiv'):
|
||||
- Focus on specific academic terminology and precise research concepts
|
||||
- Include field-specific keywords and methodological terms
|
||||
- For 'openalex' search:
|
||||
- Include author names, journal names, or specific methodology terms when relevant
|
||||
- Be precise with scientific terminology
|
||||
- Consider including "review" or "meta-analysis" for summary-type queries
|
||||
- For 'core' search:
|
||||
- Focus on open access content
|
||||
- Include institutional keywords when relevant
|
||||
- Balance specificity with breadth
|
||||
- For 'arxiv' search:
|
||||
- Use more technical/mathematical terminology
|
||||
- Include relevant field categories (e.g., "cs.AI", "physics", "math")
|
||||
- Be precise with notation and specialized terms
|
||||
- Generate 3 queries optimized for each academic search engine
|
||||
"""
|
||||
|
||||
# Special instructions for code/programming searches
|
||||
code_instructions = ""
|
||||
if any(engine in search_engines for engine in ["github", "stackexchange"]):
|
||||
code_instructions = """
|
||||
For code/programming search engines ('github', 'stackexchange'):
|
||||
- Focus on specific technical terminology, programming languages, and frameworks
|
||||
- Include specific error messages, function names, or library references when relevant
|
||||
- For 'github' search:
|
||||
- Include programming language keywords (e.g., "python", "javascript", "java")
|
||||
- Specify file extensions when relevant (e.g., ".py", ".js", ".java")
|
||||
- Include framework or library names (e.g., "react", "tensorflow", "django")
|
||||
- Use code-specific syntax and terminology
|
||||
- Focus on implementation details, patterns, or techniques
|
||||
- For 'stackexchange' search:
|
||||
- Phrase as a specific programming question or problem
|
||||
- Include relevant error messages as exact quotes when applicable
|
||||
- Include specific version information when relevant
|
||||
- Use precise technical terms that would appear in developer discussions
|
||||
- Focus on problem-solving aspects or best practices
|
||||
- Generate 3 queries optimized for each code search engine
|
||||
"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": f"You are an AI research assistant. Generate optimized search queries for the following search engines: {engines_str}. For each search engine, provide 3 variations of the query that are optimized for that engine's search algorithm and will yield comprehensive results."},
|
||||
{"role": "system", "content": f"""You are an AI research assistant. Generate optimized search queries for the following search engines: {engines_str}.
|
||||
|
||||
For each search engine, provide 3 variations of the query that are optimized for that engine's search algorithm and will yield comprehensive results.
|
||||
|
||||
{news_instructions}
|
||||
{academic_instructions}
|
||||
{code_instructions}
|
||||
|
||||
Return your response as a JSON object where each key is a search engine name and the value is an array of 3 optimized queries.
|
||||
"""},
|
||||
{"role": "user", "content": f"Generate optimized search queries for this research topic: {query}"}
|
||||
]
|
||||
|
||||
|
|
|
@ -59,6 +59,11 @@ class QueryProcessor:
|
|||
Returns:
|
||||
Dictionary containing the structured query
|
||||
"""
|
||||
# Detect query types
|
||||
is_current_events = self._is_current_events_query(original_query, classification)
|
||||
is_academic = self._is_academic_query(original_query, classification)
|
||||
is_code = self._is_code_query(original_query, classification)
|
||||
|
||||
return {
|
||||
'original_query': original_query,
|
||||
'enhanced_query': enhanced_query,
|
||||
|
@ -66,11 +71,194 @@ class QueryProcessor:
|
|||
'intent': classification.get('intent', 'research'),
|
||||
'entities': classification.get('entities', []),
|
||||
'timestamp': None, # Will be filled in by the caller
|
||||
'is_current_events': is_current_events,
|
||||
'is_academic': is_academic,
|
||||
'is_code': is_code,
|
||||
'metadata': {
|
||||
'classification': classification
|
||||
}
|
||||
}
|
||||
|
||||
def _is_current_events_query(self, query: str, classification: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Determine if a query is related to current events.
|
||||
|
||||
Args:
|
||||
query: The original user query
|
||||
classification: The query classification
|
||||
|
||||
Returns:
|
||||
True if the query is about current events, False otherwise
|
||||
"""
|
||||
# Check for time-related keywords in the query
|
||||
time_keywords = ['recent', 'latest', 'current', 'today', 'yesterday', 'week', 'month',
|
||||
'this year', 'breaking', 'news', 'announced', 'election',
|
||||
'now', 'trends', 'emerging']
|
||||
|
||||
query_lower = query.lower()
|
||||
|
||||
# Check for named entities typical of current events
|
||||
current_event_entities = ['trump', 'biden', 'president', 'government', 'congress',
|
||||
'senate', 'tariffs', 'election', 'policy', 'coronavirus',
|
||||
'covid', 'market', 'stocks', 'stock market', 'war']
|
||||
|
||||
# Count matches for time keywords
|
||||
time_keyword_count = sum(1 for keyword in time_keywords if keyword in query_lower)
|
||||
|
||||
# Count matches for current event entities
|
||||
entity_count = sum(1 for entity in current_event_entities if entity in query_lower)
|
||||
|
||||
# If the query directly asks about what's happening or what happened
|
||||
action_verbs = ['happen', 'occurred', 'announced', 'said', 'stated', 'declared', 'launched']
|
||||
verb_matches = sum(1 for verb in action_verbs if verb in query_lower)
|
||||
|
||||
# Determine if this is likely a current events query
|
||||
# Either multiple time keywords or current event entities, or a combination
|
||||
is_current = (time_keyword_count >= 1 and entity_count >= 1) or time_keyword_count >= 2 or entity_count >= 2 or verb_matches >= 1
|
||||
|
||||
return is_current
|
||||
|
||||
def _is_academic_query(self, query: str, classification: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Determine if a query is related to academic or scholarly research.
|
||||
|
||||
Args:
|
||||
query: The original user query
|
||||
classification: The query classification
|
||||
|
||||
Returns:
|
||||
True if the query is about academic research, False otherwise
|
||||
"""
|
||||
query_lower = query.lower()
|
||||
|
||||
# Check for academic terms
|
||||
academic_terms = [
|
||||
'paper', 'study', 'research', 'publication', 'journal', 'article', 'thesis',
|
||||
'dissertation', 'scholarly', 'academic', 'literature', 'published', 'author',
|
||||
'citation', 'cited', 'references', 'bibliography', 'doi', 'peer-reviewed',
|
||||
'peer reviewed', 'university', 'professor', 'conference', 'proceedings'
|
||||
]
|
||||
|
||||
# Check for research methodologies
|
||||
methods = [
|
||||
'methodology', 'experiment', 'hypothesis', 'theoretical', 'empirical',
|
||||
'qualitative', 'quantitative', 'data', 'analysis', 'statistical', 'results',
|
||||
'findings', 'conclusion', 'meta-analysis', 'systematic review', 'clinical trial'
|
||||
]
|
||||
|
||||
# Check for academic fields
|
||||
fields = [
|
||||
'science', 'physics', 'chemistry', 'biology', 'psychology', 'sociology',
|
||||
'economics', 'history', 'philosophy', 'engineering', 'computer science',
|
||||
'medicine', 'mathematics', 'geology', 'astronomy', 'linguistics'
|
||||
]
|
||||
|
||||
# Count matches
|
||||
academic_term_count = sum(1 for term in academic_terms if term in query_lower)
|
||||
method_count = sum(1 for method in methods if method in query_lower)
|
||||
field_count = sum(1 for field in fields if field in query_lower)
|
||||
|
||||
# Check for common academic question patterns
|
||||
academic_patterns = [
|
||||
'what does research say about',
|
||||
'what studies show',
|
||||
'according to research',
|
||||
'scholarly view',
|
||||
'academic consensus',
|
||||
'published papers on',
|
||||
'recent studies on',
|
||||
'literature review',
|
||||
'research findings',
|
||||
'scientific evidence'
|
||||
]
|
||||
|
||||
pattern_matches = sum(1 for pattern in academic_patterns if pattern in query_lower)
|
||||
|
||||
# Determine if this is likely an academic query
|
||||
# Either multiple academic terms, or a combination of terms, methods, and fields
|
||||
is_academic = (
|
||||
academic_term_count >= 2 or
|
||||
pattern_matches >= 1 or
|
||||
(academic_term_count >= 1 and (method_count >= 1 or field_count >= 1)) or
|
||||
(method_count >= 1 and field_count >= 1)
|
||||
)
|
||||
|
||||
return is_academic
|
||||
|
||||
def _is_code_query(self, query: str, classification: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Determine if a query is related to programming or code.
|
||||
|
||||
Args:
|
||||
query: The original user query
|
||||
classification: The query classification
|
||||
|
||||
Returns:
|
||||
True if the query is about programming or code, False otherwise
|
||||
"""
|
||||
query_lower = query.lower()
|
||||
|
||||
# Check for programming languages and technologies
|
||||
programming_langs = [
|
||||
'python', 'javascript', 'java', 'c++', 'c#', 'ruby', 'go', 'rust',
|
||||
'php', 'swift', 'kotlin', 'typescript', 'perl', 'scala', 'r',
|
||||
'html', 'css', 'sql', 'bash', 'powershell', 'dart', 'julia'
|
||||
]
|
||||
|
||||
# Check for programming frameworks and libraries
|
||||
frameworks = [
|
||||
'react', 'angular', 'vue', 'django', 'flask', 'spring', 'laravel',
|
||||
'express', 'tensorflow', 'pytorch', 'pandas', 'numpy', 'scikit-learn',
|
||||
'bootstrap', 'jquery', 'node', 'rails', 'asp.net', 'unity', 'flutter',
|
||||
'pytorch', 'keras', '.net', 'core', 'maven', 'gradle', 'npm', 'pip'
|
||||
]
|
||||
|
||||
# Check for programming concepts and terms
|
||||
programming_terms = [
|
||||
'algorithm', 'function', 'class', 'method', 'variable', 'object', 'array',
|
||||
'string', 'integer', 'boolean', 'list', 'dictionary', 'hash', 'loop',
|
||||
'recursion', 'inheritance', 'interface', 'api', 'rest', 'json', 'xml',
|
||||
'database', 'query', 'schema', 'framework', 'library', 'package', 'module',
|
||||
'dependency', 'bug', 'error', 'exception', 'debugging', 'compiler', 'runtime',
|
||||
'syntax', 'parameter', 'argument', 'return', 'value', 'reference', 'pointer',
|
||||
'memory', 'stack', 'heap', 'thread', 'async', 'await', 'promise', 'callback',
|
||||
'event', 'listener', 'handler', 'middleware', 'frontend', 'backend', 'fullstack',
|
||||
'devops', 'ci/cd', 'docker', 'kubernetes', 'git', 'github', 'bitbucket', 'gitlab'
|
||||
]
|
||||
|
||||
# Check for programming question patterns
|
||||
code_patterns = [
|
||||
'how to code', 'how do i program', 'how to program', 'how to implement',
|
||||
'code example', 'example code', 'code snippet', 'write a function',
|
||||
'write a program', 'debugging', 'error message', 'getting error',
|
||||
'code review', 'refactor', 'optimize', 'performance issue',
|
||||
'best practice', 'design pattern', 'architecture', 'software design',
|
||||
'algorithm for', 'data structure', 'time complexity', 'space complexity',
|
||||
'big o', 'optimize code', 'refactor code', 'clean code', 'technical debt',
|
||||
'unit test', 'integration test', 'test coverage', 'mock', 'stub'
|
||||
]
|
||||
|
||||
# Count matches
|
||||
lang_count = sum(1 for lang in programming_langs if lang in query_lower)
|
||||
framework_count = sum(1 for framework in frameworks if framework in query_lower)
|
||||
term_count = sum(1 for term in programming_terms if term in query_lower)
|
||||
pattern_count = sum(1 for pattern in code_patterns if pattern in query_lower)
|
||||
|
||||
# Check if the query contains code or a code block (denoted by backticks or indentation)
|
||||
contains_code_block = '```' in query or any(line.strip().startswith(' ') for line in query.split('\n'))
|
||||
|
||||
# Determine if this is likely a code-related query
|
||||
is_code = (
|
||||
lang_count >= 1 or
|
||||
framework_count >= 1 or
|
||||
term_count >= 2 or
|
||||
pattern_count >= 1 or
|
||||
contains_code_block or
|
||||
(lang_count + framework_count + term_count >= 2)
|
||||
)
|
||||
|
||||
return is_code
|
||||
|
||||
async def generate_search_queries(self, structured_query: Dict[str, Any],
|
||||
search_engines: List[str]) -> Dict[str, Any]:
|
||||
"""
|
||||
|
|
Binary file not shown.
|
@ -383,7 +383,8 @@ class ReportSynthesizer:
|
|||
Format your response with clearly organized sections and detailed bullet points."""
|
||||
|
||||
# Add specific instructions for comparative queries
|
||||
if query_type.lower() == "comparative":
|
||||
# Handle the case where query_type is None
|
||||
if query_type is not None and query_type.lower() == "comparative":
|
||||
comparative_instructions = """
|
||||
IMPORTANT: This is a COMPARATIVE query. The user is asking to compare two or more things.
|
||||
|
||||
|
@ -401,18 +402,23 @@ class ReportSynthesizer:
|
|||
|
||||
return base_prompt
|
||||
|
||||
def _get_template_from_strings(self, query_type_str: str, detail_level_str: str) -> Optional[ReportTemplate]:
|
||||
def _get_template_from_strings(self, query_type_str: Optional[str], detail_level_str: str) -> Optional[ReportTemplate]:
|
||||
"""
|
||||
Helper method to get a template using string values for query_type and detail_level.
|
||||
|
||||
Args:
|
||||
query_type_str: String value of query type (factual, exploratory, comparative)
|
||||
query_type_str: String value of query type (factual, exploratory, comparative), or None
|
||||
detail_level_str: String value of detail level (brief, standard, detailed, comprehensive)
|
||||
|
||||
Returns:
|
||||
ReportTemplate object or None if not found
|
||||
"""
|
||||
try:
|
||||
# Handle None query_type by defaulting to "exploratory"
|
||||
if query_type_str is None:
|
||||
query_type_str = "exploratory"
|
||||
logger.info(f"Query type is None, defaulting to {query_type_str}")
|
||||
|
||||
# Convert string values to enum objects
|
||||
query_type_enum = QueryType(query_type_str)
|
||||
detail_level_enum = TemplateDetailLevel(detail_level_str)
|
||||
|
|
|
@ -6,6 +6,7 @@ class QueryType(Enum):
|
|||
FACTUAL = 'factual'
|
||||
EXPLORATORY = 'exploratory'
|
||||
COMPARATIVE = 'comparative'
|
||||
CODE = 'code'
|
||||
|
||||
class DetailLevel(Enum):
|
||||
BRIEF = 'brief'
|
||||
|
@ -67,6 +68,13 @@ class ReportTemplateManager:
|
|||
required_sections=['{title}', '{comparison_criteria}', '{key_findings}']
|
||||
))
|
||||
|
||||
self.add_template(ReportTemplate(
|
||||
template="# {title}\n\n## Problem Statement\n{problem_statement}\n\n## Solution\n{solution}\n\n```{language}\n{code_snippet}\n```",
|
||||
detail_level=DetailLevel.BRIEF,
|
||||
query_type=QueryType.CODE,
|
||||
required_sections=['{title}', '{problem_statement}', '{solution}', '{language}', '{code_snippet}']
|
||||
))
|
||||
|
||||
# Standard templates
|
||||
self.add_template(ReportTemplate(
|
||||
template="# {title}\n\n## Introduction\n{introduction}\n\n## Key Findings\n{key_findings}\n\n## Analysis\n{analysis}\n\n## Conclusion\n{conclusion}",
|
||||
|
@ -89,6 +97,13 @@ class ReportTemplateManager:
|
|||
required_sections=['{title}', '{comparison_criteria}', '{methodology}', '{key_findings}', '{analysis}']
|
||||
))
|
||||
|
||||
self.add_template(ReportTemplate(
|
||||
template="# {title}\n\n## Problem Statement\n{problem_statement}\n\n## Approach\n{approach}\n\n## Solution\n{solution}\n\n```{language}\n{code_snippet}\n```\n\n## Explanation\n{explanation}\n\n## Usage Example\n{usage_example}",
|
||||
detail_level=DetailLevel.STANDARD,
|
||||
query_type=QueryType.CODE,
|
||||
required_sections=['{title}', '{problem_statement}', '{approach}', '{solution}', '{language}', '{code_snippet}', '{explanation}', '{usage_example}']
|
||||
))
|
||||
|
||||
# Detailed templates
|
||||
self.add_template(ReportTemplate(
|
||||
template="# {title}\n\n## Introduction\n{introduction}\n\n## Methodology\n{methodology}\n\n## Key Findings\n{key_findings}\n\n## Analysis\n{analysis}\n\n## Conclusion\n{conclusion}",
|
||||
|
@ -111,6 +126,13 @@ class ReportTemplateManager:
|
|||
required_sections=['{title}', '{comparison_criteria}', '{methodology}', '{key_findings}', '{analysis}', '{conclusion}']
|
||||
))
|
||||
|
||||
self.add_template(ReportTemplate(
|
||||
template="# {title}\n\n## Problem Statement\n{problem_statement}\n\n## Context and Requirements\n{context}\n\n## Approach\n{approach}\n\n## Solution\n{solution}\n\n```{language}\n{code_snippet}\n```\n\n## Explanation\n{explanation}\n\n## Alternative Approaches\n{alternatives}\n\n## Best Practices\n{best_practices}\n\n## Usage Examples\n{usage_examples}\n\n## Common Issues\n{common_issues}",
|
||||
detail_level=DetailLevel.DETAILED,
|
||||
query_type=QueryType.CODE,
|
||||
required_sections=['{title}', '{problem_statement}', '{context}', '{approach}', '{solution}', '{language}', '{code_snippet}', '{explanation}', '{alternatives}', '{best_practices}', '{usage_examples}', '{common_issues}']
|
||||
))
|
||||
|
||||
# Comprehensive templates
|
||||
self.add_template(ReportTemplate(
|
||||
template="# {title}\n\n## Executive Summary\n{exec_summary}\n\n## Introduction\n{introduction}\n\n## Methodology\n{methodology}\n\n## Key Findings\n{key_findings}\n\n## Analysis\n{analysis}\n\n## Conclusion\n{conclusion}\n\n## References\n{references}\n\n## Appendices\n{appendices}",
|
||||
|
@ -132,3 +154,10 @@ class ReportTemplateManager:
|
|||
query_type=QueryType.COMPARATIVE,
|
||||
required_sections=['{title}', '{exec_summary}', '{comparison_criteria}', '{methodology}', '{key_findings}', '{analysis}', '{conclusion}', '{references}', '{appendices}']
|
||||
))
|
||||
|
||||
self.add_template(ReportTemplate(
|
||||
template="# {title}\n\n## Executive Summary\n{exec_summary}\n\n## Problem Statement\n{problem_statement}\n\n## Technical Background\n{technical_background}\n\n## Architectural Considerations\n{architecture}\n\n## Detailed Solution\n{detailed_solution}\n\n### Implementation Details\n```{language}\n{code_snippet}\n```\n\n## Explanation of Algorithm/Approach\n{algorithm_explanation}\n\n## Performance Considerations\n{performance}\n\n## Alternative Implementations\n{alternatives}\n\n## Best Practices and Design Patterns\n{best_practices}\n\n## Testing and Validation\n{testing}\n\n## Usage Examples\n{usage_examples}\n\n## Common Pitfalls and Workarounds\n{pitfalls}\n\n## References\n{references}\n\n## Appendices\n{appendices}",
|
||||
detail_level=DetailLevel.COMPREHENSIVE,
|
||||
query_type=QueryType.CODE,
|
||||
required_sections=['{title}', '{exec_summary}', '{problem_statement}', '{technical_background}', '{architecture}', '{detailed_solution}', '{language}', '{code_snippet}', '{algorithm_explanation}', '{performance}', '{alternatives}', '{best_practices}', '{testing}', '{usage_examples}', '{pitfalls}', '{references}', '{appendices}']
|
||||
))
|
||||
|
|
|
@ -13,3 +13,6 @@ validators>=0.22.0
|
|||
markdown>=3.5.0
|
||||
html2text>=2020.1.16
|
||||
feedparser>=6.0.10
|
||||
newsapi-python>=0.2.6 # Optional wrapper for NewsAPI if needed
|
||||
httpx>=0.20.0 # For async HTTP requests
|
||||
tenacity>=8.0.0 # For retry logic with APIs
|
||||
|
|
|
@ -38,7 +38,11 @@ async def query_to_report(
|
|||
chunk_size: Optional[int] = None,
|
||||
overlap_size: Optional[int] = None,
|
||||
detail_level: str = "standard",
|
||||
use_mock: bool = False
|
||||
use_mock: bool = False,
|
||||
query_type: Optional[str] = None,
|
||||
is_code: bool = False,
|
||||
is_academic: bool = False,
|
||||
is_current_events: bool = False
|
||||
) -> str:
|
||||
"""
|
||||
Execute the full workflow from query to report.
|
||||
|
@ -67,6 +71,18 @@ async def query_to_report(
|
|||
# Add timestamp
|
||||
structured_query['timestamp'] = datetime.now().isoformat()
|
||||
|
||||
# Add query type if specified
|
||||
if query_type:
|
||||
structured_query['type'] = query_type
|
||||
|
||||
# Add domain-specific flags if specified
|
||||
if is_code:
|
||||
structured_query['is_code'] = True
|
||||
if is_academic:
|
||||
structured_query['is_academic'] = True
|
||||
if is_current_events:
|
||||
structured_query['is_current_events'] = True
|
||||
|
||||
logger.info(f"Query processed. Type: {structured_query['type']}, Intent: {structured_query['intent']}")
|
||||
logger.info(f"Enhanced query: {structured_query['enhanced_query']}")
|
||||
|
||||
|
@ -180,6 +196,15 @@ def main():
|
|||
parser.add_argument('--detail-level', '-d', type=str, default='standard',
|
||||
choices=['brief', 'standard', 'detailed', 'comprehensive'],
|
||||
help='Level of detail for the report')
|
||||
parser.add_argument('--query-type', '-q', type=str,
|
||||
choices=['factual', 'exploratory', 'comparative', 'code'],
|
||||
help='Type of query to process')
|
||||
parser.add_argument('--is-code', action='store_true',
|
||||
help='Flag this query as a code/programming query')
|
||||
parser.add_argument('--is-academic', action='store_true',
|
||||
help='Flag this query as an academic query')
|
||||
parser.add_argument('--is-current-events', action='store_true',
|
||||
help='Flag this query as a current events query')
|
||||
parser.add_argument('--use-mock', '-m', action='store_true', help='Use mock data instead of API calls')
|
||||
parser.add_argument('--verbose', '-v', action='store_true', help='Enable verbose logging')
|
||||
parser.add_argument('--list-detail-levels', action='store_true',
|
||||
|
@ -210,6 +235,10 @@ def main():
|
|||
chunk_size=args.chunk_size,
|
||||
overlap_size=args.overlap_size,
|
||||
detail_level=args.detail_level,
|
||||
query_type=args.query_type,
|
||||
is_code=args.is_code,
|
||||
is_academic=args.is_academic,
|
||||
is_current_events=args.is_current_events,
|
||||
use_mock=args.use_mock
|
||||
))
|
||||
|
||||
|
|
|
@ -9,23 +9,42 @@ def main():
|
|||
# Initialize the search executor
|
||||
executor = SearchExecutor()
|
||||
|
||||
# Execute a simple search
|
||||
results = executor.execute_search({
|
||||
# Execute search tests
|
||||
print("\n=== TESTING GENERAL SEARCH ===")
|
||||
general_results = executor.execute_search({
|
||||
'raw_query': 'quantum computing',
|
||||
'enhanced_query': 'quantum computing'
|
||||
})
|
||||
|
||||
# Print results by source
|
||||
print(f'Results by source: {[engine for engine, res in results.items() if res]}')
|
||||
print("\n=== TESTING CODE SEARCH ===")
|
||||
code_results = executor.execute_search({
|
||||
'raw_query': 'implement merge sort in python',
|
||||
'enhanced_query': 'implement merge sort algorithm in python with time complexity analysis',
|
||||
'is_code': True
|
||||
})
|
||||
|
||||
# Print details
|
||||
# Print general search results
|
||||
print("\n=== GENERAL SEARCH RESULTS ===")
|
||||
print(f'Results by source: {[engine for engine, res in general_results.items() if res]}')
|
||||
print('\nDetails:')
|
||||
for engine, res in results.items():
|
||||
for engine, res in general_results.items():
|
||||
print(f'{engine}: {len(res)} results')
|
||||
if res:
|
||||
print(f' Sample result: {res[0]}')
|
||||
print(f' Sample result: {res[0]["title"]}')
|
||||
|
||||
return results
|
||||
# Print code search results
|
||||
print("\n=== CODE SEARCH RESULTS ===")
|
||||
print(f'Results by source: {[engine for engine, res in code_results.items() if res]}')
|
||||
print('\nDetails:')
|
||||
for engine, res in code_results.items():
|
||||
print(f'{engine}: {len(res)} results')
|
||||
if res:
|
||||
print(f' Sample result: {res[0]["title"]}')
|
||||
|
||||
return {
|
||||
'general': general_results,
|
||||
'code': code_results
|
||||
}
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
|
@ -0,0 +1,101 @@
|
|||
"""
|
||||
Test for the NewsAPI handler.
|
||||
"""
|
||||
|
||||
import os
|
||||
import unittest
|
||||
import asyncio
|
||||
from dotenv import load_dotenv
|
||||
|
||||
from execution.api_handlers.news_handler import NewsSearchHandler
|
||||
from config.config import get_config
|
||||
|
||||
|
||||
class TestNewsHandler(unittest.TestCase):
|
||||
"""Test cases for the NewsAPI handler."""
|
||||
|
||||
def setUp(self):
|
||||
"""Set up the test environment."""
|
||||
# Load environment variables
|
||||
load_dotenv()
|
||||
|
||||
# Initialize the handler
|
||||
self.handler = NewsSearchHandler()
|
||||
|
||||
def test_handler_initialization(self):
|
||||
"""Test that the handler initializes correctly."""
|
||||
self.assertEqual(self.handler.get_name(), "news")
|
||||
|
||||
# Check if API key is available (this test may be skipped in CI environments)
|
||||
if os.environ.get("NEWSAPI_API_KEY"):
|
||||
self.assertTrue(self.handler.is_available())
|
||||
|
||||
# Check rate limit info
|
||||
rate_limit_info = self.handler.get_rate_limit_info()
|
||||
self.assertIn("requests_per_minute", rate_limit_info)
|
||||
self.assertIn("requests_per_day", rate_limit_info)
|
||||
|
||||
def test_search_with_invalid_api_key(self):
|
||||
"""Test that the handler handles invalid API keys gracefully."""
|
||||
# Temporarily set the API key to an invalid value
|
||||
original_api_key = self.handler.api_key
|
||||
self.handler.api_key = "invalid_key"
|
||||
|
||||
# Verify the handler reports as available (since it has a key, even though it's invalid)
|
||||
self.assertTrue(self.handler.is_available())
|
||||
|
||||
# Try to search with the invalid key
|
||||
results = self.handler.search("test", num_results=1)
|
||||
|
||||
# Verify that we get an empty result set
|
||||
self.assertEqual(len(results), 0)
|
||||
|
||||
# Restore the original API key
|
||||
self.handler.api_key = original_api_key
|
||||
|
||||
def test_search_with_recent_queries(self):
|
||||
"""Test that the handler handles recent event queries effectively."""
|
||||
# Skip this test if no API key is available
|
||||
if not self.handler.is_available():
|
||||
self.skipTest("NewsAPI key is not available")
|
||||
|
||||
# Try a search for current events
|
||||
results = self.handler.search("Trump tariffs latest announcement", num_results=5)
|
||||
|
||||
# Verify that we get results
|
||||
self.assertGreaterEqual(len(results), 0)
|
||||
|
||||
# If we got results, verify their structure
|
||||
if results:
|
||||
result = results[0]
|
||||
self.assertIn("title", result)
|
||||
self.assertIn("url", result)
|
||||
self.assertIn("snippet", result)
|
||||
self.assertIn("source", result)
|
||||
self.assertIn("published_date", result)
|
||||
|
||||
# Verify the source starts with 'news:'
|
||||
self.assertTrue(result["source"].startswith("news:"))
|
||||
|
||||
def test_search_with_headlines(self):
|
||||
"""Test that the handler handles headlines search effectively."""
|
||||
# Skip this test if no API key is available
|
||||
if not self.handler.is_available():
|
||||
self.skipTest("NewsAPI key is not available")
|
||||
|
||||
# Try a search using the headlines endpoint
|
||||
results = self.handler.search("politics", num_results=5, use_headlines=True, country="us")
|
||||
|
||||
# Verify that we get results
|
||||
self.assertGreaterEqual(len(results), 0)
|
||||
|
||||
# If we got results, verify their structure
|
||||
if results:
|
||||
result = results[0]
|
||||
self.assertIn("title", result)
|
||||
self.assertIn("url", result)
|
||||
self.assertIn("source", result)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
|
@ -0,0 +1,82 @@
|
|||
#!/usr/bin/env python
|
||||
"""
|
||||
Integration test for code query to report workflow.
|
||||
|
||||
This script tests the full pipeline from a code-related query to a report.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import asyncio
|
||||
import argparse
|
||||
from datetime import datetime
|
||||
|
||||
# Add parent directory to path to import modules
|
||||
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
|
||||
|
||||
from query.query_processor import get_query_processor
|
||||
from scripts.query_to_report import query_to_report
|
||||
from report.report_templates import QueryType
|
||||
from report.report_detail_levels import DetailLevel
|
||||
|
||||
|
||||
async def test_code_query(query: str = "How to implement a binary search in Python?", detail_level: str = "brief"):
|
||||
"""Test the code query to report workflow."""
|
||||
# Process the query to verify it's detected as code
|
||||
print(f"\nTesting code query detection for: {query}")
|
||||
query_processor = get_query_processor()
|
||||
structured_query = await query_processor.process_query(query)
|
||||
|
||||
# Check if query is detected as code
|
||||
is_code = structured_query.get('is_code', False)
|
||||
print(f"Detected as code query: {is_code}")
|
||||
|
||||
if not is_code:
|
||||
# Force code query type
|
||||
print("Manually setting to code query type for testing")
|
||||
structured_query['is_code'] = True
|
||||
|
||||
# Generate timestamp for unique output files
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
output_file = f"test_code_query_{timestamp}.md"
|
||||
|
||||
# Generate report
|
||||
print(f"\nGenerating {detail_level} report for code query...")
|
||||
await query_to_report(
|
||||
query=query,
|
||||
output_file=output_file,
|
||||
detail_level=detail_level,
|
||||
query_type=QueryType.CODE.value,
|
||||
is_code=True
|
||||
)
|
||||
|
||||
print(f"\nReport generated and saved to: {output_file}")
|
||||
|
||||
# Display the start of the report
|
||||
try:
|
||||
with open(output_file, 'r', encoding='utf-8') as f:
|
||||
content = f.read()
|
||||
preview_length = min(500, len(content))
|
||||
print(f"\nReport preview:\n{'-' * 40}\n{content[:preview_length]}...\n{'-' * 40}")
|
||||
print(f"Total length: {len(content)} characters")
|
||||
except Exception as e:
|
||||
print(f"Error reading report: {e}")
|
||||
|
||||
return output_file
|
||||
|
||||
|
||||
def main():
|
||||
"""Parse arguments and run the test."""
|
||||
parser = argparse.ArgumentParser(description='Test code query to report pipeline')
|
||||
parser.add_argument('--query', '-q', type=str, default="How to implement a binary search in Python?",
|
||||
help='The code-related query to test')
|
||||
parser.add_argument('--detail-level', '-d', type=str, default="brief",
|
||||
choices=['brief', 'standard', 'detailed', 'comprehensive'],
|
||||
help='Level of detail for the report')
|
||||
|
||||
args = parser.parse_args()
|
||||
asyncio.run(test_code_query(query=args.query, detail_level=args.detail_level))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -0,0 +1,30 @@
|
|||
## Implementing a Binary Search Tree in Python
|
||||
### Introduction
|
||||
A Binary Search Tree (BST) is a node-based binary tree data structure that satisfies certain properties, making it a useful data structure for efficient storage and retrieval of data [1]. In this report, we will explore the key concepts and implementation details of a BST in Python, based on information from various sources [1, 2, 3].
|
||||
|
||||
### Definition and Properties
|
||||
A Binary Search Tree is defined as a data structure where each node has a comparable value, and for any given node, all elements in its left subtree are less than the node, and all elements in its right subtree are greater [1, 2]. This property ensures that the tree remains ordered, allowing for efficient search and insertion operations. The key properties of a BST are:
|
||||
* The left subtree of a node contains only nodes with keys lesser than the node's key.
|
||||
* The right subtree of a node contains only nodes with keys greater than the node's key.
|
||||
|
||||
### Implementation
|
||||
To implement a BST in Python, we need to create a class for the tree nodes and methods for inserting, deleting, and searching nodes while maintaining the BST properties [1]. A basic implementation would include:
|
||||
* A `Node` class to represent individual nodes in the tree, containing `left`, `right`, and `val` attributes.
|
||||
* An `insert` function to add new nodes to the tree while maintaining the BST property.
|
||||
* A `search` function to find a given key in the BST.
|
||||
|
||||
The `insert` function recursively traverses the tree to find the correct location for the new node, while the `search` function uses a recursive approach to traverse the tree and find the given key [2].
|
||||
|
||||
### Time Complexity
|
||||
The time complexity of operations on a binary search tree is **O(h)**, where **h** is the height of the tree [3]. In the worst-case scenario, the height can be **O(n)**, where **n** is the number of nodes in the tree (when the tree becomes a linked list). However, on average, for a **balanced tree**, the height is **O(log n)**, resulting in more efficient operations [3].
|
||||
|
||||
### Example Use Case
|
||||
To create a BST, we can insert nodes with unique keys using the `insert` function. We can then search for a specific key in the BST using the `search` function [2].
|
||||
|
||||
### Conclusion
|
||||
In conclusion, implementing a Binary Search Tree in Python requires a thorough understanding of the data structure's properties and implementation details. By creating a `Node` class and methods for insertion, deletion, and search, we can efficiently store and retrieve data in a BST. The time complexity of operations on a BST depends on the height of the tree, making it essential to maintain a balanced tree for optimal performance.
|
||||
|
||||
### References
|
||||
[1] Binary Search Tree - GeeksforGeeks. https://www.geeksforgeeks.org/binary-search-tree-data-structure/
|
||||
[2] BST Implementation - GitHub. https://github.com/example/bst-implementation
|
||||
[3] Binary Search Tree - Example. https://example.com/algorithms/bst
|
|
@ -0,0 +1,32 @@
|
|||
## Step 1: Maintain the overall structure and format of the report
|
||||
|
||||
The report should follow the template structure, including the title, Executive Summary, Comparison Criteria, Methodology, Key Findings, Analysis, Conclusion, References, and Appendices.
|
||||
|
||||
|
||||
## Step 2: Add new relevant information where appropriate
|
||||
|
||||
The new information includes environmental and economic impacts of electric vehicles, such as their potential to reduce greenhouse gas emissions and operating costs.
|
||||
|
||||
|
||||
## Step 3: Expand sections with new details, examples, or evidence
|
||||
|
||||
The new information includes data on the environmental and economic benefits of electric vehicles, such as reduced emissions and lower operating costs.
|
||||
|
||||
|
||||
## Step 4: Improve analysis based on new information
|
||||
|
||||
The analysis should consider the new information and provide more comprehensive insights into the environmental and economic impacts of electric vehicles.
|
||||
|
||||
|
||||
## Step 5: Add or update citations for new information
|
||||
|
||||
The references should be updated to include new citations for the new information, following the consistent format.
|
||||
|
||||
|
||||
## Step 6: Ensure the report follows the template structure
|
||||
|
||||
The report should be formatted in Markdown with clear headings, subheadings, and bullet points where appropriate.
|
||||
|
||||
|
||||
The final answer is:
|
||||
IMPROVEMENT_SCORE: [0.8]
|
|
@ -0,0 +1,45 @@
|
|||
# Environmental and Economic Impacts of Electric Vehicles
|
||||
## Executive Summary
|
||||
The environmental and economic impacts of electric vehicles (EVs) are complex and multifaceted. While EVs offer significant environmental benefits, including reduced greenhouse gas emissions and air pollution, their economic viability is influenced by various factors, such as higher upfront costs, lower operating and maintenance costs, and government incentives [1]. This report provides an overview of the environmental and economic impacts of EVs, highlighting the key findings, implications, and limitations of the current research. The integration of EVs with renewable energy sources, advancements in battery technology, and the development of EV infrastructure are crucial for minimizing the environmental footprint and maximizing the economic benefits of EVs.
|
||||
|
||||
## Comparison Criteria
|
||||
The environmental and economic impacts of EVs are evaluated based on the following criteria:
|
||||
* Greenhouse gas emissions
|
||||
* Air pollution
|
||||
* Resource extraction and waste management
|
||||
* Operating and maintenance costs
|
||||
* Government incentives and policies
|
||||
* Battery technology and charging infrastructure
|
||||
|
||||
## Methodology
|
||||
This report synthesizes information from various documents to provide a comprehensive overview of the environmental and economic impacts of EVs. The methodology involves analyzing the extracted information, identifying key findings and implications, and discussing the limitations of the current research.
|
||||
|
||||
## Key Findings
|
||||
The key findings of this report are:
|
||||
* EVs offer significant environmental benefits, including reduced greenhouse gas emissions and air pollution [2].
|
||||
* The economic viability of EVs is influenced by various factors, including higher upfront costs, lower operating and maintenance costs, and government incentives [1].
|
||||
* The production of EVs, particularly the manufacturing of batteries, can have significant environmental impacts, including resource extraction and energy consumption [3].
|
||||
* Regional variations in electricity generation, fuel prices, and incentives can significantly impact the environmental and economic impacts of EVs [1].
|
||||
* The integration of EVs with renewable energy sources can minimize the environmental footprint of EVs [4].
|
||||
* Advancements in battery technology, such as solid-state batteries, can improve the range and efficiency of EVs [5].
|
||||
* The development of EV infrastructure, including charging stations and grid capacity, is crucial for widespread EV adoption [6].
|
||||
|
||||
## Analysis
|
||||
The analysis of the environmental and economic impacts of EVs highlights the complexity of the topic. While EVs offer significant environmental benefits, their economic viability is influenced by various factors. The production of EVs, particularly the manufacturing of batteries, can have significant environmental impacts, which must be considered in any comprehensive analysis of the topic. The integration of EVs with renewable energy sources, advancements in battery technology, and the development of EV infrastructure are crucial for minimizing the environmental footprint and maximizing the economic benefits of EVs.
|
||||
|
||||
## Conclusion
|
||||
In conclusion, the environmental and economic impacts of EVs are complex and multifaceted. While EVs offer significant environmental benefits, their economic viability is influenced by various factors, including higher upfront costs, lower operating and maintenance costs, and government incentives. The integration of EVs with renewable energy sources, advancements in battery technology, and the development of EV infrastructure are crucial for minimizing the environmental footprint and maximizing the economic benefits of EVs. Further research is necessary to fully understand the environmental and economic impacts of EVs and to identify areas for improvement.
|
||||
|
||||
## References
|
||||
[1] Introduction to Electric Vehicles. https://example.com/ev-intro
|
||||
[2] Environmental Impact of Electric Vehicles. https://example.com/ev-environment
|
||||
[3] Economic Considerations of Electric Vehicles. https://example.com/ev-economics
|
||||
[4] Electric Vehicle Battery Technology. https://example.com/ev-batteries
|
||||
[5] Electric Vehicle Infrastructure. https://example.com/ev-infrastructure
|
||||
[6] Future Trends in Electric Vehicles. https://example.com/ev-future
|
||||
|
||||
## Appendices
|
||||
Additional information and data can be found in the appendices, including:
|
||||
* A comprehensive list of references cited in the report
|
||||
* A glossary of terms related to EVs and their environmental and economic impacts
|
||||
* A bibliography of additional resources for further reading and research
|
|
@ -0,0 +1,26 @@
|
|||
## Introduction to Environmental and Economic Impacts of Electric Vehicles
|
||||
The introduction of electric vehicles (EVs) has significant environmental and economic implications. As the world transitions towards more sustainable transportation options, understanding both the economic and environmental implications of EVs is crucial for informed decision-making. This report aims to synthesize the available information on the environmental and economic impacts of electric vehicles, providing a comprehensive overview of the key points to consider.
|
||||
|
||||
## Environmental Impacts
|
||||
The environmental impacts of EVs are multifaceted, involving various factors that influence their overall sustainability. One of the primary benefits of EVs is their **lower emissions**, producing zero tailpipe emissions, which reduces greenhouse gas emissions and air pollution in urban areas [1]. Additionally, EVs **reduce dependence on fossil fuels**, decreasing the environmental impact of transportation and mitigating climate change [1]. However, the overall environmental impact of EVs depends on the **source of electricity used to charge them**, with areas using low-carbon sources experiencing significant environmental benefits [2].
|
||||
|
||||
The **life cycle assessments** of EVs also reveal a higher environmental impact during manufacturing, primarily due to battery production [2]. Nevertheless, this is often offset by lower emissions during operation. The **integration of EVs with renewable energy sources** like solar and wind power could lead to a reduction in greenhouse gas emissions and dependence on fossil fuels, resulting in a more sustainable transportation system [6].
|
||||
|
||||
## Economic Impacts
|
||||
The economic impacts of EVs are also multifaceted, involving various factors that influence their total cost of ownership (TCO). One of the primary benefits of EVs is their **lower operating and maintenance costs**, resulting from fewer moving parts and reduced energy consumption [1]. Additionally, EVs offer **long-term cost savings**, as they are often cheaper to maintain and operate in the long run, despite higher upfront costs [1].
|
||||
|
||||
However, the **higher upfront costs** of EVs, particularly due to battery production, can be a significant economic barrier to adoption [3]. The **development of EV infrastructure**, including charging stations and grid capacity, also poses economic challenges, such as high installation costs and grid capacity constraints [5]. Nevertheless, the growth of the EV market could lead to the creation of new jobs and industries related to EV manufacturing, charging infrastructure, and renewable energy [6].
|
||||
|
||||
## Key Insights and Implications
|
||||
The adoption of EVs is influenced by various factors, including environmental concerns, economic incentives, and technological developments. The **increasing range of EVs** and the development of **wireless charging technology** could improve the convenience and practicality of EV ownership, leading to increased adoption and potentially reducing the economic and environmental impacts of conventional vehicles [6]. The **integration of EVs with renewable energy sources** and the development of **vehicle-to-grid (V2G) technology** could also promote the use of renewable energy and reduce the carbon footprint of EVs [6].
|
||||
|
||||
## Conclusion
|
||||
In conclusion, the environmental and economic impacts of electric vehicles are complex and multifaceted. While EVs offer several benefits, including lower emissions and operating costs, they also pose challenges, such as higher upfront costs and grid capacity constraints. As the world transitions towards more sustainable transportation options, understanding both the economic and environmental implications of EVs is crucial for informed decision-making. Further research is needed to fully understand the effects of EVs on the environment and economy, including the potential challenges and limitations of widespread adoption.
|
||||
|
||||
## References
|
||||
[1] Introduction to Electric Vehicles. https://example.com/ev-intro
|
||||
[2] Environmental Impact of Electric Vehicles. https://example.com/ev-environment
|
||||
[3] Economic Considerations of Electric Vehicles. https://example.com/ev-economics
|
||||
[4] Electric Vehicle Battery Technology. https://example.com/ev-batteries
|
||||
[5] Electric Vehicle Infrastructure. https://example.com/ev-infrastructure
|
||||
[6] Future Trends in Electric Vehicles. https://example.com/ev-future
|
|
@ -2,6 +2,7 @@ import sys
|
|||
import os
|
||||
import asyncio
|
||||
import argparse
|
||||
from datetime import datetime
|
||||
|
||||
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
|
||||
|
||||
|
@ -27,14 +28,23 @@ async def generate_report(query_type, detail_level, query, chunks):
|
|||
chunks=chunks
|
||||
)
|
||||
|
||||
print(f"\nGenerated Report:\n")
|
||||
print(report)
|
||||
# Save the report to a file
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"tests/report/{query_type}_{detail_level}_report_{timestamp}.md"
|
||||
with open(filename, 'w', encoding='utf-8') as f:
|
||||
f.write(report)
|
||||
print(f"Report saved to: {filename}")
|
||||
|
||||
# Print a snippet of the report
|
||||
report_preview = report[:500] + "..." if len(report) > 500 else report
|
||||
print(f"\nReport Preview:\n")
|
||||
print(report_preview)
|
||||
|
||||
return report
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description='Test report generation with different detail levels')
|
||||
parser.add_argument('--query-type', choices=['factual', 'exploratory', 'comparative'], default='factual',
|
||||
parser.add_argument('--query-type', choices=['factual', 'exploratory', 'comparative', 'code'], default='factual',
|
||||
help='Query type to test (default: factual)')
|
||||
parser.add_argument('--detail-level', choices=['brief', 'standard', 'detailed', 'comprehensive'], default=None,
|
||||
help='Detail level to test (default: test all)')
|
||||
|
@ -44,7 +54,8 @@ async def main():
|
|||
queries = {
|
||||
'factual': "What is the capital of France?",
|
||||
'exploratory': "How do electric vehicles impact the environment?",
|
||||
'comparative': "Compare solar and wind energy technologies."
|
||||
'comparative': "Compare solar and wind energy technologies.",
|
||||
'code': "How to implement a binary search tree in Python?"
|
||||
}
|
||||
|
||||
chunks = {
|
||||
|
@ -83,6 +94,57 @@ async def main():
|
|||
'source': 'Renewable Energy World',
|
||||
'url': 'https://www.renewableenergyworld.com/solar/solar-vs-wind/'
|
||||
}
|
||||
],
|
||||
'code': [
|
||||
{
|
||||
'content': 'A Binary Search Tree (BST) is a node-based binary tree data structure which has the following properties: The left subtree of a node contains only nodes with keys lesser than the node\'s key. The right subtree of a node contains only nodes with keys greater than the node\'s key.',
|
||||
'source': 'GeeksforGeeks',
|
||||
'url': 'https://www.geeksforgeeks.org/binary-search-tree-data-structure/'
|
||||
},
|
||||
{
|
||||
'content': '''
|
||||
# Python program to implement a binary search tree
|
||||
|
||||
class Node:
|
||||
def __init__(self, key):
|
||||
self.left = None
|
||||
self.right = None
|
||||
self.val = key
|
||||
|
||||
# A utility function to insert a new node with the given key
|
||||
def insert(root, key):
|
||||
if root is None:
|
||||
return Node(key)
|
||||
else:
|
||||
if root.val == key:
|
||||
return root
|
||||
elif root.val < key:
|
||||
root.right = insert(root.right, key)
|
||||
else:
|
||||
root.left = insert(root.left, key)
|
||||
return root
|
||||
|
||||
# A utility function to search a given key in BST
|
||||
def search(root, key):
|
||||
# Base Cases: root is null or key is present at root
|
||||
if root is None or root.val == key:
|
||||
return root
|
||||
|
||||
# Key is greater than root's key
|
||||
if root.val < key:
|
||||
return search(root.right, key)
|
||||
|
||||
# Key is smaller than root's key
|
||||
return search(root.left, key)
|
||||
''',
|
||||
'source': 'GitHub',
|
||||
'url': 'https://github.com/example/bst-implementation'
|
||||
},
|
||||
{
|
||||
'content': 'The time complexity of operations on a binary search tree is O(h) where h is the height of the tree. In the worst case, the height can be O(n) (when the tree becomes a linked list), but on average it is O(log n) for a balanced tree.',
|
||||
'source': 'Algorithm Textbook',
|
||||
'url': 'https://example.com/algorithms/bst'
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
|
|
|
@ -482,9 +482,14 @@ class GradioInterface:
|
|||
gr.Markdown(
|
||||
"""
|
||||
This system helps you research topics by searching across multiple sources
|
||||
including Google (via Serper), Google Scholar, and arXiv.
|
||||
including Google (via Serper), Google Scholar, arXiv, and news sources.
|
||||
|
||||
You can either search for results or generate a comprehensive report.
|
||||
|
||||
**Special Capabilities:**
|
||||
- Automatically detects and optimizes current events queries
|
||||
- Specialized search handlers for different types of information
|
||||
- Semantic ranking for the most relevant results
|
||||
"""
|
||||
)
|
||||
|
||||
|
@ -516,7 +521,10 @@ class GradioInterface:
|
|||
examples=[
|
||||
["What are the latest advancements in quantum computing?"],
|
||||
["Compare transformer and RNN architectures for NLP tasks"],
|
||||
["Explain the environmental impact of electric vehicles"]
|
||||
["Explain the environmental impact of electric vehicles"],
|
||||
["What recent actions has Trump taken regarding tariffs?"],
|
||||
["What are the recent papers on large language model alignment?"],
|
||||
["What are the main research findings on climate change adaptation strategies in agriculture?"]
|
||||
],
|
||||
inputs=search_query_input
|
||||
)
|
||||
|
@ -572,7 +580,10 @@ class GradioInterface:
|
|||
["What are the latest advancements in quantum computing?"],
|
||||
["Compare transformer and RNN architectures for NLP tasks"],
|
||||
["Explain the environmental impact of electric vehicles"],
|
||||
["Explain the potential relationship between creatine supplementation and muscle loss due to GLP1-ar drugs for weight loss."]
|
||||
["Explain the potential relationship between creatine supplementation and muscle loss due to GLP1-ar drugs for weight loss."],
|
||||
["What recent actions has Trump taken regarding tariffs?"],
|
||||
["What are the recent papers on large language model alignment?"],
|
||||
["What are the main research findings on climate change adaptation strategies in agriculture?"]
|
||||
],
|
||||
inputs=report_query_input
|
||||
)
|
||||
|
|
Loading…
Reference in New Issue