paper-system/README.md

168 lines
4.2 KiB
Markdown
Raw Normal View History

2025-01-24 15:26:47 +00:00
# Paper System
A tool to fetch, filter, and process arXiv papers using LLM-based criteria. The system consists of three main components:
1. **arxiv-processor**: Fetches papers from arXiv based on category and date range
2. **llm_processor**: Evaluates papers using specified criteria through an LLM
3. **json2md**: Generates formatted markdown output of accepted/rejected papers
## Installation
1. Ensure you have Go installed (1.20 or later)
2. Clone this repository
3. Build the system:
```bash
go build -o paper-system
```
## Configuration
The system requires an OpenRouter API key for LLM processing. Set it as an environment variable:
```bash
export OPENROUTER_API_KEY=your-api-key
```
## Usage
The system can operate in two modes:
### 1. ArXiv Fetch Mode
Fetches papers from arXiv, processes them with LLM, and generates markdown output:
```bash
./paper-system \
-start 20240101 \
-end 20240131 \
-search cs.AI \
-criteria criteria.txt \
-output papers.md
```
Required flags:
- `-start`: Start date in YYYYMMDD format
- `-end`: End date in YYYYMMDD format
- `-search`: arXiv category/search query (e.g., 'cs.AI', 'physics.comp-ph')
- `-criteria`: Path to filter criteria file
Optional flags:
- `-output`: Output markdown file path (default: papers.md)
- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
- `-max-results`: Maximum number of papers to retrieve (default: 100, max: 2000)
### 2. Input JSON Mode
Process an existing JSON file of papers (useful for running different criteria against the same dataset):
```bash
./paper-system \
-input-json papers.json \
-criteria new-criteria.txt \
-output results.md
```
Required flags:
- `-input-json`: Path to input JSON file
- `-criteria`: Path to filter criteria file
Optional flags:
- `-output`: Output markdown file path (default: papers.md)
- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
## Input/Output Files
### Criteria File Format
Create a text file with your evaluation criteria. Example:
```
Please evaluate this paper based on the following criteria:
1. Practical Applications: Does the paper demonstrate clear real-world applications?
2. Experimental Results: Are there quantitative metrics and thorough evaluations?
3. Technical Innovation: Does the paper present novel techniques or improvements?
Respond with a JSON object containing:
{
"decision": "ACCEPT" or "REJECT",
"explanation": "Detailed reasoning for the decision"
}
```
### Output Format
The system generates two types of output:
1. **papers.json**: Raw paper data in JSON format (when fetching from arXiv)
```json
[
{
"title": "Paper Title",
"abstract": "Paper abstract...",
"arxiv_id": "2401.12345",
"authors": ["Author 1", "Author 2"]
}
]
```
2. **papers.md**: Formatted markdown with accepted/rejected papers
```markdown
# Accepted Papers
## [Paper Title](https://arxiv.org/abs/2401.12345)
**arXiv ID:** 2401.12345
**Abstract:**
> Paper abstract...
**Decision Explanation:** Meets criteria for practical applications...
---
# Rejected Papers
...
```
## Workflow Examples
### Basic Usage
```bash
# Fetch and evaluate papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt
# This creates:
# - papers.json (raw paper data)
# - papers.md (evaluation results)
```
### Multiple Evaluations
```bash
# 1. First fetch papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria1.txt -output results1.md
# 2. Run different criteria on the same papers
./paper-system -input-json papers.json -criteria criteria2.txt -output results2.md
```
### Fetching More Papers
```bash
# Fetch up to 2000 papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt -max-results 2000
```
## Error Handling
The system includes several safeguards:
- Validates all required parameters
- Ensures max-results is between 1 and 2000
- Prevents mixing of arXiv and input JSON modes
- Retries LLM processing on failure
- Maintains temporary files for debugging
## Notes
- The system preserves papers.json when fetching from arXiv, allowing for future reuse
- Temporary files (temp_input.json, temp_output.json) are automatically cleaned up
- The LLM processor uses a batch size of 32 papers for efficient processing