# Paper System

A tool to fetch, filter, and process arXiv papers using LLM-based criteria. The system consists of three main components:

1. **arxiv-processor**: Fetches papers from arXiv based on category and date range
2. **llm_processor**: Evaluates papers using specified criteria through an LLM
3. **json2md**: Generates formatted markdown output of accepted/rejected papers

## Installation

1. Ensure you have Go installed (1.20 or later)
2. Clone this repository
3. Build the system:
```bash
go build -o paper-system
```

## Configuration

The system requires an OpenRouter API key for LLM processing. Set it as an environment variable:

```bash
export OPENROUTER_API_KEY=your-api-key
```

## Usage

The system can operate in two modes:

### 1. ArXiv Fetch Mode

Fetches papers from arXiv, processes them with LLM, and generates markdown output:

```bash
./paper-system \
  -start 20240101 \
  -end 20240131 \
  -search cs.AI \
  -criteria criteria.txt \
  -output papers.md
```

Required flags:
- `-start`: Start date in YYYYMMDD format
- `-end`: End date in YYYYMMDD format
- `-search`: arXiv category/search query (e.g., 'cs.AI', 'physics.comp-ph')
- `-criteria`: Path to filter criteria file

Optional flags:
- `-output`: Output markdown file path (default: papers.md)
- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
- `-max-results`: Maximum number of papers to retrieve (default: 100, max: 2000)

### 2. Input JSON Mode

Process an existing JSON file of papers (useful for running different criteria against the same dataset):

```bash
./paper-system \
  -input-json papers.json \
  -criteria new-criteria.txt \
  -output results.md
```

Required flags:
- `-input-json`: Path to input JSON file
- `-criteria`: Path to filter criteria file

Optional flags:
- `-output`: Output markdown file path (default: papers.md)
- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)

## Input/Output Files

### Criteria File Format

Create a text file with your evaluation criteria. Example:
```
Please evaluate this paper based on the following criteria:

1. Practical Applications: Does the paper demonstrate clear real-world applications?
2. Experimental Results: Are there quantitative metrics and thorough evaluations?
3. Technical Innovation: Does the paper present novel techniques or improvements?

Respond with a JSON object containing:
{
  "decision": "ACCEPT" or "REJECT",
  "explanation": "Detailed reasoning for the decision"
}
```

### Output Format

The system generates dated output files when using arXiv queries:

1. **`YYYYMMDD-YYYYMMDD-CATEGORY-papers.json`**: Raw arXiv results  
2. **`YYYYMMDD-YYYYMMDD-CATEGORY-papers.md`**: Final filtered results

When using `--input-json`, specify output name with `--output`.

Example files:
```json
[
  {
    "title": "Paper Title",
    "abstract": "Paper abstract...",
    "arxiv_id": "2401.12345",
    "authors": ["Author 1", "Author 2"]
  }
]
```

```markdown
# Accepted Papers
...
```
```markdown
# Accepted Papers

## [Paper Title](https://arxiv.org/abs/2401.12345)
**arXiv ID:** 2401.12345

**Abstract:**
> Paper abstract...

**Decision Explanation:** Meets criteria for practical applications...

---

# Rejected Papers
...
```

## Workflow Examples

### Basic Usage
```bash
# Fetch and evaluate papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt

# This creates:
# - papers.json (raw paper data)
# - papers.md (evaluation results)
```

### Multiple Evaluations
```bash
# 1. First fetch papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria1.txt -output results1.md

# 2. Run different criteria on the same papers
./paper-system -input-json papers.json -criteria criteria2.txt -output results2.md
```

### Fetching More Papers
```bash
# Fetch up to 2000 papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt -max-results 2000
```

## Error Handling

The system includes several safeguards:

- Validates all required parameters
- Ensures max-results is between 1 and 2000
- Prevents mixing of arXiv and input JSON modes
- Retries LLM processing on failure
- Maintains temporary files for debugging

## Notes

- The system preserves papers.json when fetching from arXiv, allowing for future reuse
- Temporary files (temp_input.json, temp_output.json) are automatically cleaned up
- The LLM processor uses a batch size of 32 papers for efficient processing