paper-system/README.md

# Paper System

A tool to fetch, filter, and process arXiv papers using LLM-based criteria. The system consists of three main components:

1. **arxiv-processor**: Fetches papers from arXiv based on category and date range
2. **llm_processor**: Evaluates papers using specified criteria through an LLM
3. **json2md**: Generates formatted markdown output of accepted/rejected papers

## Installation

1. Ensure you have Go installed (1.20 or later)
2. Clone this repository
3. Build the system:
```bash
go build -o paper-system
```

## Configuration

The system requires an OpenRouter API key for LLM processing. Set it as an environment variable:

```bash
export OPENROUTER_API_KEY=your-api-key
```

## Usage

The system can operate in two modes:

### 1. ArXiv Fetch Mode

Fetches papers from arXiv, processes them with LLM, and generates markdown output:

```bash
./paper-system \
  -start 20240101 \
  -end 20240131 \
  -search cs.AI \
  -criteria criteria.txt \
  -output papers.md
```

Required flags:
- `-start`: Start date in YYYYMMDD format
- `-end`: End date in YYYYMMDD format
- `-search`: arXiv category/search query (e.g., 'cs.AI', 'physics.comp-ph')
- `-criteria`: Path to filter criteria file

Optional flags:
- `-output`: Output markdown file path (default: papers.md)
- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
- `-max-results`: Maximum number of papers to retrieve (default: 100, max: 2000)

### 2. Input JSON Mode

Process an existing JSON file of papers (useful for running different criteria against the same dataset):

```bash
./paper-system \
  -input-json papers.json \
  -criteria new-criteria.txt \
  -output results.md
```

Required flags:
- `-input-json`: Path to input JSON file
- `-criteria`: Path to filter criteria file

Optional flags:
- `-output`: Output markdown file path (default: papers.md)
- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)

## Input/Output Files

### Criteria File Format

Create a text file with your evaluation criteria. Example:
```
Please evaluate this paper based on the following criteria:

1. Practical Applications: Does the paper demonstrate clear real-world applications?
2. Experimental Results: Are there quantitative metrics and thorough evaluations?
3. Technical Innovation: Does the paper present novel techniques or improvements?

Respond with a JSON object containing:
{
  "decision": "ACCEPT" or "REJECT",
  "explanation": "Detailed reasoning for the decision"
}
```

### Output Format

The system generates two types of output:

1. **papers.json**: Raw paper data in JSON format (when fetching from arXiv)
```json
[
  {
    "title": "Paper Title",
    "abstract": "Paper abstract...",
    "arxiv_id": "2401.12345",
    "authors": ["Author 1", "Author 2"]
  }
]
```

2. **papers.md**: Formatted markdown with accepted/rejected papers
```markdown
# Accepted Papers

## [Paper Title](https://arxiv.org/abs/2401.12345)
**arXiv ID:** 2401.12345

**Abstract:**
> Paper abstract...

**Decision Explanation:** Meets criteria for practical applications...

---

# Rejected Papers
...
```

## Workflow Examples

### Basic Usage
```bash
# Fetch and evaluate papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt

# This creates:
# - papers.json (raw paper data)
# - papers.md (evaluation results)
```

### Multiple Evaluations
```bash
# 1. First fetch papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria1.txt -output results1.md

# 2. Run different criteria on the same papers
./paper-system -input-json papers.json -criteria criteria2.txt -output results2.md
```

### Fetching More Papers
```bash
# Fetch up to 2000 papers
./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt -max-results 2000
```

## Error Handling

The system includes several safeguards:

- Validates all required parameters
- Ensures max-results is between 1 and 2000
- Prevents mixing of arXiv and input JSON modes
- Retries LLM processing on failure
- Maintains temporary files for debugging

## Notes

- The system preserves papers.json when fetching from arXiv, allowing for future reuse
- Temporary files (temp_input.json, temp_output.json) are automatically cleaned up
- The LLM processor uses a batch size of 32 papers for efficient processing
Initial Commit; working system 2025-01-24 15:26:47 +00:00			`# Paper System`

			`A tool to fetch, filter, and process arXiv papers using LLM-based criteria. The system consists of three main components:`

			`1. arxiv-processor: Fetches papers from arXiv based on category and date range`
			`2. llm_processor: Evaluates papers using specified criteria through an LLM`
			`3. json2md: Generates formatted markdown output of accepted/rejected papers`

			`## Installation`

			`1. Ensure you have Go installed (1.20 or later)`
			`2. Clone this repository`
			`3. Build the system:`
			```bash
			`go build -o paper-system`
			```

			`## Configuration`

			`The system requires an OpenRouter API key for LLM processing. Set it as an environment variable:`

			```bash
			`export OPENROUTER_API_KEY=your-api-key`
			```

			`## Usage`

			`The system can operate in two modes:`

			`### 1. ArXiv Fetch Mode`

			`Fetches papers from arXiv, processes them with LLM, and generates markdown output:`

			```bash
			`./paper-system \`
			`-start 20240101 \`
			`-end 20240131 \`
			`-search cs.AI \`
			`-criteria criteria.txt \`
			`-output papers.md`
			```

			`Required flags:`
			- `-start`: Start date in YYYYMMDD format
			- `-end`: End date in YYYYMMDD format
			- `-search`: arXiv category/search query (e.g., 'cs.AI', 'physics.comp-ph')
			- `-criteria`: Path to filter criteria file

			`Optional flags:`
			- `-output`: Output markdown file path (default: papers.md)
			- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
			- `-max-results`: Maximum number of papers to retrieve (default: 100, max: 2000)

			`### 2. Input JSON Mode`

			`Process an existing JSON file of papers (useful for running different criteria against the same dataset):`

			```bash
			`./paper-system \`
			`-input-json papers.json \`
			`-criteria new-criteria.txt \`
			`-output results.md`
			```

			`Required flags:`
			- `-input-json`: Path to input JSON file
			- `-criteria`: Path to filter criteria file

			`Optional flags:`
			- `-output`: Output markdown file path (default: papers.md)
			- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)

			`## Input/Output Files`

			`### Criteria File Format`

			`Create a text file with your evaluation criteria. Example:`
			```
			`Please evaluate this paper based on the following criteria:`

			`1. Practical Applications: Does the paper demonstrate clear real-world applications?`
			`2. Experimental Results: Are there quantitative metrics and thorough evaluations?`
			`3. Technical Innovation: Does the paper present novel techniques or improvements?`

			`Respond with a JSON object containing:`
			`{`
			`"decision": "ACCEPT" or "REJECT",`
			`"explanation": "Detailed reasoning for the decision"`
			`}`
			```

			`### Output Format`

			`The system generates two types of output:`

			`1. papers.json: Raw paper data in JSON format (when fetching from arXiv)`
			```json
			`[`
			`{`
			`"title": "Paper Title",`
			`"abstract": "Paper abstract...",`
			`"arxiv_id": "2401.12345",`
			`"authors": ["Author 1", "Author 2"]`
			`}`
			`]`
			```

			`2. papers.md: Formatted markdown with accepted/rejected papers`
			```markdown
			`# Accepted Papers`

			`## [Paper Title](https://arxiv.org/abs/2401.12345)`
			`arXiv ID: 2401.12345`

			`Abstract:`
			`> Paper abstract...`

			`Decision Explanation: Meets criteria for practical applications...`

			`---`

			`# Rejected Papers`
			`...`
			```

			`## Workflow Examples`

			`### Basic Usage`
			```bash
			`# Fetch and evaluate papers`
			`./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt`

			`# This creates:`
			`# - papers.json (raw paper data)`
			`# - papers.md (evaluation results)`
			```

			`### Multiple Evaluations`
			```bash
			`# 1. First fetch papers`
			`./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria1.txt -output results1.md`

			`# 2. Run different criteria on the same papers`
			`./paper-system -input-json papers.json -criteria criteria2.txt -output results2.md`
			```

			`### Fetching More Papers`
			```bash
			`# Fetch up to 2000 papers`
			`./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt -max-results 2000`
			```

			`## Error Handling`

			`The system includes several safeguards:`

			`- Validates all required parameters`
			`- Ensures max-results is between 1 and 2000`
			`- Prevents mixing of arXiv and input JSON modes`
			`- Retries LLM processing on failure`
			`- Maintains temporary files for debugging`

			`## Notes`

			`- The system preserves papers.json when fetching from arXiv, allowing for future reuse`
			`- Temporary files (temp_input.json, temp_output.json) are automatically cleaned up`
			`- The LLM processor uses a batch size of 32 papers for efficient processing`