Go to file

Steve White af4a0f8fe8 Updated README.md		2025-01-29 09:35:53 -06:00
.clinerules	Initial Commit of papers system.	2025-01-26 14:15:57 -06:00
.gitignore	Initical Commit	2025-01-27 22:26:11 -06:00
README.md	Updated README.md	2025-01-29 09:35:53 -06:00
go.mod	Initical Commit	2025-01-27 22:26:11 -06:00
go.sum	Initical Commit	2025-01-27 22:26:11 -06:00
papers.go	Updated README.md and papers to add -search-only and -input options	2025-01-29 09:25:20 -06:00

README.md

Papers

A Go CLI tool for fetching, processing, and analyzing academic papers from arXiv using LLM-based evaluation.

The primary target here was to do somethiung more flexible than keyword search. There are so many papers published I don't have time to read them all, so I ask an LLM to help. I provide a file - the criteria file - with a natural-language description of what I care about and then it filters the abstracts against that criteria.

Example criteria:

Accepted papers MUST:

* primarily address LLMs (Large Language Models)

Accepted paper MUST NOT:

* primarily address legal, social, or ethical subjects
* primarily address medical applications

REJECT explanations can be very brief, less than 30 tokens.

This is hard to pull off with keyword searches. You might exclude every paper that includes the word "ethical" only to find out that several papers include it in their notes or limitations section but it has nothing to do with the paper itself.

Features

Fetch papers from arXiv API based on date range and search query
Process papers using configurable LLM models (default: phi-4)
Generate both JSON and Markdown outputs
Customizable evaluation criteria
Rate-limited API requests (2-second delay between requests)

Installation

go install gitea.r8z.us/stwhite/papers@latest

Usage

Basic usage:

papers -start 20240101 -end 20240131 -query "machine learning" -api-key "your-key"

With custom model and output paths:

papers -start 20240101 -end 20240131 -query "machine learning" -api-key "your-key" \
  -model "gpt-4" -json-output "results.json" -md-output "summary.md"

Fetch papers without processing:

papers -search-only -start 20240101 -end 20240131 -query "machine learning"

Use input file:

papers -input papers.json -api-key "your-key" -criteria criteria.md

Required Flags

-start: Start date (YYYYMMDD format)
-end: End date (YYYYMMDD format)
-query: Search query

Optional Flags

-search-only: Fetch papers from arXiv and save to JSON file without processing
-input: Input JSON file containing papers (optional)
-maxResults: Maximum number of results to fetch (1-2000, default: 100)
-model: LLM model to use for processing (default: "phi-4")
-api-endpoint: API endpoint URL (default: "http://localhost:1234/v1/chat/completions")
-criteria: Path to evaluation criteria markdown file (default: "criteria.md")
-json-output: Custom JSON output file path (default: YYYYMMDD-YYYYMMDD-query.json)
-md-output: Custom Markdown output file path (default: YYYYMMDD-YYYYMMDD-query.md)

NB: default API endpoint is LMStudio, and Phi-4 does a great job filtering papers

Pipeline

Fetch: Retrieves papers from arXiv based on specified date range and query
Save: Stores raw paper data in JSON format
Process: Evaluates papers using the specified LLM model according to criteria
Format: Generates both JSON and Markdown outputs of the processed results

Output Files

The tool generates two types of output files:

JSON Output: Contains the raw processing results
- Default name format: YYYYMMDD-YYYYMMDD-query.json
- Can be customized with -json-output flag
Markdown Output: Human-readable formatted results
- Default name format: YYYYMMDD-YYYYMMDD-query.md
- Can be customized with -md-output flag

Dependencies

arxiva: Paper fetching from arXiv
paperprocessor: LLM-based paper processing
paperformatter: Output formatting

Error Handling

The tool includes various error checks:

Date format validation (YYYYMMDD)
Required flag validation
Maximum results range validation (1-2000)
File system operations verification
API request error handling

License

MIT license