Steve White af4a0f8fe8 | ||
---|---|---|
.clinerules | ||
.gitignore | ||
README.md | ||
go.mod | ||
go.sum | ||
papers.go |
README.md
Papers
A Go CLI tool for fetching, processing, and analyzing academic papers from arXiv using LLM-based evaluation.
The primary target here was to do somethiung more flexible than keyword search. There are so many papers published I don't have time to read them all, so I ask an LLM to help. I provide a file - the criteria file - with a natural-language description of what I care about and then it filters the abstracts against that criteria.
Example criteria:
Accepted papers MUST:
* primarily address LLMs (Large Language Models)
Accepted paper MUST NOT:
* primarily address legal, social, or ethical subjects
* primarily address medical applications
REJECT explanations can be very brief, less than 30 tokens.
This is hard to pull off with keyword searches. You might exclude every paper that includes the word "ethical" only to find out that several papers include it in their notes or limitations section but it has nothing to do with the paper itself.
Features
- Fetch papers from arXiv API based on date range and search query
- Process papers using configurable LLM models (default: phi-4)
- Generate both JSON and Markdown outputs
- Customizable evaluation criteria
- Rate-limited API requests (2-second delay between requests)
Installation
go install gitea.r8z.us/stwhite/papers@latest
Usage
Basic usage:
papers -start 20240101 -end 20240131 -query "machine learning" -api-key "your-key"
With custom model and output paths:
papers -start 20240101 -end 20240131 -query "machine learning" -api-key "your-key" \
-model "gpt-4" -json-output "results.json" -md-output "summary.md"
Fetch papers without processing:
papers -search-only -start 20240101 -end 20240131 -query "machine learning"
Use input file:
papers -input papers.json -api-key "your-key" -criteria criteria.md
Required Flags
-start
: Start date (YYYYMMDD format)-end
: End date (YYYYMMDD format)-query
: Search query
Optional Flags
-search-only
: Fetch papers from arXiv and save to JSON file without processing-input
: Input JSON file containing papers (optional)-maxResults
: Maximum number of results to fetch (1-2000, default: 100)-model
: LLM model to use for processing (default: "phi-4")-api-endpoint
: API endpoint URL (default: "http://localhost:1234/v1/chat/completions")-criteria
: Path to evaluation criteria markdown file (default: "criteria.md")-json-output
: Custom JSON output file path (default: YYYYMMDD-YYYYMMDD-query.json)-md-output
: Custom Markdown output file path (default: YYYYMMDD-YYYYMMDD-query.md)
NB: default API endpoint is LMStudio, and Phi-4 does a great job filtering papers
Pipeline
- Fetch: Retrieves papers from arXiv based on specified date range and query
- Save: Stores raw paper data in JSON format
- Process: Evaluates papers using the specified LLM model according to criteria
- Format: Generates both JSON and Markdown outputs of the processed results
Output Files
The tool generates two types of output files:
-
JSON Output: Contains the raw processing results
- Default name format:
YYYYMMDD-YYYYMMDD-query.json
- Can be customized with
-json-output
flag
- Default name format:
-
Markdown Output: Human-readable formatted results
- Default name format:
YYYYMMDD-YYYYMMDD-query.md
- Can be customized with
-md-output
flag
- Default name format:
Dependencies
- arxiva: Paper fetching from arXiv
- paperprocessor: LLM-based paper processing
- paperformatter: Output formatting
Error Handling
The tool includes various error checks:
- Date format validation (YYYYMMDD)
- Required flag validation
- Maximum results range validation (1-2000)
- File system operations verification
- API request error handling
License
- MIT license