paper-system/arxiv-processor
Steve White 9396e2da3a Initial Commit; working system 2025-01-24 09:26:47 -06:00
..
arxiv Initial Commit; working system 2025-01-24 09:26:47 -06:00
storage Initial Commit; working system 2025-01-24 09:26:47 -06:00
20250123-papers.json Initial Commit; working system 2025-01-24 09:26:47 -06:00
README.md Initial Commit; working system 2025-01-24 09:26:47 -06:00
arxiv-2501.11599v1.md Initial Commit; working system 2025-01-24 09:26:47 -06:00
go.mod Initial Commit; working system 2025-01-24 09:26:47 -06:00
main.go Initial Commit; working system 2025-01-24 09:26:47 -06:00

README.md

ArXiv Processor

A Go package for fetching and processing papers from arXiv.

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/arxiv-processor.git
cd arxiv-processor
  1. Initialize the Go module:
go mod init arxiv-processor
go mod tidy

Usage

As a Library

To use the package in your Go application:

import "github.com/yourusername/arxiv-processor"

func main() {
    // Create client
    client := arxiv.NewClient()

    // Define search parameters
    // The format "20060102" is Go's reference time format:
    // 2006 = year
    // 01 = month
    // 02 = day
    // Note: The arXiv API returns full timestamps including time of day,
    // but the search API only uses the date portion for filtering
    startDate, _ := time.Parse("20060102", "20240101")
    endDate, _ := time.Parse("20060102", "20240131")

    // Fetch papers
    // The FetchPapers method returns all papers at once after completion
    // of the API request and any necessary pagination
    papers, err := client.FetchPapers("cat:cs.AI", startDate, endDate)
    if err != nil {
        log.Fatal(err)
    }

    // Use papers directly (in-memory)
    // The papers slice contains all results after completion
    for _, paper := range papers {
        fmt.Printf("Title: %s\n", paper.Title)
        fmt.Printf("Abstract: %s\n", paper.Abstract)
    }

    // Optionally save papers to file
    err = arxiv.SavePapers("papers.json", papers)
    if err != nil {
        log.Fatal(err)
    }
}

Note: The package currently writes to a file by default. To modify this behavior to only return JSON objects:

  1. Remove the SavePapers call
  2. Use the returned papers slice directly
  3. The papers slice contains all paper data as Go structs
  4. You can marshal to JSON using json.Marshal(papers) if needed

Command Line Interface

To use the CLI:

go run main.go --search "cat:cs.AI" --date-range "YYYYMMDD-YYYYMMDD"

Command Line Options

  • --search: Search query (e.g., "cat:cs.AI" for AI papers)
  • --date-range: Date range in YYYYMMDD-YYYYMMDD format
  • --output: Output file path (default: papers_data.json)

Example: Fetch AI Papers

go run main.go --search "cat:cs.AI" --date-range "20250115-20250118"

Program Output

  • Fetched papers are saved to papers_data.json
  • Example JSON structure:
[
  {
    "title": "Sample Paper Title",
    "abstract": "This is a sample abstract...",
    "arxiv_id": "2501.08565v1"
  }
]
  • The JSON file contains paper metadata including:
    • Title
    • Abstract
    • arXiv ID

Configuration

Environment Variables

  • ARXIV_MAX_RESULTS: Maximum number of results to fetch (default: 100)
  • ARXIV_START_INDEX: Start index for pagination (default: 0)

Package Structure

arxiv-processor/
├── arxiv/          # arXiv API client
├── storage/        # Data storage handlers
├── llm/            # LLM integration (TODO)
├── main.go         # Main entry point
└── README.md       # This file

Contributing

  1. Fork the repository
  2. Create a new branch
  3. Make your changes
  4. Submit a pull request

License

MIT License