Initial Commit; working system

2025-01-24 09:26:47 -06:00 · 2025-01-24 09:26:47 -06:00 · 9396e2da3a
commit 9396e2da3a
35 changed files with 33086 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,36 @@
+# OS generated files
+.DS_Store
+
+# Editor directories and files
+.vscode/
+
+# Go binaries and build artifacts
+/bin/
+/arxiv-processor/arxiv-processor
+/json2md/json2md
+/llm_processor/llm_processor
+
+# Dependency directories
+vendor/
+
+# Log files
+*.log
+
+# Test files
+*_test.go
+
+# Temporary files
+temp_*.json
+*.md.gz
+
+# Configuration files that might contain sensitive data
+.env
+
+# Generated output files
+papers.json
+newpapers.md.gz
+arxiv-processor/newpapers.json
+arxiv-processor/papers_data.json
+llm_processor/*-filt.*
+llm_processor/results.json
+llm_processor/oldresults.json
--- a/20250101-20250105-cat-cs.AI.json
+++ b/20250101-20250105-cat-cs.AI.json
--- a/README.md
+++ b/README.md
@ -0,0 +1,167 @@
+# Paper System
+
+A tool to fetch, filter, and process arXiv papers using LLM-based criteria. The system consists of three main components:
+
+1. **arxiv-processor**: Fetches papers from arXiv based on category and date range
+2. **llm_processor**: Evaluates papers using specified criteria through an LLM
+3. **json2md**: Generates formatted markdown output of accepted/rejected papers
+
+## Installation
+
+1. Ensure you have Go installed (1.20 or later)
+2. Clone this repository
+3. Build the system:
+```bash
+go build -o paper-system
+```
+
+## Configuration
+
+The system requires an OpenRouter API key for LLM processing. Set it as an environment variable:
+
+```bash
+export OPENROUTER_API_KEY=your-api-key
+```
+
+## Usage
+
+The system can operate in two modes:
+
+### 1. ArXiv Fetch Mode
+
+Fetches papers from arXiv, processes them with LLM, and generates markdown output:
+
+```bash
+./paper-system \
+  -start 20240101 \
+  -end 20240131 \
+  -search cs.AI \
+  -criteria criteria.txt \
+  -output papers.md
+```
+
+Required flags:
+- `-start`: Start date in YYYYMMDD format
+- `-end`: End date in YYYYMMDD format
+- `-search`: arXiv category/search query (e.g., 'cs.AI', 'physics.comp-ph')
+- `-criteria`: Path to filter criteria file
+
+Optional flags:
+- `-output`: Output markdown file path (default: papers.md)
+- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
+- `-max-results`: Maximum number of papers to retrieve (default: 100, max: 2000)
+
+### 2. Input JSON Mode
+
+Process an existing JSON file of papers (useful for running different criteria against the same dataset):
+
+```bash
+./paper-system \
+  -input-json papers.json \
+  -criteria new-criteria.txt \
+  -output results.md
+```
+
+Required flags:
+- `-input-json`: Path to input JSON file
+- `-criteria`: Path to filter criteria file
+
+Optional flags:
+- `-output`: Output markdown file path (default: papers.md)
+- `-model`: LLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)
+
+## Input/Output Files
+
+### Criteria File Format
+
+Create a text file with your evaluation criteria. Example:
+```
+Please evaluate this paper based on the following criteria:
+
+1. Practical Applications: Does the paper demonstrate clear real-world applications?
+2. Experimental Results: Are there quantitative metrics and thorough evaluations?
+3. Technical Innovation: Does the paper present novel techniques or improvements?
+
+Respond with a JSON object containing:
+{
+  "decision": "ACCEPT" or "REJECT",
+  "explanation": "Detailed reasoning for the decision"
+}
+```
+
+### Output Format
+
+The system generates two types of output:
+
+1. **papers.json**: Raw paper data in JSON format (when fetching from arXiv)
+```json
+[
+  {
+    "title": "Paper Title",
+    "abstract": "Paper abstract...",
+    "arxiv_id": "2401.12345",
+    "authors": ["Author 1", "Author 2"]
+  }
+]
+```
+
+2. **papers.md**: Formatted markdown with accepted/rejected papers
+```markdown
+# Accepted Papers
+
+## [Paper Title](https://arxiv.org/abs/2401.12345)
+**arXiv ID:** 2401.12345
+
+**Abstract:**
+> Paper abstract...
+
+**Decision Explanation:** Meets criteria for practical applications...
+
+---
+
+# Rejected Papers
+...
+```
+
+## Workflow Examples
+
+### Basic Usage
+```bash
+# Fetch and evaluate papers
+./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt
+
+# This creates:
+# - papers.json (raw paper data)
+# - papers.md (evaluation results)
+```
+
+### Multiple Evaluations
+```bash
+# 1. First fetch papers
+./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria1.txt -output results1.md
+
+# 2. Run different criteria on the same papers
+./paper-system -input-json papers.json -criteria criteria2.txt -output results2.md
+```
+
+### Fetching More Papers
+```bash
+# Fetch up to 2000 papers
+./paper-system -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt -max-results 2000
+```
+
+## Error Handling
+
+The system includes several safeguards:
+
+- Validates all required parameters
+- Ensures max-results is between 1 and 2000
+- Prevents mixing of arXiv and input JSON modes
+- Retries LLM processing on failure
+- Maintains temporary files for debugging
+
+## Notes
+
+- The system preserves papers.json when fetching from arXiv, allowing for future reuse
+- Temporary files (temp_input.json, temp_output.json) are automatically cleaned up
+- The LLM processor uses a batch size of 32 papers for efficient processing
--- a/arxiv-processor/20250123-papers.json
+++ b/arxiv-processor/20250123-papers.json
@ -0,0 +1,277 @@
+[
+  {
+    "title": "Offline Critic-Guided Diffusion Policy for Multi-User Delay-Constrained\n  Scheduling",
+    "abstract": "Effective multi-user delay-constrained scheduling is crucial in various\nreal-world applications, such as instant messaging, live streaming, and data\ncenter management. In these scenarios, schedulers must make real-time decisions\nto satisfy both delay and resource constraints without prior knowledge of\nsystem dynamics, which are often time-varying and challenging to estimate.\nCurrent learning-based methods typically require interactions with actual\nsystems during the training stage, which can be difficult or impractical, as it\nis capable of significantly degrading system performance and incurring\nsubstantial service costs. To address these challenges, we propose a novel\noffline reinforcement learning-based algorithm, named \\underline{S}cheduling By\n\\underline{O}ffline Learning with \\underline{C}ritic Guidance and\n\\underline{D}iffusion Generation (SOCD), to learn efficient scheduling policies\npurely from pre-collected \\emph{offline data}. SOCD innovatively employs a\ndiffusion-based policy network, complemented by a sampling-free critic network\nfor policy guidance. By integrating the Lagrangian multiplier optimization into\nthe offline reinforcement learning, SOCD effectively trains high-quality\nconstraint-aware policies exclusively from available datasets, eliminating the\nneed for online interactions with the system. Experimental results demonstrate\nthat SOCD is resilient to various system dynamics, including partially\nobservable and large-scale environments, and delivers superior performance\ncompared to existing methods.",
+    "arxiv_id": "2501.12942v1"
+  },
+  {
+    "title": "Evolution and The Knightian Blindspot of Machine Learning",
+    "abstract": "This paper claims that machine learning (ML) largely overlooks an important\nfacet of general intelligence: robustness to a qualitatively unknown future in\nan open world. Such robustness relates to Knightian uncertainty (KU) in\neconomics, i.e. uncertainty that cannot be quantified, which is excluded from\nconsideration in ML's key formalisms. This paper aims to identify this blind\nspot, argue its importance, and catalyze research into addressing it, which we\nbelieve is necessary to create truly robust open-world AI. To help illuminate\nthe blind spot, we contrast one area of ML, reinforcement learning (RL), with\nthe process of biological evolution. Despite staggering ongoing progress, RL\nstill struggles in open-world situations, often failing under unforeseen\nsituations. For example, the idea of zero-shot transferring a self-driving car\npolicy trained only in the US to the UK currently seems exceedingly ambitious.\nIn dramatic contrast, biological evolution routinely produces agents that\nthrive within an open world, sometimes even to situations that are remarkably\nout-of-distribution (e.g. invasive species; or humans, who do undertake such\nzero-shot international driving). Interestingly, evolution achieves such\nrobustness without explicit theory, formalisms, or mathematical gradients. We\nexplore the assumptions underlying RL's typical formalisms, showing how they\nlimit RL's engagement with the unknown unknowns characteristic of an\never-changing complex world. Further, we identify mechanisms through which\nevolutionary processes foster robustness to novel and unpredictable challenges,\nand discuss potential pathways to algorithmically embody them. The conclusion\nis that the intriguing remaining fragility of ML may result from blind spots in\nits formalisms, and that significant gains may result from direct confrontation\nwith the challenge of KU.",
+    "arxiv_id": "2501.13075v1"
+  },
+  {
+    "title": "Boosting MCTS with Free Energy Minimization",
+    "abstract": "Active Inference, grounded in the Free Energy Principle, provides a powerful\nlens for understanding how agents balance exploration and goal-directed\nbehavior in uncertain environments. Here, we propose a new planning framework,\nthat integrates Monte Carlo Tree Search (MCTS) with active inference objectives\nto systematically reduce epistemic uncertainty while pursuing extrinsic\nrewards. Our key insight is that MCTS already renowned for its search\nefficiency can be naturally extended to incorporate free energy minimization by\nblending expected rewards with information gain. Concretely, the Cross-Entropy\nMethod (CEM) is used to optimize action proposals at the root node, while tree\nexpansions leverage reward modeling alongside intrinsic exploration bonuses.\nThis synergy allows our planner to maintain coherent estimates of value and\nuncertainty throughout planning, without sacrificing computational\ntractability. Empirically, we benchmark our planner on a diverse set of\ncontinuous control tasks, where it demonstrates performance gains over both\nstandalone CEM and MCTS with random rollouts.",
+    "arxiv_id": "2501.13083v1"
+  },
+  {
+    "title": "A Unified Invariant Learning Framework for Graph Classification",
+    "abstract": "Invariant learning demonstrates substantial potential for enhancing the\ngeneralization of graph neural networks (GNNs) with out-of-distribution (OOD)\ndata. It aims to recognize stable features in graph data for classification,\nbased on the premise that these features causally determine the target label,\nand their influence is invariant to changes in distribution. Along this line,\nmost studies have attempted to pinpoint these stable features by emphasizing\nexplicit substructures in the graph, such as masked or attentive subgraphs, and\nprimarily enforcing the invariance principle in the semantic space, i.e., graph\nrepresentations. However, we argue that focusing only on the semantic space may\nnot accurately identify these stable features. To address this, we introduce\nthe Unified Invariant Learning (UIL) framework for graph classification. It\nprovides a unified perspective on invariant graph learning, emphasizing both\nstructural and semantic invariance principles to identify more robust stable\nfeatures. In the graph space, UIL adheres to the structural invariance\nprinciple by reducing the distance between graphons over a set of stable\nfeatures across different environments. Simultaneously, to confirm semantic\ninvariance, UIL underscores that the acquired graph representations should\ndemonstrate exemplary performance across diverse environments. We present both\ntheoretical and empirical evidence to confirm our method's ability to recognize\nsuperior stable features. Moreover, through a series of comprehensive\nexperiments complemented by in-depth analyses, we demonstrate that UIL\nconsiderably enhances OOD generalization, surpassing the performance of leading\nbaseline methods. Our codes are available at https://github.com/yongduosui/UIL.",
+    "arxiv_id": "2501.12595v1"
+  },
+  {
+    "title": "Kimi k1.5: Scaling Reinforcement Learning with LLMs",
+    "abstract": "Language model pretraining with next token prediction has proved effective\nfor scaling compute but is limited to the amount of available training data.\nScaling reinforcement learning (RL) unlocks a new axis for the continued\nimprovement of artificial intelligence, with the promise that large language\nmodels (LLMs) can scale their training data by learning to explore with\nrewards. However, prior published work has not produced competitive results. In\nlight of this, we report on the training practice of Kimi k1.5, our latest\nmulti-modal LLM trained with RL, including its RL training techniques,\nmulti-modal data recipes, and infrastructure optimization. Long context scaling\nand improved policy optimization methods are key ingredients of our approach,\nwhich establishes a simplistic, effective RL framework without relying on more\ncomplex techniques such as Monte Carlo tree search, value functions, and\nprocess reward models. Notably, our system achieves state-of-the-art reasoning\nperformance across multiple benchmarks and modalities -- e.g., 77.5 on AIME,\n96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching\nOpenAI's o1. Moreover, we present effective long2short methods that use\nlong-CoT techniques to improve short-CoT models, yielding state-of-the-art\nshort-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on\nLiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and\nClaude Sonnet 3.5 by a large margin (up to +550%).",
+    "arxiv_id": "2501.12599v1"
+  },
+  {
+    "title": "GATE: Adaptive Learning with Working Memory by Information Gating in\n  Multi-lamellar Hippocampal Formation",
+    "abstract": "Hippocampal formation (HF) can rapidly adapt to varied environments and build\nflexible working memory (WM). To mirror the HF's mechanism on generalization\nand WM, we propose a model named Generalization and Associative Temporary\nEncoding (GATE), which deploys a 3-D multi-lamellar dorsoventral (DV)\narchitecture, and learns to build up internally representation from externally\ndriven information layer-wisely. In each lamella, regions of HF:\nEC3-CA1-EC5-EC3 forms a re-entrant loop that discriminately maintains\ninformation by EC3 persistent activity, and selectively readouts the retained\ninformation by CA1 neurons. CA3 and EC5 further provides gating function that\ncontrols these processes. After learning complex WM tasks, GATE forms neuron\nrepresentations that align with experimental records, including splitter, lap,\nevidence, trace, delay-active cells, as well as conventional place cells.\nCrucially, DV architecture in GATE also captures information, range from\ndetailed to abstract, which enables a rapid generalization ability when cue,\nenvironment or task changes, with learned representations inherited. GATE\npromises a viable framework for understanding the HF's flexible memory\nmechanisms and for progressively developing brain-inspired intelligent systems.",
+    "arxiv_id": "2501.12615v1"
+  },
+  {
+    "title": "Deep Learning-Based Identification of Inconsistent Method Names: How Far\n  Are We?",
+    "abstract": "Concise and meaningful method names are crucial for program comprehension and\nmaintenance. However, method names may become inconsistent with their\ncorresponding implementations, causing confusion and errors. Several deep\nlearning (DL)-based approaches have been proposed to identify such\ninconsistencies, with initial evaluations showing promising results. However,\nthese evaluations typically use a balanced dataset, where the number of\ninconsistent and consistent names are equal. This setup, along with flawed\ndataset construction, leads to false positives, making reported performance\nless reliable in real-world scenarios, where most method names are consistent.\nIn this paper, we present an empirical study that evaluates state-of-the-art\nDL-based methods for identifying inconsistent method names. We create a new\nbenchmark by combining automatic identification from commit histories and\nmanual developer inspections, reducing false positives. We evaluate five\nrepresentative DL approaches (one retrieval-based and four generation-based) on\nthis benchmark. Our results show that performance drops substantially when\nmoving from the balanced dataset to the new benchmark. We further conduct\nquantitative and qualitative analyses to understand the strengths and\nweaknesses of the approaches. Retrieval-based methods perform well on simple\nmethods and those with popular name sub-tokens but fail due to inefficient\nrepresentation techniques. Generation-based methods struggle with inaccurate\nsimilarity calculations and immature name generation. Based on these findings,\nwe propose improvements using contrastive learning and large language models\n(LLMs). Our study suggests that significant improvements are needed before\nthese DL approaches can be effectively applied to real-world software systems.",
+    "arxiv_id": "2501.12617v1"
+  },
+  {
+    "title": "Adaptive Data Exploitation in Deep Reinforcement Learning",
+    "abstract": "We introduce ADEPT: Adaptive Data ExPloiTation, a simple yet powerful\nframework to enhance the **data efficiency** and **generalization** in deep\nreinforcement learning (RL). Specifically, ADEPT adaptively manages the use of\nsampled data across different learning stages via multi-armed bandit (MAB)\nalgorithms, optimizing data utilization while mitigating overfitting. Moreover,\nADEPT can significantly reduce the computational overhead and accelerate a wide\nrange of RL algorithms. We test ADEPT on benchmarks including Procgen,\nMiniGrid, and PyBullet. Extensive simulation demonstrates that ADEPT can\nachieve superior performance with remarkable computational efficiency, offering\na practical solution to data-efficient RL. Our code is available at\nhttps://github.com/yuanmingqi/ADEPT.",
+    "arxiv_id": "2501.12620v1"
+  },
+  {
+    "title": "Towards Robust Multi-tab Website Fingerprinting",
+    "abstract": "Website fingerprinting enables an eavesdropper to determine which websites a\nuser is visiting over an encrypted connection. State-of-the-art website\nfingerprinting (WF) attacks have demonstrated effectiveness even against\nTor-protected network traffic. However, existing WF attacks have critical\nlimitations on accurately identifying websites in multi-tab browsing sessions,\nwhere the holistic pattern of individual websites is no longer preserved, and\nthe number of tabs opened by a client is unknown a priori. In this paper, we\npropose ARES, a novel WF framework natively designed for multi-tab WF attacks.\nARES formulates the multi-tab attack as a multi-label classification problem\nand solves it using the novel Transformer-based models. Specifically, ARES\nextracts local patterns based on multi-level traffic aggregation features and\nutilizes the improved self-attention mechanism to analyze the correlations\nbetween these local patterns, effectively identifying websites. We implement a\nprototype of ARES and extensively evaluate its effectiveness using our\nlarge-scale datasets collected over multiple months. The experimental results\nillustrate that ARES achieves optimal performance in several realistic\nscenarios. Further, ARES remains robust even against various WF defenses.",
+    "arxiv_id": "2501.12622v1"
+  },
+  {
+    "title": "Inverse Reinforcement Learning with Switching Rewards and History\n  Dependency for Characterizing Animal Behaviors",
+    "abstract": "Traditional approaches to studying decision-making in neuroscience focus on\nsimplified behavioral tasks where animals perform repetitive, stereotyped\nactions to receive explicit rewards. While informative, these methods constrain\nour understanding of decision-making to short timescale behaviors driven by\nexplicit goals. In natural environments, animals exhibit more complex,\nlong-term behaviors driven by intrinsic motivations that are often\nunobservable. Recent works in time-varying inverse reinforcement learning (IRL)\naim to capture shifting motivations in long-term, freely moving behaviors.\nHowever, a crucial challenge remains: animals make decisions based on their\nhistory, not just their current state. To address this, we introduce SWIRL\n(SWitching IRL), a novel framework that extends traditional IRL by\nincorporating time-varying, history-dependent reward functions. SWIRL models\nlong behavioral sequences as transitions between short-term decision-making\nprocesses, each governed by a unique reward function. SWIRL incorporates\nbiologically plausible history dependency to capture how past decisions and\nenvironmental contexts shape behavior, offering a more accurate description of\nanimal decision-making. We apply SWIRL to simulated and real-world animal\nbehavior datasets and show that it outperforms models lacking history\ndependency, both quantitatively and qualitatively. This work presents the first\nIRL model to incorporate history-dependent policies and rewards to advance our\nunderstanding of complex, naturalistic decision-making in animals.",
+    "arxiv_id": "2501.12633v1"
+  },
+  {
+    "title": "Dynamics of Toxicity in Political Podcasts",
+    "abstract": "Toxicity in digital media poses significant challenges, yet little attention\nhas been given to its dynamics within the rapidly growing medium of podcasts.\nThis paper addresses this gap by analyzing political podcast data to study the\nemergence and propagation of toxicity, focusing on conversation\nchains-structured reply patterns within podcast transcripts. Leveraging\nstate-of-the-art transcription models and advanced conversational analysis\ntechniques, we systematically examine toxic discourse in over 30 popular\npolitical podcasts in the United States. Our key contributions include: (1)\ncreating a comprehensive dataset of transcribed and diarized political\npodcasts, identifying thousands of toxic instances using Google's Perspective\nAPI, (2) uncovering concerning trends where a majority of episodes contain at\nleast one toxic instance, (3) introducing toxic conversation chains and\nanalyzing their structural and linguistic properties, revealing characteristics\nsuch as longer durations, repetitive patterns, figurative language, and\nemotional cues tied to anger and annoyance, (4) identifying demand-related\nwords like 'want', 'like', and 'know' as precursors to toxicity, and (5)\ndeveloping predictive models to anticipate toxicity shifts based on annotated\nchange points. Our findings provide critical insights into podcast toxicity and\nestablish a foundation for future research on real-time monitoring and\nintervention mechanisms to foster healthier discourse in this influential\nmedium.",
+    "arxiv_id": "2501.12640v1"
+  },
+  {
+    "title": "The potential -- and the pitfalls -- of using pre-trained language\n  models as cognitive science theories",
+    "abstract": "Many studies have evaluated the cognitive alignment of Pre-trained Language\nModels (PLMs), i.e., their correspondence to adult performance across a range\nof cognitive domains. Recently, the focus has expanded to the developmental\nalignment of these models: identifying phases during training where\nimprovements in model performance track improvements in children's thinking\nover development. However, there are many challenges to the use of PLMs as\ncognitive science theories, including different architectures, different\ntraining data modalities and scales, and limited model interpretability. In\nthis paper, we distill lessons learned from treating PLMs, not as engineering\nartifacts but as cognitive science and developmental science models. We review\nassumptions used by researchers to map measures of PLM performance to measures\nof human performance. We identify potential pitfalls of this approach to\nunderstanding human thinking, and we end by enumerating criteria for using PLMs\nas credible accounts of cognition and cognitive development.",
+    "arxiv_id": "2501.12651v1"
+  },
+  {
+    "title": "NBDI: A Simple and Efficient Termination Condition for Skill Extraction\n  from Task-Agnostic Demonstrations",
+    "abstract": "Intelligent agents are able to make decisions based on different levels of\ngranularity and duration. Recent advances in skill learning enabled the agent\nto solve complex, long-horizon tasks by effectively guiding the agent in\nchoosing appropriate skills. However, the practice of using fixed-length skills\ncan easily result in skipping valuable decision points, which ultimately limits\nthe potential for further exploration and faster policy learning. In this work,\nwe propose to learn a simple and efficient termination condition that\nidentifies decision points through a state-action novelty module that leverages\nagent experience data. Our approach, Novelty-based Decision Point\nIdentification (NBDI), outperforms previous baselines in complex, long-horizon\ntasks, and remains effective even in the presence of significant variations in\nthe environment configurations of downstream tasks, highlighting the importance\nof decision point identification in skill learning.",
+    "arxiv_id": "2501.12668v1"
+  },
+  {
+    "title": "Growth strategies for arbitrary DAG neural architectures",
+    "abstract": "Deep learning has shown impressive results obtained at the cost of training\nhuge neural networks. However, the larger the architecture, the higher the\ncomputational, financial, and environmental costs during training and\ninference. We aim at reducing both training and inference durations. We focus\non Neural Architecture Growth, which can increase the size of a small model\nwhen needed, directly during training using information from the\nbackpropagation. We expand existing work and freely grow neural networks in the\nform of any Directed Acyclic Graph by reducing expressivity bottlenecks in the\narchitecture. We explore strategies to reduce excessive computations and steer\nnetwork growth toward more parameter-efficient architectures.",
+    "arxiv_id": "2501.12690v1"
+  },
+  {
+    "title": "EvidenceMap: Unleashing the Power of Small Language Models with Evidence\n  Analysis for Biomedical Question Answering",
+    "abstract": "Current LLM-based approaches improve question answering performance by\nleveraging the internal reasoning abilities of models or incorporating external\nknowledge. However, when humans address professional problems, it is essential\nto explicitly analyze the multifaceted relationships from multiple pieces and\ndiverse sources of evidence to achieve better answers. In this study, we\npropose a novel generative question answering framework for the biomedical\ndomain, named EvidenceMap, which explicitly learns and incorporates evidence\nanalysis with small language models (SLMs). The framework describes an evidence\nmap for each question and fully utilizes an SLM to derive the representation of\nthe supportive evaluation, the logical correlation, and the summarization of\nthe related evidence, which facilitates an analysis-augmented generation with\nanother SLM in an autoregressive way. Extensive experiments have shown that\nintroducing an evidence analysis learning process can significantly outperform\nlarger models and popular LLM reasoning methods.",
+    "arxiv_id": "2501.12746v1"
+  },
+  {
+    "title": "NExtLong: Toward Effective Long-Context Training without Long Documents",
+    "abstract": "Large language models (LLMs) with extended context windows have made\nsignificant strides yet remain a challenge due to the scarcity of long\ndocuments. Existing methods tend to synthesize long-context data but lack a\nclear mechanism to reinforce the long-range dependency modeling. To address\nthis limitation, we propose NExtLong, a novel framework for synthesizing\nlong-context data through Negative document Extension. NExtLong decomposes a\ndocument into multiple meta-chunks and extends the context by interleaving hard\nnegative distractors retrieved from pretraining corpora. This approach compels\nthe model to discriminate long-range dependent context from distracting\ncontent, enhancing its ability to model long-range dependencies. Extensive\nexperiments demonstrate that NExtLong achieves significant performance\nimprovements on the HELMET and RULER benchmarks compared to existing\nlong-context synthesis approaches and leading models, which are trained on\nnon-synthetic long documents. These findings highlight NExtLong's ability to\nreduce reliance on non-synthetic long documents, making it an effective\nframework for developing advanced long-context LLMs.",
+    "arxiv_id": "2501.12766v1"
+  },
+  {
+    "title": "Revisit Self-Debugging with Self-Generated Tests for Code Generation",
+    "abstract": "Large language models (LLMs) have shown significant advancements in code\ngeneration, but still face challenges on tasks beyond their basic capabilities.\nRecently, the notion of self-debugging has been proposed to boost the\nperformance of code generation by leveraging execution feedback from tests.\nDespite its promise, the availability of high-quality tests in real-world\nscenarios is limited. In this context, self-debugging with self-generated tests\nis a promising solution but lacks a full exploration of its limitations and\npractical potential. Therefore, we investigate its efficacy on diverse\nprogramming problems. To deepen our understanding, we propose two distinct\nparadigms for the process: post-execution and in-execution self-debugging.\nWithin the scope of self-contained Python programming tasks, we find that\npost-execution self-debugging struggles on basic problems but shows potential\nfor improvement on competitive ones, due to the bias introduced by\nself-generated tests. On the other hand, in-execution self-debugging enables\nLLMs to mitigate the bias by solely leveraging intermediate states during\nexecution, thereby enhancing code generation.",
+    "arxiv_id": "2501.12793v1"
+  },
+  {
+    "title": "Unveiling Zero-Space Detection: A Novel Framework for Autonomous\n  Ransomware Identification in High-Velocity Environments",
+    "abstract": "Modern cybersecurity landscapes increasingly demand sophisticated detection\nframeworks capable of identifying evolving threats with precision and\nadaptability. The proposed Zero-Space Detection framework introduces a novel\napproach that dynamically identifies latent behavioral patterns through\nunsupervised clustering and advanced deep learning techniques. Designed to\naddress the limitations of signature-based and heuristic methods, it operates\neffectively in high-velocity environments by integrating multi-phase filtering\nand ensemble learning for refined decision-making. Experimental evaluation\nreveals high detection rates across diverse ransomware families, including\nLockBit, Conti, REvil, and BlackMatter, while maintaining low false positive\nrates and scalable performance. Computational overhead remains minimal, with\naverage processing times ensuring compatibility with real-time systems even\nunder peak operational loads. The framework demonstrates resilience against\nadversarial strategies such as obfuscation and encryption speed variability,\nwhich frequently challenge conventional detection systems. Analysis across\nmultiple data sources highlights its versatility in handling diverse file types\nand operational contexts. Comprehensive metrics, including detection\nprobability, latency, and resource efficiency, validate its efficacy under\nreal-world conditions. Through its modular architecture, the framework achieves\nseamless integration with existing cybersecurity infrastructures without\nsignificant reconfiguration. The results demonstrate its robustness and\nscalability, offering a transformative paradigm for ransomware identification\nin dynamic and resource-constrained environments.",
+    "arxiv_id": "2501.12811v1"
+  },
+  {
+    "title": "To Measure or Not: A Cost-Sensitive, Selective Measuring Environment for\n  Agricultural Management Decisions with Reinforcement Learning",
+    "abstract": "Farmers rely on in-field observations to make well-informed crop management\ndecisions to maximize profit and minimize adverse environmental impact.\nHowever, obtaining real-world crop state measurements is labor-intensive,\ntime-consuming and expensive. In most cases, it is not feasible to gather crop\nstate measurements before every decision moment. Moreover, in previous research\npertaining to farm management optimization, these observations are often\nassumed to be readily available without any cost, which is unrealistic. Hence,\nenabling optimization without the need to have temporally complete crop state\nobservations is important. An approach to that problem is to include measuring\nas part of decision making. As a solution, we apply reinforcement learning (RL)\nto recommend opportune moments to simultaneously measure crop features and\napply nitrogen fertilizer. With realistic considerations, we design an RL\nenvironment with explicit crop feature measuring costs. While balancing costs,\nwe find that an RL agent, trained with recurrent PPO, discovers adaptive\nmeasuring policies that follow critical crop development stages, with results\naligned by what domain experts would consider a sensible approach. Our results\nhighlight the importance of measuring when crop feature measurements are not\nreadily available.",
+    "arxiv_id": "2501.12823v1"
+  },
+  {
+    "title": "GAMED-Snake: Gradient-aware Adaptive Momentum Evolution Deep Snake Model\n  for Multi-organ Segmentation",
+    "abstract": "Multi-organ segmentation is a critical yet challenging task due to complex\nanatomical backgrounds, blurred boundaries, and diverse morphologies. This\nstudy introduces the Gradient-aware Adaptive Momentum Evolution Deep Snake\n(GAMED-Snake) model, which establishes a novel paradigm for contour-based\nsegmentation by integrating gradient-based learning with adaptive momentum\nevolution mechanisms. The GAMED-Snake model incorporates three major\ninnovations: First, the Distance Energy Map Prior (DEMP) generates a\npixel-level force field that effectively attracts contour points towards the\ntrue boundaries, even in scenarios with complex backgrounds and blurred edges.\nSecond, the Differential Convolution Inception Module (DCIM) precisely extracts\ncomprehensive energy gradients, significantly enhancing segmentation accuracy.\nThird, the Adaptive Momentum Evolution Mechanism (AMEM) employs cross-attention\nto establish dynamic features across different iterations of evolution,\nenabling precise boundary alignment for diverse morphologies. Experimental\nresults on four challenging multi-organ segmentation datasets demonstrate that\nGAMED-Snake improves the mDice metric by approximately 2% compared to\nstate-of-the-art methods. Code will be available at\nhttps://github.com/SYSUzrc/GAMED-Snake.",
+    "arxiv_id": "2501.12844v1"
+  },
+  {
+    "title": "As Confidence Aligns: Exploring the Effect of AI Confidence on Human\n  Self-confidence in Human-AI Decision Making",
+    "abstract": "Complementary collaboration between humans and AI is essential for human-AI\ndecision making. One feasible approach to achieving it involves accounting for\nthe calibrated confidence levels of both AI and users. However, this process\nwould likely be made more difficult by the fact that AI confidence may\ninfluence users' self-confidence and its calibration. To explore these\ndynamics, we conducted a randomized behavioral experiment. Our results indicate\nthat in human-AI decision-making, users' self-confidence aligns with AI\nconfidence and such alignment can persist even after AI ceases to be involved.\nThis alignment then affects users' self-confidence calibration. We also found\nthe presence of real-time correctness feedback of decisions reduced the degree\nof alignment. These findings suggest that users' self-confidence is not\nindependent of AI confidence, which practitioners aiming to achieve better\nhuman-AI collaboration need to be aware of. We call for research focusing on\nthe alignment of human cognition and behavior with AI.",
+    "arxiv_id": "2501.12868v1"
+  },
+  {
+    "title": "Drone Carrier: An Integrated Unmanned Surface Vehicle for Autonomous\n  Inspection and Intervention in GNSS-Denied Maritime Environment",
+    "abstract": "This paper introduces an innovative drone carrier concept that is applied in\nmaritime port security or offshore rescue. This system works with a\nheterogeneous system consisting of multiple Unmanned Aerial Vehicles (UAVs) and\nUnmanned Surface Vehicles (USVs) to perform inspection and intervention tasks\nin GNSS-denied or interrupted environments. The carrier, an electric catamaran\nmeasuring 4m by 7m, features a 4m by 6m deck supporting automated takeoff and\nlanding for four DJI M300 drones, along with a 10kg-payload manipulator\noperable in up to level 3 sea conditions. Utilizing an offshore gimbal camera\nfor navigation, the carrier can autonomously navigate, approach and dock with\nnon-cooperative vessels, guided by an onboard camera, LiDAR, and Doppler\nVelocity Log (DVL) over a 3 km$^2$ area. UAVs equipped with onboard\nUltra-Wideband (UWB) technology execute mapping, detection, and manipulation\ntasks using a versatile gripper designed for wet, saline conditions.\nAdditionally, two UAVs can coordinate to transport large objects to the\nmanipulator or interact directly with them. These procedures are fully\nautomated and were successfully demonstrated at the Mohammed Bin Zayed\nInternational Robotic Competition (MBZIRC2024), where the drone carrier\nequipped with four UAVS and one manipulator, automatically accomplished the\nintervention tasks in sea-level-3 (wave height 1.25m) based on the rough target\ninformation.",
+    "arxiv_id": "2501.12869v1"
+  },
+  {
+    "title": "Reinforcement learning Based Automated Design of Differential Evolution\n  Algorithm for Black-box Optimization",
+    "abstract": "Differential evolution (DE) algorithm is recognized as one of the most\neffective evolutionary algorithms, demonstrating remarkable efficacy in\nblack-box optimization due to its derivative-free nature. Numerous enhancements\nto the fundamental DE have been proposed, incorporating innovative mutation\nstrategies and sophisticated parameter tuning techniques to improve\nperformance. However, no single variant has proven universally superior across\nall problems. To address this challenge, we introduce a novel framework that\nemploys reinforcement learning (RL) to automatically design DE for black-box\noptimization through meta-learning. RL acts as an advanced meta-optimizer,\ngenerating a customized DE configuration that includes an optimal\ninitialization strategy, update rule, and hyperparameters tailored to a\nspecific black-box optimization problem. This process is informed by a detailed\nanalysis of the problem characteristics. In this proof-of-concept study, we\nutilize a double deep Q-network for implementation, considering a subset of 40\npossible strategy combinations and parameter optimizations simultaneously. The\nframework's performance is evaluated against black-box optimization benchmarks\nand compared with state-of-the-art algorithms. The experimental results\nhighlight the promising potential of our proposed framework.",
+    "arxiv_id": "2501.12881v1"
+  },
+  {
+    "title": "Learning Graph Node Embeddings by Smooth Pair Sampling",
+    "abstract": "Random walk-based node embedding algorithms have attracted a lot of attention\ndue to their scalability and ease of implementation. Previous research has\nfocused on different walk strategies, optimization objectives, and embedding\nlearning models. Inspired by observations on real data, we take a different\napproach and propose a new regularization technique. More precisely, the\nfrequencies of node pairs generated by the skip-gram model on random walk node\nsequences follow a highly skewed distribution which causes learning to be\ndominated by a fraction of the pairs. We address the issue by designing an\nefficient sampling procedure that generates node pairs according to their {\\em\nsmoothed frequency}. Theoretical and experimental results demonstrate the\nadvantages of our approach.",
+    "arxiv_id": "2501.12884v1"
+  },
+  {
+    "title": "Architectural Fusion Through Contextual Partitioning in Large Language\n  Models: A Novel Approach to Parameterized Knowledge Integration",
+    "abstract": "Contextual Partitioning introduces an innovative approach to enhancing the\narchitectural design of large-scale computational models through the dynamic\nsegmentation of parameters into context-aware regions. This methodology\nemphasizes the importance of task-specific specialization, achieved through\nadaptive parameter allocation mechanisms that align with the linguistic\nfeatures of input data. Experimental evaluations demonstrated substantial\nimprovements in accuracy, perplexity, and contextual coherence across a variety\nof linguistic tasks, highlighting the adaptability and scalability of the\nproposed framework. By reducing redundancy and enhancing computational\nefficiency, Contextual Partitioning not only streamlines model operations but\nalso expands the scope of applications for advanced language processing\nsystems. The approach operates autonomously, requiring no external fine-tuning,\nthereby addressing a significant limitation in conventional parameter\noptimization techniques. Empirical results demonstrate the effectiveness of\ngradient-driven segmentation, enabling models to dynamically recalibrate and\nspecialize in response to task-specific demands. Furthermore, resource\nutilization metrics reveal notable reductions in memory usage and training\ntimes, confirming the efficiency of the approach. Observations from qualitative\nanalyses illustrate improved contextual coherence and logical flow in generated\noutputs, reinforcing the practical value of this technique. The findings\ncollectively demonstrate the potential for Contextual Partitioning to redefine\nthe scalability and adaptability of computational language architectures in\ndiverse and complex domains.",
+    "arxiv_id": "2501.12901v1"
+  },
+  {
+    "title": "A Novel Tracking Framework for Devices in X-ray Leveraging Supplementary\n  Cue-Driven Self-Supervised Features",
+    "abstract": "To restore proper blood flow in blocked coronary arteries via angioplasty\nprocedure, accurate placement of devices such as catheters, balloons, and\nstents under live fluoroscopy or diagnostic angiography is crucial. Identified\nballoon markers help in enhancing stent visibility in X-ray sequences, while\nthe catheter tip aids in precise navigation and co-registering vessel\nstructures, reducing the need for contrast in angiography. However, accurate\ndetection of these devices in interventional X-ray sequences faces significant\nchallenges, particularly due to occlusions from contrasted vessels and other\ndevices and distractions from surrounding, resulting in the failure to track\nsuch small objects. While most tracking methods rely on spatial correlation of\npast and current appearance, they often lack strong motion comprehension\nessential for navigating through these challenging conditions, and fail to\neffectively detect multiple instances in the scene. To overcome these\nlimitations, we propose a self-supervised learning approach that enhances its\nspatio-temporal understanding by incorporating supplementary cues and learning\nacross multiple representation spaces on a large dataset. Followed by that, we\nintroduce a generic real-time tracking framework that effectively leverages the\npretrained spatio-temporal network and also takes the historical appearance and\ntrajectory data into account. This results in enhanced localization of multiple\ninstances of device landmarks. Our method outperforms state-of-the-art methods\nin interventional X-ray device tracking, especially stability and robustness,\nachieving an 87% reduction in max error for balloon marker detection and a 61%\nreduction in max error for catheter tip detection.",
+    "arxiv_id": "2501.12958v1"
+  },
+  {
+    "title": "Accessible Smart Contracts Verification: Synthesizing Formal Models with\n  Tamed LLMs",
+    "abstract": "When blockchain systems are said to be trustless, what this really means is\nthat all the trust is put into software. Thus, there are strong incentives to\nensure blockchain software is correct -- vulnerabilities here cost millions and\nbreak businesses. One of the most powerful ways of establishing software\ncorrectness is by using formal methods. Approaches based on formal methods,\nhowever, induce a significant overhead in terms of time and expertise required\nto successfully employ them. Our work addresses this critical disadvantage by\nautomating the creation of a formal model -- a mathematical abstraction of the\nsoftware system -- which is often a core task when employing formal methods. We\nperform model synthesis in three phases: we first transpile the code into model\nstubs; then we \"fill in the blanks\" using a large language model (LLM);\nfinally, we iteratively repair the generated model, on both syntactical and\nsemantical level. In this way, we significantly reduce the amount of time\nnecessary to create formal models and increase accessibility of valuable\nsoftware verification methods that rely on them. The practical context of our\nwork was reducing the time-to-value of using formal models for correctness\naudits of smart contracts.",
+    "arxiv_id": "2501.12972v1"
+  },
+  {
+    "title": "Ehrenfeucht-Haussler Rank and Chain of Thought",
+    "abstract": "The notion of rank of a Boolean function has been a cornerstone in the theory\nof PAC learning, enabling quasipolynomial-time learning algorithms for\npolynomial-size decision trees. We present a novel characterization of rank,\ngrounded in the well-known Transformer architecture. We show that the rank of a\nfunction $f$ corresponds to the minimum number of Chain of Thought (CoT) steps\nrequired by a single-layer transformer decoder with hard attention to compute\n$f$. Based on this characterization we establish tight bounds on the number of\nCoT steps required for specific problems, showing that $\\ell$-fold function\ncomposition necessitates exactly $\\ell$ CoT steps. Furthermore, we analyze the\nproblem of identifying the position of the $k$-th occurrence of 1 in a Boolean\nsequence, proving that it requires $k$ CoT steps.",
+    "arxiv_id": "2501.12997v1"
+  },
+  {
+    "title": "MONA: Myopic Optimization with Non-myopic Approval Can Mitigate\n  Multi-step Reward Hacking",
+    "abstract": "Future advanced AI systems may learn sophisticated strategies through\nreinforcement learning (RL) that humans cannot understand well enough to safely\nevaluate. We propose a training method which avoids agents learning undesired\nmulti-step plans that receive high reward (multi-step \"reward hacks\") even if\nhumans are not able to detect that the behaviour is undesired. The method,\nMyopic Optimization with Non-myopic Approval (MONA), works by combining\nshort-sighted optimization with far-sighted reward. We demonstrate that MONA\ncan prevent multi-step reward hacking that ordinary RL causes, even without\nbeing able to detect the reward hacking and without any extra information that\nordinary RL does not get access to. We study MONA empirically in three settings\nwhich model different misalignment failure modes including 2-step environments\nwith LLMs representing delegated oversight and encoded reasoning and\nlonger-horizon gridworld environments representing sensor tampering.",
+    "arxiv_id": "2501.13011v1"
+  },
+  {
+    "title": "Provably-Safe Neural Network Training Using Hybrid Zonotope Reachability\n  Analysis",
+    "abstract": "Even though neural networks are being increasingly deployed in\nsafety-critical applications, it remains difficult to enforce constraints on\ntheir output, meaning that it is hard to guarantee safety in such settings.\nTowards addressing this, many existing methods seek to verify a neural\nnetwork's satisfaction of safety constraints, but do not address how to correct\nan \"unsafe\" network. On the other hand, the few works that extract a training\nsignal from verification cannot handle non-convex sets, and are either\nconservative or slow. To address these challenges, this work proposes a neural\nnetwork training method that can encourage the exact reachable set of a\nnon-convex input set through a neural network with rectified linear unit (ReLU)\nnonlinearities to avoid a non-convex unsafe region, using recent results in\nnon-convex set representation with hybrid zonotopes and extracting gradient\ninformation from mixed-integer linear programs (MILPs). The proposed method is\nfast, with the computational complexity of each training iteration comparable\nto that of solving a linear program (LP) with number of dimensions and\nconstraints linear to the number of neurons and complexity of input and unsafe\nsets. For a neural network with three hidden layers of width 30, the method was\nable to drive the reachable set of a non-convex input set with 55 generators\nand 26 constraints out of a non-convex unsafe region with 21 generators and 11\nconstraints in 490 seconds.",
+    "arxiv_id": "2501.13023v1"
+  },
+  {
+    "title": "AdaWM: Adaptive World Model based Planning for Autonomous Driving",
+    "abstract": "World model based reinforcement learning (RL) has emerged as a promising\napproach for autonomous driving, which learns a latent dynamics model and uses\nit to train a planning policy. To speed up the learning process, the\npretrain-finetune paradigm is often used, where online RL is initialized by a\npretrained model and a policy learned offline. However, naively performing such\ninitialization in RL may result in dramatic performance degradation during the\nonline interactions in the new task. To tackle this challenge, we first analyze\nthe performance degradation and identify two primary root causes therein: the\nmismatch of the planning policy and the mismatch of the dynamics model, due to\ndistribution shift. We further analyze the effects of these factors on\nperformance degradation during finetuning, and our findings reveal that the\nchoice of finetuning strategies plays a pivotal role in mitigating these\neffects. We then introduce AdaWM, an Adaptive World Model based planning\nmethod, featuring two key steps: (a) mismatch identification, which quantifies\nthe mismatches and informs the finetuning strategy, and (b) alignment-driven\nfinetuning, which selectively updates either the policy or the model as needed\nusing efficient low-rank updates. Extensive experiments on the challenging\nCARLA driving tasks demonstrate that AdaWM significantly improves the\nfinetuning process, resulting in more robust and efficient performance in\nautonomous driving systems.",
+    "arxiv_id": "2501.13072v1"
+  },
+  {
+    "title": "Attention-Driven Hierarchical Reinforcement Learning with Particle\n  Filtering for Source Localization in Dynamic Fields",
+    "abstract": "In many real-world scenarios, such as gas leak detection or environmental\npollutant tracking, solving the Inverse Source Localization and\nCharacterization problem involves navigating complex, dynamic fields with\nsparse and noisy observations. Traditional methods face significant challenges,\nincluding partial observability, temporal and spatial dynamics,\nout-of-distribution generalization, and reward sparsity. To address these\nissues, we propose a hierarchical framework that integrates Bayesian inference\nand reinforcement learning. The framework leverages an attention-enhanced\nparticle filtering mechanism for efficient and accurate belief updates, and\nincorporates two complementary execution strategies: Attention Particle\nFiltering Planning and Attention Particle Filtering Reinforcement Learning.\nThese approaches optimize exploration and adaptation under uncertainty.\nTheoretical analysis proves the convergence of the attention-enhanced particle\nfilter, while extensive experiments across diverse scenarios validate the\nframework's superior accuracy, adaptability, and computational efficiency. Our\nresults highlight the framework's potential for broad applications in dynamic\nfield estimation tasks.",
+    "arxiv_id": "2501.13084v1"
+  },
+  {
+    "title": "Understanding the LLM-ification of CHI: Unpacking the Impact of LLMs at\n  CHI through a Systematic Literature Review",
+    "abstract": "Large language models (LLMs) have been positioned to revolutionize HCI, by\nreshaping not only the interfaces, design patterns, and sociotechnical systems\nthat we study, but also the research practices we use. To-date, however, there\nhas been little understanding of LLMs' uptake in HCI. We address this gap via a\nsystematic literature review of 153 CHI papers from 2020-24 that engage with\nLLMs. We taxonomize: (1) domains where LLMs are applied; (2) roles of LLMs in\nHCI projects; (3) contribution types; and (4) acknowledged limitations and\nrisks. We find LLM work in 10 diverse domains, primarily via empirical and\nartifact contributions. Authors use LLMs in five distinct roles, including as\nresearch tools or simulated users. Still, authors often raise validity and\nreproducibility concerns, and overwhelmingly study closed models. We outline\nopportunities to improve HCI research with and on LLMs, and provide guiding\nquestions for researchers to consider the validity and appropriateness of\nLLM-related work.",
+    "arxiv_id": "2501.12557v1"
+  },
+  {
+    "title": "FedGrAINS: Personalized SubGraph Federated Learning with Adaptive\n  Neighbor Sampling",
+    "abstract": "Graphs are crucial for modeling relational and biological data. As datasets\ngrow larger in real-world scenarios, the risk of exposing sensitive information\nincreases, making privacy-preserving training methods like federated learning\n(FL) essential to ensure data security and compliance with privacy regulations.\nRecently proposed personalized subgraph FL methods have become the de-facto\nstandard for training personalized Graph Neural Networks (GNNs) in a federated\nmanner while dealing with the missing links across clients' subgraphs due to\nprivacy restrictions. However, personalized subgraph FL faces significant\nchallenges due to the heterogeneity in client subgraphs, such as degree\ndistributions among the nodes, which complicate federated training of graph\nmodels. To address these challenges, we propose \\textit{FedGrAINS}, a novel\ndata-adaptive and sampling-based regularization method for subgraph FL.\nFedGrAINS leverages generative flow networks (GFlowNets) to evaluate node\nimportance concerning clients' tasks, dynamically adjusting the message-passing\nstep in clients' GNNs. This adaptation reflects task-optimized sampling aligned\nwith a trajectory balance objective. Experimental results demonstrate that the\ninclusion of \\textit{FedGrAINS} as a regularizer consistently improves the FL\nperformance compared to baselines that do not leverage such regularization.",
+    "arxiv_id": "2501.12592v1"
+  },
+  {
+    "title": "HEPPO: Hardware-Efficient Proximal Policy Optimization -- A Universal\n  Pipelined Architecture for Generalized Advantage Estimation",
+    "abstract": "This paper introduces HEPPO, an FPGA-based accelerator designed to optimize\nthe Generalized Advantage Estimation (GAE) stage in Proximal Policy\nOptimization (PPO). Unlike previous approaches that focused on trajectory\ncollection and actor-critic updates, HEPPO addresses GAE's computational\ndemands with a parallel, pipelined architecture implemented on a single\nSystem-on-Chip (SoC). This design allows for the adaptation of various hardware\naccelerators tailored for different PPO phases. A key innovation is our\nstrategic standardization technique, which combines dynamic reward\nstandardization and block standardization for values, followed by 8-bit uniform\nquantization. This method stabilizes learning, enhances performance, and\nmanages memory bottlenecks, achieving a 4x reduction in memory usage and a 1.5x\nincrease in cumulative rewards. We propose a solution on a single SoC device\nwith programmable logic and embedded processors, delivering throughput orders\nof magnitude higher than traditional CPU-GPU systems. Our single-chip solution\nminimizes communication latency and throughput bottlenecks, significantly\nboosting PPO training efficiency. Experimental results show a 30% increase in\nPPO speed and a substantial reduction in memory access time, underscoring\nHEPPO's potential for broad applicability in hardware-efficient reinforcement\nlearning algorithms.",
+    "arxiv_id": "2501.12703v1"
+  },
+  {
+    "title": "Practical quantum federated learning and its experimental demonstration",
+    "abstract": "Federated learning is essential for decentralized, privacy-preserving model\ntraining in the data-driven era. Quantum-enhanced federated learning leverages\nquantum resources to address privacy and scalability challenges, offering\nsecurity and efficiency advantages beyond classical methods. However, practical\nand scalable frameworks addressing privacy concerns in the quantum computing\nera remain undeveloped. Here, we propose a practical quantum federated learning\nframework on quantum networks, utilizing distributed quantum secret keys to\nprotect local model updates and enable secure aggregation with\ninformation-theoretic security. We experimentally validate our framework on a\n4-client quantum network with a scalable structure. Extensive numerical\nexperiments on both quantum and classical datasets show that adding a quantum\nclient significantly enhances the trained global model's ability to classify\nmultipartite entangled and non-stabilizer quantum datasets. Simulations further\ndemonstrate scalability to 200 clients with classical models trained on the\nMNIST dataset, reducing communication costs by $75\\%$ through advanced model\ncompression techniques and achieving rapid training convergence. Our work\nprovides critical insights for building scalable, efficient, and quantum-secure\nmachine learning systems for the coming quantum internet era.",
+    "arxiv_id": "2501.12709v1"
+  },
+  {
+    "title": "A Call for Critically Rethinking and Reforming Data Analysis in\n  Empirical Software Engineering",
+    "abstract": "Context: Empirical Software Engineering (ESE) drives innovation in SE through\nqualitative and quantitative studies. However, concerns about the correct\napplication of empirical methodologies have existed since the 2006 Dagstuhl\nseminar on SE. Objective: To analyze three decades of SE research, identify\nmistakes in statistical methods, and evaluate experts' ability to detect and\naddress these issues. Methods: We conducted a literature survey of ~27,000\nempirical studies, using LLMs to classify statistical methodologies as adequate\nor inadequate. Additionally, we selected 30 primary studies and held a workshop\nwith 33 ESE experts to assess their ability to identify and resolve statistical\nissues. Results: Significant statistical issues were found in the primary\nstudies, and experts showed limited ability to detect and correct these\nmethodological problems, raising concerns about the broader ESE community's\nproficiency in this area. Conclusions. Despite our study's eventual\nlimitations, its results shed light on recurring issues from promoting\ninformation copy-and-paste from past authors' works and the continuous\npublication of inadequate approaches that promote dubious results and\njeopardize the spread of the correct statistical strategies among researchers.\nBesides, it justifies further investigation into empirical rigor in software\nengineering to expose these recurring issues and establish a framework for\nreassessing our field's foundation of statistical methodology application.\nTherefore, this work calls for critically rethinking and reforming data\nanalysis in empirical software engineering, paving the way for our work soon.",
+    "arxiv_id": "2501.12728v1"
+  },
+  {
+    "title": "Estimating the Conformal Prediction Threshold from Noisy Labels",
+    "abstract": "Conformal Prediction (CP) is a method to control prediction uncertainty by\nproducing a small prediction set, ensuring a predetermined probability that the\ntrue class lies within this set. This is commonly done by defining a score,\nbased on the model predictions, and setting a threshold on this score using a\nvalidation set. In this study, we address the problem of CP calibration when we\nonly have access to a validation set with noisy labels. We show how we can\nestimate the noise-free conformal threshold based on the noisy labeled data.\nOur solution is flexible and can accommodate various modeling assumptions\nregarding the label contamination process, without needing any information\nabout the underlying data distribution or the internal mechanisms of the\nmachine learning classifier. We develop a coverage guarantee for uniform noise\nthat is effective even in tasks with a large number of classes. We dub our\napproach Noise-Aware Conformal Prediction (NACP) and show on several natural\nand medical image classification datasets, including ImageNet, that it\nsignificantly outperforms current noisy label methods and achieves results\ncomparable to those obtained with a clean validation set.",
+    "arxiv_id": "2501.12749v1"
+  },
+  {
+    "title": "On Tradeoffs in Learning-Augmented Algorithms",
+    "abstract": "The field of learning-augmented algorithms has gained significant attention\nin recent years. These algorithms, using potentially inaccurate predictions,\nmust exhibit three key properties: consistency, robustness, and smoothness. In\nscenarios where distributional information about predictions is available, a\nstrong expected performance is required. Typically, the design of these\nalgorithms involves a natural tradeoff between consistency and robustness, and\nprevious works aimed to achieve Pareto-optimal tradeoffs for specific problems.\nHowever, in some settings, this comes at the expense of smoothness. This paper\ndemonstrates that certain problems involve multiple tradeoffs between\nconsistency, robustness, smoothness, and average performance.",
+    "arxiv_id": "2501.12770v1"
+  },
+  {
+    "title": "Data re-uploading in Quantum Machine Learning for time series:\n  application to traffic forecasting",
+    "abstract": "Accurate traffic forecasting plays a crucial role in modern Intelligent\nTransportation Systems (ITS), as it enables real-time traffic flow management,\nreduces congestion, and improves the overall efficiency of urban transportation\nnetworks. With the rise of Quantum Machine Learning (QML), it has emerged a new\nparadigm possessing the potential to enhance predictive capabilities beyond\nwhat classical machine learning models can achieve. In the present work we\npursue a heuristic approach to explore the potential of QML, and focus on a\nspecific transport issue. In particular, as a case study we investigate a\ntraffic forecast task for a major urban area in Athens (Greece), for which we\npossess high-resolution data. In this endeavor we explore the application of\nQuantum Neural Networks (QNN), and, notably, we present the first application\nof quantum data re-uploading in the context of transport forecasting. This\ntechnique allows quantum models to better capture complex patterns, such as\ntraffic dynamics, by repeatedly encoding classical data into a quantum state.\nAside from providing a prediction model, we spend considerable effort in\ncomparing the performance of our hybrid quantum-classical neural networks with\nclassical deep learning approaches. Our results show that hybrid models achieve\ncompetitive accuracy with state-of-the-art classical methods, especially when\nthe number of qubits and re-uploading blocks is increased. While the classical\nmodels demonstrate lower computational demands, we provide evidence that\nincreasing the complexity of the quantum model improves predictive accuracy.\nThese findings indicate that QML techniques, and specifically the data\nre-uploading approach, hold promise for advancing traffic forecasting models\nand could be instrumental in addressing challenges inherent in ITS\nenvironments.",
+    "arxiv_id": "2501.12776v1"
+  },
+  {
+    "title": "Machine Learning Modeling for Multi-order Human Visual Motion Processing",
+    "abstract": "Our research aims to develop machines that learn to perceive visual motion as\ndo humans. While recent advances in computer vision (CV) have enabled DNN-based\nmodels to accurately estimate optical flow in naturalistic images, a\nsignificant disparity remains between CV models and the biological visual\nsystem in both architecture and behavior. This disparity includes humans'\nability to perceive the motion of higher-order image features (second-order\nmotion), which many CV models fail to capture because of their reliance on the\nintensity conservation law. Our model architecture mimics the cortical V1-MT\nmotion processing pathway, utilizing a trainable motion energy sensor bank and\na recurrent graph network. Supervised learning employing diverse naturalistic\nvideos allows the model to replicate psychophysical and physiological findings\nabout first-order (luminance-based) motion perception. For second-order motion,\ninspired by neuroscientific findings, the model includes an additional sensing\npathway with nonlinear preprocessing before motion energy sensing, implemented\nusing a simple multilayer 3D CNN block. When exploring how the brain acquired\nthe ability to perceive second-order motion in natural environments, in which\npure second-order signals are rare, we hypothesized that second-order\nmechanisms were critical when estimating robust object motion amidst optical\nfluctuations, such as highlights on glossy surfaces. We trained our\ndual-pathway model on novel motion datasets with varying material properties of\nmoving objects. We found that training to estimate object motion from\nnon-Lambertian materials naturally endowed the model with the capacity to\nperceive second-order motion, as can humans. The resulting model effectively\naligns with biological systems while generalizing to both first- and\nsecond-order motion phenomena in natural scenes.",
+    "arxiv_id": "2501.12810v1"
+  },
+  {
+    "title": "Open or Closed LLM for Lesser-Resourced Languages? Lessons from Greek",
+    "abstract": "Natural Language Processing (NLP) for lesser-resourced languages faces\npersistent challenges, including limited datasets, inherited biases from\nhigh-resource languages, and the need for domain-specific solutions. This study\naddresses these gaps for Modern Greek through three key contributions. First,\nwe evaluate the performance of open-source (Llama-70b) and closed-source\n(GPT-4o mini) large language models (LLMs) on seven core NLP tasks with dataset\navailability, revealing task-specific strengths, weaknesses, and parity in\ntheir performance. Second, we expand the scope of Greek NLP by reframing\nAuthorship Attribution as a tool to assess potential data usage by LLMs in\npre-training, with high 0-shot accuracy suggesting ethical implications for\ndata provenance. Third, we showcase a legal NLP case study, where a Summarize,\nTranslate, and Embed (STE) methodology outperforms the traditional TF-IDF\napproach for clustering \\emph{long} legal texts. Together, these contributions\nprovide a roadmap to advance NLP in lesser-resourced languages, bridging gaps\nin model evaluation, task innovation, and real-world impact.",
+    "arxiv_id": "2501.12826v1"
+  },
+  {
+    "title": "Mutation-Guided LLM-based Test Generation at Meta",
+    "abstract": "This paper describes Meta's ACH system for mutation-guided LLM-based test\ngeneration. ACH generates relatively few mutants (aka simulated faults),\ncompared to traditional mutation testing. Instead, it focuses on generating\ncurrently undetected faults that are specific to an issue of concern. From\nthese currently uncaught faults, ACH generates tests that can catch them,\nthereby `killing' the mutants and consequently hardening the platform against\nregressions. We use privacy concerns to illustrate our approach, but ACH can\nharden code against {\\em any} type of regression. In total, ACH was applied to\n10,795 Android Kotlin classes in 7 software platforms deployed by Meta, from\nwhich it generated 9,095 mutants and 571 privacy-hardening test cases. ACH also\ndeploys an LLM-based equivalent mutant detection agent that achieves a\nprecision of 0.79 and a recall of 0.47 (rising to 0.95 and 0.96 with simple\npre-processing). ACH was used by Messenger and WhatsApp test-a-thons where\nengineers accepted 73% of its tests, judging 36% to privacy relevant. We\nconclude that ACH hardens code against specific concerns and that, even when\nits tests do not directly tackle the specific concern, engineers find them\nuseful for their other benefits.",
+    "arxiv_id": "2501.12862v1"
+  },
+  {
+    "title": "PreciseCam: Precise Camera Control for Text-to-Image Generation",
+    "abstract": "Images as an artistic medium often rely on specific camera angles and lens\ndistortions to convey ideas or emotions; however, such precise control is\nmissing in current text-to-image models. We propose an efficient and general\nsolution that allows precise control over the camera when generating both\nphotographic and artistic images. Unlike prior methods that rely on predefined\nshots, we rely solely on four simple extrinsic and intrinsic camera parameters,\nremoving the need for pre-existing geometry, reference 3D objects, and\nmulti-view data. We also present a novel dataset with more than 57,000 images,\nalong with their text prompts and ground-truth camera parameters. Our\nevaluation shows precise camera control in text-to-image generation, surpassing\ntraditional prompt engineering approaches. Our data, model, and code are\npublicly available at https://graphics.unizar.es/projects/PreciseCam2024.",
+    "arxiv_id": "2501.12910v1"
+  },
+  {
+    "title": "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via\n  Reinforcement Learning",
+    "abstract": "We introduce our first-generation reasoning models, DeepSeek-R1-Zero and\nDeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement\nlearning (RL) without supervised fine-tuning (SFT) as a preliminary step,\ndemonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero\nnaturally emerges with numerous powerful and intriguing reasoning behaviors.\nHowever, it encounters challenges such as poor readability, and language\nmixing. To address these issues and further enhance reasoning performance, we\nintroduce DeepSeek-R1, which incorporates multi-stage training and cold-start\ndata before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217\non reasoning tasks. To support the research community, we open-source\nDeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B,\n70B) distilled from DeepSeek-R1 based on Qwen and Llama.",
+    "arxiv_id": "2501.12948v1"
+  },
+  {
+    "title": "GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models",
+    "abstract": "Large Language Models (LLMs) face significant deployment challenges due to\ntheir substantial resource requirements. While low-bit quantized weights can\nreduce memory usage and improve inference efficiency, current hardware lacks\nnative support for mixed-precision General Matrix Multiplication (mpGEMM),\nresulting in inefficient dequantization-based implementations. Moreover,\nuniform quantization methods often fail to capture weight distributions\nadequately, leading to performance degradation. We propose GANQ (GPU-Adaptive\nNon-Uniform Quantization), a layer-wise post-training non-uniform quantization\nframework optimized for hardware-efficient lookup table-based mpGEMM. GANQ\nachieves superior quantization performance by utilizing a training-free,\nGPU-adaptive optimization algorithm to efficiently reduce layer-wise\nquantization errors. Extensive experiments demonstrate GANQ's ability to reduce\nthe perplexity gap from the FP16 baseline compared to state-of-the-art methods\nfor both 3-bit and 4-bit quantization. Furthermore, when deployed on a single\nNVIDIA RTX 4090 GPU, GANQ's quantized models achieve up to 2.57$\\times$ speedup\nover the baseline, advancing memory and inference efficiency in LLM deployment.",
+    "arxiv_id": "2501.12956v1"
+  },
+  {
+    "title": "It's complicated. The relationship of algorithmic fairness and\n  non-discrimination regulations in the EU AI Act",
+    "abstract": "What constitutes a fair decision? This question is not only difficult for\nhumans but becomes more challenging when Artificial Intelligence (AI) models\nare used. In light of discriminatory algorithmic behaviors, the EU has recently\npassed the AI Act, which mandates specific rules for AI models, incorporating\nboth traditional legal non-discrimination regulations and machine learning\nbased algorithmic fairness concepts. This paper aims to bridge these two\ndifferent concepts in the AI Act through: First a high-level introduction of\nboth concepts targeting legal and computer science-oriented scholars, and\nsecond an in-depth analysis of the AI Act's relationship between legal\nnon-discrimination regulations and algorithmic fairness. Our analysis reveals\nthree key findings: (1.), most non-discrimination regulations target only\nhigh-risk AI systems. (2.), the regulation of high-risk systems encompasses\nboth data input requirements and output monitoring, though these regulations\nare often inconsistent and raise questions of computational feasibility. (3.)\nRegulations for General Purpose AI Models, such as Large Language Models that\nare not simultaneously classified as high-risk systems, currently lack\nspecificity compared to other regulations. Based on these findings, we\nrecommend developing more specific auditing and testing methodologies for AI\nsystems. This paper aims to serve as a foundation for future interdisciplinary\ncollaboration between legal scholars and computer science-oriented machine\nlearning researchers studying discrimination in AI systems.",
+    "arxiv_id": "2501.12962v1"
+  },
+  {
+    "title": "Galois groups of polynomials and neurosymbolic networks",
+    "abstract": "This paper introduces a novel approach to understanding Galois theory, one of\nthe foundational areas of algebra, through the lens of machine learning. By\nanalyzing polynomial equations with machine learning techniques, we aim to\nstreamline the process of determining solvability by radicals and explore\nbroader applications within Galois theory. This summary encapsulates the\nbackground, methodology, potential applications, and challenges of using data\nscience in Galois theory.\n  More specifically, we design a neurosymbolic network to classify Galois\ngroups and show how this is more efficient than usual neural networks. We\ndiscover some very interesting distribution of polynomials for groups not\nisomorphic to the symmetric groups and alternating groups.",
+    "arxiv_id": "2501.12978v1"
+  },
+  {
+    "title": "FlanEC: Exploring Flan-T5 for Post-ASR Error Correction",
+    "abstract": "In this paper, we present an encoder-decoder model leveraging Flan-T5 for\npost-Automatic Speech Recognition (ASR) Generative Speech Error Correction\n(GenSEC), and we refer to it as FlanEC. We explore its application within the\nGenSEC framework to enhance ASR outputs by mapping n-best hypotheses into a\nsingle output sentence. By utilizing n-best lists from ASR models, we aim to\nimprove the linguistic correctness, accuracy, and grammaticality of final ASR\ntranscriptions. Specifically, we investigate whether scaling the training data\nand incorporating diverse datasets can lead to significant improvements in\npost-ASR error correction. We evaluate FlanEC using the HyPoradise dataset,\nproviding a comprehensive analysis of the model's effectiveness in this domain.\nFurthermore, we assess the proposed approach under different settings to\nevaluate model scalability and efficiency, offering valuable insights into the\npotential of instruction-tuned encoder-decoder models for this task.",
+    "arxiv_id": "2501.12979v1"
+  },
+  {
+    "title": "Paper Quality Assessment based on Individual Wisdom Metrics from Open\n  Peer Review",
+    "abstract": "This study proposes a data-driven framework for enhancing the accuracy and\nefficiency of scientific peer review through an open, bottom-up process that\nestimates reviewer quality. Traditional closed peer review systems, while\nessential for quality control, are often slow, costly, and subject to biases\nthat can impede scientific progress. Here, we introduce a method that evaluates\nindividual reviewer reliability by quantifying agreement with community\nconsensus scores and applying Bayesian weighting to refine paper quality\nassessments. We analyze open peer review data from two major scientific\nconferences, and demonstrate that reviewer-specific quality scores\nsignificantly improve the reliability of paper quality estimation. Perhaps\nsurprisingly, we find that reviewer quality scores are unrelated to authorship\nquality. Our model incorporates incentive structures to recognize high-quality\nreviewers and encourage broader coverage of submitted papers, thereby\nmitigating the common \"rich-get-richer\" pitfall of social media. These findings\nsuggest that open peer review, with mechanisms for estimating and incentivizing\nreviewer quality, offers a scalable and equitable alternative for scientific\npublishing, with potential to enhance the speed, fairness, and transparency of\nthe peer review process.",
+    "arxiv_id": "2501.13014v1"
+  },
+  {
+    "title": "Optimizing Return Distributions with Distributional Dynamic Programming",
+    "abstract": "We introduce distributional dynamic programming (DP) methods for optimizing\nstatistical functionals of the return distribution, with standard reinforcement\nlearning as a special case. Previous distributional DP methods could optimize\nthe same class of expected utilities as classic DP. To go beyond expected\nutilities, we combine distributional DP with stock augmentation, a technique\npreviously introduced for classic DP in the context of risk-sensitive RL, where\nthe MDP state is augmented with a statistic of the rewards obtained so far\n(since the first time step). We find that a number of recently studied problems\ncan be formulated as stock-augmented return distribution optimization, and we\nshow that we can use distributional DP to solve them. We analyze distributional\nvalue and policy iteration, with bounds and a study of what objectives these\ndistributional DP methods can or cannot optimize. We describe a number of\napplications outlining how to use distributional DP to solve different\nstock-augmented return distribution optimization problems, for example\nmaximizing conditional value-at-risk, and homeostatic regulation. To highlight\nthe practical potential of stock-augmented return distribution optimization and\ndistributional DP, we combine the core ideas of distributional value iteration\nwith the deep RL agent DQN, and empirically evaluate it for solving instances\nof the applications discussed.",
+    "arxiv_id": "2501.13028v1"
+  },
+  {
+    "title": "Autonomy-of-Experts Models",
+    "abstract": "Mixture-of-Experts (MoE) models mostly use a router to assign tokens to\nspecific expert modules, activating only partial parameters and often\noutperforming dense models. We argue that the separation between the router's\ndecision-making and the experts' execution is a critical yet overlooked issue,\nleading to suboptimal expert selection and ineffective learning. To address\nthis, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which\nexperts autonomously select themselves to process inputs. AoE is based on the\ninsight that an expert is aware of its own capacity to effectively process a\ntoken, an awareness reflected in the scale of its internal activations. In AoE,\nrouters are removed; instead, experts pre-compute internal activations for\ninputs and are ranked based on their activation norms. Only the top-ranking\nexperts proceed with the forward pass, while the others abort. The overhead of\npre-computing activations is reduced through a low-rank weight factorization.\nThis self-evaluating-then-partner-comparing approach ensures improved expert\nselection and effective learning. We pre-train language models having 700M up\nto 4B parameters, demonstrating that AoE outperforms traditional MoE models\nwith comparable efficiency.",
+    "arxiv_id": "2501.13074v1"
+  },
+  {
+    "title": "Robust Representation Consistency Model via Contrastive Denoising",
+    "abstract": "Robustness is essential for deep neural networks, especially in\nsecurity-sensitive applications. To this end, randomized smoothing provides\ntheoretical guarantees for certifying robustness against adversarial\nperturbations. Recently, diffusion models have been successfully employed for\nrandomized smoothing to purify noise-perturbed samples before making\npredictions with a standard classifier. While these methods excel at small\nperturbation radii, they struggle with larger perturbations and incur a\nsignificant computational overhead during inference compared to classical\nmethods. To address this, we reformulate the generative modeling task along the\ndiffusion trajectories in pixel space as a discriminative task in the latent\nspace. Specifically, we use instance discrimination to achieve consistent\nrepresentations along the trajectories by aligning temporally adjacent points.\nAfter fine-tuning based on the learned representations, our model enables\nimplicit denoising-then-classification via a single prediction, substantially\nreducing inference costs. We conduct extensive experiments on various datasets\nand achieve state-of-the-art performance with minimal computation budget during\ninference. For example, our method outperforms the certified accuracy of\ndiffusion-based methods on ImageNet across all perturbation radii by 5.3% on\naverage, with up to 11.6% at larger radii, while reducing inference costs by\n85$\\times$ on average. Codes are available at:\nhttps://github.com/jiachenlei/rRCM.",
+    "arxiv_id": "2501.13094v1"
+  },
+  {
+    "title": "Leveraging LLMs to Create a Haptic Devices' Recommendation System",
+    "abstract": "Haptic technology has seen significant growth, yet a lack of awareness of\nexisting haptic device design knowledge hinders development. This paper\naddresses these limitations by leveraging advancements in Large Language Models\n(LLMs) to develop a haptic agent, focusing specifically on Grounded Force\nFeedback (GFF) devices recommendation. Our approach involves automating the\ncreation of a structured haptic device database using information from research\npapers and product specifications. This database enables the recommendation of\nrelevant GFF devices based on user queries. To ensure precise and contextually\nrelevant recommendations, the system employs a dynamic retrieval method that\ncombines both conditional and semantic searches. Benchmarking against the\nestablished UEQ and existing haptic device searching tools, the proposed haptic\nrecommendation agent ranks in the top 10\\% across all UEQ categories with mean\ndifferences favoring the agent in nearly all subscales, and maintains no\nsignificant performance bias across different user groups, showcasing superior\nusability and user satisfaction.",
+    "arxiv_id": "2501.12573v1"
+  },
+  {
+    "title": "Guaranteed Recovery of Unambiguous Clusters",
+    "abstract": "Clustering is often a challenging problem because of the inherent ambiguity\nin what the \"correct\" clustering should be. Even when the number of clusters\n$K$ is known, this ambiguity often still exists, particularly when there is\nvariation in density among different clusters, and clusters have multiple\nrelatively separated regions of high density. In this paper we propose an\ninformation-theoretic characterization of when a $K$-clustering is ambiguous,\nand design an algorithm that recovers the clustering whenever it is\nunambiguous. This characterization formalizes the situation when two high\ndensity regions within a cluster are separable enough that they look more like\ntwo distinct clusters than two truly distinct clusters in the clustering. The\nalgorithm first identifies $K$ partial clusters (or \"seeds\") using a\ndensity-based approach, and then adds unclustered points to the initial $K$\npartial clusters in a greedy manner to form a complete clustering. We implement\nand test a version of the algorithm that is modified to effectively handle\noverlapping clusters, and observe that it requires little parameter selection\nand displays improved performance on many datasets compared to widely used\nalgorithms for non-convex cluster recovery.",
+    "arxiv_id": "2501.13093v1"
+  }
+]
--- a/arxiv-processor/README.md
+++ b/arxiv-processor/README.md
@ -0,0 +1,130 @@
+# ArXiv Processor
+
+A Go package for fetching and processing papers from arXiv.
+
+## Installation
+
+1. Clone the repository:
+```bash
+git clone https://github.com/yourusername/arxiv-processor.git
+cd arxiv-processor
+```
+
+2. Initialize the Go module:
+```bash
+go mod init arxiv-processor
+go mod tidy
+```
+
+## Usage
+
+### As a Library
+To use the package in your Go application:
+
+```go
+import "github.com/yourusername/arxiv-processor"
+
+func main() {
+    // Create client
+    client := arxiv.NewClient()
+
+    // Define search parameters
+    // The format "20060102" is Go's reference time format:
+    // 2006 = year
+    // 01 = month
+    // 02 = day
+    // Note: The arXiv API returns full timestamps including time of day,
+    // but the search API only uses the date portion for filtering
+    startDate, _ := time.Parse("20060102", "20240101")
+    endDate, _ := time.Parse("20060102", "20240131")
+
+    // Fetch papers
+    // The FetchPapers method returns all papers at once after completion
+    // of the API request and any necessary pagination
+    papers, err := client.FetchPapers("cat:cs.AI", startDate, endDate)
+    if err != nil {
+        log.Fatal(err)
+    }
+
+    // Use papers directly (in-memory)
+    // The papers slice contains all results after completion
+    for _, paper := range papers {
+        fmt.Printf("Title: %s\n", paper.Title)
+        fmt.Printf("Abstract: %s\n", paper.Abstract)
+    }
+
+    // Optionally save papers to file
+    err = arxiv.SavePapers("papers.json", papers)
+    if err != nil {
+        log.Fatal(err)
+    }
+}
+```
+
+Note: The package currently writes to a file by default. To modify this behavior to only return JSON objects:
+1. Remove the SavePapers call
+2. Use the returned papers slice directly
+3. The papers slice contains all paper data as Go structs
+4. You can marshal to JSON using json.Marshal(papers) if needed
+
+### Command Line Interface
+To use the CLI:
+
+```bash
+go run main.go --search "cat:cs.AI" --date-range "YYYYMMDD-YYYYMMDD"
+```
+
+#### Command Line Options
+- `--search`: Search query (e.g., "cat:cs.AI" for AI papers)
+- `--date-range`: Date range in YYYYMMDD-YYYYMMDD format
+- `--output`: Output file path (default: papers_data.json)
+
+### Example: Fetch AI Papers
+```bash
+go run main.go --search "cat:cs.AI" --date-range "20250115-20250118"
+```
+
+### Program Output
+- Fetched papers are saved to `papers_data.json`
+- Example JSON structure:
+```json
+[
+  {
+    "title": "Sample Paper Title",
+    "abstract": "This is a sample abstract...",
+    "arxiv_id": "2501.08565v1"
+  }
+]
+```
+- The JSON file contains paper metadata including:
+  - Title
+  - Abstract
+  - arXiv ID
+
+## Configuration
+
+### Environment Variables
+- `ARXIV_MAX_RESULTS`: Maximum number of results to fetch (default: 100)
+- `ARXIV_START_INDEX`: Start index for pagination (default: 0)
+
+## Package Structure
+
+```
+arxiv-processor/
+├── arxiv/          # arXiv API client
+├── storage/        # Data storage handlers
+├── llm/            # LLM integration (TODO)
+├── main.go         # Main entry point
+└── README.md       # This file
+```
+
+## Contributing
+
+1. Fork the repository
+2. Create a new branch
+3. Make your changes
+4. Submit a pull request
+
+## License
+
+MIT License
--- a/arxiv-processor/arxiv-2501.11599v1.md
+++ b/arxiv-processor/arxiv-2501.11599v1.md
@ -0,0 +1,156 @@
+Title: SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks
+
+URL Source: https://arxiv.org/html/2501.11599v1
+
+Markdown Content:
+Wentao Wan1, Zhuojie Yang1, Yongcan Chen3, Chenglin Luo1, Ruilin Wang1, Kehao Cai1, Nan Kang1, Liang Lin1, Keze Wang1,2
+
+###### Abstract
+
+Deductive reasoning is a crucial logical capability that assists us in solving complex problems based on existing knowledge. Although augmented by Chain-of-Thought prompts, Large Language Models (LLMs) might not follow the correct reasoning paths. Enhancing the deductive reasoning abilities of LLMs, and leveraging their extensive built-in knowledge for various reasoning tasks, remains an open question. Attempting to mimic the human deductive reasoning paradigm, we propose a multi-stage Syllogistic-Reasoning Framework of Thought (SR-FoT) that enables LLMs to perform syllogistic deductive reasoning to handle complex knowledge-based reasoning tasks. Our SR-FoT begins by interpreting the question and then uses the interpretation and the original question to propose a suitable major premise. It proceeds by generating and answering minor premise questions in two stages to match the minor premises. Finally, it guides LLMs to use the previously generated major and minor premises to perform syllogistic deductive reasoning to derive the answer to the original question. Extensive and thorough experiments on knowledge-based reasoning tasks have demonstrated the effectiveness and advantages of our SR-FoT.
+
+Code — https://github.com/RodeWayne/SR-FoT
+
+Introduction
+------------
+
+Deductive reasoning (Johnson-Laird [1999](https://arxiv.org/html/2501.11599v1#bib.bib19)) is the process of drawing valid inferences. Deductive reasoning is a powerful human capability, where rigorous deductive reasoning helps us use existing knowledge as premises to derive correct subsequent conclusions, enabling us to tackle various complex tasks in the real world.
+
+Automated deductive reasoning has long been a pursuit in the field of Natural Language Processing (NLP) (Chowdhary and Chowdhary [2020](https://arxiv.org/html/2501.11599v1#bib.bib9); Bharadiya [2023](https://arxiv.org/html/2501.11599v1#bib.bib3); Khurana et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib21)). Works on automated rigorous reasoning include reasoning engines and Automated Theorem Proving (ATP) (Bibel [2013](https://arxiv.org/html/2501.11599v1#bib.bib4)), which often provide methods for automatically checking the rigor of reasoning. However, these engines require the use of formal languages, which limits their applicability in knowledge-based reasoning scenarios. Because formal language-based reasoning requires a predefined library of formalized premises, many knowledge-based reasoning tasks, including common-sense question answering, involve a diverse array of premises. It is difficult to prepare and rigorously formalize such a large library of premises in advance. Therefore, performing correct deductive reasoning in natural language holds significant importance.
+
+Large Language Models (LLMs) (Chang et al. [2024](https://arxiv.org/html/2501.11599v1#bib.bib7); Floridi and Chiriatti [2020](https://arxiv.org/html/2501.11599v1#bib.bib15); Touvron et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib29); Chiang et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib8); Huang and Chang [2022](https://arxiv.org/html/2501.11599v1#bib.bib18); DeepSeek-AI [2024](https://arxiv.org/html/2501.11599v1#bib.bib11)) pre-trained on extensive corpora possess inherent soft deductive reasoning capabilities (Seals and Shalin [2023](https://arxiv.org/html/2501.11599v1#bib.bib27)). With the aid of the Chain-of-Thought prompt (CoT) (Lyu et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib25); Wei et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib33); Zhang et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib39); Turpin et al. [2024](https://arxiv.org/html/2501.11599v1#bib.bib30); Lee et al. [2024](https://arxiv.org/html/2501.11599v1#bib.bib22); Liu, Pang, and Fan [2023](https://arxiv.org/html/2501.11599v1#bib.bib23)), the cognitive abilities of LLMs are further enhanced. However, reasoning with CoT often does not constitute strict deductive reasoning and thus can lack rigor. Fig. [4](https://arxiv.org/html/2501.11599v1#Sx3.F4 "Figure 4 ‣ Procedure of Our SR-FoT ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks") illustrates the different processes of handling the same problem using CoT and classic syllogistic deductive reasoning, clearly showing that the syllogistic deductive approach is more rigorous. We believe that guiding large language models to perform deductive reasoning, rather than merely multi-step reasoning, can enhance the rigor of the reasoning process, reduce illusions, and subsequently improve performance on complex tasks.
+
+Inspired by the most fundamental human deductive reasoning paradigm, syllogistic reasoning (Bucciarelli and Johnson-Laird [1999](https://arxiv.org/html/2501.11599v1#bib.bib6); Khemlani and Johnson-Laird [2012](https://arxiv.org/html/2501.11599v1#bib.bib20); Bara, Bucciarelli, and Johnson-Laird [1995](https://arxiv.org/html/2501.11599v1#bib.bib2)), we propose a multi-stage reasoning framework for large language models to guide them in using syllogistic reasoning to solve specific problems. In contrast to existing works in the community (Wu et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib34); Ye et al. [2023b](https://arxiv.org/html/2501.11599v1#bib.bib37); Deng et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib12)), we do not solely rely on simplistic processes or create targeted benchmarks to evaluate the LLMs’ capabilities in performing syllogistic reasoning. Instead, we propose a reasoning framework based on the syllogistic thinking paradigm to handle complex knowledge-based reasoning tasks. Our SR-FoT advances in guiding LLMs in performing syllogistic deductive reasoning, thereby achieving improved performance on these tasks and enhancing the rigor and reliability of the reasoning process.
+
+Our SR-FoT consists of five stages. Initially, it involves interpreting the question. Subsequently, SR-FoT guides the Large Language Model (LLM) proposing a major premise suitable for the question. This major premise typically encompasses the built-in knowledge of the LLM, which serves as a universal rule that aids in addressing the original question. The next stage involves obtaining a minor premise, which acts as the bridge linking the major premise to the original problem and is crucial for applying the major premise to the current issue. We first let the LLM formulate minor premise questions based on the original question, major premise, and contextual information, and then answer these to obtain an appropriate minor premise. Finally, with both the major and minor premises established, we enable the LLM to perform syllogistic reasoning based on the original question and these premises to derive the answer to the original question. Furthermore, to minimize the interference caused by excessive information during the reasoning process (Dong et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib14)), we restrict each stage to only access the content from its necessary preceding stages. For example, during the final syllogistic reasoning, only the original problem and the previously established major and minor premises are visible, without the need to reference the problem interpretation and minor premise question stages.
+
+Our main contributions can be summarized in three points: i) We propose a multi-step thinking framework that guides LLMs in using syllogistic deductive reasoning to solve knowledge-based reasoning tasks. Specifically, to enhance the ability of LLM to leverage its built-in knowledge for solving diverse tasks, we introduce a problem interpretation stage when acquiring the major premise and improve the quality of both premises as well as their logical connection to the original problem by adopting an autonomous question-answering approach during the acquisition of the minor premise (Bubeck et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib5)); ii) To facilitate more rigorous reasoning, we have designed our thinking framework so that each step only accesses the information necessary for that stage, thereby reducing the illusions and error accumulation that can come from overly long reasoning steps; iii) Our SR-FoT achieves superior performance over the existing chain of thought-related methods on various knowledge-based reasoning QA datasets such as ScienceQA (Lu et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib24)), StrategyQA (Geva et al. [2021](https://arxiv.org/html/2501.11599v1#bib.bib17)), and BoolQ (Clark et al. [2019](https://arxiv.org/html/2501.11599v1#bib.bib10)), demonstrating the superiority of our SR-FoT.
+
+Related Work
+------------
+
+### Chain-of-Thought
+
+Chain-of-Thought (Wei et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib33)) has been demonstrated to enhance performance in reasoning tasks by fully utilizing the in-context learning capability of the large language model to stimulate its multi-step reasoning ability. Self-consistency CoT (SC-CoT) (Wang et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib32)) further improves the performance of CoT by utilizing the consistency of multiple sampled reasoning chains. Complexity-based CoT (C-CoT) (Fu et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib16)) further discovers that the consistency of complex reasoning chains is even more vital for the reasoning performance of language models. In addition, some efforts have also been made to further stimulate the reasoning ability of language models by focusing on the structure of the reasoning chain and the levels of reasoning, such as Least-to-Most (Zhou et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib40)) and Tree-of-Thought (Yao et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib35)). However, these works have not considered how to stimulate the reasoning abilities of LLMs from the perspective of logical reasoning.
+
+### Logical Reasoning Ability of LLMs
+
+There has been considerable research within the community on the logical reasoning capabilities of LLMs, broadly categorized into two directions: one focuses on logic reasoning based on formal languages, and the other on natural language logic reasoning. Research related to formal language-based logic reasoning primarily concentrates on the field of Automated Theorem Proving (ATP) (Bibel [2013](https://arxiv.org/html/2501.11599v1#bib.bib4)), utilizing the built-in mathematical priors of LLMs to accelerate the search process in theorem proving or to construct a growing library of mathematical theorems to aid new proofs (Wang et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib31)). This work typically operates within interactive theorem-proving platforms like the Lean system, which restricts its application in daily question-answering scenarios. Logic reasoning on natural language with LLMs generally involves soft reasoning (Yu et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib38)), which does not provide rigorous guarantees. For instance, the Chain-of-Thought (CoT) enhances the general explicit reasoning abilities of LLMs, and there are exploratory studies demonstrating to what extent LLMs can perform in logical reasoning, or how segment checking might reduce soft deductive reasoning illusions and error accumulation (Ye et al. [2023a](https://arxiv.org/html/2501.11599v1#bib.bib36); Dhuliawala et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib13)). Recently, several studies on syllogistic reasoning with LLMs have been proposed. However, these primarily create benchmarks (Ye et al. [2023b](https://arxiv.org/html/2501.11599v1#bib.bib37)), evaluating the capability of LLMs to perform syllogistic reasoning on datasets with given premises. Unlike previous works, our research investigates how to guide LLMs through a multi-stage process that involves autonomously generating minor and major premises and performing syllogistic deductive reasoning to answer a variety of knowledge-based reasoning questions.
+
+Methodology
+-----------
+
+We have designed a reasoning framework that guides large language models to perform syllogistic deductive reasoning for addressing various knowledge-based reasoning question-answer tasks. Next, we present syllogistic reasoning as background knowledge, followed by a detailed description of our SR-FoT framework.
+
+### Background: Syllogism
+
+In traditional logic, syllogism (Smiley [1973](https://arxiv.org/html/2501.11599v1#bib.bib28)) is a form of reasoning where one proposition (the conclusion) necessarily follows from two other propositions (known as premises). As shown in Fig. [1](https://arxiv.org/html/2501.11599v1#Sx3.F1 "Figure 1 ‣ Background: Syllogism ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), a syllogism consists of three parts: a major premise, a minor premise, and a conclusion. Logically, the conclusion is derived by applying the major premise to the minor premise. The major premise represents a general principle, while the minor premise is a specific statement. Syllogistic reasoning is a type of deductive reasoning; rigorous deductive reasoning ensures that if the premises are correct, the conclusion must also be correct.
+
+![Image 12: Refer to caption](https://arxiv.org/html/2501.11599v1/x1.png)
+
+Figure 1: A syllogism example.
+
+### Procedure of Our SR-FoT
+
+![Image 13: Refer to caption](https://arxiv.org/html/2501.11599v1/x2.png)
+
+Figure 2: Procedure of our SR-FoT. Questionori: Original Question, Context: Context provided for Original Question, Answerori: Answer for Original Question, QuestionmP: Question for Minor Premise, PropmtCoT: Guide Prompt for CoT, PropmtQE: Guide Prompt for Question Explanation, PropmtMP: Guide Prompt for Major Premise Production, PropmtQmP: Guide Prompt for Posing the Minor Premise Question, PromptmP: Guide Prompt for Minor Premise Production, PromptSR: Guide Prompt for Final Syllogistic Reasoning and so on.
+
+![Image 14: Refer to caption](https://arxiv.org/html/2501.11599v1/x3.png)
+
+Figure 3: Prompts for each stage of our SR-FoT.
+
+While our proposed SR-FoT does not guarantee the execution of rigorous syllogistic reasoning for 100%, it aims to guide the reasoning paradigms of the LLM through carefully designed prompts and sub-tasks at each stage. By strategically controlling the input visible at each stage, we strive to ensure that the LLM conducts rigorous syllogistic reasoning and minimizes the occurrence of reasoning fallacies. Specifically, As shown in Fig. [2](https://arxiv.org/html/2501.11599v1#Sx3.F2 "Figure 2 ‣ Procedure of Our SR-FoT ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), our SR-FoT is divided into five stages. The prompts for each stage of our SR-FoT are shown in Fig. [3](https://arxiv.org/html/2501.11599v1#Sx3.F3 "Figure 3 ‣ Procedure of Our SR-FoT ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks").
+
+![Image 15: Refer to caption](https://arxiv.org/html/2501.11599v1/x4.png)
+
+Figure 4: A case of using CoT and SR-FoT to answer a question in the ScienceQA dataset respectively. The highlighted red parts indicate the incorrect or misleading content, while the highlighted green parts indicate the content that helps correct reasoning.
+
+Stage 1: Question Explanation. The key to utilizing syllogistic reasoning to solve various complex knowledge-based reasoning tasks lies in formulating appropriate major and minor premises that fit the current problem. Accordingly, the first stage of our SR-FoT involves using a prompt with examples to guide the LLM in interpreting the original task question and proposing a solution approach. This guidance helps direct the LLM to formulate suitable major premises and then appropriate minor ones. In this stage, besides the guidance and example prompts, the only information available to the LLM is the ”original question” and the ”context” provided by the task, which also includes ”options” information for multiple-choice questions.
+
+Stage 2: Major Premise Production. After acquiring the “question explanation”, we gain a deeper understanding of the original question and develop an approach to solve it. This solution approach often includes guidance on what further information is needed. In this stage, based on these guidelines, we propose an appropriate major premise, which is derived from the task ”context” or the built-in knowledge of the LLM. In Stage 2, besides the guidance and example prompts, the information accessible to the LLM includes the ”original question” ”context” and the ”question explanation” generated in the first stage.
+
+Stage 3: Posing the Minor Premise Question. After establishing the major premise, to effectively engage in syllogistic reasoning, we need a minor premise. In syllogistic reasoning, the minor premise is a specific statement that describes the relationship between a particular instance and the category mentioned in the major premise. Through the minor premise, the universal characteristics of the major premise can be applied to the specific instance in the minor premise, which is a crucial step in using syllogistic reasoning to solve specific problems. Given the diverse and often complex nature of the knowledge-based reasoning tasks we need to address, it is challenging to provide a matched and correct minor premise in one step. Our SR-FoT divides the step of proposing a minor premise into two stages: posing the minor premise question (Stage 3) and answering the minor premise question (Stage 4). The task of the ”posing the minor premise question” stage is to determine what information about the specific instance in the original question the LLM should acquire to utilize the major premise in answering the ”original question”. Therefore, in the ”posing the minor premise question” stage, besides the guidance and example prompts, the LLM needs access to the ”original question,” ”context,” and the ”major premise” generated in Stage 2.
+
+Stage 4: Minor Premise Production. The task of Stage 4 is to utilize the ”context” information provided by the original task, along with the built-in knowledge of the LLM, to answer the minor premise question posed in Stage 3. This stage aims to obtain the correct information about a specific aspect of the particular instance in the original question, leading to the formulation of an accurate and matching minor premise. Given the potential complexity of the minor premise questions, we guide LLM to employ the Chain-of-Thought (CoT) technique to answer the minor premise question and to organize and obtain the minor premise. Furthermore, to avoid the interference caused by excessive information, in this stage, besides the guidance and example prompts, the LLM has access only to the ”minor premise question” and ”context” without needing to see the ”original question” again. The ”minor premise question” and ”context” already contain all the information necessary for the task of the LLM at this stage; viewing additional information like the ”original question” could instead lead to distractions and affect performance.
+
+Stage 5: Final Syllogistic Reasoning After the aforementioned stages, complex original knowledge-based reasoning questions can now be answered using syllogistic reasoning. The specific approach involves designing the appropriate task instruction and example prompts, allowing the LLM to engage in syllogistic reasoning based on the major and minor premises generated in earlier stages, to arrive at the answer to the original question. Therefore, in Stage 5, we design the LLM to have access, in addition to the guidance and example prompts, to the ”major premise” generated in Stage 2, the ”minor premise” generated in Stage 4, and the ”original question”.
+
+Table 1: Comparison with the state-of-the-art methods on the ScienceQA, StategyQA, and BoolQ datasets.
+
+Experiments
+-----------
+
+To evaluate the effectiveness of our SR-FoT, we conducted a series of experiments using both Open-source and closed-source LLMs on several common knowledge-based reasoning question-answer datasets.
+
+### Experiment Setup
+
+#### Datasets.
+
+To fully demonstrate the effectiveness and generalization of our SR-FoT, we conduct a series of experiments on three datasets from different fields.
+
+ScienceQA (Lu et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib24)) is a scientific question-answering dataset and contains 21,208 multimodal multiple-choice science questions. It can be divided into three subjects: natural science, language science, and social science. It requires the language model to select one answer from multiple options, usually requiring multi-step reasoning. In our experiment, we employ the test set samples which only have a text context, with a total of 2224. We report the accuracy of our SR-FoT and comparison methods on this set.
+
+StrategyQA (Geva et al. [2021](https://arxiv.org/html/2501.11599v1#bib.bib17)) is a question-answering dataset focusing on open-domain questions. Its questions contain multiple reasoning steps, and a strategy should be used to obtain the answers. In our experiment, we evaluate the methods with accuracy on the train set, which includes 2290 samples.
+
+BoolQ (Clark et al. [2019](https://arxiv.org/html/2501.11599v1#bib.bib10)) is a reading comprehension dataset consisting of 16k samples. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. In our experiment, we compare the accuracy of our SR-FoT with other methods on the dev set, with a total of 3270.
+
+#### Experimental Setting.
+
+Our experiments are performed using API calls on the proprietary model GPT-3.5-turbo (Ouyang et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib26)), the open-source model DeepSeek-v2 (DeepSeek-AI [2024](https://arxiv.org/html/2501.11599v1#bib.bib11)) with 236B parameters, and Qwen1.5-32B-Chat (Bai et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib1)) version with 32B parameters. The control group methods include the Base method, Chain of Thought (CoT) (Wei et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib33)), Self-consistency CoT (SC-CoT) (Wang et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib32)), and Complexity-based CoT (C-CoT) (Fu et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib16)) methods. Our own approaches included SR-FoT and Self-consistency SR-FoT (SC-SR-FoT), which represents our SR-FoT following the self-consistency sampling and aggregation settings of SC-CoT. In the single-round sampling methods which include Base, CoT and SR-FoT, the hyperparameters on GPT-3.5-turbo and Qwen1.5-32B-Chat are set to top\_p=0.3 and temperature=0.2, while the temperature on DeepSeek-v2 was set to the default recommended value of 1 (DeepSeek only allows the temperature hyperparameter to be adjusted). In the multi-round sampling methods which include SC-CoT, C-CoT and SC-SR-FoT, we perform 10 samplings each. To enhance the diversity of sampling outcomes, the hyperparameters on GPT-3.5-turbo and Qwen1.5-32B-Chat for top\_p and temperature are adjusted to 0.7 and 0.9 respectively. The temperature hyperparameter on DeepSeek remained at the default recommended value of 1. The number of in-context example prompts used in all the methods on the ScienceQA, StrategyQA, and BoolQ datasets are 5, 2, and 2, respectively.
+
+### Experimental Results and Analyses
+
+#### Performance on ScienceQA.
+
+Scientific question answering is a task scenario that often requires deductive reasoning. As seen in Tab. [1](https://arxiv.org/html/2501.11599v1#Sx3.T1 "Table 1 ‣ Procedure of Our SR-FoT ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), under GPT-3.5-turbo, in the comparison of single-round sampling methods, our SR-FoT outperforms the Base and CoT methods by 1.5% and 0.5% respectively and is on par with multi-round sampling methods like SC-CoT and C-CoT. In the comparison of multi-round sampling methods, our SC-SR-FoT exceeds SC-CoT and C-CoT methods by 1.5%. Under the open-source model DeepSeek-V2, SR-FoT outperforms the Base and CoT by 4.8% and 3.2% respectively, even surpassing multi-round sampling methods. What’s more, our SC-SR-FoT further increases the accuracy to 93.0%. Under Qwen1.5-32B-Chat, compared to the Base and CoT methods, our SR-FoT has an improvement of 1.3% and 1.6% respectively. Compared to SC-CoT and C-CoT, our SC-SR-FoT also performs better, surpassing them by 2.8% and 3.2% respectively. These indicate that our methods achieves greater superiority on the ScienceQA dataset under multiple models.
+
+#### Performance on StrategyQA and BoolQ.
+
+StrategyQA and BoolQ are two other knowledge-based reasoning question-answer datasets that require a true or false judgment based on context or common sense knowledge. From Table 1, for StrategyQA under GPT-3.5-turbo, in the comparison of single-round sampling methods, our SR-FoT outperforms Base and CoT by 7.3% and 0.8% respectively; in the comparison of multi-round sampling methods, our SC-SR-FoT exceeds SC-CoT and C-CoT by 1.9% and 3.2% respectively. Similar trends are observed under DeepSeek-V2 and Qwen1.5-32B-Chat. In addition, our SR-FoT and SC-SR-FoT also perform the best in both single-round sampling methods and multi-round sampling methods on BoolQ under the three models.
+
+Overall, whether under the closed-source large language model GPT-3.5-turbo or the open-source large language models DeepSeek-V2 and Qwen1.5-32B-Chat, our SR-FoT achieve a superior accuracy compared with other compared methods on the ScienceQA, StrategyQA, and BoolQ datasets. This demonstrates the effectiveness of our SR-FoT.
+
+It is worth noting that under DeepSeek-V2 and Qwen1.5-32B-Chat, the Base method achieves relatively high results across all three datasets, while the benefits of the CoT method show signs of saturation, and at times perform worse than the Base method. However, our methods, whether under single-round sampling settings (SR-FoT) or multi-round aggregated sampling settings (SC-SR-FoT), are still able to further enhance performance, demonstrating greater potential for performance gains. We believe this is because our SR-FoT employs a syllogistic deductive reasoning framework, allowing LLMs to address these knowledge-based reasoning tasks based on a more rigorous reasoning process, thereby achieving better overall performance.
+
+Table 2: Effectiveness comparisons for subcategories on the ScienceQA dataset with DeepSeek-v2.
+
+### Ablation Study
+
+#### Effectiveness Comparisons for Subcategories.
+
+As shown in table [2](https://arxiv.org/html/2501.11599v1#Sx4.T2 "Table 2 ‣ Performance on StrategyQA and BoolQ. ‣ Experimental Results and Analyses ‣ Experiments ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), we conduct the experiments on the ScienceQA dataset with DeepSeek-v2. The results demonstrate that our methods can enhance the reasoning performance of the language model across questions of different difficulty levels and various subjects, compared with the state-of-the-art methods. When increasing the consistency of the proposed method, the beneficial effects become more significant.
+
+#### Ablation of Stages.
+
+As shown in Table [3](https://arxiv.org/html/2501.11599v1#Sx4.T3 "Table 3 ‣ Ablation of Stages. ‣ Ablation Study ‣ Experiments ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), we conduct experiments on ScienceQA under DeepSeek-V2 to verify the effectiveness of each stage in our method. Specifically, ‘all in one stage’ denotes using instructions and examples to let the LLM provide the premises based on the question and options, and then directly provide the answers. ‘w/o stage 3’ denotes providing the minor premise directly, instead of posing it as a question first and then answering. The results demonstrate that the completeness of each stage is important. In detail, discarding either the problem explanation or the major and minor premises would decrease the performance. Furthermore, allowing the language model to directly provide the major and minor premise would significantly reduce its performance, demonstrating the necessity of the multi-stage thinking framework in our SR-FoT.
+
+Table 3: Ablation study of stages in our proposed methods on the ScienceQA dataset under DeepSeek-V2.
+
+Table 4: Ablation study of visible information in various stages on the StrategyQA dataset.
+
+Table 5: Fifty cases using CoT and SR-FoT respectively, randomly selected from the three datasets under GPT-3.5-turbo, are analyzed to assess their rigor, and their rigor rates are subsequently calculated.
+
+Table 6: Error sources on a random sample of 50 incorrect examples of our SR-FoT from the three datasets using the GPT-3.5-turbo model. ‘MaPE’ denotes major premise error, ‘MiPQE’ denotes minor premise question error, ‘MiPE’ denotes minor premise error and ‘FRPE’ denotes final reasoning process error.
+
+#### Impact of Visible Information in Various Stages.
+
+As shown in Table [4](https://arxiv.org/html/2501.11599v1#Sx4.T4 "Table 4 ‣ Ablation of Stages. ‣ Ablation Study ‣ Experiments ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), we conduct the experiments on the StrategyQA dataset with DeepSeek-v2. ‘w/o context in stage 3’ denotes that the minor premise question is generated without considering the context. ‘add Qori in stage 4’ denotes that providing the original question, minor premise question, and context all to the LLM during the process of answering the minor premise question. The results demonstrate that both decreasing or increasing the content of the input prompts adversely affect performance. This underlines the appropriateness of the designed visible information at each stage of our SR-FoT.
+
+### Rigor Analysis
+
+To more directly analyze whether our SR-FoT improves the rigor of the reasoning process compared to CoT, we randomly select 50 cases from each of the three datasets under GPT-3.5-turbo for manual evaluation. For CoT and SR-FoT, if all intermediate steps from the first step of reasoning to obtaining the final answer are correct and logically progressive, without any factual inconsistencies or self-inconsistencies, we call them rigorous; otherwise, they are not rigorous. The results are in Table [5](https://arxiv.org/html/2501.11599v1#Sx4.T5 "Table 5 ‣ Ablation of Stages. ‣ Ablation Study ‣ Experiments ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"). From the table, it can be found that our SR-FoT has a higher rigor rate than CoT on all three datasets, indicating that our SR-FoT has enhanced the rigor and interoperability of LLM reasoning. For specific comparison cases about rigor, please refer to the supplementary materials.
+
+### Error Analysis
+
+We randomly selected 50 error cases from each of the three datasets under GPT-3.5-turbo to perform an error analysis of our SR-FoT. The sources of errors and their respective proportions are as in table [6](https://arxiv.org/html/2501.11599v1#Sx4.T6 "Table 6 ‣ Ablation of Stages. ‣ Ablation Study ‣ Experiments ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"). From the error analysis, it can be found that the proportion of different types of errors varies on different datasets. In ScienceQA, most errors stem from the step of extracting suitable minor premises from the question information. In StrategyQA, the main errors stem from the final reasoning process and mistakes in presenting the major premise. In BoolQ, the primary errors originate from the final reasoning process and errors in formulating the minor premise.
+
+### Case Study
+
+We give a case on the ScienceQA dataset to show how CoT and SR-FoT work. In Fig. [4](https://arxiv.org/html/2501.11599v1#Sx3.F4 "Figure 4 ‣ Procedure of Our SR-FoT ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), it can be seen that in the fourth and fifth reasoning steps of CoT, the model misunderstands the rhyme condition and thus infers wrong information, resulting in an incorrect final answer. In SR-FoT, the question explanation points out a reasonable direction for the major premise, then the major premise gives more sufficient rhyme conditions, and the minor premise correctly distinguishes different ending sounds. With their joint help, the model gives the correct final answer. More cases can be found in the supplementary materials.
+
+Conclusion
+----------
+
+In this paper, we have developed a multi-stage syllogistic reasoning framework of thought(SR-FoT) to guide LLMs in solving complex and diverse knowledge-based reasoning question-answering tasks using syllogistic deductive reasoning. Experiments across various knowledge-based reasoning datasets under various LLMs demonstrate the effectiveness and advantages of our method.
+
--- a/arxiv-processor/arxiv/client.go
+++ b/arxiv-processor/arxiv/client.go
@ -0,0 +1,78 @@
+package arxiv
+
+import (
+	"context"
+	"encoding/xml"
+	"fmt"
+	"io"
+	"net/http"
+	"net/url"
+	"time"
+)
+
+// Client represents an arXiv API client
+type Client struct {
+	baseURL    string
+	httpClient *http.Client
+}
+
+// NewClient creates a new arXiv API client
+func NewClient() *Client {
+	return &Client{
+		baseURL:    "http://export.arxiv.org/api/query",
+		httpClient: &http.Client{Timeout: 30 * time.Second},
+	}
+}
+
+// Query represents search parameters for arXiv API
+type Query struct {
+	Category    string
+	DateRange   string
+	MaxResults  int
+	StartOffset int
+}
+
+// Paper represents a single arXiv paper
+type Paper struct {
+	ID        string    `xml:"id"`
+	Title     string    `xml:"title"`
+	Summary   string    `xml:"summary"`
+	Published time.Time `xml:"published"`
+	Updated   time.Time `xml:"updated"`
+	Authors   []Author  `xml:"author"`
+}
+
+// Author represents a paper author
+type Author struct {
+	Name string `xml:"name"`
+}
+
+// FetchPapers retrieves papers from arXiv API
+func (c *Client) FetchPapers(ctx context.Context, query Query) ([]Paper, error) {
+	params := url.Values{}
+	params.Add("search_query", fmt.Sprintf("%s AND submittedDate:[%s]", query.Category, query.DateRange))
+	params.Add("max_results", fmt.Sprintf("%d", query.MaxResults))
+	params.Add("start", fmt.Sprintf("%d", query.StartOffset))
+
+	resp, err := c.httpClient.Get(fmt.Sprintf("%s?%s", c.baseURL, params.Encode()))
+	if err != nil {
+		return nil, fmt.Errorf("failed to fetch papers: %w", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		return nil, fmt.Errorf("unexpected status code: %d", resp.StatusCode)
+	}
+
+	return parseResponse(resp.Body)
+}
+
+func parseResponse(r io.Reader) ([]Paper, error) {
+	var feed struct {
+		Entries []Paper `xml:"entry"`
+	}
+	if err := xml.NewDecoder(r).Decode(&feed); err != nil {
+		return nil, fmt.Errorf("failed to parse response: %w", err)
+	}
+	return feed.Entries, nil
+}
--- a/arxiv-processor/go.mod
+++ b/arxiv-processor/go.mod
@ -0,0 +1,3 @@
+module arxiv-processor
+
+go 1.21
--- a/arxiv-processor/main.go
+++ b/arxiv-processor/main.go
@ -0,0 +1,214 @@
+package main
+
+import (
+	"bytes"
+	"encoding/json"
+	"encoding/xml"
+	"flag"
+	"fmt"
+	"io"
+	"log"
+	"net/http"
+	"net/url"
+	"os"
+	"strings"
+	"time"
+)
+
+func main() {
+	searchQuery := flag.String("search", "", "Search query")
+	dateRange := flag.String("date-range", "", "Date range in YYYYMMDD-YYYYMMDD format")
+	outputFile := flag.String("output", "papers_data.json", "Output file path")
+
+	flag.Parse()
+
+	if *searchQuery == "" || *dateRange == "" {
+		log.Fatal("Both --search and --date-range are required")
+	}
+
+	// Parse date range
+	dates := strings.Split(*dateRange, "-")
+	if len(dates) != 2 {
+		log.Fatal("Invalid date range format. Use YYYYMMDD-YYYYMMDD")
+	}
+
+	startDate, err := time.Parse("20060102", dates[0])
+	if err != nil {
+		log.Fatalf("Invalid start date: %v", err)
+	}
+
+	endDate, err := time.Parse("20060102", dates[1])
+	if err != nil {
+		log.Fatalf("Invalid end date: %v", err)
+	}
+
+	// Create arXiv client
+	client := NewClient()
+
+	// Fetch papers
+	fmt.Println("Fetching papers...")
+	papers, err := client.FetchPapers(*searchQuery, startDate, endDate)
+	if err != nil {
+		log.Fatalf("Failed to fetch papers: %v", err)
+	}
+
+	fmt.Printf("Fetched %d papers\n", len(papers))
+
+	// Save papers to JSON
+	err = SavePapers(*outputFile, papers)
+	if err != nil {
+		log.Fatalf("Failed to save papers: %v", err)
+	}
+
+	fmt.Printf("Saved paper data to %s\n", *outputFile)
+
+	// LLM processing placeholder
+	fmt.Println("\nTo process these papers with an LLM:")
+	fmt.Println("1. Choose an LLM API (e.g., OpenAI, Anthropic, local model)")
+	fmt.Println("2. Implement the LLM integration in llm package")
+	fmt.Println("3. Run the program again with your criteria")
+}
+
+// Paper represents a single arXiv paper
+type Paper struct {
+	Title    string `json:"title"`
+	Abstract string `json:"abstract"`
+	ID       string `json:"arxiv_id"`
+}
+
+// Client handles arXiv API interactions
+type Client struct {
+	httpClient *http.Client
+	baseURL    string
+}
+
+// NewClient creates a new arXiv client
+func NewClient() *Client {
+	return &Client{
+		httpClient: &http.Client{
+			Timeout: 30 * time.Second,
+		},
+		baseURL: "http://export.arxiv.org/api/query",
+	}
+}
+
+// FetchPapers retrieves papers matching the search query within the date range
+func (c *Client) FetchPapers(searchQuery string, startDate, endDate time.Time) ([]Paper, error) {
+	var papers []Paper
+	start := 0
+	batchSize := 100
+	delay := 3 * time.Second
+
+	for {
+		// Construct query with date range and pagination
+		query := fmt.Sprintf("%s AND submittedDate:[%s TO %s]",
+			searchQuery,
+			startDate.Format("20060102"),
+			endDate.Format("20060102"),
+		)
+
+		// Build URL with query parameters
+		params := url.Values{}
+		params.Add("search_query", query)
+		params.Add("start", fmt.Sprintf("%d", start))
+		params.Add("max_results", fmt.Sprintf("%d", batchSize))
+
+		reqURL := fmt.Sprintf("%s?%s", c.baseURL, params.Encode())
+
+		// Make HTTP request
+		fmt.Printf("Making request to: %s\n", reqURL)
+		resp, err := c.httpClient.Get(reqURL)
+		if err != nil {
+			return nil, fmt.Errorf("failed to fetch papers: %w", err)
+		}
+		defer resp.Body.Close()
+
+		// Read and log raw response for debugging
+		bodyBytes, err := io.ReadAll(resp.Body)
+		if err != nil {
+			return nil, fmt.Errorf("failed to read response body: %w", err)
+		}
+		fmt.Printf("Response status: %s\n", resp.Status)
+		fmt.Printf("Raw API response (first 1000 chars):\n%s\n", string(bodyBytes[:min(len(bodyBytes), 1000)]))
+
+		// Parse XML response
+		batch, totalResults, err := c.parseResponse(bytes.NewReader(bodyBytes))
+		if err != nil {
+			return nil, fmt.Errorf("failed to parse response: %w", err)
+		}
+
+		papers = append(papers, batch...)
+		start += len(batch)
+
+		// Check if we've fetched all papers
+		if start >= totalResults || len(batch) == 0 {
+			break
+		}
+
+		// Respect arXiv's rate limits
+		time.Sleep(delay)
+	}
+
+	return papers, nil
+}
+
+// parseResponse handles XML parsing of the arXiv API response
+func (c *Client) parseResponse(r io.Reader) ([]Paper, int, error) {
+	type atomEntry struct {
+		Title   string `xml:"title"`
+		Summary string `xml:"summary"`
+		ID      string `xml:"id"`
+	}
+
+	type atomFeed struct {
+		XMLName      xml.Name    `xml:"feed"`
+		TotalResults int         `xml:"totalResults"`
+		Entries      []atomEntry `xml:"entry"`
+	}
+
+	var feed atomFeed
+	if err := xml.NewDecoder(r).Decode(&feed); err != nil {
+		return nil, 0, fmt.Errorf("failed to decode XML: %w", err)
+	}
+
+	var papers []Paper
+	for _, entry := range feed.Entries {
+		// Extract just the ID part from the full URL
+		idParts := strings.Split(entry.ID, "/abs/")
+		if len(idParts) < 2 {
+			continue
+		}
+
+		papers = append(papers, Paper{
+			Title:    strings.TrimSpace(entry.Title),
+			Abstract: strings.TrimSpace(entry.Summary),
+			ID:       idParts[1],
+		})
+	}
+
+	return papers, feed.TotalResults, nil
+}
+
+// SavePapers saves the papers to a JSON file
+func SavePapers(filename string, papers []Paper) error {
+	file, err := os.Create(filename)
+	if err != nil {
+		return fmt.Errorf("failed to create file: %w", err)
+	}
+	defer file.Close()
+
+	encoder := json.NewEncoder(file)
+	encoder.SetIndent("", "  ")
+	if err := encoder.Encode(papers); err != nil {
+		return fmt.Errorf("failed to encode JSON: %w", err)
+	}
+
+	return nil
+}
+
+func min(a, b int) int {
+	if a < b {
+		return a
+	}
+	return b
+}
--- a/arxiv-processor/storage/json.go
+++ b/arxiv-processor/storage/json.go
@ -0,0 +1,45 @@
+package storage
+
+import (
+	"encoding/json"
+	"fmt"
+	"os"
+	"time"
+
+	"arxiv-processor/arxiv"
+)
+
+// JSONStorage handles saving papers to JSON files
+type JSONStorage struct {
+	FilePath string
+}
+
+// NewJSONStorage creates a new JSON storage handler
+func NewJSONStorage(filePath string) *JSONStorage {
+	return &JSONStorage{FilePath: filePath}
+}
+
+// SavePapers saves papers to a JSON file
+func (s *JSONStorage) SavePapers(papers []arxiv.Paper) error {
+	file, err := os.Create(s.FilePath)
+	if err != nil {
+		return fmt.Errorf("failed to create file: %w", err)
+	}
+	defer file.Close()
+
+	encoder := json.NewEncoder(file)
+	encoder.SetIndent("", "  ")
+	if err := encoder.Encode(papers); err != nil {
+		return fmt.Errorf("failed to encode papers: %w", err)
+	}
+
+	return nil
+}
+
+// Paper represents a simplified paper structure for JSON storage
+type Paper struct {
+	ID        string    `json:"id"`
+	Title     string    `json:"title"`
+	Published time.Time `json:"published"`
+	Authors   []string  `json:"authors"`
+}
--- a/criteria.txt
+++ b/criteria.txt
@ -0,0 +1,43 @@
+Criteria for Paper Selection:
+
+When evaluating paper titles and abstracts, please consider the following criteria. Prioritize inclusiveness to reduce the risk of missing potentially interesting research.
+
+Papers must meet ALL of these criteria:
+
+1. Practical Applications: Does the paper focus on real-world or practical applications of Large Language Models (LLMs), particularly in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI?
+
+2. Experimental Results and Quantitative Metrics: Does the paper include experimental results with quantitative metrics that demonstrate performance improvements or innovative applications? Look for concrete results related to areas like prompt engineering.
+
+3. Comparison with State-of-the-Art: Does the paper compare its results or methodologies with existing state-of-the-art techniques, and does it demonstrate any advancements in terms of performance, scalability, or efficiency?
+
+Papers should meet at least one of these criteria:
+
+4. Methodology and Implementation Details: Does the paper clearly describe its methodology and implementation, particularly how LLMs are utilized in practical or real-world tasks?
+
+5. Real-world Applications and Challenges: Does the paper discuss real-world applications or address limitations and challenges involving LLMs, especially in autonomous or agentic AI scenarios?
+
+Additional Considerations:
+- Novelty: Does the paper introduce novel approaches or techniques, especially those extending the application or integration of LLMs with technologies like knowledge graphs?
+
+- Is the paper's approach implementable with current standard tools?
+
+- Agentic AI: Does the paper describe applications where LLMs enable AI to autonomously perform complex, real-world tasks beyond sandboxed interactions?
+
+- Reproducibility and Documentation: Are there sufficient details to support reproducibility and transparent documentation of the methodologies and results?
+
+- Impact through Experimental Validation: Does the paper show robust experimental validation, with datasets and methods that closely reflect real-world scenarios?
+
+The paper should be REJECTED if it: 
+
+- Primarily focuses on medical applications of AI
+- Primarily focuses on social applications of AI in regard to Diversity, Social harm, or similar issues.
+- Primarily focuses on video processing
+- Primarily focuses on responsible AI application or AI ethics
+- Primarily focuses on law, either with AI as subject or participant.
+
+Instructions:
+- Analyze each paper's title and abstract to determine how many criteria are met.
+- Prioritize inclusiveness: Favor selection over rejection if a paper seems marginally relevant.
+- Use a simple scoring system: A paper meeting at least two or three criteria should be accepted for further review.
+
+Please make a best-guess decision based on the information provided in the titles and abstracts. Err on the side of potential inclusion to capture a broad spectrum of relevant research.
--- a/go.mod
+++ b/go.mod
@ -0,0 +1,15 @@
+module paper-system
+
+go 1.23.4
+
+replace arxiv-processor => ./arxiv-processor
+
+replace llm_processor => ./llm_processor
+
+replace json2md => ./json2md
+
+require (
+	arxiv-processor v0.0.0-00010101000000-000000000000
+	json2md v0.0.0-00010101000000-000000000000
+	llm_processor v0.0.0-00010101000000-000000000000
+)
--- a/json2md/README.md
+++ b/json2md/README.md
@ -0,0 +1,83 @@
+# json2md - JSON to Markdown Converter
+
+A Go module for converting academic paper review data from JSON to formatted Markdown reports.
+
+## Features
+- CLI interface for easy conversion
+- Reusable library package
+- Supports accepted/rejected paper categorization
+- Preserves abstract formatting
+- Generates arXiv links automatically
+
+## Installation
+```bash
+git clone https://github.com/yourusername/json2md.git
+cd json2md
+go build
+```
+
+## Command Line Usage
+```bash
+./json2md --input papers.json --output results.md
+
+# Or directly via Go:
+go run main.go --input ../papers.json --output ../results.md
+```
+
+## Library Usage
+```go
+package main
+
+import (
+    "fmt"
+    "json2md/lib"
+)
+
+func main() {
+    // Process JSON file
+    data, err := lib.ProcessJSONFile("input.json")
+    if err != nil {
+        panic(err)
+    }
+
+    // Generate Markdown
+    if err := lib.GenerateMarkdown(data, "output.md"); err != nil {
+        panic(err)
+    }
+    
+    fmt.Println("Conversion successful!")
+}
+```
+
+## Input JSON Format
+Expected structure:
+```json
+{
+    "accepted": [
+        {
+            "paper": {
+                "title": "Paper Title",
+                "arxiv_id": "1234.56789",
+                "abstract": "Paper abstract..."
+            },
+            "explanation": "Acceptance reason"
+        }
+    ],
+    "rejected": [
+        {
+            "paper": { /* same structure */ },
+            "explanation": "Rejection reason"
+        }
+    ]
+}
+```
+
+## Project Structure
+```
+json2md/
+├── lib/           # Core conversion library
+│   └── lib.go
+├── main.go        # CLI interface
+├── go.mod         # Module configuration
+└── README.md      # This file
+```
--- a/json2md/go.mod
+++ b/json2md/go.mod
@ -0,0 +1,3 @@
+module json2md
+
+go 1.21
--- a/json2md/lib/lib.go
+++ b/json2md/lib/lib.go
@ -0,0 +1,136 @@
+package lib
+
+import (
+	"encoding/json"
+	"fmt"
+	"os"
+	"strings"
+)
+
+type Paper struct {
+	Title    string `json:"title"`
+	ArxivID  string `json:"arxiv_id"`
+	Abstract string `json:"abstract"`
+}
+
+type PaperDecision struct {
+	Paper       Paper  `json:"paper"`
+	Explanation string `json:"explanation"`
+}
+
+type DecisionData struct {
+	Accepted []PaperDecision `json:"accepted"`
+	Rejected []PaperDecision `json:"rejected"`
+}
+
+func ProcessJSONFile(inputPath string) (DecisionData, error) {
+	var data DecisionData
+
+	jsonFile, err := os.Open(inputPath)
+	if err != nil {
+		return data, fmt.Errorf("error opening input file: %w", err)
+	}
+	defer jsonFile.Close()
+
+	// First try decoding as DecisionData format
+	decoder := json.NewDecoder(jsonFile)
+	if err := decoder.Decode(&data); err == nil {
+		return data, nil
+	}
+
+	// Try alternative formats
+	jsonData, err := os.ReadFile(inputPath)
+	if err != nil {
+		return data, fmt.Errorf("error reading input file: %w", err)
+	}
+
+	// Try array format with status field
+	var papers []struct {
+		Paper       Paper  `json:"paper"`
+		Explanation string `json:"explanation"`
+		Status      string `json:"status"`
+	}
+	if err := json.Unmarshal(jsonData, &papers); err == nil {
+		for _, p := range papers {
+			switch p.Status {
+			case "accepted":
+				data.Accepted = append(data.Accepted, PaperDecision{p.Paper, p.Explanation})
+			case "rejected":
+				data.Rejected = append(data.Rejected, PaperDecision{p.Paper, p.Explanation})
+			default:
+				fmt.Printf("Warning: Paper '%s' has unknown status '%s'\n", p.Paper.Title, p.Status)
+			}
+		}
+		return data, nil
+	}
+
+	// Try original object format
+	var objData struct {
+		Accepted []PaperDecision `json:"accepted"`
+		Rejected []PaperDecision `json:"rejected"`
+	}
+	if err := json.Unmarshal(jsonData, &objData); err != nil {
+		return data, fmt.Errorf("error decoding JSON: %w", err)
+	}
+
+	data.Accepted = objData.Accepted
+	data.Rejected = objData.Rejected
+	return data, nil
+}
+
+func GenerateMarkdown(data DecisionData, outputPath string) error {
+	file, err := os.Create(outputPath)
+	if err != nil {
+		return fmt.Errorf("error creating output file: %w", err)
+	}
+	defer file.Close()
+
+	// Accepted Papers section
+	fmt.Fprintln(file, "# Accepted Papers\n")
+	for _, paper := range data.Accepted {
+		writePaperSection(file, paper, "accepted")
+	}
+
+	// Rejected Papers section
+	fmt.Fprintln(file, "# Rejected Papers\n")
+	for _, paper := range data.Rejected {
+		writePaperSection(file, paper, "rejected")
+	}
+
+	return nil
+}
+
+func writePaperSection(file *os.File, paper PaperDecision, decisionType string) {
+	// Escape special characters in title
+	title := strings.ReplaceAll(paper.Paper.Title, "[", "\\[")
+	title = strings.ReplaceAll(title, "]", "\\]")
+
+	fmt.Fprintf(file, "## [%s](https://arxiv.org/abs/%s)\n", title, paper.Paper.ArxivID)
+	fmt.Fprintf(file, "**arXiv ID:** %s\n\n", paper.Paper.ArxivID)
+	fmt.Fprintln(file, "**Abstract:**")
+
+	// Format abstract as blockquote, handling multiple paragraphs
+	abstract := strings.TrimSpace(paper.Paper.Abstract)
+	paragraphs := strings.Split(abstract, "\n\n")
+	for i, p := range paragraphs {
+		// Ensure each line starts with >
+		lines := strings.Split(strings.TrimSpace(p), "\n")
+		for _, line := range lines {
+			fmt.Fprintf(file, "> %s\n", strings.TrimSpace(line))
+		}
+		if i < len(paragraphs)-1 {
+			fmt.Fprintln(file, ">")
+		}
+	}
+	fmt.Fprintln(file)
+
+	// Format explanation, ensuring it's properly escaped and formatted
+	fmt.Fprintln(file, "**Decision Explanation:**")
+	explanation := strings.TrimSpace(paper.Explanation)
+
+	// If the explanation is already properly formatted (no raw JSON or code blocks),
+	// write it directly
+	fmt.Fprintf(file, "%s\n\n", explanation)
+
+	fmt.Fprintln(file, "---\n")
+}
--- a/json2md/main.go
+++ b/json2md/main.go
@ -0,0 +1,35 @@
+package main
+
+import (
+	"flag"
+	"fmt"
+	"os"
+
+	"json2md/lib"
+)
+
+func main() {
+	inputPath := flag.String("input", "", "Input JSON file path")
+	outputPath := flag.String("output", "decisions.md", "Output Markdown file path")
+	flag.Parse()
+
+	if *inputPath == "" {
+		fmt.Println("Error: --input parameter is required")
+		os.Exit(1)
+	}
+
+	// Process JSON file using library
+	data, err := lib.ProcessJSONFile(*inputPath)
+	if err != nil {
+		fmt.Printf("Error processing JSON file: %v\n", err)
+		os.Exit(1)
+	}
+
+	// Generate Markdown using library
+	if err := lib.GenerateMarkdown(data, *outputPath); err != nil {
+		fmt.Printf("Error generating Markdown: %v\n", err)
+		os.Exit(1)
+	}
+
+	fmt.Printf("Successfully generated Markdown file: %s\n", *outputPath)
+}
--- a/json2md/papers-output.md
+++ b/json2md/papers-output.md
--- a/llm_processor/README.md
+++ b/llm_processor/README.md
@ -0,0 +1,82 @@
+# LLM Processor Module
+
+A Go module for processing papers through language models with configurable criteria.
+
+## Installation
+
+```bash
+go get llm_processor
+```
+
+## Command Line Usage
+
+The module provides a CLI interface for processing papers:
+
+```bash
+go run main.go -input papers.json -output results.json -criteria criteria.txt
+```
+
+## API Usage
+
+Import the module in your Go code:
+
+```go
+import "llm_processor"
+```
+
+### Example: Processing Papers
+
+```go
+package main
+
+import (
+	"fmt"
+	"llm_processor/models"
+	"llm_processor/processor"
+	"llm_processor/storage"
+)
+
+func main() {
+	// Load papers
+	papers, err := storage.LoadPapers("papers.json")
+	if err != nil {
+		panic(err)
+	}
+
+	// Initialize processor
+	proc := processor.NewProcessor("gpt-4", 32)
+
+	// Process papers
+	results := proc.ProcessPapers(papers, "criteria.txt")
+
+	// Save results
+	err = storage.SaveResults("results.json", results)
+	if err != nil {
+		panic(err)
+	}
+
+	fmt.Println("Processing complete!")
+}
+```
+
+## Configuration
+
+The processor supports the following configuration:
+
+- Model selection (gpt-3.5-turbo, gpt-4, etc.)
+- Batch size
+- Custom criteria files
+- Input/output file paths
+
+## Dependencies
+
+- Go 1.21+
+- OpenAI API key (set as environment variable OPENAI_API_KEY)
+
+## Contributing
+
+Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
+
+## License
+
+[MIT](https://choosealicense.com/licenses/mit/)
--- a/llm_processor/client/client.go
+++ b/llm_processor/client/client.go
@ -0,0 +1,326 @@
+package client
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"log"
+	"math"
+	"net"
+	"net/http"
+	"os"
+	"strings"
+	"time"
+
+	"llm_processor/models"
+)
+
+const (
+	openRouterURL     = "https://openrouter.ai/api/v1/chat/completions"
+	maxRetries        = 5
+	initialDelay      = 1 * time.Second
+	evaluationTimeout = 15 * time.Minute
+	requestTimeout    = 5 * time.Minute
+	connectionTimeout = 2 * time.Minute
+)
+
+type OpenRouterClient struct {
+	apiKey       string
+	httpClient   *http.Client
+	logger       *log.Logger
+	createClient func() *http.Client
+}
+
+func NewOpenRouterClient(apiKey string) *OpenRouterClient {
+	logFile, err := os.OpenFile("debug.log", os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
+	if err != nil {
+		log.Fatalf("Failed to open debug.log: %v", err)
+	}
+
+	logger := log.New(io.MultiWriter(os.Stdout, logFile), "", log.LstdFlags|log.Lshortfile)
+	logger.Println("Initializing OpenRouter client")
+
+	createClient := func() *http.Client {
+		transport := &http.Transport{
+			MaxIdleConns:        100,
+			MaxIdleConnsPerHost: 100,
+			IdleConnTimeout:     30 * time.Second,
+			DialContext: (&net.Dialer{
+				Timeout:   connectionTimeout,
+				KeepAlive: 30 * time.Second,
+			}).DialContext,
+			TLSHandshakeTimeout: 10 * time.Second,
+		}
+
+		return &http.Client{
+			Timeout:   requestTimeout,
+			Transport: transport,
+		}
+	}
+
+	client := &OpenRouterClient{
+		apiKey:       apiKey,
+		httpClient:   createClient(),
+		logger:       logger,
+		createClient: createClient,
+	}
+
+	return client
+}
+
+type ChatCompletionRequest struct {
+	Model    string        `json:"model"`
+	Messages []ChatMessage `json:"messages"`
+}
+
+type ChatMessage struct {
+	Role    string `json:"role"`
+	Content string `json:"content"`
+}
+
+type ChatCompletionResponse struct {
+	Choices []struct {
+		Message ChatMessage `json:"message"`
+	} `json:"choices"`
+}
+
+func (c *OpenRouterClient) EvaluatePaper(ctx context.Context, paper models.Paper, criteria string, model string) (string, error) {
+	// Create a new context with evaluation timeout
+	evalCtx, cancel := context.WithTimeout(ctx, evaluationTimeout)
+	defer cancel()
+
+	startTime := time.Now()
+	c.logger.Printf("Starting evaluation for paper: %s\n", paper.Title)
+	c.logger.Printf("Evaluation timeout: %s\n", evaluationTimeout)
+	fmt.Printf("Starting evaluation for paper: %s at %s\n", paper.Title, startTime.Format(time.RFC3339))
+
+	prompt := fmt.Sprintf(`Evaluate this paper based on the following criteria:
+%s
+
+Paper Title: %s
+Abstract: %s
+
+Respond ONLY with a JSON object in this exact format:
+{
+    "decision": "ACCEPT or REJECT",
+    "explanation": "Your explanation here"
+}
+
+Do not include any other information in your response.
+
+IMPORTANT:
+1. The decision MUST be either "ACCEPT" or "REJECT" (uppercase)
+2. The explanation should be a clear, concise reason for your decision
+3. Do not include any text outside the JSON object
+4. Ensure the response is valid JSON (proper quotes and escaping)
+5. Do not include any markdown or formatting`, criteria, paper.Title, paper.Abstract)
+
+	reqBody := ChatCompletionRequest{
+		Model: model,
+		Messages: []ChatMessage{
+			{
+				Role:    "system",
+				Content: "You are a research paper evaluator. Respond only with the requested JSON format.",
+			},
+			{
+				Role:    "user",
+				Content: prompt,
+			},
+		},
+	}
+
+	var lastErr error
+	for attempt := 0; attempt < maxRetries; attempt++ {
+		attemptStart := time.Now()
+		c.logger.Printf("Attempt %d started at %s\n", attempt+1, attemptStart.Format(time.RFC3339))
+		if attempt > 0 {
+			delay := time.Duration(math.Pow(2, float64(attempt))) * initialDelay
+			select {
+			case <-time.After(delay):
+			case <-ctx.Done():
+				return "", ctx.Err()
+			}
+		}
+
+		jsonBody, err := json.Marshal(reqBody)
+		if err != nil {
+			return "", fmt.Errorf("failed to marshal request body: %w", err)
+		}
+
+		req, err := http.NewRequestWithContext(evalCtx, "POST", openRouterURL, bytes.NewBuffer(jsonBody))
+		if err != nil {
+			return "", fmt.Errorf("failed to create request: %w", err)
+		}
+
+		req.Header.Set("Authorization", "Bearer "+c.apiKey)
+		req.Header.Set("Content-Type", "application/json")
+		req.Header.Set("HTTP-Referer", "https://github.com/stwhite/arvix")
+		req.Header.Set("X-Title", "ArXiv Paper Processor")
+
+		resp, err := c.httpClient.Do(req)
+		if err != nil {
+			// Log the specific error type
+			c.logger.Printf("Attempt %d error: %v\n", attempt+1, err)
+
+			// Handle context cancellation/timeout
+			if errors.Is(err, context.DeadlineExceeded) {
+				c.logger.Printf("Context deadline exceeded, retrying...\n")
+				lastErr = fmt.Errorf("context deadline exceeded")
+				continue
+			}
+
+			// On timeout errors, create a new client
+			if strings.Contains(err.Error(), "timeout") {
+				c.logger.Printf("Timeout detected, recreating HTTP client...\n")
+				c.httpClient = c.createClient()
+			}
+
+			lastErr = fmt.Errorf("attempt %d: %w", attempt+1, err)
+			continue
+		}
+		defer resp.Body.Close()
+
+		if resp.StatusCode != http.StatusOK {
+			body, _ := io.ReadAll(resp.Body)
+			lastErr = fmt.Errorf("attempt %d: openrouter request failed: %s - %s", attempt+1, resp.Status, string(body))
+			continue
+		}
+
+		var completionResp ChatCompletionResponse
+		if err := json.NewDecoder(resp.Body).Decode(&completionResp); err != nil {
+			lastErr = fmt.Errorf("attempt %d: failed to decode response: %w", attempt+1, err)
+			continue
+		}
+
+		if len(completionResp.Choices) == 0 {
+			lastErr = fmt.Errorf("attempt %d: no choices in response", attempt+1)
+			continue
+		}
+
+		rawContent := completionResp.Choices[0].Message.Content
+
+		// Try to parse as JSON first
+		var jsonResponse map[string]interface{}
+		err = json.Unmarshal([]byte(rawContent), &jsonResponse)
+		if err != nil {
+			// If direct JSON parsing fails, try extracting from markdown code block
+			startIdx := bytes.Index([]byte(rawContent), []byte("```json"))
+			if startIdx >= 0 {
+				startIdx += len("```json")
+				endIdx := bytes.Index([]byte(rawContent[startIdx:]), []byte("```"))
+				if endIdx >= 0 {
+					jsonContent := rawContent[startIdx : startIdx+endIdx]
+					err = json.Unmarshal([]byte(jsonContent), &jsonResponse)
+					if err != nil {
+						// If still failing, try to parse as raw JSON without code block
+						err = json.Unmarshal([]byte(rawContent), &jsonResponse)
+					}
+				}
+			}
+		}
+
+		if err == nil {
+			// Validate and normalize decision
+			if decision, ok := jsonResponse["decision"].(string); ok {
+				// Normalize decision value
+				normalizedDecision := strings.ToUpper(strings.TrimSpace(decision))
+				if strings.Contains(normalizedDecision, "ACCEPT") {
+					normalizedDecision = "ACCEPT"
+				} else if strings.Contains(normalizedDecision, "REJECT") {
+					normalizedDecision = "REJECT"
+				}
+
+				if normalizedDecision == "ACCEPT" || normalizedDecision == "REJECT" {
+					// Preserve original decision in explanation
+					if explanation, ok := jsonResponse["explanation"]; !ok {
+						jsonResponse["explanation"] = fmt.Sprintf("Original decision: %s\n", decision)
+					} else {
+						jsonResponse["explanation"] = fmt.Sprintf("Original decision: %s\n%s", decision, explanation)
+					}
+
+					// Parse nested JSON in explanation if present
+					if explanation, ok := jsonResponse["explanation"].(string); ok {
+						var nested map[string]interface{}
+						if err := json.Unmarshal([]byte(explanation), &nested); err == nil {
+							jsonResponse["explanation"] = nested
+						}
+					}
+
+					// Ensure consistent response format
+					response := map[string]interface{}{
+						"paper": map[string]interface{}{
+							"title":    paper.Title,
+							"abstract": paper.Abstract,
+							"arxiv_id": paper.ArxivID,
+						},
+						"decision":    normalizedDecision,
+						"explanation": jsonResponse["explanation"],
+					}
+
+					responseJSON, err := json.Marshal(response)
+					if err != nil {
+						return "", fmt.Errorf("failed to marshal response: %w", err)
+					}
+
+					duration := time.Since(startTime)
+					c.logger.Printf("Successfully evaluated paper: %s\n", paper.Title)
+					c.logger.Printf("Total time: %s\n", duration)
+					c.logger.Printf("Attempts: %d\n", attempt+1)
+					return string(responseJSON), nil
+				}
+			}
+		}
+
+		// If direct JSON parsing fails, try extracting from markdown code block
+		startIdx := bytes.Index([]byte(rawContent), []byte("```json"))
+		if startIdx >= 0 {
+			startIdx += len("```json")
+			endIdx := bytes.Index([]byte(rawContent[startIdx:]), []byte("```"))
+			if endIdx >= 0 {
+				jsonContent := rawContent[startIdx : startIdx+endIdx]
+				err = json.Unmarshal([]byte(jsonContent), &jsonResponse)
+				if err == nil {
+					if decision, ok := jsonResponse["decision"].(string); ok {
+						if decision == "ACCEPT" || decision == "REJECT" {
+							duration := time.Since(startTime)
+							c.logger.Printf("Successfully evaluated paper: %s\n", paper.Title)
+							c.logger.Printf("Total time: %s\n", duration)
+							c.logger.Printf("Attempts: %d\n", attempt+1)
+							return jsonContent, nil
+						}
+					}
+				}
+			}
+		}
+
+		// Fallback parsing if JSON is still invalid
+		decision := "ERROR"
+		if strings.Contains(rawContent, "ACCEPT") {
+			decision = "ACCEPT"
+		} else if strings.Contains(rawContent, "REJECT") {
+			decision = "REJECT"
+		}
+
+		// Create fallback response
+		fallbackResponse := map[string]interface{}{
+			"decision":    decision,
+			"explanation": fmt.Sprintf("Original response: %s", rawContent),
+		}
+		fallbackJSON, _ := json.Marshal(fallbackResponse)
+
+		duration := time.Since(startTime)
+		c.logger.Printf("Fallback parsing used for paper: %s\n", paper.Title)
+		c.logger.Printf("Total time: %s\n", duration)
+		c.logger.Printf("Attempts: %d\n", attempt+1)
+		return string(fallbackJSON), nil
+	}
+
+	duration := time.Since(startTime)
+	c.logger.Printf("Failed to evaluate paper: %s\n", paper.Title)
+	c.logger.Printf("Total time: %s\n", duration)
+	c.logger.Printf("Attempts: %d\n", maxRetries)
+	return "", fmt.Errorf("max retries (%d) exceeded: %w", maxRetries, lastErr)
+}
--- a/llm_processor/go.mod
+++ b/llm_processor/go.mod
@ -0,0 +1,10 @@
+module llm_processor
+
+go 1.21
+
+require github.com/spf13/cobra v1.8.0
+
+require (
+	github.com/inconshreveable/mousetrap v1.1.0 // indirect
+	github.com/spf13/pflag v1.0.5 // indirect
+)
--- a/llm_processor/go.sum
+++ b/llm_processor/go.sum
@ -0,0 +1,10 @@
+github.com/cpuguy83/go-md2man/v2 v2.0.3/go.mod h1:tgQtvFlXSQOSOSIRvRPT7W67SCa46tRHOmNcaadrF8o=
+github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8=
+github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw=
+github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
+github.com/spf13/cobra v1.8.0 h1:7aJaZx1B85qltLMc546zn58BxxfZdR/W22ej9CFoEf0=
+github.com/spf13/cobra v1.8.0/go.mod h1:WXLWApfZ71AjXPya3WOlMsY9yMs7YeiHhFVlvLyhcho=
+github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA=
+github.com/spf13/pflag v1.0.5/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
+gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
--- a/llm_processor/main.go
+++ b/llm_processor/main.go
@ -0,0 +1,77 @@
+package main
+
+import (
+	"context"
+	"fmt"
+	"log"
+	"os"
+	"time"
+
+	"llm_processor/processor"
+	"llm_processor/storage"
+
+	"github.com/spf13/cobra"
+)
+
+var (
+	inputFile      string
+	outputFile     string
+	criteriaFile   string
+	modelName      string
+	batchSize      int
+	delaySeconds   int
+	timeoutSeconds int
+)
+
+var rootCmd = &cobra.Command{
+	Use:   "llm-processor",
+	Short: "Process papers using LLM with configurable criteria",
+	Long: `A command line tool for processing academic papers through
+language models using customizable evaluation criteria.`,
+	Run: func(cmd *cobra.Command, args []string) {
+		// Get OpenRouter API key from environment
+		apiKey := os.Getenv("OPENROUTER_API_KEY")
+		if apiKey == "" {
+			log.Fatal("OPENROUTER_API_KEY environment variable is required")
+		}
+
+		// Load papers
+		papers, err := storage.LoadPapers(inputFile)
+		if err != nil {
+			log.Fatalf("Failed to load papers: %v", err)
+		}
+
+		// Initialize processor
+		proc := processor.NewProcessor(modelName, batchSize, apiKey)
+		proc.SetTimeout(time.Duration(timeoutSeconds) * time.Second)
+
+		// Process papers
+		ctx := context.Background()
+		if err := proc.ProcessPapers(ctx, inputFile, outputFile, criteriaFile, time.Duration(delaySeconds)*time.Second); err != nil {
+			log.Fatalf("Failed to save results: %v", err)
+		}
+
+		fmt.Printf("Successfully processed %d papers\n", len(papers))
+	},
+}
+
+func init() {
+	rootCmd.PersistentFlags().StringVarP(&inputFile, "input", "i", "", "Input JSON file containing papers")
+	rootCmd.PersistentFlags().StringVarP(&outputFile, "output", "o", "results.json", "Output JSON file for results")
+	rootCmd.PersistentFlags().StringVarP(&criteriaFile, "criteria", "c", "", "Text file containing evaluation criteria")
+	rootCmd.PersistentFlags().StringVarP(&modelName, "model", "m", "gpt-4", "LLM model to use")
+	rootCmd.PersistentFlags().IntVarP(&batchSize, "batch-size", "b", 32, "Batch size for processing")
+	rootCmd.PersistentFlags().IntVarP(&delaySeconds, "delay", "d", 0, "Delay in seconds between paper submissions")
+	rootCmd.PersistentFlags().IntVarP(&timeoutSeconds, "timeout", "t", 3600, "Timeout in seconds for processing (default 1h)")
+
+	// Mark required flags
+	rootCmd.MarkPersistentFlagRequired("input")
+	rootCmd.MarkPersistentFlagRequired("criteria")
+}
+
+func main() {
+	if err := rootCmd.Execute(); err != nil {
+		fmt.Println(err)
+		os.Exit(1)
+	}
+}
--- a/llm_processor/models/models.go
+++ b/llm_processor/models/models.go
@ -0,0 +1,34 @@
+package models
+
+import (
+	"fmt"
+)
+
+// Paper represents a research paper
+type Paper struct {
+	Title    string `json:"title"`
+	Abstract string `json:"abstract"`
+	ArxivID  string `json:"arxiv_id"`
+}
+
+// Result represents the evaluation result from the LLM
+type Result struct {
+	Paper       Paper  `json:"paper"`
+	Decision    string `json:"decision"` // ACCEPT, REJECT, or ERROR
+	Explanation string `json:"explanation"`
+}
+
+// OrganizedResults contains categorized results
+type OrganizedResults struct {
+	Accepted []Result `json:"accepted"`
+	Rejected []Result `json:"rejected"`
+	Errors   []Result `json:"errors"`
+}
+
+// Validate checks if a Result has valid decision value
+func (r *Result) Validate() error {
+	if r.Decision != "ACCEPT" && r.Decision != "REJECT" && r.Decision != "ERROR" {
+		return fmt.Errorf("invalid decision value: %s", r.Decision)
+	}
+	return nil
+}
--- a/llm_processor/newpapers-filtered.json
+++ b/llm_processor/newpapers-filtered.json
--- a/llm_processor/newpapers-filtered.md
+++ b/llm_processor/newpapers-filtered.md
--- a/llm_processor/newpapers-r1-filtered.json
+++ b/llm_processor/newpapers-r1-filtered.json
--- a/llm_processor/newpapers-r1-filtered.md
+++ b/llm_processor/newpapers-r1-filtered.md
--- a/llm_processor/papers.md
+++ b/llm_processor/papers.md
--- a/llm_processor/processor/processor.go
+++ b/llm_processor/processor/processor.go
@ -0,0 +1,190 @@
+package processor
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"strings"
+	"time"
+
+	"llm_processor/client"
+	"llm_processor/models"
+	"llm_processor/storage"
+)
+
+type Processor struct {
+	modelName string
+	batchSize int
+	timeout   time.Duration
+	client    *client.OpenRouterClient
+}
+
+// SetTimeout allows changing the processor timeout
+func (p *Processor) SetTimeout(timeout time.Duration) {
+	p.timeout = timeout
+}
+
+func NewProcessor(modelName string, batchSize int, apiKey string) *Processor {
+	return &Processor{
+		modelName: modelName,
+		batchSize: batchSize,
+		timeout:   3600 * time.Second, // Default 1 hour timeout
+		client:    client.NewOpenRouterClient(apiKey),
+	}
+}
+
+func (p *Processor) ProcessPapers(parentCtx context.Context, inputPath, outputPath, criteriaPath string, delay time.Duration) error {
+	startTime := time.Now()
+	ctx, cancel := context.WithTimeout(parentCtx, p.timeout)
+	defer cancel()
+
+	// Load papers from input file
+	papers, err := storage.LoadPapers(inputPath)
+	if err != nil {
+		return fmt.Errorf("failed to load papers: %w", err)
+	}
+
+	// Set criteria path in storage
+	storage.SetCriteriaPath(criteriaPath)
+
+	// Initialize results file
+	if err := storage.InitializeResultsFile(outputPath); err != nil {
+		return fmt.Errorf("failed to initialize results file: %w", err)
+	}
+
+	// Process papers in batches
+	for i := 0; i < len(papers); i += p.batchSize {
+		end := i + p.batchSize
+		if end > len(papers) {
+			end = len(papers)
+		}
+
+		batch := papers[i:end]
+		if err := p.processBatch(ctx, batch, delay, startTime, outputPath); err != nil {
+			return fmt.Errorf("failed to process batch: %w", err)
+		}
+	}
+
+	return nil
+}
+
+func (p *Processor) processBatch(ctx context.Context, papers []models.Paper, delay time.Duration, startTime time.Time, outputPath string) error {
+	ctx, cancel := context.WithTimeout(ctx, p.timeout)
+	defer cancel()
+
+	// Load criteria
+	criteria, err := storage.LoadCriteria()
+	if err != nil {
+		return fmt.Errorf("failed to load criteria: %w", err)
+	}
+	for i, paper := range papers {
+		var evaluation string
+		var lastErr error
+
+		// Retry up to 3 times with exponential backoff
+		for attempt := 0; attempt < 3; attempt++ {
+			evaluation, lastErr = p.client.EvaluatePaper(ctx, paper, criteria, p.modelName)
+			if lastErr == nil {
+				break
+			}
+			time.Sleep(time.Duration(attempt+1) * time.Second) // Exponential backoff
+		}
+
+		if lastErr != nil {
+			// Log error but continue with next paper
+			evaluation = fmt.Sprintf(`{
+					"decision": "ERROR",
+					"explanation": "Failed to evaluate paper after 3 attempts: %v"
+				}`, lastErr)
+		}
+
+		// Parse and validate evaluation response
+		var evalResponse struct {
+			Decision    string `json:"decision"`
+			Explanation string `json:"explanation"`
+		}
+
+		if err := json.Unmarshal([]byte(evaluation), &evalResponse); err != nil {
+			// Try to extract decision and explanation using regex
+			decision := "ERROR"
+			explanation := evaluation
+
+			if strings.Contains(evaluation, `"decision": "ACCEPT"`) || strings.Contains(evaluation, `"decision":"ACCEPT"`) {
+				decision = "ACCEPT"
+			} else if strings.Contains(evaluation, `"decision": "REJECT"`) || strings.Contains(evaluation, `"decision":"REJECT"`) {
+				decision = "REJECT"
+			}
+
+			// Try to extract just the explanation if it's in JSON format
+			if strings.Contains(evaluation, `"explanation"`) {
+				parts := strings.Split(evaluation, `"explanation"`)
+				if len(parts) > 1 {
+					// Find the content between the first : and the next "
+					expl := parts[1]
+					start := strings.Index(expl, ":")
+					if start != -1 {
+						expl = expl[start+1:]
+						// Remove leading/trailing whitespace and quotes
+						expl = strings.Trim(expl, " \t\n\r\"")
+						// Remove trailing JSON syntax
+						expl = strings.TrimRight(expl, "}")
+						expl = strings.TrimRight(expl, ",")
+						explanation = expl
+					}
+				}
+			}
+
+			evalResponse = struct {
+				Decision    string `json:"decision"`
+				Explanation string `json:"explanation"`
+			}{
+				Decision:    decision,
+				Explanation: explanation,
+			}
+		}
+
+		// Sanitize the explanation
+		explanation := evalResponse.Explanation
+		// Remove any markdown code block syntax
+		explanation = strings.ReplaceAll(explanation, "```", "")
+		// Remove any JSON formatting if the explanation is a raw JSON string
+		if strings.HasPrefix(strings.TrimSpace(explanation), "{") {
+			var jsonExpl struct {
+				Explanation string `json:"explanation"`
+			}
+			if err := json.Unmarshal([]byte(explanation), &jsonExpl); err == nil && jsonExpl.Explanation != "" {
+				explanation = jsonExpl.Explanation
+			}
+		}
+		// Escape any remaining special markdown characters
+		explanation = strings.ReplaceAll(explanation, "*", "\\*")
+		explanation = strings.ReplaceAll(explanation, "_", "\\_")
+		explanation = strings.ReplaceAll(explanation, "`", "\\`")
+
+		result := models.Result{
+			Paper:       paper,
+			Decision:    evalResponse.Decision,
+			Explanation: explanation,
+		}
+
+		// Save result with detailed logging
+		fmt.Printf("Saving result for paper %q to %s\n", paper.Title, outputPath)
+		if err := storage.SaveResult(result, outputPath); err != nil {
+			fmt.Printf("Failed to save result for paper %q: %v\n", paper.Title, err)
+			return fmt.Errorf("failed to save result: %w", err)
+		}
+		fmt.Printf("Successfully saved result for paper %q\n", paper.Title)
+
+		// Print progress
+		elapsed := time.Since(startTime).Seconds()
+		fmt.Printf("Processed paper %d/%d (%s) - Total runtime: %.2f seconds\n",
+			i+1, len(papers), paper.Title, elapsed)
+
+		// Apply delay between papers if specified
+		if delay > 0 {
+			time.Sleep(delay)
+		}
+	}
+
+	return nil
+}
--- a/llm_processor/storage/storage.go
+++ b/llm_processor/storage/storage.go
@ -0,0 +1,143 @@
+package storage
+
+import (
+	"encoding/json"
+	"fmt"
+	"os"
+	"path/filepath"
+	"sync"
+
+	"llm_processor/models"
+)
+
+var (
+	criteriaPath = "criteria.txt"
+	criteriaMux  sync.RWMutex
+)
+
+// SetCriteriaPath sets the path for the criteria file
+func SetCriteriaPath(path string) {
+	criteriaMux.Lock()
+	defer criteriaMux.Unlock()
+	criteriaPath = path
+}
+
+// LoadPapers loads papers from a JSON file
+func LoadPapers(filePath string) ([]models.Paper, error) {
+	absPath, err := filepath.Abs(filePath)
+	if err != nil {
+		return nil, fmt.Errorf("failed to resolve absolute path: %w", err)
+	}
+
+	file, err := os.ReadFile(absPath)
+	if err != nil {
+		return nil, fmt.Errorf("failed to read papers file: %w", err)
+	}
+
+	var papers []models.Paper
+	if err := json.Unmarshal(file, &papers); err != nil {
+		return nil, fmt.Errorf("failed to parse papers JSON: %w", err)
+	}
+
+	return papers, nil
+}
+
+// LoadCriteria loads evaluation criteria from a text file
+func LoadCriteria() (string, error) {
+	criteriaMux.RLock()
+	filePath := criteriaPath
+	criteriaMux.RUnlock()
+
+	content, err := os.ReadFile(filePath)
+	if err != nil {
+		return "", fmt.Errorf("failed to read criteria file: %w", err)
+	}
+	return string(content), nil
+}
+
+// InitializeResultsFile creates a new results file with empty categories
+func InitializeResultsFile(filePath string) error {
+	results := models.OrganizedResults{
+		Accepted: []models.Result{},
+		Rejected: []models.Result{},
+		Errors:   []models.Result{},
+	}
+
+	return saveToFile(filePath, results)
+}
+
+// SaveResult appends a result to the appropriate category in the results file
+func SaveResult(result models.Result, filePath string) error {
+	// Read existing results
+	results, err := loadResults(filePath)
+	if err != nil {
+		return fmt.Errorf("failed to load results: %w", err)
+	}
+
+	// Append to appropriate category
+	switch result.Decision {
+	case "ACCEPT":
+		results.Accepted = append(results.Accepted, result)
+	case "REJECT":
+		results.Rejected = append(results.Rejected, result)
+	default:
+		results.Errors = append(results.Errors, result)
+	}
+
+	// Save updated results
+	if err := saveToFile(filePath, results); err != nil {
+		// Try fallback save
+		if fallbackErr := saveFallback(result, filePath); fallbackErr != nil {
+			return fmt.Errorf("failed to save results: %w (fallback also failed: %v)", err, fallbackErr)
+		}
+		return nil
+	}
+
+	return nil
+}
+
+func loadResults(filePath string) (*models.OrganizedResults, error) {
+	file, err := os.ReadFile(filePath)
+	if err != nil {
+		return nil, fmt.Errorf("failed to read results file: %w", err)
+	}
+
+	var results models.OrganizedResults
+	if err := json.Unmarshal(file, &results); err != nil {
+		return nil, fmt.Errorf("failed to parse results JSON: %w", err)
+	}
+
+	return &results, nil
+}
+
+func saveToFile(filePath string, data interface{}) error {
+	file, err := os.Create(filePath)
+	if err != nil {
+		return fmt.Errorf("failed to create file: %w", err)
+	}
+	defer file.Close()
+
+	encoder := json.NewEncoder(file)
+	encoder.SetIndent("", "  ")
+	if err := encoder.Encode(data); err != nil {
+		return fmt.Errorf("failed to encode JSON: %w", err)
+	}
+
+	return nil
+}
+
+func saveFallback(result models.Result, originalPath string) error {
+	fallbackPath := originalPath + ".fallback"
+	file, err := os.OpenFile(fallbackPath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
+	if err != nil {
+		return fmt.Errorf("failed to open fallback file: %w", err)
+	}
+	defer file.Close()
+
+	encoder := json.NewEncoder(file)
+	if err := encoder.Encode(result); err != nil {
+		return fmt.Errorf("failed to encode result to fallback file: %w", err)
+	}
+
+	return nil
+}
--- a/main.go
+++ b/main.go
@ -0,0 +1,238 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"flag"
+	"fmt"
+	"log"
+	"os"
+	"time"
+
+	"arxiv-processor/arxiv"
+	"json2md/lib"
+	"llm_processor/processor"
+)
+
+func main() {
+	// Set custom usage before defining flags
+	flag.Usage = func() {
+		fmt.Fprintf(os.Stderr, "Usage: %s [options]\n\n", os.Args[0])
+		fmt.Fprintf(os.Stderr, "A tool to fetch, filter, and process arXiv papers using LLM.\n\n")
+		fmt.Fprintf(os.Stderr, "Required flags:\n")
+		fmt.Fprintf(os.Stderr, "  -criteria string\n\tPath to filter criteria file\n\n")
+		fmt.Fprintf(os.Stderr, "Source flags (must use either arXiv query OR input JSON):\n")
+		fmt.Fprintf(os.Stderr, "  ArXiv query flags:\n")
+		fmt.Fprintf(os.Stderr, "    -start string\n\tStart date in YYYYMMDD format\n")
+		fmt.Fprintf(os.Stderr, "    -end string\n\tEnd date in YYYYMMDD format\n")
+		fmt.Fprintf(os.Stderr, "    -search string\n\tarXiv category/search query (e.g., 'cs.AI', 'physics.comp-ph')\n")
+		fmt.Fprintf(os.Stderr, "    -max-results int\n\tMaximum number of papers to retrieve (default: 100, max: 2000)\n\n")
+		fmt.Fprintf(os.Stderr, "  OR\n\n")
+		fmt.Fprintf(os.Stderr, "  Input JSON flag:\n")
+		fmt.Fprintf(os.Stderr, "    -input-json string\n\tPath to input JSON file (bypasses arXiv fetch)\n\n")
+		fmt.Fprintf(os.Stderr, "Optional flags:\n")
+		fmt.Fprintf(os.Stderr, "  -output string\n\tOutput markdown file path (default: papers.md)\n")
+		fmt.Fprintf(os.Stderr, "  -model string\n\tLLM model to use (default: nvidia/llama-3.1-nemotron-70b-instruct)\n\n")
+		fmt.Fprintf(os.Stderr, "Environment variables:\n")
+		fmt.Fprintf(os.Stderr, "  OPENROUTER_API_KEY\tRequired for LLM processing\n\n")
+		fmt.Fprintf(os.Stderr, "Examples:\n")
+		fmt.Fprintf(os.Stderr, "  Fetch from arXiv:\n")
+		fmt.Fprintf(os.Stderr, "    %s -start 20240101 -end 20240131 -search cs.AI -criteria criteria.txt -output papers.md\n\n", os.Args[0])
+		fmt.Fprintf(os.Stderr, "  Use existing JSON:\n")
+		fmt.Fprintf(os.Stderr, "    %s -input-json papers.json -criteria new-criteria.txt -output results.md\n", os.Args[0])
+	}
+
+	// CLI flags
+	start := flag.String("start", "", "Start date (YYYYMMDD)")
+	end := flag.String("end", "", "End date (YYYYMMDD)")
+	search := flag.String("search", "", "arXiv search query")
+	criteriaFile := flag.String("criteria", "", "Path to filter criteria file")
+	output := flag.String("output", "papers.md", "Output file path")
+	model := flag.String("model", "nvidia/llama-3.1-nemotron-70b-instruct", "LLM model to use")
+	maxResults := flag.Int("max-results", 100, "Maximum number of papers to retrieve (up to 2000)")
+	inputJSON := flag.String("input-json", "", "Path to input JSON file (bypasses arXiv fetch)")
+
+	flag.Parse()
+
+	// Validate flags
+	if *criteriaFile == "" {
+		fmt.Fprintf(os.Stderr, "Error: Missing required parameter: -criteria\n\n")
+		flag.Usage()
+		os.Exit(1)
+	}
+
+	// Validate either input-json is provided OR all arxiv flags are provided
+	usingInputJSON := *inputJSON != ""
+	usingArxiv := *start != "" || *end != "" || *search != ""
+
+	if usingInputJSON && usingArxiv {
+		fmt.Fprintf(os.Stderr, "Error: Cannot use both --input-json and arXiv query flags\n\n")
+		flag.Usage()
+		os.Exit(1)
+	}
+
+	if !usingInputJSON && !usingArxiv {
+		fmt.Fprintf(os.Stderr, "Error: Must provide either --input-json or arXiv query flags\n\n")
+		flag.Usage()
+		os.Exit(1)
+	}
+
+	if usingArxiv {
+		if *start == "" || *end == "" || *search == "" {
+			fmt.Fprintf(os.Stderr, "Error: Missing required arXiv parameters\n\n")
+			flag.Usage()
+			os.Exit(1)
+		}
+
+		if *maxResults <= 0 || *maxResults > 2000 {
+			fmt.Fprintf(os.Stderr, "Error: max-results must be between 1 and 2000\n\n")
+			flag.Usage()
+			os.Exit(1)
+		}
+	}
+
+	// Configure logging
+	log.SetPrefix("[paper-system] ")
+	log.SetFlags(log.Ltime | log.Lmsgprefix)
+
+	ctx := context.Background()
+
+	// Paper type used for JSON operations
+	type LLMPaper struct {
+		Title    string   `json:"title"`
+		Abstract string   `json:"abstract"`
+		ArxivID  string   `json:"arxiv_id"`
+		Authors  []string `json:"authors"`
+	}
+
+	var llmPapers []LLMPaper
+
+	if usingInputJSON {
+		// Load papers from input JSON
+		log.Printf("Loading papers from %s", *inputJSON)
+		paperData, err := os.ReadFile(*inputJSON)
+		if err != nil {
+			log.Fatalf("Failed to read input JSON: %v", err)
+		}
+
+		if err := json.Unmarshal(paperData, &llmPapers); err != nil {
+			log.Fatalf("Failed to parse input JSON: %v", err)
+		}
+		log.Printf("Loaded %d papers from JSON", len(llmPapers))
+
+	} else {
+		// Fetch papers from arXiv
+		log.Printf("Fetching papers from arXiv for category %q between %s and %s", *search, *start, *end)
+		arxivClient := arxiv.NewClient()
+
+		startDate := parseDate(*start)
+		endDate := parseDate(*end)
+
+		query := arxiv.Query{
+			Category:    *search,
+			DateRange:   fmt.Sprintf("%s TO %s", startDate.Format("20060102"), endDate.Format("20060102")),
+			MaxResults:  *maxResults,
+			StartOffset: 0,
+		}
+
+		log.Printf("Executing arXiv query: %+v", query)
+		papers, err := arxivClient.FetchPapers(ctx, query)
+		if err != nil {
+			log.Fatalf("arXiv fetch failed: %v", err)
+		}
+		log.Printf("Retrieved %d papers from arXiv", len(papers))
+		if len(papers) >= *maxResults {
+			log.Printf("WARNING: Retrieved maximum number of papers (%d). There may be more papers available.", *maxResults)
+			log.Printf("Use --max-results flag to retrieve more papers (up to 2000)")
+		}
+
+		// Convert arXiv papers to LLM format
+		llmPapers = make([]LLMPaper, len(papers))
+		for i, p := range papers {
+			// Convert author structs to string array
+			authors := make([]string, len(p.Authors))
+			for j, a := range p.Authors {
+				authors[j] = a.Name
+			}
+
+			llmPapers[i] = LLMPaper{
+				Title:    p.Title,
+				Abstract: p.Summary,
+				ArxivID:  p.ID,
+				Authors:  authors,
+			}
+		}
+
+		// Save papers to JSON for future use
+		log.Printf("Saving papers to papers.json")
+		papersJSON, err := json.Marshal(llmPapers)
+		if err != nil {
+			log.Fatalf("Failed to marshal papers: %v", err)
+		}
+		if err := os.WriteFile("papers.json", papersJSON, 0644); err != nil {
+			log.Fatalf("Failed to save papers JSON: %v", err)
+		}
+		log.Printf("Successfully saved papers to papers.json")
+	}
+
+	// Print paper titles for verification
+	log.Printf("Processing papers:")
+	for i, paper := range llmPapers {
+		log.Printf("  %d. %s", i+1, paper.Title)
+	}
+
+	// Save papers to temporary file for LLM processing
+	tempInput := "temp_input.json"
+	tempJSON, err := json.Marshal(llmPapers)
+	if err != nil {
+		log.Fatalf("Failed to marshal papers for LLM: %v", err)
+	}
+	if err := os.WriteFile(tempInput, tempJSON, 0644); err != nil {
+		log.Fatalf("Failed to save temp input JSON: %v", err)
+	}
+
+	// Filter papers with LLM
+	log.Printf("Starting LLM processing")
+	apiKey := os.Getenv("OPENROUTER_API_KEY")
+	if apiKey == "" {
+		log.Fatal("OPENROUTER_API_KEY environment variable is required")
+	}
+
+	llmProcessor := processor.NewProcessor(*model, 32, apiKey) // 32 = batch size from README
+	log.Printf("Initialized LLM processor with model %s", *model)
+
+	tempOutput := "temp_output.json"
+	log.Printf("Processing papers with criteria from %s", *criteriaFile)
+	if err := llmProcessor.ProcessPapers(ctx, tempInput, tempOutput, *criteriaFile, 1*time.Second); err != nil {
+		log.Fatalf("LLM processing failed: %v", err)
+	}
+	log.Printf("LLM processing complete, results saved to %s", tempOutput)
+
+	// Generate markdown
+	log.Printf("Generating markdown output")
+	decisions, err := lib.ProcessJSONFile(tempOutput)
+	if err != nil {
+		log.Fatalf("Failed to process JSON: %v", err)
+	}
+	log.Printf("Processed decisions: %d accepted, %d rejected", len(decisions.Accepted), len(decisions.Rejected))
+
+	if err := lib.GenerateMarkdown(decisions, *output); err != nil {
+		log.Fatalf("Markdown generation failed: %v", err)
+	}
+	log.Printf("Generated markdown output at %s", *output)
+
+	// Cleanup temp files
+	os.Remove(tempInput)
+	os.Remove(tempOutput)
+	log.Printf("Cleaned up temporary files")
+
+	log.Printf("Process complete. Results saved to %s", *output)
+}
+
+func parseDate(s string) time.Time {
+	t, err := time.Parse("20060102", s)
+	if err != nil {
+		log.Fatalf("Invalid date %q: %v", s, err)
+	}
+	return t
+}
--- a/BIN
+++ b/BIN
--- a/prompt-criteria.txt
+++ b/prompt-criteria.txt
@ -0,0 +1,29 @@
+**Papers MUST:**
+
+1.  **Focus primarily on the engineering, design, or optimization of prompts *specifically* for Large Language Models (LLMs).** (This clarifies that the focus is prompt engineering *for LLMs*, not just any AI system.)
+2.  **Investigate, analyze, or propose methods for improving LLM performance *through the manipulation of textual input prompts*.** (This emphasizes the "how" of prompt engineering and excludes papers that might just mention prompts in passing.)
+3.  **Provide concrete examples of prompts and demonstrate their impact on LLM output, replicable with publicly available LLMs.** (This maintains the practical, replicable aspect while emphasizing the prompt-output relationship.)
+
+**Papers MUST NOT:**
+
+1.  **Focus primarily on the development of new LLM architectures or training methods.** (This explicitly excludes papers about building or training LLMs, focusing the criteria on using them.)
+2.  **Be primarily concerned with applications of generative AI *other than text generation driven by LLMs*, such as image, video, or audio generation.** (This clearly differentiates LLMs from other generative models like text-to-image or text-to-video.)
+3.  **Be primarily concerned with medical, automotive (self-driving), or ethical subjects.** (This exclusion remains, but is lower priority given the more specific focus.)
+
+**Additional Instructions:**
+
+*   **The core subject of the paper must be prompt engineering for text-based interactions with LLMs. Papers that mention prompts but do not make them the central focus should be rejected.** (This emphasizes the primary importance of prompt engineering.)
+*   **Reject papers that focus on using LLMs as components within larger systems where prompt engineering is not the primary concern (e.g., using an LLM as part of a no-code platform or within a multi-agent system).** (This addresses the issue encountered with the example abstract.)
+*   **Favor papers that explore novel prompt engineering techniques, provide comparative analyses of different prompting strategies, or offer frameworks for systematic prompt development.** (This provides guidance on what constitutes a *strong* accept.)
+*   **Analyze each paper's title and abstract carefully to determine how many criteria are met. A paper must meet all of the "MUST" criteria to be considered for acceptance.** (This further clarifies the requirements for acceptance, as the previous requirement of "two or three" is no longer applicable.)
+*   **Err on the side of caution. When in doubt, reject.** (This rule makes the selection more exclusive, suitable for cases where there might be too many submissions to process effectively.)
+
+**Example of Clear Rejection:**
+
+*   A paper about a new No-Code platform that utilizes LLMs as one of its components would be rejected, even if it mentions prompts, because the primary focus is the platform, not prompt engineering.
+*   A paper that details a new method for fine-tuning LLMs would be rejected as the criteria is focused on prompt engineering and not new training methods.
+
+**Example of Clear Acceptance:**
+
+*   A paper that presents a new technique for automatically generating prompts that elicit specific types of responses from an LLM, providing detailed examples and comparisons, would be accepted.
+*   A paper comparing the effectiveness of various prompt structures (e.g., zero-shot, few-shot, chain-of-thought) for a particular task, offering insights into optimal prompt design, would be accepted.
--- a/prompt-papers-old.md
+++ b/prompt-papers-old.md
--- a/prompt-papers.md
+++ b/prompt-papers.md