paper-system/llm_processor/papers.md

# Accepted Papers

## [Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based
  Automatic Heuristic Design](https://arxiv.org/abs/2501.08603v2)
**arXiv ID:** 2501.08603v2

**Abstract:**
> Handcrafting heuristics for solving complex planning tasks (e.g., NP-hard
combinatorial optimization (CO) problems) is a common practice but requires
extensive domain knowledge. Recently, Large Language Model (LLM)-based
automatic heuristics design (AHD) methods have shown promise in generating
high-quality heuristics without manual intervention. Existing LLM-based AHD
methods employ a population to maintain a fixed number of top-performing
LLM-generated heuristics and introduce evolutionary computation (EC) to enhance
the population iteratively. However, the population-based procedure brings
greedy properties, often resulting in convergence to local optima. Instead, to
more comprehensively explore the space of heuristics, we propose using Monte
Carlo Tree Search (MCTS) for LLM-based heuristic evolution while preserving all
LLM-generated heuristics in a tree structure. With a novel thought-alignment
process and an exploration-decay technique, the proposed MCTS-AHD method
delivers significantly higher-quality heuristics on various complex tasks. Our
code is available at https://github.com/zz1358m/MCTS-AHD-master.

**Decision Explanation:** Original response: ```
{
    "decision": "ACCEPT",
    "explanation": "Meets all required criteria: (1) Practical Applications (automatic heuristic design for complex planning tasks), (2) Experimental Results and Quantitative Metrics (delivers significantly higher-quality heuristics), and (3) Comparison with State-of-the-Art (improves upon existing LLM-based AHD methods). Additionally, meets criteria (4) Methodology and Implementation Details (novel thought-alignment process and exploration-decay technique, with available code)."
}
```

---

## [Leveraging Large Language Models as Knowledge-Driven Agents for Reliable
  Retrosynthesis Planning](https://arxiv.org/abs/2501.08897v1)
**arXiv ID:** 2501.08897v1

**Abstract:**
> Identifying reliable synthesis pathways in materials chemistry is a complex
task, particularly in polymer science, due to the intricate and often
non-unique nomenclature of macromolecules. To address this challenge, we
propose an agent system that integrates large language models (LLMs) and
knowledge graphs (KGs). By leveraging LLMs' powerful capabilities for
extracting and recognizing chemical substance names, and storing the extracted
data in a structured knowledge graph, our system fully automates the retrieval
of relevant literatures, extraction of reaction data, database querying,
construction of retrosynthetic pathway trees, further expansion through the
retrieval of additional literature and recommendation of optimal reaction
pathways. A novel Multi-branched Reaction Pathway Search (MBRPS) algorithm
enables the exploration of all pathways, with a particular focus on
multi-branched ones, helping LLMs overcome weak reasoning in multi-branched
paths. This work represents the first attempt to develop a fully automated
retrosynthesis planning agent tailored specially for macromolecules powered by
LLMs. Applied to polyimide synthesis, our new approach constructs a
retrosynthetic pathway tree with hundreds of pathways and recommends optimized
routes, including both known and novel pathways, demonstrating its
effectiveness and potential for broader applications.

**Decision Explanation:** Original response: ```
{
    "decision": "ACCEPT",
    "explanation": "Meets all mandatory criteria: focuses on practical application of LLMs in knowledge graphs and retrieval (1), includes quantitative results through pathway tree construction (2), and implicitly compares to state-of-the-art by introducing a novel MBRPS algorithm (3). Additionally, meets optional criteria for methodology description (4), novelty in integrating LLMs with knowledge graphs for retrosynthesis, and demonstrates potential for real-world impact in materials chemistry."
}
```

---

## [Text Semantics to Flexible Design: A Residential Layout Generation
  Method Based on Stable Diffusion Model](https://arxiv.org/abs/2501.09279v1)
**arXiv ID:** 2501.09279v1

**Abstract:**
> Flexibility in the AI-based residential layout design remains a significant
challenge, as traditional methods like rule-based heuristics and graph-based
generation often lack flexibility and require substantial design knowledge from
users. To address these limitations, we propose a cross-modal design approach
based on the Stable Diffusion model for generating flexible residential
layouts. The method offers multiple input types for learning objectives,
allowing users to specify both boundaries and layouts. It incorporates natural
language as design constraints and introduces ControlNet to enable stable
layout generation through two distinct pathways. We also present a scheme that
encapsulates design expertise within a knowledge graph and translates it into
natural language, providing an interpretable representation of design
knowledge. This comprehensibility and diversity of input options enable
professionals and non-professionals to directly express design requirements,
enhancing flexibility and controllability. Finally, experiments verify the
flexibility of the proposed methods under multimodal constraints better than
state-of-the-art models, even when specific semantic information about room
areas or connections is incomplete.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria: (1) Practical Application (residential layout generation), (2) Experimental Results with Quantitative Metrics (comparison with state-of-the-art models), and (3) Comparison with State-of-the-Art. Additionally, it meets optional criteria (4) Methodology and Implementation Details (clear description of Stable Diffusion model and ControlNet usage) and shows Novelty in integrating LLMs with knowledge graphs for design expertise representation.

---

## [SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs](https://arxiv.org/abs/2501.09316v1)
**arXiv ID:** 2501.09316v1

**Abstract:**
> Despite significant advancements in general-purpose AI agents, several
challenges still hinder their practical application in real-world scenarios.
First, the limited planning capabilities of Large Language Models (LLM)
restrict AI agents from effectively solving complex tasks that require
long-horizon planning. Second, general-purpose AI agents struggle to
efficiently utilize domain-specific knowledge and human expertise. In this
paper, we introduce the Standard Operational Procedure-guided Agent
(SOP-agent), a novel framework for constructing domain-specific agents through
pseudocode-style Standard Operational Procedures (SOPs) written in natural
language. Formally, we represent a SOP as a decision graph, which is traversed
to guide the agent in completing tasks specified by the SOP. We conduct
extensive experiments across tasks in multiple domains, including
decision-making, search and reasoning, code generation, data cleaning, and
grounded customer service. The SOP-agent demonstrates excellent versatility,
achieving performance superior to general-purpose agent frameworks and
comparable to domain-specific agent systems. Additionally, we introduce the
Grounded Customer Service Benchmark, the first benchmark designed to evaluate
the grounded decision-making capabilities of AI agents in customer service
scenarios based on SOPs.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria (1, 2, and 3): focuses on practical application of LLMs in real-world tasks (1), includes experimental results with quantitative metrics showing performance improvements (2), and compares results with state-of-the-art techniques (3). Additionally, meets optional criteria 4 (clear methodology and implementation details) and demonstrates novelty in integrating LLMs with SOPs for agentic AI, enhancing its potential impact.

---

## [Aligning Instruction Tuning with Pre-training](https://arxiv.org/abs/2501.09368v1)
**arXiv ID:** 2501.09368v1

**Abstract:**
> Instruction tuning enhances large language models (LLMs) to follow human
instructions across diverse tasks, relying on high-quality datasets to guide
behavior. However, these datasets, whether manually curated or synthetically
generated, are often narrowly focused and misaligned with the broad
distributions captured during pre-training, limiting LLM generalization and
effective use of pre-trained knowledge. We propose *Aligning Instruction Tuning
with Pre-training* (AITP), a method that bridges this gap by identifying
coverage shortfalls in instruction-tuning datasets and rewriting
underrepresented pre-training data into high-quality instruction-response
pairs. This approach enriches dataset diversity while preserving task-specific
objectives. Evaluations on three fully open LLMs across eight benchmarks
demonstrate consistent performance improvements with AITP. Ablations highlight
the benefits of adaptive data selection, controlled rewriting, and balanced
integration, emphasizing the importance of aligning instruction tuning with
pre-training distributions to unlock the full potential of LLMs.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria (1, 2, 3): focuses on practical application of LLMs (instruction tuning), includes experimental results with quantitative metrics (evaluations on three LLMs across eight benchmarks), and compares with state-of-the-art techniques. Additionally, meets optional criteria for methodology and implementation details (describes AITP method), novelty (introduces a new approach for aligning instruction tuning with pre-training), and robust experimental validation.

---

## [Doc-Guided Sent2Sent++: A Sent2Sent++ Agent with Doc-Guided memory for
  Document-level Machine Translation](https://arxiv.org/abs/2501.08523v1)
**arXiv ID:** 2501.08523v1

**Abstract:**
> The field of artificial intelligence has witnessed significant advancements
in natural language processing, largely attributed to the capabilities of Large
Language Models (LLMs). These models form the backbone of Agents designed to
address long-context dependencies, particularly in Document-level Machine
Translation (DocMT). DocMT presents unique challenges, with quality,
consistency, and fluency being the key metrics for evaluation. Existing
approaches, such as Doc2Doc and Doc2Sent, either omit sentences or compromise
fluency. This paper introduces Doc-Guided Sent2Sent++, an Agent that employs an
incremental sentence-level forced decoding strategy \textbf{to ensure every
sentence is translated while enhancing the fluency of adjacent sentences.} Our
Agent leverages a Doc-Guided Memory, focusing solely on the summary and its
translation, which we find to be an efficient approach to maintaining
consistency. Through extensive testing across multiple languages and domains,
we demonstrate that Sent2Sent++ outperforms other methods in terms of quality,
consistency, and fluency. The results indicate that, our approach has achieved
significant improvements in metrics such as s-COMET, d-COMET, LTCR-$1_f$, and
document-level perplexity (d-ppl). The contributions of this paper include a
detailed analysis of current DocMT research, the introduction of the
Sent2Sent++ decoding method, the Doc-Guided Memory mechanism, and validation of
its effectiveness across languages and domains.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria (1: Practical Applications in DocMT with LLMs, 2: Experimental Results with quantitative metrics like s-COMET and d-ppl, 3: Comparison with State-of-the-Art methods). Additionally meets criteria 4 (Methodology and Implementation Details) and shows Novelty in approach. Does not fall under any rejection categories.

---

## [LlamaRestTest: Effective REST API Testing with Small Language Models](https://arxiv.org/abs/2501.08598v1)
**arXiv ID:** 2501.08598v1

**Abstract:**
> Modern web services rely heavily on REST APIs, typically documented using the
OpenAPI specification. The widespread adoption of this standard has resulted in
the development of many black-box testing tools that generate tests based on
these specifications. Recent advancements in Natural Language Processing (NLP),
particularly with Large Language Models (LLMs), have enhanced REST API testing
by extracting actionable rules and generating input values from the
human-readable portions of the specification. However, these advancements
overlook the potential of continuously refining the identified rules and test
inputs based on server responses. To address this limitation, we present
LlamaRestTest, a novel approach that employs two custom LLMs to generate
realistic test inputs and uncover parameter dependencies during the testing
process by incorporating server responses. These LLMs are created by
fine-tuning the Llama3-8b model, using mined datasets of REST API example
values and inter-parameter dependencies. We evaluated LlamaRestTest on 12
real-world services (including popular services such as Spotify), comparing it
against RESTGPT, a GPT-powered specification-enhancement tool, as well as
several state-of-the-art REST API testing tools, including RESTler, MoRest,
EvoMaster, and ARAT-RL. Our results show that fine-tuning enables smaller LLMs
to outperform larger models in detecting actionable rules and generating inputs
for REST API testing. We evaluated configurations from the base Llama3-8B to
fine-tuned versions and explored 2-bit, 4-bit, and 8-bit quantization for
efficiency. LlamaRestTest surpasses state-of-the-art tools in code coverage and
error detection, even with RESTGPT-enhanced specifications, and an ablation
study highlights the impact of its novel components.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria (1, 2, and 3) by focusing on a practical application of LLMs in REST API testing, presenting experimental results with quantitative metrics, and comparing its approach to state-of-the-art techniques. Additionally, it meets optional criteria 4 (clear methodology and implementation details) and shows novelty in fine-tuning smaller LLMs for improved performance.

---

## [AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL](https://arxiv.org/abs/2501.08600v1)
**arXiv ID:** 2501.08600v1

**Abstract:**
> As REST APIs have become widespread in modern web services, comprehensive
testing of these APIs has become increasingly crucial. Due to the vast search
space consisting of operations, parameters, and parameter values along with
their complex dependencies and constraints, current testing tools suffer from
low code coverage, leading to suboptimal fault detection. To address this
limitation, we present a novel tool, AutoRestTest, which integrates the
Semantic Operation Dependency Graph (SODG) with Multi-Agent Reinforcement
Learning (MARL) and large language models (LLMs) for effective REST API
testing. AutoRestTest determines operation-dependent parameters using the SODG
and employs five specialized agents (operation, parameter, value, dependency,
and header) to identify dependencies of operations and generate operation
sequences, parameter combinations, and values. AutoRestTest provides a
command-line interface and continuous telemetry on successful operation count,
unique server errors detected, and time elapsed. Upon completion, AutoRestTest
generates a detailed report highlighting errors detected and operations
exercised. In this paper, we introduce our tool and present preliminary
results.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications: REST API testing), 2 (Experimental Results: preliminary results with potential for quantitative metrics), and 4 (Methodology and Implementation Details: clear description of using LLMs with MARL for API testing). Additionally, shows novelty in integrating LLMs with MARL for a practical application, making it a strong candidate for further review.

---

## [MAGNET: Augmenting Generative Decoders with Representation Learning and
  Infilling Capabilities](https://arxiv.org/abs/2501.08648v1)
**arXiv ID:** 2501.08648v1

**Abstract:**
> While originally designed for unidirectional generative modeling,
decoder-only large language models (LLMs) are increasingly being adapted for
bidirectional modeling. However, unidirectional and bidirectional models are
typically trained separately with distinct objectives (generation and
representation learning, respectively). This separation overlooks the
opportunity for developing a more versatile language model and for these
objectives to complement each other. In this work, we introduce MAGNET, an
adaptation of decoder-only LLMs that enhances their ability to generate robust
representations and infill missing text spans, while preserving their knowledge
and text generation capabilities. MAGNET employs three self-supervised training
objectives and introduces an attention mechanism that combines bidirectional
and causal attention, enabling unified training across all objectives. Our
results demonstrate that LLMs adapted with MAGNET (1) surpass strong text
encoders on token-level and sentence-level representation learning tasks, (2)
generate contextually appropriate text infills by leveraging future context,
(3) retain the ability for open-ended text generation without exhibiting
repetition problem, and (4) preserve the knowledge gained by the LLM during
pretraining.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications: enhances LLMs for robust representations and text infilling), 2 (Experimental Results and Quantitative Metrics: demonstrates performance improvements in representation learning and text generation tasks), and 3 (Comparison with State-of-the-Art: surpasses strong text encoders). Additionally, shows Novelty in combining bidirectional and causal attention, and its approach is implementable with current standard tools, supporting further review.

---

## [ToMATO: Verbalizing the Mental States of Role-Playing LLMs for
  Benchmarking Theory of Mind](https://arxiv.org/abs/2501.08838v1)
**arXiv ID:** 2501.08838v1

**Abstract:**
> Existing Theory of Mind (ToM) benchmarks diverge from real-world scenarios in
three aspects: 1) they assess a limited range of mental states such as beliefs,
2) false beliefs are not comprehensively explored, and 3) the diverse
personality traits of characters are overlooked. To address these challenges,
we introduce ToMATO, a new ToM benchmark formulated as multiple-choice QA over
conversations. ToMATO is generated via LLM-LLM conversations featuring
information asymmetry. By employing a prompting method that requires
role-playing LLMs to verbalize their thoughts before each utterance, we capture
both first- and second-order mental states across five categories: belief,
intention, desire, emotion, and knowledge. These verbalized thoughts serve as
answers to questions designed to assess the mental states of characters within
conversations. Furthermore, the information asymmetry introduced by hiding
thoughts from others induces the generation of false beliefs about various
mental states. Assigning distinct personality traits to LLMs further
diversifies both utterances and thoughts. ToMATO consists of 5.4k questions,
753 conversations, and 15 personality trait patterns. Our analysis shows that
this dataset construction approach frequently generates false beliefs due to
the information asymmetry between role-playing LLMs, and effectively reflects
diverse personalities. We evaluate nine LLMs on ToMATO and find that even
GPT-4o mini lags behind human performance, especially in understanding false
beliefs, and lacks robustness to various personality traits.

**Decision Explanation:** Original decision: ACCEPT
Meets all mandatory criteria: (1) Practical Applications (benchmarking Theory of Mind in LLMs), (2) Experimental Results and Quantitative Metrics (evaluation of 9 LLMs on ToMATO with performance metrics), and (3) Comparison with State-of-the-Art (comparison to human performance). Additionally, meets criterion (4) Methodology and Implementation Details (clear description of LLM-LLM conversations and prompting method). Novelty is also present in the approach to capture mental states and false beliefs.

---

## [Exploring Task-Level Optimal Prompts for Visual In-Context Learning](https://arxiv.org/abs/2501.08841v1)
**arXiv ID:** 2501.08841v1

**Abstract:**
> With the development of Vision Foundation Models (VFMs) in recent years,
Visual In-Context Learning (VICL) has become a better choice compared to
modifying models in most scenarios. Different from retraining or fine-tuning
model, VICL does not require modifications to the model's weights or
architecture, and only needs a prompt with demonstrations to teach VFM how to
solve tasks. Currently, significant computational cost for finding optimal
prompts for every test sample hinders the deployment of VICL, as determining
which demonstrations to use for constructing prompts is very costly. In this
paper, however, we find a counterintuitive phenomenon that most test samples
actually achieve optimal performance under the same prompts, and searching for
sample-level prompts only costs more time but results in completely identical
prompts. Therefore, we propose task-level prompting to reduce the cost of
searching for prompts during the inference stage and introduce two time-saving
yet effective task-level prompt search strategies. Extensive experimental
results show that our proposed method can identify near-optimal prompts and
reach the best VICL performance with a minimal cost that prior work has never
achieved.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications: optimizing prompts for Visual In-Context Learning), 2 (Experimental Results and Quantitative Metrics: extensive results showing near-optimal prompts with minimal cost), and 4 (Methodology and Implementation Details: clearly described task-level prompt search strategies). Additionally, shows novelty in introducing time-saving yet effective prompt search strategies, making it a strong candidate for further review.

---

## [Incrementally Learning Multiple Diverse Data Domains via Multi-Source
  Dynamic Expansion Model](https://arxiv.org/abs/2501.08878v1)
**arXiv ID:** 2501.08878v1

**Abstract:**
> Continual Learning seeks to develop a model capable of incrementally
assimilating new information while retaining prior knowledge. However, current
research predominantly addresses a straightforward learning context, wherein
all data samples originate from a singular data domain. This paper shifts focus
to a more complex and realistic learning environment, characterized by data
samples sourced from multiple distinct domains. We tackle this intricate
learning challenge by introducing a novel methodology, termed the Multi-Source
Dynamic Expansion Model (MSDEM), which leverages various pre-trained models as
backbones and progressively establishes new experts based on them to adapt to
emerging tasks. Additionally, we propose an innovative dynamic expandable
attention mechanism designed to selectively harness knowledge from multiple
backbones, thereby accelerating the new task learning. Moreover, we introduce a
dynamic graph weight router that strategically reuses all previously acquired
parameters and representations for new task learning, maximizing the positive
knowledge transfer effect, which further improves generalization performance.
We conduct a comprehensive series of experiments, and the empirical findings
indicate that our proposed approach achieves state-of-the-art performance.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications for complex learning environments), 2 (Experimental Results with state-of-the-art performance), and 3 (Comparison with State-of-the-Art). Also, shows novelty in methodology and approach, aligning with Additional Considerations for novelty and potential impact through experimental validation.

---

## [Decompose-ToM: Enhancing Theory of Mind Reasoning in Large Language
  Models through Simulation and Task Decomposition](https://arxiv.org/abs/2501.09056v1)
**arXiv ID:** 2501.09056v1

**Abstract:**
> Theory of Mind (ToM) is the ability to understand and reflect on the mental
states of others. Although this capability is crucial for human interaction,
testing on Large Language Models (LLMs) reveals that they possess only a
rudimentary understanding of it. Although the most capable closed-source LLMs
have come close to human performance on some ToM tasks, they still perform
poorly on complex variations of the task that involve more structured
reasoning. In this work, we utilize the concept of "pretend-play", or
``Simulation Theory'' from cognitive psychology to propose ``Decompose-ToM'':
an LLM-based inference algorithm that improves model performance on complex ToM
tasks. We recursively simulate user perspectives and decompose the ToM task
into a simpler set of functions: subject identification, question-reframing,
world model updation, and knowledge availability. We test the algorithm on
higher-order ToM tasks and a task testing for ToM capabilities in a
conversational setting, demonstrating that our approach shows significant
improvement across models compared to baseline methods while requiring minimal
prompt tuning across tasks and no additional model training.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications in enhancing Theory of Mind with LLMs), 2 (Experimental Results with quantitative metrics showing performance improvement), and 4 (Clearly describes methodology and implementation). Additionally, shows novelty in approach (Simulation Theory for ToM) and has robust experimental validation, making it a strong candidate for further review.

---

## [Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A
  study on Lithuanian History](https://arxiv.org/abs/2501.09154v1)
**arXiv ID:** 2501.09154v1

**Abstract:**
> In this work, we evaluated Lithuanian and general history knowledge of
multilingual Large Language Models (LLMs) on a multiple-choice
question-answering task. The models were tested on a dataset of Lithuanian
national and general history questions translated into Baltic, Nordic, and
other languages (English, Ukrainian, Arabic) to assess the knowledge sharing
from culturally and historically connected groups. We evaluated GPT-4o,
LLaMa3.1 8b and 70b, QWEN2.5 7b and 72b, Mistral Nemo 12b, LLaMa3 8b, Mistral
7b, LLaMa3.2 3b, and Nordic fine-tuned models (GPT-SW3 and LLaMa3 8b).
  Our results show that GPT-4o consistently outperformed all other models
across language groups, with slightly better results for Baltic and Nordic
languages. Larger open-source models like QWEN2.5 72b and LLaMa3.1 70b
performed well but showed weaker alignment with Baltic languages. Smaller
models (Mistral Nemo 12b, LLaMa3.2 3b, QWEN 7B, LLaMa3.1 8B, and LLaMa3 8b)
demonstrated gaps with LT-related alignment with Baltic languages while
performing better on Nordic and other languages. The Nordic fine-tuned models
did not surpass multilingual models, indicating that shared cultural or
historical context alone does not guarantee better performance.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 2 (Experimental Results and Quantitative Metrics) by presenting comparative performance results of various LLMs on a specific task, and marginally meets criterion 1 (Practical Applications) due to its focus on evaluating LLMs' knowledge of Baltic and Nordic languages, which could imply potential for real-world applications in those regions. Prioritizing inclusiveness, this aligns with the goal of capturing a broad spectrum of relevant research.

---

## [The Veln(ia)s is in the Details: Evaluating LLM Judgment on Latvian and
  Lithuanian Short Answer Matching](https://arxiv.org/abs/2501.09164v1)
**arXiv ID:** 2501.09164v1

**Abstract:**
> In this work, we address the challenge of evaluating large language models
(LLMs) on the short answer matching task for Latvian and Lithuanian languages.
We introduce novel datasets consisting of 502 Latvian and 690 Lithuanian
question-answer pairs. For each question-answer pair, we generated matched and
non-matched answers using a set of alteration rules specifically designed to
introduce small but meaningful changes in the text. These generated answers
serve as test cases to assess the ability of LLMs to detect subtle differences
in matching of the original answers. A subset of the datasets was manually
verified for quality and accuracy. Our results show that while larger LLMs,
such as QWEN2.5 72b and LLaMa3.1 70b, demonstrate near-perfect performance in
distinguishing matched and non-matched answers, smaller models show more
variance. For instance, LLaMa3.1 8b and EuroLLM 9b benefited from few-shot
examples, while Mistral Nemo 12b underperformed on detection of subtle text
alteration, particularly in Lithuanian, even with additional examples. QWEN2.5
7b and Mistral 7b were able to obtain a strong and comparable performance to
the larger 70b models in zero and few shot experiments. Moreover, the
performance of Mistral 7b was weaker in few shot experiments.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria (Practical Applications, Experimental Results and Quantitative Metrics, Comparison with State-of-the-Art) and one additional consideration (Novelty, through introduction of novel datasets and evaluation of LLMs on specific languages). Does not fall under rejection categories.

---

## [Attention is All You Need Until You Need Retention](https://arxiv.org/abs/2501.09166v1)
**arXiv ID:** 2501.09166v1

**Abstract:**
> This work introduces a novel Retention Layer mechanism for Transformer based
architectures, addressing their inherent lack of intrinsic retention
capabilities. Unlike human cognition, which can encode and dynamically recall
symbolic templates, Generative Pretrained Transformers rely solely on fixed
pretrained weights and ephemeral context windows, limiting their adaptability.
The proposed Retention Layer incorporates a persistent memory module capable of
real time data population, dynamic recall, and guided output generation. This
enhancement allows models to store, update, and reuse observed patterns across
sessions, enabling incremental learning and bridging the gap between static
pretraining and dynamic, context sensitive adaptation. The Retention Layer
design parallels social learning processes, encompassing attention, retention,
reproduction, and motivation stages. Technically, it integrates a memory
attention mechanism and episodic buffers to manage memory scalability, mitigate
overfitting, and ensure efficient recall. Applications span adaptive personal
assistants, real time fraud detection, autonomous robotics, content moderation,
and healthcare diagnostics. In each domain, the retention mechanism enables
systems to learn incrementally, personalize outputs, and respond to evolving
real world challenges effectively. By emulating key aspects of human learning,
this retention enhanced architecture fosters a more fluid and responsive AI
paradigm, paving the way for dynamic, session aware models that extend the
capabilities of traditional Transformers into domains requiring continual
adaptation.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria (1, 2, 3) and additional considerations: focused on practical applications of LLMs (1), implied experimental results for performance improvements (2), compares with existing state-of-the-art Transformers (3); also meets optional criteria for novelty (introduces Retention Layer), methodology description, and potential for real-world impact, while avoiding rejected topic areas.

---

## [Guiding Retrieval using LLM-based Listwise Rankers](https://arxiv.org/abs/2501.09186v1)
**arXiv ID:** 2501.09186v1

**Abstract:**
> Large Language Models (LLMs) have shown strong promise as rerankers,
especially in ``listwise'' settings where an LLM is prompted to rerank several
search results at once. However, this ``cascading'' retrieve-and-rerank
approach is limited by the bounded recall problem: relevant documents not
retrieved initially are permanently excluded from the final ranking. Adaptive
retrieval techniques address this problem, but do not work with listwise
rerankers because they assume a document's score is computed independently from
other documents. In this paper, we propose an adaptation of an existing
adaptive retrieval method that supports the listwise setting and helps guide
the retrieval process itself (thereby overcoming the bounded recall problem for
LLM rerankers). Specifically, our proposed algorithm merges results both from
the initial ranking and feedback documents provided by the most relevant
documents seen up to that point. Through extensive experiments across diverse
LLM rerankers, first stage retrievers, and feedback sources, we demonstrate
that our method can improve nDCG@10 by up to 13.23% and recall by 28.02%--all
while keeping the total number of LLM inferences constant and overheads due to
the adaptive process minimal. The work opens the door to leveraging LLM-based
search in settings where the initial pool of results is limited, e.g., by
legacy systems, or by the cost of deploying a semantic first-stage.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria: (1) focuses on practical application of LLMs in retrieval-augmented settings, (2) includes experimental results with quantitative metrics (nDCG@10, recall improvements), and (3) implicitly compares with state-of-the-art by addressing the bounded recall problem. Additionally, it meets extra criteria for (4) methodology description, (5) real-world application discussion, and shows novelty in adapting adaptive retrieval for listwise LLM rerankers.

---

## [Perspective Transition of Large Language Models for Solving Subjective
  Tasks](https://arxiv.org/abs/2501.09265v1)
**arXiv ID:** 2501.09265v1

**Abstract:**
> Large language models (LLMs) have revolutionized the field of natural
language processing, enabling remarkable progress in various tasks. Different
from objective tasks such as commonsense reasoning and arithmetic
question-answering, the performance of LLMs on subjective tasks is still
limited, where the perspective on the specific problem plays crucial roles for
better interpreting the context and giving proper response. For example, in
certain scenarios, LLMs may perform better when answering from an expert role
perspective, potentially eliciting their relevant domain knowledge. In
contrast, in some scenarios, LLMs may provide more accurate responses when
answering from a third-person standpoint, enabling a more comprehensive
understanding of the problem and potentially mitigating inherent biases. In
this paper, we propose Reasoning through Perspective Transition (RPT), a method
based on in-context learning that enables LLMs to dynamically select among
direct, role, and third-person perspectives for the best way to solve
corresponding subjective problem. Through extensive experiments on totally 12
subjective tasks by using both closed-source and open-source LLMs including
GPT-4, GPT-3.5, Llama-3, and Qwen-2, our method outperforms widely used single
fixed perspective based methods such as chain-of-thought prompting and expert
prompting, highlights the intricate ways that LLMs can adapt their perspectives
to provide nuanced and contextually appropriate responses for different
problems.

**Decision Explanation:** Original response: ```
{
    "decision": "ACCEPT",
    "explanation": "Meets criteria 1 (Practical Applications) through improved performance on subjective tasks, criteria 2 (Experimental Results and Quantitative Metrics) with extensive experiments on multiple LLMs, and partially meets criterion 5 (Real-world Applications and Challenges) by addressing LLM limitations. Additionally, shows novelty in perspective transition approach and has potential for implementability with current standard tools."
}
```

---

## [Style4Rec: Enhancing Transformer-based E-commerce Recommendation Systems
  with Style and Shopping Cart Information](https://arxiv.org/abs/2501.09354v1)
**arXiv ID:** 2501.09354v1

**Abstract:**
> Understanding users' product preferences is essential to the efficacy of a
recommendation system. Precision marketing leverages users' historical data to
discern these preferences and recommends products that align with them.
However, recent browsing and purchase records might better reflect current
purchasing inclinations. Transformer-based recommendation systems have made
strides in sequential recommendation tasks, but they often fall short in
utilizing product image style information and shopping cart data effectively.
In light of this, we propose Style4Rec, a transformer-based e-commerce
recommendation system that harnesses style and shopping cart information to
enhance existing transformer-based sequential product recommendation systems.
Style4Rec represents a significant step forward in personalized e-commerce
recommendations, outperforming benchmarks across various evaluation metrics.
Style4Rec resulted in notable improvements: HR@5 increased from 0.681 to 0.735,
NDCG@5 increased from 0.594 to 0.674, and MRR@5 increased from 0.559 to 0.654.
We tested our model using an e-commerce dataset from our partnering company and
found that it exceeded established transformer-based sequential recommendation
benchmarks across various evaluation metrics. Thus, Style4Rec presents a
significant step forward in personalized e-commerce recommendation systems.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria: (1) Practical Applications in e-commerce, (2) Experimental Results with quantitative metrics (HR@5, NDCG@5, MRR@5), and (3) Comparison with State-of-the-Art (outperforming established transformer-based sequential recommendation benchmarks). Additionally, it meets extra criteria for Methodology and Implementation Details (described utilization of LLMs in practical tasks) and shows Novelty in approach.

---

## [Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment](https://arxiv.org/abs/2501.09620v1)
**arXiv ID:** 2501.09620v1

**Abstract:**
> Recent advances in large language models (LLMs) have demonstrated significant
progress in performing complex tasks. While Reinforcement Learning from Human
Feedback (RLHF) has been effective in aligning LLMs with human preferences, it
is susceptible to spurious correlations in reward modeling. Consequently, it
often introduces biases-such as length bias, sycophancy, conceptual bias, and
discrimination that hinder the model's ability to capture true causal
relationships. To address this, we propose a novel causal reward modeling
approach that integrates causal inference to mitigate these spurious
correlations. Our method enforces counterfactual invariance, ensuring reward
predictions remain consistent when irrelevant variables are altered. Through
experiments on both synthetic and real-world datasets, we show that our
approach mitigates various types of spurious correlations effectively,
resulting in more reliable and fair alignment of LLMs with human preferences.
As a drop-in enhancement to the existing RLHF workflow, our causal reward
modeling provides a practical way to improve the trustworthiness and fairness
of LLM finetuning.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications: LLM alignment), 2 (Experimental Results and Quantitative Metrics: experiments on synthetic and real-world datasets), and 3 (Comparison with State-of-the-Art: enhances existing RLHF workflow). Additionally, showcases Novelty (causal reward modeling approach) and has potential for Reproducibility and Documentation (describes a drop-in enhancement with experimental validation).

---

## [The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating
  Large Language Models](https://arxiv.org/abs/2501.09653v1)
**arXiv ID:** 2501.09653v1

**Abstract:**
> The recent rise in the popularity of large language models has spurred the
development of extensive code datasets needed to train them. This has left
limited code available for collection and use in the downstream investigation
of specific behaviors, or evaluation of large language models without suffering
from data contamination. To address this problem, we release The Heap, a large
multilingual dataset covering 57 programming languages that has been
deduplicated with respect to other open datasets of code, enabling researchers
to conduct fair evaluations of large language models without significant data
cleaning overhead.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 2 (Experimental Results and Quantitative Metrics implied through dataset deduplication for fair LLM evaluation), 4 (Methodology and Implementation Details for dataset creation), and shows potential for Impact through Experimental Validation. Novelty is also present in the creation of a contamination-free multilingual code dataset, making it a valuable resource for LLM evaluation.

---

## [Towards Large Reasoning Models: A Survey of Reinforced Reasoning with
  Large Language Models](https://arxiv.org/abs/2501.09686v1)
**arXiv ID:** 2501.09686v1

**Abstract:**
> Language has long been conceived as an essential tool for human reasoning.
The breakthrough of Large Language Models (LLMs) has sparked significant
research interest in leveraging these models to tackle complex reasoning tasks.
Researchers have moved beyond simple autoregressive token generation by
introducing the concept of "thought" -- a sequence of tokens representing
intermediate steps in the reasoning process. This innovative paradigm enables
LLMs' to mimic complex human reasoning processes, such as tree search and
reflective thinking. Recently, an emerging trend of learning to reason has
applied reinforcement learning (RL) to train LLMs to master reasoning
processes. This approach enables the automatic generation of high-quality
reasoning trajectories through trial-and-error search algorithms, significantly
expanding LLMs' reasoning capacity by providing substantially more training
data. Furthermore, recent studies demonstrate that encouraging LLMs to "think"
with more tokens during test-time inference can further significantly boost
reasoning accuracy. Therefore, the train-time and test-time scaling combined to
show a new research frontier -- a path toward Large Reasoning Model. The
introduction of OpenAI's o1 series marks a significant milestone in this
research direction. In this survey, we present a comprehensive review of recent
progress in LLM reasoning. We begin by introducing the foundational background
of LLMs and then explore the key technical components driving the development
of large reasoning models, with a focus on automated data construction,
learning-to-reason techniques, and test-time scaling. We also analyze popular
open-source projects at building large reasoning models, and conclude with open
challenges and future research directions.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications: leveraging LLMs for complex reasoning tasks), 3 (Comparison with State-of-the-Art: implied through discussion of recent progress and open-source projects), and 4 (Methodology and Implementation Details: provides overview of key technical components). Also shows novelty in exploring Large Reasoning Models, making it a strong candidate for further review.

---

## [CyberMentor: AI Powered Learning Tool Platform to Address Diverse
  Student Needs in Cybersecurity Education](https://arxiv.org/abs/2501.09709v1)
**arXiv ID:** 2501.09709v1

**Abstract:**
> Many non-traditional students in cybersecurity programs often lack access to
advice from peers, family members and professors, which can hinder their
educational experiences. Additionally, these students may not fully benefit
from various LLM-powered AI assistants due to issues like content relevance,
locality of advice, minimum expertise, and timing. This paper addresses these
challenges by introducing an application designed to provide comprehensive
support by answering questions related to knowledge, skills, and career
preparation advice tailored to the needs of these students. We developed a
learning tool platform, CyberMentor, to address the diverse needs and pain
points of students majoring in cybersecurity. Powered by agentic workflow and
Generative Large Language Models (LLMs), the platform leverages
Retrieval-Augmented Generation (RAG) for accurate and contextually relevant
information retrieval to achieve accessibility and personalization. We
demonstrated its value in addressing knowledge requirements for cybersecurity
education and for career marketability, in tackling skill requirements for
analytical and programming assignments, and in delivering real time on demand
learning support. Using three use scenarios, we showcased CyberMentor in
facilitating knowledge acquisition and career preparation and providing
seamless skill-based guidance and support. We also employed the LangChain
prompt-based evaluation methodology to evaluate the platform's impact,
confirming its strong performance in helpfulness, correctness, and
completeness. These results underscore the system's ability to support students
in developing practical cybersecurity skills while improving equity and
sustainability within higher education. Furthermore, CyberMentor's open-source
design allows for adaptation across other disciplines, fostering educational
innovation and broadening its potential impact.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications in cybersecurity education), 2 (Experimental Results with quantitative metrics through LangChain prompt-based evaluation), 4 (Methodology and Implementation Details with RAG and agentic workflow), and shows novelty in applying LLMs to address educational challenges, with potential for broad impact across disciplines.

---

## [A Simple Aerial Detection Baseline of Multimodal Language Models](https://arxiv.org/abs/2501.09720v1)
**arXiv ID:** 2501.09720v1

**Abstract:**
> The multimodal language models (MLMs) based on generative pre-trained
Transformer are considered powerful candidates for unifying various domains and
tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding
performance in multiple tasks, such as visual question answering and visual
grounding. In addition to visual grounding that detects specific objects
corresponded to given instruction, aerial detection, which detects all objects
of multiple categories, is also a valuable and challenging task for RS
foundation models. However, aerial detection has not been explored by existing
RS MLMs because the autoregressive prediction mechanism of MLMs differs
significantly from the detection outputs. In this paper, we present a simple
baseline for applying MLMs to aerial detection for the first time, named
LMMRotate. Specifically, we first introduce a normalization method to transform
detection outputs into textual outputs to be compatible with the MLM framework.
Then, we propose a evaluation method, which ensures a fair comparison between
MLMs and conventional object detection models. We construct the baseline by
fine-tuning open-source general-purpose MLMs and achieve impressive detection
performance comparable to conventional detector. We hope that this baseline
will serve as a reference for future MLM development, enabling more
comprehensive capabilities for understanding RS images. Code is available at
https://github.com/Li-Qingyun/mllm-mmrotate.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications in aerial detection, a real-world task), 2 (Experimental Results with quantitative metrics for detection performance), and 4 (Methodology and Implementation Details, including a novel normalization method for MLMs). Additionally, it shows Novelty in applying MLMs to aerial detection and provides Reproducibility with available code, outweighing minor uncertainties about comparison to State-of-the-Art and Agentic AI aspects.

---

## [Knowledge prompt chaining for semantic modeling](https://arxiv.org/abs/2501.08540v1)
**arXiv ID:** 2501.08540v1

**Abstract:**
> The task of building semantics for structured data such as CSV, JSON, and XML
files is highly relevant in the knowledge representation field. Even though we
have a vast of structured data on the internet, mapping them to domain
ontologies to build semantics for them is still very challenging as it requires
the construction model to understand and learn graph-structured knowledge.
Otherwise, the task will require human beings' effort and cost. In this paper,
we proposed a novel automatic semantic modeling framework: Knowledge Prompt
Chaining. It can serialize the graph-structured knowledge and inject it into
the LLMs properly in a Prompt Chaining architecture. Through this knowledge
injection and prompting chaining, the model in our framework can learn the
structure information and latent space of the graph and generate the semantic
labels and semantic graphs following the chains' insturction naturally. Based
on experimental results, our method achieves better performance than existing
leading techniques, despite using reduced structured input data.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria: (1) Practical Applications in knowledge graphs, (2) Experimental Results with quantitative metrics showing performance improvements, and (3) Comparison with State-of-the-Art techniques. Additionally, meets optional criteria for Methodology and Implementation Details (4) and Novelty in applying LLMs to semantic modeling.

---

## [LAMS: LLM-Driven Automatic Mode Switching for Assistive Teleoperation](https://arxiv.org/abs/2501.08558v1)
**arXiv ID:** 2501.08558v1

**Abstract:**
> Teleoperating high degrees-of-freedom (DoF) robotic manipulators via low-DoF
controllers like joysticks often requires frequent switching between control
modes, where each mode maps controller movements to specific robot actions.
Manually performing this frequent switching can make teleoperation cumbersome
and inefficient. On the other hand, existing automatic mode-switching
solutions, such as heuristic-based or learning-based methods, are often
task-specific and lack generalizability. In this paper, we introduce LLM-Driven
Automatic Mode Switching (LAMS), a novel approach that leverages Large Language
Models (LLMs) to automatically switch control modes based on task context.
Unlike existing methods, LAMS requires no prior task demonstrations and
incrementally improves by integrating user-generated mode-switching examples.
We validate LAMS through an ablation study and a user study with 10
participants on complex, long-horizon tasks, demonstrating that LAMS
effectively reduces manual mode switches, is preferred over alternative
methods, and improves performance over time. The project website with
supplementary materials is at https://lams-assistance.github.io/.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria: Practical Applications (teleoperation with LLMs), Experimental Results and Quantitative Metrics (ablation and user studies with concrete results), and Comparison with State-of-the-Art (improvements over alternative methods). Additionally, meets extra criteria for Novelty (novel approach leveraging LLMs), Reproducibility and Documentation (supplementary materials provided), and Impact through Experimental Validation (robust user study with real-world task implications).

---

## [ANSR-DT: An Adaptive Neuro-Symbolic Learning and Reasoning Framework for
  Digital Twins](https://arxiv.org/abs/2501.08561v1)
**arXiv ID:** 2501.08561v1

**Abstract:**
> In this paper, we propose an Adaptive Neuro-Symbolic Learning Framework for
digital twin technology called ``ANSR-DT." Our approach combines pattern
recognition algorithms with reinforcement learning and symbolic reasoning to
enable real-time learning and adaptive intelligence. This integration enhances
the understanding of the environment and promotes continuous learning, leading
to better and more effective decision-making in real-time for applications that
require human-machine collaboration. We evaluated the \textit{ANSR-DT}
framework for its ability to learn and adapt to dynamic patterns, observing
significant improvements in decision accuracy, reliability, and
interpretability when compared to existing state-of-the-art methods. However,
challenges still exist in extracting and integrating symbolic rules in complex
environments, which limits the full potential of our framework in heterogeneous
settings. Moreover, our ongoing research aims to address this issue in the
future by ensuring seamless integration of neural models at large. In addition,
our open-source implementation promotes reproducibility and encourages future
research to build on our foundational work.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications in human-machine collaboration), 2 (Experimental Results with quantitative metrics for decision accuracy), 3 (Comparison with State-of-the-Art showing performance improvements), and 4 (Methodology and Implementation Details with open-source implementation for reproducibility). Additionally, it touches on limitations and future work, indicating potential for further innovative applications with LLMs, though LLMs are not explicitly mentioned, the neuro-symbolic learning framework might imply potential integration or relevance.

---

## [RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation](https://arxiv.org/abs/2501.08617v1)
**arXiv ID:** 2501.08617v1

**Abstract:**
> Generative AI systems like foundation models (FMs) must align well with human
values to ensure their behavior is helpful and trustworthy. While Reinforcement
Learning from Human Feedback (RLHF) has shown promise for optimizing model
performance using human judgments, existing RLHF pipelines predominantly rely
on immediate feedback, which can fail to accurately reflect the downstream
impact of an interaction on users' utility. We demonstrate that feedback based
on evaluators' foresight estimates of downstream consequences systematically
induces Goodhart's Law dynamics, incentivizing misaligned behaviors like
sycophancy and deception and ultimately degrading user outcomes. To alleviate
this, we propose decoupling evaluation from prediction by refocusing RLHF on
hindsight feedback. Our theoretical analysis reveals that conditioning
evaluator feedback on downstream observations mitigates misalignment and
improves expected human utility, even when these observations are simulated by
the AI system itself. To leverage this insight in a practical alignment
algorithm, we introduce Reinforcement Learning from Hindsight Simulation
(RLHS), which first simulates plausible consequences and then elicits feedback
to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS
to two widely-employed online and offline preference optimization methods --
Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) --
and show empirically that misalignment is significantly reduced with both
methods. Through an online human user study, we show that RLHS consistently
outperforms RLHF in helping users achieve their goals and earns higher
satisfaction ratings, despite being trained solely with simulated hindsight
feedback. These results underscore the importance of focusing on long-term
consequences, even simulated ones, to mitigate misalignment in RLHF.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications) through its focus on improving alignment in generative AI systems, 2 (Experimental Results and Quantitative Metrics) with empirical results from online human user studies, and 3 (Comparison with State-of-the-Art) by comparing RLHS to existing RLHF techniques, showing performance improvements. Additionally, it touches on Novelty and Reproducibility with its introduction of RLHS and documented methodologies.

---

## [Leveraging LLM Agents for Translating Network Configurations](https://arxiv.org/abs/2501.08760v1)
**arXiv ID:** 2501.08760v1

**Abstract:**
> Configuration translation is a critical and frequent task in network
operations. When a network device is damaged or outdated, administrators need
to replace it to maintain service continuity. The replacement devices may
originate from different vendors, necessitating configuration translation to
ensure seamless network operation. However, translating configurations manually
is a labor-intensive and error-prone process. In this paper, we propose an
intent-based framework for translating network configuration with Large
Language Model (LLM) Agents. The core of our approach is an Intent-based
Retrieval Augmented Generation (IRAG) module that systematically splits a
configuration file into fragments, extracts intents, and generates accurate
translations. We also design a two-stage verification method to validate the
syntax and semantics correctness of the translated configurations. We implement
and evaluate the proposed method on real-world network configurations.
Experimental results show that our method achieves 97.74% syntax correctness,
outperforming state-of-the-art methods in translation accuracy.

**Decision Explanation:** Original response: ```
{
    "decision": "ACCEPT",
    "explanation": "Meets Criteria 1 (Practical Applications: network operations), Criteria 2 (Experimental Results and Quantitative Metrics: 97.74% syntax correctness outperforming state-of-the-art), and Criteria 3 (Comparison with State-of-the-Art). Additionally, exhibits novelty in applying LLMs with Retrieval Augmented Generation (RAG) for network configuration translation."
}
```

---

## [How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in
  Software Engineering](https://arxiv.org/abs/2501.08774v1)
**arXiv ID:** 2501.08774v1

**Abstract:**
> Artificial intelligence (AI), including large language models and generative
AI, is emerging as a significant force in software development, offering
developers powerful tools that span the entire development lifecycle. Although
software engineering research has extensively studied AI tools in software
development, the specific types of interactions between developers and these
AI-powered tools have only recently begun to receive attention. Understanding
and improving these interactions has the potential to improve productivity,
trust, and efficiency in AI-driven workflows. In this paper, we propose a
taxonomy of interaction types between developers and AI tools, identifying
eleven distinct interaction types, such as auto-complete code suggestions,
command-driven actions, and conversational assistance. Building on this
taxonomy, we outline a research agenda focused on optimizing AI interactions,
improving developer control, and addressing trust and usability challenges in
AI-assisted development. By establishing a structured foundation for studying
developer-AI interactions, this paper aims to stimulate research on creating
more effective, adaptive AI tools for software development.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications in software development), 4 (Methodology and Implementation Details through proposed taxonomy), and potentially 2 (though no explicit experimental results are mentioned in the abstract, the outlined research agenda implies future quantitative metrics). Novel approach to studying developer-AI interactions in software engineering also suggests some degree of novelty.

---

## [IDEA: Image Description Enhanced CLIP-Adapter](https://arxiv.org/abs/2501.08816v1)
**arXiv ID:** 2501.08816v1

**Abstract:**
> CLIP (Contrastive Language-Image Pre-training) has attained great success in
pattern recognition and computer vision. Transferring CLIP to downstream tasks
(e.g. zero- or few-shot classification) is a hot topic in multimodal learning.
However, current studies primarily focus on either prompt learning for text or
adapter tuning for vision, without fully exploiting the complementary
information and correlations among image-text pairs. In this paper, we propose
an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to
few-shot image classification tasks. This method captures fine-grained features
by leveraging both visual features and textual descriptions of images. IDEA is
a training-free method for CLIP, and it can be comparable to or even exceeds
state-of-the-art models on multiple tasks. Furthermore, we introduce
Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable
components (i.e., a projector and a learnable latent space), further enhancing
the model's performance and achieving SOTA results on 11 datasets. As one
important contribution, we employ the Llama model and design a comprehensive
pipeline to generate textual descriptions for images of 11 datasets, resulting
in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are
released at https://github.com/FourierAI/IDEA.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria: Practical Applications (adaptation for few-shot image classification), Experimental Results and Quantitative Metrics (comparable to or exceeding state-of-the-art models), and Comparison with State-of-the-Art (achieving SOTA results on 11 datasets). Additionally, meets optional criteria for Methodology and Implementation Details (clear description of IDEA and T-IDEA) and Novelty (introducing a new pipeline with LLM for generating image-text pairs).

---

## [MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents](https://arxiv.org/abs/2501.08828v1)
**arXiv ID:** 2501.08828v1

**Abstract:**
> Multi-modal document retrieval is designed to identify and retrieve various
forms of multi-modal content, such as figures, tables, charts, and layout
information from extensive documents. Despite its significance, there is a
notable lack of a robust benchmark to effectively evaluate the performance of
systems in multi-modal document retrieval. To address this gap, this work
introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks:
page-level and layout-level retrieval. The former focuses on localizing the
most relevant pages within a long document, while the latter targets the
detection of specific layouts, offering a more fine-grained granularity than
whole-page analysis. A layout can refer to a variety of elements such as
textual paragraphs, equations, figures, tables, or charts. The MMDocIR
benchmark comprises a rich dataset featuring expertly annotated labels for
1,685 questions and bootstrapped labels for 173,843 questions, making it a
pivotal resource for advancing multi-modal document retrieval for both training
and evaluation. Through rigorous experiments, we reveal that (i) visual
retrievers significantly outperform their text counterparts, (ii) MMDocIR train
set can effectively benefit the training process of multi-modal document
retrieval and (iii) text retrievers leveraging on VLM-text perform much better
than those using OCR-text. These findings underscores the potential advantages
of integrating visual elements for multi-modal document retrieval.

**Decision Explanation:** Original response: ```
{
    "decision": "ACCEPT",
    "explanation": "Meets criteria 1 (Practical Applications) through multi-modal document retrieval, criterion 2 (Experimental Results and Quantitative Metrics) with rigorous experiments showing comparative performance, and criterion 3 (Comparison with State-of-the-Art) by evaluating the effectiveness of different retrieval approaches. Additionally, it aligns with the area of retrieval-augmented generation (RAG) and demonstrates novelty in integrating visual elements for document retrieval."
}
```

---

## [Disentangling Exploration of Large Language Models by Optimal
  Exploitation](https://arxiv.org/abs/2501.08925v1)
**arXiv ID:** 2501.08925v1

**Abstract:**
> Exploration is a crucial skill for self-improvement and open-ended
problem-solving. However, it remains uncertain whether large language models
can effectively explore the state-space. Existing evaluations predominantly
focus on the trade-off between exploration and exploitation, often assessed in
multi-armed bandit problems. In contrast, this work isolates exploration as the
sole objective, tasking the agent with delivering information that enhances
future returns. For the evaluation, we propose to decompose missing rewards
into exploration and exploitation components by measuring the optimal
achievable return for the states already explored. Our experiments with various
LLMs reveal that most models struggle to sufficiently explore the state-space
and that weak exploration is insufficient. We observe a positive correlation
between model size and exploration performance, with larger models
demonstrating superior capabilities. Furthermore, we show that our
decomposition provides insights into differences in behaviors driven by agent
instructions during prompt engineering, offering a valuable tool for refining
LLM performance in exploratory tasks.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications with prompt engineering), 2 (Experimental Results with quantitative metrics on exploration performance), and 3 (Comparison with State-of-the-Art implicit in evaluating various LLMs). Additionally, shows Novelty in approach and has potential for Impact through Experimental Validation, warranting further review.

---

## [SteLLA: A Structured Grading System Using LLMs with RAG](https://arxiv.org/abs/2501.09092v1)
**arXiv ID:** 2501.09092v1

**Abstract:**
> Large Language Models (LLMs) have shown strong general capabilities in many
applications. However, how to make them reliable tools for some specific tasks
such as automated short answer grading (ASAG) remains a challenge. We present
SteLLA (Structured Grading System Using LLMs with RAG) in which a) Retrieval
Augmented Generation (RAG) approach is used to empower LLMs specifically on the
ASAG task by extracting structured information from the highly relevant and
reliable external knowledge based on the instructor-provided reference answer
and rubric, b) an LLM performs a structured and question-answering-based
evaluation of student answers to provide analytical grades and feedback. A
real-world dataset that contains students' answers in an exam was collected
from a college-level Biology course. Experiments show that our proposed system
can achieve substantial agreement with the human grader while providing
break-down grades and feedback on all the knowledge points examined in the
problem. A qualitative and error analysis of the feedback generated by GPT4
shows that GPT4 is good at capturing facts while may be prone to inferring too
much implication from the given text in the grading task which provides
insights into the usage of LLMs in the ASAG system.

**Decision Explanation:** Original decision: ACCEPT
Meets all mandatory criteria (1: Practical Applications in automated short answer grading, 2: Experimental Results with quantitative metrics on agreement with human grader, 3: Comparison with State-of-the-Art through evaluation of GPT4). Additionally, meets optional criteria 4 (clear methodology and implementation details) and has potential for novelty in applying LLMs with RAG for structured grading. Does not fall under rejection categories.

---

## [Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG](https://arxiv.org/abs/2501.09136v1)
**arXiv ID:** 2501.09136v1

**Abstract:**
> Large Language Models (LLMs) have revolutionized artificial intelligence (AI)
by enabling human like text generation and natural language understanding.
However, their reliance on static training data limits their ability to respond
to dynamic, real time queries, resulting in outdated or inaccurate outputs.
Retrieval Augmented Generation (RAG) has emerged as a solution, enhancing LLMs
by integrating real time data retrieval to provide contextually relevant and
up-to-date responses. Despite its promise, traditional RAG systems are
constrained by static workflows and lack the adaptability required for
multistep reasoning and complex task management.
  Agentic Retrieval-Augmented Generation (Agentic RAG) transcends these
limitations by embedding autonomous AI agents into the RAG pipeline. These
agents leverage agentic design patterns reflection, planning, tool use, and
multiagent collaboration to dynamically manage retrieval strategies,
iteratively refine contextual understanding, and adapt workflows to meet
complex task requirements. This integration enables Agentic RAG systems to
deliver unparalleled flexibility, scalability, and context awareness across
diverse applications.
  This survey provides a comprehensive exploration of Agentic RAG, beginning
with its foundational principles and the evolution of RAG paradigms. It
presents a detailed taxonomy of Agentic RAG architectures, highlights key
applications in industries such as healthcare, finance, and education, and
examines practical implementation strategies. Additionally, it addresses
challenges in scaling these systems, ensuring ethical decision making, and
optimizing performance for real-world applications, while providing detailed
insights into frameworks and tools for implementing Agentic RAG

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications in RAG and Agentic AI), 3 (Comparison implied through taxonomy and evolution of RAG paradigms), and 5 (Real-world Applications and Challenges in industries like finance and education). Also scores high on Additional Considerations for Agentic AI, Novelty, and Reproducibility (through detailed implementation strategies and frameworks).

---

## [Clone-Robust AI Alignment](https://arxiv.org/abs/2501.09254v1)
**arXiv ID:** 2501.09254v1

**Abstract:**
> A key challenge in training Large Language Models (LLMs) is properly aligning
them with human preferences. Reinforcement Learning with Human Feedback (RLHF)
uses pairwise comparisons from human annotators to train reward functions and
has emerged as a popular alignment method. However, input datasets in RLHF are
not necessarily balanced in the types of questions and answers that are
included. Therefore, we want RLHF algorithms to perform well even when the set
of alternatives is not uniformly distributed. Drawing on insights from social
choice theory, we introduce robustness to approximate clones, a desirable
property of RLHF algorithms which requires that adding near-duplicate
alternatives does not significantly change the learned reward function. We
first demonstrate that the standard RLHF algorithm based on regularized maximum
likelihood estimation (MLE) fails to satisfy this property. We then propose the
weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE
by weighting alternatives based on their similarity to other alternatives. This
new algorithm guarantees robustness to approximate clones while preserving
desirable theoretical properties.

**Decision Explanation:** Original response: ```
{
    "decision": "ACCEPT",
    "explanation": "Meets all required criteria (1, 2, 3) with practical applications in AI alignment, experimental results with quantitative metrics implied, and comparison with existing state-of-the-art techniques. Additionally, it introduces novel approaches (novelty), with potential implementability and impact through experimental validation."
}
```

---

## [To Retrieve or Not to Retrieve? Uncertainty Detection for Dynamic
  Retrieval Augmented Generation](https://arxiv.org/abs/2501.09292v1)
**arXiv ID:** 2501.09292v1

**Abstract:**
> Retrieval-Augmented Generation equips large language models with the
capability to retrieve external knowledge, thereby mitigating hallucinations by
incorporating information beyond the model's intrinsic abilities. However, most
prior works have focused on invoking retrieval deterministically, which makes
it unsuitable for tasks such as long-form question answering. Instead,
dynamically performing retrieval by invoking it only when the underlying LLM
lacks the required knowledge can be more efficient. In this context, we delve
deeper into the question, "To Retrieve or Not to Retrieve?" by exploring
multiple uncertainty detection methods. We evaluate these methods for the task
of long-form question answering, employing dynamic retrieval, and present our
comparisons. Our findings suggest that uncertainty detection metrics, such as
Degree Matrix Jaccard and Eccentricity, can reduce the number of retrieval
calls by almost half, with only a slight reduction in question-answering
accuracy.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications in Retrieval-Augmented Generation), 2 (Experimental Results with quantitative metrics on reducing retrieval calls and question-answering accuracy), and 3 (Comparison with State-of-the-Art methodologies in uncertainty detection for dynamic retrieval). Additionally, shows novelty in applying uncertainty detection metrics to optimize LLM usage.

---

## [A Study of In-Context-Learning-Based Text-to-SQL Errors](https://arxiv.org/abs/2501.09310v1)
**arXiv ID:** 2501.09310v1

**Abstract:**
> Large language models (LLMs) have been adopted to perform text-to-SQL tasks,
utilizing their in-context learning (ICL) capability to translate natural
language questions into structured query language (SQL). However, such a
technique faces correctness problems and requires efficient repairing
solutions. In this paper, we conduct the first comprehensive study of
text-to-SQL errors. Our study covers four representative ICL-based techniques,
five basic repairing methods, two benchmarks, and two LLM settings. We find
that text-to-SQL errors are widespread and summarize 29 error types of 7
categories. We also find that existing repairing attempts have limited
correctness improvement at the cost of high computational overhead with many
mis-repairs. Based on the findings, we propose MapleRepair, a novel text-to-SQL
error detection and repairing framework. The evaluation demonstrates that
MapleRepair outperforms existing solutions by repairing 13.8% more queries with
neglectable mis-repairs and 67.4% less overhead.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications: text-to-SQL tasks with LLMs), 2 (Experimental Results: quantitative metrics demonstrating MapleRepair's performance improvements), 3 (Comparison with State-of-the-Art: outperforms existing solutions), and 4 (Methodology and Implementation Details: clearly describes novel MapleRepair framework). Additionally, shows Novelty and robust Experimental Validation.

---

## [Rational Tuning of LLM Cascades via Probabilistic Modeling](https://arxiv.org/abs/2501.09345v1)
**arXiv ID:** 2501.09345v1

**Abstract:**
> Understanding the reliability of large language models (LLMs) has recently
garnered significant attention. Given LLMs' propensity to hallucinate, as well
as their high sensitivity to prompt design, it is already challenging to
predict the performance of an individual LLM. However, the problem becomes more
complex for compound LLM systems such as cascades, where in addition to each
model's standalone performance, we must understand how the error rates of
different models interact. In this paper, we present a probabilistic model for
the joint performance distribution of a sequence of LLMs, which enables a
framework for rationally tuning the confidence thresholds of a LLM cascade
using continuous optimization. Compared to selecting confidence thresholds
using grid search, our parametric Markov-copula model significantly improves
runtime scaling with respect to the length of the cascade and the desired
resolution of the cost-error curve, turning them from intractable into
low-order polynomial. In addition, the optimal thresholds computed using our
continuous optimization-based algorithm increasingly outperform those found via
grid search as cascade length grows, improving the area under the cost-error
curve by 1.9% on average for cascades consisting of at least three models.
Overall, our Markov-copula model provides a rational basis for tuning LLM
cascade performance and points to the potential of probabilistic methods in
analyzing LLM systems.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria: (1) Practical Applications - tuning LLM cascades, (2) Experimental Results and Quantitative Metrics - demonstrated performance improvements with specific metrics, and (3) Comparison with State-of-the-Art - compared with grid search and showed advancements. Additionally, meets criteria (4) Methodology and Implementation Details, and shows Novelty in utilizing a probabilistic model for LLM cascade tuning.

---

## [YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents
  in Augmented Reality Tasks](https://arxiv.org/abs/2501.09355v1)
**arXiv ID:** 2501.09355v1

**Abstract:**
> Multimodal AI Agents are AI models that have the capability of interactively
and cooperatively assisting human users to solve day-to-day tasks. Augmented
Reality (AR) head worn devices can uniquely improve the user experience of
solving procedural day-to-day tasks by providing egocentric multimodal (audio
and video) observational capabilities to AI Agents. Such AR capabilities can
help AI Agents see and listen to actions that users take which can relate to
multimodal capabilities of human users. Existing AI Agents, either Large
Language Models (LLMs) or Multimodal Vision-Language Models (VLMs) are reactive
in nature, which means that models cannot take an action without reading or
listening to the human user's prompts. Proactivity of AI Agents on the other
hand can help the human user detect and correct any mistakes in agent observed
tasks, encourage users when they do tasks correctly or simply engage in
conversation with the user - akin to a human teaching or assisting a user. Our
proposed YET to Intervene (YETI) multimodal agent focuses on the research
question of identifying circumstances that may require the agent to intervene
proactively. This allows the agent to understand when it can intervene in a
conversation with human users that can help the user correct mistakes on tasks,
like cooking, using AR. Our YETI Agent learns scene understanding signals based
on interpretable notions of Structural Similarity (SSIM) on consecutive video
frames. We also define the alignment signal which the AI Agent can learn to
identify if the video frames corresponding to the user's actions on the task
are consistent with expected actions. These signals are used by our AI Agent to
determine when it should proactively intervene. We compare our results on the
instances of proactive intervention in the HoloAssist multimodal benchmark for
an expert agent guiding a user to complete procedural tasks.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications in AR tasks), 2 (Experimental Results with quantitative metrics implied by benchmark comparison), and 3 (Comparison with State-of-the-Art via HoloAssist multimodal benchmark). Also aligns with Agentic AI consideration by enabling autonomous, proactive interventions. While it involves video processing, its primary focus is on multimodal AI agent interactions, making it an acceptable edge case.

---

## [MoE$^2$: Optimizing Collaborative Inference for Edge Large Language
  Models](https://arxiv.org/abs/2501.09410v1)
**arXiv ID:** 2501.09410v1

**Abstract:**
> Large language models (LLMs) have demonstrated remarkable capabilities across
a wide range of natural language processing tasks. Exploiting the heterogeneous
capabilities of edge LLMs is crucial for diverse emerging applications, as it
enables greater cost-effectiveness and reduced latency. In this work, we
introduce \textit{Mixture-of-Edge-Experts (MoE$^2$)}, a novel collaborative
inference framework for edge LLMs. We formulate the joint gating and expert
selection problem to optimize inference performance under energy and latency
constraints. Unlike conventional MoE problems, LLM expert selection is
significantly more challenging due to the combinatorial nature and the
heterogeneity of edge LLMs across various attributes. To this end, we propose a
two-level expert selection mechanism through which we uncover an
optimality-preserving property of gating parameters across expert selections.
This property enables the decomposition of the training and selection
processes, significantly reducing complexity. Furthermore, we leverage the
objective's monotonicity and design a discrete monotonic optimization algorithm
for optimal expert selection. We implement edge servers with NVIDIA Jetson AGX
Orins and NVIDIA RTX 4090 GPUs, and perform extensive experiments. Our results
validate that performance improvements of various LLM models and show that our
MoE$^2$ method can achieve optimal trade-offs among different delay and energy
budgets, and outperforms baselines under various system resource constraints.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications: optimizing edge LLMs for cost-effectiveness and reduced latency), 2 (Experimental Results and Quantitative Metrics: extensive experiments with performance improvements), and 3 (Comparison with State-of-the-Art: outperforms baselines under various constraints). Additionally, shows novelty in introducing a collaborative inference framework (MoE$^2$) and has potential for implementability with current standard tools.

---

## [CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through
  Category-Bounding](https://arxiv.org/abs/2501.09645v1)
**arXiv ID:** 2501.09645v1

**Abstract:**
> In today's assistant landscape, personalisation enhances interactions,
fosters long-term relationships, and deepens engagement. However, many systems
struggle with retaining user preferences, leading to repetitive user requests
and disengagement. Furthermore, the unregulated and opaque extraction of user
preferences in industry applications raises significant concerns about privacy
and trust, especially in regions with stringent regulations like Europe. In
response to these challenges, we propose a long-term memory system for voice
assistants, structured around predefined categories. This approach leverages
Large Language Models to efficiently extract, store, and retrieve preferences
within these categories, ensuring both personalisation and transparency. We
also introduce a synthetic multi-turn, multi-session conversation dataset
(CarMem), grounded in real industry data, tailored to an in-car voice assistant
setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to
.95 in preference extraction, depending on category granularity. Our
maintenance strategy reduces redundant preferences by 95% and contradictory
ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively,
the results demonstrate the system's suitability for industrial applications.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria (1, 2, & 3): focuses on practical application of LLM in voice assistants, includes experimental results with quantitative metrics, and implicitly compares with existing state-of-the-art by highlighting improvements. Additionally meets optional criteria for methodology and implementation details (4), real-world applications (5), and novelty, with a clear approach to enhancing LLMs in a specific use case.

---

## [Authenticated Delegation and Authorized AI Agents](https://arxiv.org/abs/2501.09674v1)
**arXiv ID:** 2501.09674v1

**Abstract:**
> The rapid deployment of autonomous AI agents creates urgent challenges around
authorization, accountability, and access control in digital spaces. New
standards are needed to know whom AI agents act on behalf of and guide their
use appropriately, protecting online spaces while unlocking the value of task
delegation to autonomous agents. We introduce a novel framework for
authenticated, authorized, and auditable delegation of authority to AI agents,
where human users can securely delegate and restrict the permissions and scope
of agents while maintaining clear chains of accountability. This framework
builds on existing identification and access management protocols, extending
OAuth 2.0 and OpenID Connect with agent-specific credentials and metadata,
maintaining compatibility with established authentication and web
infrastructure. Further, we propose a framework for translating flexible,
natural language permissions into auditable access control configurations,
enabling robust scoping of AI agent capabilities across diverse interaction
modalities. Taken together, this practical approach facilitates immediate
deployment of AI agents while addressing key security and accountability
concerns, working toward ensuring agentic AI systems perform only appropriate
actions and providing a tool for digital service providers to enable AI agent
interactions without risking harm from scalable interaction.

**Decision Explanation:** Original decision: ACCEPT
Meets all required criteria: (1) Practical Applications (agentic AI, task delegation), (2) implied Experimental Results (though not detailed in abstract), and (3) Comparison with State-of-the-Art (extends OAuth 2.0 and OpenID Connect). Additionally, meets optional criteria (4) Methodology and Implementation Details, (5) Real-world Applications and Challenges, and shows Novelty in authenticated delegation for AI agents, making it a strong candidate for further review.

---

## [ADAGE: A generic two-layer framework for adaptive agent based modelling](https://arxiv.org/abs/2501.09429v1)
**arXiv ID:** 2501.09429v1

**Abstract:**
> Agent-based models (ABMs) are valuable for modelling complex, potentially
out-of-equilibria scenarios. However, ABMs have long suffered from the Lucas
critique, stating that agent behaviour should adapt to environmental changes.
Furthermore, the environment itself often adapts to these behavioural changes,
creating a complex bi-level adaptation problem. Recent progress integrating
multi-agent reinforcement learning into ABMs introduces adaptive agent
behaviour, beginning to address the first part of this critique, however, the
approaches are still relatively ad hoc, lacking a general formulation, and
furthermore, do not tackle the second aspect of simultaneously adapting
environmental level characteristics in addition to the agent behaviours. In
this work, we develop a generic two-layer framework for ADaptive AGEnt based
modelling (ADAGE) for addressing these problems. This framework formalises the
bi-level problem as a Stackelberg game with conditional behavioural policies,
providing a consolidated framework for adaptive agent-based modelling based on
solving a coupled set of non-linear equations. We demonstrate how this generic
approach encapsulates several common (previously viewed as distinct) ABM tasks,
such as policy design, calibration, scenario generation, and robust behavioural
learning under one unified framework. We provide example simulations on
multiple complex economic and financial environments, showing the strength of
the novel framework under these canonical settings, addressing long-standing
critiques of traditional ABMs.

**Decision Explanation:** Original decision: ACCEPT
Meets criteria 1 (Practical Applications in agentic AI), 2 ( Experimental Results, with simulations on complex environments), and 3 (Comparison with State-of-the-Art, addressing long-standing critiques of traditional ABMs). Also shows novelty in framework development and potential for real-world applications in economic and financial environments, aligning with additional considerations for impact and implementability.

---

# Rejected Papers

## [DualOpt: A Dual Divide-and-Optimize Algorithm for the Large-scale
  Traveling Salesman Problem](https://arxiv.org/abs/2501.08565v1)
**arXiv ID:** 2501.08565v1

**Abstract:**
> This paper proposes a dual divide-and-optimize algorithm (DualOpt) for
solving the large-scale traveling salesman problem (TSP). DualOpt combines two
complementary strategies to improve both solution quality and computational
efficiency. The first strategy is a grid-based divide-and-conquer procedure
that partitions the TSP into smaller sub-problems, solving them in parallel and
iteratively refining the solution by merging nodes and partial routes. The
process continues until only one grid remains, yielding a high-quality initial
solution. The second strategy involves a path-based divide-and-optimize
procedure that further optimizes the solution by dividing it into sub-paths,
optimizing each using a neural solver, and merging them back to progressively
improve the overall solution. Extensive experiments conducted on two groups of
TSP benchmark instances, including randomly generated instances with up to
100,000 nodes and real-world datasets from TSPLIB, demonstrate the
effectiveness of DualOpt. The proposed DualOpt achieves highly competitive
results compared to 10 state-of-the-art algorithms in the literature. In
particular, DualOpt achieves an improvement gap up to 1.40% for the largest
instance TSP100K with a remarkable 104x speed-up over the leading heuristic
solver LKH3. Additionally, DualOpt demonstrates strong generalization on TSPLIB
benchmarks, confirming its capability to tackle diverse real-world TSP
applications.

**Decision Explanation:** Original decision: REJECT
The paper does not meet any of the required criteria, as it focuses on solving the Traveling Salesman Problem using a divide-and-optimize algorithm, with no apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI. Its novelty and achievements are within the optimization domain, not relevant to the specified LLM-related criteria.

---

## [Development and Validation of the Provider Documentation Summarization
  Quality Instrument for Large Language Models](https://arxiv.org/abs/2501.08977v1)
**arXiv ID:** 2501.08977v1

**Abstract:**
> As Large Language Models (LLMs) are integrated into electronic health record
(EHR) workflows, validated instruments are essential to evaluate their
performance before implementation. Existing instruments for provider
documentation quality are often unsuitable for the complexities of
LLM-generated text and lack validation on real-world data. The Provider
Documentation Summarization Quality Instrument (PDSQI-9) was developed to
evaluate LLM-generated clinical summaries. Multi-document summaries were
generated from real-world EHR data across multiple specialties using several
LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson
correlation for substantive validity, factor analysis and Cronbach's alpha for
structural validity, inter-rater reliability (ICC and Krippendorff's alpha) for
generalizability, a semi-Delphi process for content validity, and comparisons
of high- versus low-quality summaries for discriminant validity. Seven
physician raters evaluated 779 summaries and answered 8,329 questions,
achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated
strong internal consistency (Cronbach's alpha = 0.879; 95% CI: 0.867-0.891) and
high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting
structural validity and generalizability. Factor analysis identified a 4-factor
model explaining 58% of the variance, representing organization, clarity,
accuracy, and utility. Substantive validity was supported by correlations
between note length and scores for Succinct (rho = -0.200, p = 0.029) and
Organized (rho = -0.190, p = 0.037). Discriminant validity distinguished high-
from low-quality summaries (p < 0.001). The PDSQI-9 demonstrates robust
construct validity, supporting its use in clinical practice to evaluate
LLM-generated summaries and facilitate safer integration of LLMs into
healthcare workflows.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI, specifically the integration of Large Language Models into electronic health record workflows, which is explicitly listed as a rejection criterion.

---

## [AI-based Identity Fraud Detection: A Systematic Review](https://arxiv.org/abs/2501.09239v1)
**arXiv ID:** 2501.09239v1

**Abstract:**
> With the rapid development of digital services, a large volume of personally
identifiable information (PII) is stored online and is subject to cyberattacks
such as Identity fraud. Most recently, the use of Artificial Intelligence (AI)
enabled deep fake technologies has significantly increased the complexity of
identity fraud. Fraudsters may use these technologies to create highly
sophisticated counterfeit personal identification documents, photos and videos.
These advancements in the identity fraud landscape pose challenges for identity
fraud detection and society at large. There is a pressing need to review and
understand identity fraud detection methods, their limitations and potential
solutions. This research aims to address this important need by using the
well-known systematic literature review method. This paper reviewed a selected
set of 43 papers across 4 major academic literature databases. In particular,
the review results highlight the two types of identity fraud prevention and
detection methods, in-depth and open challenges. The results were also
consolidated into a taxonomy of AI-based identity fraud detection and
prevention methods including key insights and trends. Overall, this paper
provides a foundational knowledge base to researchers and practitioners for
further research and development in this important area of digital identity
fraud.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on AI applications in law and security (identity fraud detection), which is excluded according to the criteria, and does not explicitly mention Large Language Models (LLMs) or related areas like knowledge graphs, RAG, or agentic AI.

---

## [AI in Support of Diversity and Inclusion](https://arxiv.org/abs/2501.09534v1)
**arXiv ID:** 2501.09534v1

**Abstract:**
> In this paper, we elaborate on how AI can support diversity and inclusion and
exemplify research projects conducted in that direction. We start by looking at
the challenges and progress in making large language models (LLMs) more
transparent, inclusive, and aware of social biases. Even though LLMs like
ChatGPT have impressive abilities, they struggle to understand different
cultural contexts and engage in meaningful, human like conversations. A key
issue is that biases in language processing, especially in machine translation,
can reinforce inequality. Tackling these biases requires a multidisciplinary
approach to ensure AI promotes diversity, fairness, and inclusion. We also
highlight AI's role in identifying biased content in media, which is important
for improving representation. By detecting unequal portrayals of social groups,
AI can help challenge stereotypes and create more inclusive technologies.
Transparent AI algorithms, which clearly explain their decisions, are essential
for building trust and reducing bias in AI systems. We also stress AI systems
need diverse and inclusive training data. Projects like the Child Growth
Monitor show how using a wide range of data can help address real world
problems like malnutrition and poverty. We present a project that demonstrates
how AI can be applied to monitor the role of search engines in spreading
disinformation about the LGBTQ+ community. Moreover, we discuss the SignON
project as an example of how technology can bridge communication gaps between
hearing and deaf people, emphasizing the importance of collaboration and mutual
trust in developing inclusive AI. Overall, with this paper, we advocate for AI
systems that are not only effective but also socially responsible, promoting
fair and inclusive interactions between humans and machines.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on social applications of AI in regard to Diversity, Social harm, and similar issues, which is explicitly listed as a criterion for rejection.

---

## [Artificial Intelligence-Driven Clinical Decision Support Systems](https://arxiv.org/abs/2501.09628v1)
**arXiv ID:** 2501.09628v1

**Abstract:**
> As artificial intelligence (AI) becomes increasingly embedded in healthcare
delivery, this chapter explores the critical aspects of developing reliable and
ethical Clinical Decision Support Systems (CDSS). Beginning with the
fundamental transition from traditional statistical models to sophisticated
machine learning approaches, this work examines rigorous validation strategies
and performance assessment methods, including the crucial role of model
calibration and decision curve analysis. The chapter emphasizes that creating
trustworthy AI systems in healthcare requires more than just technical
accuracy; it demands careful consideration of fairness, explainability, and
privacy. The challenge of ensuring equitable healthcare delivery through AI is
stressed, discussing methods to identify and mitigate bias in clinical
predictive models. The chapter then delves into explainability as a cornerstone
of human-centered CDSS. This focus reflects the understanding that healthcare
professionals must not only trust AI recommendations but also comprehend their
underlying reasoning. The discussion advances in an analysis of privacy
vulnerabilities in medical AI systems, from data leakage in deep learning
models to sophisticated attacks against model explanations. The text explores
privacy-preservation strategies such as differential privacy and federated
learning, while acknowledging the inherent trade-offs between privacy
protection and model performance. This progression, from technical validation
to ethical considerations, reflects the multifaceted challenges of developing
AI systems that can be seamlessly and reliably integrated into daily clinical
practice while maintaining the highest standards of patient care and data
protection.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI, specifically Clinical Decision Support Systems, which falls under the excluded categories as per the evaluation criteria.

---

## [Platform-Aware Mission Planning](https://arxiv.org/abs/2501.09632v1)
**arXiv ID:** 2501.09632v1

**Abstract:**
> Planning for autonomous systems typically requires reasoning with models at
different levels of abstraction, and the harmonization of two competing sets of
objectives: high-level mission goals that refer to an interaction of the system
with the external environment, and low-level platform constraints that aim to
preserve the integrity and the correct interaction of the subsystems. The
complicated interplay between these two models makes it very hard to reason on
the system as a whole, especially when the objective is to find plans with
robustness guarantees, considering the non-deterministic behavior of the lower
layers of the system.
  In this paper, we introduce the problem of Platform-Aware Mission Planning
(PAMP), addressing it in the setting of temporal durative actions. The PAMP
problem differs from standard temporal planning for its exists-forall nature:
the high-level plan dealing with mission goals is required to satisfy safety
and executability constraints, for all the possible non-deterministic
executions of the low-level model of the platform and the environment. We
propose two approaches for solving PAMP. The first baseline approach
amalgamates the mission and platform levels, while the second is based on an
abstraction-refinement loop that leverages the combination of a planner and a
verification engine. We prove the soundness and completeness of the proposed
approaches and validate them experimentally, demonstrating the importance of
heterogeneous modeling and the superiority of the technique based on
abstraction-refinement.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it focuses on autonomous systems and mission planning without mentioning Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, which are the primary areas of interest. It lacks direct relevance to the specified LLM applications and related technologies.

---

## [Electronic Health Records: Towards Digital Twins in Healthcare](https://arxiv.org/abs/2501.09640v1)
**arXiv ID:** 2501.09640v1

**Abstract:**
> The pivotal shift from traditional paper-based records to sophisticated
Electronic Health Records (EHR), enabled systematic collection and analysis of
patient data through descriptive statistics, providing insight into patterns
and trends across patient populations. This evolution continued toward
predictive analytics, allowing healthcare providers to anticipate patient
outcomes and potential complications before they occur. This progression from
basic digital record-keeping to sophisticated predictive modelling and digital
twins reflects healthcare's broader evolution toward more integrated,
patient-centred approaches that combine data-driven insights with personalized
care delivery. This chapter explores the evolution and significance of
healthcare information systems, beginning with an examination of the
implementation of EHR in the UK and the USA. It provides a comprehensive
overview of the International Classification of Diseases (ICD) system, tracing
its development from ICD-9 to ICD-10. Central to this discussion is the
MIMIC-III database, a landmark achievement in healthcare data sharing and
arguably the most comprehensive critical care database freely available to
researchers worldwide. MIMIC-III has democratized access to high-quality
healthcare data, enabling unprecedented opportunities for research and
analysis. The chapter examines its structure, clinical outcome analysis
capabilities, and practical applications through case studies, with a
particular focus on mortality and length of stay metrics, vital signs
extraction, and ICD coding. Through detailed entity-relationship diagrams and
practical examples, the text illustrates MIMIC's complex data structure and
demonstrates how different querying approaches can lead to subtly different
results, emphasizing the critical importance of understanding the database's
architecture for accurate data extraction.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI, specifically Electronic Health Records and healthcare data analysis, which is explicitly listed as a reason for rejection.

---

## [NS-Gym: Open-Source Simulation Environments and Benchmarks for
  Non-Stationary Markov Decision Processes](https://arxiv.org/abs/2501.09646v1)
**arXiv ID:** 2501.09646v1

**Abstract:**
> In many real-world applications, agents must make sequential decisions in
environments where conditions are subject to change due to various exogenous
factors. These non-stationary environments pose significant challenges to
traditional decision-making models, which typically assume stationary dynamics.
Non-stationary Markov decision processes (NS-MDPs) offer a framework to model
and solve decision problems under such changing conditions. However, the lack
of standardized benchmarks and simulation tools has hindered systematic
evaluation and advance in this field. We present NS-Gym, the first simulation
toolkit designed explicitly for NS-MDPs, integrated within the popular
Gymnasium framework. In NS-Gym, we segregate the evolution of the environmental
parameters that characterize non-stationarity from the agent's decision-making
module, allowing for modular and flexible adaptations to dynamic environments.
We review prior work in this domain and present a toolkit encapsulating key
problem characteristics and types in NS-MDPs. This toolkit is the first effort
to develop a set of standardized interfaces and benchmark problems to enable
consistent and reproducible evaluation of algorithms under non-stationary
conditions. We also benchmark six algorithmic approaches from prior work on
NS-MDPs using NS-Gym. Our vision is that NS-Gym will enable researchers to
assess the adaptability and robustness of their decision-making algorithms to
non-stationary conditions.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on simulation environments for Non-Stationary Markov Decision Processes (NS-MDPs) without clear connections to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, failing to meet the foundational criteria related to LLM applications.

---

## [The Goofus & Gallant Story Corpus for Practical Value Alignment](https://arxiv.org/abs/2501.09707v1)
**arXiv ID:** 2501.09707v1

**Abstract:**
> Values or principles are key elements of human society that influence people
to behave and function according to an accepted standard set of social rules to
maintain social order. As AI systems are becoming ubiquitous in human society,
it is a major concern that they could violate these norms or values and
potentially cause harm. Thus, to prevent intentional or unintentional harm, AI
systems are expected to take actions that align with these principles. Training
systems to exhibit this type of behavior is difficult and often requires a
specialized dataset. This work presents a multi-modal dataset illustrating
normative and non-normative behavior in real-life situations described through
natural language and artistic images. This training set contains curated sets
of images that are designed to teach young children about social principles. We
argue that this is an ideal dataset to use for training socially normative
agents given this fact.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on value alignment and social norms in AI, which aligns with 'responsible AI application or AI ethics', a criterion for rejection. It also lacks clear connections to Large Language Models (LLMs), practical applications in knowledge graphs, RAG, or agentic AI, as emphasized in the selection criteria.

---

## [KU AIGEN ICL EDI@BC8 Track 3: Advancing Phenotype Named Entity
  Recognition and Normalization for Dysmorphology Physical Examination Reports](https://arxiv.org/abs/2501.09744v1)
**arXiv ID:** 2501.09744v1

**Abstract:**
> The objective of BioCreative8 Track 3 is to extract phenotypic key medical
findings embedded within EHR texts and subsequently normalize these findings to
their Human Phenotype Ontology (HPO) terms. However, the presence of diverse
surface forms in phenotypic findings makes it challenging to accurately
normalize them to the correct HPO terms. To address this challenge, we explored
various models for named entity recognition and implemented data augmentation
techniques such as synonym marginalization to enhance the normalization step.
Our pipeline resulted in an exact extraction and normalization F1 score 2.6\%
higher than the mean score of all submissions received in response to the
challenge. Furthermore, in terms of the normalization F1 score, our approach
surpassed the average performance by 1.9\%. These findings contribute to the
advancement of automated medical data extraction and normalization techniques,
showcasing potential pathways for future research and application in the
biomedical domain.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI, specifically biomedical data extraction and normalization for Dysmorphology Physical Examination Reports, which is explicitly listed as a rejection criterion.

---

## [Adapting Whisper for Regional Dialects: Enhancing Public Services for
  Vulnerable Populations in the United Kingdom](https://arxiv.org/abs/2501.08502v1)
**arXiv ID:** 2501.08502v1

**Abstract:**
> We collect novel data in the public service domain to evaluate the capability
of the state-of-the-art automatic speech recognition (ASR) models in capturing
regional differences in accents in the United Kingdom (UK), specifically
focusing on two accents from Scotland with distinct dialects. This study
addresses real-world problems where biased ASR models can lead to
miscommunication in public services, disadvantaging individuals with regional
accents particularly those in vulnerable populations. We first examine the
out-of-the-box performance of the Whisper large-v3 model on a baseline dataset
and our data. We then explore the impact of fine-tuning Whisper on the
performance in the two UK regions and investigate the effectiveness of existing
model evaluation techniques for our real-world application through manual
inspection of model errors. We observe that the Whisper model has a higher word
error rate (WER) on our test datasets compared to the baseline data and
fine-tuning on a given data improves performance on the test dataset with the
same domain and accent. The fine-tuned models also appear to show improved
performance when applied to the test data outside of the region it was trained
on suggesting that fine-tuned models may be transferable within parts of the
UK. Our manual analysis of model outputs reveals the benefits and drawbacks of
using WER as an evaluation metric and fine-tuning to adapt to regional
dialects.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on automatic speech recognition (ASR) and regional dialects, with no apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, failing to meet the core criteria for Practical Applications and Experimental Results in the context of LLMs.

---

## [Mitigating Domain Shift in Federated Learning via Intra- and
  Inter-Domain Prototypes](https://arxiv.org/abs/2501.08521v1)
**arXiv ID:** 2501.08521v1

**Abstract:**
> Federated Learning (FL) has emerged as a decentralized machine learning
technique, allowing clients to train a global model collaboratively without
sharing private data. However, most FL studies ignore the crucial challenge of
heterogeneous domains where each client has a distinct feature distribution,
which is common in real-world scenarios. Prototype learning, which leverages
the mean feature vectors within the same classes, has become a prominent
solution for federated learning under domain skew. However, existing federated
prototype learning methods only consider inter-domain prototypes on the server
and overlook intra-domain characteristics. In this work, we introduce a novel
federated prototype learning method, namely I$^2$PFL, which incorporates
$\textbf{I}$ntra-domain and $\textbf{I}$nter-domain $\textbf{P}$rototypes, to
mitigate domain shifts and learn a generalized global model across multiple
domains in federated learning. To construct intra-domain prototypes, we propose
feature alignment with MixUp-based augmented prototypes to capture the
diversity of local domains and enhance the generalization of local features.
Additionally, we introduce a reweighting mechanism for inter-domain prototypes
to generate generalized prototypes to provide inter-domain knowledge and reduce
domain skew across multiple clients. Extensive experiments on the Digits,
Office-10, and PACS datasets illustrate the superior performance of our method
compared to other baselines.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the primary criteria as it focuses on Federated Learning (FL) and prototype learning, without mentioning Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, which are the core areas of interest.

---

## [Dynamic Portfolio Optimization via Augmented DDPG with Quantum Price
  Levels-Based Trading Strategy](https://arxiv.org/abs/2501.08528v1)
**arXiv ID:** 2501.08528v1

**Abstract:**
> With the development of deep learning, Dynamic Portfolio Optimization (DPO)
problem has received a lot of attention in recent years, not only in the field
of finance but also in the field of deep learning. Some advanced research in
recent years has proposed the application of Deep Reinforcement Learning (DRL)
to the DPO problem, which demonstrated to be more advantageous than supervised
learning in solving the DPO problem. However, there are still certain unsolved
issues: 1) DRL algorithms usually have the problems of slow learning speed and
high sample complexity, which is especially problematic when dealing with
complex financial data. 2) researchers use DRL simply for the purpose of
obtaining high returns, but pay little attention to the problem of risk control
and trading strategy, which will affect the stability of model returns. In
order to address these issues, in this study we revamped the intrinsic
structure of the model based on the Deep Deterministic Policy Gradient (DDPG)
and proposed the Augmented DDPG model. Besides, we also proposed an innovative
risk control strategy based on Quantum Price Levels (QPLs) derived from Quantum
Finance Theory (QFT). Our experimental results revealed that our model has
better profitability as well as risk control ability with less sample
complexity in the DPO problem compared to the baseline models.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on financial applications using Deep Reinforcement Learning and Quantum Finance Theory, with no apparent involvement of Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, thus failing to meet the mandatory criteria related to LLMs.

---

## [The Devil is in Temporal Token: High Quality Video Reasoning
  Segmentation](https://arxiv.org/abs/2501.08549v1)
**arXiv ID:** 2501.08549v1

**Abstract:**
> Existing methods for Video Reasoning Segmentation rely heavily on a single
special token to represent the object in the keyframe or the entire video,
inadequately capturing spatial complexity and inter-frame motion. To overcome
these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation
approach that leverages Multimodal Large Language Models (MLLMs) to inject rich
spatiotemporal features into hierarchical tokens.Our key innovations include a
Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS).
Specifically, we design frame-level <SEG> and temporal-level <TAK> tokens that
utilize MLLM's autoregressive learning to effectively capture both local and
global information. Subsequently, we apply a similarity-based weighted fusion
and frame selection strategy, then utilize SAM2 to perform keyframe
segmentation and propagation. To enhance keyframe localization accuracy, the
TKS filters keyframes based on SAM2's occlusion scores during inference. VRS-HQ
achieves state-of-the-art performance on ReVOS, surpassing VISA by
5.9%/12.5%/9.1% in J&F scores across the three subsets. These results highlight
the strong temporal reasoning and segmentation capabilities of our method. Code
and model weights will be released at VRS-HQ.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on video processing (video reasoning segmentation), which is explicitly listed as a rejection criterion. Although it utilizes Large Language Models (MLLMs), the primary application is in video processing, outweighing the relevance to LLMs.

---

## [Evaluating SAT and SMT Solvers on Large-Scale Sudoku Puzzles](https://arxiv.org/abs/2501.08569v1)
**arXiv ID:** 2501.08569v1

**Abstract:**
> Modern SMT solvers have revolutionized the approach to constraint
satisfaction problems by integrating advanced theory reasoning and encoding
techniques. In this work, we evaluate the performance of modern SMT solvers in
Z3, CVC5 and DPLL(T) against a standard SAT solver in DPLL. By benchmarking
these solvers on novel, diverse 25x25 Sudoku puzzles of various difficulty
levels created by our improved Sudoku generator, we examine the impact of
advanced theory reasoning and encoding techniques. Our findings demonstrate
that modern SMT solvers significantly outperform classical SAT solvers. This
work highlights the evolution of logical solvers and exemplifies the utility of
SMT solvers in addressing large-scale constraint satisfaction problems.

**Decision Explanation:** Original decision: REJECT
The paper does not meet any of the specified criteria, as it focuses on the evaluation of SAT and SMT solvers for constraint satisfaction problems (Sudoku puzzles), without any apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI.

---

## [Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in
  Large Language Models](https://arxiv.org/abs/2501.08618v1)
**arXiv ID:** 2501.08618v1

**Abstract:**
> All natural languages are structured hierarchically. In humans, this
structural restriction is neurologically coded: when two grammars are presented
with identical vocabularies, brain areas responsible for language processing
are only sensitive to hierarchical grammars. Using large language models
(LLMs), we investigate whether such functionally distinct hierarchical
processing regions can arise solely from exposure to large-scale language
distributions. We generate inputs using English, Italian, Japanese, or nonce
words, varying the underlying grammars to conform to either hierarchical or
linear/positional rules. Using these grammars, we first observe that language
models show distinct behaviors on hierarchical versus linearly structured
inputs. Then, we find that the components responsible for processing
hierarchical grammars are distinct from those that process linear grammars; we
causally verify this in ablation experiments. Finally, we observe that
hierarchy-selective components are also active on nonce grammars; this suggests
that hierarchy sensitivity is not tied to meaning, nor in-distribution inputs.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on the internal processing mechanisms of Large Language Models (LLMs) in response to different grammatical structures, without explicitly addressing practical applications, experimental results with quantitative metrics, comparison with state-of-the-art, or real-world challenges, thus not meeting the required criteria.

---

## [ViBidirectionMT-Eval: Machine Translation for Vietnamese-Chinese and
  Vietnamese-Lao language pair](https://arxiv.org/abs/2501.08621v1)
**arXiv ID:** 2501.08621v1

**Abstract:**
> This paper presents an results of the VLSP 2022-2023 Machine Translation
Shared Tasks, focusing on Vietnamese-Chinese and Vietnamese-Lao machine
translation. The tasks were organized as part of the 9th, 10th annual workshop
on Vietnamese Language and Speech Processing (VLSP 2022, VLSP 2023). The
objective of the shared task was to build machine translation systems,
specifically targeting Vietnamese-Chinese and Vietnamese-Lao translation
(corresponding to 4 translation directions). The submission were evaluated on
1,000 pairs for testing (news and general domains) using established metrics
like BLEU [11] and SacreBLEU [12]. Additionally, system outputs also were
evaluated with human judgment provided by experts in Chinese and Lao languages.
These human assessments played a crucial role in ranking the performance of the
machine translation models, ensuring a more comprehensive evaluation.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on Machine Translation, without clear connections to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, failing to meet the mandatory criteria (1, 2, and 3) and not strongly aligning with any of the additional preferred criteria.

---

## [Reassessing the Role of Chain-of-Thought in Sentiment Analysis: Insights
  and Limitations](https://arxiv.org/abs/2501.08641v1)
**arXiv ID:** 2501.08641v1

**Abstract:**
> The relationship between language and thought remains an unresolved
philosophical issue. Existing viewpoints can be broadly categorized into two
schools: one asserting their independence, and another arguing that language
constrains thought. In the context of large language models, this debate raises
a crucial question: Does a language model's grasp of semantic meaning depend on
thought processes? To explore this issue, we investigate whether reasoning
techniques can facilitate semantic understanding. Specifically, we
conceptualize thought as reasoning, employ chain-of-thought prompting as a
reasoning technique, and examine its impact on sentiment analysis tasks. The
experiments show that chain-of-thought has a minimal impact on sentiment
analysis tasks. Both the standard and chain-of-thought prompts focus on aspect
terms rather than sentiment in the generated content. Furthermore,
counterfactual experiments reveal that the model's handling of sentiment tasks
primarily depends on information from demonstrations. The experimental results
support the first viewpoint.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on the philosophical relationship between language and thought, with experiments centered on sentiment analysis, which does not clearly demonstrate practical applications of Large Language Models (LLMs) in areas like knowledge graphs, retrieval-augmented generation, or agentic AI, nor does it explicitly address state-of-the-art comparisons or innovative methodological approaches as required by the main criteria.

---

## [Application of Deep Reinforcement Learning to UAV Swarming for Ground
  Surveillance](https://arxiv.org/abs/2501.08655v1)
**arXiv ID:** 2501.08655v1

**Abstract:**
> This paper summarizes in depth the state of the art of aerial swarms,
covering both classical and new reinforcement-learning-based approaches for
their management. Then, it proposes a hybrid AI system, integrating deep
reinforcement learning in a multi-agent centralized swarm architecture. The
proposed system is tailored to perform surveillance of a specific area,
searching and tracking ground targets, for security and law enforcement
applications. The swarm is governed by a central swarm controller responsible
for distributing different search and tracking tasks among the cooperating
UAVs. Each UAV agent is then controlled by a collection of cooperative
sub-agents, whose behaviors have been trained using different deep
reinforcement learning models, tailored for the different task types proposed
by the swarm controller. More specifically, proximal policy optimization (PPO)
algorithms were used to train the agents' behavior. In addition, several
metrics to assess the performance of the swarm in this application were
defined. The results obtained through simulation show that our system searches
the operation area effectively, acquires the targets in a reasonable time, and
is capable of tracking them continuously and consistently.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on the application of Deep Reinforcement Learning to UAV Swarming for Ground Surveillance, with a specific mention of law enforcement applications, which aligns with criteria for rejection (primarily focuses on law, either with AI as subject or participant). Additionally, there is no apparent connection to Large Language Models (LLMs), a key requirement for consideration.

---

## [SPEQ: Stabilization Phases for Efficient Q-Learning in High
  Update-To-Data Ratio Reinforcement Learning](https://arxiv.org/abs/2501.08669v1)
**arXiv ID:** 2501.08669v1

**Abstract:**
> A key challenge in Deep Reinforcement Learning is sample efficiency,
especially in real-world applications where collecting environment interactions
is expensive or risky. Recent off-policy algorithms improve sample efficiency
by increasing the Update-To-Data (UTD) ratio and performing more gradient
updates per environment interaction. While this improves sample efficiency, it
significantly increases computational cost due to the higher number of gradient
updates required. In this paper we propose a sample-efficient method to improve
computational efficiency by separating training into distinct learning phases
in order to exploit gradient updates more effectively. Our approach builds on
top of the Dropout Q-Functions (DroQ) algorithm and alternates between an
online, low UTD ratio training phase, and an offline stabilization phase.
During the stabilization phase, we fine-tune the Q-functions without collecting
new environment interactions. This process improves the effectiveness of the
replay buffer and reduces computational overhead. Our experimental results on
continuous control problems show that our method achieves results comparable to
state-of-the-art, high UTD ratio algorithms while requiring 56\% fewer gradient
updates and 50\% less training time than DroQ. Our approach offers an effective
and computationally economical solution while maintaining the same sample
efficiency as the more costly, high UTD ratio state-of-the-art.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the primary criteria as it focuses on Deep Reinforcement Learning (Q-Learning) rather than Large Language Models (LLMs), and does not mention applications in knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, which are the specified areas of interest.

---

## [Digital Phenotyping for Adolescent Mental Health: A Feasibility Study
  Employing Machine Learning to Predict Mental Health Risk From Active and
  Passive Smartphone Data](https://arxiv.org/abs/2501.08851v1)
**arXiv ID:** 2501.08851v1

**Abstract:**
> Background: Adolescents are particularly vulnerable to mental disorders, with
over 75% of cases manifesting before the age of 25. Research indicates that
only 18 to 34% of young people experiencing high levels of depression or
anxiety symptoms seek support. Digital tools leveraging smartphones offer
scalable and early intervention opportunities. Objective: Using a novel machine
learning framework, this study evaluated the feasibility of integrating active
and passive smartphone data to predict mental disorders in non-clinical
adolescents. Specifically, we investigated the utility of the Mindcraft app in
predicting risks for internalising and externalising disorders, eating
disorders, insomnia and suicidal ideation. Methods: Participants (N=103; mean
age 16.1 years) were recruited from three London schools. Participants
completed the Strengths and Difficulties Questionnaire, the Eating Disorders-15
Questionnaire, Sleep Condition Indicator Questionnaire and indicated the
presence/absence of suicidal ideation. They used the Mindcraft app for 14 days,
contributing active data via self-reports and passive data from smartphone
sensors. A contrastive pretraining phase was applied to enhance user-specific
feature stability, followed by supervised fine-tuning. The model evaluation
employed leave-one-subject-out cross-validation using balanced accuracy as the
primary metric. Results: The integration of active and passive data achieved
superior performance compared to individual data sources, with mean balanced
accuracies of 0.71 for SDQ-High risk, 0.67 for insomnia, 0.77 for suicidal
ideation and 0.70 for eating disorders. The contrastive learning framework
stabilised daily behavioural representations, enhancing predictive robustness.
This study demonstrates the potential of integrating active and passive
smartphone data with advanced machine-learning techniques for predicting mental
health risks.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI (mental health diagnosis and risk prediction), which is explicitly listed as a rejection criterion.

---

## [Silent Abandonment in Text-Based Contact Centers: Identifying,
  Quantifying, and Mitigating its Operational Impacts](https://arxiv.org/abs/2501.08869v2)
**arXiv ID:** 2501.08869v2

**Abstract:**
> In the quest to improve services, companies offer customers the option to
interact with agents via texting. Such contact centers face unique challenges
compared to traditional call centers, as measuring customer experience proxies
like abandonment and patience involves uncertainty. A key source of this
uncertainty is silent abandonment, where customers leave without notifying the
system, wasting agent time and leaving their status unclear. Silent abandonment
also obscures whether a customer was served or left. Our goals are to measure
the magnitude of silent abandonment and mitigate its effects. Classification
models show that 3%-70% of customers across 17 companies abandon silently. In
one study, 71.3% of abandoning customers did so silently, reducing agent
efficiency by 3.2% and system capacity by 15.3%, incurring $5,457 in annual
costs per agent. We develop an expectation-maximization (EM) algorithm to
estimate customer patience under uncertainty and identify influencing
covariates. We find that companies should use classification models to estimate
abandonment scope and our EM algorithm to assess patience. We suggest
strategies to operationally mitigate the impact of silent abandonment by
predicting suspected silent-abandonment behavior or changing service design.
Specifically, we show that while allowing customers to write while waiting in
the queue creates a missing data challenge, it also significantly increases
patience and reduces service time, leading to reduced abandonment and lower
staffing requirements.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on operational improvements in text-based contact centers, without clear involvement of Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI, failing to meet the required criteria related to LLM applications.

---

## [Projection Implicit Q-Learning with Support Constraint for Offline
  Reinforcement Learning](https://arxiv.org/abs/2501.08907v1)
**arXiv ID:** 2501.08907v1

**Abstract:**
> Offline Reinforcement Learning (RL) faces a critical challenge of
extrapolation errors caused by out-of-distribution (OOD) actions. Implicit
Q-Learning (IQL) algorithm employs expectile regression to achieve in-sample
learning, effectively mitigating the risks associated with OOD actions.
However, the fixed hyperparameter in policy evaluation and density-based policy
improvement method limit its overall efficiency. In this paper, we propose
Proj-IQL, a projective IQL algorithm enhanced with the support constraint. In
the policy evaluation phase, Proj-IQL generalizes the one-step approach to a
multi-step approach through vector projection, while maintaining in-sample
learning and expectile regression framework. In the policy improvement phase,
Proj-IQL introduces support constraint that is more aligned with the policy
evaluation approach. Furthermore, we theoretically demonstrate that Proj-IQL
guarantees monotonic policy improvement and enjoys a progressively more
rigorous criterion for superior actions. Empirical results demonstrate the
Proj-IQL achieves state-of-the-art performance on D4RL benchmarks, especially
in challenging navigation domains.

**Decision Explanation:** Original response: ```
{
    "decision": "REJECT",
    "explanation": "The paper primarily focuses on Offline Reinforcement Learning (RL) with Q-Learning, without mentioning Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, thus failing to meet the essential criteria (1, 2, and 3) and not aligning with any of the secondary criteria (4, 5) or additional considerations."
}
```

---

## [Modeling Melt Pool Features and Spatter Using Symbolic Regression and
  Machine Learning](https://arxiv.org/abs/2501.08922v1)
**arXiv ID:** 2501.08922v1

**Abstract:**
> Additive manufacturing (AM) is a rapidly evolving technology that has
attracted applications across a wide range of fields due to its ability to
fabricate complex geometries. However, one of the key challenges in AM is
achieving consistent print quality. This inconsistency is often attributed to
uncontrolled melt pool dynamics, partly caused by spatter which can lead to
defects. Therefore, capturing and controlling the evolution of the melt pool is
crucial for enhancing process stability and part quality. In this study, we
developed a framework to support decision-making in AM operations, facilitating
quality control and minimizing defects via machine learning (ML) and polynomial
symbolic regression models. We implemented experimentally validated
computational tools as a cost-effective approach to collect large datasets from
laser powder bed fusion (LPBF) processes. For a dataset consisting of 281
process conditions, parameters such as melt pool dimensions (length, width,
depth), melt pool geometry (area, volume), and volume indicated as spatter were
extracted. Using machine learning (ML) and polynomial symbolic regression
models, a high R2 of over 95 % was achieved in predicting the melt pool
dimensions and geometry features for both the training and testing datasets,
with either process conditions (power and velocity) or melt pool dimensions as
the model inputs. In the case of volume indicated as spatter, R2 improved after
logarithmic transforming the model inputs, which was either the process
conditions or the melt pool dimensions. Among the investigated ML models, the
ExtraTree model achieved the highest R2 values of 96.7 % and 87.5 %.

**Decision Explanation:** Original decision: REJECT
The paper does not meet any of the required criteria related to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI. It focuses on machine learning applications in additive manufacturing, which is outside the specified scope.

---

## [Visual WetlandBirds Dataset: Bird Species Identification and Behavior
  Recognition in Videos](https://arxiv.org/abs/2501.08931v1)
**arXiv ID:** 2501.08931v1

**Abstract:**
> The current biodiversity loss crisis makes animal monitoring a relevant field
of study. In light of this, data collected through monitoring can provide
essential insights, and information for decision-making aimed at preserving
global biodiversity. Despite the importance of such data, there is a notable
scarcity of datasets featuring videos of birds, and none of the existing
datasets offer detailed annotations of bird behaviors in video format. In
response to this gap, our study introduces the first fine-grained video dataset
specifically designed for bird behavior detection and species classification.
This dataset addresses the need for comprehensive bird video datasets and
provides detailed data on bird actions, facilitating the development of deep
learning models to recognize these, similar to the advancements made in human
action recognition. The proposed dataset comprises 178 videos recorded in
Spanish wetlands, capturing 13 different bird species performing 7 distinct
behavior classes. In addition, we also present baseline results using state of
the art models on two tasks: bird behavior recognition and species
classification.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on video processing (bird species identification and behavior recognition in videos), which is one of the explicitly stated rejection criteria. Additionally, it does not demonstrate a clear connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, as required by the practical applications criteria.

---

## [Analyzing the Ethical Logic of Six Large Language Models](https://arxiv.org/abs/2501.08951v1)
**arXiv ID:** 2501.08951v1

**Abstract:**
> This study examines the ethical reasoning of six prominent generative large
language models: OpenAI GPT-4o, Meta LLaMA 3.1, Perplexity, Anthropic Claude
3.5 Sonnet, Google Gemini, and Mistral 7B. The research explores how these
models articulate and apply ethical logic, particularly in response to moral
dilemmas such as the Trolley Problem, and Heinz Dilemma. Departing from
traditional alignment studies, the study adopts an explainability-transparency
framework, prompting models to explain their ethical reasoning. This approach
is analyzed through three established ethical typologies: the
consequentialist-deontological analytic, Moral Foundations Theory, and the
Kohlberg Stages of Moral Development Model. Findings reveal that LLMs exhibit
largely convergent ethical logic, marked by a rationalist, consequentialist
emphasis, with decisions often prioritizing harm minimization and fairness.
Despite similarities in pre-training and model architecture, a mixture of
nuanced and significant differences in ethical reasoning emerge across models,
reflecting variations in fine-tuning and post-training processes. The models
consistently display erudition, caution, and self-awareness, presenting ethical
reasoning akin to a graduate-level discourse in moral philosophy. In striking
uniformity these systems all describe their ethical reasoning as more
sophisticated than what is characteristic of typical human moral logic.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on AI ethics (ethical logic and moral dilemmas) which is explicitly listed as a rejection criterion, outweighing any potential marginal relevance to Large Language Models' practical applications or technical advancements.

---

## [Kolmogorov-Arnold Networks for Time Series Granger Causality Inference](https://arxiv.org/abs/2501.08958v1)
**arXiv ID:** 2501.08958v1

**Abstract:**
> We introduce Granger Causality Kolmogorov-Arnold Networks (GCKAN), an
innovative architecture that extends the recently proposed Kolmogorov-Arnold
Networks (KAN) to the domain of causal inference. By extracting base weights
from KAN layers and incorporating the sparsity-inducing penalty along with
ridge regularization, GCKAN infers the Granger causality from time series while
enabling automatic time lag selection. Additionally, we propose an algorithm
leveraging time-reversed Granger causality to enhance inference accuracy. The
algorithm compares prediction and sparse-inducing losses derived from the
original and time-reversed series, automatically selecting the casual
relationship with the higher score or integrating the results to mitigate
spurious connectivities. Comprehensive experiments conducted on Lorenz-96, gene
regulatory networks, fMRI BOLD signals, and VAR datasets demonstrate that the
proposed model achieves competitive performance to state-of-the-art methods in
inferring Granger causality from nonlinear, high-dimensional, and
limited-sample time series.

**Decision Explanation:** Original decision: REJECT
The paper does not meet any of the required criteria (1-3) as it does not focus on Large Language Models (LLMs), lacks comparison with state-of-the-art LLM techniques, and does not provide experimental results related to LLMs. Additionally, the paper's topic (time series Granger causality inference) does not align with the specified areas of interest (knowledge graphs, retrieval-augmented generation, agentic AI, etc.).

---

## [An analysis of data variation and bias in image-based dermatological
  datasets for machine learning classification](https://arxiv.org/abs/2501.08962v1)
**arXiv ID:** 2501.08962v1

**Abstract:**
> AI algorithms have become valuable in aiding professionals in healthcare. The
increasing confidence obtained by these models is helpful in critical decision
demands. In clinical dermatology, classification models can detect malignant
lesions on patients' skin using only RGB images as input. However, most
learning-based methods employ data acquired from dermoscopic datasets on
training, which are large and validated by a gold standard. Clinical models aim
to deal with classification on users' smartphone cameras that do not contain
the corresponding resolution provided by dermoscopy. Also, clinical
applications bring new challenges. It can contain captures from uncontrolled
environments, skin tone variations, viewpoint changes, noises in data and
labels, and unbalanced classes. A possible alternative would be to use transfer
learning to deal with the clinical images. However, as the number of samples is
low, it can cause degradations on the model's performance; the source
distribution used in training differs from the test set. This work aims to
evaluate the gap between dermoscopic and clinical samples and understand how
the dataset variations impact training. It assesses the main differences
between distributions that disturb the model's prediction. Finally, from
experiments on different architectures, we argue how to combine the data from
divergent distributions, decreasing the impact on the model's final accuracy.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI (dermatological datasets for machine learning classification), which is explicitly listed as a criterion for rejection, outweighing any potential marginal relevance to Large Language Models (LLMs) or related areas.

---

## [How Do Generative Models Draw a Software Engineer? A Case Study on
  Stable Diffusion Bias](https://arxiv.org/abs/2501.09014v1)
**arXiv ID:** 2501.09014v1

**Abstract:**
> Generative models are nowadays widely used to generate graphical content used
for multiple purposes, e.g. web, art, advertisement. However, it has been shown
that the images generated by these models could reinforce societal biases
already existing in specific contexts. In this paper, we focus on understanding
if this is the case when one generates images related to various software
engineering tasks. In fact, the Software Engineering (SE) community is not
immune from gender and ethnicity disparities, which could be amplified by the
use of these models. Hence, if used without consciousness, artificially
generated images could reinforce these biases in the SE domain. Specifically,
we perform an extensive empirical evaluation of the gender and ethnicity bias
exposed by three versions of the Stable Diffusion (SD) model (a very popular
open-source text-to-image model) - SD 2, SD XL, and SD 3 - towards SE tasks. We
obtain 6,720 images by feeding each model with two sets of prompts describing
different software-related tasks: one set includes the Software Engineer
keyword, and one set does not include any specification of the person
performing the task. Next, we evaluate the gender and ethnicity disparities in
the generated images. Results show how all models are significantly biased
towards male figures when representing software engineers. On the contrary,
while SD 2 and SD XL are strongly biased towards White figures, SD 3 is
slightly more biased towards Asian figures. Nevertheless, all models
significantly under-represent Black and Arab figures, regardless of the prompt
style used. The results of our analysis highlight severe concerns about
adopting those models to generate content for SE tasks and open the field for
future research on bias mitigation in this context.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on social applications of AI in regard to social harm (bias and diversity issues), which is one of the specified rejection criteria. Additionally, it does not explicitly meet the required criteria for practical applications of Large Language Models (LLMs) in areas like knowledge graphs, RAG, or agentic AI.

---

## [TCMM: Token Constraint and Multi-Scale Memory Bank of Contrastive
  Learning for Unsupervised Person Re-identification](https://arxiv.org/abs/2501.09044v1)
**arXiv ID:** 2501.09044v1

**Abstract:**
> This paper proposes the ViT Token Constraint and Multi-scale Memory bank
(TCMM) method to address the patch noises and feature inconsistency in
unsupervised person re-identification works. Many excellent methods use ViT
features to obtain pseudo labels and clustering prototypes, then train the
model with contrastive learning. However, ViT processes images by performing
patch embedding, which inevitably introduces noise in patches and may
compromise the performance of the re-identification model. On the other hand,
previous memory bank based contrastive methods may lead data inconsistency due
to the limitation of batch size. Furthermore, existing pseudo label methods
often discard outlier samples that are difficult to cluster. It sacrifices the
potential value of outlier samples, leading to limited model diversity and
robustness. This paper introduces the ViT Token Constraint to mitigate the
damage caused by patch noises to the ViT architecture. The proposed Multi-scale
Memory enhances the exploration of outlier samples and maintains feature
consistency. Experimental results demonstrate that our system achieves
state-of-the-art performance on common benchmarks. The project is available at
\href{https://github.com/andy412510/TCMM}{https://github.com/andy412510/TCMM}.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it primarily focuses on computer vision (person re-identification) and image processing (via Vision Transformer (ViT) features and patch embedding), rather than Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI. It also lacks clear connections to the specified application areas.

---

## [Generating Realistic Synthetic Head Rotation Data for Extended Reality
  using Deep Learning](https://arxiv.org/abs/2501.09050v1)
**arXiv ID:** 2501.09050v1

**Abstract:**
> Extended Reality is a revolutionary method of delivering multimedia content
to users. A large contributor to its popularity is the sense of immersion and
interactivity enabled by having real-world motion reflected in the virtual
experience accurately and immediately. This user motion, mainly caused by head
rotations, induces several technical challenges. For instance, which content is
generated and transmitted depends heavily on where the user is looking.
Seamless systems, taking user motion into account proactively, will therefore
require accurate predictions of upcoming rotations. Training and evaluating
such predictors requires vast amounts of orientational input data, which is
expensive to gather, as it requires human test subjects. A more feasible
approach is to gather a modest dataset through test subjects, and then extend
it to a more sizeable set using synthetic data generation methods. In this
work, we present a head rotation time series generator based on TimeGAN, an
extension of the well-known Generative Adversarial Network, designed
specifically for generating time series. This approach is able to extend a
dataset of head rotations with new samples closely matching the distribution of
the measured time series.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on generating synthetic data for Extended Reality using Deep Learning, without apparent connection to Large Language Models (LLMs), practical applications of LLMs, or prescribed areas like knowledge graphs, retrieval-augmented generation, or agentic AI.

---

## [Polyp detection in colonoscopy images using YOLOv11](https://arxiv.org/abs/2501.09051v1)
**arXiv ID:** 2501.09051v1

**Abstract:**
> Colorectal cancer (CRC) is one of the most commonly diagnosed cancers all
over the world. It starts as a polyp in the inner lining of the colon. To
prevent CRC, early polyp detection is required. Colonosopy is used for the
inspection of the colon. Generally, the images taken by the camera placed at
the tip of the endoscope are analyzed by the experts manually. Various
traditional machine learning models have been used with the rise of machine
learning. Recently, deep learning models have shown more effectiveness in polyp
detection due to their superiority in generalizing and learning small features.
These deep learning models for object detection can be segregated into two
different types: single-stage and two-stage. Generally, two stage models have
higher accuracy than single stage ones but the single stage models have low
inference time. Hence, single stage models are easy to use for quick object
detection. YOLO is one of the singlestage models used successfully for polyp
detection. It has drawn the attention of researchers because of its lower
inference time. The researchers have used Different versions of YOLO so far,
and with each newer version, the accuracy of the model is increasing. This
paper aims to see the effectiveness of the recently released YOLOv11 to detect
polyp. We analyzed the performance for all five models of YOLOv11 (YOLO11n,
YOLO11s, YOLO11m, YOLO11l, YOLO11x) with Kvasir dataset for the training and
testing. Two different versions of the dataset were used. The first consisted
of the original dataset, and the other was created using augmentation
techniques. The performance of all the models with these two versions of the
dataset have been analysed.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI (polyp detection in colonoscopy images) which is explicitly mentioned as a rejection criterion, and does not meet any of the specified criteria related to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI.

---

## [Average-Reward Reinforcement Learning with Entropy Regularization](https://arxiv.org/abs/2501.09080v1)
**arXiv ID:** 2501.09080v1

**Abstract:**
> The average-reward formulation of reinforcement learning (RL) has drawn
increased interest in recent years due to its ability to solve
temporally-extended problems without discounting. Independently, RL algorithms
have benefited from entropy-regularization: an approach used to make the
optimal policy stochastic, thereby more robust to noise. Despite the distinct
benefits of the two approaches, the combination of entropy regularization with
an average-reward objective is not well-studied in the literature and there has
been limited development of algorithms for this setting. To address this gap in
the field, we develop algorithms for solving entropy-regularized average-reward
RL problems with function approximation. We experimentally validate our method,
comparing it with existing algorithms on standard benchmarks for RL.

**Decision Explanation:** Original decision: REJECT
The paper focuses on Average-Reward Reinforcement Learning with Entropy Regularization, which does not explicitly mention Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, failing to meet the primary criteria (1, 2, and 3) and not addressing any of the secondary criteria (4, 5) in the context of LLMs.

---

## [Inferring Transition Dynamics from Value Functions](https://arxiv.org/abs/2501.09081v1)
**arXiv ID:** 2501.09081v1

**Abstract:**
> In reinforcement learning, the value function is typically trained to solve
the Bellman equation, which connects the current value to future values. This
temporal dependency hints that the value function may contain implicit
information about the environment's transition dynamics. By rearranging the
Bellman equation, we show that a converged value function encodes a model of
the underlying dynamics of the environment. We build on this insight to propose
a simple method for inferring dynamics models directly from the value function,
potentially mitigating the need for explicit model learning. Furthermore, we
explore the challenges of next-state identifiability, discussing conditions
under which the inferred dynamics model is well-defined. Our work provides a
theoretical foundation for leveraging value functions in dynamics modeling and
opens a new avenue for bridging model-free and model-based reinforcement
learning.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it focuses on reinforcement learning (value functions and transition dynamics) without mentioning Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, which are the primary areas of interest.

---

## [Mantis Shrimp: Exploring Photometric Band Utilization in Computer Vision
  Networks for Photometric Redshift Estimation](https://arxiv.org/abs/2501.09112v1)
**arXiv ID:** 2501.09112v1

**Abstract:**
> We present Mantis Shrimp, a multi-survey deep learning model for photometric
redshift estimation that fuses ultra-violet (GALEX), optical (PanSTARRS), and
infrared (UnWISE) imagery. Machine learning is now an established approach for
photometric redshift estimation, with generally acknowledged higher performance
in areas with a high density of spectroscopically identified galaxies over
template-based methods. Multiple works have shown that image-based
convolutional neural networks can outperform tabular-based color/magnitude
models. In comparison to tabular models, image models have additional design
complexities: it is largely unknown how to fuse inputs from different
instruments which have different resolutions or noise properties. The Mantis
Shrimp model estimates the conditional density estimate of redshift using
cutout images. The density estimates are well calibrated and the point
estimates perform well in the distribution of available spectroscopically
confirmed galaxies with (bias = 1e-2), scatter (NMAD = 2.44e-2) and
catastrophic outlier rate ($\eta$=17.53$\%$). We find that early fusion
approaches (e.g., resampling and stacking images from different instruments)
match the performance of late fusion approaches (e.g., concatenating latent
space representations), so that the design choice ultimately is left to the
user. Finally, we study how the models learn to use information across bands,
finding evidence that our models successfully incorporates information from all
surveys. The applicability of our model to the analysis of large populations of
galaxies is limited by the speed of downloading cutouts from external servers;
however, our model could be useful in smaller studies such as generating priors
over redshift for stellar population synthesis.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on computer vision and photometric redshift estimation in astronomy, with no apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI, thus failing to meet the core criteria.

---

## [Generative Medical Image Anonymization Based on Latent Code Projection
  and Optimization](https://arxiv.org/abs/2501.09114v1)
**arXiv ID:** 2501.09114v1

**Abstract:**
> Medical image anonymization aims to protect patient privacy by removing
identifying information, while preserving the data utility to solve downstream
tasks. In this paper, we address the medical image anonymization problem with a
two-stage solution: latent code projection and optimization. In the projection
stage, we design a streamlined encoder to project input images into a latent
space and propose a co-training scheme to enhance the projection process. In
the optimization stage, we refine the latent code using two deep loss functions
designed to address the trade-off between identity protection and data utility
dedicated to medical images. Through a comprehensive set of qualitative and
quantitative experiments, we showcase the effectiveness of our approach on the
MIMIC-CXR chest X-ray dataset by generating anonymized synthetic images that
can serve as training set for detecting lung pathologies. Source codes are
available at https://github.com/Huiyu-Li/GMIA.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI, specifically medical image anonymization, which is explicitly listed as a rejection criterion.

---

## [Grounding Text-To-Image Diffusion Models For Controlled High-Quality
  Image Generation](https://arxiv.org/abs/2501.09194v1)
**arXiv ID:** 2501.09194v1

**Abstract:**
> Large-scale text-to-image (T2I) diffusion models have demonstrated an
outstanding performance in synthesizing diverse high-quality visuals from
natural language text captions. Multiple layout-to-image models have been
developed to control the generation process by utilizing a broad array of
layouts such as segmentation maps, edges, and human keypoints. In this work, we
present ObjectDiffusion, a model that takes inspirations from the top
cutting-edge image generative frameworks to seamlessly condition T2I models
with new bounding boxes capabilities. Specifically, we make substantial
modifications to the network architecture introduced in ContorlNet to integrate
it with the condition processing and injection techniques proposed in GLIGEN.
ObjectDiffusion is initialized with pretraining parameters to leverage the
generation knowledge obtained from training on large-scale datasets. We
fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on
the COCO2017 validation dataset. Our model achieves an AP$_{50}$ of 46.6, an AR
of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on
open-source datasets in all of the three metrics. ObjectDiffusion demonstrates
a distinctive capability in synthesizing diverse, high-quality, high-fidelity
images that seamlessly conform to the semantic and spatial control layout.
Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits
remarkable grounding abilities on closed-set and open-set settings across a
wide variety of contexts. The qualitative assessment verifies the ability of
ObjectDiffusion to generate multiple objects of different sizes and locations.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on image generation (text-to-image diffusion models) without clear connections to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, failing to meet the mandatory criteria (1, 2, and 3) due to lack of direct relevance to LLMs and specified applications.

---

## [Interpretable Droplet Digital PCR Assay for Trustworthy Molecular
  Diagnostics](https://arxiv.org/abs/2501.09218v1)
**arXiv ID:** 2501.09218v1

**Abstract:**
> Accurate molecular quantification is essential for advancing research and
diagnostics in fields such as infectious diseases, cancer biology, and genetic
disorders. Droplet digital PCR (ddPCR) has emerged as a gold standard for
achieving absolute quantification. While computational ddPCR technologies have
advanced significantly, achieving automatic interpretation and consistent
adaptability across diverse operational environments remains a challenge. To
address these limitations, we introduce the intelligent interpretable droplet
digital PCR (I2ddPCR) assay, a comprehensive framework integrating front-end
predictive models (for droplet segmentation and classification) with GPT-4o
multimodal large language model (MLLM, for context-aware explanations and
recommendations) to automate and enhance ddPCR image analysis. This approach
surpasses the state-of-the-art models, affording 99.05% accuracy in processing
complex ddPCR images containing over 300 droplets per image with varying
signal-to-noise ratios (SNRs). By combining specialized neural networks and
large language models, the I2ddPCR assay offers a robust and adaptable solution
for absolute molecular quantification, achieving a sensitivity capable of
detecting low-abundance targets as low as 90.32 copies/{\mu}L. Furthermore, it
improves model's transparency through detailed explanation and troubleshooting
guidance, empowering users to make informed decisions. This innovative
framework has the potential to benefit molecular diagnostics, disease research,
and clinical applications, especially in resource-constrained settings.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI (molecular diagnostics, infectious diseases, cancer biology, and genetic disorders), which is explicitly listed as a rejection criterion.

---

## [SEAL: Entangled White-box Watermarks on Low-Rank Adaptation](https://arxiv.org/abs/2501.09284v1)
**arXiv ID:** 2501.09284v1

**Abstract:**
> Recently, LoRA and its variants have become the de facto strategy for
training and sharing task-specific versions of large pretrained models, thanks
to their efficiency and simplicity. However, the issue of copyright protection
for LoRA weights, especially through watermark-based techniques, remains
underexplored. To address this gap, we propose SEAL (SEcure wAtermarking on
LoRA weights), the universal whitebox watermarking for LoRA. SEAL embeds a
secret, non-trainable matrix between trainable LoRA weights, serving as a
passport to claim ownership. SEAL then entangles the passport with the LoRA
weights through training, without extra loss for entanglement, and distributes
the finetuned weights after hiding the passport. When applying SEAL, we
observed no performance degradation across commonsense reasoning,
textual/visual instruction tuning, and text-to-image synthesis tasks. We
demonstrate that SEAL is robust against a variety of known attacks: removal,
obfuscation, and ambiguity attacks.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on copyright protection and watermarking for Low-Rank Adaptation (LoRA) weights, which does not meet the required criteria for Large Language Models (LLMs) applications in knowledge graphs, retrieval-augmented generation, or agentic AI, and does not demonstrate practical applications, experimental results, or comparisons to state-of-the-art techniques in the specified areas.

---

## [On Learning Informative Trajectory Embeddings for Imitation,
  Classification and Regression](https://arxiv.org/abs/2501.09327v1)
**arXiv ID:** 2501.09327v1

**Abstract:**
> In real-world sequential decision making tasks like autonomous driving,
robotics, and healthcare, learning from observed state-action trajectories is
critical for tasks like imitation, classification, and clustering. For example,
self-driving cars must replicate human driving behaviors, while robots and
healthcare systems benefit from modeling decision sequences, whether or not
they come from expert data. Existing trajectory encoding methods often focus on
specific tasks or rely on reward signals, limiting their ability to generalize
across domains and tasks. Inspired by the success of embedding models like CLIP
and BERT in static domains, we propose a novel method for embedding
state-action trajectories into a latent space that captures the skills and
competencies in the dynamic underlying decision-making processes. This method
operates without the need for reward labels, enabling better generalization
across diverse domains and tasks. Our contributions are threefold: (1) We
introduce a trajectory embedding approach that captures multiple abilities from
state-action data. (2) The learned embeddings exhibit strong representational
power across downstream tasks, including imitation, classification, clustering,
and regression. (3) The embeddings demonstrate unique properties, such as
controlling agent behaviors in IQ-Learn and an additive structure in the latent
space. Experimental results confirm that our method outperforms traditional
approaches, offering more flexible and powerful trajectory representations for
various applications. Our code is available at
https://github.com/Erasmo1015/vte.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on sequential decision making tasks in autonomous driving, robotics, and healthcare, without clear involvement of Large Language Models (LLMs) or their applications in knowledge graphs, retrieval-augmented generation, or agentic AI, failing to meet the core criteria.

---

## [Neural Honeytrace: A Robust Plug-and-Play Watermarking Framework against
  Model Extraction Attacks](https://arxiv.org/abs/2501.09328v1)
**arXiv ID:** 2501.09328v1

**Abstract:**
> Developing high-performance deep learning models is resource-intensive,
leading model owners to utilize Machine Learning as a Service (MLaaS) platforms
instead of publicly releasing their models. However, malicious users may
exploit query interfaces to execute model extraction attacks, reconstructing
the target model's functionality locally. While prior research has investigated
triggerable watermarking techniques for asserting ownership, existing methods
face significant challenges: (1) most approaches require additional training,
resulting in high overhead and limited flexibility, and (2) they often fail to
account for advanced attackers, leaving them vulnerable to adaptive attacks.
  In this paper, we propose Neural Honeytrace, a robust plug-and-play
watermarking framework against model extraction attacks. We first formulate a
watermark transmission model from an information-theoretic perspective,
providing an interpretable account of the principles and limitations of
existing triggerable watermarking. Guided by the model, we further introduce:
(1) a similarity-based training-free watermarking method for plug-and-play and
flexible watermarking, and (2) a distribution-based multi-step watermark
information transmission strategy for robust watermarking. Comprehensive
experiments on four datasets demonstrate that Neural Honeytrace outperforms
previous methods in efficiency and resisting adaptive attacks. Neural
Honeytrace reduces the average number of samples required for a worst-case
t-Test-based copyright claim from $12,000$ to $200$ with zero training cost.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on model protection and copyright infringement (watermarking against model extraction attacks), which does not align with the specified areas of interest (Large Language Models, knowledge graphs, retrieval-augmented generation, agentic AI, etc.). It does not meet the practical applications criteria related to LLMs, nor does it align with the specified inclusions, making it outside the scope of the evaluation criteria.

---

## [Prompt-CAM: A Simpler Interpretable Transformer for Fine-Grained
  Analysis](https://arxiv.org/abs/2501.09333v1)
**arXiv ID:** 2501.09333v1

**Abstract:**
> We present a simple usage of pre-trained Vision Transformers (ViTs) for
fine-grained analysis, aiming to identify and localize the traits that
distinguish visually similar categories, such as different bird species or dog
breeds. Pre-trained ViTs such as DINO have shown remarkable capabilities to
extract localized, informative features. However, using saliency maps like
Grad-CAM can hardly point out the traits: they often locate the whole object by
a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class
Attention Map (Prompt-CAM) to the rescue. Prompt-CAM learns class-specific
prompts to a pre-trained ViT and uses the corresponding outputs for
classification. To classify an image correctly, the true-class prompt must
attend to the unique image patches not seen in other classes' images, i.e.,
traits. As such, the true class's multi-head attention maps reveal traits and
their locations. Implementation-wise, Prompt-CAM is almost a free lunch by
simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes
Prompt-CAM fairly easy to train and apply, sharply contrasting other
interpretable methods that design specific models and training processes. It is
even simpler than the recently published INterpretable TRansformer (INTR),
whose encoder-decoder architecture prevents it from leveraging pre-trained
ViTs. Extensive empirical studies on a dozen datasets from various domains
(e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate
Prompt-CAM superior interpretation capability.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on computer vision applications using Vision Transformers (ViTs) for fine-grained image analysis, with no clear connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI, thus not meeting the core criteria.

---

## [Predicting Air Temperature from Volumetric Urban Morphology with Machine
  Learning](https://arxiv.org/abs/2501.09469v1)
**arXiv ID:** 2501.09469v1

**Abstract:**
> In this study, we firstly introduce a method that converts CityGML data into
voxels which works efficiently and fast in high resolution for large scale
datasets such as cities but by sacrificing some building details to overcome
the limitations of previous voxelization methodologies that have been
computationally intensive and inefficient at transforming large-scale urban
areas into voxel representations for high resolution. Those voxelized 3D city
data from multiple cities and corresponding air temperature data are used to
develop a machine learning model. Before the model training, Gaussian blurring
is implemented on input data to consider spatial relationships, as a result the
correlation rate between air temperature and volumetric building morphology is
also increased after the Gaussian blurring. After the model training, the
prediction results are not just evaluated with Mean Square Error (MSE) but some
image similarity metrics such as Structural Similarity Index Measure (SSIM) and
Learned Perceptual Image Patch Similarity (LPIPS) that are able to detect and
consider spatial relations during the evaluation process. This trained model is
capable of predicting the spatial distribution of air temperature by using
building volume information of corresponding pixel as input. By doing so, this
research aims to assist urban planners in incorporating environmental
parameters into their planning strategies, thereby facilitating more
sustainable and inhabitable urban environments.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it primarily focuses on a machine learning application (predicting air temperature from urban morphology) that does not involve Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, and does not mention any of the specified areas of interest.

---

## [Class Incremental Fault Diagnosis under Limited Fault Data via
  Supervised Contrastive Knowledge Distillation](https://arxiv.org/abs/2501.09525v1)
**arXiv ID:** 2501.09525v1

**Abstract:**
> Class-incremental fault diagnosis requires a model to adapt to new fault
classes while retaining previous knowledge. However, limited research exists
for imbalanced and long-tailed data. Extracting discriminative features from
few-shot fault data is challenging, and adding new fault classes often demands
costly model retraining. Moreover, incremental training of existing methods
risks catastrophic forgetting, and severe class imbalance can bias the model's
decisions toward normal classes. To tackle these issues, we introduce a
Supervised Contrastive knowledge distiLlation for class Incremental Fault
Diagnosis (SCLIFD) framework proposing supervised contrastive knowledge
distillation for improved representation learning capability and less
forgetting, a novel prioritized exemplar selection method for sample replay to
alleviate catastrophic forgetting, and the Random Forest Classifier to address
the class imbalance. Extensive experimentation on simulated and real-world
industrial datasets across various imbalance ratios demonstrates the
superiority of SCLIFD over existing approaches. Our code can be found at
https://github.com/Zhang-Henry/SCLIFD_TII.

**Decision Explanation:** Original response: ```
{
    "decision": "REJECT",
    "explanation": "The paper primarily focuses on fault diagnosis, which does not explicitly involve Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, failing to meet the core criteria of practical applications of LLMs."
}
```

---

## [Text-driven Adaptation of Foundation Models for Few-shot Surgical
  Workflow Analysis](https://arxiv.org/abs/2501.09555v1)
**arXiv ID:** 2501.09555v1

**Abstract:**
> Purpose: Surgical workflow analysis is crucial for improving surgical
efficiency and safety. However, previous studies rely heavily on large-scale
annotated datasets, posing challenges in cost, scalability, and reliance on
expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven
Adaptation), designed to handle various surgical workflow analysis tasks with
minimal paired image-label data.
  Methods: Our approach has two key components. First, Few-shot selection-based
modality alignment selects a small subset of images and aligns their embeddings
with text embeddings from the downstream task, bridging the modality gap.
Second, Text-driven adaptation leverages only text data to train a decoder,
eliminating the need for paired image-text data. This decoder is then applied
to aligned image embeddings, enabling image-related tasks without explicit
image-text pairs.
  Results: We evaluate our approach to generative tasks (image captioning) and
discriminative tasks (triplet recognition and phase recognition). Results show
that Surg-FTDA outperforms baselines and generalizes well across downstream
tasks.
  Conclusion: We propose a text-driven adaptation approach that mitigates the
modality gap and handles multiple downstream tasks in surgical workflow
analysis, with minimal reliance on large annotated datasets. The code and
dataset will be released in https://github.com/TingxuanSix/Surg-FTDA.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI (surgical workflow analysis), which is one of the specified rejection criteria, outweighing its potential alignment with criteria like practical applications, experimental results, and novelty.

---

## [IFRA: a machine learning-based Instrumented Fall Risk Assessment Scale
  derived from Instrumented Timed Up and Go test in stroke patients](https://arxiv.org/abs/2501.09595v1)
**arXiv ID:** 2501.09595v1

**Abstract:**
> Effective fall risk assessment is critical for post-stroke patients. The
present study proposes a novel, data-informed fall risk assessment method based
on the instrumented Timed Up and Go (ITUG) test data, bringing in many mobility
measures that traditional clinical scales fail to capture. IFRA, which stands
for Instrumented Fall Risk Assessment, has been developed using a two-step
process: first, features with the highest predictive power among those
collected in a ITUG test have been identified using machine learning
techniques; then, a strategy is proposed to stratify patients into low, medium,
or high-risk strata. The dataset used in our analysis consists of 142
participants, out of which 93 were used for training (15 synthetically
generated), 17 for validation and 32 to test the resulting IFRA scale (22
non-fallers and 10 fallers). Features considered in the IFRA scale include gait
speed, vertical acceleration during sit-to-walk transition, and turning angular
velocity, which align well with established literature on the risk of fall in
neurological patients. In a comparison with traditional clinical scales such as
the traditional Timed Up & Go and the Mini-BESTest, IFRA demonstrates
competitive performance, being the only scale to correctly assign more than
half of the fallers to the high-risk stratum (Fischer's Exact test p = 0.004).
Despite the dataset's limited size, this is the first proof-of-concept study to
pave the way for future evidence regarding the use of IFRA tool for continuous
patient monitoring and fall prevention both in clinical stroke rehabilitation
and at home post-discharge.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI (stroke patient fall risk assessment), which is one of the specified rejection criteria, outweighing its potential marginal relevance to the broader criteria of practical applications or methodology.

---

## [Reducing the Sensitivity of Neural Physics Simulators to Mesh Topology
  via Pretraining](https://arxiv.org/abs/2501.09597v1)
**arXiv ID:** 2501.09597v1

**Abstract:**
> Meshes are used to represent complex objects in high fidelity physics
simulators across a variety of domains, such as radar sensing and aerodynamics.
There is growing interest in using neural networks to accelerate physics
simulations, and also a growing body of work on applying neural networks
directly to irregular mesh data. Since multiple mesh topologies can represent
the same object, mesh augmentation is typically required to handle topological
variation when training neural networks. Due to the sensitivity of physics
simulators to small changes in mesh shape, it is challenging to use these
augmentations when training neural network-based physics simulators. In this
work, we show that variations in mesh topology can significantly reduce the
performance of neural network simulators. We evaluate whether pretraining can
be used to address this issue, and find that employing an established
autoencoder pretraining technique with graph embedding models reduces the
sensitivity of neural network simulators to variations in mesh topology.
Finally, we highlight future research directions that may further reduce neural
simulator sensitivity to mesh topology.

**Decision Explanation:** Original response: ```
{
    "decision": "REJECT",
    "explanation": "The paper primarily focuses on neural physics simulators and mesh topology in the context of physics simulations (e.g., radar sensing, aerodynamics), with no apparent direct connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI, failing to meet the core criteria."
}
```

---

## [Monte Carlo Tree Search with Velocity Obstacles for safe and efficient
  motion planning in dynamic environments](https://arxiv.org/abs/2501.09649v1)
**arXiv ID:** 2501.09649v1

**Abstract:**
> Online motion planning is a challenging problem for intelligent robots moving
in dense environments with dynamic obstacles, e.g., crowds. In this work, we
propose a novel approach for optimal and safe online motion planning with
minimal information about dynamic obstacles. Specifically, our approach
requires only the current position of the obstacles and their maximum speed,
but it does not need any information about their exact trajectories or dynamic
model. The proposed methodology combines Monte Carlo Tree Search (MCTS), for
online optimal planning via model simulations, with Velocity Obstacles (VO),
for obstacle avoidance. We perform experiments in a cluttered simulated
environment with walls, and up to 40 dynamic obstacles moving with random
velocities and directions. With an ablation study, we show the key contribution
of VO in scaling up the efficiency of MCTS, selecting the safest and most
rewarding actions in the tree of simulations. Moreover, we show the superiority
of our methodology with respect to state-of-the-art planners, including
Non-linear Model Predictive Control (NMPC), in terms of improved collision
rate, computational and task performance.

**Decision Explanation:** Original decision: REJECT
The paper does not meet any of the required criteria, as it focuses on motion planning for robots in dynamic environments, utilizing Monte Carlo Tree Search and Velocity Obstacles, without any apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI.

---

## [Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP
  Evaluation Benchmark](https://arxiv.org/abs/2501.09672v1)
**arXiv ID:** 2501.09672v1

**Abstract:**
> The proliferation of Vision-Language Models (VLMs) in the past several years
calls for rigorous and comprehensive evaluation methods and benchmarks. This
work analyzes existing VLM evaluation techniques, including automated metrics,
AI-based assessments, and human evaluations across diverse tasks. We first
introduce Robin - a novel suite of VLMs that we built by combining Large
Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use
Robin to identify shortcomings of current evaluation approaches across scales.
Next, to overcome the identified limitations, we introduce CHIRP - a new long
form response benchmark we developed for more robust and complete VLM
evaluation. We provide open access to the Robin training code, model suite, and
CHIRP benchmark to promote reproducibility and advance VLM research.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on Vision-Language Models (VLMs) and their evaluation, with Large Language Models (LLMs) being a component rather than the main focus, and does not clearly meet the practical application criteria in areas specified (knowledge graphs, RAG, or agentic AI), nor does it explicitly address comparison with state-of-the-art in LLMs.

---

## [Cueless EEG imagined speech for subject identification: dataset and
  benchmarks](https://arxiv.org/abs/2501.09700v1)
**arXiv ID:** 2501.09700v1

**Abstract:**
> Electroencephalogram (EEG) signals have emerged as a promising modality for
biometric identification. While previous studies have explored the use of
imagined speech with semantically meaningful words for subject identification,
most have relied on additional visual or auditory cues. In this study, we
introduce a cueless EEG-based imagined speech paradigm, where subjects imagine
the pronunciation of semantically meaningful words without any external cues.
This innovative approach addresses the limitations of prior methods by
requiring subjects to select and imagine words from a predefined list
naturally. The dataset comprises over 4,350 trials from 11 subjects across five
sessions. We assess a variety of classification methods, including traditional
machine learning techniques such as Support Vector Machines (SVM) and XGBoost,
as well as time-series foundation models and deep learning architectures
specifically designed for EEG classification, such as EEG Conformer and Shallow
ConvNet. A session-based hold-out validation strategy was employed to ensure
reliable evaluation and prevent data leakage. Our results demonstrate
outstanding classification accuracy, reaching 97.93%. These findings highlight
the potential of cueless EEG paradigms for secure and reliable subject
identification in real-world applications, such as brain-computer interfaces
(BCIs).

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it primarily focuses on biometric identification using EEG signals, which is unrelated to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, and does not mention any practical applications of LLMs.

---

## [Learnings from Scaling Visual Tokenizers for Reconstruction and
  Generation](https://arxiv.org/abs/2501.09755v1)
**arXiv ID:** 2501.09755v1

**Abstract:**
> Visual tokenization via auto-encoding empowers state-of-the-art image and
video generative models by compressing pixels into a latent space. Although
scaling Transformer-based generators has been central to recent advances, the
tokenizer component itself is rarely scaled, leaving open questions about how
auto-encoder design choices influence both its objective of reconstruction and
downstream generative performance. Our work aims to conduct an exploration of
scaling in auto-encoders to fill in this blank. To facilitate this exploration,
we replace the typical convolutional backbone with an enhanced Vision
Transformer architecture for Tokenization (ViTok). We train ViTok on
large-scale image and video datasets far exceeding ImageNet-1K, removing data
constraints on tokenizer scaling. We first study how scaling the auto-encoder
bottleneck affects both reconstruction and generation -- and find that while it
is highly correlated with reconstruction, its relationship with generation is
more complex. We next explored the effect of separately scaling the
auto-encoders' encoder and decoder on reconstruction and generation
performance. Crucially, we find that scaling the encoder yields minimal gains
for either reconstruction or generation, while scaling the decoder boosts
reconstruction but the benefits for generation are mixed. Building on our
exploration, we design ViTok as a lightweight auto-encoder that achieves
competitive performance with state-of-the-art auto-encoders on ImageNet-1K and
COCO reconstruction tasks (256p and 512p) while outperforming existing
auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x
fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates
competitive performance on image generation for ImageNet-1K and sets new
state-of-the-art benchmarks for class-conditional video generation on UCF-101.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on video processing and image generation, with an emphasis on scaling visual tokenizers for reconstruction and generation, which falls under the rejection criteria of primarily focusing on video processing.

---

## [Exploring the Efficacy of Meta-Learning: Unveiling Superior Data
  Diversity Utilization of MAML Over Pre-training](https://arxiv.org/abs/2501.08506v1)
**arXiv ID:** 2501.08506v1

**Abstract:**
> Currently, data and model size dominate the narrative in the training of
super-large, powerful models. However, there has been a lack of exploration on
the effect of other attributes of the training dataset on model performance. We
hypothesize that dataset diversity can impact the performance of vision models.
Our study shows positive correlations between test set accuracy and data
diversity, providing an argument for furthering the research of dataset
attributes beyond size. We analyzed pre-training and model-agnostic
meta-learning methods on twelve popular visual datasets (e.g., Omniglot,
CIFAR-FS, Aircraft) and five model configurations, including MAML variants with
different numbers of inner gradient steps and supervised learning. We show
moderate to strong positive correlations (R-squared: 0.15-0.42) between
accuracy and data diversity and weaker but significant correlations (R-squared:
~0.2) between loss and diversity. These findings support our hypothesis and
demonstrate a promising way for a deeper exploration of how formal data
diversity influences model performance. This initial study highlights the
potential of (Task2Vec) data diversity as a valuable measure in the rapidly
evolving field of large-scale learning and emphasizes that understanding the
dataset is key to building more powerful and generalizable models.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on computer vision (visual datasets) and the impact of dataset diversity on vision model performance, which aligns with 'Primarily focuses on video processing' (or closely related fields), and does not explicitly involve Large Language Models (LLMs) or their applications as required by the main criteria.

---

## [Easing Seasickness through Attention Redirection with a
  Mindfulness-Based Brain--Computer Interface](https://arxiv.org/abs/2501.08518v1)
**arXiv ID:** 2501.08518v1

**Abstract:**
> Seasickness is a prevalent issue that adversely impacts both passenger
experiences and the operational efficiency of maritime crews. While techniques
that redirect attention have proven effective in alleviating motion sickness
symptoms in terrestrial environments, applying similar strategies to manage
seasickness poses unique challenges due to the prolonged and intense motion
environment associated with maritime travel. In this study, we propose a
mindfulness brain-computer interface (BCI), specifically designed to redirect
attention with the aim of mitigating seasickness symptoms in real-world
settings. Our system utilizes a single-channel headband to capture prefrontal
EEG signals, which are then wirelessly transmitted to computing devices for the
assessment of mindfulness states. The results are transferred into real-time
feedback as mindfulness scores and audiovisual stimuli, facilitating a shift in
attentional focus from physiological discomfort to mindfulness practices. A
total of 43 individuals participated in a real-world maritime experiment
consisted of three sessions: a real-feedback mindfulness session, a resting
session, and a pseudofeedback mindfulness session. Notably, 81.39% of
participants reported that the mindfulness BCI intervention was effective, and
there was a significant reduction in the severity of seasickness, as measured
by the Misery Scale (MISC). Furthermore, EEG analysis revealed a decrease in
the theta/beta ratio, corresponding with the alleviation of seasickness
symptoms. A decrease in overall EEG band power during the real-feedback
mindfulness session suggests that the mindfulness BCI fosters a more tranquil
and downregulated state of brain activity. Together, this study presents a
novel nonpharmacological, portable, and effective approach for seasickness
intervention, with the potential to enhance the cruising experience for both
passengers and crews.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on a medical application (seasickness alleviation) and brain-computer interface technology, with no apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI, thus failing to meet the fundamental criteria.

---

## [Reinforcement Learning-Enhanced Procedural Generation for Dynamic
  Narrative-Driven AR Experiences](https://arxiv.org/abs/2501.08552v1)
**arXiv ID:** 2501.08552v1

**Abstract:**
> Procedural Content Generation (PCG) is widely used to create scalable and
diverse environments in games. However, existing methods, such as the Wave
Function Collapse (WFC) algorithm, are often limited to static scenarios and
lack the adaptability required for dynamic, narrative-driven applications,
particularly in augmented reality (AR) games. This paper presents a
reinforcement learning-enhanced WFC framework designed for mobile AR
environments. By integrating environment-specific rules and dynamic tile weight
adjustments informed by reinforcement learning (RL), the proposed method
generates maps that are both contextually coherent and responsive to gameplay
needs. Comparative evaluations and user studies demonstrate that the framework
achieves superior map quality and delivers immersive experiences, making it
well-suited for narrative-driven AR games. Additionally, the method holds
promise for broader applications in education, simulation training, and
immersive extended reality (XR) experiences, where dynamic and adaptive
environments are critical.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on procedural content generation for gaming and augmented reality (AR) experiences, with broader applications in education and simulation training, but does not explicitly meet any of the required criteria related to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI.

---

## [Towards Lightweight and Stable Zero-shot TTS with Self-distilled
  Representation Disentanglement](https://arxiv.org/abs/2501.08566v1)
**arXiv ID:** 2501.08566v1

**Abstract:**
> Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized
voice customization through voice cloning. However, current methods for
achieving zero-shot TTS heavily rely on large model scales and extensive
training datasets to ensure satisfactory performance and generalizability
across various speakers. This raises concerns regarding both deployment costs
and data security. In this paper, we present a lightweight and stable zero-shot
TTS system. We introduce a novel TTS architecture designed to effectively model
linguistic content and various speaker attributes from source speech and prompt
speech, respectively. Furthermore, we present a two-stage self-distillation
framework that constructs parallel data pairs for effectively disentangling
linguistic content and speakers from the perspective of training data.
Extensive experiments show that our system exhibits excellent performance and
superior stability on the zero-shot TTS tasks. Moreover, it shows markedly
superior computational efficiency, with RTFs of 0.13 and 0.012 on the CPU and
GPU, respectively.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it primarily focuses on Text-To-Speech (TTS) synthesis, which is not explicitly mentioned as a relevant application area in the provided criteria, and does not clearly relate to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.

---

## [Sound Scene Synthesis at the DCASE 2024 Challenge](https://arxiv.org/abs/2501.08587v1)
**arXiv ID:** 2501.08587v1

**Abstract:**
> This paper presents Task 7 at the DCASE 2024 Challenge: sound scene
synthesis. Recent advances in sound synthesis and generative models have
enabled the creation of realistic and diverse audio content. We introduce a
standardized evaluation framework for comparing different sound scene synthesis
systems, incorporating both objective and subjective metrics. The challenge
attracted four submissions, which are evaluated using the Fr\'echet Audio
Distance (FAD) and human perceptual ratings. Our analysis reveals significant
insights into the current capabilities and limitations of sound scene synthesis
systems, while also highlighting areas for future improvement in this rapidly
evolving field.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on sound scene synthesis, which relates to audio processing and does not explicitly involve Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, thus not meeting any of the required criteria.

---

## [OpenMLDB: A Real-Time Relational Data Feature Computation System for
  Online ML](https://arxiv.org/abs/2501.08591v1)
**arXiv ID:** 2501.08591v1

**Abstract:**
> Efficient and consistent feature computation is crucial for a wide range of
online ML applications. Typically, feature computation is divided into two
distinct phases, i.e., offline stage for model training and online stage for
model serving. These phases often rely on execution engines with different
interface languages and function implementations, causing significant
inconsistencies. Moreover, many online ML features involve complex time-series
computations (e.g., functions over varied-length table windows) that differ
from standard streaming and analytical queries. Existing data processing
systems (e.g., Spark, Flink, DuckDB) often incur multi-second latencies for
these computations, making them unsuitable for real-time online ML applications
that demand timely feature updates.
  This paper presents OpenMLDB, a feature computation system deployed in
4Paradigm's SageOne platform and over 100 real scenarios. Technically, OpenMLDB
first employs a unified query plan generator for consistent computation results
across the offline and online stages, significantly reducing feature deployment
overhead. Second, OpenMLDB provides an online execution engine that resolves
performance bottlenecks caused by long window computations (via
pre-aggregation) and multi-table window unions (via data self-adjusting). It
also provides a high-performance offline execution engine with window parallel
optimization and time-aware data skew resolving. Third, OpenMLDB features a
compact data format and stream-focused indexing to maximize memory usage and
accelerate data access. Evaluations in testing and real workloads reveal
significant performance improvements and resource savings compared to the
baseline systems. The open community of OpenMLDB now has over 150 contributors
and gained 1.6k stars on GitHub.

**Decision Explanation:** Original response: ```
{
    "decision": "REJECT",
    "explanation": "The paper primarily focuses on optimizing feature computation for online Machine Learning (ML) applications using a relational data system, without explicitly mentioning Large Language Models (LLMs), their applications, or related challenges, thereby not meeting the core criteria."
}

---

## [Fine-grained Spatio-temporal Event Prediction with Self-adaptive Anchor
  Graph](https://arxiv.org/abs/2501.08653v1)
**arXiv ID:** 2501.08653v1

**Abstract:**
> Event prediction tasks often handle spatio-temporal data distributed in a
large spatial area. Different regions in the area exhibit different
characteristics while having latent correlations. This spatial heterogeneity
and correlations greatly affect the spatio-temporal distributions of event
occurrences, which has not been addressed by state-of-the-art models. Learning
spatial dependencies of events in a continuous space is challenging due to its
fine granularity and a lack of prior knowledge. In this work, we propose a
novel Graph Spatio-Temporal Point Process (GSTPP) model for fine-grained event
prediction. It adopts an encoder-decoder architecture that jointly models the
state dynamics of spatially localized regions using neural Ordinary
Differential Equations (ODEs). The state evolution is built on the foundation
of a novel Self-Adaptive Anchor Graph (SAAG) that captures spatial
dependencies. By adaptively localizing the anchor nodes in the space and
jointly constructing the correlation edges between them, the SAAG enhances the
model's ability of learning complex spatial event patterns. The proposed GSTPP
model greatly improves the accuracy of fine-grained event prediction. Extensive
experimental results show that our method greatly improves the prediction
accuracy over existing spatio-temporal event prediction approaches.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the primary criteria for Practical Applications of Large Language Models (LLMs), as it focuses on Spatio-temporal Event Prediction using Graph Neural Networks and Ordinary Differential Equations, without any apparent connection to LLMs, knowledge graphs, retrieval-augmented generation, or agentic AI.

---

## [Self-supervised Transformation Learning for Equivariant Representations](https://arxiv.org/abs/2501.08712v1)
**arXiv ID:** 2501.08712v1

**Abstract:**
> Unsupervised representation learning has significantly advanced various
machine learning tasks. In the computer vision domain, state-of-the-art
approaches utilize transformations like random crop and color jitter to achieve
invariant representations, embedding semantically the same inputs despite
transformations. However, this can degrade performance in tasks requiring
precise features, such as localization or flower classification. To address
this, recent research incorporates equivariant representation learning, which
captures transformation-sensitive information. However, current methods depend
on transformation labels and thus struggle with interdependency and complex
transformations. We propose Self-supervised Transformation Learning (STL),
replacing transformation labels with transformation representations derived
from image pairs. The proposed method ensures transformation representation is
image-invariant and learns corresponding equivariant transformations, enhancing
performance without increased batch complexity. We demonstrate the approach's
effectiveness across diverse classification and detection tasks, outperforming
existing methods in 7 out of 11 benchmarks and excelling in detection. By
integrating complex transformations like AugMix, unusable by prior equivariant
methods, this approach enhances performance across tasks, underscoring its
adaptability and resilience. Additionally, its compatibility with various base
models highlights its flexibility and broad applicability. The code is
available at https://github.com/jaemyung-u/stl.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on computer vision and image processing (e.g., transformation learning, image classification, detection tasks), which does not align with the specified areas of interest (Large Language Models, knowledge graphs, retrieval-augmented generation, agentic AI). None of the mandatory criteria (1-3) are met, and the paper's topic falls outside the scope of the evaluation criteria.

---

## [Networked Agents in the Dark: Team Value Learning under Partial
  Observability](https://arxiv.org/abs/2501.08778v1)
**arXiv ID:** 2501.08778v1

**Abstract:**
> We propose a novel cooperative multi-agent reinforcement learning (MARL)
approach for networked agents. In contrast to previous methods that rely on
complete state information or joint observations, our agents must learn how to
reach shared objectives under partial observability. During training, they
collect individual rewards and approximate a team value function through local
communication, resulting in cooperative behavior. To describe our problem, we
introduce the networked dynamic partially observable Markov game framework,
where agents communicate over a switching topology communication network. Our
distributed method, DNA-MARL, uses a consensus mechanism for local
communication and gradient descent for local computation. DNA-MARL increases
the range of the possible applications of networked agents, being well-suited
for real world domains that impose privacy and where the messages may not reach
their recipients. We evaluate DNA-MARL across benchmark MARL scenarios. Our
results highlight the superior performance of DNA-MARL over previous methods.

**Decision Explanation:** Original response: ```
{
    "decision": "REJECT",
    "explanation": "The paper does not meet the core criteria related to Large Language Models (LLMs), instead focusing on multi-agent reinforcement learning (MARL) and networked agents, with no apparent connection to LLMs, knowledge graphs, retrieval-augmented generation, or agentic AI as specified in the selection criteria."
}
```

---

## [XMusic: Towards a Generalized and Controllable Symbolic Music Generation
  Framework](https://arxiv.org/abs/2501.08809v1)
**arXiv ID:** 2501.08809v1

**Abstract:**
> In recent years, remarkable advancements in artificial intelligence-generated
content (AIGC) have been achieved in the fields of image synthesis and text
generation, generating content comparable to that produced by humans. However,
the quality of AI-generated music has not yet reached this standard, primarily
due to the challenge of effectively controlling musical emotions and ensuring
high-quality outputs. This paper presents a generalized symbolic music
generation framework, XMusic, which supports flexible prompts (i.e., images,
videos, texts, tags, and humming) to generate emotionally controllable and
high-quality symbolic music. XMusic consists of two core components, XProjector
and XComposer. XProjector parses the prompts of various modalities into
symbolic music elements (i.e., emotions, genres, rhythms and notes) within the
projection space to generate matching music. XComposer contains a Generator and
a Selector. The Generator generates emotionally controllable and melodious
music based on our innovative symbolic music representation, whereas the
Selector identifies high-quality symbolic music by constructing a multi-task
learning scheme involving quality assessment, emotion recognition, and genre
recognition tasks. In addition, we build XMIDI, a large-scale symbolic music
dataset that contains 108,023 MIDI files annotated with precise emotion and
genre labels. Objective and subjective evaluations show that XMusic
significantly outperforms the current state-of-the-art methods with impressive
music quality. Our XMusic has been awarded as one of the nine Highlights of
Collectibles at WAIC 2023. The project homepage of XMusic is
https://xmusic-project.github.io.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on music generation, which is not explicitly mentioned in the criteria for acceptance, and does not clearly meet any of the required criteria (1-3) related to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI, despite its innovative approach and state-of-the-art results in the music domain.

---

## [SAIF: A Comprehensive Framework for Evaluating the Risks of Generative
  AI in the Public Sector](https://arxiv.org/abs/2501.08814v1)
**arXiv ID:** 2501.08814v1

**Abstract:**
> The rapid adoption of generative AI in the public sector, encompassing
diverse applications ranging from automated public assistance to welfare
services and immigration processes, highlights its transformative potential
while underscoring the pressing need for thorough risk assessments. Despite its
growing presence, evaluations of risks associated with AI-driven systems in the
public sector remain insufficiently explored. Building upon an established
taxonomy of AI risks derived from diverse government policies and corporate
guidelines, we investigate the critical risks posed by generative AI in the
public sector while extending the scope to account for its multimodal
capabilities. In addition, we propose a Systematic dAta generatIon Framework
for evaluating the risks of generative AI (SAIF). SAIF involves four key
stages: breaking down risks, designing scenarios, applying jailbreak methods,
and exploring prompt types. It ensures the systematic and consistent generation
of prompt data, facilitating a comprehensive evaluation while providing a solid
foundation for mitigating the risks. Furthermore, SAIF is designed to
accommodate emerging jailbreak methods and evolving prompt types, thereby
enabling effective responses to unforeseen risk scenarios. We believe that this
study can play a crucial role in fostering the safe and responsible integration
of generative AI into the public sector.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on responsible AI application and AI ethics (risk assessment and mitigation), which is explicitly mentioned as a rejection criterion. Additionally, while it mentions practical applications in the public sector, the core contribution (SAIF framework) is geared towards evaluating and mitigating risks, rather than demonstrating practical applications or experimental results with Large Language Models (LLMs).

---

## [Automatic tuning of communication protocols for vehicular ad hoc
  networks using metaheuristics](https://arxiv.org/abs/2501.08847v1)
**arXiv ID:** 2501.08847v1

**Abstract:**
> The emerging field of vehicular ad hoc networks (VANETs) deals with a set of
communicating vehicles which are able to spontaneously interconnect without any
pre-existing infrastructure. In such kind of networks, it is crucial to make an
optimal configuration of the communication protocols previously to the final
network deployment. This way, a human designer can obtain an optimal QoS of the
network beforehand. The problem we consider in this work lies in configuring
the File Transfer protocol Configuration (FTC) with the aim of optimizing the
transmission time, the number of lost packets, and the amount of data
transferred in realistic VANET scenarios. We face the FTC with five
representative state-of-the-art optimization techniques and compare their
performance. These algorithms are: Particle Swarm Optimization (PSO),
Differential Evolution (DE), Genetic Algorithm (GA), Evolutionary Strategy
(ES), and Simulated Annealing (SA). For our tests, two typical environment
instances of VANETs for Urban and Highway scenarios have been defined. The
experiments using ns- 2 (a well-known realistic VANET simulator) reveal that
PSO outperforms all the compared algorithms for both studied VANET instances.

**Decision Explanation:** Original response: ```
{
    "decision": "REJECT",
    "explanation": "The paper does not meet any of the required criteria, focusing on vehicular ad hoc networks and metaheuristics for optimizing communication protocols, with no apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI."
}

---

## [RouteNet-Gauss: Hardware-Enhanced Network Modeling with Machine Learning](https://arxiv.org/abs/2501.08848v1)
**arXiv ID:** 2501.08848v1

**Abstract:**
> Network simulation is pivotal in network modeling, assisting with tasks
ranging from capacity planning to performance estimation. Traditional
approaches such as Discrete Event Simulation (DES) face limitations in terms of
computational cost and accuracy. This paper introduces RouteNet-Gauss, a novel
integration of a testbed network with a Machine Learning (ML) model to address
these challenges. By using the testbed as a hardware accelerator,
RouteNet-Gauss generates training datasets rapidly and simulates network
scenarios with high fidelity to real-world conditions. Experimental results
show that RouteNet-Gauss significantly reduces prediction errors by up to 95%
and achieves a 488x speedup in inference time compared to state-of-the-art
DES-based methods. RouteNet-Gauss's modular architecture is dynamically
constructed based on the specific characteristics of the network scenario, such
as topology and routing. This enables it to understand and generalize to
different network configurations beyond those seen during training, including
networks up to 10x larger. Additionally, it supports Temporal Aggregated
Performance Estimation (TAPE), providing configurable temporal granularity and
maintaining high accuracy in flow performance metrics. This approach shows
promise in improving both simulation efficiency and accuracy, offering a
valuable tool for network operators.

**Decision Explanation:** Original decision: REJECT
The paper does not meet any of the primary criteria related to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI. It focuses on network modeling using Machine Learning (ML) with a hardware accelerator, which falls outside the specified scope.

---

## [Graph Counterfactual Explainable AI via Latent Space Traversal](https://arxiv.org/abs/2501.08850v1)
**arXiv ID:** 2501.08850v1

**Abstract:**
> Explaining the predictions of a deep neural network is a nontrivial task, yet
high-quality explanations for predictions are often a prerequisite for
practitioners to trust these models. Counterfactual explanations aim to explain
predictions by finding the ''nearest'' in-distribution alternative input whose
prediction changes in a pre-specified way. However, it remains an open question
how to define this nearest alternative input, whose solution depends on both
the domain (e.g. images, graphs, tabular data, etc.) and the specific
application considered. For graphs, this problem is complicated i) by their
discrete nature, as opposed to the continuous nature of state-of-the-art graph
classifiers; and ii) by the node permutation group acting on the graphs. We
propose a method to generate counterfactual explanations for any differentiable
black-box graph classifier, utilizing a case-specific permutation equivariant
graph variational autoencoder. We generate counterfactual explanations in a
continuous fashion by traversing the latent space of the autoencoder across the
classification boundary of the classifier, allowing for seamless integration of
discrete graph structure and continuous graph attributes. We empirically
validate the approach on three graph datasets, showing that our model is
consistently high-performing and more robust than the baselines.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on explainable AI for graph neural networks, without clear evidence of integrating Large Language Models (LLMs) or addressing the specified application areas (knowledge graphs, retrieval-augmented generation, or agentic AI), thus not meeting the required criteria.

---

## [ARMOR: Shielding Unlearnable Examples against Data Augmentation](https://arxiv.org/abs/2501.08862v1)
**arXiv ID:** 2501.08862v1

**Abstract:**
> Private data, when published online, may be collected by unauthorized parties
to train deep neural networks (DNNs). To protect privacy, defensive noises can
be added to original samples to degrade their learnability by DNNs. Recently,
unlearnable examples are proposed to minimize the training loss such that the
model learns almost nothing. However, raw data are often pre-processed before
being used for training, which may restore the private information of protected
data. In this paper, we reveal the data privacy violation induced by data
augmentation, a commonly used data pre-processing technique to improve model
generalization capability, which is the first of its kind as far as we are
concerned. We demonstrate that data augmentation can significantly raise the
accuracy of the model trained on unlearnable examples from 21.3% to 66.1%. To
address this issue, we propose a defense framework, dubbed ARMOR, to protect
data privacy from potential breaches of data augmentation. To overcome the
difficulty of having no access to the model training process, we design a
non-local module-assisted surrogate model that better captures the effect of
data augmentation. In addition, we design a surrogate augmentation selection
strategy that maximizes distribution alignment between augmented and
non-augmented samples, to choose the optimal augmentation strategy for each
class. We also use a dynamic step size adjustment algorithm to enhance the
defensive noise generation process. Extensive experiments are conducted on 4
datasets and 5 data augmentation methods to verify the performance of ARMOR.
Comparisons with 6 state-of-the-art defense methods have demonstrated that
ARMOR can preserve the unlearnability of protected private data under data
augmentation. ARMOR reduces the test accuracy of the model trained on augmented
protected samples by as much as 60% more than baselines.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on protecting data privacy from breaches, which aligns with 'responsible AI application or AI ethics', a criterion for rejection. Additionally, it does not clearly meet the required criteria for Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI, as the main topic revolves around deep neural networks (DNNs) and data augmentation techniques for privacy preservation.

---

## [Karatsuba Matrix Multiplication and its Efficient Custom Hardware
  Implementations](https://arxiv.org/abs/2501.08889v1)
**arXiv ID:** 2501.08889v1

**Abstract:**
> While the Karatsuba algorithm reduces the complexity of large integer
multiplication, the extra additions required minimize its benefits for smaller
integers of more commonly-used bitwidths. In this work, we propose the
extension of the scalar Karatsuba multiplication algorithm to matrix
multiplication, showing how this maintains the reduction in multiplication
complexity of the original Karatsuba algorithm while reducing the complexity of
the extra additions. Furthermore, we propose new matrix multiplication hardware
architectures for efficiently exploiting this extension of the Karatsuba
algorithm in custom hardware. We show that the proposed algorithm and hardware
architectures can provide real area or execution time improvements for integer
matrix multiplication compared to scalar Karatsuba or conventional matrix
multiplication algorithms, while also supporting implementation through proven
systolic array and conventional multiplier architectures at the core. We
provide a complexity analysis of the algorithm and architectures and evaluate
the proposed designs both in isolation and in an end-to-end deep learning
accelerator system compared to baseline designs and prior state-of-the-art
works implemented on the same type of compute platform, demonstrating their
ability to increase the performance-per-area of matrix multiplication hardware.

**Decision Explanation:** Original decision: REJECT
The paper does not meet any of the required criteria as it focuses on matrix multiplication algorithms and custom hardware implementations for deep learning accelerators, without mentioning Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, and falls outside the specified areas of interest.

---

## [Computing Game Symmetries and Equilibria That Respect Them](https://arxiv.org/abs/2501.08905v1)
**arXiv ID:** 2501.08905v1

**Abstract:**
> Strategic interactions can be represented more concisely, and analyzed and
solved more efficiently, if we are aware of the symmetries within the
multiagent system. Symmetries also have conceptual implications, for example
for equilibrium selection. We study the computational complexity of identifying
and using symmetries. Using the classical framework of normal-form games, we
consider game symmetries that can be across some or all players and/or actions.
We find a strong connection between game symmetries and graph automorphisms,
yielding graph automorphism and graph isomorphism completeness results for
characterizing the symmetries present in a game. On the other hand, we also
show that the problem becomes polynomial-time solvable when we restrict the
consideration of actions in one of two ways.
  Next, we investigate when exactly game symmetries can be successfully
leveraged for Nash equilibrium computation. We show that finding a Nash
equilibrium that respects a given set of symmetries is PPAD- and CLS-complete
in general-sum and team games respectively -- that is, exactly as hard as
Brouwer fixed point and gradient descent problems. Finally, we present
polynomial-time methods for the special cases where we are aware of a vast
number of symmetries, or where the game is two-player zero-sum and we do not
even know the symmetries.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the mandatory criteria as it lacks focus on Large Language Models (LLMs), practical applications in areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI. Instead, it focuses on computational complexity and game theory, which falls outside the specified scope.

---

## [Trusted Machine Learning Models Unlock Private Inference for Problems
  Currently Infeasible with Cryptography](https://arxiv.org/abs/2501.08970v1)
**arXiv ID:** 2501.08970v1

**Abstract:**
> We often interact with untrusted parties. Prioritization of privacy can limit
the effectiveness of these interactions, as achieving certain goals
necessitates sharing private data. Traditionally, addressing this challenge has
involved either seeking trusted intermediaries or constructing cryptographic
protocols that restrict how much data is revealed, such as multi-party
computations or zero-knowledge proofs. While significant advances have been
made in scaling cryptographic approaches, they remain limited in terms of the
size and complexity of applications they can be used for. In this paper, we
argue that capable machine learning models can fulfill the role of a trusted
third party, thus enabling secure computations for applications that were
previously infeasible. In particular, we describe Trusted Capable Model
Environments (TCMEs) as an alternative approach for scaling secure computation,
where capable machine learning model(s) interact under input/output
constraints, with explicit information flow control and explicit statelessness.
This approach aims to achieve a balance between privacy and computational
efficiency, enabling private inference where classical cryptographic solutions
are currently infeasible. We describe a number of use cases that are enabled by
TCME, and show that even some simple classic cryptographic problems can already
be solved with TCME. Finally, we outline current limitations and discuss the
path forward in implementing them.

**Decision Explanation:** Original response: ```
{
    "decision": "REJECT",
    "explanation": "The paper primarily focuses on privacy and security in machine learning, using cryptography as a point of comparison, without clear connections to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation, or agentic AI, failing to meet the mandatory criteria."
}
```

---

## [Personality Modeling for Persuasion of Misinformation using AI Agent](https://arxiv.org/abs/2501.08985v1)
**arXiv ID:** 2501.08985v1

**Abstract:**
> The proliferation of misinformation on social media platforms has highlighted
the need to understand how individual personality traits influence
susceptibility to and propagation of misinformation. This study employs an
innovative agent-based modeling approach to investigate the relationship
between personality traits and misinformation dynamics. Using six AI agents
embodying different dimensions of the Big Five personality traits
(Extraversion, Agreeableness, and Neuroticism), we simulated interactions
across six diverse misinformation topics. The experiment, implemented through
the AgentScope framework using the GLM-4-Flash model, generated 90 unique
interactions, revealing complex patterns in how personality combinations affect
persuasion and resistance to misinformation. Our findings demonstrate that
analytical and critical personality traits enhance effectiveness in
evidence-based discussions, while non-aggressive persuasion strategies show
unexpected success in misinformation correction. Notably, agents with critical
traits achieved a 59.4% success rate in HIV-related misinformation discussions,
while those employing non-aggressive approaches maintained consistent
persuasion rates above 40% across different personality combinations. The study
also revealed a non-transitive pattern in persuasion effectiveness, challenging
conventional assumptions about personality-based influence. These results
provide crucial insights for developing personality-aware interventions in
digital environments and suggest that effective misinformation countermeasures
should prioritize emotional connection and trust-building over confrontational
approaches. The findings contribute to both theoretical understanding of
personality-misinformation dynamics and practical strategies for combating
misinformation in social media contexts.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on social applications of AI regarding misinformation and social influence, aligning with rejected topics (Social harm or similar issues), and lacks clear connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI in a non-sandboxed, complex, real-world task context.

---

## [AI-RAN: Transforming RAN with AI-driven Computing Infrastructure](https://arxiv.org/abs/2501.09007v1)
**arXiv ID:** 2501.09007v1

**Abstract:**
> The radio access network (RAN) landscape is undergoing a transformative shift
from traditional, communication-centric infrastructures towards converged
compute-communication platforms. This article introduces AI-RAN which
integrates both RAN and artificial intelligence (AI) workloads on the same
infrastructure. By doing so, AI-RAN not only meets the performance demands of
future networks but also improves asset utilization. We begin by examining how
RANs have evolved beyond mobile broadband towards AI-RAN and articulating
manifestations of AI-RAN into three forms: AI-for-RAN, AI-on-RAN, and
AI-and-RAN. Next, we identify the key requirements and enablers for the
convergence of communication and computing in AI-RAN. We then provide a
reference architecture for advancing AI-RAN from concept to practice. To
illustrate the practical potential of AI-RAN, we present a proof-of-concept
that concurrently processes RAN and AI workloads utilizing NVIDIA Grace-Hopper
GH200 servers. Finally, we conclude the article by outlining future work
directions to guide further developments of AI-RAN.

**Decision Explanation:** Original decision: REJECT
The paper does not mention Large Language Models (LLMs) or related areas like knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, focusing instead on integrating AI with radio access network (RAN) infrastructure, which is outside the specified scope.

---

## [Multimodal LLMs Can Reason about Aesthetics in Zero-Shot](https://arxiv.org/abs/2501.09012v1)
**arXiv ID:** 2501.09012v1

**Abstract:**
> We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability
shall be elicited to evaluate the aesthetics of artworks. To facilitate this
investigation, we construct MM-StyleBench, a novel high-quality dataset for
benchmarking artistic stylization. We then develop a principled method for
human preference modeling and perform a systematic correlation analysis between
MLLMs' responses and human preference. Our experiments reveal an inherent
hallucination issue of MLLMs in art evaluation, associated with response
subjectivity. ArtCoT is proposed, demonstrating that art-specific task
decomposition and the use of concrete language boost MLLMs' reasoning ability
for aesthetics. Our findings offer valuable insights into MLLMs for art and can
benefit a wide range of downstream applications, such as style transfer and
artistic image generation. Code available at
https://github.com/songrise/MLLM4Art.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on the application of Multimodal LLMs in evaluating aesthetics of artworks, which falls under a creative/social application rather than the specified areas of interest (knowledge graphs, retrieval-augmented generation, agentic AI, etc.). It does not clearly align with the preferred criteria, especially given its lack of direct relevance to practical applications in areas like knowledge graphs, RAG, or agentic AI.

---

## [Spatio-Temporal Foundation Models: Vision, Challenges, and Opportunities](https://arxiv.org/abs/2501.09045v1)
**arXiv ID:** 2501.09045v1

**Abstract:**
> Foundation models have revolutionized artificial intelligence, setting new
benchmarks in performance and enabling transformative capabilities across a
wide range of vision and language tasks. However, despite the prevalence of
spatio-temporal data in critical domains such as transportation, public health,
and environmental monitoring, spatio-temporal foundation models (STFMs) have
not yet achieved comparable success. In this paper, we articulate a vision for
the future of STFMs, outlining their essential characteristics and the
generalization capabilities necessary for broad applicability. We critically
assess the current state of research, identifying gaps relative to these ideal
traits, and highlight key challenges that impede their progress. Finally, we
explore potential opportunities and directions to advance research towards the
aim of effective and broadly applicable STFMs.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on foundation models for vision tasks (e.g., transportation, environmental monitoring) with minimal indication of direct involvement with Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, failing to meet the mandatory criteria (1, 2, and 3) and not strongly aligning with any of the secondary criteria related to LLMs.

---

## [Dynamic-Aware Spatio-temporal Representation Learning for Dynamic MRI
  Reconstruction](https://arxiv.org/abs/2501.09049v1)
**arXiv ID:** 2501.09049v1

**Abstract:**
> Dynamic MRI reconstruction, one of inverse problems, has seen a surge by the
use of deep learning techniques. Especially, the practical difficulty of
obtaining ground truth data has led to the emergence of unsupervised learning
approaches. A recent promising method among them is implicit neural
representation (INR), which defines the data as a continuous function that maps
coordinate values to the corresponding signal values. This allows for filling
in missing information only with incomplete measurements and solving the
inverse problem effectively. Nevertheless, previous works incorporating this
method have faced drawbacks such as long optimization time and the need for
extensive hyperparameter tuning. To address these issues, we propose
Dynamic-Aware INR (DA-INR), an INR-based model for dynamic MRI reconstruction
that captures the spatial and temporal continuity of dynamic MRI data in the
image domain and explicitly incorporates the temporal redundancy of the data
into the model structure. As a result, DA-INR outperforms other models in
reconstruction quality even at extreme undersampling ratios while significantly
reducing optimization time and requiring minimal hyperparameter tuning.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI (Dynamic MRI Reconstruction), which is a rejection criterion, and does not mention Large Language Models (LLMs) or related areas like knowledge graphs, RAG, or agentic AI, failing to meet the mandatory criteria.

---

## [Tracking the Takes and Trajectories of English-Language News Narratives
  across Trustworthy and Worrisome Websites](https://arxiv.org/abs/2501.09102v1)
**arXiv ID:** 2501.09102v1

**Abstract:**
> Understanding how misleading and outright false information enters news
ecosystems remains a difficult challenge that requires tracking how narratives
spread across thousands of fringe and mainstream news websites. To do this, we
introduce a system that utilizes encoder-based large language models and
zero-shot stance detection to scalably identify and track news narratives and
their attitudes across over 4,000 factually unreliable, mixed-reliability, and
factually reliable English-language news websites. Running our system over an
18 month period, we track the spread of 146K news stories. Using network-based
interference via the NETINF algorithm, we show that the paths of news
narratives and the stances of websites toward particular entities can be used
to uncover slanted propaganda networks (e.g., anti-vaccine and anti-Ukraine)
and to identify the most influential websites in spreading these attitudes in
the broader news ecosystem. We hope that increased visibility into our
distributed news ecosystem can help with the reporting and fact-checking of
propaganda and disinformation.

**Decision Explanation:** Original response: ```
{
    "decision": "REJECT",
    "explanation": "The paper primarily focuses on social applications of AI to combat disinformation and propaganda, aligning with 'social applications of AI in regard to...Social harm, or similar issues', a criterion for rejection."
}
```

---

## [A Non-autoregressive Model for Joint STT and TTS](https://arxiv.org/abs/2501.09104v1)
**arXiv ID:** 2501.09104v1

**Abstract:**
> In this paper, we take a step towards jointly modeling automatic speech
recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way.
We develop a novel multimodal framework capable of handling the speech and text
modalities as input either individually or together. The proposed model can
also be trained with unpaired speech or text data owing to its multimodal
nature. We further propose an iterative refinement strategy to improve the STT
and TTS performance of our model such that the partial hypothesis at the output
can be fed back to the input of our model, thus iteratively improving both STT
and TTS predictions. We show that our joint model can effectively perform both
STT and TTS tasks, outperforming the STT-specific baseline in all tasks and
performing competitively with the TTS-specific baseline across a wide range of
evaluation metrics.

**Decision Explanation:** Original decision: REJECT
The paper focuses on Speech-to-Text (STT) and Text-to-Speech (TTS) applications, which, while related to language models, do not explicitly involve Large Language Models (LLMs) in areas like knowledge graphs, retrieval-augmented generation, or agentic AI as required by the primary criteria. Additionally, there is no clear indication of addressing the secondary criteria that would justify inclusion under the prioritization for inclusiveness.

---

## [AutoLoop: Fast Visual SLAM Fine-tuning through Agentic Curriculum
  Learning](https://arxiv.org/abs/2501.09160v1)
**arXiv ID:** 2501.09160v1

**Abstract:**
> Current visual SLAM systems face significant challenges in balancing
computational efficiency with robust loop closure handling. Traditional
approaches require careful manual tuning and incur substantial computational
overhead, while learning-based methods either lack explicit loop closure
capabilities or implement them through computationally expensive methods. We
present AutoLoop, a novel approach that combines automated curriculum learning
with efficient fine-tuning for visual SLAM systems. Our method employs a DDPG
(Deep Deterministic Policy Gradient) agent to dynamically adjust loop closure
weights during training, eliminating the need for manual hyperparameter search
while significantly reducing the required training steps. The approach
pre-computes potential loop closure pairs offline and leverages them through an
agent-guided curriculum, allowing the model to adapt efficiently to new
scenarios. Experiments conducted on TartanAir for training and validated across
multiple benchmarks including KITTI, EuRoC, ICL-NUIM and TUM RGB-D demonstrate
that AutoLoop achieves comparable or superior performance while reducing
training time by an order of magnitude compared to traditional approaches.
AutoLoop provides a practical solution for rapid adaptation of visual SLAM
systems, automating the weight tuning process that traditionally requires
multiple manual iterations. Our results show that this automated curriculum
strategy not only accelerates training but also maintains or improves the
model's performance across diverse environmental conditions.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on visual SLAM (Simultaneous Localization and Mapping), a computer vision and robotics topic, with no apparent direct connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI applications as specified in the criteria. The mention of 'Agentic' in the title refers to the use of an agent in a robotics context, not the specified AI application area.

---

## [Towards Understanding Extrapolation: a Causal Lens](https://arxiv.org/abs/2501.09163v1)
**arXiv ID:** 2501.09163v1

**Abstract:**
> Canonical work handling distribution shifts typically necessitates an entire
target distribution that lands inside the training distribution. However,
practical scenarios often involve only a handful of target samples, potentially
lying outside the training support, which requires the capability of
extrapolation. In this work, we aim to provide a theoretical understanding of
when extrapolation is possible and offer principled methods to achieve it
without requiring an on-support target distribution. To this end, we formulate
the extrapolation problem with a latent-variable model that embodies the
minimal change principle in causal mechanisms. Under this formulation, we cast
the extrapolation problem into a latent-variable identification problem. We
provide realistic conditions on shift properties and the estimation objectives
that lead to identification even when only one off-support target sample is
available, tackling the most challenging scenarios. Our theory reveals the
intricate interplay between the underlying manifold's smoothness and the shift
properties. We showcase how our theoretical results inform the design of
practical adaptation algorithms. Through experiments on both synthetic and
real-world data, we validate our theoretical findings and their practical
implications.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it lacks focus on Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, instead concentrating on theoretical understanding of extrapolation with a causal lens, making it outside the specified scope.

---

## [A Blockchain-Enabled Approach to Cross-Border Compliance and Trust](https://arxiv.org/abs/2501.09182v1)
**arXiv ID:** 2501.09182v1

**Abstract:**
> As artificial intelligence (AI) systems become increasingly integral to
critical infrastructure and global operations, the need for a unified,
trustworthy governance framework is more urgent that ever. This paper proposes
a novel approach to AI governance, utilizing blockchain and distributed ledger
technologies (DLT) to establish a decentralized, globally recognized framework
that ensures security, privacy, and trustworthiness of AI systems across
borders. The paper presents specific implementation scenarios within the
financial sector, outlines a phased deployment timeline over the next decade,
and addresses potential challenges with solutions grounded in current research.
By synthesizing advancements in blockchain, AI ethics, and cybersecurity, this
paper offers a comprehensive roadmap for a decentralized AI governance
framework capable of adapting to the complex and evolving landscape of global
AI regulation.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on AI governance, ethics, and regulation, with an emphasis on blockchain-enabled compliance and trust, which aligns with rejected criteria (responsible AI application or AI ethics). It lacks clear connections to Large Language Models (LLMs) and their practical applications as specified in the selection criteria.

---

## [Patch-aware Vector Quantized Codebook Learning for Unsupervised Visual
  Defect Detection](https://arxiv.org/abs/2501.09187v1)
**arXiv ID:** 2501.09187v1

**Abstract:**
> Unsupervised visual defect detection is critical in industrial applications,
requiring a representation space that captures normal data features while
detecting deviations. Achieving a balance between expressiveness and
compactness is challenging; an overly expressive space risks inefficiency and
mode collapse, impairing detection accuracy. We propose a novel approach using
an enhanced VQ-VAE framework optimized for unsupervised defect detection. Our
model introduces a patch-aware dynamic code assignment scheme, enabling
context-sensitive code allocation to optimize spatial representation. This
strategy enhances normal-defect distinction and improves detection accuracy
during inference. Experiments on MVTecAD, BTAD, and MTSD datasets show our
method achieves state-of-the-art performance.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on unsupervised visual defect detection, which falls under 'video processing' ( analyzing images for defects), meeting none of the required criteria for Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.

---

## [Adaptive Law-Based Transformation (ALT): A Lightweight Feature
  Representation for Time Series Classification](https://arxiv.org/abs/2501.09217v1)
**arXiv ID:** 2501.09217v1

**Abstract:**
> Time series classification (TSC) is fundamental in numerous domains,
including finance, healthcare, and environmental monitoring. However,
traditional TSC methods often struggle with the inherent complexity and
variability of time series data. Building on our previous work with the linear
law-based transformation (LLT) - which improved classification accuracy by
transforming the feature space based on key data patterns - we introduce
adaptive law-based transformation (ALT). ALT enhances LLT by incorporating
variable-length shifted time windows, enabling it to capture distinguishing
patterns of various lengths and thereby handle complex time series more
effectively. By mapping features into a linearly separable space, ALT provides
a fast, robust, and transparent solution that achieves state-of-the-art
performance with only a few hyperparameters.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on time series classification in domains like finance, healthcare, and environmental monitoring, without any evident connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, thus failing to meet the fundamental criteria.

---

## [Foundations of Large Language Models](https://arxiv.org/abs/2501.09223v1)
**arXiv ID:** 2501.09223v1

**Abstract:**
> This is a book about large language models. As indicated by the title, it
primarily focuses on foundational concepts rather than comprehensive coverage
of all cutting-edge technologies. The book is structured into four main
chapters, each exploring a key area: pre-training, generative models, prompting
techniques, and alignment methods. It is intended for college students,
professionals, and practitioners in natural language processing and related
fields, and can serve as a reference for anyone interested in large language
models.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it primarily focuses on foundational concepts rather than practical applications, experimental results, or comparisons with state-of-the-art techniques in areas like knowledge graphs, RAG, or agentic AI, and lacks mentioned real-world applications, experimental results, or novelty in LLM applications.

---

## [Large Language Model is Secretly a Protein Sequence Optimizer](https://arxiv.org/abs/2501.09274v1)
**arXiv ID:** 2501.09274v1

**Abstract:**
> We consider the protein sequence engineering problem, which aims to find
protein sequences with high fitness levels, starting from a given wild-type
sequence. Directed evolution has been a dominating paradigm in this field which
has an iterative process to generate variants and select via experimental
feedback. We demonstrate large language models (LLMs), despite being trained on
massive texts, are secretly protein sequence optimizers. With a directed
evolutionary method, LLM can perform protein engineering through Pareto and
experiment-budget constrained optimization, demonstrating success on both
synthetic and experimental fitness landscapes.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI, specifically protein sequence engineering, which falls under the rejection criteria. Despite meeting potential criteria for practical applications, experimental results, and novelty, its primary focus on a medical application takes precedence.

---

## [LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport](https://arxiv.org/abs/2501.09291v1)
**arXiv ID:** 2501.09291v1

**Abstract:**
> Automated audio captioning is a task that generates textual descriptions for
audio content, and recent studies have explored using visual information to
enhance captioning quality. However, current methods often fail to effectively
fuse audio and visual data, missing important semantic cues from each modality.
To address this, we introduce LAVCap, a large language model (LLM)-based
audio-visual captioning framework that effectively integrates visual
information with audio to improve audio captioning performance. LAVCap employs
an optimal transport-based alignment loss to bridge the modality gap between
audio and visual features, enabling more effective semantic extraction.
Additionally, we propose an optimal transport attention module that enhances
audio-visual fusion using an optimal transport assignment map. Combined with
the optimal training strategy, experimental results demonstrate that each
component of our framework is effective. LAVCap outperforms existing
state-of-the-art methods on the AudioCaps dataset, without relying on large
datasets or post-processing. Code is available at
https://github.com/NAVER-INTEL-Co-Lab/gaudi-lavcap.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on audio-visual captioning with LLMs, which falls under multimedia processing (similar to video processing), and does not align with the specified areas of interest (knowledge graphs, retrieval-augmented generation, agentic AI, etc.).

---

## [Understanding Mental Health Content on Social Media and Its Effect
  Towards Suicidal Ideation](https://arxiv.org/abs/2501.09309v1)
**arXiv ID:** 2501.09309v1

**Abstract:**
> This review underscores the critical need for effective strategies to
identify and support individuals with suicidal ideation, exploiting
technological innovations in ML and DL to further suicide prevention efforts.
The study details the application of these technologies in analyzing vast
amounts of unstructured social media data to detect linguistic patterns,
keywords, phrases, tones, and contextual cues associated with suicidal
thoughts. It explores various ML and DL models like SVMs, CNNs, LSTM, neural
networks, and their effectiveness in interpreting complex data patterns and
emotional nuances within text data. The review discusses the potential of these
technologies to serve as a life-saving tool by identifying at-risk individuals
through their digital traces. Furthermore, it evaluates the real-world
effectiveness, limitations, and ethical considerations of employing these
technologies for suicide prevention, stressing the importance of responsible
development and usage. The study aims to fill critical knowledge gaps by
analyzing recent studies, methodologies, tools, and techniques in this field.
It highlights the importance of synthesizing current literature to inform
practical tools and suicide prevention efforts, guiding innovation in reliable,
ethical systems for early intervention. This research synthesis evaluates the
intersection of technology and mental health, advocating for the ethical and
responsible application of ML, DL, and NLP to offer life-saving potential
worldwide while addressing challenges like generalizability, biases, privacy,
and the need for further research to ensure these technologies do not
exacerbate existing inequities and harms.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on social applications of AI in regard to social harm (suicidal ideation) and responsible AI application or AI ethics, which are explicitly listed as rejection criteria.

---

## [Shape-Based Single Object Classification Using Ensemble Method
  Classifiers](https://arxiv.org/abs/2501.09311v1)
**arXiv ID:** 2501.09311v1

**Abstract:**
> Nowadays, more and more images are available. Annotation and retrieval of the
images pose classification problems, where each class is defined as the group
of database images labelled with a common semantic label. Various systems have
been proposed for content-based retrieval, as well as for image classification
and indexing. In this paper, a hierarchical classification framework has been
proposed for bridging the semantic gap effectively and achieving multi-category
image classification. A well known pre-processing and post-processing method
was used and applied to three problems; image segmentation, object
identification and image classification. The method was applied to classify
single object images from Amazon and Google datasets. The classification was
tested for four different classifiers; BayesNetwork (BN), Random Forest (RF),
Bagging and Vote. The estimated classification accuracies ranged from 20% to
99% (using 10-fold cross validation). The Bagging classifier presents the best
performance, followed by the Random Forest classifier.

**Decision Explanation:** Original decision: REJECT
The paper does not meet any of the required criteria related to Large Language Models (LLMs), focusing instead on image classification, object identification, and traditional machine learning classifiers, with no mention of LLMs, knowledge graphs, RAG, or agentic AI.

---

## [ELM-DeepONets: Backpropagation-Free Training of Deep Operator Networks
  via Extreme Learning Machines](https://arxiv.org/abs/2501.09395v1)
**arXiv ID:** 2501.09395v1

**Abstract:**
> Deep Operator Networks (DeepONets) are among the most prominent frameworks
for operator learning, grounded in the universal approximation theorem for
operators. However, training DeepONets typically requires significant
computational resources. To address this limitation, we propose ELM-DeepONets,
an Extreme Learning Machine (ELM) framework for DeepONets that leverages the
backpropagation-free nature of ELM. By reformulating DeepONet training as a
least-squares problem for newly introduced parameters, the ELM-DeepONet
approach significantly reduces training complexity. Validation on benchmark
problems, including nonlinear ODEs and PDEs, demonstrates that the proposed
method not only achieves superior accuracy but also drastically reduces
computational costs. This work offers a scalable and efficient alternative for
operator learning in scientific computing.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it focuses on Deep Operator Networks for scientific computing (non-Large Language Model application), lacks comparison with state-of-the-art LLM techniques, and does not demonstrate practical applications in knowledge graphs, RAG, or agentic AI.

---

## [Dynamic Neural Style Transfer for Artistic Image Generation using VGG19](https://arxiv.org/abs/2501.09420v1)
**arXiv ID:** 2501.09420v1

**Abstract:**
> Throughout history, humans have created remarkable works of art, but
artificial intelligence has only recently started to make strides in generating
visually compelling art. Breakthroughs in the past few years have focused on
using convolutional neural networks (CNNs) to separate and manipulate the
content and style of images, applying texture synthesis techniques.
Nevertheless, a number of current techniques continue to encounter obstacles,
including lengthy processing times, restricted choices of style images, and the
inability to modify the weight ratio of styles. We proposed a neural style
transfer system that can add various artistic styles to a desired image to
address these constraints allowing flexible adjustments to style weight ratios
and reducing processing time. The system uses the VGG19 model for feature
extraction, ensuring high-quality, flexible stylization without compromising
content integrity.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on image processing and generation using CNNs (VGG19), with no apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, failing to meet the fundamental criteria.

---

## [A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and
  Mitigation Strategy](https://arxiv.org/abs/2501.09431v1)
**arXiv ID:** 2501.09431v1

**Abstract:**
> While large language models (LLMs) present significant potential for
supporting numerous real-world applications and delivering positive social
impacts, they still face significant challenges in terms of the inherent risk
of privacy leakage, hallucinated outputs, and value misalignment, and can be
maliciously used for generating toxic content and unethical purposes after been
jailbroken. Therefore, in this survey, we present a comprehensive review of
recent advancements aimed at mitigating these issues, organized across the four
phases of LLM development and usage: data collecting and pre-training,
fine-tuning and alignment, prompting and reasoning, and post-processing and
auditing. We elaborate on the recent advances for enhancing the performance of
LLMs in terms of privacy protection, hallucination reduction, value alignment,
toxicity elimination, and jailbreak defenses. In contrast to previous surveys
that focus on a single dimension of responsible LLMs, this survey presents a
unified framework that encompasses these diverse dimensions, providing a
comprehensive view of enhancing LLMs to better serve real-world applications.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on responsible AI application and AI ethics, specifically addressing inherent risks, malicious use, and mitigation strategies, which is explicitly listed as a rejection criterion.

---

## [Solving the unsolvable: Translating case law in Hong Kong](https://arxiv.org/abs/2501.09444v1)
**arXiv ID:** 2501.09444v1

**Abstract:**
> This paper addresses the challenges translating case law under Hong Kong's
bilingual legal system. It highlights the initial success of translating all
written statutes into Chinese before the 1997 handover, a task mandated by the
Basic Law. The effort involved significant collaboration among legal,
linguistic, and translation experts, resulting in a comprehensive and
culturally appropriate bilingual legal system. However, translating case law
remains a significant challenge due to the sheer volume and continuous growth
of judicial decisions. The paper critiques the governments and judiciarys
sporadic and uncoordinated efforts to translate case law, contrasting it with
the thorough approach previously taken for statute translation. Although the
government acknowledges the importance of legal bilingualism, it lacks a
sustainable strategy for translating case law. The Judiciarys position that
translating all judgments is unnecessary, unrealistic, and not cost-effectiveis
analyzed and critiqued for its impact on legal transparency and public trust. A
proposed solution involves leveraging machine translation technology through a
human-machine interactive translation platform, which undergoes two major
transitions. Initially based on a neural model, the platform transitions to
using a large language model for improved translation accuracy. Furthermore, it
evolves from a single-agent system to a multi-agent system, incorporating
Translator, Annotator, and Proofreader agents. This multi-agent approach,
supported by a grant, aims to facilitate efficient, high-quality translation of
judicial judgments by integrating advanced artificial intelligence and
continuous feedback mechanisms, thus better meeting the needs of a bilingual
legal system.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on law, specifically the application of Large Language Models for translating case law in Hong Kong's legal system, which meets the rejection criteria (Primarily focuses on law, either with AI as subject or participant).

---

## [RE-POSE: Synergizing Reinforcement Learning-Based Partitioning and
  Offloading for Edge Object Detection](https://arxiv.org/abs/2501.09465v1)
**arXiv ID:** 2501.09465v1

**Abstract:**
> Object detection plays a crucial role in smart video analysis, with
applications ranging from autonomous driving and security to smart cities.
However, achieving real-time object detection on edge devices presents
significant challenges due to their limited computational resources and the
high demands of deep neural network (DNN)-based detection models, particularly
when processing high-resolution video. Conventional strategies, such as input
down-sampling and network up-scaling, often compromise detection accuracy for
faster performance or lead to higher inference latency. To address these
issues, this paper introduces RE-POSE, a Reinforcement Learning (RL)-Driven
Partitioning and Edge Offloading framework designed to optimize the
accuracy-latency trade-off in resource-constrained edge environments. Our
approach features an RL-Based Dynamic Clustering Algorithm (RL-DCA) that
partitions video frames into non-uniform blocks based on object distribution
and the computational characteristics of DNNs. Furthermore, a parallel edge
offloading scheme is implemented to distribute these blocks across multiple
edge servers for concurrent processing. Experimental evaluations show that
RE-POSE significantly enhances detection accuracy and reduces inference
latency, surpassing existing methods.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on video processing (object detection in high-resolution video) and edge computing optimization, which is explicitly listed as a rejection criterion. While it demonstrates practical applications, experimental results, and comparison with state-of-the-art, its core focus aligns with a rejected topic area.

---

## [MonoSOWA: Scalable monocular 3D Object detector Without human
  Annotations](https://arxiv.org/abs/2501.09481v1)
**arXiv ID:** 2501.09481v1

**Abstract:**
> Detecting the three-dimensional position and orientation of objects using a
single RGB camera is a foundational task in computer vision with many important
applications. Traditionally, 3D object detection methods are trained in a
fully-supervised setup, requiring vast amounts of human annotations, which are
laborious, costly, and do not scale well with the ever-increasing amounts of
data being captured.
  In this paper, we present the first method to train 3D object detectors for
monocular RGB cameras without domain-specific human annotations, thus making
orders of magnitude more data available for training. Thanks to newly proposed
Canonical Object Space, the method can not only exploit data across a variety
of datasets and camera setups to train a single 3D detector, but unlike
previous work it also works out of the box in previously unseen camera setups.
All this is crucial for practical applications, where the data and cameras are
extremely heterogeneous.
  The method is evaluated on two standard autonomous driving datasets, where it
outperforms previous works, which, unlike our method, still rely on 2D human
annotations.

**Decision Explanation:** Original decision: REJECT
The paper focuses on computer vision (3D object detection using a monocular RGB camera) and autonomous driving, with no apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, thus failing to meet the primary criteria.

---

## [MatrixNet: Learning over symmetry groups using learned group
  representations](https://arxiv.org/abs/2501.09571v1)
**arXiv ID:** 2501.09571v1

**Abstract:**
> Group theory has been used in machine learning to provide a theoretically
grounded approach for incorporating known symmetry transformations in tasks
from robotics to protein modeling. In these applications, equivariant neural
networks use known symmetry groups with predefined representations to learn
over geometric input data. We propose MatrixNet, a neural network architecture
that learns matrix representations of group element inputs instead of using
predefined representations. MatrixNet achieves higher sample efficiency and
generalization over several standard baselines in prediction tasks over the
several finite groups and the Artin braid group. We also show that MatrixNet
respects group relations allowing generalization to group elements of greater
word length than in the training set.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the practical applications criteria for Large Language Models (LLMs) as it focuses on group theory, symmetry transformations, and equivariant neural networks without mentioning LLMs, knowledge graphs, retrieval-augmented generation, or agentic AI.

---

## [Managed-Retention Memory: A New Class of Memory for the AI Era](https://arxiv.org/abs/2501.09605v1)
**arXiv ID:** 2501.09605v1

**Abstract:**
> AI clusters today are one of the major uses of High Bandwidth Memory (HBM).
However, HBM is suboptimal for AI workloads for several reasons. Analysis shows
HBM is overprovisioned on write performance, but underprovisioned on density
and read bandwidth, and also has significant energy per bit overheads. It is
also expensive, with lower yield than DRAM due to manufacturing complexity. We
propose a new memory class: Managed-Retention Memory (MRM), which is more
optimized to store key data structures for AI inference workloads. We believe
that MRM may finally provide a path to viability for technologies that were
originally proposed to support Storage Class Memory (SCM). These technologies
traditionally offered long-term persistence (10+ years) but provided poor IO
performance and/or endurance. MRM makes different trade-offs, and by
understanding the workload IO patterns, MRM foregoes long-term data retention
and write performance for better potential performance on the metrics important
for these workloads.

**Decision Explanation:** Original decision: REJECT
The paper focuses on developing a new memory class (Managed-Retention Memory) for optimizing AI inference workloads, but does not explicitly mention Large Language Models (LLMs), their applications, or related criteria such as knowledge graphs, RAG, or agentic AI, failing to meet the primary criteria.

---

## [Incorporating Quantum Advantage in Quantum Circuit Generation through
  Genetic Programming](https://arxiv.org/abs/2501.09682v1)
**arXiv ID:** 2501.09682v1

**Abstract:**
> Designing efficient quantum circuits that leverage quantum advantage compared
to classical computing has become increasingly critical. Genetic algorithms
have shown potential in generating such circuits through artificial evolution.
However, integrating quantum advantage into the fitness function of these
algorithms remains unexplored. In this paper, we aim to enhance the efficiency
of quantum circuit design by proposing two novel approaches for incorporating
quantum advantage metrics into the fitness function of genetic algorithms.1 We
evaluate our approaches based on the Bernstein-Vazirani Problem and the
Unstructured Database Search Problem as test cases. The results demonstrate
that our approaches not only improve the convergence speed of the genetic
algorithm but also produce circuits comparable to expert-designed solutions.
Our findings suggest that automated quantum circuit design using genetic
algorithms that incorporate a measure of quantum advantage is a promising
approach to accelerating the development of quantum algorithms.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it focuses on Quantum Circuit Generation using Genetic Programming, with no apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, thus failing to meet the primary criteria (1, 2, and 3) and all secondary criteria related to LLMs.

---

## [Reward-Guided Controlled Generation for Inference-Time Alignment in
  Diffusion Models: Tutorial and Review](https://arxiv.org/abs/2501.09685v1)
**arXiv ID:** 2501.09685v1

**Abstract:**
> This tutorial provides an in-depth guide on inference-time guidance and
alignment methods for optimizing downstream reward functions in diffusion
models. While diffusion models are renowned for their generative modeling
capabilities, practical applications in fields such as biology often require
sample generation that maximizes specific metrics (e.g., stability, affinity in
proteins, closeness to target structures). In these scenarios, diffusion models
can be adapted not only to generate realistic samples but also to explicitly
maximize desired measures at inference time without fine-tuning. This tutorial
explores the foundational aspects of such inference-time algorithms. We review
these methods from a unified perspective, demonstrating that current techniques
-- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling,
and classifier guidance -- aim to approximate soft optimal denoising processes
(a.k.a. policies in RL) that combine pre-trained denoising processes with value
functions serving as look-ahead functions that predict from intermediate states
to terminal rewards. Within this framework, we present several novel algorithms
not yet covered in the literature. Furthermore, we discuss (1) fine-tuning
methods combined with inference-time techniques, (2) inference-time algorithms
based on search algorithms such as Monte Carlo tree search, which have received
limited attention in current research, and (3) connections between
inference-time algorithms in language models and diffusion models. The code of
this tutorial on protein design is available at
https://github.com/masa-ue/AlignInversePro

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on diffusion models, biology (protein design), and generative modeling, with no clear connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, thus not meeting the mandatory criteria.

---

## [Practical Continual Forgetting for Pre-trained Vision Models](https://arxiv.org/abs/2501.09705v1)
**arXiv ID:** 2501.09705v1

**Abstract:**
> For privacy and security concerns, the need to erase unwanted information
from pre-trained vision models is becoming evident nowadays. In real-world
scenarios, erasure requests originate at any time from both users and model
owners, and these requests usually form a sequence. Therefore, under such a
setting, selective information is expected to be continuously removed from a
pre-trained model while maintaining the rest. We define this problem as
continual forgetting and identify three key challenges. (i) For unwanted
knowledge, efficient and effective deleting is crucial. (ii) For remaining
knowledge, the impact brought by the forgetting procedure should be minimal.
(iii) In real-world scenarios, the training samples may be scarce or partially
missing during the process of forgetting. To address them, we first propose
Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA
modules to fine-tune the FFN layers in Transformer blocks for each forgetting
task independently, and towards (ii), a simple group sparse regularization is
adopted, enabling automatic selection of specific LoRA groups and zeroing out
the others. To further extend GS-LoRA to more practical scenarios, we
incorporate prototype information as additional supervision and introduce a
more practical approach, GS-LoRA++. For each forgotten class, we move the
logits away from its original prototype. For the remaining classes, we pull the
logits closer to their respective prototypes. We conduct extensive experiments
on face recognition, object detection and image classification and demonstrate
that our method manages to forget specific classes with minimal impact on other
classes. Codes have been released on https://github.com/bjzhb666/GS-LoRA.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on pre-trained vision models, continual forgetting, and applications in face recognition, object detection, and image classification, with no apparent connection to Large Language Models (LLMs), which is the core requirement for acceptance.

---

## [Parallel multi-objective metaheuristics for smart communications in
  vehicular networks](https://arxiv.org/abs/2501.09725v1)
**arXiv ID:** 2501.09725v1

**Abstract:**
> This article analyzes the use of two parallel multi-objective soft computing
algorithms to automatically search for high-quality settings of the Ad hoc On
Demand Vector routing protocol for vehicular networks. These methods are based
on an evolutionary algorithm and on a swarm intelligence approach. The
experimental analysis demonstrates that the configurations computed by our
optimization algorithms outperform other state-of-the-art optimized ones. In
turn, the computational efficiency achieved by all the parallel versions is
greater than 87 %. Therefore, the line of work presented in this article
represents an efficient framework to improve vehicular communications.

**Decision Explanation:** Original decision: REJECT
The paper does not meet the primary criteria as it does not focus on Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI; instead, it concentrates on optimizing vehicular network communications using evolutionary and swarm intelligence algorithms.

---

## [Benchmarking Robustness of Contrastive Learning Models for Medical
  Image-Report Retrieval](https://arxiv.org/abs/2501.09134v1)
**arXiv ID:** 2501.09134v1

**Abstract:**
> Medical images and reports offer invaluable insights into patient health. The
heterogeneity and complexity of these data hinder effective analysis. To bridge
this gap, we investigate contrastive learning models for cross-domain
retrieval, which associates medical images with their corresponding clinical
reports. This study benchmarks the robustness of four state-of-the-art
contrastive learning models: CLIP, CXR-RePaiR, MedCLIP, and CXR-CLIP. We
introduce an occlusion retrieval task to evaluate model performance under
varying levels of image corruption. Our findings reveal that all evaluated
models are highly sensitive to out-of-distribution data, as evidenced by the
proportional decrease in performance with increasing occlusion levels. While
MedCLIP exhibits slightly more robustness, its overall performance remains
significantly behind CXR-CLIP and CXR-RePaiR. CLIP, trained on a
general-purpose dataset, struggles with medical image-report retrieval,
highlighting the importance of domain-specific training data. The evaluation of
this work suggests that more effort needs to be spent on improving the
robustness of these models. By addressing these limitations, we can develop
more reliable cross-domain retrieval models for medical applications.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on medical applications of AI, which is explicitly listed as a criterion for rejection, outweighing any potential alignment with other criteria such as experimental results, comparison with state-of-the-art, and methodology details.

---

## [Quantum-Enhanced Transformers for Robust Acoustic Scene Classification
  in IoT Environments](https://arxiv.org/abs/2501.09394v1)
**arXiv ID:** 2501.09394v1

**Abstract:**
> The proliferation of Internet of Things (IoT) devices equipped with acoustic
sensors necessitates robust acoustic scene classification (ASC) capabilities,
even in noisy and data-limited environments. Traditional machine learning
methods often struggle to generalize effectively under such conditions. To
address this, we introduce Q-ASC, a novel Quantum-Inspired Acoustic Scene
Classifier that leverages the power of quantum-inspired transformers. By
integrating quantum concepts like superposition and entanglement, Q-ASC
achieves superior feature learning and enhanced noise resilience compared to
classical models. Furthermore, we introduce a Quantum Variational Autoencoder
(QVAE) based data augmentation technique to mitigate the challenge of limited
labeled data in IoT deployments. Extensive evaluations on the Tampere
University of Technology (TUT) Acoustic Scenes 2016 benchmark dataset
demonstrate that Q-ASC achieves remarkable accuracy between 68.3% and 88.5%
under challenging conditions, outperforming state-of-the-art methods by over 5%
in the best case. This research paves the way for deploying intelligent
acoustic sensing in IoT networks, with potential applications in smart homes,
industrial monitoring, and environmental surveillance, even in adverse acoustic
environments.

**Decision Explanation:** Original decision: REJECT
The paper primarily focuses on acoustic scene classification in IoT environments using quantum-enhanced transformers, with no apparent connection to Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI, thus failing to meet the mandatory criteria (1, 2, and 3).

---

## [Metric Learning with Progressive Self-Distillation for Audio-Visual
  Embedding Learning](https://arxiv.org/abs/2501.09608v1)
**arXiv ID:** 2501.09608v1

**Abstract:**
> Metric learning projects samples into an embedded space, where similarities
and dissimilarities are quantified based on their learned representations.
However, existing methods often rely on label-guided representation learning,
where representations of different modalities, such as audio and visual data,
are aligned based on annotated labels. This approach tends to underutilize
latent complex features and potential relationships inherent in the
distributions of audio and visual data that are not directly tied to the
labels, resulting in suboptimal performance in audio-visual embedding learning.
To address this issue, we propose a novel architecture that integrates
cross-modal triplet loss with progressive self-distillation. Our method
enhances representation learning by leveraging inherent distributions and
dynamically refining soft audio-visual alignments -- probabilistic alignments
between audio and visual data that capture the inherent relationships beyond
explicit labels. Specifically, the model distills audio-visual
distribution-based knowledge from annotated labels in a subset of each batch.
This self-distilled knowledge is used t

**Decision Explanation:** Original decision: REJECT
The paper does not meet the required criteria as it focuses on audio-visual embedding learning with metric learning and self-distillation, without any indication of practical applications or involvement of Large Language Models (LLMs), knowledge graphs, retrieval-augmented generation (RAG), or agentic AI.

---