5904 lines
368 KiB
Markdown
5904 lines
368 KiB
Markdown
# Accepted Papers
|
|
|
|
## [LLMPC: Large Language Model Predictive Control](https://arxiv.org/abs/http://arxiv.org/abs/2501.02486v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02486v1
|
|
|
|
**Abstract:**
|
|
> Recent advancements in prompting techniques for Large Language Models (LLMs)
|
|
> have improved their reasoning, planning, and action abilities. This paper
|
|
> examines these prompting techniques through the lens of model predictive
|
|
> control (MPC). We show that LLMs act as implicit planning cost function
|
|
> minimizers when planning prompts are used. Under our framework we demonstrate
|
|
> that LLM planning performance can be improved further by incorporating real
|
|
> planning cost functions and evaluators.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "ACCEPT",
|
|
"explanation": "The paper focuses primarily on optimizing LLM performance through the manipulation of textual input prompts (planning prompts), provides a framework (LLMPC) for systematic prompt development, and demonstrates its impact on LLM output, aligning with the core subject of prompt engineering for text-based interactions with LLMs."
|
|
}
|
|
|
|
---
|
|
|
|
# Rejected Papers
|
|
|
|
## [Co-Activation Graph Analysis of Safety-Verified and Explainable Deep
|
|
Reinforcement Learning Policies](https://arxiv.org/abs/http://arxiv.org/abs/2501.03142v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03142v1
|
|
|
|
**Abstract:**
|
|
> Deep reinforcement learning (RL) policies can demonstrate unsafe behaviors
|
|
> and are challenging to interpret. To address these challenges, we combine RL
|
|
> policy model checking--a technique for determining whether RL policies exhibit
|
|
> unsafe behaviors--with co-activation graph analysis--a method that maps neural
|
|
> network inner workings by analyzing neuron activation patterns--to gain insight
|
|
> into the safe RL policy's sequential decision-making. This combination lets us
|
|
> interpret the RL policy's inner workings for safe decision-making. We
|
|
> demonstrate its applicability in various experiments.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on deep reinforcement learning (RL) policies, model checking, and co-activation graph analysis, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Turn-based Multi-Agent Reinforcement Learning Model Checking](https://arxiv.org/abs/http://arxiv.org/abs/2501.03187v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03187v1
|
|
|
|
**Abstract:**
|
|
> In this paper, we propose a novel approach for verifying the compliance of
|
|
> turn-based multi-agent reinforcement learning (TMARL) agents with complex
|
|
> requirements in stochastic multiplayer games. Our method overcomes the
|
|
> limitations of existing verification approaches, which are inadequate for
|
|
> dealing with TMARL agents and not scalable to large games with multiple agents.
|
|
> Our approach relies on tight integration of TMARL and a verification technique
|
|
> referred to as model checking. We demonstrate the effectiveness and scalability
|
|
> of our technique through experiments in different types of environments. Our
|
|
> experiments show that our method is suited to verify TMARL agents and scales
|
|
> better than naive monolithic model checking.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on verifying compliance of turn-based multi-agent reinforcement learning agents, with no clear emphasis on prompt engineering for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria of focusing on engineering, design, or optimization of prompts specifically for LLMs.
|
|
|
|
---
|
|
|
|
## [Neural Deconstruction Search for Vehicle Routing Problems](https://arxiv.org/abs/http://arxiv.org/abs/2501.03715v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03715v1
|
|
|
|
**Abstract:**
|
|
> Autoregressive construction approaches generate solutions to vehicle routing
|
|
> problems in a step-by-step fashion, leading to high-quality solutions that are
|
|
> nearing the performance achieved by handcrafted, operations research
|
|
> techniques. In this work, we challenge the conventional paradigm of sequential
|
|
> solution construction and introduce an iterative search framework where
|
|
> solutions are instead deconstructed by a neural policy. Throughout the search,
|
|
> the neural policy collaborates with a simple greedy insertion algorithm to
|
|
> rebuild the deconstructed solutions. Our approach surpasses the performance of
|
|
> state-of-the-art operations research methods across three challenging vehicle
|
|
> routing problems of various problem sizes.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on solving Vehicle Routing Problems using a neural search framework, which falls under the excluded categories (primarily concerned with applications other than text generation driven by LLMs, and specifically mentions automotive/self-driving/operations research subjects). It also does not meet the 'MUST' criteria of focusing primarily on the engineering, design, or optimization of prompts for Large Language Models (LLMs).
|
|
|
|
---
|
|
|
|
## [A completely uniform transformer for parity](https://arxiv.org/abs/http://arxiv.org/abs/2501.02535v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02535v1
|
|
|
|
**Abstract:**
|
|
> We construct a 3-layer constant-dimension transformer, recognizing the parity
|
|
> language, where neither parameter matrices nor the positional encoding depend
|
|
> on the input length. This improves upon a construction of Chiang and Cholak who
|
|
> use a positional encoding, depending on the input length (but their
|
|
> construction has 2 layers).
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on constructing a uniform transformer for recognizing parity language, which aligns with developing new LLM architectures or training methods, violating the 'MUST NOT' criteria 1. Additionally, it does not demonstrate prompt engineering for text-based interactions with LLMs as its core subject.
|
|
|
|
---
|
|
|
|
## [Test-time Computing: from System-1 Thinking to System-2 Thinking](https://arxiv.org/abs/http://arxiv.org/abs/2501.02497v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02497v1
|
|
|
|
**Abstract:**
|
|
> The remarkable performance of the o1 model in complex reasoning demonstrates
|
|
> that test-time computing scaling can further unlock the model's potential,
|
|
> enabling powerful System-2 thinking. However, there is still a lack of
|
|
> comprehensive surveys for test-time computing scaling. We trace the concept of
|
|
> test-time computing back to System-1 models. In System-1 models, test-time
|
|
> computing addresses distribution shifts and improves robustness and
|
|
> generalization through parameter updating, input modification, representation
|
|
> editing, and output calibration. In System-2 models, it enhances the model's
|
|
> reasoning ability to solve complex problems through repeated sampling,
|
|
> self-correction, and tree search. We organize this survey according to the
|
|
> trend of System-1 to System-2 thinking, highlighting the key role of test-time
|
|
> computing in the transition from System-1 models to weak System-2 models, and
|
|
> then to strong System-2 models. We also point out a few possible future
|
|
> directions.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper does not meet the primary focus criteria as it discusses test-time computing scaling for enhancing model performance (pertaining to System-1 and System-2 thinking) without specifically addressing prompt engineering for Large Language Models (LLMs) or demonstrating the impact of textual input prompts on LLM output."
|
|
}
|
|
|
|
---
|
|
|
|
## [KG-CF: Knowledge Graph Completion with Context Filtering under the
|
|
Guidance of Large Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02711v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02711v1
|
|
|
|
**Abstract:**
|
|
> Large Language Models (LLMs) have shown impressive performance in various
|
|
> tasks, including knowledge graph completion (KGC). However, current studies
|
|
> mostly apply LLMs to classification tasks, like identifying missing triplets,
|
|
> rather than ranking-based tasks, where the model ranks candidate entities based
|
|
> on plausibility. This focus limits the practical use of LLMs in KGC, as
|
|
> real-world applications prioritize highly plausible triplets. Additionally,
|
|
> while graph paths can help infer the existence of missing triplets and improve
|
|
> completion accuracy, they often contain redundant information. To address these
|
|
> issues, we propose KG-CF, a framework tailored for ranking-based KGC tasks.
|
|
> KG-CF leverages LLMs' reasoning abilities to filter out irrelevant contexts,
|
|
> achieving superior results on real-world datasets. The code and datasets are
|
|
> available at \url{https://anonymous.4open.science/r/KG-CF}.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on leveraging LLMs for knowledge graph completion (ranking-based tasks) rather than on the engineering, design, or optimization of prompts specifically for LLMs. While LLMs are used, the central concern is the application (KGC) and not prompt engineering techniques, methods, or their impact on LLM output."
|
|
}
|
|
|
|
---
|
|
|
|
## [Artificial Intelligence in Creative Industries: Advances Prior to 2025](https://arxiv.org/abs/http://arxiv.org/abs/2501.02725v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02725v1
|
|
|
|
**Abstract:**
|
|
> The rapid advancements in artificial intelligence (AI), particularly in
|
|
> generative AI and large language models (LLMs), have profoundly impacted the
|
|
> creative industries by enabling innovative content creation, enhancing
|
|
> workflows, and democratizing access to creative tools. This paper explores the
|
|
> significant technological shifts since our previous review in 2022,
|
|
> highlighting how these developments have expanded creative opportunities and
|
|
> efficiency. These technological advancements have enhanced the capabilities of
|
|
> text-to-image, text-to-video, and multimodal generation technologies. In
|
|
> particular, key breakthroughs in LLMs have established new benchmarks in
|
|
> conversational AI, while advancements in image generators have revolutionized
|
|
> content creation. We also discuss AI integration into post-production
|
|
> workflows, which has significantly accelerated and refined traditional
|
|
> processes. Despite these innovations, challenges remain, particularly for the
|
|
> media industry, due to the demands on communication traffic from creative
|
|
> content. We therefore include data compression and quality assessment in this
|
|
> paper. Furthermore, we highlight the trend toward unified AI frameworks capable
|
|
> of addressing multiple creative tasks and underscore the importance of human
|
|
> oversight to mitigate AI-generated inaccuracies. Finally, we explore AI's
|
|
> future potential in the creative sector, stressing the need to navigate
|
|
> emerging challenges to maximize its benefits while addressing associated risks.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on the broader impact of AI in creative industries, mentioning LLMs and generative AI as components, rather than specifically concentrating on prompt engineering for LLMs as required by the criteria.
|
|
|
|
---
|
|
|
|
## [Multi-Agent Path Finding under Limited Communication Range Constraint
|
|
via Dynamic Leading](https://arxiv.org/abs/http://arxiv.org/abs/2501.02770v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02770v1
|
|
|
|
**Abstract:**
|
|
> This paper proposes a novel framework to handle a multi-agent path finding
|
|
> problem under a limited communication range constraint, where all agents must
|
|
> have a connected communication channel to the rest of the team. Many existing
|
|
> approaches to multi-agent path finding (e.g., leader-follower platooning)
|
|
> overcome computational challenges of planning in this domain by planning one
|
|
> agent at a time in a fixed order. However, fixed leader-follower approaches can
|
|
> become stuck during planning, limiting their practical utility in dense-clutter
|
|
> environments. To overcome this limitation, we develop dynamic leading
|
|
> multi-agent path finding, which allows for dynamic reselection of the leading
|
|
> agent during path planning whenever progress cannot be made. The experiments
|
|
> show the efficiency of our framework, which can handle up to 25 agents with
|
|
> more than 90% success-rate across five environment types where baselines
|
|
> routinely fail.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper focuses on multi-agent path finding under limited communication constraints, utilizing dynamic leading, and does not meet any of the 'MUST' criteria, particularly lacking primary focus on prompt engineering for Large Language Models (LLMs) and manipulation of textual input prompts to improve LLM performance."
|
|
}
|
|
|
|
---
|
|
|
|
## [SenseRAG: Constructing Environmental Knowledge Bases with Proactive
|
|
Querying for LLM-Based Autonomous Driving](https://arxiv.org/abs/http://arxiv.org/abs/2501.03535v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03535v2
|
|
|
|
**Abstract:**
|
|
> This study addresses the critical need for enhanced situational awareness in
|
|
> autonomous driving (AD) by leveraging the contextual reasoning capabilities of
|
|
> large language models (LLMs). Unlike traditional perception systems that rely
|
|
> on rigid, label-based annotations, it integrates real-time, multimodal sensor
|
|
> data into a unified, LLMs-readable knowledge base, enabling LLMs to dynamically
|
|
> understand and respond to complex driving environments. To overcome the
|
|
> inherent latency and modality limitations of LLMs, a proactive
|
|
> Retrieval-Augmented Generation (RAG) is designed for AD, combined with a
|
|
> chain-of-thought prompting mechanism, ensuring rapid and context-rich
|
|
> understanding. Experimental results using real-world Vehicle-to-everything
|
|
> (V2X) datasets demonstrate significant improvements in perception and
|
|
> prediction performance, highlighting the potential of this framework to enhance
|
|
> safety, adaptability, and decision-making in next-generation AD systems.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on enhancing autonomous driving (AD) systems with LLMs, rather than prompt engineering for text-based interactions with LLMs. Although it mentions a 'chain-of-thought prompting mechanism', the core subject is the AD system's perception and prediction performance, not novel prompt engineering techniques for LLMs.
|
|
|
|
---
|
|
|
|
## [STContext: A Multifaceted Dataset for Developing Context-aware
|
|
Spatio-temporal Crowd Mobility Prediction Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03583v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03583v1
|
|
|
|
**Abstract:**
|
|
> In smart cities, context-aware spatio-temporal crowd flow prediction (STCFP)
|
|
> models leverage contextual features (e.g., weather) to identify unusual crowd
|
|
> mobility patterns and enhance prediction accuracy. However, the best practice
|
|
> for incorporating contextual features remains unclear due to inconsistent usage
|
|
> of contextual features in different papers. Developing a multifaceted dataset
|
|
> with rich types of contextual features and STCFP scenarios is crucial for
|
|
> establishing a principled context modeling paradigm. Existing open crowd flow
|
|
> datasets lack an adequate range of contextual features, which poses an urgent
|
|
> requirement to build a multifaceted dataset to fill these research gaps. To
|
|
> this end, we create STContext, a multifaceted dataset for developing
|
|
> context-aware STCFP models. Specifically, STContext provides nine
|
|
> spatio-temporal datasets across five STCFP scenarios and includes ten
|
|
> contextual features, including weather, air quality index, holidays, points of
|
|
> interest, road networks, etc. Besides, we propose a unified workflow for
|
|
> incorporating contextual features into deep STCFP methods, with steps including
|
|
> feature transformation, dependency modeling, representation fusion, and
|
|
> training strategies. Through extensive experiments, we have obtained several
|
|
> useful guidelines for effective context modeling and insights for future
|
|
> research. The STContext is open-sourced at
|
|
> https://github.com/Liyue-Chen/STContext.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it focuses on developing a multifaceted dataset for context-aware spatio-temporal crowd mobility prediction models, which falls outside the specified criteria.
|
|
|
|
---
|
|
|
|
## [Neural DNF-MT: A Neuro-symbolic Approach for Learning Interpretable and
|
|
Editable Policies](https://arxiv.org/abs/http://arxiv.org/abs/2501.03888v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03888v1
|
|
|
|
**Abstract:**
|
|
> Although deep reinforcement learning has been shown to be effective, the
|
|
> model's black-box nature presents barriers to direct policy interpretation. To
|
|
> address this problem, we propose a neuro-symbolic approach called neural DNF-MT
|
|
> for end-to-end policy learning. The differentiable nature of the neural DNF-MT
|
|
> model enables the use of deep actor-critic algorithms for training. At the same
|
|
> time, its architecture is designed so that trained models can be directly
|
|
> translated into interpretable policies expressed as standard (bivalent or
|
|
> probabilistic) logic programs. Moreover, additional layers can be included to
|
|
> extract abstract features from complex observations, acting as a form of
|
|
> predicate invention. The logic representations are highly interpretable, and we
|
|
> show how the bivalent representations of deterministic policies can be edited
|
|
> and incorporated back into a neural model, facilitating manual intervention and
|
|
> adaptation of learned policies. We evaluate our approach on a range of tasks
|
|
> requiring learning deterministic or stochastic behaviours from various forms of
|
|
> observations. Our empirical results show that our neural DNF-MT model performs
|
|
> at the level of competing black-box methods whilst providing interpretable
|
|
> policies.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper focuses on developing a neuro-symbolic approach for interpretable policy learning in reinforcement learning, with no primary focus on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance."
|
|
}
|
|
|
|
---
|
|
|
|
## [PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides](https://arxiv.org/abs/http://arxiv.org/abs/2501.03936v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03936v1
|
|
|
|
**Abstract:**
|
|
> Automatically generating presentations from documents is a challenging task
|
|
> that requires balancing content quality, visual design, and structural
|
|
> coherence. Existing methods primarily focus on improving and evaluating the
|
|
> content quality in isolation, often overlooking visual design and structural
|
|
> coherence, which limits their practical applicability. To address these
|
|
> limitations, we propose PPTAgent, which comprehensively improves presentation
|
|
> generation through a two-stage, edit-based approach inspired by human
|
|
> workflows. PPTAgent first analyzes reference presentations to understand their
|
|
> structural patterns and content schemas, then drafts outlines and generates
|
|
> slides through code actions to ensure consistency and alignment. To
|
|
> comprehensively evaluate the quality of generated presentations, we further
|
|
> introduce PPTEval, an evaluation framework that assesses presentations across
|
|
> three dimensions: Content, Design, and Coherence. Experiments show that
|
|
> PPTAgent significantly outperforms traditional automatic presentation
|
|
> generation methods across all three dimensions. The code and data are available
|
|
> at https://github.com/icip-cas/PPTAgent.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on generating presentations (image/video generation) and improving content quality, visual design, and structural coherence, rather than specifically engineering or optimizing text-based input prompts for Large Language Models (LLMs)."
|
|
}
|
|
|
|
---
|
|
|
|
## [Implementing Systemic Thinking for Automatic Schema Matching: An
|
|
Agent-Based Modeling Approach](https://arxiv.org/abs/http://arxiv.org/abs/2501.04136v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04136v1
|
|
|
|
**Abstract:**
|
|
> Several approaches are proposed to deal with the problem of the Automatic
|
|
> Schema Matching (ASM). The challenges and difficulties caused by the complexity
|
|
> and uncertainty characterizing both the process and the outcome of Schema
|
|
> Matching motivated us to investigate how bio-inspired emerging paradigm can
|
|
> help with understanding, managing, and ultimately overcoming those challenges.
|
|
> In this paper, we explain how we approached Automatic Schema Matching as a
|
|
> systemic and Complex Adaptive System (CAS) and how we modeled it using the
|
|
> approach of Agent-Based Modeling and Simulation (ABMS). This effort gives birth
|
|
> to a tool (prototype) for schema matching called Reflex-SMAS. A set of
|
|
> experiments demonstrates the viability of our approach on two main aspects: (i)
|
|
> effectiveness (increasing the quality of the found matchings) and (ii)
|
|
> efficiency (reducing the effort required for this efficiency). Our approach
|
|
> represents a significant paradigm-shift, in the field of Automatic Schema
|
|
> Matching.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the primary focus criteria, as it concentrates on Automatic Schema Matching using Agent-Based Modeling and Simulation, with no apparent emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [AI-Driven Reinvention of Hydrological Modeling for Accurate Predictions
|
|
and Interpretation to Transform Earth System Modeling](https://arxiv.org/abs/http://arxiv.org/abs/2501.04733v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04733v1
|
|
|
|
**Abstract:**
|
|
> Traditional equation-driven hydrological models often struggle to accurately
|
|
> predict streamflow in challenging regional Earth systems like the Tibetan
|
|
> Plateau, while hybrid and existing algorithm-driven models face difficulties in
|
|
> interpreting hydrological behaviors. This work introduces HydroTrace, an
|
|
> algorithm-driven, data-agnostic model that substantially outperforms these
|
|
> approaches, achieving a Nash-Sutcliffe Efficiency of 98% and demonstrating
|
|
> strong generalization on unseen data. Moreover, HydroTrace leverages advanced
|
|
> attention mechanisms to capture spatial-temporal variations and
|
|
> feature-specific impacts, enabling the quantification and spatial resolution of
|
|
> streamflow partitioning as well as the interpretation of hydrological behaviors
|
|
> such as glacier-snow-streamflow interactions and monsoon dynamics.
|
|
> Additionally, a large language model (LLM)-based application allows users to
|
|
> easily understand and apply HydroTrace's insights for practical purposes. These
|
|
> advancements position HydroTrace as a transformative tool in hydrological and
|
|
> broader Earth system modeling, offering enhanced prediction accuracy and
|
|
> interpretability.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on developing a new hydrological modeling approach (HydroTrace) and its application in Earth system modeling, with the Large Language Model (LLM) being a secondary component for interpreting insights, rather than the primary subject being prompt engineering for text-based interactions with LLMs.
|
|
|
|
---
|
|
|
|
## [Rethinking IDE Customization for Enhanced HAX: A Hyperdimensional
|
|
Perspective](https://arxiv.org/abs/http://arxiv.org/abs/2501.02491v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02491v1
|
|
|
|
**Abstract:**
|
|
> As Integrated Development Environments (IDEs) increasingly integrate
|
|
> Artificial Intelligence, Software Engineering faces both benefits like
|
|
> productivity gains and challenges like mismatched user preferences. We propose
|
|
> Hyper-Dimensional (HD) vector spaces to model Human-Computer Interaction,
|
|
> focusing on user actions, stylistic preferences, and project context. These
|
|
> contributions aim to inspire further research on applying HD computing in IDE
|
|
> design.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on IDE customization using Hyperdimensional vector spaces for Human-Computer Interaction, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [TreeMatch: A Fully Unsupervised WSD System Using Dependency Knowledge on
|
|
a Specific Domain](https://arxiv.org/abs/http://arxiv.org/abs/2501.02546v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02546v1
|
|
|
|
**Abstract:**
|
|
> Word sense disambiguation (WSD) is one of the main challenges in
|
|
> Computational Linguistics. TreeMatch is a WSD system originally developed using
|
|
> data from SemEval 2007 Task 7 (Coarse-grained English All-words Task) that has
|
|
> been adapted for use in SemEval 2010 Task 17 (All-words Word Sense
|
|
> Disambiguation on a Specific Domain). The system is based on a fully
|
|
> unsupervised method using dependency knowledge drawn from a domain specific
|
|
> knowledge base that was built for this task. When evaluated on the task, the
|
|
> system precision performs above the Most Frequent Selection baseline.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on Word Sense Disambiguation (WSD) using dependency knowledge and a domain-specific knowledge base, with no indication of primarily investigating, analyzing, or proposing methods for improving Large Language Model (LLM) performance through the manipulation of textual input prompts.
|
|
|
|
---
|
|
|
|
## [Efficient Architectures for High Resolution Vision-Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02584v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02584v1
|
|
|
|
**Abstract:**
|
|
> Vision-Language Models (VLMs) have recently experienced significant
|
|
> advancements. However, challenges persist in the accurate recognition of fine
|
|
> details within high resolution images, which limits performance in multiple
|
|
> tasks. This work introduces Pheye, a novel architecture that efficiently
|
|
> processes high-resolution images while training fewer parameters than similarly
|
|
> sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong
|
|
> performance, particularly in tasks that demand fine-grained image understanding
|
|
> and/or the handling of scene-text.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on the development of a novel architecture (Pheye) for Vision-Language Models (VLMs) to process high-resolution images, which aligns with excluded criteria: developing new LLM architectures and being concerned with applications of generative AI other than text generation driven by LLMs (specifically, vision-language tasks).
|
|
|
|
---
|
|
|
|
## [Enhancing Robot Route Optimization in Smart Logistics with Transformer
|
|
and GNN Integration](https://arxiv.org/abs/http://arxiv.org/abs/2501.02749v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02749v1
|
|
|
|
**Abstract:**
|
|
> This research delves into advanced route optimization for robots in smart
|
|
> logistics, leveraging a fusion of Transformer architectures, Graph Neural
|
|
> Networks (GNNs), and Generative Adversarial Networks (GANs). The approach
|
|
> utilizes a graph-based representation encompassing geographical data, cargo
|
|
> allocation, and robot dynamics, addressing both spatial and resource
|
|
> limitations to refine route efficiency. Through extensive testing with
|
|
> authentic logistics datasets, the proposed method achieves notable
|
|
> improvements, including a 15% reduction in travel distance, a 20% boost in time
|
|
> efficiency, and a 10% decrease in energy consumption. These findings highlight
|
|
> the algorithm's effectiveness, promoting enhanced performance in intelligent
|
|
> logistics operations.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on the integration of Transformer, GNN, and GAN for robot route optimization in smart logistics, with no apparent emphasis on prompt engineering, manipulation of textual input prompts, or the interaction with Large Language Models (LLMs) for text generation.
|
|
|
|
---
|
|
|
|
## [Key-value memory in the brain](https://arxiv.org/abs/http://arxiv.org/abs/2501.02950v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02950v1
|
|
|
|
**Abstract:**
|
|
> Classical models of memory in psychology and neuroscience rely on
|
|
> similarity-based retrieval of stored patterns, where similarity is a function
|
|
> of retrieval cues and the stored patterns. While parsimonious, these models do
|
|
> not allow distinct representations for storage and retrieval, despite their
|
|
> distinct computational demands. Key-value memory systems, in contrast,
|
|
> distinguish representations used for storage (values) and those used for
|
|
> retrieval (keys). This allows key-value memory systems to optimize
|
|
> simultaneously for fidelity in storage and discriminability in retrieval. We
|
|
> review the computational foundations of key-value memory, its role in modern
|
|
> machine learning systems, related ideas from psychology and neuroscience,
|
|
> applications to a number of empirical puzzles, and possible biological
|
|
> implementations.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the 'MUST' criteria as it does not focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate methods for improving LLM performance through prompt manipulation. Instead, it discusses key-value memory systems in the context of psychology, neuroscience, and machine learning, with no apparent connection to LLM prompt engineering.
|
|
|
|
---
|
|
|
|
## [Putnam's Critical and Explanatory Tendencies Interpreted from a Machine
|
|
Learning Perspective](https://arxiv.org/abs/http://arxiv.org/abs/2501.03026v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03026v1
|
|
|
|
**Abstract:**
|
|
> Making sense of theory choice in normal and across extraordinary science is
|
|
> central to philosophy of science. The emergence of machine learning models has
|
|
> the potential to act as a wrench in the gears of current debates. In this
|
|
> paper, I will attempt to reconstruct the main movements that lead to and came
|
|
> out of Putnam's critical and explanatory tendency distinction, argue for the
|
|
> biconditional necessity of the tendencies, and conceptualize that wrench
|
|
> through a machine learning interpretation of my claim.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on interpreting philosophical concepts through a machine learning perspective, not specifically on the engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [GLiREL -- Generalist Model for Zero-Shot Relation Extraction](https://arxiv.org/abs/http://arxiv.org/abs/2501.03172v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03172v1
|
|
|
|
**Abstract:**
|
|
> We introduce GLiREL (Generalist Lightweight model for zero-shot Relation
|
|
> Extraction), an efficient architecture and training paradigm for zero-shot
|
|
> relation classification. Inspired by recent advancements in zero-shot named
|
|
> entity recognition, this work presents an approach to efficiently and
|
|
> accurately predict zero-shot relationship labels between multiple entities in a
|
|
> single forward pass. Experiments using the FewRel and WikiZSL benchmarks
|
|
> demonstrate that our approach achieves state-of-the-art results on the
|
|
> zero-shot relation classification task. In addition, we contribute a protocol
|
|
> for synthetically-generating datasets with diverse relation labels.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on developing a new efficient architecture and training paradigm for zero-shot relation classification, rather than engineering, designing, or optimizing prompts for Large Language Models (LLMs) to improve their text generation performance.
|
|
|
|
---
|
|
|
|
## [Can LLMs Design Good Questions Based on Context?](https://arxiv.org/abs/http://arxiv.org/abs/2501.03491v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03491v1
|
|
|
|
**Abstract:**
|
|
> This paper evaluates questions generated by LLMs from context, comparing them
|
|
> to human-generated questions across six dimensions. We introduce an automated
|
|
> LLM-based evaluation method, focusing on aspects like question length, type,
|
|
> context coverage, and answerability. Our findings highlight unique
|
|
> characteristics of LLM-generated questions, contributing insights that can
|
|
> support further research in question quality and downstream applications.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on LLMs generating questions based on context, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) to improve their text generation performance through textual input manipulation.
|
|
|
|
---
|
|
|
|
## [Self-Adaptive ERP: Embedding NLP into Petri-Net creation and Model
|
|
Matching](https://arxiv.org/abs/http://arxiv.org/abs/2501.03795v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03795v1
|
|
|
|
**Abstract:**
|
|
> Enterprise Resource Planning (ERP) consultants play a vital role in
|
|
> customizing systems to meet specific business needs by processing large amounts
|
|
> of data and adapting functionalities. However, the process is
|
|
> resource-intensive, time-consuming, and requires continuous adjustments as
|
|
> business demands evolve. This research introduces a Self-Adaptive ERP Framework
|
|
> that automates customization using enterprise process models and system usage
|
|
> analysis. It leverages Artificial Intelligence (AI) & Natural Language
|
|
> Processing (NLP) for Petri nets to transform business processes into adaptable
|
|
> models, addressing both structural and functional matching. The framework,
|
|
> built using Design Science Research (DSR) and a Systematic Literature Review
|
|
> (SLR), reduces reliance on manual adjustments, improving ERP customization
|
|
> efficiency and accuracy while minimizing the need for consultants.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on automating ERP customization using AI and NLP for Petri-net creation and model matching, rather than specifically engineering, designing, or optimizing prompts for Large Language Models (LLMs). The mention of NLP is not centered around prompt engineering for LLMs, but rather for transforming business processes into adaptable models within an ERP framework.
|
|
|
|
---
|
|
|
|
## [Synthetic Data Privacy Metrics](https://arxiv.org/abs/http://arxiv.org/abs/2501.03941v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03941v1
|
|
|
|
**Abstract:**
|
|
> Recent advancements in generative AI have made it possible to create
|
|
> synthetic datasets that can be as accurate as real-world data for training AI
|
|
> models, powering statistical insights, and fostering collaboration with
|
|
> sensitive datasets while offering strong privacy guarantees. Effectively
|
|
> measuring the empirical privacy of synthetic data is an important step in the
|
|
> process. However, while there is a multitude of new privacy metrics being
|
|
> published every day, there currently is no standardization. In this paper, we
|
|
> review the pros and cons of popular metrics that include simulations of
|
|
> adversarial attacks. We also review current best practices for amending
|
|
> generative models to enhance the privacy of the data they create (e.g.
|
|
> differential privacy).
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on synthetic data privacy metrics, generative AI for dataset creation, and differential privacy, without addressing prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Traits of a Leader: User Influence Level Prediction through
|
|
Sociolinguistic Modeling](https://arxiv.org/abs/http://arxiv.org/abs/2501.04046v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04046v1
|
|
|
|
**Abstract:**
|
|
> Recognition of a user's influence level has attracted much attention as human
|
|
> interactions move online. Influential users have the ability to sway others'
|
|
> opinions to achieve some goals. As a result, predicting users' level of
|
|
> influence can help to understand social networks, forecast trends, prevent
|
|
> misinformation, etc. However, predicting user influence is a challenging
|
|
> problem because the concept of influence is specific to a situation or a
|
|
> domain, and user communications are limited to text. In this work, we define
|
|
> user influence level as a function of community endorsement and develop a model
|
|
> that significantly outperforms the baseline by leveraging demographic and
|
|
> personality data. This approach consistently improves RankDCG scores across
|
|
> eight different domains.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on sociolinguistic modeling for predicting user influence levels in online interactions, without primarily addressing the engineering, design, or optimization of prompts for Large Language Models (LLMs) or demonstrating the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [IntegrityAI at GenAI Detection Task 2: Detecting Machine-Generated
|
|
Academic Essays in English and Arabic Using ELECTRA and Stylometry](https://arxiv.org/abs/http://arxiv.org/abs/2501.05476v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.05476v1
|
|
|
|
**Abstract:**
|
|
> Recent research has investigated the problem of detecting machine-generated
|
|
> essays for academic purposes. To address this challenge, this research utilizes
|
|
> pre-trained, transformer-based models fine-tuned on Arabic and English academic
|
|
> essays with stylometric features. Custom models based on ELECTRA for English
|
|
> and AraELECTRA for Arabic were trained and evaluated using a benchmark dataset.
|
|
> Proposed models achieved excellent results with an F1-score of 99.7%, ranking
|
|
> 2nd among of 26 teams in the English subtask, and 98.4%, finishing 1st out of
|
|
> 23 teams in the Arabic one.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on developing and fine-tuning transformer-based models (ELECTRA, AraELECTRA) for detecting machine-generated essays, rather than engineering or optimizing prompts for Large Language Models (LLMs). Prompt engineering is not the central concern of this research."
|
|
}
|
|
|
|
---
|
|
|
|
## [Fairness Through Matching](https://arxiv.org/abs/http://arxiv.org/abs/2501.02793v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02793v1
|
|
|
|
**Abstract:**
|
|
> Group fairness requires that different protected groups, characterized by a
|
|
> given sensitive attribute, receive equal outcomes overall. Typically, the level
|
|
> of group fairness is measured by the statistical gap between predictions from
|
|
> different protected groups. In this study, we reveal an implicit property of
|
|
> existing group fairness measures, which provides an insight into how the
|
|
> group-fair models behave. Then, we develop a new group-fair constraint based on
|
|
> this implicit property to learn group-fair models. To do so, we first introduce
|
|
> a notable theoretical observation: every group-fair model has an implicitly
|
|
> corresponding transport map between the input spaces of each protected group.
|
|
> Based on this observation, we introduce a new group fairness measure termed
|
|
> Matched Demographic Parity (MDP), which quantifies the averaged gap between
|
|
> predictions of two individuals (from different protected groups) matched by a
|
|
> given transport map. Then, we prove that any transport map can be used in MDP
|
|
> to learn group-fair models, and develop a novel algorithm called Fairness
|
|
> Through Matching (FTM), which learns a group-fair model using MDP constraint
|
|
> with an user-specified transport map. We specifically propose two favorable
|
|
> types of transport maps for MDP, based on the optimal transport theory, and
|
|
> discuss their advantages. Experiments reveal that FTM successfully trains
|
|
> group-fair models with certain desirable properties by choosing the transport
|
|
> map accordingly.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not focus on prompt engineering for Large Language Models (LLMs), instead concentrating on developing a new group-fair constraint and algorithm for learnings group-fair models, with no mention of LLMs, prompt design, or textual input manipulation.
|
|
|
|
---
|
|
|
|
## [CALM: Curiosity-Driven Auditing for Large Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02997v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02997v1
|
|
|
|
**Abstract:**
|
|
> Auditing Large Language Models (LLMs) is a crucial and challenging task. In
|
|
> this study, we focus on auditing black-box LLMs without access to their
|
|
> parameters, only to the provided service. We treat this type of auditing as a
|
|
> black-box optimization problem where the goal is to automatically uncover
|
|
> input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe
|
|
> behaviors. For instance, we may seek a non-toxic input that the target LLM
|
|
> responds to with a toxic output or an input that induces the hallucinative
|
|
> response from the target LLM containing politically sensitive individuals. This
|
|
> black-box optimization is challenging due to the scarcity of feasible points,
|
|
> the discrete nature of the prompt space, and the large search space. To address
|
|
> these challenges, we propose Curiosity-Driven Auditing for Large Language
|
|
> Models (CALM), which uses intrinsically motivated reinforcement learning to
|
|
> finetune an LLM as the auditor agent to uncover potential harmful and biased
|
|
> input-output pairs of the target LLM. CALM successfully identifies derogatory
|
|
> completions involving celebrities and uncovers inputs that elicit specific
|
|
> names under the black-box setting. This work offers a promising direction for
|
|
> auditing black-box LLMs. Our code is available at
|
|
> https://github.com/x-zheng16/CALM.git.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on auditing Large Language Models (LLMs) using a reinforcement learning approach, rather than on the engineering, design, or optimization of prompts for improving LLM performance through textual input manipulation.
|
|
|
|
---
|
|
|
|
## [Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering
|
|
alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.03012v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03012v1
|
|
|
|
**Abstract:**
|
|
> Multimodal LLMs have reached remarkable levels of proficiency in
|
|
> understanding multimodal inputs, driving extensive research to develop
|
|
> increasingly powerful models. However, much less attention has been paid to
|
|
> understanding and explaining the underlying mechanisms of these models. Most
|
|
> existing explainability research examines these models only in their final
|
|
> states, overlooking the dynamic representational shifts that occur during
|
|
> training. In this work, we systematically analyze the evolution of hidden state
|
|
> representations to reveal how fine-tuning alters the internal structure of a
|
|
> model to specialize in new multimodal tasks. Using a concept-based approach, we
|
|
> map hidden states to interpretable visual and textual concepts, enabling us to
|
|
> trace changes in encoded concepts across modalities as training progresses. We
|
|
> also demonstrate the use of shift vectors to capture these concepts changes.
|
|
> These shift vectors allow us to recover fine-tuned concepts by shifting those
|
|
> in the original model. Finally, we explore the practical impact of our findings
|
|
> on model steering, showing that we can adjust multimodal LLMs behaviors without
|
|
> any training, such as modifying answer types, captions style, or biasing the
|
|
> model toward specific responses. Our work sheds light on how multimodal
|
|
> representations evolve through fine-tuning and offers a new perspective for
|
|
> interpreting model adaptation in multimodal tasks. The code for this project is
|
|
> publicly available at https://github.com/mshukor/xl-vlms.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on analyzing and understanding the internal representation shifts of multimodal LLMs during fine-tuning, rather than engineering, designing, or optimizing prompts specifically for LLMs, thus failing to meet the core 'MUST' criteria."
|
|
}
|
|
|
|
---
|
|
|
|
## [Large language models for artificial general intelligence (AGI): A
|
|
survey of foundational principles and approaches](https://arxiv.org/abs/http://arxiv.org/abs/2501.03151v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03151v1
|
|
|
|
**Abstract:**
|
|
> Generative artificial intelligence (AI) systems based on large-scale
|
|
> pretrained foundation models (PFMs) such as vision-language models, large
|
|
> language models (LLMs), diffusion models and vision-language-action (VLA)
|
|
> models have demonstrated the ability to solve complex and truly non-trivial AI
|
|
> problems in a wide variety of domains and contexts. Multimodal large language
|
|
> models (MLLMs), in particular, learn from vast and diverse data sources,
|
|
> allowing rich and nuanced representations of the world and, thereby, providing
|
|
> extensive capabilities, including the ability to reason, engage in meaningful
|
|
> dialog; collaborate with humans and other agents to jointly solve complex
|
|
> problems; and understand social and emotional aspects of humans. Despite this
|
|
> impressive feat, the cognitive abilities of state-of-the-art LLMs trained on
|
|
> large-scale datasets are still superficial and brittle. Consequently, generic
|
|
> LLMs are severely limited in their generalist capabilities. A number of
|
|
> foundational problems -- embodiment, symbol grounding, causality and memory --
|
|
> are required to be addressed for LLMs to attain human-level general
|
|
> intelligence. These concepts are more aligned with human cognition and provide
|
|
> LLMs with inherent human-like cognitive properties that support the realization
|
|
> of physically-plausible, semantically meaningful, flexible and more
|
|
> generalizable knowledge and intelligence. In this work, we discuss the
|
|
> aforementioned foundational issues and survey state-of-the art approaches for
|
|
> implementing these concepts in LLMs. Specifically, we discuss how the
|
|
> principles of embodiment, symbol grounding, causality and memory can be
|
|
> leveraged toward the attainment of artificial general intelligence (AGI) in an
|
|
> organic manner.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on foundational principles and approaches for achieving Artificial General Intelligence (AGI) with Large Language Models (LLMs), rather than specifically on the engineering, design, or optimization of prompts for LLMs. It lacks concrete examples of prompts and their impact on LLM output, which is a required criterion.
|
|
|
|
---
|
|
|
|
## [From Aleatoric to Epistemic: Exploring Uncertainty Quantification
|
|
Techniques in Artificial Intelligence](https://arxiv.org/abs/http://arxiv.org/abs/2501.03282v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03282v1
|
|
|
|
**Abstract:**
|
|
> Uncertainty quantification (UQ) is a critical aspect of artificial
|
|
> intelligence (AI) systems, particularly in high-risk domains such as
|
|
> healthcare, autonomous systems, and financial technology, where decision-making
|
|
> processes must account for uncertainty. This review explores the evolution of
|
|
> uncertainty quantification techniques in AI, distinguishing between aleatoric
|
|
> and epistemic uncertainties, and discusses the mathematical foundations and
|
|
> methods used to quantify these uncertainties. We provide an overview of
|
|
> advanced techniques, including probabilistic methods, ensemble learning,
|
|
> sampling-based approaches, and generative models, while also highlighting
|
|
> hybrid approaches that integrate domain-specific knowledge. Furthermore, we
|
|
> examine the diverse applications of UQ across various fields, emphasizing its
|
|
> impact on decision-making, predictive accuracy, and system robustness. The
|
|
> review also addresses key challenges such as scalability, efficiency, and
|
|
> integration with explainable AI, and outlines future directions for research in
|
|
> this rapidly developing area. Through this comprehensive survey, we aim to
|
|
> provide a deeper understanding of UQ's role in enhancing the reliability,
|
|
> safety, and trustworthiness of AI systems.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the 'MUST' criteria as it focuses primarily on uncertainty quantification techniques in AI across various high-risk domains, rather than specifically on the engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not provide concrete examples of prompts impacting LLM output.
|
|
|
|
---
|
|
|
|
## [Online Reinforcement Learning-Based Dynamic Adaptive Evaluation Function
|
|
for Real-Time Strategy Tasks](https://arxiv.org/abs/http://arxiv.org/abs/2501.03824v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03824v1
|
|
|
|
**Abstract:**
|
|
> Effective evaluation of real-time strategy tasks requires adaptive mechanisms
|
|
> to cope with dynamic and unpredictable environments. This study proposes a
|
|
> method to improve evaluation functions for real-time responsiveness to
|
|
> battle-field situation changes, utilizing an online reinforcement
|
|
> learning-based dynam-ic weight adjustment mechanism within the real-time
|
|
> strategy game. Building on traditional static evaluation functions, the method
|
|
> employs gradient descent in online reinforcement learning to update weights
|
|
> dynamically, incorporating weight decay techniques to ensure stability.
|
|
> Additionally, the AdamW optimizer is integrated to adjust the learning rate and
|
|
> decay rate of online reinforcement learning in real time, further reducing the
|
|
> dependency on manual parameter tun-ing. Round-robin competition experiments
|
|
> demonstrate that this method signifi-cantly enhances the application
|
|
> effectiveness of the Lanchester combat model evaluation function, Simple
|
|
> evaluation function, and Simple Sqrt evaluation function in planning algorithms
|
|
> including IDABCD, IDRTMinimax, and Port-folio AI. The method achieves a notable
|
|
> improvement in scores, with the en-hancement becoming more pronounced as the
|
|
> map size increases. Furthermore, the increase in evaluation function
|
|
> computation time induced by this method is kept below 6% for all evaluation
|
|
> functions and planning algorithms. The pro-posed dynamic adaptive evaluation
|
|
> function demonstrates a promising approach for real-time strategy task
|
|
> evaluation.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on improving evaluation functions for real-time strategy tasks using online reinforcement learning, with no emphasis on prompt engineering, design, or optimization for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Dolphin: Closed-loop Open-ended Auto-research through Thinking,
|
|
Practice, and Feedback](https://arxiv.org/abs/http://arxiv.org/abs/2501.03916v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03916v2
|
|
|
|
**Abstract:**
|
|
> The scientific research paradigm is undergoing a profound transformation
|
|
> owing to the development of Artificial Intelligence (AI). Recent works
|
|
> demonstrate that various AI-assisted research methods can largely improve
|
|
> research efficiency by improving data analysis, accelerating computation, and
|
|
> fostering novel idea generation. To further move towards the ultimate goal
|
|
> (i.e., automatic scientific research), in this paper, we propose Dolphin, the
|
|
> first closed-loop open-ended auto-research framework to further build the
|
|
> entire process of human scientific research. Dolphin can generate research
|
|
> ideas, perform experiments, and get feedback from experimental results to
|
|
> generate higher-quality ideas. More specifically, Dolphin first generates novel
|
|
> ideas based on relevant papers which are ranked by the topic and task
|
|
> attributes. Then, the codes are automatically generated and debugged with the
|
|
> exception-traceback-guided local code structure. Finally, Dolphin automatically
|
|
> analyzes the results of each idea and feeds the results back to the next round
|
|
> of idea generation. Experiments are conducted on the benchmark datasets of
|
|
> different topics and results show that Dolphin can generate novel ideas
|
|
> continuously and complete the experiment in a loop. We highlight that Dolphin
|
|
> can automatically propose methods that are comparable to the state-of-the-art
|
|
> in some tasks such as 2D image classification and 3D point classification.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on developing an auto-research framework using AI, with Large Language Models (LLMs) being only one of the potential tools used for generating research ideas, rather than the central focus on prompt engineering for text-based interactions with LLMs as required.
|
|
|
|
---
|
|
|
|
## [Interpretable Neural ODEs for Gene Regulatory Network Discovery under
|
|
Perturbations](https://arxiv.org/abs/http://arxiv.org/abs/2501.02409v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02409v1
|
|
|
|
**Abstract:**
|
|
> Modern high-throughput biological datasets with thousands of perturbations
|
|
> provide the opportunity for large-scale discovery of causal graphs that
|
|
> represent the regulatory interactions between genes. Numerous methods have been
|
|
> proposed to infer a directed acyclic graph (DAG) corresponding to the
|
|
> underlying gene regulatory network (GRN) that captures causal gene
|
|
> relationships. However, existing models have restrictive assumptions (e.g.
|
|
> linearity, acyclicity), limited scalability, and/or fail to address the dynamic
|
|
> nature of biological processes such as cellular differentiation. We propose
|
|
> PerturbODE, a novel framework that incorporates biologically informative neural
|
|
> ordinary differential equations (neural ODEs) to model cell state trajectories
|
|
> under perturbations and derive the causal GRN from the neural ODE's parameters.
|
|
> We demonstrate PerturbODE's efficacy in trajectory prediction and GRN inference
|
|
> across simulated and real over-expression datasets.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet any of the 'MUST' criteria, as it focuses on gene regulatory network discovery using neural ODEs, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, falling outside the specified scope.
|
|
|
|
---
|
|
|
|
## [A Statistical Hypothesis Testing Framework for Data Misappropriation
|
|
Detection in Large Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02441v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02441v1
|
|
|
|
**Abstract:**
|
|
> Large Language Models (LLMs) are rapidly gaining enormous popularity in
|
|
> recent years. However, the training of LLMs has raised significant privacy and
|
|
> legal concerns, particularly regarding the inclusion of copyrighted materials
|
|
> in their training data without proper attribution or licensing, which falls
|
|
> under the broader issue of data misappropriation. In this article, we focus on
|
|
> a specific problem of data misappropriation detection, namely, to determine
|
|
> whether a given LLM has incorporated data generated by another LLM. To address
|
|
> this issue, we propose embedding watermarks into the copyrighted training data
|
|
> and formulating the detection of data misappropriation as a hypothesis testing
|
|
> problem. We develop a general statistical testing framework, construct a
|
|
> pivotal statistic, determine the optimal rejection threshold, and explicitly
|
|
> control the type I and type II errors. Furthermore, we establish the asymptotic
|
|
> optimality properties of the proposed tests, and demonstrate its empirical
|
|
> effectiveness through intensive numerical experiments.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on detecting data misappropriation in LLMs through statistical hypothesis testing, with no emphasis on prompt engineering, design, or optimization for improving LLM performance through textual input prompts."
|
|
}
|
|
|
|
---
|
|
|
|
## [RTLMarker: Protecting LLM-Generated RTL Copyright via a Hardware
|
|
Watermarking Framework](https://arxiv.org/abs/http://arxiv.org/abs/2501.02446v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02446v1
|
|
|
|
**Abstract:**
|
|
> Recent advances of large language models in the field of Verilog generation
|
|
> have raised several ethical and security concerns, such as code copyright
|
|
> protection and dissemination of malicious code. Researchers have employed
|
|
> watermarking techniques to identify codes generated by large language models.
|
|
> However, the existing watermarking works fail to protect RTL code copyright due
|
|
> to the significant syntactic and semantic differences between RTL code and
|
|
> software code in languages such as Python. This paper proposes a hardware
|
|
> watermarking framework RTLMarker that embeds watermarks into RTL code and
|
|
> deeper into the synthesized netlist. We propose a set of rule-based Verilog
|
|
> code transformations , ensuring the watermarked RTL code's syntactic and
|
|
> semantic correctness. In addition, we consider an inherent tradeoff between
|
|
> watermark transparency and watermark effectiveness and jointly optimize them.
|
|
> The results demonstrate RTLMarker's superiority over the baseline in RTL code
|
|
> watermarking.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on a hardware watermarking framework (RTLMarker) for protecting RTL code copyright generated by LLMs, rather than exploring prompt engineering techniques for improving LLM performance through textual input manipulation."
|
|
}
|
|
|
|
---
|
|
|
|
## [Enhancing Contrastive Learning for Retinal Imaging via Adjusted
|
|
Augmentation Scales](https://arxiv.org/abs/http://arxiv.org/abs/2501.02451v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02451v1
|
|
|
|
**Abstract:**
|
|
> Contrastive learning, a prominent approach within self-supervised learning,
|
|
> has demonstrated significant effectiveness in developing generalizable models
|
|
> for various applications involving natural images. However, recent research
|
|
> indicates that these successes do not necessarily extend to the medical imaging
|
|
> domain. In this paper, we investigate the reasons for this suboptimal
|
|
> performance and hypothesize that the dense distribution of medical images poses
|
|
> challenges to the pretext tasks in contrastive learning, particularly in
|
|
> constructing positive and negative pairs. We explore model performance under
|
|
> different augmentation strategies and compare the results to those achieved
|
|
> with strong augmentations. Our study includes six publicly available datasets
|
|
> covering multiple clinically relevant tasks. We further assess the model's
|
|
> generalizability through external evaluations. The model pre-trained with weak
|
|
> augmentation outperforms those with strong augmentation, improving AUROC from
|
|
> 0.838 to 0.848 and AUPR from 0.523 to 0.597 on MESSIDOR2, and showing similar
|
|
> enhancements across other datasets. Our findings suggest that optimizing the
|
|
> scale of augmentation is critical for enhancing the efficacy of contrastive
|
|
> learning in medical imaging.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on enhancing contrastive learning for medical imaging (retinal imaging), which violates the 'MUST NOT' criteria: being primarily concerned with medical subjects and not focusing on prompt engineering for Large Language Models (LLMs) or text generation driven by LLMs."
|
|
}
|
|
|
|
---
|
|
|
|
## [Hengqin-RA-v1: Advanced Large Language Model for Diagnosis and Treatment
|
|
of Rheumatoid Arthritis with Dataset based Traditional Chinese Medicine](https://arxiv.org/abs/http://arxiv.org/abs/2501.02471v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02471v1
|
|
|
|
**Abstract:**
|
|
> Large language models (LLMs) primarily trained on English texts, often face
|
|
> biases and inaccuracies in Chinese contexts. Their limitations are pronounced
|
|
> in fields like Traditional Chinese Medicine (TCM), where cultural and clinical
|
|
> subtleties are vital, further hindered by a lack of domain-specific data, such
|
|
> as rheumatoid arthritis (RA). To address these issues, this paper introduces
|
|
> Hengqin-RA-v1, the first large language model specifically tailored for TCM
|
|
> with a focus on diagnosing and treating RA. We also present HQ-GCM-RA-C1, a
|
|
> comprehensive RA-specific dataset curated from ancient Chinese medical
|
|
> literature, classical texts, and modern clinical studies. This dataset empowers
|
|
> Hengqin-RA-v1 to deliver accurate and culturally informed responses,
|
|
> effectively bridging the gaps left by general-purpose models. Extensive
|
|
> experiments demonstrate that Hengqin-RA-v1 outperforms state-of-the-art models,
|
|
> even surpassing the diagnostic accuracy of TCM practitioners in certain cases.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on developing a new Large Language Model architecture (Hengqin-RA-v1) specifically tailored for Traditional Chinese Medicine, rather than prompt engineering for existing LLMs, and introduces a new dataset for training, which violate the 'MUST NOT' criteria 1.
|
|
|
|
---
|
|
|
|
## [The Meta-Representation Hypothesis](https://arxiv.org/abs/http://arxiv.org/abs/2501.02481v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02481v1
|
|
|
|
**Abstract:**
|
|
> Humans rely on high-level meta-representations to engage in abstract
|
|
> reasoning. In complex cognitive tasks, these meta-representations help
|
|
> individuals abstract general rules from experience. However, constructing such
|
|
> meta-representations from high-dimensional observations remains a longstanding
|
|
> challenge for reinforcement learning agents. For instance, a well-trained agent
|
|
> often fails to generalize to even minor variations of the same task, such as
|
|
> changes in background color, while humans can easily handle. In this paper, we
|
|
> build a bridge between meta-representation and generalization, showing that
|
|
> generalization performance benefits from meta-representation learning. We also
|
|
> hypothesize that deep mutual learning (DML) among agents can help them converge
|
|
> to meta-representations. Empirical results provide support for our theory and
|
|
> hypothesis. Overall, this work provides a new perspective on the generalization
|
|
> of deep reinforcement learning.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the 'MUST' criteria as it primarily focuses on reinforcement learning, meta-representation, and generalization in deep learning, with no clear emphasis on prompt engineering, design, or optimization specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts for improving LLM performance.
|
|
|
|
---
|
|
|
|
## [Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment
|
|
Retrieval and Highlight Detection](https://arxiv.org/abs/http://arxiv.org/abs/2501.02504v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02504v1
|
|
|
|
**Abstract:**
|
|
> The goal of video moment retrieval and highlight detection is to identify
|
|
> specific segments and highlights based on a given text query. With the rapid
|
|
> growth of video content and the overlap between these tasks, recent works have
|
|
> addressed both simultaneously. However, they still struggle to fully capture
|
|
> the overall video context, making it challenging to determine which words are
|
|
> most relevant. In this paper, we present a novel Video Context-aware Keyword
|
|
> Attention module that overcomes this limitation by capturing keyword variation
|
|
> within the context of the entire video. To achieve this, we introduce a video
|
|
> context clustering module that provides concise representations of the overall
|
|
> video context, thereby enhancing the understanding of keyword dynamics.
|
|
> Furthermore, we propose a keyword weight detection module with keyword-aware
|
|
> contrastive learning that incorporates keyword information to enhance
|
|
> fine-grained alignment between visual and textual features. Extensive
|
|
> experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate
|
|
> that our proposed method significantly improves performance in moment retrieval
|
|
> and highlight detection tasks compared to existing approaches. Our code is
|
|
> available at: https://github.com/VisualAIKHU/Keyword-DETR
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on video moment retrieval and highlight detection, utilizing a novel Video Context-aware Keyword Attention module, with no evident focus on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance."
|
|
}
|
|
|
|
---
|
|
|
|
## [Remote Inference over Dynamic Links via Adaptive Rate Deep Task-Oriented
|
|
Vector Quantization](https://arxiv.org/abs/http://arxiv.org/abs/2501.02521v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02521v1
|
|
|
|
**Abstract:**
|
|
> A broad range of technologies rely on remote inference, wherein data acquired
|
|
> is conveyed over a communication channel for inference in a remote server.
|
|
> Communication between the participating entities is often carried out over
|
|
> rate-limited channels, necessitating data compression for reducing latency.
|
|
> While deep learning facilitates joint design of the compression mapping along
|
|
> with encoding and inference rules, existing learned compression mechanisms are
|
|
> static, and struggle in adapting their resolution to changes in channel
|
|
> conditions and to dynamic links. To address this, we propose Adaptive Rate
|
|
> Task-Oriented Vector Quantization (ARTOVeQ), a learned compression mechanism
|
|
> that is tailored for remote inference over dynamic links. ARTOVeQ is based on
|
|
> designing nested codebooks along with a learning algorithm employing
|
|
> progressive learning. We show that ARTOVeQ extends to support low-latency
|
|
> inference that is gradually refined via successive refinement principles, and
|
|
> that it enables the simultaneous usage of multiple resolutions when conveying
|
|
> high-dimensional data. Numerical results demonstrate that the proposed scheme
|
|
> yields remote deep inference that operates with multiple rates, supports a
|
|
> broad range of bit budgets, and facilitates rapid inference that gradually
|
|
> improves with more bits exchanged, while approaching the performance of
|
|
> single-rate deep quantization methods.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on adaptive rate deep task-oriented vector quantization for remote inference over dynamic links, without any primary concern for prompt engineering, Large Language Models (LLMs), or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation](https://arxiv.org/abs/http://arxiv.org/abs/2501.02523v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02523v1
|
|
|
|
**Abstract:**
|
|
> Facial images have extensive practical applications. Although the current
|
|
> large-scale text-image diffusion models exhibit strong generation capabilities,
|
|
> it is challenging to generate the desired facial images using only text prompt.
|
|
> Image prompts are a logical choice. However, current methods of this type
|
|
> generally focus on general domain. In this paper, we aim to optimize image
|
|
> makeup techniques to generate the desired facial images. Specifically, (1) we
|
|
> built a dataset of 4 million high-quality face image-text pairs
|
|
> (FaceCaptionHQ-4M) based on LAION-Face to train our Face-MakeUp model; (2) to
|
|
> maintain consistency with the reference facial image, we extract/learn
|
|
> multi-scale content features and pose features for the facial image,
|
|
> integrating these into the diffusion model to enhance the preservation of
|
|
> facial identity features for diffusion models. Validation on two face-related
|
|
> test datasets demonstrates that our Face-MakeUp can achieve the best
|
|
> comprehensive performance.All codes are available
|
|
> at:https://github.com/ddw2AIGROUP2CQUPT/Face-MakeUp
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on optimizing image makeup techniques for text-to-image generation, which falls under image generation driven by non-text-based interactions with generative models, violating the 'MUST NOT' criteria 2, and does not meet the core subject requirement of prompt engineering for text-based interactions with Large Language Models (LLMs)."
|
|
}
|
|
|
|
---
|
|
|
|
## [AMM: Adaptive Modularized Reinforcement Model for Multi-city Traffic
|
|
Signal Control](https://arxiv.org/abs/http://arxiv.org/abs/2501.02548v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02548v1
|
|
|
|
**Abstract:**
|
|
> Traffic signal control (TSC) is an important and widely studied direction.
|
|
> Recently, reinforcement learning (RL) methods have been used to solve TSC
|
|
> problems and achieve superior performance over conventional TSC methods.
|
|
> However, applying RL methods to the real world is challenging due to the huge
|
|
> cost of experiments in real-world traffic environments. One possible solution
|
|
> is TSC domain adaptation, which adapts trained models to target environments
|
|
> and reduces the number of interactions and the training cost. However, existing
|
|
> TSC domain adaptation methods still face two major issues: the lack of
|
|
> consideration for differences across cities and the low utilization of
|
|
> multi-city data.
|
|
> To solve aforementioned issues, we propose an approach named Adaptive
|
|
> Modularized Model (AMM). By modularizing TSC problems and network models, we
|
|
> overcome the challenge of possible changes in environmental observations. We
|
|
> also aggregate multi-city experience through meta-learning. We conduct
|
|
> extensive experiments on different cities and show that AMM can achieve
|
|
> excellent performance with limited interactions in target environments and
|
|
> outperform existing methods. We also demonstrate the feasibility and
|
|
> generalizability of our method.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on Adaptive Modularized Reinforcement Model for Multi-city Traffic Signal Control, which does not meet the 'MUST' criteria of primarily focusing on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate improving LLM performance through textual input prompt manipulation.
|
|
|
|
---
|
|
|
|
## [KM-UNet KAN Mamba UNet for medical image segmentation](https://arxiv.org/abs/http://arxiv.org/abs/2501.02559v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02559v1
|
|
|
|
**Abstract:**
|
|
> Medical image segmentation is a critical task in medical imaging analysis.
|
|
> Traditional CNN-based methods struggle with modeling long-range dependencies,
|
|
> while Transformer-based models, despite their success, suffer from quadratic
|
|
> computational complexity. To address these limitations, we propose KM-UNet, a
|
|
> novel U-shaped network architecture that combines the strengths of
|
|
> Kolmogorov-Arnold Networks (KANs) and state-space models (SSMs). KM-UNet
|
|
> leverages the Kolmogorov-Arnold representation theorem for efficient feature
|
|
> representation and SSMs for scalable long-range modeling, achieving a balance
|
|
> between accuracy and computational efficiency. We evaluate KM-UNet on five
|
|
> benchmark datasets: ISIC17, ISIC18, CVC, BUSI, and GLAS. Experimental results
|
|
> demonstrate that KM-UNet achieves competitive performance compared to
|
|
> state-of-the-art methods in medical image segmentation tasks. To the best of
|
|
> our knowledge, KM-UNet is the first medical image segmentation framework
|
|
> integrating KANs and SSMs. This work provides a valuable baseline and new
|
|
> insights for the development of more efficient and interpretable medical image
|
|
> segmentation systems. The code is open source at
|
|
> https://github.com/2760613195/KM_UNet
|
|
> Keywords:KAN,Manba, state-space models,UNet, Medical image segmentation, Deep
|
|
> learning
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on medical image segmentation using a novel U-shaped network architecture, which falls under excluded subjects (medical) and does not meet the 'MUST' criteria of focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).
|
|
|
|
---
|
|
|
|
## [Decoding fMRI Data into Captions using Prefix Language Modeling](https://arxiv.org/abs/http://arxiv.org/abs/2501.02570v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02570v1
|
|
|
|
**Abstract:**
|
|
> With the advancements in Large Language and Latent Diffusion models, brain
|
|
> decoding has achieved remarkable results in recent years. The works on the NSD
|
|
> dataset, with stimuli images from the COCO dataset, leverage the embeddings
|
|
> from the CLIP model for image reconstruction and GIT for captioning. However,
|
|
> the current captioning approach introduces the challenge of potential data
|
|
> contamination given that the GIT model was trained on the COCO dataset. In this
|
|
> work, we present an alternative method for decoding brain signals into image
|
|
> captions by predicting a DINOv2 model's embedding of an image from the
|
|
> corresponding fMRI signal and then providing its [CLS] token as the prefix to
|
|
> the GPT-2 language model which decreases computational requirements
|
|
> considerably. Additionally, instead of commonly used Linear Regression, we
|
|
> explore 3D Convolutional Neural Network mapping of fMRI signals to image
|
|
> embedding space for better accounting positional information of voxels.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on decoding fMRI data into image captions using a combination of computer vision and language models, rather than specifically engineering or optimizing prompts for Large Language Models (LLMs). The use of a GPT-2 model with a predicted prefix does not constitute a primary focus on prompt engineering for text-based interactions with LLMs.
|
|
|
|
---
|
|
|
|
## [Evolving Skeletons: Motion Dynamics in Action Recognition](https://arxiv.org/abs/http://arxiv.org/abs/2501.02593v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02593v1
|
|
|
|
**Abstract:**
|
|
> Skeleton-based action recognition has gained significant attention for its
|
|
> ability to efficiently represent spatiotemporal information in a lightweight
|
|
> format. Most existing approaches use graph-based models to process skeleton
|
|
> sequences, where each pose is represented as a skeletal graph structured around
|
|
> human physical connectivity. Among these, the Spatiotemporal Graph
|
|
> Convolutional Network (ST-GCN) has become a widely used framework.
|
|
> Alternatively, hypergraph-based models, such as the Hyperformer, capture
|
|
> higher-order correlations, offering a more expressive representation of complex
|
|
> joint interactions. A recent advancement, termed Taylor Videos, introduces
|
|
> motion-enhanced skeleton sequences by embedding motion concepts, providing a
|
|
> fresh perspective on interpreting human actions in skeleton-based action
|
|
> recognition. In this paper, we conduct a comprehensive evaluation of both
|
|
> traditional skeleton sequences and Taylor-transformed skeletons using ST-GCN
|
|
> and Hyperformer models on the NTU-60 and NTU-120 datasets. We compare skeletal
|
|
> graph and hypergraph representations, analyzing static poses against
|
|
> motion-injected poses. Our findings highlight the strengths and limitations of
|
|
> Taylor-transformed skeletons, demonstrating their potential to enhance motion
|
|
> dynamics while exposing current challenges in fully using their benefits. This
|
|
> study underscores the need for innovative skeletal modelling techniques to
|
|
> effectively handle motion-rich data and advance the field of action
|
|
> recognition.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on skeleton-based action recognition, graph/hypergraph models (ST-GCN, Hyperformer), and motion dynamics analysis, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Empowering Bengali Education with AI: Solving Bengali Math Word Problems
|
|
through Transformer Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02599v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02599v1
|
|
|
|
**Abstract:**
|
|
> Mathematical word problems (MWPs) involve the task of converting textual
|
|
> descriptions into mathematical equations. This poses a significant challenge in
|
|
> natural language processing, particularly for low-resource languages such as
|
|
> Bengali. This paper addresses this challenge by developing an innovative
|
|
> approach to solving Bengali MWPs using transformer-based models, including
|
|
> Basic Transformer, mT5, BanglaT5, and mBART50. To support this effort, the
|
|
> "PatiGonit" dataset was introduced, containing 10,000 Bengali math problems,
|
|
> and these models were fine-tuned to translate the word problems into equations
|
|
> accurately. The evaluation revealed that the mT5 model achieved the highest
|
|
> accuracy of 97.30%, demonstrating the effectiveness of transformer models in
|
|
> this domain. This research marks a significant step forward in Bengali natural
|
|
> language processing, offering valuable methodologies and resources for
|
|
> educational AI tools. By improving math education, it also supports the
|
|
> development of advanced problem-solving skills for Bengali-speaking students.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper focuses primarily on developing an approach for solving Bengali math word problems using transformer-based models, including fine-tuning these models, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs) to improve their text generation performance."
|
|
}
|
|
|
|
---
|
|
|
|
## [LLMs Help Alleviate the Cross-Subject Variability in Brain Signal and
|
|
Language Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.02621v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02621v1
|
|
|
|
**Abstract:**
|
|
> Decoding human activity from EEG signals has long been a popular research
|
|
> topic. While recent studies have increasingly shifted focus from single-subject
|
|
> to cross-subject analysis, few have explored the model's ability to perform
|
|
> zero-shot predictions on EEG signals from previously unseen subjects. This
|
|
> research aims to investigate whether deep learning methods can capture
|
|
> subject-independent semantic information inherent in human EEG signals. Such
|
|
> insights are crucial for Brain-Computer Interfaces (BCI) because, on one hand,
|
|
> they demonstrate the model's robustness against subject-specific temporal
|
|
> biases, and on the other, they significantly enhance the generalizability of
|
|
> downstream tasks. We employ Large Language Models (LLMs) as denoising agents to
|
|
> extract subject-independent semantic features from noisy EEG signals.
|
|
> Experimental results, including ablation studies, highlight the pivotal role of
|
|
> LLMs in decoding subject-independent semantic information from noisy EEG data.
|
|
> We hope our findings will contribute to advancing BCI research and assist both
|
|
> academia and industry in applying EEG signals to a broader range of
|
|
> applications.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on using LLMs as denoising agents for decoding subject-independent semantic information from EEG signals in Brain-Computer Interfaces (BCI) research, rather than on prompt engineering for text-based interactions with LLMs, failing to meet the core subject criterion."
|
|
}
|
|
|
|
---
|
|
|
|
## [Trust and Dependability in Blockchain & AI Based MedIoT Applications:
|
|
Research Challenges and Future Directions](https://arxiv.org/abs/http://arxiv.org/abs/2501.02647v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02647v1
|
|
|
|
**Abstract:**
|
|
> This paper critically reviews the integration of Artificial Intelligence (AI)
|
|
> and blockchain technologies in the context of Medical Internet of Things
|
|
> (MedIoT) applications, where they collectively promise to revolutionize
|
|
> healthcare delivery. By examining current research, we underscore AI's
|
|
> potential in advancing diagnostics and patient care, alongside blockchain's
|
|
> capacity to bolster data security and patient privacy. We focus particularly on
|
|
> the imperative to cultivate trust and ensure reliability within these systems.
|
|
> Our review highlights innovative solutions for managing healthcare data and
|
|
> challenges such as ensuring scalability, maintaining privacy, and promoting
|
|
> ethical practices within the MedIoT domain. We present a vision for integrating
|
|
> AI-driven insights with blockchain security in healthcare, offering a
|
|
> comprehensive review of current research and future directions. We conclude
|
|
> with a set of identified research gaps and propose that addressing these is
|
|
> crucial for achieving the dependable, secure, and patient -centric MedIoT
|
|
> applications of tomorrow.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on the integration of AI and blockchain for MedIoT applications, addressing trust, security, and healthcare, which falls under excluded subjects (medical) and does not meet the 'MUST' criteria for focusing on prompt engineering for Large Language Models (LLMs).
|
|
|
|
---
|
|
|
|
## [Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical
|
|
Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features](https://arxiv.org/abs/http://arxiv.org/abs/2501.02649v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02649v1
|
|
|
|
**Abstract:**
|
|
> Predicting plant species composition in specific spatiotemporal contexts
|
|
> plays an important role in biodiversity management and conservation, as well as
|
|
> in improving species identification tools. Our work utilizes 88,987 plant
|
|
> survey records conducted in specific spatiotemporal contexts across Europe. We
|
|
> also use the corresponding satellite images, time series data, climate time
|
|
> series, and other rasterized environmental data such as land cover, human
|
|
> footprint, bioclimatic, and soil variables as training data to train the model
|
|
> to predict the outcomes of 4,716 plant surveys. We propose a feature
|
|
> construction and result correction method based on the graph structure. Through
|
|
> comparative experiments, we select the best-performing backbone networks for
|
|
> feature extraction in both temporal and image modalities. In this process, we
|
|
> built a backbone network based on the Swin-Transformer Block for extracting
|
|
> temporal Cubes features. We then design a hierarchical cross-attention
|
|
> mechanism capable of robustly fusing features from multiple modalities. During
|
|
> training, we adopt a 10-fold cross-fusion method based on fine-tuning and use a
|
|
> Threshold Top-K method for post-processing. Ablation experiments demonstrate
|
|
> the improvements in model performance brought by our proposed solution
|
|
> pipeline.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on multi-modal (image, satellite, time series, environmental data) plant species prediction using graph-based and vision backbone-extracted features, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions, failing to meet the primary 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence
|
|
Benchmarks](https://arxiv.org/abs/http://arxiv.org/abs/2501.02654v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02654v2
|
|
|
|
**Abstract:**
|
|
> Recent advancements in natural language processing have highlighted the
|
|
> vulnerability of deep learning models to adversarial attacks. While various
|
|
> defence mechanisms have been proposed, there is a lack of comprehensive
|
|
> benchmarks that evaluate these defences across diverse datasets, models, and
|
|
> tasks. In this work, we address this gap by presenting an extensive benchmark
|
|
> for textual adversarial defence that significantly expands upon previous work.
|
|
> Our benchmark incorporates a wide range of datasets, evaluates state-of-the-art
|
|
> defence mechanisms, and extends the assessment to include critical tasks such
|
|
> as single-sentence classification, similarity and paraphrase identification,
|
|
> natural language inference, and commonsense reasoning. This work not only
|
|
> serves as a valuable resource for researchers and practitioners in the field of
|
|
> adversarial robustness but also identifies key areas for future research in
|
|
> textual adversarial defence. By establishing a new standard for benchmarking in
|
|
> this domain, we aim to accelerate progress towards more robust and reliable
|
|
> natural language processing systems.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on developing a benchmark for evaluating adversarial defence mechanisms in NLP, rather than primarily on the engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not provide concrete examples of prompts with their impact on LLM output.
|
|
|
|
---
|
|
|
|
## [Multi-Aggregator Time-Warping Heterogeneous Graph Neural Network for
|
|
Personalized Micro-Video Recommendation](https://arxiv.org/abs/http://arxiv.org/abs/2501.02666v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02666v1
|
|
|
|
**Abstract:**
|
|
> Micro-video recommendation is attracting global attention and becoming a
|
|
> popular daily service for people of all ages. Recently, Graph Neural
|
|
> Networks-based micro-video recommendation has displayed performance improvement
|
|
> for many kinds of recommendation tasks. However, the existing works fail to
|
|
> fully consider the characteristics of micro-videos, such as the high timeliness
|
|
> of news nature micro-video recommendation and sequential interactions of
|
|
> frequently changed interests. In this paper, a novel Multi-aggregator
|
|
> Time-warping Heterogeneous Graph Neural Network (MTHGNN) is proposed for
|
|
> personalized news nature micro-video recommendation based on sequential
|
|
> sessions, where characteristics of micro-videos are comprehensively studied,
|
|
> users' preference is mined via multi-aggregator, the temporal and dynamic
|
|
> changes of users' preference are captured, and timeliness is considered.
|
|
> Through the comparison with the state-of-the-arts, the experimental results
|
|
> validate the superiority of our MTHGNN model.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on developing a novel Graph Neural Network for personalized micro-video recommendation, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [From Superficial Patterns to Semantic Understanding: Fine-Tuning
|
|
Language Models on Contrast Sets](https://arxiv.org/abs/http://arxiv.org/abs/2501.02683v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02683v2
|
|
|
|
**Abstract:**
|
|
> Large-scale pre-trained language models have demonstrated high performance on
|
|
> standard datasets for natural language inference (NLI) tasks. Unfortunately,
|
|
> these evaluations can be misleading, as although the models can perform well on
|
|
> in-distribution data, they perform poorly on out-of-distribution test sets,
|
|
> such as contrast sets. Contrast sets consist of perturbed instances of data
|
|
> that have very minor, but meaningful, changes to the input that alter the gold
|
|
> label, revealing how models can learn superficial patterns in the training data
|
|
> rather than learning more sophisticated language nuances. As an example, the
|
|
> ELECTRA-small language model achieves nearly 90% accuracy on an SNLI dataset
|
|
> but drops to 75% when tested on an out-of-distribution contrast set. The
|
|
> research carried out in this study explores how the robustness of a language
|
|
> model can be improved by exposing it to small amounts of more complex contrast
|
|
> sets during training to help it better learn language patterns. With this
|
|
> approach, the model recovers performance and achieves nearly 90% accuracy on
|
|
> contrast sets, highlighting the importance of diverse and challenging training
|
|
> data.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on fine-tuning language models with contrast sets for improved robustness, which aligns with developing new training methods, and does not investigate, analyze, or propose methods for improving LLM performance through the manipulation of textual input prompts.
|
|
|
|
---
|
|
|
|
## [Horizon Generalization in Reinforcement Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.02709v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02709v1
|
|
|
|
**Abstract:**
|
|
> We study goal-conditioned RL through the lens of generalization, but not in
|
|
> the traditional sense of random augmentations and domain randomization. Rather,
|
|
> we aim to learn goal-directed policies that generalize with respect to the
|
|
> horizon: after training to reach nearby goals (which are easy to learn), these
|
|
> policies should succeed in reaching distant goals (which are quite challenging
|
|
> to learn). In the same way that invariance is closely linked with
|
|
> generalization is other areas of machine learning (e.g., normalization layers
|
|
> make a network invariant to scale, and therefore generalize to inputs of
|
|
> varying scales), we show that this notion of horizon generalization is closely
|
|
> linked with invariance to planning: a policy navigating towards a goal will
|
|
> select the same actions as if it were navigating to a waypoint en route to that
|
|
> goal. Thus, such a policy trained to reach nearby goals should succeed at
|
|
> reaching arbitrarily-distant goals. Our theoretical analysis proves that both
|
|
> horizon generalization and planning invariance are possible, under some
|
|
> assumptions. We present new experimental results and recall findings from prior
|
|
> work in support of our theoretical results. Taken together, our results open
|
|
> the door to studying how techniques for invariance and generalization developed
|
|
> in other areas of machine learning might be adapted to achieve this alluring
|
|
> property.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on reinforcement learning, goal-conditioned policies, and horizon generalization, with no apparent connection to prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Improved Data Encoding for Emerging Computing Paradigms: From Stochastic
|
|
to Hyperdimensional Computing](https://arxiv.org/abs/http://arxiv.org/abs/2501.02715v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02715v1
|
|
|
|
**Abstract:**
|
|
> Data encoding is a fundamental step in emerging computing paradigms,
|
|
> particularly in stochastic computing (SC) and hyperdimensional computing (HDC),
|
|
> where it plays a crucial role in determining the overall system performance and
|
|
> hardware cost efficiency. This study presents an advanced encoding strategy
|
|
> that leverages a hardware-friendly class of low-discrepancy (LD) sequences,
|
|
> specifically powers-of-2 bases of Van der Corput (VDC) sequences (VDC-2^n), as
|
|
> sources for random number generation. Our approach significantly enhances the
|
|
> accuracy and efficiency of SC and HDC systems by addressing challenges
|
|
> associated with randomness. By employing LD sequences, we improve correlation
|
|
> properties and reduce hardware complexity. Experimental results demonstrate
|
|
> significant improvements in accuracy and energy savings for SC and HDC systems.
|
|
> Our solution provides a robust framework for integrating SC and HDC in
|
|
> resource-constrained environments, paving the way for efficient and scalable AI
|
|
> implementations.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not focus on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance, instead concentrating on data encoding strategies for stochastic and hyperdimensional computing paradigms.
|
|
|
|
---
|
|
|
|
## [Tree-based RAG-Agent Recommendation System: A Case Study in Medical Test
|
|
Data](https://arxiv.org/abs/http://arxiv.org/abs/2501.02727v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02727v1
|
|
|
|
**Abstract:**
|
|
> We present HiRMed (Hierarchical RAG-enhanced Medical Test Recommendation), a
|
|
> novel tree-structured recommendation system that leverages Retrieval-Augmented
|
|
> Generation (RAG) for intelligent medical test recommendations. Unlike
|
|
> traditional vector similarity-based approaches, our system performs medical
|
|
> reasoning at each tree node through a specialized RAG process. Starting from
|
|
> the root node with initial symptoms, the system conducts step-wise medical
|
|
> analysis to identify potential underlying conditions and their corresponding
|
|
> diagnostic requirements. At each level, instead of simple matching, our
|
|
> RAG-enhanced nodes analyze retrieved medical knowledge to understand
|
|
> symptom-disease relationships and determine the most appropriate diagnostic
|
|
> path. The system dynamically adjusts its recommendation strategy based on
|
|
> medical reasoning results, considering factors such as urgency levels and
|
|
> diagnostic uncertainty. Experimental results demonstrate that our approach
|
|
> achieves superior performance in terms of coverage rate, accuracy, and miss
|
|
> rate compared to conventional retrieval-based methods. This work represents a
|
|
> significant advance in medical test recommendation by introducing medical
|
|
> reasoning capabilities into the traditional tree-based retrieval structure.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on a medical application (medical test recommendation system) and develops a new system architecture (Tree-based RAG-Agent) rather than focusing on prompt engineering for Large Language Models (LLMs) in text-based interactions, failing to meet the required criteria.
|
|
|
|
---
|
|
|
|
## [AFed: Algorithmic Fair Federated Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.02732v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02732v1
|
|
|
|
**Abstract:**
|
|
> Federated Learning (FL) has gained significant attention as it facilitates
|
|
> collaborative machine learning among multiple clients without centralizing
|
|
> their data on a server. FL ensures the privacy of participating clients by
|
|
> locally storing their data, which creates new challenges in fairness.
|
|
> Traditional debiasing methods assume centralized access to sensitive
|
|
> information, rendering them impractical for the FL setting. Additionally, FL is
|
|
> more susceptible to fairness issues than centralized machine learning due to
|
|
> the diverse client data sources that may be associated with group information.
|
|
> Therefore, training a fair model in FL without access to client local data is
|
|
> important and challenging. This paper presents AFed, a straightforward yet
|
|
> effective framework for promoting group fairness in FL. The core idea is to
|
|
> circumvent restricted data access by learning the global data distribution.
|
|
> This paper proposes two approaches: AFed-G, which uses a conditional generator
|
|
> trained on the server side, and AFed-GAN, which improves upon AFed-G by
|
|
> training a conditional GAN on the client side. We augment the client data with
|
|
> the generated samples to help remove bias. Our theoretical analysis justifies
|
|
> the proposed methods, and empirical results on multiple real-world datasets
|
|
> demonstrate a substantial improvement in AFed over several baselines.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a framework for promoting group fairness in Federated Learning (FL), with no emphasis on Large Language Models (LLMs), prompt engineering, or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [TARDiS : Text Augmentation for Refining Diversity and Separability](https://arxiv.org/abs/http://arxiv.org/abs/2501.02739v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02739v1
|
|
|
|
**Abstract:**
|
|
> Text augmentation (TA) is a critical technique for text classification,
|
|
> especially in few-shot settings. This paper introduces a novel LLM-based TA
|
|
> method, TARDiS, to address challenges inherent in the generation and alignment
|
|
> stages of two-stage TA methods. For the generation stage, we propose two
|
|
> generation processes, SEG and CEG, incorporating multiple class-specific
|
|
> prompts to enhance diversity and separability. For the alignment stage, we
|
|
> introduce a class adaptation (CA) method to ensure that generated examples
|
|
> align with their target classes through verification and modification.
|
|
> Experimental results demonstrate TARDiS's effectiveness, outperforming
|
|
> state-of-the-art LLM-based TA methods in various few-shot text classification
|
|
> tasks. An in-depth analysis confirms the detailed behaviors at each stage.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on a novel text augmentation method (TARDiS) for improving few-shot text classification tasks, using LLMs as a tool, rather than focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) to improve LLM performance through prompt manipulation.
|
|
|
|
---
|
|
|
|
## [Interpretable Recognition of Fused Magnesium Furnace Working Conditions
|
|
with Deep Convolutional Stochastic Configuration Networks](https://arxiv.org/abs/http://arxiv.org/abs/2501.02740v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02740v1
|
|
|
|
**Abstract:**
|
|
> To address the issues of a weak generalization capability and
|
|
> interpretability in working condition recognition model of a fused magnesium
|
|
> furnace, this paper proposes an interpretable working condition recognition
|
|
> method based on deep convolutional stochastic configuration networks (DCSCNs).
|
|
> Firstly, a supervised learning mechanism is employed to generate physically
|
|
> meaningful Gaussian differential convolution kernels. An incremental method is
|
|
> utilized to construct a DCSCNs model, ensuring the convergence of recognition
|
|
> errors in a hierarchical manner and avoiding the iterative optimization process
|
|
> of convolutional kernel parameters using the widely used backpropagation
|
|
> algorithm. The independent coefficient of channel feature maps is defined to
|
|
> obtain the visualization results of feature class activation maps for the fused
|
|
> magnesium furnace. A joint reward function is constructed based on the
|
|
> recognition accuracy, the interpretable trustworthiness evaluation metrics, and
|
|
> the model parameter quantity. Reinforcement learning (RL) is applied to
|
|
> adaptively prune the convolutional kernels of the DCSCNs model, aiming to build
|
|
> a compact, highly performed and interpretable network. The experimental results
|
|
> demonstrate that the proposed method outperforms the other deep learning
|
|
> approaches in terms of recognition accuracy and interpretability.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the primary focus criteria, as it is centered on developing a deep convolutional stochastic configuration network for recognizing working conditions in a fused magnesium furnace, with no mention of Large Language Models (LLMs), prompt engineering, or text-based interactions.
|
|
|
|
---
|
|
|
|
## [Visual Large Language Models for Generalized and Specialized
|
|
Applications](https://arxiv.org/abs/http://arxiv.org/abs/2501.02765v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02765v1
|
|
|
|
**Abstract:**
|
|
> Visual-language models (VLM) have emerged as a powerful tool for learning a
|
|
> unified embedding space for vision and language. Inspired by large language
|
|
> models, which have demonstrated strong reasoning and multi-task capabilities,
|
|
> visual large language models (VLLMs) are gaining increasing attention for
|
|
> building general-purpose VLMs. Despite the significant progress made in VLLMs,
|
|
> the related literature remains limited, particularly from a comprehensive
|
|
> application perspective, encompassing generalized and specialized applications
|
|
> across vision (image, video, depth), action, and language modalities. In this
|
|
> survey, we focus on the diverse applications of VLLMs, examining their using
|
|
> scenarios, identifying ethics consideration and challenges, and discussing
|
|
> future directions for their development. By synthesizing these contents, we aim
|
|
> to provide a comprehensive guide that will pave the way for future innovations
|
|
> and broader applications of VLLMs. The paper list repository is available:
|
|
> https://github.com/JackYFL/awesome-VLLMs.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on Visual Large Language Models (VLLMs) for generalized and specialized applications across vision, action, and language modalities, with no primary focus on engineering, design, or optimization of textual input prompts specifically for Large Language Models (LLMs) as required.
|
|
|
|
---
|
|
|
|
## [Are GNNs Effective for Multimodal Fault Diagnosis in Microservice
|
|
Systems?](https://arxiv.org/abs/http://arxiv.org/abs/2501.02766v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02766v1
|
|
|
|
**Abstract:**
|
|
> Fault diagnosis in microservice systems has increasingly embraced multimodal
|
|
> observation data for a holistic and multifaceted view of the system, with Graph
|
|
> Neural Networks (GNNs) commonly employed to model complex service dependencies.
|
|
> However, despite the intuitive appeal, there remains a lack of compelling
|
|
> justification for the adoption of GNNs, as no direct evidence supports their
|
|
> necessity or effectiveness. To critically evaluate the current use of GNNs, we
|
|
> propose DiagMLP, a simple topology-agnostic baseline as a substitute for GNNs
|
|
> in fault diagnosis frameworks. Through experiments on five public datasets, we
|
|
> surprisingly find that DiagMLP performs competitively with and even outperforms
|
|
> GNN-based methods in fault diagnosis tasks, indicating that the current
|
|
> paradigm of using GNNs to model service dependencies has not yet demonstrated a
|
|
> tangible contribution. We further discuss potential reasons for this
|
|
> observation and advocate shifting the focus from solely pursuing novel model
|
|
> designs to developing challenging datasets, standardizing preprocessing
|
|
> protocols, and critically evaluating the utility of advanced deep learning
|
|
> modules.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on evaluating Graph Neural Networks (GNNs) for multimodal fault diagnosis in microservice systems, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet the primary 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Enhancing Trustworthiness of Graph Neural Networks with Rank-Based
|
|
Conformal Training](https://arxiv.org/abs/http://arxiv.org/abs/2501.02767v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02767v1
|
|
|
|
**Abstract:**
|
|
> Graph Neural Networks (GNNs) has been widely used in a variety of fields
|
|
> because of their great potential in representing graph-structured data.
|
|
> However, lacking of rigorous uncertainty estimations limits their application
|
|
> in high-stakes. Conformal Prediction (CP) can produce statistically guaranteed
|
|
> uncertainty estimates by using the classifier's probability estimates to obtain
|
|
> prediction sets, which contains the true class with a user-specified
|
|
> probability. In this paper, we propose a Rank-based CP during training
|
|
> framework to GNNs (RCP-GNN) for reliable uncertainty estimates to enhance the
|
|
> trustworthiness of GNNs in the node classification scenario. By exploiting rank
|
|
> information of the classifier's outcome, prediction sets with desired coverage
|
|
> rate can be efficiently constructed. The strategy of CP during training with
|
|
> differentiable rank-based conformity loss function is further explored to adapt
|
|
> prediction sets according to network topology information. In this way, the
|
|
> composition of prediction sets can be guided by the goal of jointly reducing
|
|
> inefficiency and probability estimation errors. Extensive experiments on
|
|
> several real-world datasets show that our model achieves any pre-defined target
|
|
> marginal coverage while significantly reducing the inefficiency compared with
|
|
> state-of-the-art methods.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper focuses primarily on enhancing Graph Neural Networks (GNNs) with Conformal Prediction for reliable uncertainty estimates, which does not meet the MUST criteria of focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs)."
|
|
}
|
|
|
|
---
|
|
|
|
## [ICFNet: Integrated Cross-modal Fusion Network for Survival Prediction](https://arxiv.org/abs/http://arxiv.org/abs/2501.02778v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02778v1
|
|
|
|
**Abstract:**
|
|
> Survival prediction is a crucial task in the medical field and is essential
|
|
> for optimizing treatment options and resource allocation. However, current
|
|
> methods often rely on limited data modalities, resulting in suboptimal
|
|
> performance. In this paper, we propose an Integrated Cross-modal Fusion Network
|
|
> (ICFNet) that integrates histopathology whole slide images, genomic expression
|
|
> profiles, patient demographics, and treatment protocols. Specifically, three
|
|
> types of encoders, a residual orthogonal decomposition module and a unification
|
|
> fusion module are employed to merge multi-modal features to enhance prediction
|
|
> accuracy. Additionally, a balanced negative log-likelihood loss function is
|
|
> designed to ensure fair training across different patients. Extensive
|
|
> experiments demonstrate that our ICFNet outperforms state-of-the-art algorithms
|
|
> on five public TCGA datasets, including BLCA, BRCA, GBMLGG, LUAD, and UCEC, and
|
|
> shows its potential to support clinical decision-making and advance precision
|
|
> medicine. The codes are available at: https://github.com/binging512/ICFNet.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a new network (ICFNet) for survival prediction in the medical field, integrating various data modalities, and does not meet the criteria of focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).
|
|
|
|
---
|
|
|
|
## [Hybrid deep convolution model for lung cancer detection with transfer
|
|
learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.02785v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02785v1
|
|
|
|
**Abstract:**
|
|
> Advances in healthcare research have significantly enhanced our understanding
|
|
> of disease mechanisms, diagnostic precision, and therapeutic options. Yet, lung
|
|
> cancer remains one of the leading causes of cancer-related mortality worldwide
|
|
> due to challenges in early and accurate diagnosis. While current lung cancer
|
|
> detection models show promise, there is considerable potential for further
|
|
> improving the accuracy for timely intervention. To address this challenge, we
|
|
> introduce a hybrid deep convolution model leveraging transfer learning, named
|
|
> the Maximum Sensitivity Neural Network (MSNN). MSNN is designed to improve the
|
|
> precision of lung cancer detection by refining sensitivity and specificity.
|
|
> This model has surpassed existing deep learning approaches through experimental
|
|
> validation, achieving an accuracy of 98% and a sensitivity of 97%. By
|
|
> overlaying sensitivity maps onto lung Computed Tomography (CT) scans, it
|
|
> enables the visualization of regions most indicative of malignant or benign
|
|
> classifications. This innovative method demonstrates exceptional performance in
|
|
> distinguishing lung cancer with minimal false positives, thereby enhancing the
|
|
> accuracy of medical diagnoses.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on the development of a new deep convolution model for lung cancer detection, which falls under medical subjects and does not address prompt engineering for Large Language Models (LLMs) as required.
|
|
|
|
---
|
|
|
|
## [Segmenting Text and Learning Their Rewards for Improved RLHF in Language
|
|
Model](https://arxiv.org/abs/http://arxiv.org/abs/2501.02790v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02790v1
|
|
|
|
**Abstract:**
|
|
> Reinforcement learning from human feedback (RLHF) has been widely adopted to
|
|
> align language models (LMs) with human preference. Prior RLHF works typically
|
|
> take a bandit formulation, which, though intuitive, ignores the sequential
|
|
> nature of LM generation and can suffer from the sparse reward issue. While
|
|
> recent works propose dense token-level RLHF, treating each token as an action
|
|
> may be oversubtle to proper reward assignment. In this paper, we seek to get
|
|
> the best of both by training and utilizing a segment-level reward model, which
|
|
> assigns a reward to each semantically complete text segment that spans over a
|
|
> short sequence of tokens. For reward learning, our method allows dynamic text
|
|
> segmentation and compatibility with standard sequence-preference datasets. For
|
|
> effective RL-based LM training against segment reward, we generalize the
|
|
> classical scalar bandit reward normalizers into location-aware normalizer
|
|
> functions and interpolate the segment reward for further densification. With
|
|
> these designs, our method performs competitively on three popular RLHF
|
|
> benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation
|
|
> studies are conducted to further demonstrate our method.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on improving Reinforcement Learning from Human Feedback (RLHF) for Language Models through segment-level reward modeling, rather than specifically investigating, analyzing, or proposing methods for improving LLM performance through the manipulation of textual input prompts."
|
|
}
|
|
|
|
---
|
|
|
|
## [Enhancing Lifelong Multi-Agent Path Finding with Cache Mechanism](https://arxiv.org/abs/http://arxiv.org/abs/2501.02803v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02803v1
|
|
|
|
**Abstract:**
|
|
> Multi-Agent Path Finding (MAPF), which focuses on finding collision-free
|
|
> paths for multiple robots, is crucial in autonomous warehouse operations.
|
|
> Lifelong MAPF (L-MAPF), where agents are continuously reassigned new targets
|
|
> upon completing their current tasks, offers a more realistic approximation of
|
|
> real-world warehouse scenarios. While cache storage systems can enhance
|
|
> efficiency and reduce operational costs, existing approaches primarily rely on
|
|
> expectations and mathematical models, often without adequately addressing the
|
|
> challenges of multi-robot planning and execution. In this paper, we introduce a
|
|
> novel mechanism called Lifelong MAPF with Cache Mechanism (L-MAPF-CM), which
|
|
> integrates high-level cache storage with low-level path planning. We have
|
|
> involved a new type of map grid called cache for temporary item storage.
|
|
> Additionally, we involved a task assigner (TA) with a locking mechanism to
|
|
> bridge the gap between the new cache grid and L-MAPF algorithm. The TA
|
|
> dynamically allocates target locations to agents based on their status in
|
|
> various scenarios. We evaluated L-MAPF-CM using different cache replacement
|
|
> policies and task distributions. L-MAPF-CM has demonstrated performance
|
|
> improvements particularly with high cache hit rates and smooth traffic
|
|
> conditions.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on enhancing Multi-Agent Path Finding with a cache mechanism for autonomous warehouse operations, which does not meet the required focus on prompt engineering, design, or optimization specifically for Large Language Models (LLMs) and their textual input prompts.
|
|
|
|
---
|
|
|
|
## [InpDiffusion: Image Inpainting Localization via Conditional Diffusion
|
|
Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02816v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02816v1
|
|
|
|
**Abstract:**
|
|
> As artificial intelligence advances rapidly, particularly with the advent of
|
|
> GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL)
|
|
> has become increasingly challenging. Current IIL methods face two main
|
|
> challenges: a tendency towards overconfidence, leading to incorrect
|
|
> predictions; and difficulty in detecting subtle tampering boundaries in
|
|
> inpainted images. In response, we propose a new paradigm that treats IIL as a
|
|
> conditional mask generation task utilizing diffusion models. Our method,
|
|
> InpDiffusion, utilizes the denoising process enhanced by the integration of
|
|
> image semantic conditions to progressively refine predictions. During
|
|
> denoising, we employ edge conditions and introduce a novel edge supervision
|
|
> strategy to enhance the model's perception of edge details in inpainted
|
|
> objects. Balancing the diffusion model's stochastic sampling with edge
|
|
> supervision of tampered image regions mitigates the risk of incorrect
|
|
> predictions from overconfidence and prevents the loss of subtle boundaries that
|
|
> can result from overly stochastic processes. Furthermore, we propose an
|
|
> innovative Dual-stream Multi-scale Feature Extractor (DMFE) for extracting
|
|
> multi-scale features, enhancing feature representation by considering both
|
|
> semantic and edge conditions of the inpainted images. Extensive experiments
|
|
> across challenging datasets demonstrate that the InpDiffusion significantly
|
|
> outperforms existing state-of-the-art methods in IIL tasks, while also
|
|
> showcasing excellent generalization capabilities and robustness.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the primary criteria as it focuses on image inpainting localization using conditional diffusion models, with no apparent connection to Large Language Models (LLMs), prompt engineering, or textual input prompts. The subject matter falls under image generation and processing, which is explicitly excluded.
|
|
|
|
---
|
|
|
|
## [Enhanced Rooftop Solar Panel Detection by Efficiently Aggregating Local
|
|
Features](https://arxiv.org/abs/http://arxiv.org/abs/2501.02840v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02840v1
|
|
|
|
**Abstract:**
|
|
> In this paper, we present an enhanced Convolutional Neural Network
|
|
> (CNN)-based rooftop solar photovoltaic (PV) panel detection approach using
|
|
> satellite images. We propose to use pre-trained CNN-based model to extract the
|
|
> local convolutional features of rooftops. These local features are then
|
|
> combined using the Vectors of Locally Aggregated Descriptors (VLAD) technique
|
|
> to obtain rooftop-level global features, which are then used to train
|
|
> traditional Machine Learning (ML) models to identify rooftop images that do and
|
|
> do not contain PV panels. On the dataset used in this study, the proposed
|
|
> approach achieved rooftop-PV classification scores exceeding the predefined
|
|
> threshold of 0.9 across all three cities for each of the feature extractor
|
|
> networks evaluated. Moreover, we propose a 3-phase approach to enable efficient
|
|
> utilization of the previously trained models on a new city or region with
|
|
> limited labelled data. We illustrate the effectiveness of this 3-phase approach
|
|
> for multi-city rooftop-PV detection task.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on enhancing rooftop solar panel detection using CNN and ML techniques with satellite images, and does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through the manipulation of textual input prompts as required.
|
|
|
|
---
|
|
|
|
## [IIMedGPT: Promoting Large Language Model Capabilities of Medical Tasks
|
|
by Efficient Human Preference Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.02869v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02869v1
|
|
|
|
**Abstract:**
|
|
> Recent researches of large language models(LLM), which is pre-trained on
|
|
> massive general-purpose corpora, have achieved breakthroughs in responding
|
|
> human queries. However, these methods face challenges including limited data
|
|
> insufficiency to support extensive pre-training and can not align responses
|
|
> with users' instructions. To address these issues, we introduce a medical
|
|
> instruction dataset, CMedINS, containing six medical instructions derived from
|
|
> actual medical tasks, which effectively fine-tunes LLM in conjunction with
|
|
> other data. Subsequently, We launch our medical model, IIMedGPT, employing an
|
|
> efficient preference alignment method, Direct preference Optimization(DPO). The
|
|
> results show that our final model outperforms existing medical models in
|
|
> medical dialogue.Datsets, Code and model checkpoints will be released upon
|
|
> acceptance.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on fine-tuning a Large Language Model for medical tasks using a new dataset and preference alignment method, which violates the 'MUST NOT' criteria of not focusing on the development of new LLM architectures or training methods, and being primarily concerned with medical subjects.
|
|
|
|
---
|
|
|
|
## [Skillful High-Resolution Ensemble Precipitation Forecasting with an
|
|
Integrated Deep Learning Framework](https://arxiv.org/abs/http://arxiv.org/abs/2501.02905v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02905v1
|
|
|
|
**Abstract:**
|
|
> High-resolution precipitation forecasts are crucial for providing accurate
|
|
> weather prediction and supporting effective responses to extreme weather
|
|
> events. Traditional numerical models struggle with stochastic subgrid-scale
|
|
> processes, while recent deep learning models often produce blurry results. To
|
|
> address these challenges, we propose a physics-inspired deep learning framework
|
|
> for high-resolution (0.05\textdegree{} $\times$ 0.05\textdegree{}) ensemble
|
|
> precipitation forecasting. Trained on ERA5 and CMPA high-resolution
|
|
> precipitation datasets, the framework integrates deterministic and
|
|
> probabilistic components. The deterministic model, based on a 3D
|
|
> SwinTransformer, captures average precipitation at mesoscale resolution and
|
|
> incorporates strategies to enhance performance, particularly for moderate to
|
|
> heavy rainfall. The probabilistic model employs conditional diffusion in latent
|
|
> space to account for uncertainties in residual precipitation at convective
|
|
> scales. During inference, ensemble members are generated by repeatedly sampling
|
|
> latent variables, enabling the model to represent precipitation uncertainty.
|
|
> Our model significantly enhances spatial resolution and forecast accuracy. Rank
|
|
> histogram shows that the ensemble system is reliable and unbiased. In a case
|
|
> study of heavy precipitation in southern China, the model outputs align more
|
|
> closely with observed precipitation distributions than ERA5, demonstrating
|
|
> superior capability in capturing extreme precipitation events. Additionally,
|
|
> 5-day real-time forecasts show good performance in terms of CSI scores.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on developing a deep learning framework for high-resolution precipitation forecasting, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet the primary criteria.
|
|
|
|
---
|
|
|
|
## [Unsupervised Tomato Split Anomaly Detection using Hyperspectral Imaging
|
|
and Variational Autoencoders](https://arxiv.org/abs/http://arxiv.org/abs/2501.02921v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02921v1
|
|
|
|
**Abstract:**
|
|
> Tomato anomalies/damages pose a significant challenge in greenhouse farming.
|
|
> While this method of cultivation benefits from efficient resource utilization,
|
|
> anomalies can significantly degrade the quality of farm produce. A common
|
|
> anomaly associated with tomatoes is splitting, characterized by the development
|
|
> of cracks on the tomato skin, which degrades its quality. Detecting this type
|
|
> of anomaly is challenging due to dynamic variations in appearance and sizes,
|
|
> compounded by dataset scarcity. We address this problem in an unsupervised
|
|
> manner by utilizing a tailored variational autoencoder (VAE) with hyperspectral
|
|
> input. Preliminary analysis of the dataset enabled us to select the optimal
|
|
> range of wavelengths for detecting this anomaly. Our findings indicate that the
|
|
> 530nm - 550nm range is suitable for identifying tomato dry splits. The analysis
|
|
> on reconstruction loss allow us to not only detect the anomalies but also to
|
|
> some degree estimate the anomalous regions.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet any of the 'MUST' criteria: it focuses on anomaly detection in hyperspectral imaging using Variational Autoencoders, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus falling entirely outside the specified scope.
|
|
|
|
---
|
|
|
|
## [GLFC: Unified Global-Local Feature and Contrast Learning with
|
|
Mamba-Enhanced UNet for Synthetic CT Generation from CBCT](https://arxiv.org/abs/http://arxiv.org/abs/2501.02992v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02992v2
|
|
|
|
**Abstract:**
|
|
> Generating synthetic Computed Tomography (CT) images from Cone Beam Computed
|
|
> Tomography (CBCT) is desirable for improving the image quality of CBCT.
|
|
> Existing synthetic CT (sCT) generation methods using Convolutional Neural
|
|
> Networks (CNN) and Transformers often face difficulties in effectively
|
|
> capturing both global and local features and contrasts for high-quality sCT
|
|
> generation. In this work, we propose a Global-Local Feature and Contrast
|
|
> learning (GLFC) framework for sCT generation. First, a Mamba-Enhanced UNet
|
|
> (MEUNet) is introduced by integrating Mamba blocks into the skip connections of
|
|
> a high-resolution UNet for effective global and local feature learning. Second,
|
|
> we propose a Multiple Contrast Loss (MCL) that calculates synthetic loss at
|
|
> different intensity windows to improve quality for both soft tissues and bone
|
|
> regions. Experiments on the SynthRAD2023 dataset demonstrate that GLFC improved
|
|
> the SSIM of sCT from 77.91% to 91.50% compared with the original CBCT, and
|
|
> significantly outperformed several existing methods for sCT generation. The
|
|
> code is available at https://github.com/HiLab-git/GLFC
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on generating synthetic CT images from CBCT using a novel CNN framework (GLFC with Mamba-Enhanced UNet), which does not meet the 'MUST' criteria of focusing on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts for improving LLM performance."
|
|
}
|
|
|
|
---
|
|
|
|
## [Quality Estimation based Feedback Training for Improving Pronoun
|
|
Translation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03008v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03008v1
|
|
|
|
**Abstract:**
|
|
> Pronoun translation is a longstanding challenge in neural machine translation
|
|
> (NMT), often requiring inter-sentential context to ensure linguistic accuracy.
|
|
> To address this, we introduce ProNMT, a novel framework designed to enhance
|
|
> pronoun and overall translation quality in context-aware machine translation
|
|
> systems. ProNMT leverages Quality Estimation (QE) models and a unique Pronoun
|
|
> Generation Likelihood-Based Feedback mechanism to iteratively fine-tune
|
|
> pre-trained NMT models without relying on extensive human annotations. The
|
|
> framework combines QE scores with pronoun-specific rewards to guide training,
|
|
> ensuring improved handling of linguistic nuances. Extensive experiments
|
|
> demonstrate significant gains in pronoun translation accuracy and general
|
|
> translation quality across multiple metrics. ProNMT offers an efficient,
|
|
> scalable, and context-aware approach to improving NMT systems, particularly in
|
|
> translating context-dependent elements like pronouns.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on improving neural machine translation (NMT) systems through Quality Estimation based feedback training, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs), and does not provide concrete examples of prompts or demonstrate their impact on LLM output.
|
|
|
|
---
|
|
|
|
## [Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization
|
|
Degradation for Mathematical Reasoning](https://arxiv.org/abs/http://arxiv.org/abs/2501.03035v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03035v1
|
|
|
|
**Abstract:**
|
|
> Large language models have achieved significant advancements in complex
|
|
> mathematical reasoning benchmarks, such as MATH. However, their substantial
|
|
> computational requirements present challenges for practical deployment. Model
|
|
> quantization has emerged as an effective strategy to reduce memory usage and
|
|
> computational costs by employing lower precision and bit-width representations.
|
|
> In this study, we systematically evaluate the impact of quantization on
|
|
> mathematical reasoning tasks. We introduce a multidimensional evaluation
|
|
> framework that qualitatively assesses specific capability dimensions and
|
|
> conduct quantitative analyses on the step-by-step outputs of various
|
|
> quantization methods. Our results demonstrate that quantization differentially
|
|
> affects numerical computation and reasoning planning abilities, identifying key
|
|
> areas where quantized models experience performance degradation.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on model quantization for reducing computational costs and its impact on mathematical reasoning tasks, rather than prompt engineering for Large Language Models (LLMs), not meeting the 'MUST' criteria for prompt engineering, LLM performance improvement through prompt manipulation, and providing concrete prompt examples."
|
|
}
|
|
|
|
---
|
|
|
|
## [Piano Transcription by Hierarchical Language Modeling with Pretrained
|
|
Roll-based Encoders](https://arxiv.org/abs/http://arxiv.org/abs/2501.03038v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03038v2
|
|
|
|
**Abstract:**
|
|
> Automatic Music Transcription (AMT), aiming to get musical notes from raw
|
|
> audio, typically uses frame-level systems with piano-roll outputs or language
|
|
> model (LM)-based systems with note-level predictions. However, frame-level
|
|
> systems require manual thresholding, while the LM-based systems struggle with
|
|
> long sequences. In this paper, we propose a hybrid method combining pre-trained
|
|
> roll-based encoders with an LM decoder to leverage the strengths of both
|
|
> methods. Besides, our approach employs a hierarchical prediction strategy,
|
|
> first predicting onset and pitch, then velocity, and finally offset. The
|
|
> hierarchical prediction strategy reduces computational costs by breaking down
|
|
> long sequences into different hierarchies. Evaluated on two benchmark
|
|
> roll-based encoders, our method outperforms traditional piano-roll outputs 0.01
|
|
> and 0.022 in onset-offset-velocity F1 score, demonstrating its potential as a
|
|
> performance-enhancing plug-in for arbitrary roll-based music transcription
|
|
> encoder.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on Automatic Music Transcription using a hybrid method combining pre-trained roll-based encoders with a language model decoder, rather than prompt engineering for Large Language Models (LLMs). The subject is outside the specified scope, particularly excluding applications of generative AI other than text generation driven by LLMs, such as audio generation in this case.
|
|
|
|
---
|
|
|
|
## [Single-Channel Distance-Based Source Separation for Mobile GPU in
|
|
Outdoor and Indoor Environments](https://arxiv.org/abs/http://arxiv.org/abs/2501.03045v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03045v1
|
|
|
|
**Abstract:**
|
|
> This study emphasizes the significance of exploring distance-based source
|
|
> separation (DSS) in outdoor environments. Unlike existing studies that
|
|
> primarily focus on indoor settings, the proposed model is designed to capture
|
|
> the unique characteristics of outdoor audio sources. It incorporates advanced
|
|
> techniques, including a two-stage conformer block, a linear relation-aware
|
|
> self-attention (RSA), and a TensorFlow Lite GPU delegate. While the linear RSA
|
|
> may not capture physical cues as explicitly as the quadratic RSA, the linear
|
|
> RSA enhances the model's context awareness, leading to improved performance on
|
|
> the DSS that requires an understanding of physical cues in outdoor and indoor
|
|
> environments. The experimental results demonstrated that the proposed model
|
|
> overcomes the limitations of existing approaches and considerably enhances
|
|
> energy efficiency and real-time inference speed on mobile devices.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on source separation for audio signals in outdoor and indoor environments, utilizing mobile GPU and TensorFlow, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Personalized Fashion Recommendation with Image Attributes and Aesthetics
|
|
Assessment](https://arxiv.org/abs/http://arxiv.org/abs/2501.03085v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03085v1
|
|
|
|
**Abstract:**
|
|
> Personalized fashion recommendation is a difficult task because 1) the
|
|
> decisions are highly correlated with users' aesthetic appetite, which previous
|
|
> work frequently overlooks, and 2) many new items are constantly rolling out
|
|
> that cause strict cold-start problems in the popular identity (ID)-based
|
|
> recommendation methods. These new items are critical to recommend because of
|
|
> trend-driven consumerism. In this work, we aim to provide more accurate
|
|
> personalized fashion recommendations and solve the cold-start problem by
|
|
> converting available information, especially images, into two attribute graphs
|
|
> focusing on optimized image utilization and noise-reducing user modeling.
|
|
> Compared with previous methods that separate image and text as two components,
|
|
> the proposed method combines image and text information to create a richer
|
|
> attributes graph. Capitalizing on the advancement of large language and vision
|
|
> models, we experiment with extracting fine-grained attributes efficiently and
|
|
> as desired using two different prompts. Preliminary experiments on the IQON3000
|
|
> dataset have shown that the proposed method achieves competitive accuracy
|
|
> compared with baselines.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on personalized fashion recommendation using image attributes and aesthetics, combining image and text information, rather than specifically engineering or optimizing prompts for Large Language Models (LLMs). While it mentions using two different prompts for attribute extraction, prompt engineering is not the central concern, but rather a minor component within the larger system.
|
|
|
|
---
|
|
|
|
## [LangFair: A Python Package for Assessing Bias and Fairness in Large
|
|
Language Model Use Cases](https://arxiv.org/abs/http://arxiv.org/abs/2501.03112v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03112v1
|
|
|
|
**Abstract:**
|
|
> Large Language Models (LLMs) have been observed to exhibit bias in numerous
|
|
> ways, potentially creating or worsening outcomes for specific groups identified
|
|
> by protected attributes such as sex, race, sexual orientation, or age. To help
|
|
> address this gap, we introduce LangFair, an open-source Python package that
|
|
> aims to equip LLM practitioners with the tools to evaluate bias and fairness
|
|
> risks relevant to their specific use cases. The package offers functionality to
|
|
> easily generate evaluation datasets, comprised of LLM responses to
|
|
> use-case-specific prompts, and subsequently calculate applicable metrics for
|
|
> the practitioner's use case. To guide in metric selection, LangFair offers an
|
|
> actionable decision framework.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on assessing bias and fairness in Large Language Models using a Python package, rather than on the engineering, design, or optimization of prompts specifically for LLMs to improve their text generation performance.
|
|
|
|
---
|
|
|
|
## [From Models to Network Topologies: A Topology Inference Attack in
|
|
Decentralized Federated Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.03119v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03119v1
|
|
|
|
**Abstract:**
|
|
> Federated Learning (FL) is widely recognized as a privacy-preserving machine
|
|
> learning paradigm due to its model-sharing mechanism that avoids direct data
|
|
> exchange. However, model training inevitably leaves exploitable traces that can
|
|
> be used to infer sensitive information. In Decentralized FL (DFL), the overlay
|
|
> topology significantly influences its models' convergence, robustness, and
|
|
> security. This study explores the feasibility of inferring the overlay topology
|
|
> of DFL systems based solely on model behavior, introducing a novel Topology
|
|
> Inference Attack. A taxonomy of topology inference attacks is proposed,
|
|
> categorizing them by the attacker's capabilities and knowledge. Practical
|
|
> attack strategies are developed for different scenarios, and quantitative
|
|
> experiments are conducted to identify key factors influencing the attack
|
|
> effectiveness. Experimental results demonstrate that analyzing only the public
|
|
> models of individual nodes can accurately infer the DFL topology, underscoring
|
|
> the risk of sensitive information leakage in DFL systems. This finding offers
|
|
> valuable insights for improving privacy preservation in decentralized learning
|
|
> environments.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper focuses on topology inference attacks in Decentralized Federated Learning, which does not meet the MUST criteria: it does not focus on engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate improving LLM performance through textual input prompts."
|
|
}
|
|
|
|
---
|
|
|
|
## [PRMBench: A Fine-grained and Challenging Benchmark for Process-Level
|
|
Reward Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03124v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03124v2
|
|
|
|
**Abstract:**
|
|
> Process-level Reward Models (PRMs) are crucial for complex reasoning and
|
|
> decision-making tasks, where each intermediate step plays an important role in
|
|
> the reasoning process. Since language models are prone to various types of
|
|
> errors during the reasoning process, PRMs are required to possess nuanced
|
|
> capabilities for detecting various implicit error types in real-world
|
|
> scenarios. However, current benchmarks primarily focus on step correctness,
|
|
> failing to evaluate PRMs' performance systematically. To address this gap, we
|
|
> introduce PRMBench, a process-level benchmark specifically designed to assess
|
|
> the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216
|
|
> carefully designed problems and 83,456 step-level labels, evaluating models
|
|
> across multiple dimensions, including simplicity, soundness, and sensitivity.
|
|
> In our experiments on 15 models, spanning both open-source PRMs and
|
|
> closed-source large language models prompted as critic models, we uncover
|
|
> significant weaknesses in current PRMs. These findings underscore the
|
|
> challenges inherent in process-level evaluation and highlight key directions
|
|
> for future research. We hope PRMBench can be a robust bench for advancing
|
|
> research on PRM evaluation and development.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a benchmark for evaluating Process-Level Reward Models (PRMs) and their error detection capabilities, rather than engineering, designing, or optimizing prompts specifically for Large Language Models (LLMs). While LLMs are mentioned as being used as critic models, prompt engineering is not the central concern.
|
|
|
|
---
|
|
|
|
## [Geometry Restoration and Dewarping of Camera-Captured Document Images](https://arxiv.org/abs/http://arxiv.org/abs/2501.03145v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03145v2
|
|
|
|
**Abstract:**
|
|
> This research focuses on developing a method for restoring the topology of
|
|
> digital images of paper documents captured by a camera, using algorithms for
|
|
> detection, segmentation, geometry restoration, and dewarping. Our methodology
|
|
> employs deep learning (DL) for document outline detection, followed by computer
|
|
> vision (CV) to create a topological 2D grid using cubic polynomial
|
|
> interpolation and correct nonlinear distortions by remapping the image. Using
|
|
> classical CV methods makes the document topology restoration process more
|
|
> efficient and faster, as it requires significantly fewer computational
|
|
> resources and memory. We developed a new pipeline for automatic document
|
|
> dewarping and reconstruction, along with a framework and annotated dataset to
|
|
> demonstrate its efficiency. Our experiments confirm the promise of our
|
|
> methodology and its superiority over existing benchmarks (including mobile apps
|
|
> and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both
|
|
> visually and in terms of document readability via Optical Character Recognition
|
|
> (OCR) and geometry restoration metrics. This paves the way for creating
|
|
> high-quality digital copies of paper documents and enhancing the efficiency of
|
|
> OCR systems. Project page: https://github.com/HorizonParadox/DRCCBI
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on developing a method for restoring and dewarping document images using computer vision and deep learning, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Automated Generation of Challenging Multiple-Choice Questions for Vision
|
|
Language Model Evaluation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03225v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03225v1
|
|
|
|
**Abstract:**
|
|
> The rapid development of vision language models (VLMs) demands rigorous and
|
|
> reliable evaluation. However, current visual question answering (VQA)
|
|
> benchmarks often depend on open-ended questions, making accurate evaluation
|
|
> difficult due to the variability in natural language responses. To address
|
|
> this, we introduce AutoConverter, an agentic framework that automatically
|
|
> converts these open-ended questions into multiple-choice format, enabling
|
|
> objective evaluation while reducing the costly question creation process. Our
|
|
> experiments demonstrate that AutoConverter can generate correct and challenging
|
|
> multiple-choice questions, with VLMs demonstrating consistently similar or
|
|
> lower accuracy on these questions compared to human-created ones. Using
|
|
> AutoConverter, we construct VMCBench, a benchmark created by transforming 20
|
|
> existing VQA datasets into a unified multiple-choice format, totaling 9,018
|
|
> questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench,
|
|
> setting a new standard for scalable, consistent, and reproducible VLM
|
|
> evaluation.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on evaluating Vision Language Models (VLMs) through automated generation of multiple-choice questions, rather than specifically engineering or optimizing prompts for Large Language Models (LLMs) to improve text generation, as required by the criteria.
|
|
|
|
---
|
|
|
|
## [LightGNN: Simple Graph Neural Network for Recommendation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03228v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03228v2
|
|
|
|
**Abstract:**
|
|
> Graph neural networks (GNNs) have demonstrated superior performance in
|
|
> collaborative recommendation through their ability to conduct high-order
|
|
> representation smoothing, effectively capturing structural information within
|
|
> users' interaction patterns. However, existing GNN paradigms face significant
|
|
> challenges in scalability and robustness when handling large-scale, noisy, and
|
|
> real-world datasets. To address these challenges, we present LightGNN, a
|
|
> lightweight and distillation-based GNN pruning framework designed to
|
|
> substantially reduce model complexity while preserving essential collaboration
|
|
> modeling capabilities. Our LightGNN framework introduces a computationally
|
|
> efficient pruning module that adaptively identifies and removes redundant edges
|
|
> and embedding entries for model compression. The framework is guided by a
|
|
> resource-friendly hierarchical knowledge distillation objective, whose
|
|
> intermediate layer augments the observed graph to maintain performance,
|
|
> particularly in high-rate compression scenarios. Extensive experiments on
|
|
> public datasets demonstrate LightGNN's effectiveness, significantly improving
|
|
> both computational efficiency and recommendation accuracy. Notably, LightGNN
|
|
> achieves an 80% reduction in edge count and 90% reduction in embedding entries
|
|
> while maintaining performance comparable to more complex state-of-the-art
|
|
> baselines. The implementation of our LightGNN framework is available at the
|
|
> github repository: https://github.com/HKUDS/LightGNN.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a lightweight Graph Neural Network (GNN) for recommendation systems, with no mention of Large Language Models (LLMs), prompt engineering, or textual input manipulation, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Gaussian Masked Autoencoders](https://arxiv.org/abs/http://arxiv.org/abs/2501.03229v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03229v1
|
|
|
|
**Abstract:**
|
|
> This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While
|
|
> reconstructive self-supervised learning frameworks such as MAE learns good
|
|
> semantic abstractions, it is not trained for explicit spatial awareness. Our
|
|
> approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic
|
|
> abstractions and spatial understanding jointly. Like MAE, it reconstructs the
|
|
> image end-to-end in the pixel space, but beyond MAE, it also introduces an
|
|
> intermediate, 3D Gaussian-based representation and renders images via
|
|
> splatting. We show that GMAE can enable various zero-shot learning capabilities
|
|
> of spatial understanding (e.g., figure-ground segmentation, image layering,
|
|
> edge detection, etc.) while preserving the high-level semantics of
|
|
> self-supervised representation quality from MAE. To our knowledge, we are the
|
|
> first to employ Gaussian primitives in an image representation learning
|
|
> framework beyond optimization-based single-scene reconstructions. We believe
|
|
> GMAE will inspire further research in this direction and contribute to
|
|
> developing next-generation techniques for modeling high-fidelity visual data.
|
|
> More details at https://brjathu.github.io/gmae
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a new image representation learning framework (Gaussian Masked Autoencoders) for spatial understanding and visual data modeling, with no mention of Large Language Models (LLMs), prompt engineering, or text-based interactions, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Backdoor Token Unlearning: Exposing and Defending Backdoors in
|
|
Pretrained Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03272v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03272v1
|
|
|
|
**Abstract:**
|
|
> Supervised fine-tuning has become the predominant method for adapting large
|
|
> pretrained models to downstream tasks. However, recent studies have revealed
|
|
> that these models are vulnerable to backdoor attacks, where even a small number
|
|
> of malicious samples can successfully embed backdoor triggers into the model.
|
|
> While most existing defense methods focus on post-training backdoor defense,
|
|
> efficiently defending against backdoor attacks during training phase remains
|
|
> largely unexplored. To address this gap, we propose a novel defense method
|
|
> called Backdoor Token Unlearning (BTU), which proactively detects and
|
|
> neutralizes trigger tokens during the training stage. Our work is based on two
|
|
> key findings: 1) backdoor learning causes distinctive differences between
|
|
> backdoor token parameters and clean token parameters in word embedding layers,
|
|
> and 2) the success of backdoor attacks heavily depends on backdoor token
|
|
> parameters. The BTU defense leverages these properties to identify aberrant
|
|
> embedding parameters and subsequently removes backdoor behaviors using a
|
|
> fine-grained unlearning technique. Extensive evaluations across three datasets
|
|
> and four types of backdoor attacks demonstrate that BTU effectively defends
|
|
> against these threats while preserving the model's performance on primary
|
|
> tasks. Our code is available at https://github.com/XDJPH/BTU.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on defending against backdoor attacks in pretrained language models during the training phase, rather than prompt engineering for text-based interactions with Large Language Models (LLMs), failing to meet the 'MUST' criteria related to prompt engineering and manipulation of textual input prompts.
|
|
|
|
---
|
|
|
|
## [Strategic Fusion Optimizes Transformer Compression](https://arxiv.org/abs/http://arxiv.org/abs/2501.03273v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03273v1
|
|
|
|
**Abstract:**
|
|
> This study investigates transformer model compression by systematically
|
|
> pruning its layers. We evaluated 14 pruning strategies across nine diverse
|
|
> datasets, including 12 strategies based on different signals obtained from
|
|
> layer activations, mutual information, gradients, weights, and attention. To
|
|
> address the limitations of single-signal strategies, we introduced two fusion
|
|
> strategies, linear regression and random forest, which combine individual
|
|
> strategies (i.e., strategic fusion), for more informed pruning decisions.
|
|
> Additionally, we applied knowledge distillation to mitigate any accuracy loss
|
|
> during layer pruning. Our results reveal that random forest strategic fusion
|
|
> outperforms individual strategies in seven out of nine datasets and achieves
|
|
> near-optimal performance in the other two. The distilled random forest
|
|
> surpasses the original accuracy in six datasets and mitigates accuracy drops in
|
|
> the remaining three. Knowledge distillation also improves the accuracy-to-size
|
|
> ratio by an average factor of 18.84 across all datasets. Supported by
|
|
> mathematical foundations and biological analogies, our findings suggest that
|
|
> strategically combining multiple signals can lead to efficient, high-performing
|
|
> transformer models for resource-constrained applications.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on transformer model compression through layer pruning and knowledge distillation, not on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance."
|
|
}
|
|
|
|
---
|
|
|
|
## [ComMer: a Framework for Compressing and Merging User Data for
|
|
Personalization](https://arxiv.org/abs/http://arxiv.org/abs/2501.03276v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03276v1
|
|
|
|
**Abstract:**
|
|
> Large Language Models (LLMs) excel at a wide range of tasks, but adapting
|
|
> them to new data, particularly for personalized applications, poses significant
|
|
> challenges due to resource and computational constraints. Existing methods
|
|
> either rely on exposing fresh data to the model through the prompt, which is
|
|
> limited by context size and computationally expensive at inference time, or
|
|
> fine-tuning, which incurs substantial training and update costs. In this paper,
|
|
> we introduce ComMer - Compress and Merge - a novel framework that efficiently
|
|
> personalizes LLMs by compressing users' documents into compact representations,
|
|
> which are then merged and fed into a frozen LLM. We evaluate ComMer on two
|
|
> types of personalization tasks - personalized skill learning, using the tweet
|
|
> paraphrasing dataset and the personalized news headline generation dataset from
|
|
> the LaMP benchmark, and knowledge-intensive, using the PerLTQA dataset. Our
|
|
> experiments demonstrate that in constrained inference budget scenarios ComMer
|
|
> achieves superior quality in skill learning tasks, while highlighting
|
|
> limitations in knowledge-intensive settings due to the loss of detailed
|
|
> information. These results offer insights into trade-offs and potential
|
|
> optimizations in multi-document compression for personalization.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on developing a framework for compressing and merging user data to personalize Large Language Models (LLMs), rather than on the engineering, design, or optimization of prompts specifically for LLMs. It does not investigate or propose methods for improving LLM performance through the manipulation of textual input prompts.
|
|
|
|
---
|
|
|
|
## [CodeVision: Detecting LLM-Generated Code Using 2D Token Probability Maps
|
|
and Vision Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03288v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03288v1
|
|
|
|
**Abstract:**
|
|
> The rise of large language models (LLMs) like ChatGPT has significantly
|
|
> improved automated code generation, enhancing software development efficiency.
|
|
> However, this introduces challenges in academia, particularly in distinguishing
|
|
> between human-written and LLM-generated code, which complicates issues of
|
|
> academic integrity. Existing detection methods, such as pre-trained models and
|
|
> watermarking, face limitations in adaptability and computational efficiency. In
|
|
> this paper, we propose a novel detection method using 2D token probability maps
|
|
> combined with vision models, preserving spatial code structures such as
|
|
> indentation and brackets. By transforming code into log probability matrices
|
|
> and applying vision models like Vision Transformers (ViT) and ResNet, we
|
|
> capture both content and structure for more accurate detection. Our method
|
|
> shows robustness across multiple programming languages and improves upon
|
|
> traditional detectors, offering a scalable and computationally efficient
|
|
> solution for identifying LLM-generated code.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on detecting LLM-generated code using vision models, rather than on the engineering, design, or optimization of prompts for Large Language Models (LLMs), failing to meet the primary criteria of focusing on prompt engineering for text-based interactions with LLMs.
|
|
|
|
---
|
|
|
|
## [Analyzing Bias in Swiss Federal Supreme Court Judgments Using Facebook's
|
|
Holistic Bias Dataset: Implications for Language Model Training](https://arxiv.org/abs/http://arxiv.org/abs/2501.03324v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03324v1
|
|
|
|
**Abstract:**
|
|
> Natural Language Processing (NLP) is vital for computers to process and
|
|
> respond accurately to human language. However, biases in training data can
|
|
> introduce unfairness, especially in predicting legal judgment. This study
|
|
> focuses on analyzing biases within the Swiss Judgment Prediction Dataset
|
|
> (SJP-Dataset). Our aim is to ensure unbiased factual descriptions essential for
|
|
> fair decision making by NLP models in legal contexts. We analyze the dataset
|
|
> using social bias descriptors from the Holistic Bias dataset and employ
|
|
> advanced NLP techniques, including attention visualization, to explore the
|
|
> impact of dispreferred descriptors on model predictions. The study identifies
|
|
> biases and examines their influence on model behavior. Challenges include
|
|
> dataset imbalance and token limits affecting model performance.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on analyzing biases in a legal dataset for fair NLP model training, not on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not provide concrete examples of prompt manipulation impacting LLM output.
|
|
|
|
---
|
|
|
|
## [Existential Crisis: A Social Robot's Reason for Being](https://arxiv.org/abs/http://arxiv.org/abs/2501.03376v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03376v1
|
|
|
|
**Abstract:**
|
|
> As Robots become ever more important in our daily lives there's growing need
|
|
> for understanding how they're perceived by people. This study aims to
|
|
> investigate how the user perception of robots is influenced by displays of
|
|
> personality. Using LLMs and speech to text technology, we designed a
|
|
> within-subject study to compare two conditions: a personality-driven robot and
|
|
> a purely task-oriented, personality-neutral robot. Twelve participants,
|
|
> recruited from Socially Intelligent Robotics course at Vrije Universiteit
|
|
> Amsterdam, interacted with a robot Nao tasked with asking them a set of medical
|
|
> questions under both conditions. After completing both interactions, the
|
|
> participants completed a user experience questionnaire measuring their
|
|
> emotional states and robot perception using standardized questionnaires from
|
|
> the SRI and Psychology literature.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on user perception of robots with displayed personality, utilizing LLMs as a tool, but prompt engineering for Large Language Models is not the primary concern; the core subject revolves around social robotics and user experience, not the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Over-the-Air Fair Federated Learning via Multi-Objective Optimization](https://arxiv.org/abs/http://arxiv.org/abs/2501.03392v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03392v1
|
|
|
|
**Abstract:**
|
|
> In federated learning (FL), heterogeneity among the local dataset
|
|
> distributions of clients can result in unsatisfactory performance for some,
|
|
> leading to an unfair model. To address this challenge, we propose an
|
|
> over-the-air fair federated learning algorithm (OTA-FFL), which leverages
|
|
> over-the-air computation to train fair FL models. By formulating FL as a
|
|
> multi-objective minimization problem, we introduce a modified Chebyshev
|
|
> approach to compute adaptive weighting coefficients for gradient aggregation in
|
|
> each communication round. To enable efficient aggregation over the multiple
|
|
> access channel, we derive analytical solutions for the optimal transmit scalars
|
|
> at the clients and the de-noising scalar at the parameter server. Extensive
|
|
> experiments demonstrate the superiority of OTA-FFL in achieving fairness and
|
|
> robust performance compared to existing methods.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on federated learning, multi-objective optimization, and over-the-air computation, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Enhanced Importance Sampling through Latent Space Exploration in
|
|
Normalizing Flows](https://arxiv.org/abs/http://arxiv.org/abs/2501.03394v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03394v1
|
|
|
|
**Abstract:**
|
|
> Importance sampling is a rare event simulation technique used in Monte Carlo
|
|
> simulations to bias the sampling distribution towards the rare event of
|
|
> interest. By assigning appropriate weights to sampled points, importance
|
|
> sampling allows for more efficient estimation of rare events or tails of
|
|
> distributions. However, importance sampling can fail when the proposal
|
|
> distribution does not effectively cover the target distribution. In this work,
|
|
> we propose a method for more efficient sampling by updating the proposal
|
|
> distribution in the latent space of a normalizing flow. Normalizing flows learn
|
|
> an invertible mapping from a target distribution to a simpler latent
|
|
> distribution. The latent space can be more easily explored during the search
|
|
> for a proposal distribution, and samples from the proposal distribution are
|
|
> recovered in the space of the target distribution via the invertible mapping.
|
|
> We empirically validate our methodology on simulated robotics applications such
|
|
> as autonomous racing and aircraft ground collision avoidance.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the criteria as it focuses on enhancing importance sampling in Monte Carlo simulations using normalizing flows, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [BoundingDocs: a Unified Dataset for Document Question Answering with
|
|
Spatial Annotations](https://arxiv.org/abs/http://arxiv.org/abs/2501.03403v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03403v1
|
|
|
|
**Abstract:**
|
|
> We present a unified dataset for document Question-Answering (QA), which is
|
|
> obtained combining several public datasets related to Document AI and visually
|
|
> rich document understanding (VRDU). Our main contribution is twofold: on the
|
|
> one hand we reformulate existing Document AI tasks, such as Information
|
|
> Extraction (IE), into a Question-Answering task, making it a suitable resource
|
|
> for training and evaluating Large Language Models; on the other hand, we
|
|
> release the OCR of all the documents and include the exact position of the
|
|
> answer to be found in the document image as a bounding box. Using this dataset,
|
|
> we explore the impact of different prompting techniques (that might include
|
|
> bounding box information) on the performance of open-weight models, identifying
|
|
> the most effective approaches for document comprehension.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "Although the paper mentions exploring the impact of different prompting techniques on LLM performance, its primary focus is on presenting a unified dataset for document Question Answering, making prompt engineering a secondary concern rather than the central focus."
|
|
}
|
|
|
|
---
|
|
|
|
## [SALT: Sales Autocompletion Linked Business Tables Dataset](https://arxiv.org/abs/http://arxiv.org/abs/2501.03413v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03413v1
|
|
|
|
**Abstract:**
|
|
> Foundation models, particularly those that incorporate Transformer
|
|
> architectures, have demonstrated exceptional performance in domains such as
|
|
> natural language processing and image processing. Adapting these models to
|
|
> structured data, like tables, however, introduces significant challenges. These
|
|
> difficulties are even more pronounced when addressing multi-table data linked
|
|
> via foreign key, which is prevalent in the enterprise realm and crucial for
|
|
> empowering business use cases. Despite its substantial impact, research
|
|
> focusing on such linked business tables within enterprise settings remains a
|
|
> significantly important yet underexplored domain. To address this, we introduce
|
|
> a curated dataset sourced from an Enterprise Resource Planning (ERP) system,
|
|
> featuring extensive linked tables. This dataset is specifically designed to
|
|
> support research endeavors in table representation learning. By providing
|
|
> access to authentic enterprise data, our goal is to potentially enhance the
|
|
> effectiveness and applicability of models for real-world business contexts.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper does not meet the 'MUST' criteria, as its primary focus is on introducing a dataset for table representation learning in enterprise settings, rather than engineering, designing, or optimizing prompts specifically for Large Language Models (LLMs), and does not demonstrate the impact of textual input prompts on LLM output."
|
|
}
|
|
|
|
---
|
|
|
|
## [Optimization Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.03443v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03443v1
|
|
|
|
**Abstract:**
|
|
> This article introduces the concept of optimization learning, a methodology
|
|
> to design optimization proxies that learn the input/output mapping of
|
|
> parametric optimization problems. These optimization proxies are trustworthy by
|
|
> design: they compute feasible solutions to the underlying optimization
|
|
> problems, provide quality guarantees on the returned solutions, and scale to
|
|
> large instances. Optimization proxies are differentiable programs that combine
|
|
> traditional deep learning technology with repair or completion layers to
|
|
> produce feasible solutions. The article shows that optimization proxies can be
|
|
> trained end-to-end in a self-supervised way. It presents methodologies to
|
|
> provide performance guarantees and to scale optimization proxies to large-scale
|
|
> optimization problems. The potential of optimization proxies is highlighted
|
|
> through applications in power systems and, in particular, real-time risk
|
|
> assessment and security-constrained optimal power flow.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not focus on prompt engineering, design, or optimization for Large Language Models (LLMs). Instead, it introduces 'optimization learning' for solving parametric optimization problems, unrelated to LLMs or text-based interactions.
|
|
|
|
---
|
|
|
|
## [LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification
|
|
and Tagging](https://arxiv.org/abs/http://arxiv.org/abs/2501.03464v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03464v1
|
|
|
|
**Abstract:**
|
|
> Transformers have set new benchmarks in audio processing tasks, leveraging
|
|
> self-attention mechanisms to capture complex patterns and dependencies within
|
|
> audio data. However, their focus on pairwise interactions limits their ability
|
|
> to process the higher-order relations essential for identifying distinct audio
|
|
> objects. To address this limitation, this work introduces the Local- Higher
|
|
> Order Graph Neural Network (LHGNN), a graph based model that enhances feature
|
|
> understanding by integrating local neighbourhood information with higher-order
|
|
> data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio
|
|
> relationships. Evaluation of the model on three publicly available audio
|
|
> datasets shows that it outperforms Transformer-based models across all
|
|
> benchmarks while operating with substantially fewer parameters. Moreover, LHGNN
|
|
> demonstrates a distinct advantage in scenarios lacking ImageNet pretraining,
|
|
> establishing its effectiveness and efficiency in environments where extensive
|
|
> pretraining data is unavailable.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on developing a new graph neural network model (LHGNN) for audio classification and tagging, primarily dealing with audio data and disregarding Large Language Models (LLMs) and text-based prompt engineering, thus failing to meet all the 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [MTRAG: A Multi-Turn Conversational Benchmark for Evaluating
|
|
Retrieval-Augmented Generation Systems](https://arxiv.org/abs/http://arxiv.org/abs/2501.03468v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03468v1
|
|
|
|
**Abstract:**
|
|
> Retrieval-augmented generation (RAG) has recently become a very popular task
|
|
> for Large Language Models (LLMs). Evaluating them on multi-turn RAG
|
|
> conversations, where the system is asked to generate a response to a question
|
|
> in the context of a preceding conversation is an important and often overlooked
|
|
> task with several additional challenges. We present MTRAG: an end-to-end
|
|
> human-generated multi-turn RAG benchmark that reflects several real-world
|
|
> properties across diverse dimensions for evaluating the full RAG pipeline.
|
|
> MTRAG contains 110 conversations averaging 7.7 turns each across four domains
|
|
> for a total of 842 tasks. We also explore automation paths via synthetic data
|
|
> and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that
|
|
> even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the
|
|
> need for strong retrieval and generation systems that can handle later turns,
|
|
> unanswerable questions, non-standalone questions, and multiple domains. MTRAG
|
|
> is available at https://github.com/ibm/mt-rag-benchmark.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on introducing a benchmark (MTRAG) for evaluating Retrieval-Augmented Generation Systems, rather than specifically focusing on the engineering, design, or optimization of prompts for Large Language Models (LLMs) and demonstrating their impact on LLM output.
|
|
|
|
---
|
|
|
|
## [Reading with Intent -- Neutralizing Intent](https://arxiv.org/abs/http://arxiv.org/abs/2501.03475v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03475v1
|
|
|
|
**Abstract:**
|
|
> Queries to large language models (LLMs) can be divided into two parts: the
|
|
> instruction/question and the accompanying context. The context for
|
|
> retrieval-augmented generation (RAG) systems in most benchmarks comes from
|
|
> Wikipedia or Wikipedia-like texts which are written in a neutral and factual
|
|
> tone. However, when RAG systems retrieve internet-based content, they encounter
|
|
> text with diverse tones and linguistic styles, introducing challenges for
|
|
> downstream tasks. The Reading with Intent task addresses this issue by
|
|
> evaluating how varying tones in context passages affect model performance.
|
|
> Building on prior work that focused on sarcasm, we extend this paradigm by
|
|
> constructing a dataset where context passages are transformed to $11$ distinct
|
|
> emotions using a better synthetic data generation approach. Using this dataset,
|
|
> we train an emotion translation model to systematically adapt passages to
|
|
> specified emotional tones. The human evaluation shows that the LLM fine-tuned
|
|
> to become the emotion-translator benefited from the synthetically generated
|
|
> data. Finally, the emotion-translator is used in the Reading with Intent task
|
|
> to transform the passages to a neutral tone. By neutralizing the passages, it
|
|
> mitigates the challenges posed by sarcastic passages and improves overall
|
|
> results on this task by about $3\%$.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on training an emotion translation model to adapt passages to specified emotional tones and fine-tuning an LLM for emotion translation, rather than the engineering, design, or optimization of prompts specifically for LLMs. While it mentions improving results on the Reading with Intent task by neutralizing passages, the core subject is emotion adaptation in LLM training, not prompt engineering for text-based interactions with LLMs."
|
|
}
|
|
|
|
---
|
|
|
|
## [Align-Pro: A Principled Approach to Prompt Optimization for LLM
|
|
Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.03486v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03486v1
|
|
|
|
**Abstract:**
|
|
> The alignment of large language models (LLMs) with human values is critical
|
|
> as these models become increasingly integrated into various societal and
|
|
> decision-making processes. Traditional methods, such as reinforcement learning
|
|
> from human feedback (RLHF), achieve alignment by fine-tuning model parameters,
|
|
> but these approaches are often computationally expensive and impractical when
|
|
> models are frozen or inaccessible for parameter modification. In contrast,
|
|
> prompt optimization is a viable alternative to RLHF for LLM alignment. While
|
|
> the existing literature has shown empirical promise of prompt optimization, its
|
|
> theoretical underpinning remains under-explored. We address this gap by
|
|
> formulating prompt optimization as an optimization problem and try to provide
|
|
> theoretical insights into the optimality of such a framework. To analyze the
|
|
> performance of the prompt optimization, we study theoretical suboptimality
|
|
> bounds and provide insights in terms of how prompt optimization depends upon
|
|
> the given prompter and target model. We also provide empirical validation
|
|
> through experiments on various datasets, demonstrating that prompt optimization
|
|
> can effectively align LLMs, even when parameter fine-tuning is not feasible.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
Although the paper discusses prompt optimization for LLM alignment, its primary focus is on LLM alignment through a theoretical optimization framework, rather than specifically on engineering, design, or optimization of prompts for improving LLM performance through textual input manipulation, as required.
|
|
|
|
---
|
|
|
|
## [Can Deep Learning Trigger Alerts from Mobile-Captured Images?](https://arxiv.org/abs/http://arxiv.org/abs/2501.03499v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03499v1
|
|
|
|
**Abstract:**
|
|
> Our research presents a comprehensive approach to leveraging mobile camera
|
|
> image data for real-time air quality assessment and recommendation. We develop
|
|
> a regression-based Convolutional Neural Network model and tailor it explicitly
|
|
> for air quality prediction by exploiting the inherent relationship between
|
|
> output parameters. As a result, the Mean Squared Error of 0.0077 and 0.0112
|
|
> obtained for 2 and 5 pollutants respectively outperforms existing models.
|
|
> Furthermore, we aim to verify the common practice of augmenting the original
|
|
> dataset with a view to introducing more variation in the training phase. It is
|
|
> one of our most significant contributions that our experimental results
|
|
> demonstrate minimal accuracy differences between the original and augmented
|
|
> datasets. Finally, a real-time, user-friendly dashboard is implemented which
|
|
> dynamically displays the Air Quality Index and pollutant values derived from
|
|
> captured mobile camera images. Users' health conditions are considered to
|
|
> recommend whether a location is suitable based on current air quality metrics.
|
|
> Overall, this research contributes to verification of data augmentation
|
|
> techniques, CNN-based regression modelling for air quality prediction, and
|
|
> user-centric air quality monitoring through mobile technology. The proposed
|
|
> system offers practical solutions for individuals to make informed
|
|
> environmental health and well-being decisions.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on leveraging mobile camera images for air quality assessment using a Convolutional Neural Network model, with no mention of Large Language Models (LLMs), prompt engineering, or text-based interactions, thereby failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Vocal Tract Length Warped Features for Spoken Keyword Spotting](https://arxiv.org/abs/http://arxiv.org/abs/2501.03523v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03523v1
|
|
|
|
**Abstract:**
|
|
> In this paper, we propose several methods that incorporate vocal tract length
|
|
> (VTL) warped features for spoken keyword spotting (KWS). The first method,
|
|
> VTL-independent KWS, involves training a single deep neural network (DNN) that
|
|
> utilizes VTL features with various warping factors. During training, a specific
|
|
> VTL feature is randomly selected per epoch, allowing the exploration of VTL
|
|
> variations. During testing, the VTL features with different warping factors of
|
|
> a test utterance are scored against the DNN and combined with equal weight. In
|
|
> the second method scores the conventional features of a test utterance (without
|
|
> VTL warping) against the DNN. The third method, VTL-concatenation KWS,
|
|
> concatenates VTL warped features to form high-dimensional features for KWS.
|
|
> Evaluations carried out on the English Google Command dataset demonstrate that
|
|
> the proposed methods improve the accuracy of KWS.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on spoken keyword spotting using vocal tract length warped features with deep neural networks, which does not meet the criteria of primarily focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or investigating methods for improving LLM performance through textual input prompt manipulation.
|
|
|
|
---
|
|
|
|
## [Deep Learning within Tabular Data: Foundations, Challenges, Advances and
|
|
Future Directions](https://arxiv.org/abs/http://arxiv.org/abs/2501.03540v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03540v1
|
|
|
|
**Abstract:**
|
|
> Tabular data remains one of the most prevalent data types across a wide range
|
|
> of real-world applications, yet effective representation learning for this
|
|
> domain poses unique challenges due to its irregular patterns, heterogeneous
|
|
> feature distributions, and complex inter-column dependencies. This survey
|
|
> provides a comprehensive review of state-of-the-art techniques in tabular data
|
|
> representation learning, structured around three foundational design elements:
|
|
> training data, neural architectures, and learning objectives. Unlike prior
|
|
> surveys that focus primarily on either architecture design or learning
|
|
> strategies, we adopt a holistic perspective that emphasizes the universality
|
|
> and robustness of representation learning methods across diverse downstream
|
|
> tasks. We examine recent advances in data augmentation and generation,
|
|
> specialized neural network architectures tailored to tabular data, and
|
|
> innovative learning objectives that enhance representation quality.
|
|
> Additionally, we highlight the growing influence of self-supervised learning
|
|
> and the adaptation of transformer-based foundation models for tabular data. Our
|
|
> review is based on a systematic literature search using rigorous inclusion
|
|
> criteria, encompassing 127 papers published since 2020 in top-tier conferences
|
|
> and journals. Through detailed analysis and comparison, we identify emerging
|
|
> trends, critical gaps, and promising directions for future research, aiming to
|
|
> guide the development of more generalizable and effective tabular data
|
|
> representation methods.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on tabular data representation learning using deep learning techniques, with no apparent connection to Large Language Models (LLMs) or prompt engineering for text-based interactions, thus failing to meet the primary 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for
|
|
Text-to-Image Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03544v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03544v1
|
|
|
|
**Abstract:**
|
|
> Text-to-image (T2I) models have been shown to be vulnerable to misuse,
|
|
> particularly in generating not-safe-for-work (NSFW) content, raising serious
|
|
> ethical concerns. In this work, we present PromptGuard, a novel content
|
|
> moderation technique that draws inspiration from the system prompt mechanism in
|
|
> large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack
|
|
> a direct interface for enforcing behavioral guidelines. Our key idea is to
|
|
> optimize a safety soft prompt that functions as an implicit system prompt
|
|
> within the T2I model's textual embedding space. This universal soft prompt (P*)
|
|
> directly moderates NSFW inputs, enabling safe yet realistic image generation
|
|
> without altering the inference efficiency or requiring proxy models. Extensive
|
|
> experiments across three datasets demonstrate that PromptGuard effectively
|
|
> mitigates NSFW content generation while preserving high-quality benign outputs.
|
|
> PromptGuard achieves 7.8 times faster than prior content moderation methods,
|
|
> surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to
|
|
> 5.84%.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on text-to-image models and content moderation, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) as required, violating the 'MUST NOT' criteria concerning applications of generative AI other than text generation driven by LLMs.
|
|
|
|
---
|
|
|
|
## [Rethinking Adversarial Attacks in Reinforcement Learning from Policy
|
|
Distribution Perspective](https://arxiv.org/abs/http://arxiv.org/abs/2501.03562v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03562v2
|
|
|
|
**Abstract:**
|
|
> Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies
|
|
> in the observation signal in realworld applications. Adversarial attack is an
|
|
> effective method for evaluating the robustness of DRL agents. However, existing
|
|
> attack methods targeting individual sampled actions have limited impacts on the
|
|
> overall policy distribution, particularly in continuous action spaces. To
|
|
> address these limitations, we propose the Distribution-Aware Projected Gradient
|
|
> Descent attack (DAPGD). DAPGD uses distribution similarity as the gradient
|
|
> perturbation input to attack the policy network, which leverages the entire
|
|
> policy distribution rather than relying on individual samples. We utilize the
|
|
> Bhattacharyya distance in DAPGD to measure policy similarity, enabling
|
|
> sensitive detection of subtle but critical differences between probability
|
|
> distributions. Our experiment results demonstrate that DAPGD achieves SOTA
|
|
> results compared to the baselines in three robot navigation tasks, achieving an
|
|
> average 22.03% higher reward drop compared to the best baseline.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the 'MUST' criteria: it focuses on Reinforcement Learning, adversarial attacks, and policy distribution, with no primary emphasis on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate improving LLM performance through textual input prompt manipulation.
|
|
|
|
---
|
|
|
|
## [Applying Large Language Models in Knowledge Graph-based Enterprise
|
|
Modeling: Challenges and Opportunities](https://arxiv.org/abs/http://arxiv.org/abs/2501.03566v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03566v1
|
|
|
|
**Abstract:**
|
|
> The role of large language models (LLMs) in enterprise modeling has recently
|
|
> started to shift from academic research to that of industrial applications.
|
|
> Thereby, LLMs represent a further building block for the machine-supported
|
|
> generation of enterprise models. In this paper we employ a knowledge
|
|
> graph-based approach for enterprise modeling and investigate the potential
|
|
> benefits of LLMs in this context. In addition, the findings of an expert survey
|
|
> and ChatGPT-4o-based experiments demonstrate that LLM-based model generations
|
|
> exhibit minimal variability, yet remain constrained to specific tasks, with
|
|
> reliability declining for more intricate tasks. The survey results further
|
|
> suggest that the supervision and intervention of human modeling experts are
|
|
> essential to ensure the accuracy and integrity of the generated models.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper's primary focus is on applying LLMs in enterprise modeling using knowledge graphs, not on prompt engineering for text-based interactions with LLMs, failing to meet the core subject requirement."
|
|
}
|
|
|
|
---
|
|
|
|
## [RecKG: Knowledge Graph for Recommender Systems](https://arxiv.org/abs/http://arxiv.org/abs/2501.03598v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03598v1
|
|
|
|
**Abstract:**
|
|
> Knowledge graphs have proven successful in integrating heterogeneous data
|
|
> across various domains. However, there remains a noticeable dearth of research
|
|
> on their seamless integration among heterogeneous recommender systems, despite
|
|
> knowledge graph-based recommender systems garnering extensive research
|
|
> attention. This study aims to fill this gap by proposing RecKG, a standardized
|
|
> knowledge graph for recommender systems. RecKG ensures the consistent
|
|
> representation of entities across different datasets, accommodating diverse
|
|
> attribute types for effective data integration. Through a meticulous
|
|
> examination of various recommender system datasets, we select attributes for
|
|
> RecKG, ensuring standardized formatting through consistent naming conventions.
|
|
> By these characteristics, RecKG can seamlessly integrate heterogeneous data
|
|
> sources, enabling the discovery of additional semantic information within the
|
|
> integrated knowledge graph. We apply RecKG to standardize real-world datasets,
|
|
> subsequently developing an application for RecKG using a graph database.
|
|
> Finally, we validate RecKG's achievement in interoperability through a
|
|
> qualitative evaluation between RecKG and other studies.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on integrating knowledge graphs for recommender systems, not on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria 1 and 2.
|
|
|
|
---
|
|
|
|
## [MHGNet: Multi-Heterogeneous Graph Neural Network for Traffic Prediction](https://arxiv.org/abs/http://arxiv.org/abs/2501.03635v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03635v1
|
|
|
|
**Abstract:**
|
|
> In recent years, traffic flow prediction has played a crucial role in the
|
|
> management of intelligent transportation systems. However, traditional
|
|
> forecasting methods often model non-Euclidean low-dimensional traffic data as a
|
|
> simple graph with single-type nodes and edges, failing to capture similar
|
|
> trends among nodes of the same type. To address this limitation, this paper
|
|
> proposes MHGNet, a novel framework for modeling spatiotemporal
|
|
> multi-heterogeneous graphs. Within this framework, the STD Module decouples
|
|
> single-pattern traffic data into multi-pattern traffic data through feature
|
|
> mappings of timestamp embedding matrices and node embedding matrices.
|
|
> Subsequently, the Node Clusterer leverages the Euclidean distance between nodes
|
|
> and different types of limit points to perform clustering with O(N) time
|
|
> complexity. The nodes within each cluster undergo residual subgraph convolution
|
|
> within the spatiotemporal fusion subgraphs generated by the DSTGG Module,
|
|
> followed by processing in the SIE Module for node repositioning and
|
|
> redistribution of weights. To validate the effectiveness of MHGNet, this paper
|
|
> conducts extensive ablation studies and quantitative evaluations on four widely
|
|
> used benchmarks, demonstrating its superior performance.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the primary focus criteria, as it concentrates on traffic prediction using Multi-Heterogeneous Graph Neural Networks, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to satisfy the 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Effective and Efficient Mixed Precision Quantization of Speech
|
|
Foundation Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03643v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03643v2
|
|
|
|
**Abstract:**
|
|
> This paper presents a novel mixed-precision quantization approach for speech
|
|
> foundation models that tightly integrates mixed-precision learning and
|
|
> quantized model parameter estimation into one single model compression stage.
|
|
> Experiments conducted on LibriSpeech dataset with fine-tuned wav2vec2.0-base
|
|
> and HuBERT-large models suggest the resulting mixed-precision quantized models
|
|
> increased the lossless compression ratio by factors up to 1.7x and 1.9x over
|
|
> the respective uniform-precision and two-stage mixed-precision quantized
|
|
> baselines that perform precision learning and model parameters quantization in
|
|
> separate and disjointed stages, while incurring no statistically word error
|
|
> rate (WER) increase over the 32-bit full-precision models. The system
|
|
> compression time of wav2vec2.0-base and HuBERT-large models is reduced by up to
|
|
> 1.9 and 1.5 times over the two-stage mixed-precision baselines, while both
|
|
> produce lower WERs. The best-performing 3.5-bit mixed-precision quantized
|
|
> HuBERT-large model produces a lossless compression ratio of 8.6x over the
|
|
> 32-bit full-precision system.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on model compression through mixed-precision quantization of speech foundation models, not on prompt engineering or the manipulation of textual input prompts for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria."
|
|
}
|
|
|
|
---
|
|
|
|
## [A Diversity-Enhanced Knowledge Distillation Model for Practical Math
|
|
Word Problem Solving](https://arxiv.org/abs/http://arxiv.org/abs/2501.03670v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03670v1
|
|
|
|
**Abstract:**
|
|
> Math Word Problem (MWP) solving is a critical task in natural language
|
|
> processing, has garnered significant research interest in recent years. Various
|
|
> recent studies heavily rely on Seq2Seq models and their extensions (e.g.,
|
|
> Seq2Tree and Graph2Tree) to generate mathematical equations. While effective,
|
|
> these models struggle to generate diverse but counterpart solution equations,
|
|
> limiting their generalization across various math problem scenarios. In this
|
|
> paper, we introduce a novel Diversity-enhanced Knowledge Distillation (DivKD)
|
|
> model for practical MWP solving. Our approach proposes an adaptive diversity
|
|
> distillation method, in which a student model learns diverse equations by
|
|
> selectively transferring high-quality knowledge from a teacher model.
|
|
> Additionally, we design a diversity prior-enhanced student model to better
|
|
> capture the diversity distribution of equations by incorporating a conditional
|
|
> variational auto-encoder. Extensive experiments on {four} MWP benchmark
|
|
> datasets demonstrate that our approach achieves higher answer accuracy than
|
|
> strong baselines while maintaining high efficiency for practical applications.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper focuses primarily on developing a new model (Diversity-enhanced Knowledge Distillation) for math word problem solving, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs). Prompt engineering is not the central concern, and the paper does not demonstrate the impact of textual input prompts on LLM output."
|
|
}
|
|
|
|
---
|
|
|
|
## [SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks](https://arxiv.org/abs/http://arxiv.org/abs/2501.03676v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03676v2
|
|
|
|
**Abstract:**
|
|
> In this work, we build upon the offline reinforcement learning algorithm TD7,
|
|
> which incorporates State-Action Learned Embeddings (SALE) and a prioritized
|
|
> experience replay buffer (LAP). We propose a model-free actor-critic algorithm
|
|
> that integrates ensemble Q-networks and a gradient diversity penalty from EDAC.
|
|
> The ensemble Q-networks introduce penalties to guide the actor network toward
|
|
> in-distribution actions, effectively addressing the challenge of
|
|
> out-of-distribution actions. Meanwhile, the gradient diversity penalty
|
|
> encourages diverse Q-value gradients, further suppressing overestimation for
|
|
> out-of-distribution actions. Additionally, our method retains an adjustable
|
|
> behavior cloning (BC) term that directs the actor network toward dataset
|
|
> actions during early training stages, while gradually reducing its influence as
|
|
> the precision of the Q-ensemble improves. These enhancements work
|
|
> synergistically to improve the stability and precision of the training.
|
|
> Experimental results on the D4RL MuJoCo benchmarks demonstrate that our
|
|
> algorithm achieves higher convergence speed, stability, and performance
|
|
> compared to existing methods.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on offline reinforcement learning with ensemble Q-networks, disregarding the mandatory criteria of concentrating on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and manipulating textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [SLAM: Towards Efficient Multilingual Reasoning via Selective Language
|
|
Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.03681v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03681v1
|
|
|
|
**Abstract:**
|
|
> Despite the significant improvements achieved by large language models (LLMs)
|
|
> in English reasoning tasks, these models continue to struggle with multilingual
|
|
> reasoning. Recent studies leverage a full-parameter and two-stage training
|
|
> paradigm to teach models to first understand non-English questions and then
|
|
> reason. However, this method suffers from both substantial computational
|
|
> resource computing and catastrophic forgetting. The fundamental cause is that,
|
|
> with the primary goal of enhancing multilingual comprehension, an excessive
|
|
> number of irrelevant layers and parameters are tuned during the first stage.
|
|
> Given our findings that the representation learning of languages is merely
|
|
> conducted in lower-level layers, we propose an efficient multilingual reasoning
|
|
> alignment approach that precisely identifies and fine-tunes the layers
|
|
> responsible for handling multilingualism. Experimental results show that our
|
|
> method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of
|
|
> all parameters within 7B and 13B LLMs, achieving superior average performance
|
|
> than all strong baselines across 10 languages. Meanwhile, SLAM only involves
|
|
> one training stage, reducing training time by 4.1-11.9 compared to the
|
|
> two-stage method.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on efficient multilingual reasoning via selective language alignment, involving a one-stage training method for LLMs, which aligns more with the development of new training methods (violation of MUST NOT 1) rather than the engineering, design, or optimization of prompts for Large Language Models.
|
|
|
|
---
|
|
|
|
## [Exploring Molecule Generation Using Latent Space Graph Diffusion](https://arxiv.org/abs/http://arxiv.org/abs/2501.03696v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03696v1
|
|
|
|
**Abstract:**
|
|
> Generating molecular graphs is a challenging task due to their discrete
|
|
> nature and the competitive objectives involved. Diffusion models have emerged
|
|
> as SOTA approaches in data generation across various modalities. For molecular
|
|
> graphs, graph neural networks (GNNs) as a diffusion backbone have achieved
|
|
> impressive results. Latent space diffusion, where diffusion occurs in a
|
|
> low-dimensional space via an autoencoder, has demonstrated computational
|
|
> efficiency. However, the literature on latent space diffusion for molecular
|
|
> graphs is scarce, and no commonly accepted best practices exist. In this work,
|
|
> we explore different approaches and hyperparameters, contrasting generative
|
|
> flow models (denoising diffusion, flow matching, heat dissipation) and
|
|
> architectures (GNNs and E(3)-equivariant GNNs). Our experiments reveal a high
|
|
> sensitivity to the choice of approach and design decisions. Code is made
|
|
> available at
|
|
> github.com/Prashanth-Pombala/Molecule-Generation-using-Latent-Space-Graph-Diffusion.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper focuses on generating molecular graphs using latent space graph diffusion, involving graph neural networks and diffusion models, without any mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria."
|
|
}
|
|
|
|
---
|
|
|
|
## [Unsupervised Speech Segmentation: A General Approach Using Speech
|
|
Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03711v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03711v1
|
|
|
|
**Abstract:**
|
|
> In this paper, we introduce an unsupervised approach for Speech Segmentation,
|
|
> which builds on previously researched approaches, e.g., Speaker Diarization,
|
|
> while being applicable to an inclusive set of acoustic-semantic distinctions,
|
|
> paving a path towards a general Unsupervised Speech Segmentation approach.
|
|
> Unlike traditional speech and audio segmentation, which mainly focuses on
|
|
> spectral changes in the input signal, e.g., phone segmentation, our approach
|
|
> tries to segment the spoken utterance into chunks with differing
|
|
> acoustic-semantic styles, focusing on acoustic-semantic information that does
|
|
> not translate well into text, e.g., emotion or speaker. While most Speech
|
|
> Segmentation tasks only handle one style change, e.g., emotion diarization, our
|
|
> approach tries to handle multiple acoustic-semantic style changes. Leveraging
|
|
> recent advances in Speech Language Models (SLMs), we propose a simple
|
|
> unsupervised method to segment a given speech utterance. We empirically
|
|
> demonstrate the effectiveness of the proposed approach by considering several
|
|
> setups. Results suggest that the proposed method is superior to the evaluated
|
|
> baselines on boundary detection, segment purity, and over-segmentation. Code is
|
|
> available at
|
|
> https://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on unsupervised speech segmentation using Speech Language Models (SLMs), which does not meet the criteria of primarily focusing on the engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts for improving LLM performance.
|
|
|
|
---
|
|
|
|
## [Materialist: Physically Based Editing Using Single-Image Inverse
|
|
Rendering](https://arxiv.org/abs/http://arxiv.org/abs/2501.03717v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03717v1
|
|
|
|
**Abstract:**
|
|
> To perform image editing based on single-view, inverse physically based
|
|
> rendering, we present a method combining a learning-based approach with
|
|
> progressive differentiable rendering. Given an image, our method leverages
|
|
> neural networks to predict initial material properties. Progressive
|
|
> differentiable rendering is then used to optimize the environment map and
|
|
> refine the material properties with the goal of closely matching the rendered
|
|
> result to the input image. We require only a single image while other inverse
|
|
> rendering methods based on the rendering equation require multiple views. In
|
|
> comparison to single-view methods that rely on neural renderers, our approach
|
|
> achieves more realistic light material interactions, accurate shadows, and
|
|
> global illumination. Furthermore, with optimized material properties and
|
|
> illumination, our method enables a variety of tasks, including physically based
|
|
> material editing, object insertion, and relighting. We also propose a method
|
|
> for material transparency editing that operates effectively without requiring
|
|
> full scene geometry. Compared with methods based on Stable Diffusion, our
|
|
> approach offers stronger interpretability and more realistic light refraction
|
|
> based on empirical results.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on image editing using single-image inverse physically based rendering, involving neural networks and differentiable rendering for image generation, which falls under image generation and not text-based interactions with Large Language Models (LLMs), thus failing to meet the primary criteria of focusing on prompt engineering for LLMs.
|
|
|
|
---
|
|
|
|
## [SelectiveFinetuning: Enhancing Transfer Learning in Sleep Staging
|
|
through Selective Domain Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.03764v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03764v1
|
|
|
|
**Abstract:**
|
|
> In practical sleep stage classification, a key challenge is the variability
|
|
> of EEG data across different subjects and environments. Differences in
|
|
> physiology, age, health status, and recording conditions can lead to domain
|
|
> shifts between data. These domain shifts often result in decreased model
|
|
> accuracy and reliability, particularly when the model is applied to new data
|
|
> with characteristics different from those it was originally trained on, which
|
|
> is a typical manifestation of negative transfer. To address this, we propose
|
|
> SelectiveFinetuning in this paper. Our method utilizes a pretrained Multi
|
|
> Resolution Convolutional Neural Network (MRCNN) to extract EEG features,
|
|
> capturing the distinctive characteristics of different sleep stages. To
|
|
> mitigate the effect of domain shifts, we introduce a domain aligning mechanism
|
|
> that employs Earth Mover Distance (EMD) to evaluate and select source domain
|
|
> data closely matching the target domain. By finetuning the model with selective
|
|
> source data, our SelectiveFinetuning enhances the model's performance on target
|
|
> domain that exhibits domain shifts compared to the data used for training.
|
|
> Experimental results show that our method outperforms existing baselines,
|
|
> offering greater robustness and adaptability in practical scenarios where data
|
|
> distributions are often unpredictable.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on enhancing transfer learning in sleep staging through selective domain alignment using a Convolutional Neural Network (CNN), not on the engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not investigate the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Three-dimensional attention Transformer for state evaluation in
|
|
real-time strategy games](https://arxiv.org/abs/http://arxiv.org/abs/2501.03832v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03832v1
|
|
|
|
**Abstract:**
|
|
> Situation assessment in Real-Time Strategy (RTS) games is crucial for
|
|
> understanding decision-making in complex adversarial environments. However,
|
|
> existing methods remain limited in processing multi-dimensional feature
|
|
> information and temporal dependencies. Here we propose a tri-dimensional
|
|
> Space-Time-Feature Transformer (TSTF Transformer) architecture, which
|
|
> efficiently models battlefield situations through three independent but
|
|
> cascaded modules: spatial attention, temporal attention, and feature attention.
|
|
> On a dataset comprising 3,150 adversarial experiments, the 8-layer TSTF
|
|
> Transformer demonstrates superior performance: achieving 58.7% accuracy in the
|
|
> early game (~4% progress), significantly outperforming the conventional
|
|
> Timesformer's 41.8%; reaching 97.6% accuracy in the mid-game (~40% progress)
|
|
> while maintaining low performance variation (standard deviation 0.114).
|
|
> Meanwhile, this architecture requires fewer parameters (4.75M) compared to the
|
|
> baseline model (5.54M). Our study not only provides new insights into situation
|
|
> assessment in RTS games but also presents an innovative paradigm for
|
|
> Transformer-based multi-dimensional temporal modeling.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on developing a new Transformer architecture for situation assessment in Real-Time Strategy games, rather than on prompt engineering for Large Language Models (LLMs). It does not investigate, analyze, or propose methods for improving LLM performance through textual input prompt manipulation, and there is no mention of prompts or LLMs in the provided abstract.
|
|
|
|
---
|
|
|
|
## [SCC-YOLO: An Improved Object Detector for Assisting in Brain Tumor
|
|
Diagnosis](https://arxiv.org/abs/http://arxiv.org/abs/2501.03836v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03836v2
|
|
|
|
**Abstract:**
|
|
> Brain tumors can result in neurological dysfunction, alterations in cognitive
|
|
> and psychological states, increased intracranial pressure, and the occurrence
|
|
> of seizures, thereby presenting a substantial risk to human life and health.
|
|
> The You Only Look Once(YOLO) series models have demonstrated superior accuracy
|
|
> in object detection for medical imaging. In this paper, we develop a novel
|
|
> SCC-YOLO architecture by integrating the SCConv attention mechanism into
|
|
> YOLOv9. The SCConv module reconstructs an efficient convolutional module by
|
|
> reducing spatial and channel redundancy among features, thereby enhancing the
|
|
> learning of image features. We investigate the impact of intergrating different
|
|
> attention mechanisms with the YOLOv9 model on brain tumor image detection using
|
|
> both the Br35H dataset and our self-made dataset(Brain_Tumor_Dataset).
|
|
> Experimental results show that on the Br35H dataset, SCC-YOLO achieved a 0.3%
|
|
> improvement in mAp50 compared to YOLOv9, while on our self-made dataset,
|
|
> SCC-YOLO exhibited a 0.5% improvement over YOLOv9. SCC-YOLO has reached
|
|
> state-of-the-art performance in brain tumor detection. Source code is available
|
|
> at : https://jihulab.com/healthcare-information-studio/SCC-YOLO/-/tree/master
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on developing a new object detection architecture (SCC-YOLO) for medical imaging (brain tumor diagnosis) and does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through the manipulation of textual input prompts.
|
|
|
|
---
|
|
|
|
## [Explainable Reinforcement Learning via Temporal Policy Decomposition](https://arxiv.org/abs/http://arxiv.org/abs/2501.03902v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03902v1
|
|
|
|
**Abstract:**
|
|
> We investigate the explainability of Reinforcement Learning (RL) policies
|
|
> from a temporal perspective, focusing on the sequence of future outcomes
|
|
> associated with individual actions. In RL, value functions compress information
|
|
> about rewards collected across multiple trajectories and over an infinite
|
|
> horizon, allowing a compact form of knowledge representation. However, this
|
|
> compression obscures the temporal details inherent in sequential
|
|
> decision-making, presenting a key challenge for interpretability. We present
|
|
> Temporal Policy Decomposition (TPD), a novel explainability approach that
|
|
> explains individual RL actions in terms of their Expected Future Outcome (EFO).
|
|
> These explanations decompose generalized value functions into a sequence of
|
|
> EFOs, one for each time step up to a prediction horizon of interest, revealing
|
|
> insights into when specific outcomes are expected to occur. We leverage
|
|
> fixed-horizon temporal difference learning to devise an off-policy method for
|
|
> learning EFOs for both optimal and suboptimal actions, enabling contrastive
|
|
> explanations consisting of EFOs for different state-action pairs. Our
|
|
> experiments demonstrate that TPD generates accurate explanations that (i)
|
|
> clarify the policy's future strategy and anticipated trajectory for a given
|
|
> action and (ii) improve understanding of the reward composition, facilitating
|
|
> fine-tuning of the reward function to align with human expectations.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper focuses on explainability in Reinforcement Learning (RL) policies, primarily dealing with temporal policy decomposition and value functions, with no indication of prompt engineering for Large Language Models (LLMs) or manipulation of textual input prompts to improve LLM performance."
|
|
}
|
|
|
|
---
|
|
|
|
## [Exploring the Potential of Large Language Models in Public
|
|
Transportation: San Antonio Case Study](https://arxiv.org/abs/http://arxiv.org/abs/2501.03904v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03904v1
|
|
|
|
**Abstract:**
|
|
> The integration of large language models (LLMs) into public transit systems
|
|
> presents a transformative opportunity to enhance urban mobility. This study
|
|
> explores the potential of LLMs to revolutionize public transportation
|
|
> management within the context of San Antonio's transit system. Leveraging the
|
|
> capabilities of LLMs in natural language processing and data analysis, we
|
|
> investigate their capabilities to optimize route planning, reduce wait times,
|
|
> and provide personalized travel assistance. By utilizing the General Transit
|
|
> Feed Specification (GTFS) and other relevant data, this research aims to
|
|
> demonstrate how LLMs can potentially improve resource allocation, elevate
|
|
> passenger satisfaction, and inform data-driven decision-making in transit
|
|
> operations. A comparative analysis of different ChatGPT models was conducted to
|
|
> assess their ability to understand transportation information, retrieve
|
|
> relevant data, and provide comprehensive responses. Findings from this study
|
|
> suggest that while LLMs hold immense promise for public transit, careful
|
|
> engineering and fine-tuning are essential to realizing their full potential.
|
|
> San Antonio serves as a case study to inform the development of LLM-powered
|
|
> transit systems in other urban environments.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on applying Large Language Models to public transportation management, rather than on the engineering, design, or optimization of prompts specifically for LLMs, failing to meet the core subject requirement.
|
|
|
|
---
|
|
|
|
## [Localizing AI: Evaluating Open-Weight Language Models for Languages of
|
|
Baltic States](https://arxiv.org/abs/http://arxiv.org/abs/2501.03952v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03952v1
|
|
|
|
**Abstract:**
|
|
> Although large language models (LLMs) have transformed our expectations of
|
|
> modern language technologies, concerns over data privacy often restrict the use
|
|
> of commercially available LLMs hosted outside of EU jurisdictions. This limits
|
|
> their application in governmental, defence, and other data-sensitive sectors.
|
|
> In this work, we evaluate the extent to which locally deployable open-weight
|
|
> LLMs support lesser-spoken languages such as Lithuanian, Latvian, and Estonian.
|
|
> We examine various size and precision variants of the top-performing
|
|
> multilingual open-weight models, Llama~3, Gemma~2, Phi, and NeMo, on machine
|
|
> translation, multiple-choice question answering, and free-form text generation.
|
|
> The results indicate that while certain models like Gemma~2 perform close to
|
|
> the top commercially available models, many LLMs struggle with these languages.
|
|
> Most surprisingly, however, we find that these models, while showing close to
|
|
> state-of-the-art translation performance, are still prone to lexical
|
|
> hallucinations with errors in at least 1 in 20 words for all open-weight
|
|
> multilingual LLMs.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on evaluating open-weight Language Models for lesser-spoken languages, concerning data privacy and model performance, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs)."
|
|
}
|
|
|
|
---
|
|
|
|
## [VLM-driven Behavior Tree for Context-aware Task Planning](https://arxiv.org/abs/http://arxiv.org/abs/2501.03968v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03968v2
|
|
|
|
**Abstract:**
|
|
> The use of Large Language Models (LLMs) for generating Behavior Trees (BTs)
|
|
> has recently gained attention in the robotics community, yet remains in its
|
|
> early stages of development. In this paper, we propose a novel framework that
|
|
> leverages Vision-Language Models (VLMs) to interactively generate and edit BTs
|
|
> that address visual conditions, enabling context-aware robot operations in
|
|
> visually complex environments. A key feature of our approach lies in the
|
|
> conditional control through self-prompted visual conditions. Specifically, the
|
|
> VLM generates BTs with visual condition nodes, where conditions are expressed
|
|
> as free-form text. Another VLM process integrates the text into its prompt and
|
|
> evaluates the conditions against real-world images during robot execution. We
|
|
> validated our framework in a real-world cafe scenario, demonstrating both its
|
|
> feasibility and limitations.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on leveraging Vision-Language Models (VLMs) for context-aware task planning in robotics, with prompt engineering being a secondary aspect used for conditional control, not the central focus of the paper.
|
|
|
|
---
|
|
|
|
## [ChronoLLM: A Framework for Customizing Large Language Model for Digital
|
|
Twins generalization based on PyChrono](https://arxiv.org/abs/http://arxiv.org/abs/2501.04062v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04062v1
|
|
|
|
**Abstract:**
|
|
> Recently, the integration of advanced simulation technologies with artificial
|
|
> intelligence (AI) is revolutionizing science and engineering research.
|
|
> ChronoLlama introduces a novel framework that customizes the open-source LLMs,
|
|
> specifically for code generation, paired with PyChrono for multi-physics
|
|
> simulations. This integration aims to automate and improve the creation of
|
|
> simulation scripts, thus enhancing model accuracy and efficiency. This
|
|
> combination harnesses the speed of AI-driven code generation with the
|
|
> reliability of physics-based simulations, providing a powerful tool for
|
|
> researchers and engineers. Empirical results indicate substantial enhancements
|
|
> in simulation setup speed, accuracy of the generated codes, and overall
|
|
> computational efficiency. ChronoLlama not only expedites the development and
|
|
> testing of multibody systems but also spearheads a scalable, AI-enhanced
|
|
> approach to managing intricate mechanical simulations. This pioneering
|
|
> integration of cutting-edge AI with traditional simulation platforms represents
|
|
> a significant leap forward in automating and optimizing design processes in
|
|
> engineering applications.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on integrating LLMs with multi-physics simulations for automating code generation in engineering applications, rather than specifically on prompt engineering for Large Language Models. The core subject is the development of a framework for simulation script automation, not novel prompt engineering techniques or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [TrojanDec: Data-free Detection of Trojan Inputs in Self-supervised
|
|
Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.04108v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04108v1
|
|
|
|
**Abstract:**
|
|
> An image encoder pre-trained by self-supervised learning can be used as a
|
|
> general-purpose feature extractor to build downstream classifiers for various
|
|
> downstream tasks. However, many studies showed that an attacker can embed a
|
|
> trojan into an encoder such that multiple downstream classifiers built based on
|
|
> the trojaned encoder simultaneously inherit the trojan behavior. In this work,
|
|
> we propose TrojanDec, the first data-free method to identify and recover a test
|
|
> input embedded with a trigger. Given a (trojaned or clean) encoder and a test
|
|
> input, TrojanDec first predicts whether the test input is trojaned. If not, the
|
|
> test input is processed in a normal way to maintain the utility. Otherwise, the
|
|
> test input will be further restored to remove the trigger. Our extensive
|
|
> evaluation shows that TrojanDec can effectively identify the trojan (if any)
|
|
> from a given test input and recover it under state-of-the-art trojan attacks.
|
|
> We further demonstrate by experiments that our TrojanDec outperforms the
|
|
> state-of-the-art defenses.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on detecting and recovering from trojan attacks in self-supervised image encoders, which falls outside the specified criteria: it does not involve Large Language Models (LLMs), prompt engineering for text-based interactions, or text generation, and instead pertains to image processing and security.
|
|
|
|
---
|
|
|
|
## [BiasGuard: Guardrailing Fairness in Machine Learning Production Systems](https://arxiv.org/abs/http://arxiv.org/abs/2501.04142v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04142v1
|
|
|
|
**Abstract:**
|
|
> As machine learning (ML) systems increasingly impact critical sectors such as
|
|
> hiring, financial risk assessments, and criminal justice, the imperative to
|
|
> ensure fairness has intensified due to potential negative implications. While
|
|
> much ML fairness research has focused on enhancing training data and processes,
|
|
> addressing the outputs of already deployed systems has received less attention.
|
|
> This paper introduces 'BiasGuard', a novel approach designed to act as a
|
|
> fairness guardrail in production ML systems. BiasGuard leverages Test-Time
|
|
> Augmentation (TTA) powered by Conditional Generative Adversarial Network
|
|
> (CTGAN), a cutting-edge generative AI model, to synthesize data samples
|
|
> conditioned on inverted protected attribute values, thereby promoting equitable
|
|
> outcomes across diverse groups. This method aims to provide equal opportunities
|
|
> for both privileged and unprivileged groups while significantly enhancing the
|
|
> fairness metrics of deployed systems without the need for retraining. Our
|
|
> comprehensive experimental analysis across diverse datasets reveals that
|
|
> BiasGuard enhances fairness by 31% while only reducing accuracy by 0.09%
|
|
> compared to non-mitigated benchmarks. Additionally, BiasGuard outperforms
|
|
> existing post-processing methods in improving fairness, positioning it as an
|
|
> effective tool to safeguard against biases when retraining the model is
|
|
> impractical.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on ensuring fairness in machine learning production systems via Test-Time Augmentation and a generative AI model, rather than specifically on prompt engineering for Large Language Models (LLMs) and the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Reasoning-Enhanced Self-Training for Long-Form Personalized Text
|
|
Generation](https://arxiv.org/abs/http://arxiv.org/abs/2501.04167v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04167v1
|
|
|
|
**Abstract:**
|
|
> Personalized text generation requires a unique ability of large language
|
|
> models (LLMs) to learn from context that they often do not encounter during
|
|
> their standard training. One way to encourage LLMs to better use personalized
|
|
> context for generating outputs that better align with the user's expectations
|
|
> is to instruct them to reason over the user's past preferences, background
|
|
> knowledge, or writing style. To achieve this, we propose Reasoning-Enhanced
|
|
> Self-Training for Personalized Text Generation (REST-PG), a framework that
|
|
> trains LLMs to reason over personal data during response generation. REST-PG
|
|
> first generates reasoning paths to train the LLM's reasoning abilities and then
|
|
> employs Expectation-Maximization Reinforced Self-Training to iteratively train
|
|
> the LLM based on its own high-reward outputs. We evaluate REST-PG on the
|
|
> LongLaMP benchmark, consisting of four diverse personalized long-form text
|
|
> generation tasks. Our experiments demonstrate that REST-PG achieves significant
|
|
> improvements over state-of-the-art baselines, with an average relative
|
|
> performance gain of 14.5% on the benchmark.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on developing a new framework (REST-PG) for self-training LLMs to enhance personalized text generation, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models. It meets the exclusions as it's more about training methods for LLMs than prompt engineering.
|
|
|
|
---
|
|
|
|
## [Learning to Transfer Human Hand Skills for Robot Manipulations](https://arxiv.org/abs/http://arxiv.org/abs/2501.04169v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04169v1
|
|
|
|
**Abstract:**
|
|
> We present a method for teaching dexterous manipulation tasks to robots from
|
|
> human hand motion demonstrations. Unlike existing approaches that solely rely
|
|
> on kinematics information without taking into account the plausibility of robot
|
|
> and object interaction, our method directly infers plausible robot manipulation
|
|
> actions from human motion demonstrations. To address the embodiment gap between
|
|
> the human hand and the robot system, our approach learns a joint motion
|
|
> manifold that maps human hand movements, robot hand actions, and object
|
|
> movements in 3D, enabling us to infer one motion component from others. Our key
|
|
> idea is the generation of pseudo-supervision triplets, which pair human,
|
|
> object, and robot motion trajectories synthetically. Through real-world
|
|
> experiments with robot hand manipulation, we demonstrate that our data-driven
|
|
> retargeting method significantly outperforms conventional retargeting
|
|
> techniques, effectively bridging the embodiment gap between human and robotic
|
|
> hands. Website at https://rureadyo.github.io/MocapRobot/.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper does not focus on prompt engineering for Large Language Models (LLMs), instead concentrating on robotics and teaching manipulation tasks to robots from human hand motion demonstrations, which falls outside the specified criteria."
|
|
}
|
|
|
|
---
|
|
|
|
## [SNR-EQ-JSCC: Joint Source-Channel Coding with SNR-Based Embedding and
|
|
Query](https://arxiv.org/abs/http://arxiv.org/abs/2501.04732v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04732v1
|
|
|
|
**Abstract:**
|
|
> Coping with the impact of dynamic channels is a critical issue in joint
|
|
> source-channel coding (JSCC)-based semantic communication systems. In this
|
|
> paper, we propose a lightweight channel-adaptive semantic coding architecture
|
|
> called SNR-EQ-JSCC. It is built upon the generic Transformer model and achieves
|
|
> channel adaptation (CA) by Embedding the signal-to-noise ratio (SNR) into the
|
|
> attention blocks and dynamically adjusting attention scores through
|
|
> channel-adaptive Queries. Meanwhile, penalty terms are introduced in the loss
|
|
> function to stabilize the training process. Considering that instantaneous SNR
|
|
> feedback may be imperfect, we propose an alternative method that uses only the
|
|
> average SNR, which requires no retraining of SNR-EQ-JSCC. Simulation results
|
|
> conducted on image transmission demonstrate that the proposed SNR-EQJSCC
|
|
> outperforms the state-of-the-art SwinJSCC in peak signal-to-noise ratio (PSNR)
|
|
> and perception metrics while only requiring 0.05% of the storage overhead and
|
|
> 6.38% of the computational complexity for CA. Moreover, the channel-adaptive
|
|
> query method demonstrates significant improvements in perception metrics. When
|
|
> instantaneous SNR feedback is imperfect, SNR-EQ-JSCC using only the average SNR
|
|
> still surpasses baseline schemes.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on joint source-channel coding for semantic communication systems, adapting a Transformer model for channel conditions, and does not primarily investigate prompt engineering for Large Language Models (LLMs) or analyze methods for improving LLM performance through textual input prompt manipulation.
|
|
|
|
---
|
|
|
|
## [RTLSquad: Multi-Agent Based Interpretable RTL Design](https://arxiv.org/abs/http://arxiv.org/abs/2501.05470v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.05470v1
|
|
|
|
**Abstract:**
|
|
> Optimizing Register-Transfer Level (RTL) code is crucial for improving
|
|
> hardware PPA performance. Large Language Models (LLMs) offer new approaches for
|
|
> automatic RTL code generation and optimization. However, existing methods often
|
|
> lack decision interpretability (sufficient, understandable justification for
|
|
> decisions), making it difficult for hardware engineers to trust the generated
|
|
> results, thus preventing these methods from being integrated into the design
|
|
> process. To address this, we propose RTLSquad, a novel LLM-Based Multi-Agent
|
|
> system for interpretable RTL code generation. RTLSquad divides the design
|
|
> process into exploration, implementation, and verification & evaluation stages
|
|
> managed by specialized agent squads, generating optimized RTL code through
|
|
> inter-agent collaboration, and providing decision interpretability through the
|
|
> communication process. Experiments show that RTLSquad excels in generating
|
|
> functionally correct RTL code and optimizing PPA performance, while also having
|
|
> the capability to provide decision paths, demonstrating the practical value of
|
|
> our system.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a multi-agent system for interpretable RTL (Register-Transfer Level) design using LLMs, rather than focusing on the engineering, design, or optimization of prompts specifically for Large Language Models. Prompt engineering is not the central concern, but rather a means to achieve the system's goals.
|
|
|
|
---
|
|
|
|
## [Found in Translation: semantic approaches for enhancing AI
|
|
interpretability in face verification](https://arxiv.org/abs/http://arxiv.org/abs/2501.05471v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.05471v1
|
|
|
|
**Abstract:**
|
|
> The increasing complexity of machine learning models in computer vision,
|
|
> particularly in face verification, requires the development of explainable
|
|
> artificial intelligence (XAI) to enhance interpretability and transparency.
|
|
> This study extends previous work by integrating semantic concepts derived from
|
|
> human cognitive processes into XAI frameworks to bridge the comprehension gap
|
|
> between model outputs and human understanding. We propose a novel approach
|
|
> combining global and local explanations, using semantic features defined by
|
|
> user-selected facial landmarks to generate similarity maps and textual
|
|
> explanations via large language models (LLMs). The methodology was validated
|
|
> through quantitative experiments and user feedback, demonstrating improved
|
|
> interpretability. Results indicate that our semantic-based approach,
|
|
> particularly the most detailed set, offers a more nuanced understanding of
|
|
> model decisions than traditional methods. User studies highlight a preference
|
|
> for our semantic explanations over traditional pixelbased heatmaps, emphasizing
|
|
> the benefits of human-centric interpretability in AI. This work contributes to
|
|
> the ongoing efforts to create XAI frameworks that align AI models behaviour
|
|
> with human cognitive processes, fostering trust and acceptance in critical
|
|
> applications.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on enhancing AI interpretability in face verification (a computer vision task) using Large Language Models (LLMs) as a component for generating textual explanations, rather than focusing on the engineering, design, or optimization of prompts specifically for LLMs in text-based interactions.
|
|
|
|
---
|
|
|
|
## [Modality-Invariant Bidirectional Temporal Representation Distillation
|
|
Network for Missing Multimodal Sentiment Analysis](https://arxiv.org/abs/http://arxiv.org/abs/2501.05474v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.05474v1
|
|
|
|
**Abstract:**
|
|
> Multimodal Sentiment Analysis (MSA) integrates diverse modalities(text,
|
|
> audio, and video) to comprehensively analyze and understand individuals'
|
|
> emotional states. However, the real-world prevalence of incomplete data poses
|
|
> significant challenges to MSA, mainly due to the randomness of modality
|
|
> missing. Moreover, the heterogeneity issue in multimodal data has yet to be
|
|
> effectively addressed. To tackle these challenges, we introduce the
|
|
> Modality-Invariant Bidirectional Temporal Representation Distillation Network
|
|
> (MITR-DNet) for Missing Multimodal Sentiment Analysis. MITR-DNet employs a
|
|
> distillation approach, wherein a complete modality teacher model guides a
|
|
> missing modality student model, ensuring robustness in the presence of modality
|
|
> missing. Simultaneously, we developed the Modality-Invariant Bidirectional
|
|
> Temporal Representation Learning Module (MIB-TRL) to mitigate heterogeneity.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on Multimodal Sentiment Analysis with a network architecture (MITR-DNet) to address modality missing and heterogeneity issues, rather than prompt engineering for Large Language Models (LLMs), failing to meet the 'MUST' criteria for focus on LLM prompt engineering and manipulation of textual input prompts.
|
|
|
|
---
|
|
|
|
## [Cooperative Search and Track of Rogue Drones using Multiagent
|
|
Reinforcement Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.10413v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.10413v1
|
|
|
|
**Abstract:**
|
|
> This work considers the problem of intercepting rogue drones targeting
|
|
> sensitive critical infrastructure facilities. While current interception
|
|
> technologies focus mainly on the jamming/spoofing tasks, the challenges of
|
|
> effectively locating and tracking rogue drones have not received adequate
|
|
> attention. Solving this problem and integrating with recently proposed
|
|
> interception techniques will enable a holistic system that can reliably detect,
|
|
> track, and neutralize rogue drones. Specifically, this work considers a team of
|
|
> pursuer UAVs that can search, detect, and track multiple rogue drones over a
|
|
> sensitive facility. The joint search and track problem is addressed through a
|
|
> novel multiagent reinforcement learning scheme to optimize the agent mobility
|
|
> control actions that maximize the number of rogue drones detected and tracked.
|
|
> The performance of the proposed system is investigated under realistic settings
|
|
> through extensive simulation experiments with varying number of agents
|
|
> demonstrating both its performance and scalability.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on multiagent reinforcement learning for tracking rogue drones, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Efficient Deployment of Large Language Models on Resource-constrained
|
|
Devices](https://arxiv.org/abs/http://arxiv.org/abs/2501.02438v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02438v1
|
|
|
|
**Abstract:**
|
|
> Deploying Large Language Models (LLMs) on resource-constrained (or weak)
|
|
> devices presents significant challenges due to limited resources and
|
|
> heterogeneous data distribution. To address the data concern, it is necessary
|
|
> to fine-tune LLMs using on-device private data for various downstream tasks.
|
|
> While Federated Learning (FL) offers a promising privacy-preserving solution,
|
|
> existing fine-tuning methods retain the original LLM size, leaving issues of
|
|
> high inference latency and excessive memory demands unresolved. Hence, we
|
|
> design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning
|
|
> (PEFT) with structured pruning for efficient deployment of LLMs on
|
|
> resource-constrained devices. Specifically, FedSpine introduces an iterative
|
|
> process to prune and tune the parameters of LLMs. To mitigate the impact of
|
|
> device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed
|
|
> to adaptively determine different pruning ratios and LoRA ranks for
|
|
> heterogeneous devices without any prior knowledge of their computing and
|
|
> communication capabilities. As a result, FedSpine maintains higher inference
|
|
> accuracy while improving fine-tuning efficiency. Experimental results conducted
|
|
> on a physical platform with 80 devices demonstrate that FedSpine can speed up
|
|
> fine-tuning by 1.4$\times$-6.9$\times$ and improve final accuracy by 0.4%-4.5%
|
|
> under the same sparsity level compared to other baselines.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on the efficient deployment and fine-tuning of Large Language Models on resource-constrained devices, rather than on the engineering, design, or optimization of prompts for LLMs, failing to meet the first 'MUST' criterion.
|
|
|
|
---
|
|
|
|
## [FedRSClip: Federated Learning for Remote Sensing Scene Classification
|
|
Using Vision-Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02461v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02461v1
|
|
|
|
**Abstract:**
|
|
> Remote sensing data is often distributed across multiple institutions, and
|
|
> due to privacy concerns and data-sharing restrictions, leveraging large-scale
|
|
> datasets in a centralized training framework is challenging. Federated learning
|
|
> offers a promising solution by enabling collaborative model training across
|
|
> distributed data sources without requiring data centralization. However,
|
|
> current Vision-Language Models (VLMs), which typically contain billions of
|
|
> parameters, pose significant communication challenges for traditional federated
|
|
> learning approaches based on model parameter updates, as they would incur
|
|
> substantial communication costs. In this paper, we propose FedRSCLIP, the first
|
|
> federated learning framework designed for remote sensing image classification
|
|
> based on a VLM, specifically CLIP. FedRSCLIP addresses the challenges of data
|
|
> heterogeneity and large-scale model transmission in federated environments by
|
|
> introducing Prompt Learning, which optimizes only a small set of tunable
|
|
> parameters. The framework introduces a dual-prompt mechanism, comprising Shared
|
|
> Prompts for global knowledge sharing and Private Prompts for client-specific
|
|
> adaptation. To maintain semantic coherence between shared and private prompts,
|
|
> we propose the Dual Prompt Alignment Constraint to balance global consistency
|
|
> and local adaptability across diverse client distributions. Additionally, to
|
|
> enhance cross-modal representation learning, we introduce the Cross-Modal
|
|
> Feature Alignment Constraint to align multimodal features between text and
|
|
> image prompts. To validate the effectiveness of our proposed model, we
|
|
> construct a Fed-RSIC dataset based on three existing remote sensing image
|
|
> classification datasets, specifically designed to simulate various federated
|
|
> learning configurations. Experimental results demonstrate the effectiveness and
|
|
> superiority of FedRSCLIP in remote sensing image classification.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on federated learning for Vision-Language Models (VLMs) in remote sensing image classification, not on the engineering, design, or optimization of textual input prompts for Large Language Models (LLMs), and does not provide concrete examples of text prompts impacting LLM output.
|
|
|
|
---
|
|
|
|
## [Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera](https://arxiv.org/abs/http://arxiv.org/abs/2501.02464v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02464v1
|
|
|
|
**Abstract:**
|
|
> While recent depth estimation methods exhibit strong zero-shot
|
|
> generalization, achieving accurate metric depth across diverse camera
|
|
> types-particularly those with large fields of view (FoV) such as fisheye and
|
|
> 360-degree cameras-remains a significant challenge. This paper presents Depth
|
|
> Any Camera (DAC), a powerful zero-shot metric depth estimation framework that
|
|
> extends a perspective-trained model to effectively handle cameras with varying
|
|
> FoVs. The framework is designed to ensure that all existing 3D data can be
|
|
> leveraged, regardless of the specific camera types used in new applications.
|
|
> Remarkably, DAC is trained exclusively on perspective images but generalizes
|
|
> seamlessly to fisheye and 360-degree cameras without the need for specialized
|
|
> training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image
|
|
> representation, enabling consistent processing of images with diverse FoVs. Its
|
|
> key components include a pitch-aware Image-to-ERP conversion for efficient
|
|
> online augmentation in ERP space, a FoV alignment operation to support
|
|
> effective training across a wide range of FoVs, and multi-resolution data
|
|
> augmentation to address resolution disparities between training and testing.
|
|
> DAC achieves state-of-the-art zero-shot metric depth estimation, improving
|
|
> delta-1 ($\delta_1$) accuracy by up to 50% on multiple fisheye and 360-degree
|
|
> datasets compared to prior metric depth foundation models, demonstrating robust
|
|
> generalization across camera types.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on depth estimation in computer vision, specifically developing a framework for metric depth estimation from various camera types, and does not address prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [PTEENet: Post-Trained Early-Exit Neural Networks Augmentation for
|
|
Inference Cost Optimization](https://arxiv.org/abs/http://arxiv.org/abs/2501.02508v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02508v1
|
|
|
|
**Abstract:**
|
|
> For many practical applications, a high computational cost of inference over
|
|
> deep network architectures might be unacceptable. A small degradation in the
|
|
> overall inference accuracy might be a reasonable price to pay for a significant
|
|
> reduction in the required computational resources. In this work, we describe a
|
|
> method for introducing "shortcuts" into the DNN feedforward inference process
|
|
> by skipping costly feedforward computations whenever possible. The proposed
|
|
> method is based on the previously described BranchyNet (Teerapittayanon et al.,
|
|
> 2016) and the EEnet (Demir, 2019) architectures that jointly train the main
|
|
> network and early exit branches. We extend those methods by attaching branches
|
|
> to pre-trained models and, thus, eliminating the need to alter the original
|
|
> weights of the network. We also suggest a new branch architecture based on
|
|
> convolutional building blocks to allow enough training capacity when applied on
|
|
> large DNNs. The proposed architecture includes confidence heads that are used
|
|
> for predicting the confidence level in the corresponding early exits. By
|
|
> defining adjusted thresholds on these confidence extensions, we can control in
|
|
> real-time the amount of data exiting from each branch and the overall tradeoff
|
|
> between speed and accuracy of our model. In our experiments, we evaluate our
|
|
> method using image datasets (SVHN and CIFAR10) and several DNN architectures
|
|
> (ResNet, DenseNet, VGG) with varied depth. Our results demonstrate that the
|
|
> proposed method enables us to reduce the average inference computational cost
|
|
> and further controlling the tradeoff between the model accuracy and the
|
|
> computation cost.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on optimizing the inference cost of deep neural networks (DNNs) through early-exit strategies, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet the primary criteria.
|
|
|
|
---
|
|
|
|
## [Evaluating Large Language Models Against Human Annotators in Latent
|
|
Content Analysis: Sentiment, Political Leaning, Emotional Intensity, and
|
|
Sarcasm](https://arxiv.org/abs/http://arxiv.org/abs/2501.02532v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02532v1
|
|
|
|
**Abstract:**
|
|
> In the era of rapid digital communication, vast amounts of textual data are
|
|
> generated daily, demanding efficient methods for latent content analysis to
|
|
> extract meaningful insights. Large Language Models (LLMs) offer potential for
|
|
> automating this process, yet comprehensive assessments comparing their
|
|
> performance to human annotators across multiple dimensions are lacking. This
|
|
> study evaluates the reliability, consistency, and quality of seven
|
|
> state-of-the-art LLMs, including variants of OpenAI's GPT-4, Gemini, Llama, and
|
|
> Mixtral, relative to human annotators in analyzing sentiment, political
|
|
> leaning, emotional intensity, and sarcasm detection. A total of 33 human
|
|
> annotators and eight LLM variants assessed 100 curated textual items,
|
|
> generating 3,300 human and 19,200 LLM annotations, with LLMs evaluated across
|
|
> three time points to examine temporal consistency. Inter-rater reliability was
|
|
> measured using Krippendorff's alpha, and intra-class correlation coefficients
|
|
> assessed consistency over time. The results reveal that both humans and LLMs
|
|
> exhibit high reliability in sentiment analysis and political leaning
|
|
> assessments, with LLMs demonstrating higher internal consistency than humans.
|
|
> In emotional intensity, LLMs displayed higher agreement compared to humans,
|
|
> though humans rated emotional intensity significantly higher. Both groups
|
|
> struggled with sarcasm detection, evidenced by low agreement. LLMs showed
|
|
> excellent temporal consistency across all dimensions, indicating stable
|
|
> performance over time. This research concludes that LLMs, especially GPT-4, can
|
|
> effectively replicate human analysis in sentiment and political leaning,
|
|
> although human expertise remains essential for emotional intensity
|
|
> interpretation. The findings demonstrate the potential of LLMs for consistent
|
|
> and high-quality performance in certain areas of latent content analysis.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on evaluating LLMs' performance in latent content analysis compared to human annotators, rather than primarily investigating, analyzing, or proposing methods for improving LLM performance through the manipulation of textual input prompts.
|
|
|
|
---
|
|
|
|
## [Energy Optimization of Multi-task DNN Inference in MEC-assisted XR
|
|
Devices: A Lyapunov-Guided Reinforcement Learning Approach](https://arxiv.org/abs/http://arxiv.org/abs/2501.02572v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02572v1
|
|
|
|
**Abstract:**
|
|
> Extended reality (XR), blending virtual and real worlds, is a key application
|
|
> of future networks. While AI advancements enhance XR capabilities, they also
|
|
> impose significant computational and energy challenges on lightweight XR
|
|
> devices. In this paper, we developed a distributed queue model for multi-task
|
|
> DNN inference, addressing issues of resource competition and queue coupling. In
|
|
> response to the challenges posed by the high energy consumption and limited
|
|
> resources of XR devices, we designed a dual time-scale joint optimization
|
|
> strategy for model partitioning and resource allocation, formulated as a
|
|
> bi-level optimization problem. This strategy aims to minimize the total energy
|
|
> consumption of XR devices while ensuring queue stability and adhering to
|
|
> computational and communication resource constraints. To tackle this problem,
|
|
> we devised a Lyapunov-guided Proximal Policy Optimization algorithm, named
|
|
> LyaPPO. Numerical results demonstrate that the LyaPPO algorithm outperforms the
|
|
> baselines, achieving energy conservation of 24.79% to 46.14% under varying
|
|
> resource capacities. Specifically, the proposed algorithm reduces the energy
|
|
> consumption of XR devices by 24.29% to 56.62% compared to baseline algorithms.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on energy optimization of Multi-task DNN Inference in MEC-assisted XR Devices using Reinforcement Learning, with no apparent focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud
|
|
Platforms](https://arxiv.org/abs/http://arxiv.org/abs/2501.02600v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02600v1
|
|
|
|
**Abstract:**
|
|
> The rising demand for generative large language models (LLMs) poses
|
|
> challenges for thermal and power management in cloud datacenters. Traditional
|
|
> techniques often are inadequate for LLM inference due to the fine-grained,
|
|
> millisecond-scale execution phases, each with distinct performance, thermal,
|
|
> and power profiles. Additionally, LLM inference workloads are sensitive to
|
|
> various configuration parameters (e.g., model parallelism, size, and
|
|
> quantization) that involve trade-offs between performance, temperature, power,
|
|
> and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads,
|
|
> each with different levels of visibility and flexibility. We propose TAPAS, a
|
|
> thermal- and power-aware framework designed for LLM inference clusters in the
|
|
> cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing
|
|
> the total cost of ownership (TCO) while effectively handling emergencies (e.g.,
|
|
> cooling and power failures). The system leverages historical temperature and
|
|
> power data, along with the adaptability of SaaS workloads, to: (1) efficiently
|
|
> place new GPU workload VMs within cooling and power constraints, (2) route LLM
|
|
> inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load
|
|
> spikes and emergency situations. Our evaluation on a large GPU cluster
|
|
> demonstrates significant reductions in thermal and power throttling events,
|
|
> boosting system efficiency.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on thermal- and power-aware scheduling for LLM inference in cloud platforms, addressing infrastructure management rather than prompt engineering for text-based interactions with LLMs, thus failing to meet the core subject requirement.
|
|
|
|
---
|
|
|
|
## [Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for
|
|
Jailbreak Attack Defense](https://arxiv.org/abs/http://arxiv.org/abs/2501.02629v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02629v1
|
|
|
|
**Abstract:**
|
|
> As large language models (LLMs) are increasingly deployed in diverse
|
|
> applications, including chatbot assistants and code generation, aligning their
|
|
> behavior with safety and ethical standards has become paramount. However,
|
|
> jailbreak attacks, which exploit vulnerabilities to elicit unintended or
|
|
> harmful outputs, threaten LLMs' safety significantly. In this paper, we
|
|
> introduce Layer-AdvPatcher, a novel methodology designed to defend against
|
|
> jailbreak attacks by utilizing an unlearning strategy to patch specific layers
|
|
> within LLMs through self-augmented datasets. Our insight is that certain
|
|
> layer(s), tend to produce affirmative tokens when faced with harmful prompts.
|
|
> By identifying these layers and adversarially exposing them to generate more
|
|
> harmful data, one can understand their inherent and diverse vulnerabilities to
|
|
> attacks. With these exposures, we then "unlearn" these issues, reducing the
|
|
> impact of affirmative tokens and hence minimizing jailbreak risks while keeping
|
|
> the model's responses to safe queries intact. We conduct extensive experiments
|
|
> on two models, four benchmark datasets, and multiple state-of-the-art jailbreak
|
|
> benchmarks to demonstrate the efficacy of our approach. Results indicate that
|
|
> our framework reduces the harmfulness and attack success rate of jailbreak
|
|
> attacks without compromising utility for benign queries compared to recent
|
|
> defense methods.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on defending against jailbreak attacks by modifying LLM architecture (Layer-AdvPatcher) and utilizing an unlearning strategy, rather than primarily focusing on the engineering, design, or optimization of prompts for improving LLM performance through textual input manipulation.
|
|
|
|
---
|
|
|
|
## [Representation Learning of Lab Values via Masked AutoEncoder](https://arxiv.org/abs/http://arxiv.org/abs/2501.02648v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02648v2
|
|
|
|
**Abstract:**
|
|
> Accurate imputation of missing laboratory values in electronic health records
|
|
> (EHRs) is critical to enable robust clinical predictions and reduce biases in
|
|
> AI systems in healthcare. Existing methods, such as variational autoencoders
|
|
> (VAEs) and decision tree-based approaches such as XGBoost, struggle to model
|
|
> the complex temporal and contextual dependencies in EHR data, mainly in
|
|
> underrepresented groups. In this work, we propose Lab-MAE, a novel
|
|
> transformer-based masked autoencoder framework that leverages self-supervised
|
|
> learning for the imputation of continuous sequential lab values. Lab-MAE
|
|
> introduces a structured encoding scheme that jointly models laboratory test
|
|
> values and their corresponding timestamps, enabling explicit capturing temporal
|
|
> dependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that
|
|
> Lab-MAE significantly outperforms the state-of-the-art baselines such as
|
|
> XGBoost across multiple metrics, including root mean square error (RMSE),
|
|
> R-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves
|
|
> equitable performance across demographic groups of patients, advancing fairness
|
|
> in clinical predictions. We further investigate the role of follow-up
|
|
> laboratory values as potential shortcut features, revealing Lab-MAE's
|
|
> robustness in scenarios where such data is unavailable. The findings suggest
|
|
> that our transformer-based architecture, adapted to the characteristics of the
|
|
> EHR data, offers a foundation model for more accurate and fair clinical
|
|
> imputation models. In addition, we measure and compare the carbon footprint of
|
|
> Lab-MAE with the baseline XGBoost model, highlighting its environmental
|
|
> requirements.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on developing a novel transformer-based masked autoencoder for imputing missing laboratory values in electronic health records, which does not meet the 'MUST' criteria of primarily focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts for improving LLM performance.
|
|
|
|
---
|
|
|
|
## [From thermodynamics to protein design: Diffusion models for biomolecule
|
|
generation towards autonomous protein engineering](https://arxiv.org/abs/http://arxiv.org/abs/2501.02680v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02680v1
|
|
|
|
**Abstract:**
|
|
> Protein design with desirable properties has been a significant challenge for
|
|
> many decades. Generative artificial intelligence is a promising approach and
|
|
> has achieved great success in various protein generation tasks. Notably,
|
|
> diffusion models stand out for their robust mathematical foundations and
|
|
> impressive generative capabilities, offering unique advantages in certain
|
|
> applications such as protein design. In this review, we first give the
|
|
> definition and characteristics of diffusion models and then focus on two
|
|
> strategies: Denoising Diffusion Probabilistic Models and Score-based Generative
|
|
> Models, where DDPM is the discrete form of SGM. Furthermore, we discuss their
|
|
> applications in protein design, peptide generation, drug discovery, and
|
|
> protein-ligand interaction. Finally, we outline the future perspectives of
|
|
> diffusion models to advance autonomous protein design and engineering. The E(3)
|
|
> group consists of all rotations, reflections, and translations in
|
|
> three-dimensions. The equivariance on the E(3) group can keep the physical
|
|
> stability of the frame of each amino acid as much as possible, and we reflect
|
|
> on how to keep the diffusion model E(3) equivariant for protein generation.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on the application of diffusion models for biomolecule generation in protein design, not on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs). It also concerns a medical/biological subject (protein design) and generative AI application (protein generation) outside the specified text generation driven by LLMs.
|
|
|
|
---
|
|
|
|
## [EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in
|
|
Instructional Multimodal Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02699v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02699v1
|
|
|
|
**Abstract:**
|
|
> Large language models and vision transformers have demonstrated impressive
|
|
> zero-shot capabilities, enabling significant transferability in downstream
|
|
> tasks. The fusion of these models has resulted in multi-modal architectures
|
|
> with enhanced instructional capabilities. Despite incorporating vast image and
|
|
> language pre-training, these multi-modal architectures often generate responses
|
|
> that deviate from the ground truth in the image data. These failure cases are
|
|
> known as hallucinations. Current methods for mitigating hallucinations
|
|
> generally focus on regularizing the language component, improving the fusion
|
|
> module, or ensembling multiple visual encoders to improve visual
|
|
> representation. In this paper, we address the hallucination issue by directly
|
|
> enhancing the capabilities of the visual component. Our approach, named EAGLE,
|
|
> is fully agnostic to the LLM or fusion module and works as a post-pretraining
|
|
> approach that improves the grounding and language alignment of the visual
|
|
> encoder. We show that a straightforward reformulation of the original
|
|
> contrastive pre-training task results in an improved visual encoder that can be
|
|
> incorporated into the instructional multi-modal architecture without additional
|
|
> instructional training. As a result, EAGLE achieves a significant reduction in
|
|
> hallucinations across multiple challenging benchmarks and tasks.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on enhancing visual grounding in multimodal models to reduce hallucinations, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs). It meets none of the 'MUST' criteria for prompt engineering, focusing instead on improving the visual component of multimodal architectures.
|
|
|
|
---
|
|
|
|
## [QuIM-RAG: Advancing Retrieval-Augmented Generation with Inverted
|
|
Question Matching for Enhanced QA Performance](https://arxiv.org/abs/http://arxiv.org/abs/2501.02702v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02702v1
|
|
|
|
**Abstract:**
|
|
> This work presents a novel architecture for building Retrieval-Augmented
|
|
> Generation (RAG) systems to improve Question Answering (QA) tasks from a target
|
|
> corpus. Large Language Models (LLMs) have revolutionized the analyzing and
|
|
> generation of human-like text. These models rely on pre-trained data and lack
|
|
> real-time updates unless integrated with live data tools. RAG enhances LLMs by
|
|
> integrating online resources and databases to generate contextually appropriate
|
|
> responses. However, traditional RAG still encounters challenges like
|
|
> information dilution and hallucinations when handling vast amounts of data. Our
|
|
> approach addresses these challenges by converting corpora into a
|
|
> domain-specific dataset and RAG architecture is constructed to generate
|
|
> responses from the target document. We introduce QuIM-RAG (Question-to-question
|
|
> Inverted Index Matching), a novel approach for the retrieval mechanism in our
|
|
> system. This strategy generates potential questions from document chunks and
|
|
> matches these with user queries to identify the most relevant text chunks for
|
|
> generating accurate answers. We have implemented our RAG system on top of the
|
|
> open-source Meta-LLaMA3-8B-instruct model by Meta Inc. that is available on
|
|
> Hugging Face. We constructed a custom corpus of 500+ pages from a high-traffic
|
|
> website accessed thousands of times daily for answering complex questions,
|
|
> along with manually prepared ground truth QA for evaluation. We compared our
|
|
> approach with traditional RAG models using BERT-Score and RAGAS,
|
|
> state-of-the-art metrics for evaluating LLM applications. Our evaluation
|
|
> demonstrates that our approach outperforms traditional RAG architectures on
|
|
> both metrics.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on advancing Retrieval-Augmented Generation (RAG) system architecture for Question Answering tasks, rather than specifically on the engineering, design, or optimization of prompts for Large Language Models (LLMs), as required. While LLMs are utilized, the core subject is the RAG system's enhancement, not prompt engineering for text-based LLM interactions.
|
|
|
|
---
|
|
|
|
## [OpenGU: A Comprehensive Benchmark for Graph Unlearning](https://arxiv.org/abs/http://arxiv.org/abs/2501.02728v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02728v1
|
|
|
|
**Abstract:**
|
|
> Graph Machine Learning is essential for understanding and analyzing
|
|
> relational data. However, privacy-sensitive applications demand the ability to
|
|
> efficiently remove sensitive information from trained graph neural networks
|
|
> (GNNs), avoiding the unnecessary time and space overhead caused by retraining
|
|
> models from scratch. To address this issue, Graph Unlearning (GU) has emerged
|
|
> as a critical solution, with the potential to support dynamic graph updates in
|
|
> data management systems and enable scalable unlearning in distributed data
|
|
> systems while ensuring privacy compliance. Unlike machine unlearning in
|
|
> computer vision or other fields, GU faces unique difficulties due to the
|
|
> non-Euclidean nature of graph data and the recursive message-passing mechanism
|
|
> of GNNs. Additionally, the diversity of downstream tasks and the complexity of
|
|
> unlearning requests further amplify these challenges. Despite the proliferation
|
|
> of diverse GU strategies, the absence of a benchmark providing fair comparisons
|
|
> for GU, and the limited flexibility in combining downstream tasks and
|
|
> unlearning requests, have yielded inconsistencies in evaluations, hindering the
|
|
> development of this domain. To fill this gap, we present OpenGU, the first GU
|
|
> benchmark, where 16 SOTA GU algorithms and 37 multi-domain datasets are
|
|
> integrated, enabling various downstream tasks with 13 GNN backbones when
|
|
> responding to flexible unlearning requests. Based on this unified benchmark
|
|
> framework, we are able to provide a comprehensive and fair evaluation for GU.
|
|
> Through extensive experimentation, we have drawn $8$ crucial conclusions about
|
|
> existing GU methods, while also gaining valuable insights into their
|
|
> limitations, shedding light on potential avenues for future research.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not focus on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance. Instead, it focuses on Graph Unlearning (GU) for Graph Neural Networks (GNNs), which falls outside the specified criteria.
|
|
|
|
---
|
|
|
|
## [GLoG-CSUnet: Enhancing Vision Transformers with Adaptable Radiomic
|
|
Features for Medical Image Segmentation](https://arxiv.org/abs/http://arxiv.org/abs/2501.02788v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02788v2
|
|
|
|
**Abstract:**
|
|
> Vision Transformers (ViTs) have shown promise in medical image semantic
|
|
> segmentation (MISS) by capturing long-range correlations. However, ViTs often
|
|
> struggle to model local spatial information effectively, which is essential for
|
|
> accurately segmenting fine anatomical details, particularly when applied to
|
|
> small datasets without extensive pre-training. We introduce Gabor and Laplacian
|
|
> of Gaussian Convolutional Swin Network (GLoG-CSUnet), a novel architecture
|
|
> enhancing Transformer-based models by incorporating learnable radiomic
|
|
> features. This approach integrates dynamically adaptive Gabor and Laplacian of
|
|
> Gaussian (LoG) filters to capture texture, edge, and boundary information,
|
|
> enhancing the feature representation processed by the Transformer model. Our
|
|
> method uniquely combines the long-range dependency modeling of Transformers
|
|
> with the texture analysis capabilities of Gabor and LoG features. Evaluated on
|
|
> the Synapse multi-organ and ACDC cardiac segmentation datasets, GLoG-CSUnet
|
|
> demonstrates significant improvements over state-of-the-art models, achieving a
|
|
> 1.14% increase in Dice score for Synapse and 0.99% for ACDC, with minimal
|
|
> computational overhead (only 15 and 30 additional parameters, respectively).
|
|
> GLoG-CSUnet's flexible design allows integration with various base models,
|
|
> offering a promising approach for incorporating radiomics-inspired feature
|
|
> extraction in Transformer architectures for medical image analysis. The code
|
|
> implementation is available on GitHub at: https://github.com/HAAIL/GLoG-CSUnet.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on enhancing Vision Transformers with adaptable radiomic features for Medical Image Segmentation, which falls under the excluded categories of 'medical subjects' and 'applications of generative AI other than text generation driven by LLMs', and does not meet the core subject requirement of prompt engineering for text-based interactions with LLMs.
|
|
|
|
---
|
|
|
|
## [RDD4D: 4D Attention-Guided Road Damage Detection And Classification](https://arxiv.org/abs/http://arxiv.org/abs/2501.02822v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02822v1
|
|
|
|
**Abstract:**
|
|
> Road damage detection and assessment are crucial components of infrastructure
|
|
> maintenance. However, current methods often struggle with detecting multiple
|
|
> types of road damage in a single image, particularly at varying scales. This is
|
|
> due to the lack of road datasets with various damage types having varying
|
|
> scales. To overcome this deficiency, first, we present a novel dataset called
|
|
> Diverse Road Damage Dataset (DRDD) for road damage detection that captures the
|
|
> diverse road damage types in individual images, addressing a crucial gap in
|
|
> existing datasets. Then, we provide our model, RDD4D, that exploits Attention4D
|
|
> blocks, enabling better feature refinement across multiple scales. The
|
|
> Attention4D module processes feature maps through an attention mechanism
|
|
> combining positional encoding and "Talking Head" components to capture local
|
|
> and global contextual information. In our comprehensive experimental analysis
|
|
> comparing various state-of-the-art models on our proposed, our enhanced model
|
|
> demonstrated superior performance in detecting large-sized road cracks with an
|
|
> Average Precision (AP) of 0.458 and maintained competitive performance with an
|
|
> overall AP of 0.445. Moreover, we also provide results on the CrackTinyNet
|
|
> dataset; our model achieved around a 0.21 increase in performance. The code,
|
|
> model weights, dataset, and our results are available on
|
|
> \href{https://github.com/msaqib17/Road_Damage_Detection}{https://github.com/msaqib17/Road\_Damage\_Detection}.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on road damage detection using computer vision and deep learning techniques, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thereby failing to meet all the 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Forward Once for All: Structural Parameterized Adaptation for Efficient
|
|
Cloud-coordinated On-device Recommendation](https://arxiv.org/abs/http://arxiv.org/abs/2501.02837v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02837v1
|
|
|
|
**Abstract:**
|
|
> In cloud-centric recommender system, regular data exchanges between user
|
|
> devices and cloud could potentially elevate bandwidth demands and privacy
|
|
> risks. On-device recommendation emerges as a viable solution by performing
|
|
> reranking locally to alleviate these concerns. Existing methods primarily focus
|
|
> on developing local adaptive parameters, while potentially neglecting the
|
|
> critical role of tailor-made model architecture. Insights from broader research
|
|
> domains suggest that varying data distributions might favor distinct
|
|
> architectures for better fitting. In addition, imposing a uniform model
|
|
> structure across heterogeneous devices may result in risking inefficacy on less
|
|
> capable devices or sub-optimal performance on those with sufficient
|
|
> capabilities. In response to these gaps, our paper introduces Forward-OFA, a
|
|
> novel approach for the dynamic construction of device-specific networks (both
|
|
> structure and parameters). Forward-OFA employs a structure controller to
|
|
> selectively determine whether each block needs to be assembled for a given
|
|
> device. However, during the training of the structure controller, these
|
|
> assembled heterogeneous structures are jointly optimized, where the co-adaption
|
|
> among blocks might encounter gradient conflicts. To mitigate this, Forward-OFA
|
|
> is designed to establish a structure-guided mapping of real-time behaviors to
|
|
> the parameters of assembled networks. Structure-related parameters and parallel
|
|
> components within the mapper prevent each part from receiving heterogeneous
|
|
> gradients from others, thus bypassing the gradient conflicts for coupled
|
|
> optimization. Besides, direct mapping enables Forward-OFA to achieve adaptation
|
|
> through only one forward pass, allowing for swift adaptation to changing
|
|
> interests and eliminating the requirement for on-device backpropagation.
|
|
> Experiments on real-world datasets demonstrate the effectiveness and efficiency
|
|
> of Forward-OFA.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a novel approach for dynamic construction of device-specific networks for on-device recommendation, with no clear emphasis on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Explaining Humour Style Classifications: An XAI Approach to
|
|
Understanding Computational Humour Analysis](https://arxiv.org/abs/http://arxiv.org/abs/2501.02891v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02891v1
|
|
|
|
**Abstract:**
|
|
> Humour styles can have either a negative or a positive impact on well-being.
|
|
> Given the importance of these styles to mental health, significant research has
|
|
> been conducted on their automatic identification. However, the automated
|
|
> machine learning models used for this purpose are black boxes, making their
|
|
> prediction decisions opaque. Clarity and transparency are vital in the field of
|
|
> mental health. This paper presents an explainable AI (XAI) framework for
|
|
> understanding humour style classification, building upon previous work in
|
|
> computational humour analysis. Using the best-performing single model
|
|
> (ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to
|
|
> analyse how linguistic, emotional, and semantic features contribute to humour
|
|
> style classification decisions. Our analysis reveals distinct patterns in how
|
|
> different humour styles are characterised and misclassified, with particular
|
|
> emphasis on the challenges in distinguishing affiliative humour from other
|
|
> styles. Through detailed examination of feature importance, error patterns, and
|
|
> misclassification cases, we identify key factors influencing model decisions,
|
|
> including emotional ambiguity, context misinterpretation, and target
|
|
> identification. The framework demonstrates significant utility in understanding
|
|
> model behaviour, achieving interpretable insights into the complex interplay of
|
|
> features that define different humour styles. Our findings contribute to both
|
|
> the theoretical understanding of computational humour analysis and practical
|
|
> applications in mental health, content moderation, and digital humanities
|
|
> research.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on explainable AI (XAI) for humour style classification in computational humour analysis, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), failing to meet the first 'MUST' criterion.
|
|
|
|
---
|
|
|
|
## [Label-free Concept Based Multiple Instance Learning for Gigapixel
|
|
Histopathology](https://arxiv.org/abs/http://arxiv.org/abs/2501.02922v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02922v1
|
|
|
|
**Abstract:**
|
|
> Multiple Instance Learning (MIL) methods allow for gigapixel Whole-Slide
|
|
> Image (WSI) analysis with only slide-level annotations. Interpretability is
|
|
> crucial for safely deploying such algorithms in high-stakes medical domains.
|
|
> Traditional MIL methods offer explanations by highlighting salient regions.
|
|
> However, such spatial heatmaps provide limited insights for end users. To
|
|
> address this, we propose a novel inherently interpretable WSI-classification
|
|
> approach that uses human-understandable pathology concepts to generate
|
|
> explanations. Our proposed Concept MIL model leverages recent advances in
|
|
> vision-language models to directly predict pathology concepts based on image
|
|
> features. The model's predictions are obtained through a linear combination of
|
|
> the concepts identified on the top-K patches of a WSI, enabling inherent
|
|
> explanations by tracing each concept's influence on the prediction. In contrast
|
|
> to traditional concept-based interpretable models, our approach eliminates the
|
|
> need for costly human annotations by leveraging the vision-language model. We
|
|
> validate our method on two widely used pathology datasets: Camelyon16 and
|
|
> PANDA. On both datasets, Concept MIL achieves AUC and accuracy scores over 0.9,
|
|
> putting it on par with state-of-the-art models. We further find that 87.1\%
|
|
> (Camelyon16) and 85.3\% (PANDA) of the top 20 patches fall within the tumor
|
|
> region. A user study shows that the concepts identified by our model align with
|
|
> the concepts used by pathologists, making it a promising strategy for
|
|
> human-interpretable WSI classification.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on medical image analysis (histopathology) using vision-language models, not on engineering, design, or optimization of textual input prompts for Large Language Models (LLMs), and does not meet the core subject requirement of prompt engineering for text-based interactions with LLMs.
|
|
|
|
---
|
|
|
|
## [Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the
|
|
Wild](https://arxiv.org/abs/http://arxiv.org/abs/2501.02964v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02964v2
|
|
|
|
**Abstract:**
|
|
> Complex visual reasoning remains a key challenge today. Typically, the
|
|
> challenge is tackled using methodologies such as Chain of Thought (COT) and
|
|
> visual instruction tuning. However, how to organically combine these two
|
|
> methodologies for greater success remains unexplored. Also, issues like
|
|
> hallucinations and high training cost still need to be addressed. In this work,
|
|
> we devise an innovative multi-round training and reasoning framework suitable
|
|
> for lightweight Multimodal Large Language Models (MLLMs). Our self-questioning
|
|
> approach heuristically guides MLLMs to focus on visual clues relevant to the
|
|
> target problem, reducing hallucinations and enhancing the model's ability to
|
|
> describe fine-grained image details. This ultimately enables the model to
|
|
> perform well in complex visual reasoning and question-answering tasks. We have
|
|
> named this framework Socratic Questioning(SQ). To facilitate future research,
|
|
> we create a multimodal mini-dataset named CapQA, which includes 1k images of
|
|
> fine-grained activities, for visual instruction tuning and evaluation, our
|
|
> proposed SQ method leads to a 31.2% improvement in the hallucination score. Our
|
|
> extensive experiments on various benchmarks demonstrate SQ's remarkable
|
|
> capabilities in heuristic self-questioning, zero-shot visual reasoning and
|
|
> hallucination mitigation. Our model and code will be publicly available.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on multimodal reasoning, visual question-answering, and reducing hallucinations in Multimodal Large Language Models (MLLMs) through a self-questioning framework, rather than primarily on the engineering, design, or optimization of textual input prompts for Large Language Models (LLMs) as required.
|
|
|
|
---
|
|
|
|
## [Proof-of-Data: A Consensus Protocol for Collaborative Intelligence](https://arxiv.org/abs/http://arxiv.org/abs/2501.02971v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02971v1
|
|
|
|
**Abstract:**
|
|
> Existing research on federated learning has been focused on the setting where
|
|
> learning is coordinated by a centralized entity. Yet the greatest potential of
|
|
> future collaborative intelligence would be unleashed in a more open and
|
|
> democratized setting with no central entity in a dominant role, referred to as
|
|
> "decentralized federated learning". New challenges arise accordingly in
|
|
> achieving both correct model training and fair reward allocation with
|
|
> collective effort among all participating nodes, especially with the threat of
|
|
> the Byzantine node jeopardising both tasks.
|
|
> In this paper, we propose a blockchain-based decentralized Byzantine
|
|
> fault-tolerant federated learning framework based on a novel Proof-of-Data
|
|
> (PoD) consensus protocol to resolve both the "trust" and "incentive"
|
|
> components. By decoupling model training and contribution accounting, PoD is
|
|
> able to enjoy not only the benefit of learning efficiency and system liveliness
|
|
> from asynchronous societal-scale PoW-style learning but also the finality of
|
|
> consensus and reward allocation from epoch-based BFT-style voting. To mitigate
|
|
> false reward claims by data forgery from Byzantine attacks, a privacy-aware
|
|
> data verification and contribution-based reward allocation mechanism is
|
|
> designed to complete the framework. Our evaluation results show that PoD
|
|
> demonstrates performance in model training close to that of the centralized
|
|
> counterpart while achieving trust in consensus and fairness for reward
|
|
> allocation with a fault tolerance ratio of 1/3.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on decentralized federated learning, blockchain, and a consensus protocol, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular
|
|
Balls](https://arxiv.org/abs/http://arxiv.org/abs/2501.02975v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02975v1
|
|
|
|
**Abstract:**
|
|
> Outlier detection refers to the identification of anomalous samples that
|
|
> deviate significantly from the distribution of normal data and has been
|
|
> extensively studied and used in a variety of practical tasks. However, most
|
|
> unsupervised outlier detection methods are carefully designed to detect
|
|
> specified outliers, while real-world data may be entangled with different types
|
|
> of outliers. In this study, we propose a fuzzy rough sets-based multi-scale
|
|
> outlier detection method to identify various types of outliers. Specifically, a
|
|
> novel fuzzy rough sets-based method that integrates relative fuzzy granule
|
|
> density is first introduced to improve the capability of detecting local
|
|
> outliers. Then, a multi-scale view generation method based on granular-ball
|
|
> computing is proposed to collaboratively identify group outliers at different
|
|
> levels of granularity. Moreover, reliable outliers and inliers determined by
|
|
> the three-way decision are used to train a weighted support vector machine to
|
|
> further improve the performance of outlier detection. The proposed method
|
|
> innovatively transforms unsupervised outlier detection into a semi-supervised
|
|
> classification problem and for the first time explores the fuzzy rough
|
|
> sets-based outlier detection from the perspective of multi-scale granular
|
|
> balls, allowing for high adaptability to different types of outliers. Extensive
|
|
> experiments carried out on both artificial and UCI datasets demonstrate that
|
|
> the proposed outlier detection method significantly outperforms the
|
|
> state-of-the-art methods, improving the results by at least 8.48% in terms of
|
|
> the Area Under the ROC Curve (AUROC) index. { The source codes are released at
|
|
> \url{https://github.com/Xiaofeng-Tan/MGBOD}. }
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the primary criteria as it focuses on outlier detection using fuzzy rough sets and multi-scale granular balls, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thereby failing to align with the specified requirements.
|
|
|
|
---
|
|
|
|
## [CAMP: Collaborative Attention Model with Profiles for Vehicle Routing
|
|
Problems](https://arxiv.org/abs/http://arxiv.org/abs/2501.02977v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02977v1
|
|
|
|
**Abstract:**
|
|
> The profiled vehicle routing problem (PVRP) is a generalization of the
|
|
> heterogeneous capacitated vehicle routing problem (HCVRP) in which the
|
|
> objective is to optimize the routes of vehicles to serve client demands subject
|
|
> to different vehicle profiles, with each having a preference or constraint on a
|
|
> per-client basis. While existing learning methods have shown promise for
|
|
> solving the HCVRP in real-time, no learning method exists to solve the more
|
|
> practical and challenging PVRP. In this paper, we propose a Collaborative
|
|
> Attention Model with Profiles (CAMP), a novel approach that learns efficient
|
|
> solvers for PVRP using multi-agent reinforcement learning. CAMP employs a
|
|
> specialized attention-based encoder architecture to embed profiled client
|
|
> embeddings in parallel for each vehicle profile. We design a communication
|
|
> layer between agents for collaborative decision-making across profiled
|
|
> embeddings at each decoding step and a batched pointer mechanism to attend to
|
|
> the profiled embeddings to evaluate the likelihood of the next actions. We
|
|
> evaluate CAMP on two variants of PVRPs: PVRP with preferences, which explicitly
|
|
> influence the reward function, and PVRP with zone constraints with different
|
|
> numbers of agents and clients, demonstrating that our learned solvers achieve
|
|
> competitive results compared to both classical state-of-the-art neural
|
|
> multi-agent models in terms of solution quality and computational efficiency.
|
|
> We make our code openly available at https://github.com/ai4co/camp.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on developing a multi-agent reinforcement learning model (CAMP) for solving vehicle routing problems, with no emphasis on prompt engineering, Large Language Models (LLMs), or the manipulation of textual input prompts for LLM performance.
|
|
|
|
---
|
|
|
|
## [CONTINUUM: Detecting APT Attacks through Spatial-Temporal Graph Neural
|
|
Networks](https://arxiv.org/abs/http://arxiv.org/abs/2501.02981v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02981v2
|
|
|
|
**Abstract:**
|
|
> Advanced Persistent Threats (APTs) represent a significant challenge in
|
|
> cybersecurity due to their sophisticated and stealthy nature. Traditional
|
|
> Intrusion Detection Systems (IDS) often fall short in detecting these
|
|
> multi-stage attacks. Recently, Graph Neural Networks (GNNs) have been employed
|
|
> to enhance IDS capabilities by analyzing the complex relationships within
|
|
> networked data. However, existing GNN-based solutions are hampered by high
|
|
> false positive rates and substantial resource consumption. In this paper, we
|
|
> present a novel IDS designed to detect APTs using a Spatio-Temporal Graph
|
|
> Neural Network Autoencoder. Our approach leverages spatial information to
|
|
> understand the interactions between entities within a graph and temporal
|
|
> information to capture the evolution of the graph over time. This dual
|
|
> perspective is crucial for identifying the sequential stages of APTs.
|
|
> Furthermore, to address privacy and scalability concerns, we deploy our
|
|
> architecture in a federated learning environment. This setup ensures that local
|
|
> data remains on-premise while encrypted model-weights are shared and aggregated
|
|
> using homomorphic encryption, maintaining data privacy and security. Our
|
|
> evaluation shows that this system effectively detects APTs with lower false
|
|
> positive rates and optimized resource usage compared to existing methods,
|
|
> highlighting the potential of spatio-temporal analysis and federated learning
|
|
> in enhancing cybersecurity defenses.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper does not meet the 'MUST' criteria as it primarily focuses on detecting APT attacks using Spatio-Temporal Graph Neural Networks in cybersecurity, with no emphasis on engineering, design, or optimization of prompts for Large Language Models (LLMs)."
|
|
}
|
|
|
|
---
|
|
|
|
## [A Bio-Inspired Research Paradigm of Collision Perception Neurons
|
|
Enabling Neuro-Robotic Integration: The LGMD Case](https://arxiv.org/abs/http://arxiv.org/abs/2501.02982v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02982v1
|
|
|
|
**Abstract:**
|
|
> Compared to human vision, insect visual systems excel at rapid and precise
|
|
> collision detection, despite relying on only tens of thousands of neurons
|
|
> organized through a few neuropils. This efficiency makes them an attractive
|
|
> model system for developing artificial collision-detecting systems.
|
|
> Specifically, researchers have identified collision-selective neurons in the
|
|
> locust's optic lobe, called lobula giant movement detectors (LGMDs), which
|
|
> respond specifically to approaching objects. Research upon LGMD neurons began
|
|
> in the early 1970s. Initially, due to their large size, these neurons were
|
|
> identified as motion detectors, but their role as looming detectors was
|
|
> recognized over time. Since then, progress in neuroscience, computational
|
|
> modeling of LGMD's visual neural circuits, and LGMD-based robotics has advanced
|
|
> in tandem, each field supporting and driving the others. Today, with a deeper
|
|
> understanding of LGMD neurons, LGMD-based models have significantly improved
|
|
> collision-free navigation in mobile robots including ground and aerial robots.
|
|
> This review highlights recent developments in LGMD research from the
|
|
> perspectives of neuroscience, computational modeling, and robotics. It
|
|
> emphasizes a biologically plausible research paradigm, where insights from
|
|
> neuroscience inform real-world applications, which would in turn validate and
|
|
> advance neuroscience. With strong support from extensive research and growing
|
|
> application demand, this paradigm has reached a mature stage and demonstrates
|
|
> versatility across different areas of neuroscience research, thereby enhancing
|
|
> our understanding of the interconnections between neuroscience, computational
|
|
> modeling, and robotics. Furthermore, other motion-sensitive neurons have also
|
|
> shown promising potential for adopting this research paradigm.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not focus on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance, instead focusing on bio-inspired robotics, neuroscience, and collision detection using LGMD neurons.
|
|
|
|
---
|
|
|
|
## [To Analyze and Regulate Human-in-the-loop Learning for Congestion Games](https://arxiv.org/abs/http://arxiv.org/abs/2501.03055v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03055v2
|
|
|
|
**Abstract:**
|
|
> In congestion games, selfish users behave myopically to crowd to the shortest
|
|
> paths, and the social planner designs mechanisms to regulate such selfish
|
|
> routing through information or payment incentives. However, such mechanism
|
|
> design requires the knowledge of time-varying traffic conditions and it is the
|
|
> users themselves to learn and report past road experiences to the social
|
|
> planner (e.g., Waze or Google Maps). When congestion games meet mobile
|
|
> crowdsourcing, it is critical to incentivize selfish users to explore
|
|
> non-shortest paths in the best exploitation-exploration trade-off. First, we
|
|
> consider a simple but fundamental parallel routing network with one
|
|
> deterministic path and multiple stochastic paths for users with an average
|
|
> arrival probability $\lambda$. We prove that the current myopic routing policy
|
|
> (widely used in Waze and Google Maps) misses both exploration (when strong
|
|
> hazard belief) and exploitation (when weak hazard belief) as compared to the
|
|
> social optimum. Due to the myopic policy's under-exploration, we prove that the
|
|
> caused price of anarchy (PoA) is larger than
|
|
> \(\frac{1}{1-\rho^{\frac{1}{\lambda}}}\), which can be arbitrarily large as
|
|
> discount factor \(\rho\rightarrow1\). To mitigate such huge efficiency loss, we
|
|
> propose a novel selective information disclosure (SID) mechanism: we only
|
|
> reveal the latest traffic information to users when they intend to over-explore
|
|
> stochastic paths upon arrival, while hiding such information when they want to
|
|
> under-explore. We prove that our mechanism successfully reduces PoA to be less
|
|
> than~\(2\). Besides the parallel routing network, we further extend our
|
|
> mechanism and PoA results to any linear path graphs with multiple intermediate
|
|
> nodes.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper does not meet the 'MUST' criteria as it does not focus primarily on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it addresses mechanism design for regulating human-in-the-loop learning in congestion games, with no apparent connection to LLMs or prompt engineering for text-based interactions."
|
|
}
|
|
|
|
---
|
|
|
|
## [Survival Analysis Revisited: Understanding and Unifying Poisson,
|
|
Exponential, and Cox Models in Fall Risk Analysis](https://arxiv.org/abs/http://arxiv.org/abs/2501.03058v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03058v1
|
|
|
|
**Abstract:**
|
|
> This paper explores foundational and applied aspects of survival analysis,
|
|
> using fall risk assessment as a case study. It revisits key time-related
|
|
> probability distributions and statistical methods, including logistic
|
|
> regression, Poisson regression, Exponential regression, and the Cox
|
|
> Proportional Hazards model, offering a unified perspective on their
|
|
> relationships within the survival analysis framework. A contribution of this
|
|
> work is the step-by-step derivation and clarification of the relationships
|
|
> among these models, particularly demonstrating that Poisson regression in the
|
|
> survival context is a specific case of the Cox model. These insights address
|
|
> gaps in understanding and reinforce the simplicity and interpretability of
|
|
> survival models. The paper also emphasizes the practical utility of survival
|
|
> analysis by connecting theoretical insights with real-world applications. In
|
|
> the context of fall detection, it demonstrates how these models can
|
|
> simultaneously predict fall risk, analyze contributing factors, and estimate
|
|
> time-to-event outcomes within a single streamlined framework. In contrast,
|
|
> advanced deep learning methods often require complex post-hoc interpretation
|
|
> and separate training for different tasks particularly when working with
|
|
> structured numerical data. This highlights the enduring relevance of classical
|
|
> statistical frameworks and makes survival models especially valuable in
|
|
> healthcare settings, where explainability and robustness are critical. By
|
|
> unifying foundational concepts and offering a cohesive perspective on
|
|
> time-to-event analysis, this work serves as an accessible resource for
|
|
> understanding survival models and applying them effectively to diverse
|
|
> analytical challenges.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the 'MUST' criteria as it primarily focuses on survival analysis, statistical methods, and their application in healthcare, with no apparent emphasis on prompt engineering, Large Language Models (LLMs), or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video
|
|
Generation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03059v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03059v1
|
|
|
|
**Abstract:**
|
|
> We consider the task of Image-to-Video (I2V) generation, which involves
|
|
> transforming static images into realistic video sequences based on a textual
|
|
> description. While recent advancements produce photorealistic outputs, they
|
|
> frequently struggle to create videos with accurate and consistent object
|
|
> motion, especially in multi-object scenarios. To address these limitations, we
|
|
> propose a two-stage compositional framework that decomposes I2V generation
|
|
> into: (i) An explicit intermediate representation generation stage, followed by
|
|
> (ii) A video generation stage that is conditioned on this representation. Our
|
|
> key innovation is the introduction of a mask-based motion trajectory as an
|
|
> intermediate representation, that captures both semantic object information and
|
|
> motion, enabling an expressive but compact representation of motion and
|
|
> semantics. To incorporate the learned representation in the second stage, we
|
|
> utilize object-level attention objectives. Specifically, we consider a spatial,
|
|
> per-object, masked-cross attention objective, integrating object-specific
|
|
> prompts into corresponding latent space regions and a masked spatio-temporal
|
|
> self-attention objective, ensuring frame-to-frame consistency for each object.
|
|
> We evaluate our method on challenging benchmarks with multi-object and
|
|
> high-motion scenarios and empirically demonstrate that the proposed method
|
|
> achieves state-of-the-art results in temporal coherence, motion realism, and
|
|
> text-prompt faithfulness. Additionally, we introduce \benchmark, a new
|
|
> challenging benchmark for single-object and multi-object I2V generation, and
|
|
> demonstrate our method's superiority on this benchmark. Project page is
|
|
> available at https://guyyariv.github.io/TTM/.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on Image-to-Video (I2V) generation, a generative AI application other than text generation driven by LLMs, and does not centrally concern prompt engineering for text-based interactions with LLMs, despite mentioning object-specific prompts as a component of the methodology.
|
|
|
|
---
|
|
|
|
## [The Scaling Law for LoRA Base on Mutual Information Upper Bound](https://arxiv.org/abs/http://arxiv.org/abs/2501.03152v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03152v1
|
|
|
|
**Abstract:**
|
|
> LoRA (Low-Rank Adaptation) is a widely used model fine-tuning method. In
|
|
> fine-tuning, the law among model performance, model parameters, and data
|
|
> complexity has been a focal issue in the field. Existing methods often leverage
|
|
> external metrics (such as cross-entropy or perplexity) to evaluate model
|
|
> performance. In the fine-tuning process for large models, two types of
|
|
> knowledge are typically involved: the frozen, general knowledge acquired by the
|
|
> model during pre-training and the new knowledge learned through the LoRA module
|
|
> from the current data. Generally, the less LoRA's learned knowledge relies on
|
|
> the large model, the more it captures the specific knowledge of new data,
|
|
> thereby enhancing its adaptability to new tasks. However, external metrics do
|
|
> not readily capture the dependency relationship between these two types of
|
|
> knowledge. Therefore, we designed an internal metric based on the Mutual
|
|
> Information Upper Bound (MIUB) theory to investigate the scaling law of
|
|
> large-model LoRA fine-tuning. In our experiments, we validated this approach on
|
|
> benchmark datasets, using the Llama3-8B and Phi3-3B models. The results show
|
|
> that the proposed MIUB metric aligns more accurately and stably with the
|
|
> scaling law of LoRA fine-tuning compared to cross-entropy and perplexity.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on model fine-tuning (LoRA) and evaluating its performance using an internal metric (Mutual Information Upper Bound), rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).
|
|
|
|
---
|
|
|
|
## [Detecting AI-Generated Text in Educational Content: Leveraging Machine
|
|
Learning and Explainable AI for Academic Integrity](https://arxiv.org/abs/http://arxiv.org/abs/2501.03203v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03203v1
|
|
|
|
**Abstract:**
|
|
> This study seeks to enhance academic integrity by providing tools to detect
|
|
> AI-generated content in student work using advanced technologies. The findings
|
|
> promote transparency and accountability, helping educators maintain ethical
|
|
> standards and supporting the responsible integration of AI in education. A key
|
|
> contribution of this work is the generation of the CyberHumanAI dataset, which
|
|
> has 1000 observations, 500 of which are written by humans and the other 500
|
|
> produced by ChatGPT. We evaluate various machine learning (ML) and deep
|
|
> learning (DL) algorithms on the CyberHumanAI dataset comparing human-written
|
|
> and AI-generated content from Large Language Models (LLMs) (i.e., ChatGPT).
|
|
> Results demonstrate that traditional ML algorithms, specifically XGBoost and
|
|
> Random Forest, achieve high performance (83% and 81% accuracies respectively).
|
|
> Results also show that classifying shorter content seems to be more challenging
|
|
> than classifying longer content. Further, using Explainable Artificial
|
|
> Intelligence (XAI) we identify discriminative features influencing the ML
|
|
> model's predictions, where human-written content tends to use a practical
|
|
> language (e.g., use and allow). Meanwhile AI-generated text is characterized by
|
|
> more abstract and formal terms (e.g., realm and employ). Finally, a comparative
|
|
> analysis with GPTZero show that our narrowly focused, simple, and fine-tuned
|
|
> model can outperform generalized systems like GPTZero. The proposed model
|
|
> achieved approximately 77.5% accuracy compared to GPTZero's 48.5% accuracy when
|
|
> tasked to classify Pure AI, Pure Human, and mixed class. GPTZero showed a
|
|
> tendency to classify challenging and small-content cases as either mixed or
|
|
> unrecognized while our proposed model showed a more balanced performance across
|
|
> the three classes.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on detecting AI-generated text in educational content using machine learning and Explainable AI, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), as required by the criteria.
|
|
|
|
---
|
|
|
|
## [BoostStep: Boosting mathematical capability of Large Language Models via
|
|
improved single-step reasoning](https://arxiv.org/abs/http://arxiv.org/abs/2501.03226v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03226v2
|
|
|
|
**Abstract:**
|
|
> Cutting-edge large language models (LLMs) demonstrate promising performance
|
|
> in solving complex math problems with a divide-and-conquer pipeline and the
|
|
> assistance of in-context learning (ICL) examples. However, their potential for
|
|
> improvement is limited by two critical problems within their ICL examples:
|
|
> granularity-mismatch and the ensuing negative-effect noise problem.
|
|
> Specifically, the LLMs are capable of the dividing process yet mostly failed by
|
|
> inaccurate reasoning within a few conquer steps, while the ICL examples
|
|
> retrieved in question-grained sometimes lack relevant steps for a specific
|
|
> challenging reasoning step. Further, this disconnect may hinder the correct
|
|
> reasoning due to its irrelevance. To this end, we focus on improving the
|
|
> reasoning quality within each step and present BoostStep. BoostStep aligns the
|
|
> granularity between the retrieving and reasoning on step grained, and provides
|
|
> highly related ICL examples for each reasoning step with a novel `first-try'
|
|
> strategy. BoostStep provides more relevant examples than the coarse
|
|
> question-grained strategy, enhancing the model reasoning quality within each
|
|
> step steadily. BoostStep is a general and robust reasoning-enhancing method
|
|
> that not only improves standalone reasoning performance but also integrates
|
|
> seamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidate
|
|
> generation and decision-making. Quantitatively, it improves GPT-4o and
|
|
> Qwen2.5-Math-72B by 3.6\% and 2.0\% respectively on various mathematical
|
|
> benchmarks, and 7.5\% gain combined with MCTS.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
Although the paper improves LLM performance through refined in-context learning (ICL) examples, its primary focus is on enhancing mathematical reasoning capabilities within LLMs via a novel strategy, rather than prompt engineering techniques specifically for text-based interactions with LLMs.
|
|
|
|
---
|
|
|
|
## [DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich
|
|
Paradigm for Direct Preference Optimization](https://arxiv.org/abs/http://arxiv.org/abs/2501.03271v3)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03271v3
|
|
|
|
**Abstract:**
|
|
> The rapid rise of large language models (LLMs) has unlocked many applications
|
|
> but also underscores the challenge of aligning them with diverse values and
|
|
> preferences. Direct Preference Optimization (DPO) is central to alignment but
|
|
> constrained by fixed divergences and limited feature transformations. We
|
|
> propose DPO-Kernels, which integrates kernel methods to address these issues
|
|
> through four key contributions: (i) Kernelized Representations with polynomial,
|
|
> RBF, Mahalanobis, and spectral kernels for richer transformations, plus a
|
|
> hybrid loss combining embedding-based and probability-based objectives; (ii)
|
|
> Divergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya,
|
|
> Wasserstein, and f-divergences) for greater stability; (iii) Data-Driven
|
|
> Selection metrics that automatically choose the best kernel-divergence pair;
|
|
> and (iv) a Hierarchical Mixture of Kernels for both local precision and global
|
|
> modeling. Evaluations on 12 datasets demonstrate state-of-the-art performance
|
|
> in factuality, safety, reasoning, and instruction following. Grounded in
|
|
> Heavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization
|
|
> for LLMs, offering a comprehensive resource for further alignment research.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on Direct Preference Optimization (DPO) for aligning Large Language Models (LLMs) with diverse values and preferences, rather than prompt engineering specifically for LLMs. While LLMs are mentioned, the core subject is DPO methodology, not the design, optimization, or manipulation of textual input prompts for improving LLM performance."
|
|
}
|
|
|
|
---
|
|
|
|
## [Revolutionizing Encrypted Traffic Classification with MH-Net: A
|
|
Multi-View Heterogeneous Graph Model](https://arxiv.org/abs/http://arxiv.org/abs/2501.03279v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03279v1
|
|
|
|
**Abstract:**
|
|
> With the growing significance of network security, the classification of
|
|
> encrypted traffic has emerged as an urgent challenge. Traditional byte-based
|
|
> traffic analysis methods are constrained by the rigid granularity of
|
|
> information and fail to fully exploit the diverse correlations between bytes.
|
|
> To address these limitations, this paper introduces MH-Net, a novel approach
|
|
> for classifying network traffic that leverages multi-view heterogeneous traffic
|
|
> graphs to model the intricate relationships between traffic bytes. The essence
|
|
> of MH-Net lies in aggregating varying numbers of traffic bits into multiple
|
|
> types of traffic units, thereby constructing multi-view traffic graphs with
|
|
> diverse information granularities. By accounting for different types of byte
|
|
> correlations, such as header-payload relationships, MH-Net further endows the
|
|
> traffic graph with heterogeneity, significantly enhancing model performance.
|
|
> Notably, we employ contrastive learning in a multi-task manner to strengthen
|
|
> the robustness of the learned traffic unit representations. Experiments
|
|
> conducted on the ISCX and CIC-IoT datasets for both the packet-level and
|
|
> flow-level traffic classification tasks demonstrate that MH-Net achieves the
|
|
> best overall performance compared to dozens of SOTA methods.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the primary focus criteria, as it revolves around network traffic classification using a multi-view heterogeneous graph model (MH-Net), with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to satisfy the mandatory 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [A Decision-Based Heterogenous Graph Attention Network for Multi-Class
|
|
Fake News Detection](https://arxiv.org/abs/http://arxiv.org/abs/2501.03290v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03290v1
|
|
|
|
**Abstract:**
|
|
> A promising tool for addressing fake news detection is Graph Neural Networks
|
|
> (GNNs). However, most existing GNN-based methods rely on binary classification,
|
|
> categorizing news as either real or fake. Additionally, traditional GNN models
|
|
> use a static neighborhood for each node, making them susceptible to issues like
|
|
> over-squashing. In this paper, we introduce a novel model named Decision-based
|
|
> Heterogeneous Graph Attention Network (DHGAT) for fake news detection in a
|
|
> semi-supervised setting. DHGAT effectively addresses the limitations of
|
|
> traditional GNNs by dynamically optimizing and selecting the neighborhood type
|
|
> for each node in every layer. It represents news data as a heterogeneous graph
|
|
> where nodes (news items) are connected by various types of edges. The
|
|
> architecture of DHGAT consists of a decision network that determines the
|
|
> optimal neighborhood type and a representation network that updates node
|
|
> embeddings based on this selection. As a result, each node learns an optimal
|
|
> and task-specific computational graph, enhancing both the accuracy and
|
|
> efficiency of the fake news detection process. We evaluate DHGAT on the LIAR
|
|
> dataset, a large and challenging dataset for multi-class fake news detection,
|
|
> which includes news items categorized into six classes. Our results demonstrate
|
|
> that DHGAT outperforms existing methods, improving accuracy by approximately 4%
|
|
> and showing robustness with limited labeled data.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on developing a new Graph Neural Network (GNN) architecture for multi-class fake news detection, which does not meet the 'MUST' criteria of primarily focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or investigating methods for improving LLM performance through textual input prompt manipulation.
|
|
|
|
---
|
|
|
|
## [Multi-Modal One-Shot Federated Ensemble Learning for Medical Data with
|
|
Vision Large Language Model](https://arxiv.org/abs/http://arxiv.org/abs/2501.03292v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03292v1
|
|
|
|
**Abstract:**
|
|
> Federated learning (FL) has attracted considerable interest in the medical
|
|
> domain due to its capacity to facilitate collaborative model training while
|
|
> maintaining data privacy. However, conventional FL methods typically
|
|
> necessitate multiple communication rounds, leading to significant communication
|
|
> overhead and delays, especially in environments with limited bandwidth.
|
|
> One-shot federated learning addresses these issues by conducting model training
|
|
> and aggregation in a single communication round, thereby reducing communication
|
|
> costs while preserving privacy. Among these, one-shot federated ensemble
|
|
> learning combines independently trained client models using ensemble techniques
|
|
> such as voting, further boosting performance in non-IID data scenarios. On the
|
|
> other hand, existing machine learning methods in healthcare predominantly use
|
|
> unimodal data (e.g., medical images or textual reports), which restricts their
|
|
> diagnostic accuracy and comprehensiveness. Therefore, the integration of
|
|
> multi-modal data is proposed to address these shortcomings. In this paper, we
|
|
> introduce FedMME, an innovative one-shot multi-modal federated ensemble
|
|
> learning framework that utilizes multi-modal data for medical image analysis.
|
|
> Specifically, FedMME capitalizes on vision large language models to produce
|
|
> textual reports from medical images, employs a BERT model to extract textual
|
|
> features from these reports, and amalgamates these features with visual
|
|
> features to improve diagnostic accuracy. Experimental results show that our
|
|
> method demonstrated superior performance compared to existing one-shot
|
|
> federated learning methods in healthcare scenarios across four datasets with
|
|
> various data distributions. For instance, it surpasses existing one-shot
|
|
> federated learning approaches by more than 17.5% in accuracy on the RSNA
|
|
> dataset when applying a Dirichlet distribution with ($\alpha$ = 0.3).
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on the development of a one-shot federated ensemble learning framework for medical data, using a Vision Large Language Model as a component, rather than focusing specifically on the engineering, design, or optimization of prompts for Large Language Models.
|
|
|
|
---
|
|
|
|
## [Rethinking Byzantine Robustness in Federated Recommendation from Sparse
|
|
Aggregation Perspective](https://arxiv.org/abs/http://arxiv.org/abs/2501.03301v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03301v2
|
|
|
|
**Abstract:**
|
|
> To preserve user privacy in recommender systems, federated recommendation
|
|
> (FR) based on federated learning (FL) emerges, keeping the personal data on the
|
|
> local client and updating a model collaboratively. Unlike FL, FR has a unique
|
|
> sparse aggregation mechanism, where the embedding of each item is updated by
|
|
> only partial clients, instead of full clients in a dense aggregation of general
|
|
> FL. Recently, as an essential principle of FL, model security has received
|
|
> increasing attention, especially for Byzantine attacks, where malicious clients
|
|
> can send arbitrary updates. The problem of exploring the Byzantine robustness
|
|
> of FR is particularly critical since in the domains applying FR, e.g.,
|
|
> e-commerce, malicious clients can be injected easily by registering new
|
|
> accounts. However, existing Byzantine works neglect the unique sparse
|
|
> aggregation of FR, making them unsuitable for our problem. Thus, we make the
|
|
> first effort to investigate Byzantine attacks on FR from the perspective of
|
|
> sparse aggregation, which is non-trivial: it is not clear how to define
|
|
> Byzantine robustness under sparse aggregations and design Byzantine attacks
|
|
> under limited knowledge/capability. In this paper, we reformulate the Byzantine
|
|
> robustness under sparse aggregation by defining the aggregation for a single
|
|
> item as the smallest execution unit. Then we propose a family of effective
|
|
> attack strategies, named Spattack, which exploit the vulnerability in sparse
|
|
> aggregation and are categorized along the adversary's knowledge and capability.
|
|
> Extensive experimental results demonstrate that Spattack can effectively
|
|
> prevent convergence and even break down defenses under a few malicious clients,
|
|
> raising alarms for securing FR systems.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper focuses on Byzantine robustness in Federated Recommendation systems, specifically addressing sparse aggregation and security against malicious clients, with no primary concern or investigation into prompt engineering for Large Language Models (LLMs) or their textual input prompts."
|
|
}
|
|
|
|
---
|
|
|
|
## [Advanced Machine Learning Techniques for Social Support Detection on
|
|
Social Media](https://arxiv.org/abs/http://arxiv.org/abs/2501.03370v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03370v1
|
|
|
|
**Abstract:**
|
|
> The widespread use of social media highlights the need to understand its
|
|
> impact, particularly the role of online social support. This study uses a
|
|
> dataset focused on online social support, which includes binary and multiclass
|
|
> classifications of social support content on social media. The classification
|
|
> of social support is divided into three tasks. The first task focuses on
|
|
> distinguishing between supportive and non-supportive. The second task aims to
|
|
> identify whether the support is directed toward an individual or a group. The
|
|
> third task categorizes the specific type of social support, grouping it into
|
|
> categories such as Nation, LGBTQ, Black people, Women, Religion, and Other (if
|
|
> it does not fit into the previously mentioned categories). To address data
|
|
> imbalances in these tasks, we employed K-means clustering for balancing the
|
|
> dataset and compared the results with the original unbalanced data. Using
|
|
> advanced machine learning techniques, including transformers and zero-shot
|
|
> learning approaches with GPT3, GPT4, and GPT4-o, we predict social support
|
|
> levels in various contexts. The effectiveness of the dataset is evaluated using
|
|
> baseline models across different learning approaches, with transformer-based
|
|
> methods demonstrating superior performance. Additionally, we achieved a 0.4\%
|
|
> increase in the macro F1 score for the second task and a 0.7\% increase for the
|
|
> third task, compared to previous work utilizing traditional machine learning
|
|
> with psycholinguistic and unigram-based TF-IDF values.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not focus primarily on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it utilizes LLMs (GPT3, GPT4) as components for social support detection on social media, with the primary focus being on advanced machine learning techniques for classification tasks, not prompt engineering.
|
|
|
|
---
|
|
|
|
## [License Plate Images Generation with Diffusion Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03374v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03374v1
|
|
|
|
**Abstract:**
|
|
> Despite the evident practical importance of license plate recognition (LPR),
|
|
> corresponding research is limited by the volume of publicly available datasets
|
|
> due to privacy regulations such as the General Data Protection Regulation
|
|
> (GDPR). To address this challenge, synthetic data generation has emerged as a
|
|
> promising approach. In this paper, we propose to synthesize realistic license
|
|
> plates (LPs) using diffusion models, inspired by recent advances in image and
|
|
> video generation. In our experiments a diffusion model was successfully trained
|
|
> on a Ukrainian LP dataset, and 1000 synthetic images were generated for
|
|
> detailed analysis. Through manual classification and annotation of the
|
|
> generated images, we performed a thorough study of the model output, such as
|
|
> success rate, character distributions, and type of failures. Our contributions
|
|
> include experimental validation of the efficacy of diffusion models for LP
|
|
> synthesis, along with insights into the characteristics of the generated data.
|
|
> Furthermore, we have prepared a synthetic dataset consisting of 10,000 LP
|
|
> images, publicly available at https://zenodo.org/doi/10.5281/zenodo.13342102.
|
|
> Conducted experiments empirically confirm the usefulness of synthetic data for
|
|
> the LPR task. Despite the initial performance gap between the model trained
|
|
> with real and synthetic data, the expansion of the training data set with
|
|
> pseudolabeled synthetic data leads to an improvement in LPR accuracy by 3%
|
|
> compared to baseline.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on generating license plate images using diffusion models, which is an application of generative AI in image generation, not text generation driven by Large Language Models (LLMs), and does not investigate prompt engineering for LLMs.
|
|
|
|
---
|
|
|
|
## [Activating Associative Disease-Aware Vision Token Memory for LLM-Based
|
|
X-ray Report Generation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03458v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03458v1
|
|
|
|
**Abstract:**
|
|
> X-ray image based medical report generation achieves significant progress in
|
|
> recent years with the help of the large language model, however, these models
|
|
> have not fully exploited the effective information in visual image regions,
|
|
> resulting in reports that are linguistically sound but insufficient in
|
|
> describing key diseases. In this paper, we propose a novel associative
|
|
> memory-enhanced X-ray report generation model that effectively mimics the
|
|
> process of professional doctors writing medical reports. It considers both the
|
|
> mining of global and local visual information and associates historical report
|
|
> information to better complete the writing of the current report. Specifically,
|
|
> given an X-ray image, we first utilize a classification model along with its
|
|
> activation maps to accomplish the mining of visual regions highly associated
|
|
> with diseases and the learning of disease query tokens. Then, we employ a
|
|
> visual Hopfield network to establish memory associations for disease-related
|
|
> tokens, and a report Hopfield network to retrieve report memory information.
|
|
> This process facilitates the generation of high-quality reports based on a
|
|
> large language model and achieves state-of-the-art performance on multiple
|
|
> benchmark datasets, including the IU X-ray, MIMIC-CXR, and Chexpert Plus. The
|
|
> source code of this work is released on
|
|
> \url{https://github.com/Event-AHU/Medical_Image_Analysis}.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a novel model for X-ray report generation in a medical context, which violates the 'MUST NOT' criteria of being primarily concerned with medical subjects and not focusing on prompt engineering for text-based interactions with LLMs as the core subject.
|
|
|
|
---
|
|
|
|
## [Radar Signal Recognition through Self-Supervised Learning and Domain
|
|
Adaptation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03461v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03461v2
|
|
|
|
**Abstract:**
|
|
> Automatic radar signal recognition (RSR) plays a pivotal role in electronic
|
|
> warfare (EW), as accurately classifying radar signals is critical for informing
|
|
> decision-making processes. Recent advances in deep learning have shown
|
|
> significant potential in improving RSR performance in domains with ample
|
|
> annotated data. However, these methods fall short in EW scenarios where
|
|
> annotated RF data are scarce or impractical to obtain. To address these
|
|
> challenges, we introduce a self-supervised learning (SSL) method which utilises
|
|
> masked signal modelling and RF domain adaption to enhance RSR performance in
|
|
> environments with limited RF samples and labels. Specifically, we investigate
|
|
> pre-training masked autoencoders (MAE) on baseband in-phase and quadrature
|
|
> (I/Q) signals from various RF domains and subsequently transfer the learned
|
|
> representation to the radar domain, where annotated data are limited. Empirical
|
|
> results show that our lightweight self-supervised ResNet model with domain
|
|
> adaptation achieves up to a 17.5% improvement in 1-shot classification accuracy
|
|
> when pre-trained on in-domain signals (i.e., radar signals) and up to a 16.31%
|
|
> improvement when pre-trained on out-of-domain signals (i.e., comm signals),
|
|
> compared to its baseline without SSL. We also provide reference results for
|
|
> several MAE designs and pre-training strategies, establishing a new benchmark
|
|
> for few-shot radar signal classification.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on self-supervised learning and domain adaptation for radar signal recognition, with no discussion on prompt engineering, Large Language Models (LLMs), or textual input prompts, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [KG-TRICK: Unifying Textual and Relational Information Completion of
|
|
Knowledge for Multilingual Knowledge Graphs](https://arxiv.org/abs/http://arxiv.org/abs/2501.03560v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03560v1
|
|
|
|
**Abstract:**
|
|
> Multilingual knowledge graphs (KGs) provide high-quality relational and
|
|
> textual information for various NLP applications, but they are often
|
|
> incomplete, especially in non-English languages. Previous research has shown
|
|
> that combining information from KGs in different languages aids either
|
|
> Knowledge Graph Completion (KGC), the task of predicting missing relations
|
|
> between entities, or Knowledge Graph Enhancement (KGE), the task of predicting
|
|
> missing textual information for entities. Although previous efforts have
|
|
> considered KGC and KGE as independent tasks, we hypothesize that they are
|
|
> interdependent and mutually beneficial. To this end, we introduce KG-TRICK, a
|
|
> novel sequence-to-sequence framework that unifies the tasks of textual and
|
|
> relational information completion for multilingual KGs. KG-TRICK demonstrates
|
|
> that: i) it is possible to unify the tasks of KGC and KGE into a single
|
|
> framework, and ii) combining textual information from multiple languages is
|
|
> beneficial to improve the completeness of a KG. As part of our contributions,
|
|
> we also introduce WikiKGE10++, the largest manually-curated benchmark for
|
|
> textual information completion of KGs, which features over 25,000 entities
|
|
> across 10 diverse languages.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on unifying Knowledge Graph Completion and Enhancement tasks for multilingual Knowledge Graphs, utilizing a sequence-to-sequence framework, without explicit concentration on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [From Code to Compliance: Assessing ChatGPT's Utility in Designing an
|
|
Accessible Webpage -- A Case Study](https://arxiv.org/abs/http://arxiv.org/abs/2501.03572v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03572v1
|
|
|
|
**Abstract:**
|
|
> Web accessibility ensures that individuals with disabilities can access and
|
|
> interact with digital content without barriers, yet a significant majority of
|
|
> most used websites fail to meet accessibility standards. This study evaluates
|
|
> ChatGPT's (GPT-4o) ability to generate and improve web pages in line with Web
|
|
> Content Accessibility Guidelines (WCAG). While ChatGPT can effectively address
|
|
> accessibility issues when prompted, its default code often lacks compliance,
|
|
> reflecting limitations in its training data and prevailing inaccessible web
|
|
> practices. Automated and manual testing revealed strengths in resolving simple
|
|
> issues but challenges with complex tasks, requiring human oversight and
|
|
> additional iterations. Unlike prior studies, we incorporate manual evaluation,
|
|
> dynamic elements, and use the visual reasoning capability of ChatGPT along with
|
|
> the prompts to fix accessibility issues. Providing screenshots alongside
|
|
> prompts enhances the LLM's ability to address accessibility issues by allowing
|
|
> it to analyze surrounding components, such as determining appropriate contrast
|
|
> colors. We found that effective prompt engineering, such as providing concise,
|
|
> structured feedback and incorporating visual aids, significantly enhances
|
|
> ChatGPT's performance. These findings highlight the potential and limitations
|
|
> of large language models for accessible web development, offering practical
|
|
> guidance for developers to create more inclusive websites.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper's primary focus is leveraging ChatGPT for accessible web development, not prompt engineering for Large Language Models. While it mentions effective prompt engineering techniques, this is secondary to its main objective of improving web accessibility."
|
|
}
|
|
|
|
---
|
|
|
|
## [Action Quality Assessment via Hierarchical Pose-guided Multi-stage
|
|
Contrastive Regression](https://arxiv.org/abs/http://arxiv.org/abs/2501.03674v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03674v1
|
|
|
|
**Abstract:**
|
|
> Action Quality Assessment (AQA), which aims at automatic and fair evaluation
|
|
> of athletic performance, has gained increasing attention in recent years.
|
|
> However, athletes are often in rapid movement and the corresponding visual
|
|
> appearance variances are subtle, making it challenging to capture fine-grained
|
|
> pose differences and leading to poor estimation performance. Furthermore, most
|
|
> common AQA tasks, such as diving in sports, are usually divided into multiple
|
|
> sub-actions, each of which contains different durations. However, existing
|
|
> methods focus on segmenting the video into fixed frames, which disrupts the
|
|
> temporal continuity of sub-actions resulting in unavoidable prediction errors.
|
|
> To address these challenges, we propose a novel action quality assessment
|
|
> method through hierarchically pose-guided multi-stage contrastive regression.
|
|
> Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture
|
|
> fine-grained spatio-temporal visual and skeletal features. Then, a procedure
|
|
> segmentation network is introduced to separate different sub-actions and obtain
|
|
> segmented features. Afterwards, the segmented visual and skeletal features are
|
|
> both fed into a multi-modal fusion module as physics structural priors, to
|
|
> guide the model in learning refined activity similarities and variances.
|
|
> Finally, a multi-stage contrastive learning regression approach is employed to
|
|
> learn discriminative representations and output prediction results. In
|
|
> addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the
|
|
> current low-quality human pose labels. In experiments, the results on
|
|
> FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority
|
|
> of our proposed approach. Our source code and dataset are available at
|
|
> https://github.com/Lumos0507/HP-MCoRe.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on Action Quality Assessment using computer vision and pose-guided contrastive regression, with no mention of Large Language Models (LLMs), prompt engineering, or text-based interactions, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [MAJL: A Model-Agnostic Joint Learning Framework for Music Source
|
|
Separation and Pitch Estimation](https://arxiv.org/abs/http://arxiv.org/abs/2501.03689v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03689v1
|
|
|
|
**Abstract:**
|
|
> Music source separation and pitch estimation are two vital tasks in music
|
|
> information retrieval. Typically, the input of pitch estimation is obtained
|
|
> from the output of music source separation. Therefore, existing methods have
|
|
> tried to perform these two tasks simultaneously, so as to leverage the mutually
|
|
> beneficial relationship between both tasks. However, these methods still face
|
|
> two critical challenges that limit the improvement of both tasks: the lack of
|
|
> labeled data and joint learning optimization. To address these challenges, we
|
|
> propose a Model-Agnostic Joint Learning (MAJL) framework for both tasks. MAJL
|
|
> is a generic framework and can use variant models for each task. It includes a
|
|
> two-stage training method and a dynamic weighting method named Dynamic Weights
|
|
> on Hard Samples (DWHS), which addresses the lack of labeled data and joint
|
|
> learning optimization, respectively. Experimental results on public music
|
|
> datasets show that MAJL outperforms state-of-the-art methods on both tasks,
|
|
> with significant improvements of 0.92 in Signal-to-Distortion Ratio (SDR) for
|
|
> music source separation and 2.71% in Raw Pitch Accuracy (RPA) for pitch
|
|
> estimation. Furthermore, comprehensive studies not only validate the
|
|
> effectiveness of each component of MAJL, but also indicate the great generality
|
|
> of MAJL in adapting to different model architectures.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the 'MUST' criteria as it focuses on music source separation and pitch estimation, leveraging joint learning frameworks, without any mention of Large Language Models (LLMs), prompt engineering, or text-based interactions, thus falling outside the specified scope.
|
|
|
|
---
|
|
|
|
## [AuxDepthNet: Real-Time Monocular 3D Object Detection with
|
|
Depth-Sensitive Features](https://arxiv.org/abs/http://arxiv.org/abs/2501.03700v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03700v1
|
|
|
|
**Abstract:**
|
|
> Monocular 3D object detection is a challenging task in autonomous systems due
|
|
> to the lack of explicit depth information in single-view images. Existing
|
|
> methods often depend on external depth estimators or expensive sensors, which
|
|
> increase computational complexity and hinder real-time performance. To overcome
|
|
> these limitations, we propose AuxDepthNet, an efficient framework for real-time
|
|
> monocular 3D object detection that eliminates the reliance on external depth
|
|
> maps or pre-trained depth models. AuxDepthNet introduces two key components:
|
|
> the Auxiliary Depth Feature (ADF) module, which implicitly learns
|
|
> depth-sensitive features to improve spatial reasoning and computational
|
|
> efficiency, and the Depth Position Mapping (DPM) module, which embeds depth
|
|
> positional information directly into the detection process to enable accurate
|
|
> object localization and 3D bounding box regression. Leveraging the DepthFusion
|
|
> Transformer architecture, AuxDepthNet globally integrates visual and
|
|
> depth-sensitive features through depth-guided interactions, ensuring robust and
|
|
> efficient detection. Extensive experiments on the KITTI dataset show that
|
|
> AuxDepthNet achieves state-of-the-art performance, with $\text{AP}_{3D}$ scores
|
|
> of 24.72\% (Easy), 18.63\% (Moderate), and 15.31\% (Hard), and
|
|
> $\text{AP}_{\text{BEV}}$ scores of 34.11\% (Easy), 25.18\% (Moderate), and
|
|
> 21.90\% (Hard) at an IoU threshold of 0.7.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on monocular 3D object detection for autonomous systems, involving depth estimation and sensor technologies, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions, thus failing all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Self-adaptive vision-language model for 3D segmentation of pulmonary
|
|
artery and vein](https://arxiv.org/abs/http://arxiv.org/abs/2501.03722v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03722v1
|
|
|
|
**Abstract:**
|
|
> Accurate segmentation of pulmonary structures iscrucial in clinical
|
|
> diagnosis, disease study, and treatment planning. Significant progress has been
|
|
> made in deep learning-based segmentation techniques, but most require much
|
|
> labeled data for training. Consequently, developing precise segmentation
|
|
> methods that demand fewer labeled datasets is paramount in medical image
|
|
> analysis. The emergence of pre-trained vision-language foundation models, such
|
|
> as CLIP, recently opened the door for universal computer vision tasks.
|
|
> Exploiting the generalization ability of these pre-trained foundation models on
|
|
> downstream tasks, such as segmentation, leads to unexpected performance with a
|
|
> relatively small amount of labeled data. However, exploring these models for
|
|
> pulmonary artery-vein segmentation is still limited. This paper proposes a
|
|
> novel framework called Language-guided self-adaptive Cross-Attention Fusion
|
|
> Framework. Our method adopts pre-trained CLIP as a strong feature extractor for
|
|
> generating the segmentation of 3D CT scans, while adaptively aggregating the
|
|
> cross-modality of text and image representations. We propose a s pecially
|
|
> designed adapter module to fine-tune pre-trained CLIP with a self-adaptive
|
|
> learning strategy to effectively fuse the two modalities of embeddings. We
|
|
> extensively validate our method on a local dataset, which is the largest
|
|
> pulmonary artery-vein CT dataset to date and consists of 718 labeled data in
|
|
> total. The experiments show that our method outperformed other state-of-the-art
|
|
> methods by a large margin. Our data and code will be made publicly available
|
|
> upon acceptance.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on medical image analysis (3D segmentation of pulmonary artery and vein) and the development of a vision-language model, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not demonstrate the impact of textual input prompts on LLM output.
|
|
|
|
---
|
|
|
|
## [Deep Sylvester Posterior Inference for Adaptive Compressed Sensing in
|
|
Ultrasound Imaging](https://arxiv.org/abs/http://arxiv.org/abs/2501.03825v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03825v1
|
|
|
|
**Abstract:**
|
|
> Ultrasound images are commonly formed by sequential acquisition of
|
|
> beam-steered scan-lines. Minimizing the number of required scan-lines can
|
|
> significantly enhance frame rate, field of view, energy efficiency, and data
|
|
> transfer speeds. Existing approaches typically use static subsampling schemes
|
|
> in combination with sparsity-based or, more recently, deep-learning-based
|
|
> recovery. In this work, we introduce an adaptive subsampling method that
|
|
> maximizes intrinsic information gain in-situ, employing a Sylvester Normalizing
|
|
> Flow encoder to infer an approximate Bayesian posterior under partial
|
|
> observation in real-time. Using the Bayesian posterior and a deep generative
|
|
> model for future observations, we determine the subsampling scheme that
|
|
> maximizes the mutual information between the subsampled observations, and the
|
|
> next frame of the video. We evaluate our approach using the EchoNet cardiac
|
|
> ultrasound video dataset and demonstrate that our active sampling method
|
|
> outperforms competitive baselines, including uniform and variable-density
|
|
> random sampling, as well as equidistantly spaced scan-lines, improving mean
|
|
> absolute reconstruction error by 15%. Moreover, posterior inference and the
|
|
> sampling scheme generation are performed in just 0.015 seconds (66Hz), making
|
|
> it fast enough for real-time 2D ultrasound imaging applications.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on adaptive compressed sensing in ultrasound imaging, employing deep learning for image reconstruction, and does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through prompt engineering, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [TACLR: A Scalable and Efficient Retrieval-based Method for Industrial
|
|
Product Attribute Value Identification](https://arxiv.org/abs/http://arxiv.org/abs/2501.03835v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03835v1
|
|
|
|
**Abstract:**
|
|
> Product Attribute Value Identification (PAVI) involves identifying attribute
|
|
> values from product profiles, a key task for improving product search,
|
|
> recommendations, and business analytics on e-commerce platforms. However,
|
|
> existing PAVI methods face critical challenges, such as inferring implicit
|
|
> values, handling out-of-distribution (OOD) values, and producing normalized
|
|
> outputs. To address these limitations, we introduce Taxonomy-Aware Contrastive
|
|
> Learning Retrieval (TACLR), the first retrieval-based method for PAVI. TACLR
|
|
> formulates PAVI as an information retrieval task by encoding product profiles
|
|
> and candidate values into embeddings and retrieving values based on their
|
|
> similarity to the item embedding. It leverages contrastive training with
|
|
> taxonomy-aware hard negative sampling and employs adaptive inference with
|
|
> dynamic thresholds. TACLR offers three key advantages: (1) it effectively
|
|
> handles implicit and OOD values while producing normalized outputs; (2) it
|
|
> scales to thousands of categories, tens of thousands of attributes, and
|
|
> millions of values; and (3) it supports efficient inference for high-load
|
|
> industrial scenarios. Extensive experiments on proprietary and public datasets
|
|
> validate the effectiveness and efficiency of TACLR. Moreover, it has been
|
|
> successfully deployed in a real-world e-commerce platform, processing millions
|
|
> of product listings daily while supporting dynamic, large-scale attribute
|
|
> taxonomies.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it introduces a retrieval-based method (TACLR) for Product Attribute Value Identification, primarily concerned with information retrieval and taxonomy-aware contrastive learning, without any central focus on prompt engineering for LLMs.
|
|
|
|
---
|
|
|
|
## [Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video
|
|
Generation Control](https://arxiv.org/abs/http://arxiv.org/abs/2501.03847v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03847v2
|
|
|
|
**Abstract:**
|
|
> Diffusion models have demonstrated impressive performance in generating
|
|
> high-quality videos from text prompts or images. However, precise control over
|
|
> the video generation process, such as camera manipulation or content editing,
|
|
> remains a significant challenge. Existing methods for controlled video
|
|
> generation are typically limited to a single control type, lacking the
|
|
> flexibility to handle diverse control demands. In this paper, we introduce
|
|
> Diffusion as Shader (DaS), a novel approach that supports multiple video
|
|
> control tasks within a unified architecture. Our key insight is that achieving
|
|
> versatile video control necessitates leveraging 3D control signals, as videos
|
|
> are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods
|
|
> limited to 2D control signals, DaS leverages 3D tracking videos as control
|
|
> inputs, making the video diffusion process inherently 3D-aware. This innovation
|
|
> allows DaS to achieve a wide range of video controls by simply manipulating the
|
|
> 3D tracking videos. A further advantage of using 3D tracking videos is their
|
|
> ability to effectively link frames, significantly enhancing the temporal
|
|
> consistency of the generated videos. With just 3 days of fine-tuning on 8 H800
|
|
> GPUs using less than 10k videos, DaS demonstrates strong control capabilities
|
|
> across diverse tasks, including mesh-to-video generation, camera control,
|
|
> motion transfer, and object manipulation.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on video generation control using diffusion models, which falls under image/video generation rather than text generation driven by Large Language Models (LLMs), violating the 'MUST NOT' criteria related to applications of generative AI.
|
|
|
|
---
|
|
|
|
## [CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds
|
|
Ratio on High-Resolution Point Clouds](https://arxiv.org/abs/http://arxiv.org/abs/2501.03879v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03879v1
|
|
|
|
**Abstract:**
|
|
> Recent research has demonstrated that Large Language Models (LLMs) are not
|
|
> limited to text-only tasks but can also function as multimodal models across
|
|
> various modalities, including audio, images, and videos. In particular,
|
|
> research on 3D Large Multimodal Models (3D LMMs) is making notable strides,
|
|
> driven by the potential of processing higher-dimensional data like point
|
|
> clouds. However, upon closer examination, we find that the visual and textual
|
|
> content within each sample of existing training datasets lacks both high
|
|
> informational granularity and clarity, which serve as a bottleneck for precise
|
|
> cross-modal understanding. To address these issues, we propose CL3DOR,
|
|
> Contrastive Learning for 3D large multimodal models via Odds ratio on
|
|
> high-Resolution point clouds, designed to ensure greater specificity and
|
|
> clarity in both visual and textual content. Specifically, we increase the
|
|
> density of point clouds per object and construct informative hard negative
|
|
> responses in the training dataset to penalize unwanted responses. To leverage
|
|
> hard negative responses, we incorporate the odds ratio as an auxiliary term for
|
|
> contrastive learning into the conventional language modeling loss. CL3DOR
|
|
> achieves state-of-the-art performance in 3D scene understanding and reasoning
|
|
> benchmarks. Additionally, we demonstrate the effectiveness of CL3DOR's key
|
|
> components through extensive experiments.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a new multimodal learning method (CL3DOR) for 3D Large Multimodal Models, which includes but is not centered on prompt engineering for Large Language Models (LLMs). The core contribution is the contrastive learning approach for multimodal understanding, not the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
|
|
Vision Token](https://arxiv.org/abs/http://arxiv.org/abs/2501.03895v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03895v1
|
|
|
|
**Abstract:**
|
|
> The advent of real-time large multimodal models (LMMs) like GPT-4o has
|
|
> sparked considerable interest in efficient LMMs. LMM frameworks typically
|
|
> encode visual inputs into vision tokens (continuous representations) and
|
|
> integrate them and textual instructions into the context of large language
|
|
> models (LLMs), where large-scale parameters and numerous context tokens
|
|
> (predominantly vision tokens) result in substantial computational overhead.
|
|
> Previous efforts towards efficient LMMs always focus on replacing the LLM
|
|
> backbone with smaller models, while neglecting the crucial issue of token
|
|
> quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal
|
|
> vision tokens. To achieve a high compression ratio of vision tokens while
|
|
> preserving visual information, we first analyze how LMMs understand vision
|
|
> tokens and find that most vision tokens only play a crucial role in the early
|
|
> layers of LLM backbone, where they mainly fuse visual information into text
|
|
> tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to
|
|
> fuse visual information into text tokens in advance, thereby facilitating the
|
|
> extreme compression of vision tokens fed to LLM backbone into one token.
|
|
> LLaVA-Mini is a unified large multimodal model that can support the
|
|
> understanding of images, high-resolution images, and videos in an efficient
|
|
> manner. Experiments across 11 image-based and 7 video-based benchmarks
|
|
> demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token
|
|
> instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by
|
|
> 77%, deliver low-latency responses within 40 milliseconds, and process over
|
|
> 10,000 frames of video on the GPU hardware with 24GB of memory.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on developing an efficient Large Multimodal Model (LMM) for image and video processing, optimizing vision tokens, and reducing computational overhead, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs) and their text-based interactions."
|
|
}
|
|
|
|
---
|
|
|
|
## [Explainable Time Series Prediction of Tyre Energy in Formula One Race
|
|
Strategy](https://arxiv.org/abs/http://arxiv.org/abs/2501.04067v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04067v1
|
|
|
|
**Abstract:**
|
|
> Formula One (F1) race strategy takes place in a high-pressure and fast-paced
|
|
> environment where split-second decisions can drastically affect race results.
|
|
> Two of the core decisions of race strategy are when to make pit stops (i.e.
|
|
> replace the cars' tyres) and which tyre compounds (hard, medium or soft, in
|
|
> normal conditions) to select. The optimal pit stop decisions can be determined
|
|
> by estimating the tyre degradation of these compounds, which in turn can be
|
|
> computed from the energy applied to each tyre, i.e. the tyre energy. In this
|
|
> work, we trained deep learning models, using the Mercedes-AMG PETRONAS F1
|
|
> team's historic race data consisting of telemetry, to forecast tyre energies
|
|
> during races. Additionally, we fitted XGBoost, a decision tree-based machine
|
|
> learning algorithm, to the same dataset and compared the results, with both
|
|
> giving impressive performance. Furthermore, we incorporated two different
|
|
> explainable AI methods, namely feature importance and counterfactual
|
|
> explanations, to gain insights into the reasoning behind the forecasts. Our
|
|
> contributions thus result in an explainable, automated method which could
|
|
> assist F1 teams in optimising their race strategy.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on developing an explainable AI method for time series prediction in Formula One racing, utilizing deep learning models and XGBoost, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).
|
|
|
|
---
|
|
|
|
## [Explainable Reinforcement Learning for Formula One Race Strategy](https://arxiv.org/abs/http://arxiv.org/abs/2501.04068v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04068v1
|
|
|
|
**Abstract:**
|
|
> In Formula One, teams compete to develop their cars and achieve the highest
|
|
> possible finishing position in each race. During a race, however, teams are
|
|
> unable to alter the car, so they must improve their cars' finishing positions
|
|
> via race strategy, i.e. optimising their selection of which tyre compounds to
|
|
> put on the car and when to do so. In this work, we introduce a reinforcement
|
|
> learning model, RSRL (Race Strategy Reinforcement Learning), to control race
|
|
> strategies in simulations, offering a faster alternative to the industry
|
|
> standard of hard-coded and Monte Carlo-based race strategies. Controlling cars
|
|
> with a pace equating to an expected finishing position of P5.5 (where P1
|
|
> represents first place and P20 is last place), RSRL achieves an average
|
|
> finishing position of P5.33 on our test race, the 2023 Bahrain Grand Prix,
|
|
> outperforming the best baseline of P5.63. We then demonstrate, in a
|
|
> generalisability study, how performance for one track or multiple tracks can be
|
|
> prioritised via training. Further, we supplement model predictions with feature
|
|
> importance, decision tree-based surrogate models, and decision tree
|
|
> counterfactuals towards improving user trust in the model. Finally, we provide
|
|
> illustrations which exemplify our approach in real-world situations, drawing
|
|
> parallels between simulations and reality.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on developing a reinforcement learning model for optimizing Formula One race strategies, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet the primary 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Multi-armed Bandit and Backbone boost Lin-Kernighan-Helsgaun Algorithm
|
|
for the Traveling Salesman Problems](https://arxiv.org/abs/http://arxiv.org/abs/2501.04072v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04072v1
|
|
|
|
**Abstract:**
|
|
> The Lin-Kernighan-Helsguan (LKH) heuristic is a classic local search
|
|
> algorithm for the Traveling Salesman Problem (TSP). LKH introduces an
|
|
> $\alpha$-value to replace the traditional distance metric for evaluating the
|
|
> edge quality, which leads to a significant improvement. However, we observe
|
|
> that the $\alpha$-value does not make full use of the historical information
|
|
> during the search, and single guiding information often makes LKH hard to
|
|
> escape from some local optima. To address the above issues, we propose a novel
|
|
> way to extract backbone information during the TSP local search process, which
|
|
> is dynamic and can be updated once a local optimal solution is found. We
|
|
> further propose to combine backbone information, $\alpha$-value, and distance
|
|
> to evaluate the edge quality so as to guide the search. Moreover, we abstract
|
|
> their different combinations to arms in a multi-armed bandit (MAB) and use an
|
|
> MAB model to help the algorithm select an appropriate evaluation metric
|
|
> dynamically. Both the backbone information and MAB can provide diverse guiding
|
|
> information and learn from the search history to suggest the best metric. We
|
|
> apply our methods to LKH and LKH-3, which is an extension version of LKH that
|
|
> can be used to solve about 40 variant problems of TSP and Vehicle Routing
|
|
> Problem (VRP). Extensive experiments show the excellent performance and
|
|
> generalization capability of our proposed method, significantly improving LKH
|
|
> for TSP and LKH-3 for two representative TSP and VRP variants, the Colored TSP
|
|
> (CTSP) and Capacitated VRP with Time Windows (CVRPTW).
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper does not meet the 'MUST' criteria as it focuses on optimizing the Traveling Salesman Problem algorithm using multi-armed bandit and backbone boost, with no primary focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Enhancing Distribution and Label Consistency for Graph
|
|
Out-of-Distribution Generalization](https://arxiv.org/abs/http://arxiv.org/abs/2501.04102v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04102v1
|
|
|
|
**Abstract:**
|
|
> To deal with distribution shifts in graph data, various graph
|
|
> out-of-distribution (OOD) generalization techniques have been recently
|
|
> proposed. These methods often employ a two-step strategy that first creates
|
|
> augmented environments and subsequently identifies invariant subgraphs to
|
|
> improve generalizability. Nevertheless, this approach could be suboptimal from
|
|
> the perspective of consistency. First, the process of augmenting environments
|
|
> by altering the graphs while preserving labels may lead to graphs that are not
|
|
> realistic or meaningfully related to the origin distribution, thus lacking
|
|
> distribution consistency. Second, the extracted subgraphs are obtained from
|
|
> directly modifying graphs, and may not necessarily maintain a consistent
|
|
> predictive relationship with their labels, thereby impacting label consistency.
|
|
> In response to these challenges, we introduce an innovative approach that aims
|
|
> to enhance these two types of consistency for graph OOD generalization. We
|
|
> propose a modifier to obtain both augmented and invariant graphs in a unified
|
|
> manner. With the augmented graphs, we enrich the training data without
|
|
> compromising the integrity of label-graph relationships. The label consistency
|
|
> enhancement in our framework further preserves the supervision information in
|
|
> the invariant graph. We conduct extensive experiments on real-world datasets to
|
|
> demonstrate the superiority of our framework over other state-of-the-art
|
|
> baselines.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on enhancing distribution and label consistency for graph out-of-distribution generalization, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text-based interactions, thus failing to meet the 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Multimodal Multihop Source Retrieval for Web Question Answering](https://arxiv.org/abs/http://arxiv.org/abs/2501.04173v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04173v1
|
|
|
|
**Abstract:**
|
|
> This work deals with the challenge of learning and reasoning over multi-modal
|
|
> multi-hop question answering (QA). We propose a graph reasoning network based
|
|
> on the semantic structure of the sentences to learn multi-source reasoning
|
|
> paths and find the supporting facts across both image and text modalities for
|
|
> answering the question. In this paper, we investigate the importance of graph
|
|
> structure for multi-modal multi-hop question answering. Our analysis is
|
|
> centered on WebQA. We construct a strong baseline model, that finds relevant
|
|
> sources using a pairwise classification task. We establish that, with the
|
|
> proper use of feature representations from pre-trained models, graph structure
|
|
> helps in improving multi-modal multi-hop question answering. We point out that
|
|
> both graph structure and adjacency matrix are task-related prior knowledge, and
|
|
> graph structure can be leveraged to improve the retrieval performance for the
|
|
> task. Experiments and visualized analysis demonstrate that message propagation
|
|
> over graph networks or the entire graph structure can replace massive
|
|
> multimodal transformers with token-wise cross-attention. We demonstrated the
|
|
> applicability of our method and show a performance gain of \textbf{4.6$\%$}
|
|
> retrieval F1score over the transformer baselines, despite being a very light
|
|
> model. We further demonstrated the applicability of our model to a large scale
|
|
> retrieval setting.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on multimodal multihop question answering using graph reasoning networks, with emphasis on leveraging image and text modalities, rather than specifically engineering prompts for Large Language Models (LLMs). It does not investigate, analyze, or propose methods for improving LLM performance through the manipulation of textual input prompts as required.
|
|
|
|
---
|
|
|
|
## [HIVEX: A High-Impact Environment Suite for Multi-Agent Research
|
|
(extended version)](https://arxiv.org/abs/http://arxiv.org/abs/2501.04180v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04180v2
|
|
|
|
**Abstract:**
|
|
> Games have been vital test beds for the rapid development of Agent-based
|
|
> research. Remarkable progress has been achieved in the past, but it is unclear
|
|
> if the findings equip for real-world problems. While pressure grows, some of
|
|
> the most critical ecological challenges can find mitigation and prevention
|
|
> solutions through technology and its applications. Most real-world domains
|
|
> include multi-agent scenarios and require machine-machine and human-machine
|
|
> collaboration. Open-source environments have not advanced and are often toy
|
|
> scenarios, too abstract or not suitable for multi-agent research. By mimicking
|
|
> real-world problems and increasing the complexity of environments, we hope to
|
|
> advance state-of-the-art multi-agent research and inspire researchers to work
|
|
> on immediate real-world problems. Here, we present HIVEX, an environment suite
|
|
> to benchmark multi-agent research focusing on ecological challenges. HIVEX
|
|
> includes the following environments: Wind Farm Control, Wildfire Resource
|
|
> Management, Drone-Based Reforestation, Ocean Plastic Collection, and Aerial
|
|
> Wildfire Suppression. We provide environments, training examples, and baselines
|
|
> for the main and sub-tasks. All trained models resulting from the experiments
|
|
> of this work are hosted on Hugging Face. We also provide a leaderboard on
|
|
> Hugging Face and encourage the community to submit models trained on our
|
|
> environment suite.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on multi-agent research, ecological challenges, and the development of an environment suite (HIVEX), with no clear emphasis on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.
|
|
|
|
---
|
|
|
|
## [Fixed Points of Deep Neural Networks: Emergence, Stability, and
|
|
Applications](https://arxiv.org/abs/http://arxiv.org/abs/2501.04182v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04182v1
|
|
|
|
**Abstract:**
|
|
> We present numerical and analytical results on the formation and stability of
|
|
> a family of fixed points of deep neural networks (DNNs). Such fixed points
|
|
> appear in a class of DNNs when dimensions of input and output vectors are the
|
|
> same. We demonstrate examples of applications of such networks in supervised,
|
|
> semi-supervised and unsupervised learning such as encoding/decoding of images,
|
|
> restoration of damaged images among others.
|
|
> We present several numerical and analytical results. First, we show that for
|
|
> untrained DNN's with weights and biases initialized by normally distributed
|
|
> random variables the only one fixed point exists. This result holds for DNN
|
|
> with any depth (number of layers) $L$, any layer width $N$, and sigmoid-type
|
|
> activation functions. Second, it has been shown that for a DNN whose parameters
|
|
> (weights and biases) are initialized by ``light-tailed'' distribution of
|
|
> weights (e.g. normal distribution), after training the distribution of these
|
|
> parameters become ``heavy-tailed''. This motivates our study of DNNs with
|
|
> ``heavy-tailed'' initialization. For such DNNs we show numerically %existence
|
|
> and stability that training leads to emergence of $Q(N,L)$ fixed points, where
|
|
> $Q(N,L)$ is a positive integer which depends on the number of layers $L$ and
|
|
> layer width $N$. We further observe numerically that for fixed $N = N_0$ the
|
|
> function $Q(N_0, L)$ is non-monotone, that is it initially grows as $L$
|
|
> increases and then decreases to 1.
|
|
> This non-monotone behavior of $Q(N_0, L)$ is also obtained by analytical
|
|
> derivation of equation for Empirical Spectral Distribution (ESD) of
|
|
> input-output Jacobian followed by numerical solution of this equation.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on the analysis of fixed points in deep neural networks (DNNs), including their emergence, stability, and applications in image processing, but does not address prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance, thus failing to meet all 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Generative Style Transfer for MRI Image Segmentation: A Case of Glioma
|
|
Segmentation in Sub-Saharan Africa](https://arxiv.org/abs/http://arxiv.org/abs/2501.04734v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04734v1
|
|
|
|
**Abstract:**
|
|
> In Sub-Saharan Africa (SSA), the utilization of lower-quality Magnetic
|
|
> Resonance Imaging (MRI) technology raises questions about the applicability of
|
|
> machine learning methods for clinical tasks. This study aims to provide a
|
|
> robust deep learning-based brain tumor segmentation (BraTS) method tailored for
|
|
> the SSA population using a threefold approach. Firstly, the impact of domain
|
|
> shift from the SSA training data on model efficacy was examined, revealing no
|
|
> significant effect. Secondly, a comparative analysis of 3D and 2D
|
|
> full-resolution models using the nnU-Net framework indicates similar
|
|
> performance of both the models trained for 300 epochs achieving a five-fold
|
|
> cross-validation score of 0.93. Lastly, addressing the performance gap observed
|
|
> in SSA validation as opposed to the relatively larger BraTS glioma (GLI)
|
|
> validation set, two strategies are proposed: fine-tuning SSA cases using the
|
|
> GLI+SSA best-pretrained 2D fullres model at 300 epochs, and introducing a novel
|
|
> neural style transfer-based data augmentation technique for the SSA cases. This
|
|
> investigation underscores the potential of enhancing brain tumor prediction
|
|
> within SSA's unique healthcare landscape.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on MRI image segmentation, a non-text generation task, using deep learning methods, and does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through prompt engineering, violating multiple 'MUST NOT' criteria.
|
|
|
|
---
|
|
|
|
## [Retrieval-Augmented Generation by Evidence Retroactivity in LLMs](https://arxiv.org/abs/http://arxiv.org/abs/2501.05475v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.05475v1
|
|
|
|
**Abstract:**
|
|
> Retrieval-augmented generation has gained significant attention due to its
|
|
> ability to integrate relevant external knowledge, enhancing the accuracy and
|
|
> reliability of the LLMs' responses. Most of the existing methods apply a
|
|
> dynamic multiple retrieval-generating process, to address multi-hop complex
|
|
> questions by decomposing them into sub-problems. However, these methods rely on
|
|
> an unidirectional forward reasoning paradigm, where errors from insufficient
|
|
> reasoning steps or inherent flaws in current retrieval systems are
|
|
> irreversible, potentially derailing the entire reasoning chain. For the first
|
|
> time, this work introduces Retroactive Retrieval-Augmented Generation
|
|
> (RetroRAG), a novel framework to build a retroactive reasoning paradigm.
|
|
> RetroRAG revises and updates the evidence, redirecting the reasoning chain to
|
|
> the correct direction. RetroRAG constructs an evidence-collation-discovery
|
|
> framework to search, generate, and refine credible evidence. It synthesizes
|
|
> inferential evidence related to the key entities in the question from the
|
|
> existing source knowledge and formulates search queries to uncover additional
|
|
> information. As new evidence is found, RetroRAG continually updates and
|
|
> organizes this information, enhancing its ability to locate further necessary
|
|
> evidence. Paired with an Answerer to generate and evaluate outputs, RetroRAG is
|
|
> capable of refining its reasoning process iteratively until a reliable answer
|
|
> is obtained. Empirical evaluations show that RetroRAG significantly outperforms
|
|
> existing methods.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on introducing a novel framework for retrieval-augmented generation with a retroactive reasoning paradigm, rather than specifically engineering or optimizing prompts for Large Language Models (LLMs). While LLMs are utilized, the core subject is the enhancement of their reliability through external knowledge integration, not prompt engineering for text-based interactions.
|
|
|
|
---
|
|
|
|
## [Language and Planning in Robotic Navigation: A Multilingual Evaluation
|
|
of State-of-the-Art Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.05478v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.05478v1
|
|
|
|
**Abstract:**
|
|
> Large Language Models (LLMs) such as GPT-4, trained on huge amount of
|
|
> datasets spanning multiple domains, exhibit significant reasoning,
|
|
> understanding, and planning capabilities across various tasks. This study
|
|
> presents the first-ever work in Arabic language integration within the
|
|
> Vision-and-Language Navigation (VLN) domain in robotics, an area that has been
|
|
> notably underexplored in existing research. We perform a comprehensive
|
|
> evaluation of state-of-the-art multi-lingual Small Language Models (SLMs),
|
|
> including GPT-4o mini, Llama 3 8B, and Phi-3 medium 14B, alongside the
|
|
> Arabic-centric LLM, Jais. Our approach utilizes the NavGPT framework, a pure
|
|
> LLM-based instruction-following navigation agent, to assess the impact of
|
|
> language on navigation reasoning through zero-shot sequential action prediction
|
|
> using the R2R dataset. Through comprehensive experiments, we demonstrate that
|
|
> our framework is capable of high-level planning for navigation tasks when
|
|
> provided with instructions in both English and Arabic. However, certain models
|
|
> struggled with reasoning and planning in the Arabic language due to inherent
|
|
> limitations in their capabilities, sub-optimal performance, and parsing issues.
|
|
> These findings highlight the importance of enhancing planning and reasoning
|
|
> capabilities in language models for effective navigation, emphasizing this as a
|
|
> key area for further development while also unlocking the potential of
|
|
> Arabic-language models for impactful real-world applications.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on the evaluation of Large Language Models (LLMs) for multilingual robotic navigation, rather than on engineering, design, or optimization of prompts specifically for LLMs. While LLMs are utilized, the core subject is not prompt engineering for text-based interactions, but rather the application of LLMs in navigation tasks across different languages.
|
|
|
|
---
|
|
|
|
## [Detection, Retrieval, and Explanation Unified: A Violence Detection
|
|
System Based on Knowledge Graphs and GAT](https://arxiv.org/abs/http://arxiv.org/abs/2501.06224v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.06224v1
|
|
|
|
**Abstract:**
|
|
> Recently, violence detection systems developed using unified multimodal
|
|
> models have achieved significant success and attracted widespread attention.
|
|
> However, most of these systems face two critical challenges: the lack of
|
|
> interpretability as black-box models and limited functionality, offering only
|
|
> classification or retrieval capabilities. To address these challenges, this
|
|
> paper proposes a novel interpretable violence detection system, termed the
|
|
> Three-in-One (TIO) System. The TIO system integrates knowledge graphs (KG) and
|
|
> graph attention networks (GAT) to provide three core functionalities:
|
|
> detection, retrieval, and explanation. Specifically, the system processes each
|
|
> video frame along with text descriptions generated by a large language model
|
|
> (LLM) for videos containing potential violent behavior. It employs ImageBind to
|
|
> generate high-dimensional embeddings for constructing a knowledge graph, uses
|
|
> GAT for reasoning, and applies lightweight time series modules to extract video
|
|
> embedding features. The final step connects a classifier and retriever for
|
|
> multi-functional outputs. The interpretability of KG enables the system to
|
|
> verify the reasoning process behind each output. Additionally, the paper
|
|
> introduces several lightweight methods to reduce the resource consumption of
|
|
> the TIO system and enhance its efficiency. Extensive experiments conducted on
|
|
> the XD-Violence and UCF-Crime datasets validate the effectiveness of the
|
|
> proposed system. A case study further reveals an intriguing phenomenon: as the
|
|
> number of bystanders increases, the occurrence of violent behavior tends to
|
|
> decrease.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper's primary focus is on developing a unified violence detection system using knowledge graphs and GAT, with LLMs being used only as a component for generating text descriptions, rather than the central focus being on prompt engineering for text-based interactions with LLMs.
|
|
|
|
---
|
|
|
|
## [asanAI: In-Browser, No-Code, Offline-First Machine Learning Toolkit](https://arxiv.org/abs/http://arxiv.org/abs/2501.06226v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.06226v1
|
|
|
|
**Abstract:**
|
|
> Machine learning (ML) has become crucial in modern life, with growing
|
|
> interest from researchers and the public. Despite its potential, a significant
|
|
> entry barrier prevents widespread adoption, making it challenging for
|
|
> non-experts to understand and implement ML techniques. The increasing desire to
|
|
> leverage ML is counterbalanced by its technical complexity, creating a gap
|
|
> between potential and practical application. This work introduces asanAI, an
|
|
> offline-first, open-source, no-code machine learning toolkit designed for users
|
|
> of all skill levels. It allows individuals to design, debug, train, and test ML
|
|
> models directly in a web browser, eliminating the need for software
|
|
> installations and coding. The toolkit runs on any device with a modern web
|
|
> browser, including smartphones, and ensures user privacy through local
|
|
> computations while utilizing WebGL for enhanced GPU performance. Users can
|
|
> quickly experiment with neural networks and train custom models using various
|
|
> data sources, supported by intuitive visualizations of network structures and
|
|
> data flows. asanAI simplifies the teaching of ML concepts in educational
|
|
> settings and is released under an open-source MIT license, encouraging
|
|
> modifications. It also supports exporting models in industry-ready formats,
|
|
> empowering a diverse range of users to effectively learn and apply machine
|
|
> learning in their projects. The proposed toolkit is successfully utilized by
|
|
> researchers of ScaDS.AI to swiftly draft and test machine learning ideas, by
|
|
> trainers to effectively educate enthusiasts, and by teachers to introduce
|
|
> contemporary ML topics in classrooms with minimal effort and high clarity.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a no-code machine learning toolkit (asanAI) that happens to utilize ML techniques, rather than engineering prompts specifically for Large Language Models (LLMs), thus not meeting the core subject criterion.
|
|
|
|
---
|
|
|
|
## [Balanced Multi-view Clustering](https://arxiv.org/abs/http://arxiv.org/abs/2501.02564v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02564v2
|
|
|
|
**Abstract:**
|
|
> Multi-view clustering (MvC) aims to integrate information from different
|
|
> views to enhance the capability of the model in capturing the underlying data
|
|
> structures. The widely used joint training paradigm in MvC is potentially not
|
|
> fully leverage the multi-view information, since the imbalanced and
|
|
> under-optimized view-specific features caused by the uniform learning objective
|
|
> for all views. For instance, particular views with more discriminative
|
|
> information could dominate the learning process in the joint training paradigm,
|
|
> leading to other views being under-optimized. To alleviate this issue, we first
|
|
> analyze the imbalanced phenomenon in the joint-training paradigm of multi-view
|
|
> clustering from the perspective of gradient descent for each view-specific
|
|
> feature extractor. Then, we propose a novel balanced multi-view clustering
|
|
> (BMvC) method, which introduces a view-specific contrastive regularization
|
|
> (VCR) to modulate the optimization of each view. Concretely, VCR preserves the
|
|
> sample similarities captured from the joint features and view-specific ones
|
|
> into the clustering distributions corresponding to view-specific features to
|
|
> enhance the learning process of view-specific feature extractors. Additionally,
|
|
> a theoretical analysis is provided to illustrate that VCR adaptively modulates
|
|
> the magnitudes of gradients for updating the parameters of view-specific
|
|
> feature extractors to achieve a balanced multi-view learning procedure. In such
|
|
> a manner, BMvC achieves a better trade-off between the exploitation of
|
|
> view-specific patterns and the exploration of view-invariance patterns to fully
|
|
> learn the multi-view information for the clustering task. Finally, a set of
|
|
> experiments are conducted to verify the superiority of the proposed method
|
|
> compared with state-of-the-art approaches both on eight benchmark MvC datasets
|
|
> and two spatially resolved transcriptomics datasets.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on multi-view clustering, a general machine learning technique, and does not meet the 'MUST' criteria for prompt engineering specifically tailored for Large Language Models (LLMs), nor does it provide examples of textual input prompts impacting LLM output.
|
|
|
|
---
|
|
|
|
## [Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM
|
|
Pre-Training Datasets](https://arxiv.org/abs/http://arxiv.org/abs/2501.02628v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02628v1
|
|
|
|
**Abstract:**
|
|
> A critical part of creating code suggestion systems is the pre-training of
|
|
> Large Language Models on vast amounts of source code and natural language text,
|
|
> often of questionable origin or quality. This may contribute to the presence of
|
|
> bugs and vulnerabilities in code generated by LLMs. While efforts to identify
|
|
> bugs at or after code generation exist, it is preferable to pre-train or
|
|
> fine-tune LLMs on curated, high-quality, and compliant datasets. The need for
|
|
> vast amounts of training data necessitates that such curation be automated,
|
|
> minimizing human intervention.
|
|
> We propose an automated source code autocuration technique that leverages the
|
|
> complete version history of open-source software projects to improve the
|
|
> quality of training data. This approach leverages the version history of all
|
|
> OSS projects to identify training data samples that have been modified or have
|
|
> undergone changes in at least one OSS project, and pinpoint a subset of samples
|
|
> that include fixes for bugs or vulnerabilities. We evaluate this method using
|
|
> The Stack v2 dataset, and find that 17% of the code versions in the dataset
|
|
> have newer versions, with 17% of those representing bug fixes, including 2.36%
|
|
> addressing known CVEs. The deduplicated version of Stack v2 still includes
|
|
> blobs vulnerable to 6,947 known CVEs. Furthermore, 58% of the blobs in the
|
|
> dataset were never modified after creation, suggesting they likely represent
|
|
> software with minimal or no use. Misidentified blob origins present an
|
|
> additional challenge, as they lead to the inclusion of non-permissively
|
|
> licensed code, raising serious compliance concerns.
|
|
> By addressing these issues, the training of new models can avoid perpetuating
|
|
> buggy code patterns or license violations. We expect our results to inspire
|
|
> process improvements for automated data curation, with the potential to enhance
|
|
> the reliability of outputs generated by AI tools.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses primarily on the curation of pre-training datasets for LLMs to reduce bugs and licensing risks, rather than the engineering, design, or optimization of prompts specifically for Large Language Models, failing to meet the first 'MUST' criteria.
|
|
|
|
---
|
|
|
|
## [Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured
|
|
State-Space Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02832v3)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.02832v3
|
|
|
|
**Abstract:**
|
|
> We propose Samba ASR,the first state of the art Automatic Speech
|
|
> Recognition(ASR)model leveraging the novel Mamba architecture as both encoder
|
|
> and decoder,built on the foundation of state space models(SSMs).Unlike
|
|
> transformerbased ASR models,which rely on self-attention mechanisms to capture
|
|
> dependencies,Samba ASR effectively models both local and global temporal
|
|
> dependencies using efficient statespace dynamics,achieving remarkable
|
|
> performance gains.By addressing the limitations of transformers,such as
|
|
> quadratic scaling with input length and difficulty in handling longrange
|
|
> dependencies,Samba ASR achieves superior accuracy and efficiency.Experimental
|
|
> results demonstrate that Samba ASR surpasses existing opensource
|
|
> transformerbased ASR models across various standard benchmarks,establishing it
|
|
> as the new state of theart in ASR.Extensive evaluations on the benchmark
|
|
> dataset show significant improvements in Word Error Rate(WER),with competitive
|
|
> performance even in lowresource scenarios.Furthermore,the inherent
|
|
> computational efficiency and parameter optimization of the Mamba architecture
|
|
> make Samba ASR a scalable and robust solution for diverse ASR tasks.Our
|
|
> contributions include the development of a new Samba ASR architecture for
|
|
> automatic speech recognition(ASR),demonstrating the superiority of structured
|
|
> statespace models(SSMs)over transformer based models for speech sequence
|
|
> processing.We provide a comprehensive evaluation on public
|
|
> benchmarks,showcasing stateoftheart(SOTA)performance,and present an indepth
|
|
> analysis of computational efficiency,robustness to noise,and sequence
|
|
> generalization.This work highlights the viability of Mamba SSMs as a
|
|
> transformerfree alternative for efficient and accurate ASR.By leveraging the
|
|
> advancements of statespace modeling,Samba ASR redefines ASR performance
|
|
> standards and sets a new benchmark for future research in this field.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper focuses on developing a new architecture (Mamba) for Automatic Speech Recognition (ASR), which is unrelated to Large Language Models (LLMs) and prompt engineering for text-based interactions, thus failing to meet the primary criteria.
|
|
|
|
---
|
|
|
|
## [A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation
|
|
Based on Large Language Models Enhanced by Domain Knowledge Retrieval](https://arxiv.org/abs/http://arxiv.org/abs/2501.03295v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03295v2
|
|
|
|
**Abstract:**
|
|
> Data-driven soft sensors are crucial in predicting key performance indicators
|
|
> in industrial systems. However, current methods predominantly rely on the
|
|
> supervised learning paradigms of parameter updating, which inherently faces
|
|
> challenges such as high development costs, poor robustness, training
|
|
> instability, and lack of interpretability. Recently, large language models
|
|
> (LLMs) have demonstrated significant potential across various domains, notably
|
|
> through In-Context Learning (ICL), which enables high-performance task
|
|
> execution with minimal input-label demonstrations and no prior training. This
|
|
> paper aims to replace supervised learning with the emerging ICL paradigm for
|
|
> soft sensor modeling to address existing challenges and explore new avenues for
|
|
> advancement. To achieve this, we propose a novel framework called the Few-shot
|
|
> Uncertainty-aware and self-Explaining Soft Sensor (LLM-FUESS), which includes
|
|
> the Zero-shot Auxiliary Variable Selector (LLM-ZAVS) and the Uncertainty-aware
|
|
> Few-shot Soft Sensor (LLM-UFSS). The LLM-ZAVS retrieves from the Industrial
|
|
> Knowledge Vector Storage to enhance LLMs' domain-specific knowledge, enabling
|
|
> zero-shot auxiliary variable selection. In the LLM-UFSS, we utilize text-based
|
|
> context demonstrations of structured data to prompt LLMs to execute ICL for
|
|
> predicting and propose a context sample retrieval augmentation strategy to
|
|
> improve performance. Additionally, we explored LLMs' AIGC and probabilistic
|
|
> characteristics to propose self-explanation and uncertainty quantification
|
|
> methods for constructing a trustworthy soft sensor. Extensive experiments
|
|
> demonstrate that our method achieved state-of-the-art predictive performance,
|
|
> strong robustness, and flexibility, effectively mitigates training instability
|
|
> found in traditional methods. To the best of our knowledge, this is the first
|
|
> work to establish soft sensor utilizing LLMs.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on replacing supervised learning with In-Context Learning for soft sensor modeling in industrial systems, utilizing LLMs as a core component within a larger system. While it mentions prompt enhancements (e.g., text-based context demonstrations, context sample retrieval augmentation), prompt engineering for text-based interactions with LLMs is not the central focus, but rather a means to achieve the paper's main objective of advancing soft sensor technology.
|
|
|
|
---
|
|
|
|
## [FTA-FTL: A Fine-Tuned Aggregation Federated Transfer Learning Scheme for
|
|
Lithology Microscopic Image Classification](https://arxiv.org/abs/http://arxiv.org/abs/2501.03349v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03349v1
|
|
|
|
**Abstract:**
|
|
> Lithology discrimination is a crucial activity in characterizing oil
|
|
> reservoirs, and processing lithology microscopic images is an essential
|
|
> technique for investigating fossils and minerals and geological assessment of
|
|
> shale oil exploration. In this way, Deep Learning (DL) technique is a powerful
|
|
> approach for building robust classifier models. However, there is still a
|
|
> considerable challenge to collect and produce a large dataset.
|
|
> Transfer-learning and data augmentation techniques have emerged as popular
|
|
> approaches to tackle this problem. Furthermore, due to different reasons,
|
|
> especially data privacy, individuals, organizations, and industry companies
|
|
> often are not willing to share their sensitive data and information. Federated
|
|
> Learning (FL) has emerged to train a highly accurate central model across
|
|
> multiple decentralized edge servers without transferring sensitive data,
|
|
> preserving sensitive data, and enhancing security. This study involves two
|
|
> phases; the first phase is to conduct Lithology microscopic image
|
|
> classification on a small dataset using transfer learning. In doing so, various
|
|
> pre-trained DL model architectures are comprehensively compared for the
|
|
> classification task. In the second phase, we formulated the classification task
|
|
> to a Federated Transfer Learning (FTL) scheme and proposed a Fine-Tuned
|
|
> Aggregation strategy for Federated Learning (FTA-FTL). In order to perform a
|
|
> comprehensive experimental study, several metrics such as accuracy, f1 score,
|
|
> precision, specificity, sensitivity (recall), and confusion matrix are taken
|
|
> into account. The results are in excellent agreement and confirm the efficiency
|
|
> of the proposed scheme, and show that the proposed FTA-FTL algorithm is capable
|
|
> enough to achieve approximately the same results obtained by the centralized
|
|
> implementation for Lithology microscopic images classification task.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper focuses on Federated Transfer Learning for image classification, primarily dealing with Deep Learning model architectures and data privacy in the context of Lithology microscopic image classification, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions."
|
|
}
|
|
|
|
---
|
|
|
|
## [Not all tokens are created equal: Perplexity Attention Weighted Networks
|
|
for AI generated text detection](https://arxiv.org/abs/http://arxiv.org/abs/2501.03940v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03940v2
|
|
|
|
**Abstract:**
|
|
> The rapid advancement in large language models (LLMs) has significantly
|
|
> enhanced their ability to generate coherent and contextually relevant text,
|
|
> raising concerns about the misuse of AI-generated content and making it
|
|
> critical to detect it. However, the task remains challenging, particularly in
|
|
> unseen domains or with unfamiliar LLMs. Leveraging LLM next-token distribution
|
|
> outputs offers a theoretically appealing approach for detection, as they
|
|
> encapsulate insights from the models' extensive pre-training on diverse
|
|
> corpora. Despite its promise, zero-shot methods that attempt to operationalize
|
|
> these outputs have met with limited success. We hypothesize that one of the
|
|
> problems is that they use the mean to aggregate next-token distribution metrics
|
|
> across tokens, when some tokens are naturally easier or harder to predict and
|
|
> should be weighted differently. Based on this idea, we propose the Perplexity
|
|
> Attention Weighted Network (PAWN), which uses the last hidden states of the LLM
|
|
> and positions to weight the sum of a series of features based on metrics from
|
|
> the next-token distribution across the sequence length. Although not zero-shot,
|
|
> our method allows us to cache the last hidden states and next-token
|
|
> distribution metrics on disk, greatly reducing the training resource
|
|
> requirements. PAWN shows competitive and even better performance
|
|
> in-distribution than the strongest baselines (fine-tuned LMs) with a fraction
|
|
> of their trainable parameters. Our model also generalizes better to unseen
|
|
> domains and source models, with smaller variability in the decision boundary
|
|
> across distribution shifts. It is also more robust to adversarial attacks, and
|
|
> if the backbone has multilingual capabilities, it presents decent
|
|
> generalization to languages not seen during supervised training, with LLaMA3-1B
|
|
> reaching a mean macro-averaged F1 score of 81.46% in cross-validation with nine
|
|
> languages.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on developing a novel method (PAWN) for detecting AI-generated text, leveraging LLM outputs, rather than engineering or optimizing prompts for Large Language Models. Prompt manipulation for improving LLM performance is not the central concern.
|
|
|
|
---
|
|
|
|
## [More is not always better? Enhancing Many-Shot In-Context Learning with
|
|
Differentiated and Reweighting Objectives](https://arxiv.org/abs/http://arxiv.org/abs/2501.04070v2)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.04070v2
|
|
|
|
**Abstract:**
|
|
> Large language models (LLMs) excel at few-shot in-context learning (ICL)
|
|
> without requiring parameter updates. However, as the number of ICL
|
|
> demonstrations increases from a few to many, performance tends to plateau and
|
|
> eventually decline. We identify two primary causes for this trend: the
|
|
> suboptimal negative log-likelihood (NLL) optimization objective and the
|
|
> incremental data noise. To address these issues, we introduce DrICL, a novel
|
|
> optimization method that enhances model performance through Differentiated
|
|
> Learning and advantage-based Reweighting objectives. Globally, DrICL utilizes
|
|
> differentiated learning to optimize the NLL objective, ensuring that many-shot
|
|
> performance surpasses zero-shot levels. Locally, it dynamically adjusts the
|
|
> weighting of many-shot demonstrations by leveraging cumulative advantages
|
|
> inspired by reinforcement learning, thereby improving generalization. This
|
|
> approach allows the model to handle varying numbers of shots effectively,
|
|
> mitigating the impact of noisy data. Recognizing the lack of multi-task
|
|
> datasets with diverse many-shot distributions, we develop the Many-Shot ICL
|
|
> Benchmark (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers
|
|
> from 1 to 350 within sequences of up to 8,000 tokens-for fine-tuning purposes.
|
|
> ICL-50 facilitates the evaluation of many-shot ICL strategies across seven
|
|
> prominent NLP tasks and 50 distinct datasets. Experimental results demonstrate
|
|
> that LLMs enhanced with DrICL achieve significant improvements in many-shot
|
|
> setups across various tasks, including both in-domain and out-of-domain
|
|
> scenarios. We release the code and benchmark dataset hoping to facilitate
|
|
> further research in many-shot ICL.
|
|
|
|
**Decision Explanation:**
|
|
Original response:
|
|
{
|
|
"decision": "REJECT",
|
|
"explanation": "The paper primarily focuses on enhancing many-shot in-context learning through novel optimization methods (Differentiated Learning and Reweighting objectives), rather than specifically on the engineering, design, or optimization of textual input prompts for Large Language Models (LLMs). The core subject is optimization of LLM performance through training methods, not prompt engineering."
|
|
}
|
|
|
|
---
|
|
|
|
## [Cosmos World Foundation Model Platform for Physical AI](https://arxiv.org/abs/http://arxiv.org/abs/2501.03575v1)
|
|
**arXiv ID:** http://arxiv.org/abs/2501.03575v1
|
|
|
|
**Abstract:**
|
|
> Physical AI needs to be trained digitally first. It needs a digital twin of
|
|
> itself, the policy model, and a digital twin of the world, the world model. In
|
|
> this paper, we present the Cosmos World Foundation Model Platform to help
|
|
> developers build customized world models for their Physical AI setups. We
|
|
> position a world foundation model as a general-purpose world model that can be
|
|
> fine-tuned into customized world models for downstream applications. Our
|
|
> platform covers a video curation pipeline, pre-trained world foundation models,
|
|
> examples of post-training of pre-trained world foundation models, and video
|
|
> tokenizers. To help Physical AI builders solve the most critical problems of
|
|
> our society, we make our platform open-source and our models open-weight with
|
|
> permissive licenses available via https://github.com/NVIDIA/Cosmos.
|
|
|
|
**Decision Explanation:**
|
|
Original decision: REJECT
|
|
The paper primarily focuses on a platform for Physical AI and world model development, with emphasis on video curation, model fine-tuning, and open-source resources. It does not meet the core criteria of focusing primarily on prompt engineering for Large Language Models (LLMs), manipulating textual input prompts to improve LLM performance, or providing concrete examples of prompts and their impact on LLM output.
|
|
|
|
---
|
|
|