465 KiB

Raw Blame History

Accepted Papers

ICPC: In-context Prompt Compression with Faster Inference

arXiv ID: http://arxiv.org/abs/2501.01625v1

Abstract:

Despite the recent success of Large Language Models (LLMs), it remains challenging to feed LLMs with long prompts due to the fixed size of LLM inputs. As a remedy, prompt compression becomes a promising solution by removing redundant tokens in the prompt. However, using LLM in the existing works requires additional computation resources and leads to memory overheads. To address it, we propose ICPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length. The key idea of ICPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function, which effectively reduces the information loss during prompt compression and increases the speed of compression. Empirically, we demonstrate that ICPC can effectively compress long texts of different categories and thus achieve better performance and speed on different types of NLP tasks.

Decision Explanation: Original response: { "decision": "ACCEPT", "explanation": "The paper meets all 'MUST' criteria: it focuses primarily on prompt engineering for LLMs (in-context prompt compression), investigates a method to improve LLM performance through prompt manipulation (reducing information loss during compression), and likely provides concrete examples (though not explicitly stated in the abstract, implied through 'empirically demonstrate' on various NLP tasks)." }

[Adaptive Few-shot Prompting for Machine Translation with Pre-trained

Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.01679v1) arXiv ID: http://arxiv.org/abs/2501.01679v1

Abstract:

Recently, Large language models (LLMs) with in-context learning have demonstrated remarkable potential in handling neural machine translation. However, existing evidence shows that LLMs are prompt-sensitive and it is sub-optimal to apply the fixed prompt to any input for downstream machine translation tasks. To address this issue, we propose an adaptive few-shot prompting (AFSP) framework to automatically select suitable translation demonstrations for various source input sentences to further elicit the translation capability of an LLM for better machine translation. First, we build a translation demonstration retrieval module based on LLM's embedding to retrieve top-k semantic-similar translation demonstrations from aligned parallel translation corpus. Rather than using other embedding models for semantic demonstration retrieval, we build a hybrid demonstration retrieval module based on the embedding layer of the deployed LLM to build better input representation for retrieving more semantic-related translation demonstrations. Then, to ensure better semantic consistency between source inputs and target outputs, we force the deployed LLM itself to generate multiple output candidates in the target language with the help of translation demonstrations and rerank these candidates. Besides, to better evaluate the effectiveness of our AFSP framework on the latest language and extend the research boundary of neural machine translation, we construct a high-quality diplomatic Chinese-English parallel dataset that consists of 5,528 parallel Chinese-English sentences. Finally, extensive experiments on the proposed diplomatic Chinese-English parallel dataset and the United Nations Parallel Corpus (Chinese-English part) show the effectiveness and superiority of our proposed AFSP.

Decision Explanation: Original decision: ACCEPT The paper meets all 'MUST' criteria: it focuses on prompt engineering for LLMs (adaptive few-shot prompting for machine translation), investigates improving LLM performance through prompt manipulation, and provides concrete examples with a replicable framework using publicly available LLMs, as evidenced by experiments on constructed and existing datasets.

Table as Thought: Exploring Structured Thoughts in LLM Reasoning

arXiv ID: http://arxiv.org/abs/2501.02152v1

Abstract:

Large language models' reasoning abilities benefit from methods that organize their thought processes, such as chain-of-thought prompting, which employs a sequential structure to guide the reasoning process step-by-step. However, existing approaches focus primarily on organizing the sequence of thoughts, leaving structure in individual thought steps underexplored. To address this gap, we propose Table as Thought, a framework inspired by cognitive neuroscience theories on human thought. Table as Thought organizes reasoning within a tabular schema, where rows represent sequential thought steps and columns capture critical constraints and contextual information to enhance reasoning. The reasoning process iteratively populates the table until self-verification ensures completeness and correctness. Our experiments show that Table as Thought excels in planning tasks and demonstrates a strong potential for enhancing LLM performance in mathematical reasoning compared to unstructured thought baselines. This work provides a novel exploration of refining thought representation within LLMs, paving the way for advancements in reasoning and AI cognition.

Decision Explanation: Original decision: ACCEPT The paper focuses primarily on prompt engineering for LLMs by proposing a novel 'Table as Thought' framework, which organizes reasoning within a tabular schema to enhance LLM performance. It provides concrete examples (planning and mathematical reasoning tasks) and demonstrates the impact of this structured prompt approach on LLM output, meeting all required criteria.

[The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for

Test Case Generation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01329v1) arXiv ID: http://arxiv.org/abs/2501.01329v1

Abstract:

Test cases are essential for validating the reliability and quality of software applications. Recent studies have demonstrated the capability of Large Language Models (LLMs) to generate useful test cases for given source code. However, the existing work primarily relies on human-written plain prompts, which often leads to suboptimal results since the performance of LLMs can be highly influenced by the prompts. Moreover, these approaches use the same prompt for all LLMs, overlooking the fact that different LLMs might be best suited to different prompts. Given the wide variety of possible prompt formulations, automatically discovering the optimal prompt for each LLM presents a significant challenge. Although there are methods on automated prompt optimization in the natural language processing field, they are hard to produce effective prompts for the test case generation task. First, the methods iteratively optimize prompts by simply combining and mutating existing ones without proper guidance, resulting in prompts that lack diversity and tend to repeat the same errors in the generated test cases. Second, the prompts are generally lack of domain contextual knowledge, limiting LLMs' performance in the task.

Decision Explanation: Original decision: ACCEPT The paper meets all 'MUST' criteria, focusing primarily on prompt engineering for LLMs, investigating automated optimization of textual input prompts to improve LLM performance in test case generation, and likely providing concrete examples replicable with publicly available LLMs, as implied by its title and abstract.

[Recursive Decomposition of Logical Thoughts: Framework for Superior

Reasoning and Knowledge Propagation in Large Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02026v1) arXiv ID: http://arxiv.org/abs/2501.02026v1

Abstract:

Enhancing the reasoning capabilities of Large Language Models remains a critical challenge in artificial intelligence. We introduce RDoLT, Recursive Decomposition of Logical Thought prompting, a novel framework that significantly boosts LLM reasoning performance. RDoLT is built on three key innovations: (1) recursively breaking down complex reasoning tasks into sub-tasks of progressive complexity; (2) employing an advanced selection and scoring mechanism to identify the most promising reasoning thoughts; and (3) integrating a knowledge propagation module that mimics human learning by keeping track of strong and weak thoughts for information propagation. Our approach was evaluated across multiple benchmarks, including GSM8K, SVAMP, MultiArith, LastLetterConcatenation, and Gaokao2023 Math. The results demonstrate that RDoLT consistently outperforms existing state-of-the-art techniques, achieving a 90.98 percent accuracy on GSM8K with ChatGPT-4, surpassing state-of-the-art techniques by 6.28 percent. Similar improvements were observed on other benchmarks, with accuracy gains ranging from 5.5 percent to 6.75 percent. These findings highlight RDoLT's potential to advance prompt engineering, offering a more effective and generalizable approach to complex reasoning tasks.

Decision Explanation: Original decision: ACCEPT Meets all 'MUST' criteria: (1) Focuses on prompt engineering for LLMs, (2) Investigates a novel method (RDoLT) to improve LLM performance through prompt manipulation, and (3) Provides concrete examples and benchmark results demonstrating the impact of RDoLT on LLM output, replicable with publicly available LLMs (ChatGPT-4).

Rejected Papers

[Beyond Text: Implementing Multimodal Large Language Model-Powered

Multi-Agent Systems Using a No-Code Platform](https://arxiv.org/abs/http://arxiv.org/abs/2501.00750v1) arXiv ID: http://arxiv.org/abs/2501.00750v1

Abstract:

This study proposes the design and implementation of a multimodal LLM-based Multi-Agent System (MAS) leveraging a No-Code platform to address the practical constraints and significant entry barriers associated with AI adoption in enterprises. Advanced AI technologies, such as Large Language Models (LLMs), often pose challenges due to their technical complexity and high implementation costs, making them difficult for many organizations to adopt. To overcome these limitations, this research develops a No-Code-based Multi-Agent System designed to enable users without programming knowledge to easily build and manage AI systems. The study examines various use cases to validate the applicability of AI in business processes, including code generation from image-based notes, Advanced RAG-based question-answering systems, text-based image generation, and video generation using images and prompts. These systems lower the barriers to AI adoption, empowering not only professional developers but also general users to harness AI for significantly improved productivity and efficiency. By demonstrating the scalability and accessibility of No-Code platforms, this study advances the democratization of AI technologies within enterprises and validates the practical applicability of Multi-Agent Systems, ultimately contributing to the widespread adoption of AI across various industries.

Decision Explanation: Original decision: REJECT The paper's primary focus is on implementing a No-Code platform for Multi-Agent Systems using LLMs, not specifically on prompt engineering for text-based interactions with LLMs, failing to meet the core subject requirement. Prompt engineering is mentioned (e.g., using images and prompts for video generation) but is not the central focus.

A3: Android Agent Arena for Mobile GUI Agents

arXiv ID: http://arxiv.org/abs/2501.01149v1

Abstract:

AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at \url{https://yuxiangchai.github.io/Android-Agent-Arena/}.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a novel evaluation platform (Android Agent Arena) for mobile GUI agents, with LLMs mentioned only as part of the automated evaluation process, not as the central subject for prompt engineering. The core subject does not meet the 'MUST' criteria of focusing primarily on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).

[Rethinking Relation Extraction: Beyond Shortcuts to Generalization with

a Debiased Benchmark](https://arxiv.org/abs/http://arxiv.org/abs/2501.01349v1) arXiv ID: http://arxiv.org/abs/2501.01349v1

Abstract:

Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions. However, biases within datasets can lead models to learn shortcut patterns, resulting in inaccurate assessments and hindering real-world applicability. This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context. We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement. DREB utilizes Bias Evaluator and PPL Evaluator to ensure low bias and high naturalness, providing a reliable and accurate assessment of model generalization in entity bias scenarios. To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques. MixDebias effectively improves model performance on DREB while maintaining performance on the original dataset. Extensive experiments demonstrate the effectiveness and robustness of MixDebias compared to existing methods, highlighting its potential for improving the generalization ability of relation extraction models. We will release DREB and MixDebias publicly.

Decision Explanation: Original decision: REJECT The paper focuses on debiasing relation extraction benchmarks and proposing a new debiasing method (MixDebias), rather than primarily investigating prompt engineering, optimizing textual input prompts for Large Language Models (LLMs), or demonstrating the impact of prompts on LLM output.

ASKCOS: an open source software suite for synthesis planning

arXiv ID: http://arxiv.org/abs/2501.01835v1

Abstract:

The advancement of machine learning and the availability of large-scale reaction datasets have accelerated the development of data-driven models for computer-aided synthesis planning (CASP) in the past decade. Here, we detail the newest version of ASKCOS, an open source software suite for synthesis planning that makes available several research advances in a freely available, practical tool. Four one-step retrosynthesis models form the basis of both interactive planning and automatic planning modes. Retrosynthetic planning is complemented by other modules for feasibility assessment and pathway evaluation, including reaction condition recommendation, reaction outcome prediction, and auxiliary capabilities such as solubility prediction and quantum mechanical descriptor prediction. ASKCOS has assisted hundreds of medicinal, synthetic, and process chemists in their day-to-day tasks, complementing expert decision making. It is our belief that CASP tools like ASKCOS are an important part of modern chemistry research, and that they offer ever-increasing utility and accessibility.

Decision Explanation: Original decision: REJECT The paper focuses primarily on the development of an open source software suite for synthesis planning in chemistry, with no evident focus on prompt engineering for Large Language Models (LLMs) or improving LLM performance through textual input prompt manipulation.

[eRevise+RF: A Writing Evaluation System for Assessing Student Essay

Revisions and Providing Formative Feedback](https://arxiv.org/abs/http://arxiv.org/abs/2501.00715v1) arXiv ID: http://arxiv.org/abs/2501.00715v1

Abstract:

The ability to revise essays in response to feedback is important for students' writing success. An automated writing evaluation (AWE) system that supports students in revising their essays is thus essential. We present eRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g., changes made to an essay to improve its quality in response to essay feedback) and providing revision feedback. We deployed the system with 6 teachers and 406 students across 3 schools in Pennsylvania and Louisiana. The results confirmed its effectiveness in (1) assessing student essays in terms of evidence usage, (2) extracting evidence and reasoning revisions across essays, and (3) determining revision success in responding to feedback. The evaluation also suggested eRevise+RF is a helpful system for young students to improve their argumentative writing skills through revision and formative feedback.

Decision Explanation: Original decision: REJECT The paper focuses primarily on the development and evaluation of an automated writing evaluation (AWE) system for assessing student essay revisions, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not demonstrate the impact of textual input prompts on LLM output.

[Towards End-to-End Neuromorphic Voxel-based 3D Object Reconstruction

Without Physical Priors](https://arxiv.org/abs/http://arxiv.org/abs/2501.00741v1) arXiv ID: http://arxiv.org/abs/2501.00741v1

Abstract:

Neuromorphic cameras, also known as event cameras, are asynchronous brightness-change sensors that can capture extremely fast motion without suffering from motion blur, making them particularly promising for 3D reconstruction in extreme environments. However, existing research on 3D reconstruction using monocular neuromorphic cameras is limited, and most of the methods rely on estimating physical priors and employ complex multi-step pipelines. In this work, we propose an end-to-end method for dense voxel 3D reconstruction using neuromorphic cameras that eliminates the need to estimate physical priors. Our method incorporates a novel event representation to enhance edge features, enabling the proposed feature-enhancement model to learn more effectively. Additionally, we introduced Optimal Binarization Threshold Selection Principle as a guideline for future related work, using the optimal reconstruction results achieved with threshold optimization as the benchmark. Our method achieves a 54.6% improvement in reconstruction accuracy compared to the baseline method.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on 3D object reconstruction using neuromorphic cameras, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria." }

[AttriReBoost: A Gradient-Free Propagation Optimization Method for Cold

Start Mitigation in Attribute Missing Graphs](https://arxiv.org/abs/http://arxiv.org/abs/2501.00743v1) arXiv ID: http://arxiv.org/abs/2501.00743v1

Abstract:

Missing attribute issues are prevalent in the graph learning, leading to biased outcomes in Graph Neural Networks (GNNs). Existing methods that rely on feature propagation are prone to cold start problem, particularly when dealing with attribute resetting and low-degree nodes, which hinder effective propagation and convergence. To address these challenges, we propose AttriReBoost (ARB), a novel method that incorporates propagation-based method to mitigate cold start problems in attribute-missing graphs. ARB enhances global feature propagation by redefining initial boundary conditions and strategically integrating virtual edges, thereby improving node connectivity and ensuring more stable and efficient convergence. This method facilitates gradient-free attribute reconstruction with lower computational overhead. The proposed method is theoretically grounded, with its convergence rigorously established. Extensive experiments on several real-world benchmark datasets demonstrate the effectiveness of ARB, achieving an average accuracy improvement of 5.11% over state-of-the-art methods. Additionally, ARB exhibits remarkable computational efficiency, processing a large-scale graph with 2.49 million nodes in just 16 seconds on a single GPU. Our code is available at https://github.com/limengran98/ARB.

Decision Explanation: Original decision: REJECT The paper focuses on mitigating cold start problems in attribute-missing graphs for Graph Neural Networks (GNNs), which does not meet the 'MUST' criteria of primarily focusing on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts for improving LLM performance.

Enhancing Transformers for Generalizable First-Order Logical Entailment

arXiv ID: http://arxiv.org/abs/2501.00759v1

Abstract:

Transformers, as a fundamental deep learning architecture, have demonstrated remarkable capabilities in reasoning. This paper investigates the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge and explores ways to improve it. The first-order reasoning capability of transformers is assessed through their ability to perform first-order logical entailment, which is quantitatively measured by their performance in answering knowledge graph queries. We establish connections between (1) two types of distribution shifts studied in out-of-distribution generalization and (2) the unseen knowledge and query settings discussed in the task of knowledge graph query answering, enabling a characterization of fine-grained generalizability. Results on our comprehensive dataset show that transformers outperform previous methods specifically designed for this task and provide detailed empirical evidence on the impact of input query syntax, token embedding, and transformer architectures on the reasoning capability of transformers. Interestingly, our findings reveal a mismatch between positional encoding and other design choices in transformer architectures employed in prior practices. This discovery motivates us to propose a more sophisticated, logic-aware architecture, TEGA, to enhance the capability for generalizable first-order logical entailment in transformers.

Decision Explanation: Original decision: REJECT The paper focuses on enhancing transformer architectures for first-order logical entailment, which aligns with the development of new LLM architectures, rather than primarily focusing on prompt engineering for text-based interactions with existing LLMs.

[REM: A Scalable Reinforced Multi-Expert Framework for Multiplex

Influence Maximization](https://arxiv.org/abs/http://arxiv.org/abs/2501.00779v1) arXiv ID: http://arxiv.org/abs/2501.00779v1

Abstract:

In social online platforms, identifying influential seed users to maximize influence spread is a crucial as it can greatly diminish the cost and efforts required for information dissemination. While effective, traditional methods for Multiplex Influence Maximization (MIM) have reached their performance limits, prompting the emergence of learning-based approaches. These novel methods aim for better generalization and scalability for more sizable graphs but face significant challenges, such as (1) inability to handle unknown diffusion patterns and (2) reliance on high-quality training samples. To address these issues, we propose the Reinforced Expert Maximization framework (REM). REM leverages a Propagation Mixture of Experts technique to encode dynamic propagation of large multiplex networks effectively in order to generate enhanced influence propagation. Noticeably, REM treats a generative model as a policy to autonomously generate different seed sets and learn how to improve them from a Reinforcement Learning perspective. Extensive experiments on several real-world datasets demonstrate that REM surpasses state-of-the-art methods in terms of influence spread, scalability, and inference time in influence maximization tasks.

Decision Explanation: Original decision: REJECT The paper does not meet the criteria as it primarily focuses on developing a novel framework (REM) for Multiplex Influence Maximization, leveraging Reinforcement Learning and a generative model, but does not concentrate on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or demonstrate the impact of textual input prompts on LLM output.

[Make Shuffling Great Again: A Side-Channel Resistant Fisher-Yates

Algorithm for Protecting Neural Networks](https://arxiv.org/abs/http://arxiv.org/abs/2501.00798v1) arXiv ID: http://arxiv.org/abs/2501.00798v1

Abstract:

Neural network models implemented in embedded devices have been shown to be susceptible to side-channel attacks (SCAs), allowing recovery of proprietary model parameters, such as weights and biases. There are already available countermeasure methods currently used for protecting cryptographic implementations that can be tailored to protect embedded neural network models. Shuffling, a hiding-based countermeasure that randomly shuffles the order of computations, was shown to be vulnerable to SCA when the Fisher-Yates algorithm is used. In this paper, we propose a design of an SCA-secure version of the Fisher-Yates algorithm. By integrating the masking technique for modular reduction and Blakely's method for modular multiplication, we effectively remove the vulnerability in the division operation that led to side-channel leakage in the original version of the algorithm. We experimentally evaluate that the countermeasure is effective against SCA by implementing a correlation power analysis attack on an embedded neural network model implemented on ARM Cortex-M4. Compared to the original proposal, the memory overhead is 2\times the biggest layer of the network, while the time overhead varies from 4\% to 0.49\% for a layer with 100 and 1000 neurons, respectively.

Decision Explanation: Original decision: REJECT The paper focuses on protecting neural networks from side-channel attacks by modifying the Fisher-Yates algorithm, which does not meet the 'MUST' criteria: it does not primarily focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate improving LLM performance through textual input prompt manipulation.

[Reasoning-Oriented and Analogy-Based Methods for Locating and Editing in

Zero-Shot Event-Relational Reasoning](https://arxiv.org/abs/http://arxiv.org/abs/2501.00803v1) arXiv ID: http://arxiv.org/abs/2501.00803v1

Abstract:

Zero-shot event-relational reasoning is an important task in natural language processing, and existing methods jointly learn a variety of event-relational prefixes and inference-form prefixes to achieve such tasks. However, training prefixes consumes large computational resources and lacks interpretability. Additionally, learning various relational and inferential knowledge inefficiently exploits the connections between tasks. Therefore, we first propose a method for Reasoning-Oriented Locating and Editing (ROLE), which locates and edits the key modules of the language model for reasoning about event relations, enhancing interpretability and also resource-efficiently optimizing the reasoning ability. Subsequently, we propose a method for Analogy-Based Locating and Editing (ABLE), which efficiently exploits the similarities and differences between tasks to optimize the zero-shot reasoning capability. Experimental results show that ROLE improves interpretability and reasoning performance with reduced computational cost. ABLE achieves SOTA results in zero-shot reasoning.

Decision Explanation: Original decision: REJECT The paper primarily focuses on optimizing the language model's reasoning ability through module editing and analogy-based methods, rather than engineering or manipulating textual input prompts to improve Large Language Model (LLM) performance, thus not meeting the core subject requirement of prompt engineering for text-based LLM interactions.

LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management

arXiv ID: http://arxiv.org/abs/2501.00826v2

Abstract:

Cryptocurrency investment is inherently difficult due to its shorter history compared to traditional assets, the need to integrate vast amounts of data from various modalities, and the requirement for complex reasoning. While deep learning approaches have been applied to address these challenges, their black-box nature raises concerns about trust and explainability. Recently, large language models (LLMs) have shown promise in financial applications due to their ability to understand multi-modal data and generate explainable decisions. However, single LLM faces limitations in complex, comprehensive tasks such as asset investment. These limitations are even more pronounced in cryptocurrency investment, where LLMs have less domain-specific knowledge in their training corpora. To overcome these challenges, we propose an explainable, multi-modal, multi-agent framework for cryptocurrency investment. Our framework uses specialized agents that collaborate within and across teams to handle subtasks such as data analysis, literature integration, and investment decision-making for the top 30 cryptocurrencies by market capitalization. The expert training module fine-tunes agents using multi-modal historical data and professional investment literature, while the multi-agent investment module employs real-time data to make informed cryptocurrency investment decisions. Unique intrateam and interteam collaboration mechanisms enhance prediction accuracy by adjusting final predictions based on confidence levels within agent teams and facilitating information sharing between teams. Empirical evaluation using data from November 2023 to September 2024 demonstrates that our framework outperforms single-agent models and market benchmarks in classification, asset pricing, portfolio, and explainability performance.

Decision Explanation: Original decision: REJECT The paper's primary focus is on developing a multi-agent system for automated crypto portfolio management using LLMs, rather than specifically engineering or optimizing prompts for LLMs, thus not meeting the core subject requirement.

[Embedding Style Beyond Topics: Analyzing Dispersion Effects Across

Different Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.00828v1) arXiv ID: http://arxiv.org/abs/2501.00828v1

Abstract:

This paper analyzes how writing style affects the dispersion of embedding vectors across multiple, state-of-the-art language models. While early transformer models primarily aligned with topic modeling, this study examines the role of writing style in shaping embedding spaces. Using a literary corpus that alternates between topics and styles, we compare the sensitivity of language models across French and English. By analyzing the particular impact of style on embedding dispersion, we aim to better understand how language models process stylistic information, contributing to their overall interpretability.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on analyzing the impact of writing style on embedding vector dispersion across language models, rather than specifically investigating prompt engineering techniques for improving Large Language Model (LLM) performance through textual input prompt manipulation." }

[An LLM-Empowered Adaptive Evolutionary Algorithm For Multi-Component

Deep Learning Systems](https://arxiv.org/abs/http://arxiv.org/abs/2501.00829v1) arXiv ID: http://arxiv.org/abs/2501.00829v1

Abstract:

Multi-objective evolutionary algorithms (MOEAs) are widely used for searching optimal solutions in complex multi-component applications. Traditional MOEAs for multi-component deep learning (MCDL) systems face challenges in enhancing the search efficiency while maintaining the diversity. To combat these, this paper proposes $\mu$MOEA, the first LLM-empowered adaptive evolutionary search algorithm to detect safety violations in MCDL systems. Inspired by the context-understanding ability of Large Language Models (LLMs), $\mu$MOEA promotes the LLM to comprehend the optimization problem and generate an initial population tailed to evolutionary objectives. Subsequently, it employs adaptive selection and variation to iteratively produce offspring, balancing the evolutionary efficiency and diversity. During the evolutionary process, to navigate away from the local optima, $\mu$MOEA integrates the evolutionary experience back into the LLM. This utilization harnesses the LLM's quantitative reasoning prowess to generate differential seeds, breaking away from current optimal solutions. We evaluate $\mu$MOEA in finding safety violations of MCDL systems, and compare its performance with state-of-the-art MOEA methods. Experimental results show that $\mu$MOEA can significantly improve the efficiency and diversity of the evolutionary search.

Decision Explanation: Original decision: REJECT The paper's primary focus is on developing an adaptive evolutionary algorithm for multi-component deep learning systems, utilizing LLMs as a component, rather than focusing specifically on prompt engineering for Large Language Models. Prompt engineering is not the central concern, but rather a means to enhance the algorithm's performance.

[LLM+AL: Bridging Large Language Models and Action Languages for Complex

Reasoning about Actions](https://arxiv.org/abs/http://arxiv.org/abs/2501.00830v1) arXiv ID: http://arxiv.org/abs/2501.00830v1

Abstract:

Large Language Models (LLMs) have made significant strides in various intelligent tasks but still struggle with complex action reasoning tasks that require systematic search. To address this limitation, we propose a method that bridges the natural language understanding capabilities of LLMs with the symbolic reasoning strengths of action languages. Our approach, termed "LLM+AL," leverages the LLM's strengths in semantic parsing and commonsense knowledge generation alongside the action language's proficiency in automated reasoning based on encoded knowledge. We compare LLM+AL against state-of-the-art LLMs, including ChatGPT-4, Claude 3 Opus, Gemini Ultra 1.0, and o1-preview, using benchmarks for complex reasoning about actions. Our findings indicate that, although all methods exhibit errors, LLM+AL, with relatively minimal human corrections, consistently leads to correct answers, whereas standalone LLMs fail to improve even with human feedback. LLM+AL also contributes to automated generation of action languages.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on integrating LLMs with Action Languages for complex reasoning, rather than prompt engineering specifically for LLMs, and does not provide concrete examples of textual input prompts and their impact on LLM output." }

Distilled Lifelong Self-Adaptation for Configurable Systems

arXiv ID: http://arxiv.org/abs/2501.00840v1

Abstract:

Modern configurable systems provide tremendous opportunities for engineering future intelligent software systems. A key difficulty thereof is how to effectively self-adapt the configuration of a running system such that its performance (e.g., runtime and throughput) can be optimized under time-varying workloads. This unfortunately remains unaddressed in existing approaches as they either overlook the available past knowledge or rely on static exploitation of past knowledge without reasoning the usefulness of information when planning for self-adaptation. In this paper, we tackle this challenging problem by proposing DLiSA, a framework that self-adapts configurable systems. DLiSA comes with two properties: firstly, it supports lifelong planning, and thereby the planning process runs continuously throughout the lifetime of the system, allowing dynamic exploitation of the accumulated knowledge for rapid adaptation. Secondly, the planning for a newly emerged workload is boosted via distilled knowledge seeding, in which the knowledge is dynamically purified such that only useful past configurations are seeded when necessary, mitigating misleading information. Extensive experiments suggest that the proposed DLiSA significantly outperforms state-of-the-art approaches, demonstrating a performance improvement of up to 229% and a resource acceleration of up to 2.22x on generating promising adaptation configurations. All data and sources can be found at our repository: https://github.com/ideas-labo/dlisa.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper does not meet the primary focus criteria, as it concentrates on self-adaptation for configurable systems' performance optimization, lacking any mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, and instead focuses on system configuration and workload adaptation." }

[Diversity Optimization for Travelling Salesman Problem via Deep

Reinforcement Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.00884v1) arXiv ID: http://arxiv.org/abs/2501.00884v1

Abstract:

Existing neural methods for the Travelling Salesman Problem (TSP) mostly aim at finding a single optimal solution. To discover diverse yet high-quality solutions for Multi-Solution TSP (MSTSP), we propose a novel deep reinforcement learning based neural solver, which is primarily featured by an encoder-decoder structured policy. Concretely, on the one hand, a Relativization Filter (RF) is designed to enhance the robustness of the encoder to affine transformations of the instances, so as to potentially improve the quality of the found solutions. On the other hand, a Multi-Attentive Adaptive Active Search (MA3S) is tailored to allow the decoders to strike a balance between the optimality and diversity. Experimental evaluations on benchmark instances demonstrate the superiority of our method over recent neural baselines across different metrics, and its competitive performance against state-of-the-art traditional heuristics with significantly reduced computational time, ranging from 1.3\times to 15\times faster. Furthermore, we demonstrate that our method can also be applied to the Capacitated Vehicle Routing Problem (CVRP).

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper does not focus on prompt engineering for Large Language Models (LLMs) as required. Instead, it explores deep reinforcement learning for solving the Travelling Salesman Problem and Capacitated Vehicle Routing Problem, which does not meet the primary criteria of prompt engineering for text-based interactions with LLMs." }

Population Aware Diffusion for Time Series Generation

arXiv ID: http://arxiv.org/abs/2501.00910v1

Abstract:

Diffusion models have shown promising ability in generating high-quality time series (TS) data. Despite the initial success, existing works mostly focus on the authenticity of data at the individual level, but pay less attention to preserving the population-level properties on the entire dataset. Such population-level properties include value distributions for each dimension and distributions of certain functional dependencies (e.g., cross-correlation, CC) between different dimensions. For instance, when generating house energy consumption TS data, the value distributions of the outside temperature and the kitchen temperature should be preserved, as well as the distribution of CC between them. Preserving such TS population-level properties is critical in maintaining the statistical insights of the datasets, mitigating model bias, and augmenting downstream tasks like TS prediction. Yet, it is often overlooked by existing models. Hence, data generated by existing models often bear distribution shifts from the original data. We propose Population-aware Diffusion for Time Series (PaD-TS), a new TS generation model that better preserves the population-level properties. The key novelties of PaD-TS include

a new training method explicitly incorporating TS population-level property preservation, and 2) a new dual-channel encoder model architecture that better captures the TS data structure. Empirical results in major benchmark datasets show that PaD-TS can improve the average CC distribution shift score between real and synthetic data by 5.9x while maintaining a performance comparable to state-of-the-art models on individual-level authenticity.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on developing a new diffusion model for time series generation, emphasizing population-level property preservation, without any mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria." }

`β`-DQN: Improving Deep Q-Learning By Evolving the Behavior

arXiv ID: http://arxiv.org/abs/2501.00913v1

Abstract:

While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like \epsilon-greedy. Motivated by this, we introduce \beta-DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function \beta. This function estimates the probability that each action has been taken at each state. By leveraging \beta, we generate a population of diverse policies that balance exploration between state-action coverage and overestimation bias correction. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. \beta-DQN is straightforward to implement and adds minimal computational overhead to the standard DQN. Experiments on both simple and challenging exploration domains show that \beta-DQN outperforms existing baseline methods across a wide range of tasks, providing an effective solution for improving exploration in deep reinforcement learning.

Decision Explanation: Original decision: REJECT The paper focuses primarily on improving Deep Q-Learning with a new exploration method, not on the engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not investigate the manipulation of textual input prompts to improve LLM performance.

[Incremental Dialogue Management: Survey, Discussion, and Implications

for HRI](https://arxiv.org/abs/http://arxiv.org/abs/2501.00953v1) arXiv ID: http://arxiv.org/abs/2501.00953v1

Abstract:

Efforts towards endowing robots with the ability to speak have benefited from recent advancements in NLP, in particular large language models. However, as powerful as current models have become, they still operate on sentence or multi-sentence level input, not on the word-by-word input that humans operate on, affecting the degree of responsiveness that they offer, which is critical in situations where humans interact with robots using speech. In this paper, we review the literature on interactive systems that operate incrementally (i.e., at the word level or below it). We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation. Primary focus is on the part of the system that makes decisions, known as the dialogue manager. We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and the implications of incremental dialogue for embodied, robotic platforms.

Decision Explanation: Original decision: REJECT The paper's primary focus is on incremental dialogue management for Human-Robot Interaction (HRI), not specifically on the engineering, design, or optimization of prompts for Large Language Models (LLMs), failing to meet the core subject requirement.

[Are LLMs effective psychological assessors? Leveraging adaptive RAG for

interpretable mental health screening through psychometric practice](https://arxiv.org/abs/http://arxiv.org/abs/2501.00982v1) arXiv ID: http://arxiv.org/abs/2501.00982v1

Abstract:

In psychological practice, standardized questionnaires serve as essential tools for assessing mental constructs (e.g., attitudes, traits, and emotions) through structured questions (aka items). With the increasing prevalence of social media platforms where users share personal experiences and emotions, researchers are exploring computational methods to leverage this data for rapid mental health screening. In this study, we propose a novel adaptive Retrieval-Augmented Generation (RAG) approach that completes psychological questionnaires by analyzing social media posts. Our method retrieves the most relevant user posts for each question in a psychological survey and uses Large Language Models (LLMs) to predict questionnaire scores in a zero-shot setting. Our findings are twofold. First we demonstrate that this approach can effectively predict users' responses to psychological questionnaires, such as the Beck Depression Inventory II (BDI-II), achieving performance comparable to or surpassing state-of-the-art models on Reddit-based benchmark datasets without relying on training data. Second, we show how this methodology can be generalized as a scalable screening tool, as the final assessment is systematically derived by completing standardized questionnaires and tracking how individual item responses contribute to the diagnosis, aligning with established psychometric practices.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper's primary focus is leveraging LLMs for mental health screening via social media analysis, rather than prompt engineering for text-based interactions with LLMs. Prompt manipulation is not the central concern, but rather a means to achieve psychological assessment." }

Bootstrapped Reward Shaping

arXiv ID: http://arxiv.org/abs/2501.00989v1

Abstract:

In reinforcement learning, especially in sparse-reward domains, many environment steps are required to observe reward information. In order to increase the frequency of such observations, "potential-based reward shaping" (PBRS) has been proposed as a method of providing a more dense reward signal while leaving the optimal policy invariant. However, the required "potential function" must be carefully designed with task-dependent knowledge to not deter training performance. In this work, we propose a "bootstrapped" method of reward shaping, termed BSRS, in which the agent's current estimate of the state-value function acts as the potential function for PBRS. We provide convergence proofs for the tabular setting, give insights into training dynamics for deep RL, and show that the proposed method improves training speed in the Atari suite.

Decision Explanation: Original decision: REJECT The paper focuses on reinforcement learning, proposing a method for reward shaping in sparse-reward domains, and does not meet any of the 'MUST' criteria, particularly lacking focus on prompt engineering, design, or optimization for Large Language Models (LLMs), and instead pertains to a different area of AI research.

[Exploring Information Processing in Large Language Models: Insights from

Information Bottleneck Theory](https://arxiv.org/abs/http://arxiv.org/abs/2501.00999v2) arXiv ID: http://arxiv.org/abs/2501.00999v2

Abstract:

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks by understanding input information and predicting corresponding outputs. However, the internal mechanisms by which LLMs comprehend input and make effective predictions remain poorly understood. In this paper, we explore the working mechanism of LLMs in information processing from the perspective of Information Bottleneck Theory. We propose a non-training construction strategy to define a task space and identify the following key findings: (1) LLMs compress input information into specific task spaces (e.g., sentiment space, topic space) to facilitate task understanding; (2) they then extract and utilize relevant information from the task space at critical moments to generate accurate predictions. Based on these insights, we introduce two novel approaches: an Information Compression-based Context Learning (IC-ICL) and a Task-Space-guided Fine-Tuning (TS-FT). IC-ICL enhances reasoning performance and inference efficiency by compressing retrieved example information into the task space. TS-FT employs a space-guided loss to fine-tune LLMs, encouraging the learning of more effective compression and selection mechanisms. Experiments across multiple datasets validate the effectiveness of task space construction. Additionally, IC-ICL not only improves performance but also accelerates inference speed by over 40%, while TS-FT achieves superior results with a minimal strategy adjustment.

Decision Explanation: Original decision: REJECT The paper primarily focuses on understanding LLM's internal information processing mechanisms through Information Bottleneck Theory and proposes methods for fine-tuning (TS-FT) and context learning (IC-ICL), which do not meet the 'MUST' criteria of focusing on the engineering, design, or optimization of prompts specifically for LLMs, nor does it provide concrete examples of prompts and their impact on LLM output.

[Deep Reinforcement Learning for Job Scheduling and Resource Management

in Cloud Computing: An Algorithm-Level Review](https://arxiv.org/abs/http://arxiv.org/abs/2501.01007v1) arXiv ID: http://arxiv.org/abs/2501.01007v1

Abstract:

Cloud computing has revolutionized the provisioning of computing resources, offering scalable, flexible, and on-demand services to meet the diverse requirements of modern applications. At the heart of efficient cloud operations are job scheduling and resource management, which are critical for optimizing system performance and ensuring timely and cost-effective service delivery. However, the dynamic and heterogeneous nature of cloud environments presents significant challenges for these tasks, as workloads and resource availability can fluctuate unpredictably. Traditional approaches, including heuristic and meta-heuristic algorithms, often struggle to adapt to these real-time changes due to their reliance on static models or predefined rules. Deep Reinforcement Learning (DRL) has emerged as a promising solution to these challenges by enabling systems to learn and adapt policies based on continuous observations of the environment, facilitating intelligent and responsive decision-making. This survey provides a comprehensive review of DRL-based algorithms for job scheduling and resource management in cloud computing, analyzing their methodologies, performance metrics, and practical applications. We also highlight emerging trends and future research directions, offering valuable insights into leveraging DRL to advance both job scheduling and resource management in cloud computing.

Decision Explanation: Original decision: REJECT The paper primarily focuses on Deep Reinforcement Learning for job scheduling and resource management in cloud computing, with no mention of Large Language Models (LLMs), prompt engineering, or textual input manipulation, thus failing to meet all 'MUST' criteria.

[MDSF: Context-Aware Multi-Dimensional Data Storytelling Framework based

on Large language Model](https://arxiv.org/abs/http://arxiv.org/abs/2501.01014v1) arXiv ID: http://arxiv.org/abs/2501.01014v1

Abstract:

The exponential growth of data and advancements in big data technologies have created a demand for more efficient and automated approaches to data analysis and storytelling. However, automated data analysis systems still face challenges in leveraging large language models (LLMs) for data insight discovery, augmented analysis, and data storytelling. This paper introduces the Multidimensional Data Storytelling Framework (MDSF) based on large language models for automated insight generation and context-aware storytelling. The framework incorporates advanced preprocessing techniques, augmented analysis algorithms, and a unique scoring mechanism to identify and prioritize actionable insights. The use of fine-tuned LLMs enhances contextual understanding and generates narratives with minimal manual intervention. The architecture also includes an agent-based mechanism for real-time storytelling continuation control. Key findings reveal that MDSF outperforms existing methods across various datasets in terms of insight ranking accuracy, descriptive quality, and narrative coherence. The experimental evaluation demonstrates MDSF's ability to automate complex analytical tasks, reduce interpretive biases, and improve user satisfaction. User studies further underscore its practical utility in enhancing content structure, conclusion extraction, and richness of detail.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses primarily on developing a multidimensional data storytelling framework using LLMs as a component, rather than centered on prompt engineering for LLMs. The core subject is the framework's architecture and performance, not novel prompt engineering techniques or systematic prompt development for text-based interactions with LLMs." }

Towards Adversarially Robust Deep Metric Learning

arXiv ID: http://arxiv.org/abs/2501.01025v2

Abstract:

Deep Metric Learning (DML) has shown remarkable successes in many domains by taking advantage of powerful deep neural networks. Deep neural networks are prone to adversarial attacks and could be easily fooled by adversarial examples. The current progress on this robustness issue is mainly about deep classification models but pays little attention to DML models. Existing works fail to thoroughly inspect the robustness of DML and neglect an important DML scenario, the clustering-based inference. In this work, we first point out the robustness issue of DML models in clustering-based inference scenarios. We find that, for the clustering-based inference, existing defenses designed DML are unable to be reused and the adaptions of defenses designed for deep classification models cannot achieve satisfactory robustness performance. To alleviate the hazard of adversarial examples, we propose a new defense, the Ensemble Adversarial Training (EAT), which exploits ensemble learning and adversarial training. EAT promotes the diversity of the ensemble, encouraging each model in the ensemble to have different robustness features, and employs a self-transferring mechanism to make full use of the robustness statistics of the whole ensemble in the update of every single model. We evaluate the EAT method on three widely-used datasets with two popular model architectures. The results show that the proposed EAT method greatly outperforms the adaptions of defenses designed for deep classification models.

Decision Explanation: Original decision: REJECT The paper does not focus on prompt engineering for Large Language Models (LLMs) but instead addresses Deep Metric Learning (DML) and its robustness against adversarial attacks, with no mention of LLMs, text generation, or prompt manipulation.

Reasoning based on symbolic and parametric knowledge bases: a survey

arXiv ID: http://arxiv.org/abs/2501.01030v1

Abstract:

Reasoning is fundamental to human intelligence, and critical for problem-solving, decision-making, and critical thinking. Reasoning refers to drawing new conclusions based on existing knowledge, which can support various applications like clinical diagnosis, basic education, and financial analysis. Though a good number of surveys have been proposed for reviewing reasoning-related methods, none of them has systematically investigated these methods from the viewpoint of their dependent knowledge base. Both the scenarios to which the knowledge bases are applied and their storage formats are significantly different. Hence, investigating reasoning methods from the knowledge base perspective helps us better understand the challenges and future directions. To fill this gap, this paper first classifies the knowledge base into symbolic and parametric ones. The former explicitly stores information in human-readable symbols, and the latter implicitly encodes knowledge within parameters. Then, we provide a comprehensive overview of reasoning methods using symbolic knowledge bases, parametric knowledge bases, and both of them. Finally, we identify the future direction toward enhancing reasoning capabilities to bridge the gap between human and machine intelligence.

Decision Explanation: Original decision: REJECT The paper does not focus primarily on the engineering, design, or optimization of prompts for Large Language Models (LLMs), but rather on a survey of reasoning methods based on symbolic and parametric knowledge bases, without explicit mention of prompt engineering or LLMs.

MSWA: Refining Local Attention with Multi-ScaleWindow Attention

arXiv ID: http://arxiv.org/abs/2501.01039v1

Abstract:

Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window. Nevertheless, SWA employs a uniform window size for each head in each layer, making it inefficient in capturing context of varying scales. To mitigate this limitation, we propose Multi-Scale Window Attention (MSWA) which applies diverse window sizes across heads and layers in the Transformer. It not only allows for different window sizes among heads within the same layer but also progressively increases window size allocation from shallow to deep layers, thus enabling the model to capture contextual information with different lengths and distances. Experimental results on language modeling and common-sense reasoning tasks substantiate that MSWA outperforms traditional local attention in both effectiveness and efficiency.

Decision Explanation: Original decision: REJECT The paper focuses primarily on refining the local attention mechanism in Transformer-based LLMs through Multi-Scale Window Attention, which pertains to the development of new LLM architecture/training methods, not prompt engineering for text-based interactions with LLMs.

Risks of Cultural Erasure in Large Language Models

arXiv ID: http://arxiv.org/abs/2501.01056v1

Abstract:

Large language models are increasingly being integrated into applications that shape the production and discovery of societal knowledge such as search, online education, and travel planning. As a result, language models will shape how people learn about, perceive and interact with global cultures making it important to consider whose knowledge systems and perspectives are represented in models. Recognizing this importance, increasingly work in Machine Learning and NLP has focused on evaluating gaps in global cultural representational distribution within outputs. However, more work is needed on developing benchmarks for cross-cultural impacts of language models that stem from a nuanced sociologically-aware conceptualization of cultural impact or harm. We join this line of work arguing for the need of metricizable evaluations of language technologies that interrogate and account for historical power inequities and differential impacts of representation on global cultures, particularly for cultures already under-represented in the digital corpora. We look at two concepts of erasure: omission: where cultures are not represented at all and simplification i.e. when cultural complexity is erased by presenting one-dimensional views of a rich culture. The former focuses on whether something is represented, and the latter on how it is represented. We focus our analysis on two task contexts with the potential to influence global cultural production. First, we probe representations that a language model produces about different places around the world when asked to describe these contexts. Second, we analyze the cultures represented in the travel recommendations produced by a set of language model applications. Our study shows ways in which the NLP community and application developers can begin to operationalize complex socio-cultural considerations into standard evaluations and benchmarks.

Decision Explanation: Original decision: REJECT The paper focuses on evaluating cultural representation and potential erasure in Large Language Models, but does not primarily investigate, analyze, or propose methods for improving LLM performance through the manipulation of textual input prompts, nor does it provide concrete examples of prompts and their impact on LLM output.

Graph Generative Pre-trained Transformer

arXiv ID: http://arxiv.org/abs/2501.01073v1

Abstract:

Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction.

Decision Explanation: Original decision: REJECT The paper focuses on the development of a new graph generative model (Graph Generative Pre-trained Transformer) and its fine-tuning strategies for downstream applications, rather than primarily on the engineering, design, or optimization of prompts for Large Language Models (LLMs).

[BatStyler: Advancing Multi-category Style Generation for Source-free

Domain Generalization](https://arxiv.org/abs/http://arxiv.org/abs/2501.01109v1) arXiv ID: http://arxiv.org/abs/2501.01109v1

Abstract:

Source-Free Domain Generalization (SFDG) aims to develop a model that performs on unseen domains without relying on any source domains. However, the implementation remains constrained due to the unavailability of training data. Research on SFDG focus on knowledge transfer of multi-modal models and style synthesis based on joint space of multiple modalities, thus eliminating the dependency on source domain images. However, existing works primarily work for multi-domain and less-category configuration, but performance on multi-domain and multi-category configuration is relatively poor. In addition, the efficiency of style synthesis also deteriorates in multi-category scenarios. How to efficiently synthesize sufficiently diverse data and apply it to multi-category configuration is a direction with greater practical value. In this paper, we propose a method called BatStyler, which is utilized to improve the capability of style synthesis in multi-category scenarios. BatStyler consists of two modules: Coarse Semantic Generation and Uniform Style Generation modules. The Coarse Semantic Generation module extracts coarse-grained semantics to prevent the compression of space for style diversity learning in multi-category configuration, while the Uniform Style Generation module provides a template of styles that are uniformly distributed in space and implements parallel training. Extensive experiments demonstrate that our method exhibits comparable performance on less-category datasets, while surpassing state-of-the-art methods on multi-category datasets.

Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it focuses on Source-Free Domain Generalization for multi-category style generation, not on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not provide concrete examples of prompts impacting LLM output.

[MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic

Forgetting in Malware Classification](https://arxiv.org/abs/http://arxiv.org/abs/2501.01110v1) arXiv ID: http://arxiv.org/abs/2501.01110v1

Abstract:

Continual Learning (CL) for malware classification tackles the rapidly evolving nature of malware threats and the frequent emergence of new types. Generative Replay (GR)-based CL systems utilize a generative model to produce synthetic versions of past data, which are then combined with new data to retrain the primary model. Traditional machine learning techniques in this domain often struggle with catastrophic forgetting, where a model's performance on old data degrades over time. In this paper, we introduce a GR-based CL system that employs Generative Adversarial Networks (GANs) with feature matching loss to generate high-quality malware samples. Additionally, we implement innovative selection schemes for replay samples based on the model's hidden representations. Our comprehensive evaluation across Windows and Android malware datasets in a class-incremental learning scenario -- where new classes are introduced continuously over multiple tasks -- demonstrates substantial performance improvements over previous methods. For example, our system achieves an average accuracy of 55% on Windows malware samples, significantly outperforming other GR-based models by 28%. This study provides practical insights for advancing GR-based malware classification systems. The implementation is available at \url {https://github.com/MalwareReplayGAN/MalCL}\footnote{The code will be made public upon the presentation of the paper}.

Decision Explanation: Original decision: REJECT The paper focuses on leveraging GAN-based Generative Replay for Continual Learning in malware classification, which does not meet the 'MUST' criteria of primarily focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts for improving LLM performance.

[Pruning-based Data Selection and Network Fusion for Efficient Deep

Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.01118v1) arXiv ID: http://arxiv.org/abs/2501.01118v1

Abstract:

Efficient data selection is essential for improving the training efficiency of deep neural networks and reducing the associated annotation costs. However, traditional methods tend to be computationally expensive, limiting their scalability and real-world applicability. We introduce PruneFuse, a novel method that combines pruning and network fusion to enhance data selection and accelerate network training. In PruneFuse, the original dense network is pruned to generate a smaller surrogate model that efficiently selects the most informative samples from the dataset. Once this iterative data selection selects sufficient samples, the insights learned from the pruned model are seamlessly integrated with the dense model through network fusion, providing an optimized initialization that accelerates training. Extensive experimentation on various datasets demonstrates that PruneFuse significantly reduces computational costs for data selection, achieves better performance than baselines, and accelerates the overall training process.

Decision Explanation: Original decision: REJECT The paper focuses on improving deep neural network training efficiency through pruning and network fusion, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.

[Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal

Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.01120v1) arXiv ID: http://arxiv.org/abs/2501.01120v1

Abstract:

Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT's robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. The code of our work and prompt-based baselines is available at https://github.com/Jian-Lang/RAGPT.

Decision Explanation: Original decision: REJECT The paper primarily focuses on enhancing the robustness of MultiModal Transformers (MMTs) for incomplete multimodal learning, with dynamic prompt tuning being a component of the proposed framework, rather than the core subject. The main focus is on addressing multimodal learning challenges, not specifically on prompt engineering for text-based interactions with Large Language Models (LLMs).

Deep Learning in Palmprint Recognition-A Comprehensive Survey

arXiv ID: http://arxiv.org/abs/2501.01166v1

Abstract:

Palmprint recognition has emerged as a prominent biometric technology, widely applied in diverse scenarios. Traditional handcrafted methods for palmprint recognition often fall short in representation capability, as they heavily depend on researchers' prior knowledge. Deep learning (DL) has been introduced to address this limitation, leveraging its remarkable successes across various domains. While existing surveys focus narrowly on specific tasks within palmprint recognition-often grounded in traditional methodologies-there remains a significant gap in comprehensive research exploring DL-based approaches across all facets of palmprint recognition. This paper bridges that gap by thoroughly reviewing recent advancements in DL-powered palmprint recognition. The paper systematically examines progress across key tasks, including region-of-interest segmentation, feature extraction, and security/privacy-oriented challenges. Beyond highlighting these advancements, the paper identifies current challenges and uncovers promising opportunities for future research. By consolidating state-of-the-art progress, this review serves as a valuable resource for researchers, enabling them to stay abreast of cutting-edge technologies and drive innovation in palmprint recognition.

Decision Explanation: Original decision: REJECT The paper focuses on deep learning in palmprint recognition, a biometric technology unrelated to Large Language Models (LLMs) or prompt engineering for text-based interactions, thus failing to meet the primary 'MUST' criteria.

[Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes

in Benchmark Datasets](https://arxiv.org/abs/http://arxiv.org/abs/2501.01168v1) arXiv ID: http://arxiv.org/abs/2501.01168v1

Abstract:

The multifaceted challenge of accurately measuring gender stereotypical bias in language models is akin to discerning different segments of a broader, unseen entity. This short paper primarily focuses on intrinsic bias mitigation and measurement strategies for language models, building on prior research that demonstrates a lack of correlation between intrinsic and extrinsic approaches. We delve deeper into intrinsic measurements, identifying inconsistencies and suggesting that these benchmarks may reflect different facets of gender stereotype. Our methodology involves analyzing data distributions across datasets and integrating gender stereotype components informed by social psychology. By adjusting the distribution of two datasets, we achieve a better alignment of outcomes. Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.

Decision Explanation: Original decision: REJECT The paper primarily focuses on measuring and mitigating gender stereotypical bias in language models, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not provide concrete examples of prompts and their impact on LLM output.

L3D-Pose: Lifting Pose for 3D Avatars from a Single Camera in the Wild

arXiv ID: http://arxiv.org/abs/2501.01174v1

Abstract:

While 2D pose estimation has advanced our ability to interpret body movements in animals and primates, it is limited by the lack of depth information, constraining its application range. 3D pose estimation provides a more comprehensive solution by incorporating spatial depth, yet creating extensive 3D pose datasets for animals is challenging due to their dynamic and unpredictable behaviours in natural settings. To address this, we propose a hybrid approach that utilizes rigged avatars and the pipeline to generate synthetic datasets to acquire the necessary 3D annotations for training. Our method introduces a simple attention-based MLP network for converting 2D poses to 3D, designed to be independent of the input image to ensure scalability for poses in natural environments. Additionally, we identify that existing anatomical keypoint detectors are insufficient for accurate pose retargeting onto arbitrary avatars. To overcome this, we present a lookup table based on a deep pose estimation method using a synthetic collection of diverse actions rigged avatars perform. Our experiments demonstrate the effectiveness and efficiency of this lookup table-based retargeting approach. Overall, we propose a comprehensive framework with systematically synthesized datasets for lifting poses from 2D to 3D and then utilize this to re-target motion from wild settings onto arbitrary avatars.

Decision Explanation: Original decision: REJECT The paper focuses on 3D pose estimation for avatars using computer vision and machine learning techniques, with no primary emphasis on the engineering, design, or optimization of prompts for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.

Data Augmentation Techniques for Chinese Disease Name Normalization

arXiv ID: http://arxiv.org/abs/2501.01195v1

Abstract:

Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data

Decision Explanation: Original decision: REJECT The paper primarily focuses on medical domain (disease name normalization) and data augmentation techniques for training models, rather than prompt engineering for Large Language Models (LLMs) and manipulating textual input prompts to improve LLM performance.

[A redescription mining framework for post-hoc explaining and relating

deep learning models](https://arxiv.org/abs/http://arxiv.org/abs/2501.01209v1) arXiv ID: http://arxiv.org/abs/2501.01209v1

Abstract:

Deep learning models (DLMs) achieve increasingly high performance both on structured and unstructured data. They significantly extended applicability of machine learning to various domains. Their success in making predictions, detecting patterns and generating new data made significant impact on science and industry. Despite these accomplishments, DLMs are difficult to explain because of their enormous size. In this work, we propose a novel framework for post-hoc explaining and relating DLMs using redescriptions. The framework allows cohort analysis of arbitrary DLMs by identifying statistically significant redescriptions of neuron activations. It allows coupling neurons to a set of target labels or sets of descriptive attributes, relating layers within a single DLM or associating different DLMs. The proposed framework is independent of the artificial neural network architecture and can work with more complex target labels (e.g. multi-label or multi-target scenario). Additionally, it can emulate both pedagogical and decompositional approach to rule extraction. The aforementioned properties of the proposed framework can increase explainability and interpretability of arbitrary DLMs by providing different information compared to existing explainable-AI approaches.

Decision Explanation: Original decision: REJECT The paper focuses on explaining and relating deep learning models (DLMs) in general, using redescriptions, without a specific emphasis on Large Language Models (LLMs), prompt engineering, or the manipulation of textual input prompts to improve LLM performance.

[An Efficient Attention Mechanism for Sequential Recommendation Tasks:

HydraRec](https://arxiv.org/abs/http://arxiv.org/abs/2501.01242v1) arXiv ID: http://arxiv.org/abs/2501.01242v1

Abstract:

Transformer based models are increasingly being used in various domains including recommender systems (RS). Pretrained transformer models such as BERT have shown good performance at language modelling. With the greater ability to model sequential tasks, variants of Encoder-only models (like BERT4Rec, SASRec etc.) have found success in sequential RS problems. Computing dot-product attention in traditional transformer models has quadratic complexity in sequence length. This is a bigger problem with RS because unlike language models, new items are added to the catalogue every day. User buying history is a dynamic sequence which depends on multiple factors. Recently, various linear attention models have tried to solve this problem by making the model linear in sequence length (token dimensions). Hydra attention is one such linear complexity model proposed for vision transformers which reduces the complexity of attention for both the number of tokens as well as model embedding dimensions. Building on the idea of Hydra attention, we introduce an efficient Transformer based Sequential RS (HydraRec) which significantly improves theoretical complexity of computing attention for longer sequences and bigger datasets while preserving the temporal context. Extensive experiments are conducted to evaluate other linear transformer-based RS models and compared with HydraRec across various evaluation metrics. HydraRec outperforms other linear attention-based models as well as dot-product based attention models when used with causal masking for sequential recommendation next item prediction tasks. For bi-directional models its performance is comparable to the BERT4Rec model with an improvement in running time.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing an efficient attention mechanism for sequential recommendation tasks using transformer-based models, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs). It does not investigate methods for improving LLM performance through the manipulation of textual input prompts, nor does it provide concrete examples of prompts and their impact on LLM output.

Stealthy Backdoor Attack to Real-world Models in Android Apps

arXiv ID: http://arxiv.org/abs/2501.01263v1

Abstract:

Powered by their superior performance, deep neural networks (DNNs) have found widespread applications across various domains. Many deep learning (DL) models are now embedded in mobile apps, making them more accessible to end users through on-device DL. However, deploying on-device DL to users' smartphones simultaneously introduces several security threats. One primary threat is backdoor attacks. Extensive research has explored backdoor attacks for several years and has proposed numerous attack approaches. However, few studies have investigated backdoor attacks on DL models deployed in the real world, or they have shown obvious deficiencies in effectiveness and stealthiness. In this work, we explore more effective and stealthy backdoor attacks on real-world DL models extracted from mobile apps. Our main justification is that imperceptible and sample-specific backdoor triggers generated by DNN-based steganography can enhance the efficacy of backdoor attacks on real-world models. We first confirm the effectiveness of steganography-based backdoor attacks on four state-of-the-art DNN models. Subsequently, we systematically evaluate and analyze the stealthiness of the attacks to ensure they are difficult to perceive. Finally, we implement the backdoor attacks on real-world models and compare our approach with three baseline methods. We collect 38,387 mobile apps, extract 89 DL models from them, and analyze these models to obtain the prerequisite model information for the attacks. After identifying the target models, our approach achieves an average of 12.50% higher attack success rate than DeepPayload while better maintaining the normal performance of the models. Extensive experimental results demonstrate that our method enables more effective, robust, and stealthy backdoor attacks on real-world models.

Decision Explanation: Original decision: REJECT The paper focuses on backdoor attacks in deep neural networks deployed in Android apps, not on the engineering, design, or optimization of prompts for Large Language Models (LLMs), failing to meet all the primary 'MUST' criteria.

PIMAEX: Multi-Agent Exploration through Peer Incentivization

arXiv ID: http://arxiv.org/abs/2501.01266v1

Abstract:

While exploration in single-agent reinforcement learning has been studied extensively in recent years, considerably less work has focused on its counterpart in multi-agent reinforcement learning. To address this issue, this work proposes a peer-incentivized reward function inspired by previous research on intrinsic curiosity and influence-based rewards. The \textit{PIMAEX} reward, short for Peer-Incentivized Multi-Agent Exploration, aims to improve exploration in the multi-agent setting by encouraging agents to exert influence over each other to increase the likelihood of encountering novel states. We evaluate the \textit{PIMAEX} reward in conjunction with \textit{PIMAEX-Communication}, a multi-agent training algorithm that employs a communication channel for agents to influence one another. The evaluation is conducted in the \textit{Consume/Explore} environment, a partially observable environment with deceptive rewards, specifically designed to challenge the exploration vs.\ exploitation dilemma and the credit-assignment problem. The results empirically demonstrate that agents using the \textit{PIMAEX} reward with \textit{PIMAEX-Communication} outperform those that do not.

Decision Explanation: Original decision: REJECT The paper focuses on multi-agent reinforcement learning, proposing a reward function and training algorithm for improved exploration, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), thus failing to meet the core subject requirement.

[NeutraSum: A Language Model can help a Balanced Media Diet by

Neutralizing News Summaries](https://arxiv.org/abs/http://arxiv.org/abs/2501.01284v1) arXiv ID: http://arxiv.org/abs/2501.01284v1

Abstract:

Media bias in news articles arises from the political polarisation of media outlets, which can reinforce societal stereotypes and beliefs. Reporting on the same event often varies significantly between outlets, reflecting their political leanings through polarised language and focus. Although previous studies have attempted to generate bias-free summaries from multiperspective news articles, they have not effectively addressed the challenge of mitigating inherent media bias. To address this gap, we propose \textbf{NeutraSum}, a novel framework that integrates two neutrality losses to adjust the semantic space of generated summaries, thus minimising media bias. These losses, designed to balance the semantic distances across polarised inputs and ensure alignment with expert-written summaries, guide the generation of neutral and factually rich summaries. To evaluate media bias, we employ the political compass test, which maps political leanings based on economic and social dimensions. Experimental results on the Allsides dataset demonstrate that NeutraSum not only improves summarisation performance but also achieves significant reductions in media bias, offering a promising approach for neutral news summarisation.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a framework (NeutraSum) to mitigate media bias in news summaries, rather than engineer or optimize prompts specifically for Large Language Models (LLMs). While it involves generated summaries, the core subject is not prompt engineering for text-based interactions with LLMs, but rather the reduction of media bias.

Citations and Trust in LLM Generated Responses

arXiv ID: http://arxiv.org/abs/2501.01303v1

Abstract:

Question answering systems are rapidly advancing, but their opaque nature may impact user trust. We explored trust through an anti-monitoring framework, where trust is predicted to be correlated with presence of citations and inversely related to checking citations. We tested this hypothesis with a live question-answering experiment that presented text responses generated using a commercial Chatbot along with varying citations (zero, one, or five), both relevant and random, and recorded if participants checked the citations and their self-reported trust in the generated responses. We found a significant increase in trust when citations were present, a result that held true even when the citations were random; we also found a significant decrease in trust when participants checked the citations. These results highlight the importance of citations in enhancing trust in AI-generated content.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper's primary focus is on enhancing user trust in AI-generated content through citations, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not demonstrate the impact of textual input prompts on LLM output." }

[Multi-Head Explainer: A General Framework to Improve Explainability in

CNNs and Transformers](https://arxiv.org/abs/http://arxiv.org/abs/2501.01311v2) arXiv ID: http://arxiv.org/abs/2501.01311v2

Abstract:

In this study, we introduce the Multi-Head Explainer (MHEX), a versatile and modular framework that enhances both the explainability and accuracy of Convolutional Neural Networks (CNNs) and Transformer-based models. MHEX consists of three core components: an Attention Gate that dynamically highlights task-relevant features, Deep Supervision that guides early layers to capture fine-grained details pertinent to the target class, and an Equivalent Matrix that unifies refined local and global representations to generate comprehensive saliency maps. Our approach demonstrates superior compatibility, enabling effortless integration into existing residual networks like ResNet and Transformer architectures such as BERT with minimal modifications. Extensive experiments on benchmark datasets in medical imaging and text classification show that MHEX not only improves classification accuracy but also produces highly interpretable and detailed saliency scores.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a framework to enhance explainability and accuracy in CNNs and Transformers, rather than engineering prompts for Large Language Models (LLMs). Prompt engineering for LLMs is not the central concern, and the paper falls under excluded categories (development of new model components and medical subjects).

[Understanding Difficult-to-learn Examples in Contrastive Learning: A

Theoretical Framework for Spectral Contrastive Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.01317v1) arXiv ID: http://arxiv.org/abs/2501.01317v1

Abstract:

Unsupervised contrastive learning has shown significant performance improvements in recent years, often approaching or even rivaling supervised learning in various tasks. However, its learning mechanism is fundamentally different from that of supervised learning. Previous works have shown that difficult-to-learn examples (well-recognized in supervised learning as examples around the decision boundary), which are essential in supervised learning, contribute minimally in unsupervised settings. In this paper, perhaps surprisingly, we find that the direct removal of difficult-to-learn examples, although reduces the sample size, can boost the downstream classification performance of contrastive learning. To uncover the reasons behind this, we develop a theoretical framework modeling the similarity between different pairs of samples. Guided by this theoretical framework, we conduct a thorough theoretical analysis revealing that the presence of difficult-to-learn examples negatively affects the generalization of contrastive learning. Furthermore, we demonstrate that the removal of these examples, and techniques such as margin tuning and temperature scaling can enhance its generalization bounds, thereby improving performance. Empirically, we propose a simple and efficient mechanism for selecting difficult-to-learn examples and validate the effectiveness of the aforementioned methods, which substantiates the reliability of our proposed theoretical framework.

Decision Explanation: Original decision: REJECT The paper primarily focuses on unsupervised contrastive learning, its theoretical framework, and improving downstream classification performance, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria.

[DeepFilter: An Instrumental Baseline for Accurate and Efficient Process

Monitoring](https://arxiv.org/abs/http://arxiv.org/abs/2501.01342v1) arXiv ID: http://arxiv.org/abs/2501.01342v1

Abstract:

Effective process monitoring is increasingly vital in industrial automation for ensuring operational safety, necessitating both high accuracy and efficiency. Although Transformers have demonstrated success in various fields, their canonical form based on the self-attention mechanism is inadequate for process monitoring due to two primary limitations: (1) the step-wise correlations captured by self-attention mechanism are difficult to capture discriminative patterns in monitoring logs due to the lacking semantics of each step, thus compromising accuracy; (2) the quadratic computational complexity of self-attention hampers efficiency. To address these issues, we propose DeepFilter, a Transformer-style framework for process monitoring. The core innovation is an efficient filtering layer that excel capturing long-term and periodic patterns with reduced complexity. Equipping with the global filtering layer, DeepFilter enhances both accuracy and efficiency, meeting the stringent demands of process monitoring. Experimental results on real-world process monitoring datasets validate DeepFilter's superiority in terms of accuracy and efficiency compared to existing state-of-the-art models.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on developing a new Transformer-style framework (DeepFilter) for process monitoring in industrial automation, addressing limitations of self-attention mechanisms, which falls under 'development of new LLM architectures or training methods' and does not centrally concern prompt engineering for text-based interactions with LLMs." }

[A Unified Hyperparameter Optimization Pipeline for Transformer-Based

Time Series Forecasting Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.01394v1) arXiv ID: http://arxiv.org/abs/2501.01394v1

Abstract:

Transformer-based models for time series forecasting (TSF) have attracted significant attention in recent years due to their effectiveness and versatility. However, these models often require extensive hyperparameter optimization (HPO) to achieve the best possible performance, and a unified pipeline for HPO in transformer-based TSF remains lacking. In this paper, we present one such pipeline and conduct extensive experiments on several state-of-the-art (SOTA) transformer-based TSF models. These experiments are conducted on standard benchmark datasets to evaluate and compare the performance of different models, generating practical insights and examples. Our pipeline is generalizable beyond transformer-based architectures and can be applied to other SOTA models, such as Mamba and TimeMixer, as demonstrated in our experiments. The goal of this work is to provide valuable guidance to both industry practitioners and academic researchers in efficiently identifying optimal hyperparameters suited to their specific domain applications. The code and complete experimental results are available on GitHub.

Decision Explanation: Original decision: REJECT The paper focuses primarily on hyperparameter optimization for transformer-based time series forecasting models, rather than on the engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not provide examples of prompts or their impact on LLM output.

On Unifying Video Generation and Camera Pose Estimation

arXiv ID: http://arxiv.org/abs/2501.01409v1

Abstract:

Inspired by the emergent 3D capabilities in image generators, we explore whether video generators similarly exhibit 3D awareness. Using structure-from-motion (SfM) as a benchmark for 3D tasks, we investigate if intermediate features from OpenSora, a video generation model, can support camera pose estimation. We first examine native 3D awareness in video generation features by routing raw intermediate outputs to SfM-prediction modules like DUSt3R. Then, we explore the impact of fine-tuning on camera pose estimation to enhance 3D awareness. Results indicate that while video generator features have limited inherent 3D awareness, task-specific supervision significantly boosts their accuracy for camera pose estimation, resulting in competitive performance. The proposed unified model, named JOG3R, produces camera pose estimates with competitive quality without degrading video generation quality.

Decision Explanation: Original decision: REJECT The paper primarily focuses on video generation and camera pose estimation, which falls under image/video generation, not text generation driven by Large Language Models (LLMs), thereby violating the 'Papers MUST NOT' criteria #2.

arXiv ID: http://arxiv.org/abs/2501.01470v1

Abstract:

To address the modality imbalance caused by data heterogeneity, existing multi-modal learning (MML) approaches primarily focus on balancing this difference from the perspective of optimization objectives. However, almost all existing methods ignore the impact of sample sequences, i.e., an inappropriate training order tends to trigger learning bias in the model, further exacerbating modality imbalance. In this paper, we propose Balance-aware Sequence Sampling (BSS) to enhance the robustness of MML. Specifically, we first define a multi-perspective measurer to evaluate the balance degree of each sample. Via the evaluation, we employ a heuristic scheduler based on curriculum learning (CL) that incrementally provides training subsets, progressing from balanced to imbalanced samples to rebalance MML. Moreover, considering that sample balance may evolve as the model capability increases, we propose a learning-based probabilistic sampling method to dynamically update the training sequences at the epoch level, further improving MML performance. Extensive experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) MML approaches.

Decision Explanation: Original decision: REJECT The paper primarily focuses on enhancing multi-modal learning (MML) through balance-aware sequence sampling, ignoring the core subject of prompt engineering for text-based interactions with Large Language Models (LLMs), and instead addressing modality imbalance in MML without mentioning prompts or LLM-specific text generation.

[Augmented Contrastive Clustering with Uncertainty-Aware Prototyping for

Time Series Test Time Adaptation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01472v1) arXiv ID: http://arxiv.org/abs/2501.01472v1

Abstract:

Test-time adaptation aims to adapt pre-trained deep neural networks using solely online unlabelled test data during inference. Although TTA has shown promise in visual applications, its potential in time series contexts remains largely unexplored. Existing TTA methods, originally designed for visual tasks, may not effectively handle the complex temporal dynamics of real-world time series data, resulting in suboptimal adaptation performance. To address this gap, we propose Augmented Contrastive Clustering with Uncertainty-aware Prototyping (ACCUP), a straightforward yet effective TTA method for time series data. Initially, our approach employs augmentation ensemble on the time series data to capture diverse temporal information and variations, incorporating uncertainty-aware prototypes to distill essential characteristics. Additionally, we introduce an entropy comparison scheme to selectively acquire more confident predictions, enhancing the reliability of pseudo labels. Furthermore, we utilize augmented contrastive clustering to enhance feature discriminability and mitigate error accumulation from noisy pseudo labels, promoting cohesive clustering within the same class while facilitating clear separation between different classes. Extensive experiments conducted on three real-world time series datasets and an additional visual dataset demonstrate the effectiveness and generalization potential of the proposed method, advancing the underexplored realm of TTA for time series data.

Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it focuses on test-time adaptation for time series data using deep neural networks, without any mention of Large Language Models (LLMs), prompt engineering, or manipulation of textual input prompts for improving LLM performance.

Unraveling Indirect In-Context Learning Using Influence Functions

arXiv ID: http://arxiv.org/abs/2501.01473v1

Abstract:

This work introduces a novel paradigm for generalized In-Context Learning (ICL), termed Indirect In-Context Learning. In Indirect ICL, we explore demonstration selection strategies tailored for two distinct real-world scenarios: Mixture of Tasks and Noisy Demonstrations. We systematically evaluate the effectiveness of Influence Functions (IFs) as a selection tool for these settings, highlighting the potential for IFs to better capture the informativeness of examples within the demonstration pool. For the Mixture of Tasks setting, demonstrations are drawn from 28 diverse tasks, including MMLU, BigBench, StrategyQA, and CommonsenseQA. We demonstrate that combining BertScore-Recall (BSR) with an IF surrogate model can significantly improve performance, leading to average absolute accuracy gains of 0.37% and 1.45% for 3-shot and 5-shot setups when compared to traditional ICL metrics. In the Noisy Demonstrations setting, we examine scenarios where demonstrations might be mislabeled. Our experiments show that reweighting traditional ICL selectors (BSR and Cosine Similarity) with IF-based selectors boosts accuracy by an average of 2.90% for Cosine Similarity and 2.94% for BSR on noisy GLUE benchmarks. In sum, we propose a robust framework for demonstration selection that generalizes beyond traditional ICL, offering valuable insights into the role of IFs for Indirect ICL.

Decision Explanation: Original decision: REJECT The paper primarily focuses on demonstration selection strategies using Influence Functions for In-Context Learning, rather than engineering or optimizing prompts specifically for Large Language Models (LLMs), failing to meet the core subject requirement.

[A Survey of Deep Learning Methods in Protein Bioinformatics and its

Impact on Protein Design](https://arxiv.org/abs/http://arxiv.org/abs/2501.01477v1) arXiv ID: http://arxiv.org/abs/2501.01477v1

Abstract:

Proteins are sequences of amino acids that serve as the basic building blocks of living organisms. Despite rapidly growing databases documenting structural and functional information for various protein sequences, our understanding of proteins remains limited because of the large possible sequence space and the complex inter- and intra-molecular forces. Deep learning, which is characterized by its ability to learn relevant features directly from large datasets, has demonstrated remarkable performance in fields such as computer vision and natural language processing. It has also been increasingly applied in recent years to the data-rich domain of protein sequences with great success, most notably with Alphafold2's breakout performance in the protein structure prediction. The performance improvements achieved by deep learning unlocks new possibilities in the field of protein bioinformatics, including protein design, one of the most difficult but useful tasks. In this paper, we broadly categorize problems in protein bioinformatics into three main categories: 1) structural prediction, 2) functional prediction, and 3) protein design, and review the progress achieved from using deep learning methodologies in each of them. We expand on the main challenges of the protein design problem and highlight how advances in structural and functional prediction have directly contributed to design tasks. Finally, we conclude by identifying important topics and future research directions.

Decision Explanation: Original decision: REJECT The paper focuses primarily on deep learning methods in protein bioinformatics and protein design, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria.

[Drift2Matrix: Kernel-Induced Self Representation for Concept Drift

Adaptation in Co-evolving Time Series](https://arxiv.org/abs/http://arxiv.org/abs/2501.01480v2) arXiv ID: http://arxiv.org/abs/2501.01480v2

Abstract:

In the realm of time series analysis, tackling the phenomenon of concept drift poses a significant challenge. Concept drift -- characterized by the evolving statistical properties of time series data, affects the reliability and accuracy of conventional analysis models. This is particularly evident in co-evolving scenarios where interactions among variables are crucial. This paper presents Drift2Matrix, a novel framework that leverages kernel-induced self-representation for adaptive responses to concept drift in time series. Drift2Matrix employs a kernel-based learning mechanism to generate a representation matrix, encapsulating the inherent dynamics of co-evolving time series. This matrix serves as a key tool for identification and adaptation to concept drift by observing its temporal variations. Furthermore, Drift2Matrix effectively identifies prevailing patterns and offers insights into emerging trends through pattern evolution analysis. Our empirical evaluation of Drift2Matrix across various datasets demonstrates its effectiveness in handling the complexities of concept drift. This approach introduces a novel perspective in the theoretical domain of co-evolving time series analysis, enhancing adaptability and accuracy in the face of dynamic data environments.

Decision Explanation: Original decision: REJECT The paper does not focus on prompt engineering for Large Language Models (LLMs). Instead, it presents a framework (Drift2Matrix) for adapting to concept drift in co-evolving time series analysis, which is unrelated to the engineering, design, or optimization of prompts for LLMs.

[BoxingGym: Benchmarking Progress in Automated Experimental Design and

Model Discovery](https://arxiv.org/abs/http://arxiv.org/abs/2501.01540v1) arXiv ID: http://arxiv.org/abs/2501.01540v1

Abstract:

Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent's ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.

Decision Explanation: Original decision: REJECT The paper's primary focus is on benchmarking scientific discovery and experimentation using LLMs, rather than specifically engineering or optimizing prompts for Large Language Models (LLMs). While LLMs are utilized, the core subject is the evaluation of scientific agents, not prompt engineering for text-based interactions with LLMs.

[Constructing and explaining machine learning models for chemistry:

example of the exploration and design of boron-based Lewis acids](https://arxiv.org/abs/http://arxiv.org/abs/2501.01576v2) arXiv ID: http://arxiv.org/abs/2501.01576v2

Abstract:

The integration of machine learning (ML) into chemistry offers transformative potential in the design of molecules with targeted properties. However, the focus has often been on creating highly efficient predictive models, sometimes at the expense of interpretability. In this study, we leverage explainable AI techniques to explore the rational design of boron-based Lewis acids, which play a pivotal role in organic reactions due to their electron-ccepting properties. Using Fluoride Ion Affinity as a proxy for Lewis acidity, we developed interpretable ML models based on chemically meaningful descriptors, including ab initio computed features and substituent-based parameters derived from the Hammett linear free-energy relationship. By constraining the chemical space to well-defined molecular scaffolds, we achieved highly accurate predictions (mean absolute error < 6 kJ/mol), surpassing conventional black-box deep learning models in low-data regimes. Interpretability analyses of the models shed light on the origin of Lewis acidity in these compounds and identified actionable levers to modulate it through the nature and positioning of substituents on the molecular scaffold. This work bridges ML and chemist's way of thinking, demonstrating how explainable models can inspire molecular design and enhance scientific understanding of chemical reactivity.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing interpretable machine learning models for chemistry applications, with no apparent emphasis on prompt engineering, Large Language Models (LLMs), or the manipulation of textual input prompts to improve LLM performance.

[(WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering:

Methodology, Results, and Challenges](https://arxiv.org/abs/http://arxiv.org/abs/2501.01588v1) arXiv ID: http://arxiv.org/abs/2501.01588v1

Abstract:

Large Language Models (LLMs) have become essential tools across various domains due to their impressive capabilities in understanding and generating human-like text. The ability to accurately answer multiple-choice questions (MCQs) holds significant value in education, particularly in automated tutoring systems and assessment platforms. However, adapting LLMs to handle MCQ tasks effectively remains challenging due to the hallucinations and unclear prompts. This work explores the potential of Microsoft's PHI-3\cite{Abdin2024}, a compact yet efficient LLM, for MCQ answering. Our contributions include fine-tuning the model on the TruthfulQA dataset, designing optimized prompts to enhance model performance, and evaluating using perplexity and traditional metrics like accuracy and F1 score. Results show a remarkable improvement in PHI-3.5's MCQ handling post-fine-tuning, with perplexity decreasing from 4.68 to 2.27, and accuracy rising from 62% to 90.8%. This research underlines the importance of efficient models in adaptive learning systems and educational assessments, paving the way for broader integration into the classroom, particularly in fields like test preparation, student feedback, and personalized learning.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on fine-tuning a Large Language Model (PHI-3) for multiple-choice question answering, which falls under the development of new training methods, contravening the 'MUST NOT' criterion 1. Although it mentions designing optimized prompts, this is secondary to the fine-tuning methodology, not meeting the core subject requirement of prompt engineering being the central focus." }

[Prism: Mining Task-aware Domains in Non-i.i.d. IMU Data for Flexible

User Perception](https://arxiv.org/abs/http://arxiv.org/abs/2501.01598v1) arXiv ID: http://arxiv.org/abs/2501.01598v1

Abstract:

A wide range of user perception applications leverage inertial measurement unit (IMU) data for online prediction. However, restricted by the non-i.i.d. nature of IMU data collected from mobile devices, most systems work well only in a controlled setting (e.g., for a specific user in particular postures), limiting application scenarios. To achieve uncontrolled online prediction on mobile devices, referred to as the flexible user perception (FUP) problem, is attractive but hard. In this paper, we propose a novel scheme, called Prism, which can obtain high FUP accuracy on mobile devices. The core of Prism is to discover task-aware domains embedded in IMU dataset, and to train a domain-aware model on each identified domain. To this end, we design an expectation-maximization (EM) algorithm to estimate latent domains with respect to the specific downstream perception task. Finally, the best-fit model can be automatically selected for use by comparing the test sample and all identified domains in the feature space. We implement Prism on various mobile devices and conduct extensive experiments. Results demonstrate that Prism can achieve the best FUP performance with a low latency.

Decision Explanation: Original decision: REJECT The paper does not focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it proposes a scheme for improving user perception through task-aware domains in IMU data, with no mention of LLMs or prompt engineering for text-based interactions.

Few-shot Implicit Function Generation via Equivariance

arXiv ID: http://arxiv.org/abs/2501.01601v1

Abstract:

Implicit Neural Representations (INRs) have emerged as a powerful framework for representing continuous signals. However, generating diverse INR weights remains challenging due to limited training data. We introduce Few-shot Implicit Function Generation, a new problem setup that aims to generate diverse yet functionally consistent INR weights from only a few examples. This is challenging because even for the same signal, the optimal INRs can vary significantly depending on their initializations. To tackle this, we propose EquiGen, a framework that can generate new INRs from limited data. The core idea is that functionally similar networks can be transformed into one another through weight permutations, forming an equivariance group. By projecting these weights into an equivariant latent space, we enable diverse generation within these groups, even with few examples. EquiGen implements this through an equivariant encoder trained via contrastive learning and smooth augmentation, an equivariance-guided diffusion process, and controlled perturbations in the equivariant subspace. Experiments on 2D image and 3D shape INR datasets demonstrate that our approach effectively generates diverse INR weights while preserving their functional properties in few-shot scenarios.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on generating Implicit Neural Representations (INRs) weights for continuous signal representation, not on the engineering, design, or optimization of prompts for Large Language Models (LLMs), failing to meet the primary 'MUST' criteria." }

[Google is all you need: Semi-Supervised Transfer Learning Strategy For

Light Multimodal Multi-Task Classification Model](https://arxiv.org/abs/http://arxiv.org/abs/2501.01611v1) arXiv ID: http://arxiv.org/abs/2501.01611v1

Abstract:

As the volume of digital image data increases, the effectiveness of image classification intensifies. This study introduces a robust multi-label classification system designed to assign multiple labels to a single image, addressing the complexity of images that may be associated with multiple categories (ranging from 1 to 19, excluding 12). We propose a multi-modal classifier that merges advanced image recognition algorithms with Natural Language Processing (NLP) models, incorporating a fusion module to integrate these distinct modalities. The purpose of integrating textual data is to enhance the accuracy of label prediction by providing contextual understanding that visual analysis alone cannot fully capture. Our proposed classification model combines Convolutional Neural Networks (CNN) for image processing with NLP techniques for analyzing textual description (i.e., captions). This approach includes rigorous training and validation phases, with each model component verified and analyzed through ablation experiments. Preliminary results demonstrate the classifier's accuracy and efficiency, highlighting its potential as an automatic image-labeling system.

Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a multi-modal classification model for image classification, integrating NLP for contextual understanding, rather than specifically on the engineering, design, or optimization of prompts for Large Language Models (LLMs).

[Merging Context Clustering with Visual State Space Models for Medical

Image Segmentation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01618v1) arXiv ID: http://arxiv.org/abs/2501.01618v1

Abstract:

Medical image segmentation demands the aggregation of global and local feature representations, posing a challenge for current methodologies in handling both long-range and short-range feature interactions. Recently, vision mamba (ViM) models have emerged as promising solutions for addressing model complexities by excelling in long-range feature iterations with linear complexity. However, existing ViM approaches overlook the importance of preserving short-range local dependencies by directly flattening spatial tokens and are constrained by fixed scanning patterns that limit the capture of dynamic spatial context information. To address these challenges, we introduce a simple yet effective method named context clustering ViM (CCViM), which incorporates a context clustering module within the existing ViM models to segment image tokens into distinct windows for adaptable local clustering. Our method effectively combines long-range and short-range feature interactions, thereby enhancing spatial contextual representations for medical image segmentation tasks. Extensive experimental evaluations on diverse public datasets, i.e., Kumar, CPM17, ISIC17, ISIC18, and Synapse demonstrate the superior performance of our method compared to current state-of-the-art methods. Our code can be found at https://github.com/zymissy/CCViM.

Decision Explanation: Original decision: REJECT The paper primarily focuses on medical image segmentation, incorporating visual state space models, and does not address prompt engineering for Large Language Models (LLMs) or demonstrate the impact of textual input prompts on LLM output, thus failing to meet the 'MUST' criteria.

[Implications of Artificial Intelligence on Health Data Privacy and

Confidentiality](https://arxiv.org/abs/http://arxiv.org/abs/2501.01639v2) arXiv ID: http://arxiv.org/abs/2501.01639v2

Abstract:

The rapid integration of artificial intelligence (AI) in healthcare is revolutionizing medical diagnostics, personalized medicine, and operational efficiency. However, alongside these advancements, significant challenges arise concerning patient data privacy, ethical considerations, and regulatory compliance. This paper examines the dual impact of AI on healthcare, highlighting its transformative potential and the critical need for safeguarding sensitive health information. It explores the role of the Health Insurance Portability and Accountability Act (HIPAA) as a regulatory framework for ensuring data privacy and security, emphasizing the importance of robust safeguards and ethical standards in AI-driven healthcare. Through case studies, including AI applications in diabetic retinopathy, oncology, and the controversies surrounding data sharing, this study underscores the ethical and legal complexities of AI implementation. A balanced approach that fosters innovation while maintaining patient trust and privacy is imperative. The findings emphasize the importance of continuous education, transparency, and adherence to regulatory frameworks to harness AI's full potential responsibly and ethically in healthcare.

Decision Explanation: Original decision: REJECT The paper primarily focuses on the implications of AI on health data privacy and confidentiality, rather than on the engineering, design, or optimization of prompts for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria for focusing on prompt engineering for LLMs.

[HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long

Video Understanding](https://arxiv.org/abs/http://arxiv.org/abs/2501.01645v1) arXiv ID: http://arxiv.org/abs/2501.01645v1

Abstract:

Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses primarily on building a benchmark dataset for hour-long video understanding, involving multimodal large language models, but does not concentrate on prompt engineering for text-based interactions with LLMs, instead emphasizing video analysis and understanding." }

[AVATAR: Adversarial Autoencoders with Autoregressive Refinement for Time

Series Generation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01649v1) arXiv ID: http://arxiv.org/abs/2501.01649v1

Abstract:

Data augmentation can significantly enhance the performance of machine learning tasks by addressing data scarcity and improving generalization. However, generating time series data presents unique challenges. A model must not only learn a probability distribution that reflects the real data distribution but also capture the conditional distribution at each time step to preserve the inherent temporal dependencies. To address these challenges, we introduce AVATAR, a framework that combines Adversarial Autoencoders (AAE) with Autoregressive Learning to achieve both objectives. Specifically, our technique integrates the autoencoder with a supervisor and introduces a novel supervised loss to assist the decoder in learning the temporal dynamics of time series data. Additionally, we propose another innovative loss function, termed distribution loss, to guide the encoder in more efficiently aligning the aggregated posterior of the autoencoder's latent representation with a prior Gaussian distribution. Furthermore, our framework employs a joint training mechanism to simultaneously train all networks using a combined loss, thereby fulfilling the dual objectives of time series generation. We evaluate our technique across a variety of time series datasets with diverse characteristics. Our experiments demonstrate significant improvements in both the quality and practical utility of the generated data, as assessed by various qualitative and quantitative metrics.

Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a new framework (AVATAR) for time series generation using adversarial autoencoders and autoregressive learning, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).

[EAUWSeg: Eliminating annotation uncertainty in weakly-supervised medical

image segmentation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01658v1) arXiv ID: http://arxiv.org/abs/2501.01658v1

Abstract:

Weakly-supervised medical image segmentation is gaining traction as it requires only rough annotations rather than accurate pixel-to-pixel labels, thereby reducing the workload for specialists. Although some progress has been made, there is still a considerable performance gap between the label-efficient methods and fully-supervised one, which can be attributed to the uncertainty nature of these weak labels. To address this issue, we propose a novel weak annotation method coupled with its learning framework EAUWSeg to eliminate the annotation uncertainty. Specifically, we first propose the Bounded Polygon Annotation (BPAnno) by simply labeling two polygons for a lesion. Then, the tailored learning mechanism that explicitly treat bounded polygons as two separated annotations is proposed to learn invariant feature by providing adversarial supervision signal for model training. Subsequently, a confidence-auxiliary consistency learner incorporates with a classification-guided confidence generator is designed to provide reliable supervision signal for pixels in uncertain region by leveraging the feature presentation consistency across pixels within the same category as well as class-specific information encapsulated in bounded polygons annotation. Experimental results demonstrate that EAUWSeg outperforms existing weakly-supervised segmentation methods. Furthermore, compared to fully-supervised counterparts, the proposed method not only delivers superior performance but also costs much less annotation workload. This underscores the superiority and effectiveness of our approach.

Decision Explanation: Original decision: REJECT The paper primarily focuses on medical image segmentation, addressing annotation uncertainty in weakly-supervised learning, which violates the 'MUST NOT' criteria of being primarily concerned with medical subjects and not focusing on prompt engineering for Large Language Models (LLMs).

[BARTPredict: Empowering IoT Security with LLM-Driven Cyber Threat

Prediction](https://arxiv.org/abs/http://arxiv.org/abs/2501.01664v1) arXiv ID: http://arxiv.org/abs/2501.01664v1

Abstract:

The integration of Internet of Things (IoT) technology in various domains has led to operational advancements, but it has also introduced new vulnerabilities to cybersecurity threats, as evidenced by recent widespread cyberattacks on IoT devices. Intrusion detection systems are often reactive, triggered by specific patterns or anomalies observed within the network. To address this challenge, this work proposes a proactive approach to anticipate and preemptively mitigate malicious activities, aiming to prevent potential damage before it occurs. This paper proposes an innovative intrusion prediction framework empowered by Pre-trained Large Language Models (LLMs). The framework incorporates two LLMs: a fine-tuned Bidirectional and AutoRegressive Transformers (BART) model for predicting network traffic and a fine-tuned Bidirectional Encoder Representations from Transformers (BERT) model for evaluating the predicted traffic. By harnessing the bidirectional capabilities of BART the framework then identifies malicious packets among these predictions. Evaluated using the CICIoT2023 IoT attack dataset, our framework showcases a notable enhancement in predictive performance, attaining an impressive 98% overall accuracy, providing a powerful response to the cybersecurity challenges that confront IoT networks.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing an intrusion prediction framework for IoT security using fine-tuned LLMs, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models. Prompt engineering is not the central concern, but rather a means to achieve the framework's predictive capabilities.

[VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer

for Video-based Remote Physiological Measurement](https://arxiv.org/abs/http://arxiv.org/abs/2501.01691v2) arXiv ID: http://arxiv.org/abs/2501.01691v2

Abstract:

Remote physiological signal measurement based on facial videos, also known as remote photoplethysmography (rPPG), involves predicting changes in facial vascular blood flow from facial videos. While most deep learning-based methods have achieved good results, they often struggle to balance performance across small and large-scale datasets due to the inherent limitations of convolutional neural networks (CNNs) and Transformer. In this paper, we introduce VidFormer, a novel end-to-end framework that integrates 3-Dimension Convolutional Neural Network (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an analysis of the traditional skin reflection model and subsequently introduce an enhanced model for the reconstruction of rPPG signals. Based on this improved model, VidFormer utilizes 3DCNN and Transformer to extract local and global features from input data, respectively. To enhance the spatiotemporal feature extraction capabilities of VidFormer, we incorporate temporal-spatial attention mechanisms tailored for both 3DCNN and Transformer. Additionally, we design a module to facilitate information exchange and fusion between the 3DCNN and Transformer. Our evaluation on five publicly available datasets demonstrates that VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we discuss the essential roles of each VidFormer module and examine the effects of ethnicity, makeup, and exercise on its performance.

Decision Explanation: Original decision: REJECT The paper focuses on developing a novel end-to-end framework for video-based remote physiological measurement, utilizing 3DCNN and Transformer, with no primary focus on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or improving LLM performance through textual input prompts.

[The Essence of Contextual Understanding in Theory of Mind: A Study on

Question Answering with Story Characters](https://arxiv.org/abs/http://arxiv.org/abs/2501.01705v1) arXiv ID: http://arxiv.org/abs/2501.01705v1

Abstract:

Theory-of-Mind (ToM) is a fundamental psychological capability that allows humans to understand and interpret the mental states of others. Humans infer others' thoughts by integrating causal cues and indirect clues from broad contextual information, often derived from past interactions. In other words, human ToM heavily relies on the understanding about the backgrounds and life stories of others. Unfortunately, this aspect is largely overlooked in existing benchmarks for evaluating machines' ToM capabilities, due to their usage of short narratives without global backgrounds. In this paper, we verify the importance of understanding long personal backgrounds in ToM and assess the performance of LLMs in such realistic evaluation scenarios. To achieve this, we introduce a novel benchmark, CharToM-QA, comprising 1,035 ToM questions based on characters from classic novels. Our human study reveals a significant disparity in performance: the same group of educated participants performs dramatically better when they have read the novels compared to when they have not. In parallel, our experiments on state-of-the-art LLMs, including the very recent o1 model, show that LLMs still perform notably worse than humans, despite that they have seen these stories during pre-training. This highlights the limitations of current LLMs in capturing the nuanced contextual information required for ToM reasoning.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on evaluating Large Language Models' (LLMs) performance in Theory of Mind tasks using a novel benchmark, rather than on the engineering, design, or optimization of prompts specifically for LLMs, as required by the criteria." }

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

arXiv ID: http://arxiv.org/abs/2501.01709v1

Abstract:

Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different visual encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers. Comprehensive experiments on popular VLMs, such as LLaVA and LLaVA-NeXT, validate the effectiveness of our method. The code will be released.

Decision Explanation: Original decision: REJECT The paper primarily focuses on knowledge distillation for Vision-Language Models (VLMs) with multiple visual encoders, which does not meet the criteria of focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not investigate the manipulation of textual input prompts to improve LLM performance.

[LLMs & Legal Aid: Understanding Legal Needs Exhibited Through User

Queries](https://arxiv.org/abs/http://arxiv.org/abs/2501.01711v1) arXiv ID: http://arxiv.org/abs/2501.01711v1

Abstract:

The paper presents a preliminary analysis of an experiment conducted by Frank Bold, a Czech expert group, to explore user interactions with GPT-4 for addressing legal queries. Between May 3, 2023, and July 25, 2023, 1,252 users submitted 3,847 queries. Unlike studies that primarily focus on the accuracy, factuality, or hallucination tendencies of large language models (LLMs), our analysis focuses on the user query dimension of the interaction. Using GPT-4o for zero-shot classification, we categorized queries on (1) whether users provided factual information about their issue (29.95%) or not (70.05%), (2) whether they sought legal information (64.93%) or advice on the course of action (35.07%), and (3) whether they imposed requirements to shape or control the model's answer (28.57%) or not (71.43%). We provide both quantitative and qualitative insight into user needs and contribute to a better understanding of user engagement with LLMs.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper's primary focus is on understanding user needs and engagement with LLMs for legal queries, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), as required by the criteria." }

[Proposing Hierarchical Goal-Conditioned Policy Planning in Multi-Goal

Reinforcement Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.01727v1) arXiv ID: http://arxiv.org/abs/2501.01727v1

Abstract:

Humanoid robots must master numerous tasks with sparse rewards, posing a challenge for reinforcement learning (RL). We propose a method combining RL and automated planning to address this. Our approach uses short goal-conditioned policies (GCPs) organized hierarchically, with Monte Carlo Tree Search (MCTS) planning using high-level actions (HLAs). Instead of primitive actions, the planning process generates HLAs. A single plan-tree, maintained during the agent's lifetime, holds knowledge about goal achievement. This hierarchy enhances sample efficiency and speeds up reasoning by reusing HLAs and anticipating future actions. Our Hierarchical Goal-Conditioned Policy Planning (HGCPP) framework uniquely integrates GCPs, MCTS, and hierarchical RL, potentially improving exploration and planning in complex tasks.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses primarily on multi-goal reinforcement learning for humanoid robots, using techniques like Monte Carlo Tree Search and hierarchical RL, with no apparent connection to Large Language Models (LLMs) or prompt engineering for text-based interactions." }

[Augmentation Matters: A Mix-Paste Method for X-Ray Prohibited Item

Detection under Noisy Annotations](https://arxiv.org/abs/http://arxiv.org/abs/2501.01733v1) arXiv ID: http://arxiv.org/abs/2501.01733v1

Abstract:

Automatic X-ray prohibited item detection is vital for public safety. Existing deep learning-based methods all assume that the annotations of training X-ray images are correct. However, obtaining correct annotations is extremely hard if not impossible for large-scale X-ray images, where item overlapping is ubiquitous.As a result, X-ray images are easily contaminated with noisy annotations, leading to performance deterioration of existing methods.In this paper, we address the challenging problem of training a robust prohibited item detector under noisy annotations (including both category noise and bounding box noise) from a novel perspective of data augmentation, and propose an effective label-aware mixed patch paste augmentation method (Mix-Paste). Specifically, for each item patch, we mix several item patches with the same category label from different images and replace the original patch in the image with the mixed patch. In this way, the probability of containing the correct prohibited item within the generated image is increased. Meanwhile, the mixing process mimics item overlapping, enabling the model to learn the characteristics of X-ray images. Moreover, we design an item-based large-loss suppression (LLS) strategy to suppress the large losses corresponding to potentially positive predictions of additional items due to the mixing operation. We show the superiority of our method on X-ray datasets under noisy annotations. In addition, we evaluate our method on the noisy MS-COCO dataset to showcase its generalization ability. These results clearly indicate the great potential of data augmentation to handle noise annotations. The source code is released at https://github.com/wscds/Mix-Paste.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on data augmentation for improving prohibited item detection in X-ray images using deep learning, with no primary concern on prompt engineering, Large Language Models (LLMs), or textual input prompts." }

[Automating Legal Concept Interpretation with LLMs: Retrieval,

Generation, and Evaluation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01743v1) arXiv ID: http://arxiv.org/abs/2501.01743v1

Abstract:

Legal articles often include vague concepts to adapt to the ever-changing society. Providing detailed interpretations of these concepts is a critical task for legal practitioners, which requires meticulous and professional annotations by legal experts, admittedly time-consuming and expensive to collect at scale. In this paper, we introduce a novel retrieval-augmented generation framework, ATRI, for AuTomatically Retrieving relevant information from past judicial precedents and Interpreting vague legal concepts. We further propose a new benchmark, Legal Concept Entailment, to automate the evaluation of generated concept interpretations without expert involvement. Automatic evaluations indicate that our generated interpretations can effectively assist large language models (LLMs) in understanding vague legal concepts. Multi-faceted evaluations by legal experts indicate that the quality of our concept interpretations is comparable to those written by human experts. Our work has strong implications for leveraging LLMs to support legal practitioners in interpreting vague legal concepts and beyond.

Decision Explanation: Original decision: REJECT The paper's primary focus is on developing a retrieval-augmented generation framework for legal concept interpretation, leveraging LLMs as a component, rather than focusing specifically on prompt engineering for text-based interactions with LLMs.

[Creating Artificial Students that Never Existed: Leveraging Large

Language Models and CTGANs for Synthetic Data Generation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01793v1) arXiv ID: http://arxiv.org/abs/2501.01793v1

Abstract:

In this study, we explore the growing potential of AI and deep learning technologies, particularly Generative Adversarial Networks (GANs) and Large Language Models (LLMs), for generating synthetic tabular data. Access to quality students data is critical for advancing learning analytics, but privacy concerns and stricter data protection regulations worldwide limit their availability and usage. Synthetic data offers a promising alternative. We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Using the popular GAN model CTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic tabular student data. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data. To validate our findings, we apply a comprehensive set of utility evaluation metrics to assess the statistical and predictive performance of the synthetic data and compare the different generator models used, specially the performance of LLMs. Our study aims to provide the learning analytics community with valuable insights into the use of synthetic data, laying the groundwork for expanding the field methodological toolbox with new innovative approaches for learning analytics data generation.

Decision Explanation: Original decision: REJECT The paper primarily focuses on generating synthetic tabular data for learning analytics using GANs and LLMs, with LLMs being used for data generation rather than text-based interactions through prompt engineering, thus not meeting the core subject requirement.

End-to-End Long Document Summarization using Gradient Caching

arXiv ID: http://arxiv.org/abs/2501.01805v1

Abstract:

Training transformer-based encoder-decoder models for long document summarization poses a significant challenge due to the quadratic memory consumption during training. Several approaches have been proposed to extend the input length at test time, but training with these approaches is still difficult, requiring truncation of input documents and causing a mismatch between training and test conditions. In this work, we propose CachED (Gradient \textbf{Cach}$ing for \textbf{E}ncoder-\textbf{D}$ecoder models), an approach that enables end-to-end training of existing transformer-based encoder-decoder models, using the entire document without truncation. Specifically, we apply non-overlapping sliding windows to input documents, followed by fusion in decoder. During backpropagation, the gradients are cached at the decoder and are passed through the encoder in chunks by re-computing the hidden vectors, similar to gradient checkpointing. In the experiments on long document summarization, we extend BART to CachED BART, processing more than 500K tokens during training and achieving superior performance without using any additional parameters.

Decision Explanation: Original decision: REJECT The paper focuses primarily on developing an optimization technique (Gradient Caching) for training transformer-based encoder-decoder models for long document summarization, rather than on engineering, designing, or optimizing prompts specifically for Large Language Models (LLMs).

arXiv ID: http://arxiv.org/abs/2501.01821v1

Abstract:

Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.

Decision Explanation: Original decision: REJECT The paper focuses on optimizing the behavior of social agents powered by LLMs through a new method (SDPO), but does not primarily investigate, analyze, or propose methods for improving LLM performance through the manipulation of textual input prompts. Instead, it optimizes agent behavior through preference optimization, making prompt engineering not the central focus.

The Proof is in the Almond Cookies

arXiv ID: http://arxiv.org/abs/2501.01827v1

Abstract:

This paper presents a case study on how to process cooking recipes (and more generally, how-to instructions) in a way that makes it possible for a robot or artificial cooking assistant to support human chefs in the kitchen. Such AI assistants would be of great benefit to society, as they can help to sustain the autonomy of aging adults or people with a physical impairment, or they may reduce the stress in a professional kitchen. We propose a novel approach to computational recipe understanding that mimics the human sense-making process, which is narrative-based. Using an English recipe for almond crescent cookies as illustration, we show how recipes can be modelled as rich narrative structures by integrating various knowledge sources such as language processing, ontologies, and mental simulation. We show how such narrative structures can be used for (a) dealing with the challenges of recipe language, such as zero anaphora, (b) optimizing a robot's planning process, (c) measuring how well an AI system understands its current tasks, and (d) allowing recipe annotations to become language-independent.

Decision Explanation: Original decision: REJECT The paper's primary focus is on computational recipe understanding for robot-assisted cooking, not on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not demonstrate the impact of textual input prompts on LLM output.

[MoColl: Agent-Based Specific and General Model Collaboration for Image

Captioning](https://arxiv.org/abs/http://arxiv.org/abs/2501.01834v2) arXiv ID: http://arxiv.org/abs/2501.01834v2

Abstract:

Image captioning is a critical task at the intersection of computer vision and natural language processing, with wide-ranging applications across various domains. For complex tasks such as diagnostic report generation, deep learning models require not only domain-specific image-caption datasets but also the incorporation of relevant general knowledge to provide contextual accuracy. Existing approaches exhibit inherent limitations: specialized models excel in capturing domain-specific details but lack generalization, while vision-language models (VLMs) built on large language models (LLMs) leverage general knowledge but struggle with domain-specific adaptation. To address these limitations, this paper proposes a novel agent-enhanced model collaboration framework, which we call MoColl, designed to effectively integrate domain-specific and general knowledge. Specifically, our approach is to decompose complex image captioning tasks into a series of interconnected question-answer subtasks. A trainable visual question answering (VQA) model is employed as a specialized tool to focus on domain-specific visual analysis, answering task-specific questions based on image content. Concurrently, an LLM-based agent with general knowledge formulates these questions and synthesizes the resulting question-answer pairs into coherent captions. Beyond its role in leveraging the VQA model, the agent further guides its training to enhance its domain-specific capabilities. Experimental results on radiology report generation validate the effectiveness of the proposed framework, demonstrating significant improvements in the quality of generated reports.

Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a collaborative framework for image captioning, leveraging both vision-language models (VLMs) and Large Language Models (LLMs), but does not centralize on prompt engineering for text-based interactions with LLMs. The LLM's role is in question formulation and synthesis, not in prompt engineering for improved LLM output.

Practical machine learning is learning on small samples

arXiv ID: http://arxiv.org/abs/2501.01836v1

Abstract:

Based on limited observations, machine learning discerns a dependence which is expected to hold in the future. What makes it possible? Statistical learning theory imagines indefinitely increasing training sample to justify its approach. In reality, there is no infinite time or even infinite general population for learning. Here I argue that practical machine learning is based on an implicit assumption that underlying dependence is relatively smooth" : likely, there are no abrupt differences in feedback between cases with close data points. From this point of view learning shall involve selection of the hypothesis smoothly" approximating the training set. I formalize this as Practical learning paradigm. The paradigm includes terminology and rules for description of learners. Popular learners (local smoothing, k-NN, decision trees, Naive Bayes, SVM for classification and for regression) are shown here to be implementations of this paradigm.

Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria: it does not focus on prompt engineering for Large Language Models (LLMs), nor does it investigate improving LLM performance through textual input prompts. Instead, it discusses a general machine learning paradigm for learning from small samples, unrelated to LLMs or prompt engineering.

[Multi-Agent Conversational Online Learning for Adaptive LLM Response

Identification](https://arxiv.org/abs/http://arxiv.org/abs/2501.01849v1) arXiv ID: http://arxiv.org/abs/2501.01849v1

Abstract:

The remarkable generative capability of large language models (LLMs) has sparked a growing interest in automatically generating responses for different applications. Given the dynamic nature of user preferences and the uncertainty of LLM response performance, it is crucial to design efficient online learning algorithms to identify optimal LLM responses (i.e., high-quality responses that also meet user preferences). Most existing online algorithms adopt a centralized approach and fail to leverage explicit user preferences for more efficient and personalized LLM response identification. In contrast, this paper introduces \textit{MACO} (\underline{M}ulti-\underline{A}gent \underline{C}onversational \underline{O}nline Learning for Adaptive LLM Response Identification): 1) The online LLM response identification process is accelerated by multiple local agents (such as smartphones), while enhancing data privacy; 2) A novel conversational mechanism is proposed to adaptively conduct conversations for soliciting user preferences (e.g., a preference for a humorous tone over a serious one in generated responses), so to minimize uncertainty in preference estimation. Our theoretical analysis demonstrates that \cadi\ is near-optimal regarding cumulative regret. Additionally, \cadi
offers reduced communication costs and computational complexity by eliminating the traditional, computing-intensive ``G-optimal design" found in previous works. Extensive experiments with the open LLM \textit{Llama}, coupled with two different embedding models from Google and OpenAI for text vector representation, demonstrate that \cadi\ significantly outperforms the current state-of-the-art in online LLM response identification.

Decision Explanation: Original decision: REJECT The paper's primary focus is on multi-agent conversational online learning for adapting LLM responses, not on prompt engineering specifically for Large Language Models (LLMs). It fails to meet the 'MUST' criteria by not investigating, analyzing, or proposing methods for improving LLM performance through the manipulation of textual input prompts.

[Accuracy Can Lie: On the Impact of Surrogate Model in Configuration

Tuning](https://arxiv.org/abs/http://arxiv.org/abs/2501.01876v1) arXiv ID: http://arxiv.org/abs/2501.01876v1

Abstract:

To ease the expensive measurements during configuration tuning, it is natural to build a surrogate model as the replacement of the system, and thereby the configuration performance can be cheaply evaluated. Yet, a stereotype therein is that the higher the model accuracy, the better the tuning result would be. This "accuracy is all" belief drives our research community to build more and more accurate models and criticize a tuner for the inaccuracy of the model used. However, this practice raises some previously unaddressed questions, e.g., Do those somewhat small accuracy improvements reported in existing work really matter much to the tuners? What role does model accuracy play in the impact of tuning quality? To answer those related questions, we conduct one of the largest-scale empirical studies to date-running over the period of 13 months 24*7-that covers 10 models, 17 tuners, and 29 systems from the existing works while under four different commonly used metrics, leading to 13,612 cases of investigation. Surprisingly, our key findings reveal that the accuracy can lie: there are a considerable number of cases where higher accuracy actually leads to no improvement in the tuning outcomes (up to 58% cases under certain setting), or even worse, it can degrade the tuning quality (up to 24% cases under certain setting). We also discover that the chosen models in most proposed tuners are sub-optimal and that the required % of accuracy change to significantly improve tuning quality varies according to the range of model accuracy. Deriving from the fitness landscape analysis, we provide in-depth discussions of the rationale behind, offering several lessons learned as well as insights for future opportunities. Most importantly, this work poses a clear message to the community: we should take one step back from the natural "accuracy is all" belief for model-based configuration tuning.

Decision Explanation: Original decision: REJECT The paper focuses on the impact of surrogate model accuracy in configuration tuning, which does not meet the 'MUST' criteria of primarily focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

arXiv ID: http://arxiv.org/abs/2501.01904v1

Abstract:

Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.

Decision Explanation: Original decision: REJECT The paper primarily focuses on fine-tuning a multimodal large language model (MLLM) for slow-thinking reasoning, which falls under developing new training methods for LLM architectures, rather than engineering prompts specifically for improving LLM performance through textual input manipulation.

[Mitigating Hallucination for Large Vision Language Model by

Inter-Modality Correlation Calibration Decoding](https://arxiv.org/abs/http://arxiv.org/abs/2501.01926v1) arXiv ID: http://arxiv.org/abs/2501.01926v1

Abstract:

Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding and attention rectification, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from spurious inter-modality correlations. In this paper, we propose an Inter-Modality Correlation Calibration Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design a Cross-Modal Value-Enhanced Decoding(CMVED) module to alleviate hallucination by a novel contrastive decoding mechanism. During the estimation of distorted distribution, CMVED masks the value vectors associated with significant cross-modal attention weights, which address both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Content-Driven Attention Refinement(CDAR) module refines cross-modal attention weights, guiding LVLMs to focus on important visual content. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations in LVLM text generation. Our code will be available at https://github.com/lijm48/IMCCD.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on mitigating hallucinations in Large Vision Language Models (LVLMs) through decoding mechanisms, not on the engineering, design, or optimization of textual input prompts for Large Language Models (LLMs), and does not provide concrete examples of prompt manipulation impacting LLM output." }

[Abstractive Text Summarization for Contemporary Sanskrit Prose: Issues

and Challenges](https://arxiv.org/abs/http://arxiv.org/abs/2501.01933v1) arXiv ID: http://arxiv.org/abs/2501.01933v1

Abstract:

This thesis presents Abstractive Text Summarization models for contemporary Sanskrit prose. The first chapter, titled Introduction, presents the motivation behind this work, the research questions, and the conceptual framework. Sanskrit is a low-resource inflectional language. The key research question that this thesis investigates is what the challenges in developing an abstractive TS for Sanskrit. To answer the key research questions, sub-questions based on four different themes have been posed in this work. The second chapter, Literature Review, surveys the previous works done. The third chapter, data preparation, answers the remaining three questions from the third theme. It reports the data collection and preprocessing challenges for both language model and summarization model trainings. The fourth chapter reports the training and inference of models and the results obtained therein. This research has initiated a pipeline for Sanskrit abstractive text summarization and has reported the challenges faced at every stage of the development. The research questions based on every theme have been answered to answer the key research question.

Decision Explanation: Original decision: REJECT The paper focuses primarily on developing Abstractive Text Summarization models for Sanskrit prose, discussing language-specific challenges and model training, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).

[Cold-Start Recommendation towards the Era of Large Language Models

(LLMs): A Comprehensive Survey and Roadmap](https://arxiv.org/abs/http://arxiv.org/abs/2501.01945v2) arXiv ID: http://arxiv.org/abs/2501.01945v2

Abstract:

Cold-start problem is one of the long-standing challenges in recommender systems, focusing on accurately modeling new or interaction-limited users or items to provide better recommendations. Due to the diversification of internet platforms and the exponential growth of users and items, the importance of cold-start recommendation (CSR) is becoming increasingly evident. At the same time, large language models (LLMs) have achieved tremendous success and possess strong capabilities in modeling user and item information, providing new potential for cold-start recommendations. However, the research community on CSR still lacks a comprehensive review and reflection in this field. Based on this, in this paper, we stand in the context of the era of large language models and provide a comprehensive review and discussion on the roadmap, related literature, and future directions of CSR. Specifically, we have conducted an exploration of the development path of how existing CSR utilizes information, from content features, graph relations, and domain information, to the world knowledge possessed by large language models, aiming to provide new insights for both the research and industrial communities on CSR. Related resources of cold-start recommendations are collected and continuously updated for the community in https://github.com/YuanchenBei/Awesome-Cold-Start-Recommendation.

Decision Explanation: Original decision: REJECT The paper focuses primarily on cold-start recommendation systems utilizing Large Language Models (LLMs), but does not concentrate on prompt engineering for text-based interactions with LLMs. Instead, it leverages LLMs as a component for enhancing recommender systems, which does not meet the core subject requirement of prompt engineering being the central focus.

MADGEN: Mass-Spec attends to De Novo Molecular generation

arXiv ID: http://arxiv.org/abs/2501.01950v2

Abstract:

The annotation (assigning structural chemical identities) of MS/MS spectra remains a significant challenge due to the enormous molecular diversity in biological samples and the limited scope of reference databases. Currently, the vast majority of spectral measurements remain in the "dark chemical space" without structural annotations. To improve annotation, we propose MADGEN (Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data. MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation starting with the scaffold. In the first stage, given an MS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ contrastive learning to align mass spectra with candidate molecular scaffolds. In the second stage, starting from the retrieved scaffold, we employ the MS/MS spectrum to guide an attention-based generative model to generate the final molecule. Our approach constrains the molecular generation search space, reducing its complexity and improving generation accuracy. We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN's performance with a predictive scaffold retriever and with an oracle retriever. We demonstrate the effectiveness of using attention to integrate spectral information throughout the generation process to achieve strong results with the oracle retriever.

Decision Explanation: Original decision: REJECT The paper primarily focuses on de novo molecular structure generation using mass spectrometry data, with attention-based generative models, and does not meet the core requirement of focusing on prompt engineering specifically for Large Language Models (LLMs) or manipulating textual input prompts to improve LLM performance.

[MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of

Accelerators](https://arxiv.org/abs/http://arxiv.org/abs/2501.01951v2) arXiv ID: http://arxiv.org/abs/2501.01951v2

Abstract:

Graph convolutional networks (GCNs) have demonstrated superiority in graph-based learning tasks. However, training GCNs on full graphs is particularly challenging, due to the following two challenges: (1) the associated feature tensors can easily explode the memory and block the communication bandwidth of modern accelerators, and (2) the computation workflow in training GCNs alternates between sparse and dense matrix operations, complicating the efficient utilization of computational resources. Existing solutions for scalable distributed full-graph GCN training mostly adopt partition parallelism, which is unsatisfactory as they only partially address the first challenge while incurring scaled-out communication volume. To this end, we propose MixGCN aiming to simultaneously address both the aforementioned challenges towards GCN training. To tackle the first challenge, MixGCN integrates mixture of parallelism. Both theoretical and empirical analysis verify its constant communication volumes and enhanced balanced workload; For handling the second challenge, we consider mixture of accelerators (i.e., sparse and dense accelerators) with a dedicated accelerator for GCN training and a fine-grain pipeline. Extensive experiments show that MixGCN achieves boosted training efficiency and scalability.

Decision Explanation: Original decision: REJECT The paper focuses primarily on scalable training of Graph Convolutional Networks (GCNs) using parallelism and mixture of accelerators, with no mention of Large Language Models (LLMs) or prompt engineering, thus failing to meet all 'MUST' criteria.

[SmartSpatial: Enhancing the 3D Spatial Arrangement Capabilities of

Stable Diffusion Models and Introducing a Novel 3D Spatial Evaluation Framework](https://arxiv.org/abs/http://arxiv.org/abs/2501.01998v1) arXiv ID: http://arxiv.org/abs/2501.01998v1

Abstract:

Stable Diffusion models have made remarkable strides in generating photorealistic images from text prompts but often falter when tasked with accurately representing complex spatial arrangements, particularly involving intricate 3D relationships. To address this limitation, we introduce SmartSpatial, an innovative approach that enhances the spatial arrangement capabilities of Stable Diffusion models through 3D-aware conditioning and attention-guided mechanisms. SmartSpatial incorporates depth information and employs cross-attention control to ensure precise object placement, delivering notable improvements in spatial accuracy metrics. In conjunction with SmartSpatial, we present SmartSpatialEval, a comprehensive evaluation framework designed to assess spatial relationships. This framework utilizes vision-language models and graph-based dependency parsing for performance analysis. Experimental results on the COCO and SpatialPrompts datasets show that SmartSpatial significantly outperforms existing methods, setting new benchmarks for spatial arrangement accuracy in image generation.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on enhancing Stable Diffusion models for 3D spatial arrangement in image generation, which falls under image generation driven by generative AI models other than Large Language Models (LLMs) for text generation, violating the 'MUST NOT' criteria." }

[Multi-Task Semantic Communication With Graph Attention-Based Feature

Correlation Extraction](https://arxiv.org/abs/http://arxiv.org/abs/2501.02006v1) arXiv ID: http://arxiv.org/abs/2501.02006v1

Abstract:

Multi-task semantic communication can serve multiple learning tasks using a shared encoder model. Existing models have overlooked the intricate relationships between features extracted during an encoding process of tasks. This paper presents a new graph attention inter-block (GAI) module to the encoder/transmitter of a multi-task semantic communication system, which enriches the features for multiple tasks by embedding the intermediate outputs of encoding in the features, compared to the existing techniques. The key idea is that we interpret the outputs of the intermediate feature extraction blocks of the encoder as the nodes of a graph to capture the correlations of the intermediate features. Another important aspect is that we refine the node representation using a graph attention mechanism to extract the correlations and a multi-layer perceptron network to associate the node representations with different tasks. Consequently, the intermediate features are weighted and embedded into the features transmitted for executing multiple tasks at the receiver. Experiments demonstrate that the proposed model surpasses the most competitive and publicly available models by 11.4% on the CityScapes 2Task dataset and outperforms the established state-of-the-art by 3.97% on the NYU V2 3Task dataset, respectively, when the bandwidth ratio of the communication channel (i.e., compression level for transmission over the channel) is as constrained as 1 12 .

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on developing a new module for a multi-task semantic communication system, optimizing feature extraction with graph attention, and does not primarily investigate prompt engineering for Large Language Models (LLMs) or analyze the impact of textual input prompts on LLM output." }

[TART: Token-based Architecture Transformer for Neural Network

Performance Prediction](https://arxiv.org/abs/http://arxiv.org/abs/2501.02007v1) arXiv ID: http://arxiv.org/abs/2501.02007v1

Abstract:

In the realm of neural architecture design, achieving high performance is largely reliant on the manual expertise of researchers. Despite the emergence of Neural Architecture Search (NAS) as a promising technique for automating this process, current NAS methods still require human input to expand the search space and cannot generate new architectures. This paper explores the potential of Transformers in comprehending neural architectures and their performance, with the objective of establishing the foundation for utilizing Transformers to generate novel networks. We propose the Token-based Architecture Transformer (TART), which predicts neural network performance without the need to train candidate networks. TART attains state-of-the-art performance on the DeepNets-1M dataset for performance prediction tasks without edge information, indicating the potential of Transformers to aid in discovering novel and high-performing neural architectures.

Decision Explanation: Original decision: REJECT The paper focuses primarily on utilizing Transformers for Neural Architecture Search (NAS) and predicting neural network performance, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), failing to meet the mandatory criteria.

[Cross-model Transferability among Large Language Models on the Platonic

Representations of Concepts](https://arxiv.org/abs/http://arxiv.org/abs/2501.02009v1) arXiv ID: http://arxiv.org/abs/2501.02009v1

Abstract:

Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on understanding concept representations across different LLMs and developing a linear transformation method for alignment, rather than specifically engineering or optimizing prompts for improving LLM text generation performance through textual input manipulation." }

[Machine Learning-Based Differential Diagnosis of Parkinson's Disease

Using Kinematic Feature Extraction and Selection](https://arxiv.org/abs/http://arxiv.org/abs/2501.02014v1) arXiv ID: http://arxiv.org/abs/2501.02014v1

Abstract:

Parkinson's disease (PD), the second most common neurodegenerative disorder, is characterized by dopaminergic neuron loss and the accumulation of abnormal synuclein. PD presents both motor and non-motor symptoms that progressively impair daily functioning. The severity of these symptoms is typically assessed using the MDS-UPDRS rating scale, which is subjective and dependent on the physician's experience. Additionally, PD shares symptoms with other neurodegenerative diseases, such as progressive supranuclear palsy (PSP) and multiple system atrophy (MSA), complicating accurate diagnosis. To address these diagnostic challenges, we propose a machine learning-based system for differential diagnosis of PD, PSP, MSA, and healthy controls (HC). This system utilizes a kinematic feature-based hierarchical feature extraction and selection approach. Initially, 18 kinematic features are extracted, including two newly proposed features: Thumb-to-index vector velocity and acceleration, which provide insights into motor control patterns. In addition, 41 statistical features were extracted here from each kinematic feature, including some new approaches such as Average Absolute Change, Rhythm, Amplitude, Frequency, Standard Deviation of Frequency, and Slope. Feature selection is performed using One-way ANOVA to rank features, followed by Sequential Forward Floating Selection (SFFS) to identify the most relevant ones, aiming to reduce the computational complexity. The final feature set is used for classification, achieving a classification accuracy of 66.67% for each dataset and 88.89% for each patient, with particularly high performance for the MSA and HC groups using the SVM algorithm. This system shows potential as a rapid and accurate diagnostic tool in clinical practice, though further data collection and refinement are needed to enhance its reliability.

Decision Explanation: Original decision: REJECT The paper focuses on machine learning for differential diagnosis of Parkinson's disease using kinematic feature extraction and selection, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, failing to meet all 'MUST' criteria.

[Enhancing Uncertainty Modeling with Semantic Graph for Hallucination

Detection](https://arxiv.org/abs/http://arxiv.org/abs/2501.02020v1) arXiv ID: http://arxiv.org/abs/2501.02020v1

Abstract:

Large Language Models (LLMs) are prone to hallucination with non-factual or unfaithful statements, which undermines the applications in real-world scenarios. Recent researches focus on uncertainty-based hallucination detection, which utilizes the output probability of LLMs for uncertainty calculation and does not rely on external knowledge or frequent sampling from LLMs. Whereas, most approaches merely consider the uncertainty of each independent token, while the intricate semantic relations among tokens and sentences are not well studied, which limits the detection of hallucination that spans over multiple tokens and sentences in the passage. In this paper, we propose a method to enhance uncertainty modeling with semantic graph for hallucination detection. Specifically, we first construct a semantic graph that well captures the relations among entity tokens and sentences. Then, we incorporate the relations between two entities for uncertainty propagation to enhance sentence-level hallucination detection. Given that hallucination occurs due to the conflict between sentences, we further present a graph-based uncertainty calibration method that integrates the contradiction probability of the sentence with its neighbors in the semantic graph for uncertainty calculation. Extensive experiments on two datasets show the great advantages of our proposed approach. In particular, we obtain substantial improvements with 19.78% in passage-level hallucination detection.

Decision Explanation: Original decision: REJECT The paper primarily focuses on enhancing uncertainty modeling for hallucination detection in LLMs, rather than prompt engineering, optimization, or manipulation of textual input prompts to improve LLM performance, thus not meeting the core subject requirement.

Weakly Supervised Learning on Large Graphs

arXiv ID: http://arxiv.org/abs/2501.02021v1

Abstract:

Graph classification plays a pivotal role in various domains, including pathology, where images can be represented as graphs.In this domain, images can be represented as graphs, where nodes might represent individual nuclei, and edges capture the spatial or functional relationships between them. Often, the overall label of the graph, such as a cancer type or disease state, is determined by patterns within smaller, localized regions of the image. This work introduces a weakly-supervised graph classification framework leveraging two subgraph extraction techniques: (1) Sliding-window approach (2) BFS-based approach. Subgraphs are processed using a Graph Attention Network (GAT), which employs attention mechanisms to identify the most informative subgraphs for classification. Weak supervision is achieved by propagating graph-level labels to subgraphs, eliminating the need for detailed subgraph annotations.

Decision Explanation: Original decision: REJECT This paper focuses on weakly supervised learning for graph classification using Graph Attention Networks, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions, thus failing to meet the primary 'MUST' criteria.

[CarbonChat: Large Language Model-Based Corporate Carbon Emission

Analysis and Climate Knowledge Q&A System](https://arxiv.org/abs/http://arxiv.org/abs/2501.02031v1) arXiv ID: http://arxiv.org/abs/2501.02031v1

Abstract:

As the impact of global climate change intensifies, corporate carbon emissions have become a focal point of global attention. In response to issues such as the lag in climate change knowledge updates within large language models, the lack of specialization and accuracy in traditional augmented generation architectures for complex problems, and the high cost and time consumption of sustainability report analysis, this paper proposes CarbonChat: Large Language Model-based corporate carbon emission analysis and climate knowledge Q&A system, aimed at achieving precise carbon emission analysis and policy understanding.First, a diversified index module construction method is proposed to handle the segmentation of rule-based and long-text documents, as well as the extraction of structured data, thereby optimizing the parsing of key information.Second, an enhanced self-prompt retrieval-augmented generation architecture is designed, integrating intent recognition, structured reasoning chains, hybrid retrieval, and Text2SQL, improving the efficiency of semantic understanding and query conversion.Next, based on the greenhouse gas accounting framework, 14 dimensions are established for carbon emission analysis, enabling report summarization, relevance evaluation, and customized responses.Finally, through a multi-layer chunking mechanism, timestamps, and hallucination detection features, the accuracy and verifiability of the analysis results are ensured, reducing hallucination rates and enhancing the precision of the responses.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a corporate carbon emission analysis and climate knowledge Q&A system using LLMs, rather than specifically on prompt engineering for LLMs. While it mentions a 'self-prompt retrieval-augmented generation architecture', prompt engineering is not the central concern, but rather a component of the larger system.

3D Cloud reconstruction through geospatially-aware Masked Autoencoders

arXiv ID: http://arxiv.org/abs/2501.02035v1

Abstract:

Clouds play a key role in Earth's radiation balance with complex effects that introduce large uncertainties into climate models. Real-time 3D cloud data is essential for improving climate predictions. This study leverages geostationary imagery from MSG/SEVIRI and radar reflectivity measurements of cloud profiles from CloudSat/CPR to reconstruct 3D cloud structures. We first apply self-supervised learning (SSL) methods-Masked Autoencoders (MAE) and geospatially-aware SatMAE on unlabelled MSG images, and then fine-tune our models on matched image-profile pairs. Our approach outperforms state-of-the-art methods like U-Nets, and our geospatial encoding further improves prediction results, demonstrating the potential of SSL for cloud reconstruction.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on 3D cloud reconstruction using geospatially-aware Masked Autoencoders, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria." }

[Architecture for Trajectory-Based Fishing Ship Classification with AIS

Data](https://arxiv.org/abs/http://arxiv.org/abs/2501.02038v1) arXiv ID: http://arxiv.org/abs/2501.02038v1

Abstract:

This paper proposes a data preparation process for managing real-world kinematic data and detecting fishing vessels. The solution is a binary classification that classifies ship trajectories into either fishing or non-fishing ships. The data used are characterized by the typical problems found in classic data mining applications using real-world data, such as noise and inconsistencies. The two classes are also clearly unbalanced in the data, a problem which is addressed using algorithms that resample the instances. For classification, a series of features are extracted from spatiotemporal data that represent the trajectories of the ships, available from sequences of Automatic Identification System (AIS) reports. These features are proposed for the modelling of ship behavior but, because they do not contain context-related information, the classification can be applied in other scenarios. Experimentation shows that the proposed data preparation process is useful for the presented classification problem. In addition, positive results are obtained using minimal information.

Decision Explanation: Original decision: REJECT The paper focuses on developing an architecture for ship classification using AIS data and kinematic data, with no mention of Large Language Models (LLMs), prompt engineering, or text-based interactions, failing to meet the primary 'MUST' criteria.

[An Investigation into Value Misalignment in LLM-Generated Texts for

Cultural Heritage](https://arxiv.org/abs/http://arxiv.org/abs/2501.02039v1) arXiv ID: http://arxiv.org/abs/2501.02039v1

Abstract:

As Large Language Models (LLMs) become increasingly prevalent in tasks related to cultural heritage, such as generating descriptions of historical monuments, translating ancient texts, preserving oral traditions, and creating educational content, their ability to produce accurate and culturally aligned texts is being increasingly relied upon by users and researchers. However, cultural value misalignments may exist in generated texts, such as the misrepresentation of historical facts, the erosion of cultural identity, and the oversimplification of complex cultural narratives, which may lead to severe consequences. Therefore, investigating value misalignment in the context of LLM for cultural heritage is crucial for mitigating these risks, yet there has been a significant lack of systematic and comprehensive study and investigation in this area. To fill this gap, we systematically assess the reliability of LLMs in generating culturally aligned texts for cultural heritage-related tasks. We conduct a comprehensive evaluation by compiling an extensive set of 1066 query tasks covering 5 widely recognized categories with 17 aspects within the knowledge framework of cultural heritage across 5 open-source LLMs, and examine both the type and rate of cultural value misalignments in the generated texts. Using both automated and manual approaches, we effectively detect and analyze the cultural value misalignments in LLM-generated texts. Our findings are concerning: over 65% of the generated texts exhibit notable cultural misalignments, with certain tasks demonstrating almost complete misalignment with key cultural values. Beyond these findings, this paper introduces a benchmark dataset and a comprehensive evaluation workflow that can serve as a valuable resource for future research aimed at enhancing the cultural sensitivity and reliability of LLMs.

Decision Explanation: Original decision: REJECT The paper primarily focuses on evaluating cultural value misalignment in LLM-generated texts for cultural heritage, rather than investigating methods for improving LLM performance through the manipulation of textual input prompts, and does not provide concrete examples of prompt engineering techniques.

[A Separable Self-attention Inspired by the State Space Model for

Computer Vision](https://arxiv.org/abs/http://arxiv.org/abs/2501.02040v1) arXiv ID: http://arxiv.org/abs/2501.02040v1

Abstract:

Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction tasks.Code is available at: \url{https://github.com/yws-wxs/VMINet}.

Decision Explanation: Original decision: REJECT The paper focuses primarily on computer vision, proposing a novel separable self-attention method and its application in image classification and object detection, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts.

[MRG: A Multi-Robot Manufacturing Digital Scene Generation Method Using

Multi-Instance Point Cloud Registration](https://arxiv.org/abs/http://arxiv.org/abs/2501.02041v1) arXiv ID: http://arxiv.org/abs/2501.02041v1

Abstract:

A high-fidelity digital simulation environment is crucial for accurately replicating physical operational processes. However, inconsistencies between simulation and physical environments result in low confidence in simulation outcomes, limiting their effectiveness in guiding real-world production. Unlike the traditional step-by-step point cloud "segmentation-registration" generation method, this paper introduces, for the first time, a novel Multi-Robot Manufacturing Digital Scene Generation (MRG) method that leverages multi-instance point cloud registration, specifically within manufacturing scenes. Tailored to the characteristics of industrial robots and manufacturing settings, an instance-focused transformer module is developed to delineate instance boundaries and capture correlations between local regions. Additionally, a hypothesis generation module is proposed to extract target instances while preserving key features. Finally, an efficient screening and optimization algorithm is designed to refine the final registration results. Experimental evaluations on the Scan2CAD and Welding-Station datasets demonstrate that: (1) the proposed method outperforms existing multi-instance point cloud registration techniques; (2) compared to state-of-the-art methods, the Scan2CAD dataset achieves improvements in MR and MP by 12.15% and 17.79%, respectively; and (3) on the Welding-Station dataset, MR and MP are enhanced by 16.95% and 24.15%, respectively. This work marks the first application of multi-instance point cloud registration in manufacturing scenes, significantly advancing the precision and reliability of digital simulation environments for industrial applications.

Decision Explanation: Original decision: REJECT The paper focuses on multi-instance point cloud registration for digital scene generation in manufacturing, with no indication of involving Large Language Models (LLMs) or prompt engineering for text-based interactions, thus failing to meet the primary 'MUST' criteria.

[Advancing Pancreatic Cancer Prediction with a Next Visit Token

Prediction Head on top of Med-BERT](https://arxiv.org/abs/http://arxiv.org/abs/2501.02044v1) arXiv ID: http://arxiv.org/abs/2501.02044v1

Abstract:

Background: Recently, numerous foundation models pretrained on extensive data have demonstrated efficacy in disease prediction using Electronic Health Records (EHRs). However, there remains some unanswered questions on how to best utilize such models especially with very small fine-tuning cohorts. Methods: We utilized Med-BERT, an EHR-specific foundation model, and reformulated the disease binary prediction task into a token prediction task and a next visit mask token prediction task to align with Med-BERT's pretraining task format in order to improve the accuracy of pancreatic cancer (PaCa) prediction in both few-shot and fully supervised settings. Results: The reformulation of the task into a token prediction task, referred to as Med-BERT-Sum, demonstrates slightly superior performance in both few-shot scenarios and larger data samples. Furthermore, reformulating the prediction task as a Next Visit Mask Token Prediction task (Med-BERT-Mask) significantly outperforms the conventional Binary Classification (BC) prediction task (Med-BERT-BC) by 3% to 7% in few-shot scenarios with data sizes ranging from 10 to 500 samples. These findings highlight that aligning the downstream task with Med-BERT's pretraining objectives substantially enhances the model's predictive capabilities, thereby improving its effectiveness in predicting both rare and common diseases. Conclusion: Reformatting disease prediction tasks to align with the pretraining of foundation models enhances prediction accuracy, leading to earlier detection and timely intervention. This approach improves treatment effectiveness, survival rates, and overall patient outcomes for PaCa and potentially other cancers.

Decision Explanation: Original decision: REJECT The paper focuses primarily on optimizing disease prediction tasks using Med-BERT, a foundation model, by reformulating the prediction task, rather than specifically engineering or optimizing prompts for Large Language Models (LLMs). The core subject is not prompt engineering for text-based interactions with LLMs, but rather adapting task formats to improve model performance in a medical context, which is explicitly excluded.

ArtCrafter: Text-Image Aligning Style Transfer via Embedding Reframing

arXiv ID: http://arxiv.org/abs/2501.02064v1

Abstract:

Recent years have witnessed significant advancements in text-guided style transfer, primarily attributed to innovations in diffusion models. These models excel in conditional guidance, utilizing text or images to direct the sampling process. However, despite their capabilities, direct conditional guidance approaches often face challenges in balancing the expressiveness of textual semantics with the diversity of output results while capturing stylistic features. To address these challenges, we introduce ArtCrafter, a novel framework for text-to-image style transfer. Specifically, we introduce an attention-based style extraction module, meticulously engineered to capture the subtle stylistic elements within an image. This module features a multi-layer architecture that leverages the capabilities of perceiver attention mechanisms to integrate fine-grained information. Additionally, we present a novel text-image aligning augmentation component that adeptly balances control over both modalities, enabling the model to efficiently map image and text embeddings into a shared feature space. We achieve this through attention operations that enable smooth information flow between modalities. Lastly, we incorporate an explicit modulation that seamlessly blends multimodal enhanced embeddings with original embeddings through an embedding reframing design, empowering the model to generate diverse outputs. Extensive experiments demonstrate that ArtCrafter yields impressive results in visual stylization, exhibiting exceptional levels of stylistic intensity, controllability, and diversity.

Decision Explanation: Original decision: REJECT The paper focuses on text-to-image style transfer using diffusion models, which falls under image generation driven by generative AI models other than Large Language Models (LLMs) for text generation, violating the 'Papers MUST NOT' criterion 2.

[The interplay between domain specialization and model size: a case study

in the legal domain](https://arxiv.org/abs/http://arxiv.org/abs/2501.02068v1) arXiv ID: http://arxiv.org/abs/2501.02068v1

Abstract:

Scaling laws for language models so far focused on finding the compute-optimal model size and token count for training from scratch. However, achieving this optimal balance requires significant compute resources due to the extensive data demands when training models from randomly-initialized weights. Continual pre-training offers a cost-effective alternative, leveraging the compute investment from pre-trained models to incorporate new knowledge without requiring extensive new data. Recent findings suggest that data quality influences constants in scaling laws, thereby altering the optimal parameter-token allocation ratio. Building on this insight, we investigate the interplay between domain specialization and model size during continual pre-training under compute-constrained scenarios. Our goal is to identify a compute-efficient training regime for this scenario and, potentially, detect patterns in this interplay that can be generalized across different model sizes and domains. To compare general and specialized training, we filtered a web-based dataset to extract legal domain data. We pre-trained models with 1.5B, 3B, 7B and 14B parameters on both the unfiltered and filtered datasets, then evaluated their performance on legal exams. Results show that as model size increases, the compute-effectiveness gap between specialized and general models widens.

Decision Explanation: Original decision: REJECT The paper primarily focuses on the interplay between domain specialization and model size for continual pre-training, which falls under developing new training methods for LLMs, rather than prompt engineering for text-based interactions with LLMs.

[On the Statistical Complexity for Offline and Low-Adaptive Reinforcement

Learning with Structures](https://arxiv.org/abs/http://arxiv.org/abs/2501.02089v1) arXiv ID: http://arxiv.org/abs/2501.02089v1

Abstract:

This article reviews the recent advances on the statistical foundation of reinforcement learning (RL) in the offline and low-adaptive settings. We will start by arguing why offline RL is the appropriate model for almost any real-life ML problems, even if they have nothing to do with the recent AI breakthroughs that use RL. Then we will zoom into two fundamental problems of offline RL: offline policy evaluation (OPE) and offline policy learning (OPL). It may be surprising to people that tight bounds for these problems were not known even for tabular and linear cases until recently. We delineate the differences between worst-case minimax bounds and instance-dependent bounds. We also cover key algorithmic ideas and proof techniques behind near-optimal instance-dependent methods in OPE and OPL. Finally, we discuss the limitations of offline RL and review a burgeoning problem of \emph{low-adaptive exploration} which addresses these limitations by providing a sweet middle ground between offline and online RL.

Decision Explanation: Original decision: REJECT The paper does not meet the primary focus criteria, as it discusses reinforcement learning (RL) in offline and low-adaptive settings, with no evident focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs), nor does it investigate improving LLM performance through textual input prompt manipulation.

Online Detection of Water Contamination Under Concept Drift

arXiv ID: http://arxiv.org/abs/2501.02107v1

Abstract:

Water Distribution Networks (WDNs) are vital infrastructures, and contamination poses serious public health risks. Harmful substances can interact with disinfectants like chlorine, making chlorine monitoring essential for detecting contaminants. However, chlorine sensors often become unreliable and require frequent calibration. This study introduces the Dual-Threshold Anomaly and Drift Detection (AD&DD) method, an unsupervised approach combining a dual-threshold drift detection mechanism with an LSTM-based Variational Autoencoder(LSTM-VAE) for real-time contamination detection. Tested on two realistic WDNs, AD&DD effectively identifies anomalies with sensor offsets as concept drift, and outperforms other methods. A proposed decentralized architecture enables accurate contamination detection and localization by deploying AD&DD on selected nodes.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper does not meet the 'MUST' criteria as it focuses on detecting water contamination using an LSTM-based Variational Autoencoder, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance." }

[Siamese Networks for Cat Re-Identification: Exploring Neural Models for

Cat Instance Recognition](https://arxiv.org/abs/http://arxiv.org/abs/2501.02112v1) arXiv ID: http://arxiv.org/abs/2501.02112v1

Abstract:

Street cats in urban areas often rely on human intervention for survival, leading to challenges in population control and welfare management. In April 2023, Hello Inc., a Chinese urban mobility company, launched the Hello Street Cat initiative to address these issues. The project deployed over 21,000 smart feeding stations across 14 cities in China, integrating livestreaming cameras and treat dispensers activated through user donations. It also promotes the Trap-Neuter-Return (TNR) method, supported by a community-driven platform, HelloStreetCatWiki, where volunteers catalog and identify cats. However, manual identification is inefficient and unsustainable, creating a need for automated solutions. This study explores Deep Learning-based models for re-identifying street cats in the Hello Street Cat initiative. A dataset of 2,796 images of 69 cats was used to train Siamese Networks with EfficientNetB0, MobileNet and VGG16 as base models, evaluated under contrastive and triplet loss functions. VGG16 paired with contrastive loss emerged as the most effective configuration, achieving up to 97% accuracy and an F1 score of 0.9344 during testing. The approach leverages image augmentation and dataset refinement to overcome challenges posed by limited data and diverse visual variations. These findings underscore the potential of automated cat re-identification to streamline population monitoring and welfare efforts. By reducing reliance on manual processes, the method offers a scalable and reliable solution for communitydriven initiatives. Future research will focus on expanding datasets and developing real-time implementations to enhance practicality in large-scale deployments.

Decision Explanation: Original decision: REJECT The paper focuses on image recognition for cat re-identification using Siamese Networks and Deep Learning models, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.

[AVTrustBench: Assessing and Enhancing Reliability and Robustness in

Audio-Visual LLMs](https://arxiv.org/abs/http://arxiv.org/abs/2501.02135v1) arXiv ID: http://arxiv.org/abs/2501.02135v1

Abstract:

With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on the development and evaluation of a benchmark for Audio-Visual Large Language Models (AVLLMs) and proposes a training strategy (CAVPref) to enhance their reliability and robustness, rather than focusing on prompt engineering for text-based interactions with LLMs." }

Attribute-Based Robotic Grasping with Data-Efficient Adaptation

arXiv ID: http://arxiv.org/abs/2501.02149v1

Abstract:

Robotic grasping is one of the most fundamental robotic manipulation tasks and has been the subject of extensive research. However, swiftly teaching a robot to grasp a novel target object in clutter remains challenging. This paper attempts to address the challenge by leveraging object attributes that facilitate recognition, grasping, and rapid adaptation to new domains. In this work, we present an end-to-end encoder-decoder network to learn attribute-based robotic grasping with data-efficient adaptation capability. We first pre-train the end-to-end model with a variety of basic objects to learn generic attribute representation for recognition and grasping. Our approach fuses the embeddings of a workspace image and a query text using a gated-attention mechanism and learns to predict instance grasping affordances. To train the joint embedding space of visual and textual attributes, the robot utilizes object persistence before and after grasping. Our model is self-supervised in a simulation that only uses basic objects of various colors and shapes but generalizes to novel objects in new environments. To further facilitate generalization, we propose two adaptation methods, adversarial adaption and one-grasp adaptation. Adversarial adaptation regulates the image encoder using augmented data of unlabeled images, whereas one-grasp adaptation updates the overall end-to-end model using augmented data from one grasp trial. Both adaptation methods are data-efficient and considerably improve instance grasping performance. Experimental results in both simulation and the real world demonstrate that our approach achieves over 81% instance grasping success rate on unknown objects, which outperforms several baselines by large margins.

Decision Explanation: Original decision: REJECT The paper's primary focus is on robotic grasping and adaptation in novel environments, utilizing an encoder-decoder network, with no emphasis on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.

[The Integration of Blockchain and Artificial Intelligence for Secure

Healthcare Systems](https://arxiv.org/abs/http://arxiv.org/abs/2501.02169v1) arXiv ID: http://arxiv.org/abs/2501.02169v1

Abstract:

Verisign reported a 125 percent increase in data breaches within the healthcare sector in the United States during 2022, with 18.2 million patient records being impacted. Growing healthcare data volumes and diversification mean that medical information is becoming more valuable. Many Health Centers use various technologies to ease the classification, storage, and exchange of big data. This use can also make the health data of the users at risk and vulnerable. AI and blockchain are among the leading technologies at hand. With AI, data-driven operations and big data efficiency have been improved with respect to traditional techniques. Due to its potential to bring about improvements in health services and lower medical costs, this AI technology is regularly used in healthcare. Blockchain helps protect transactions on sharing information and private privacy as long as the exchange of knowledge is that of the standard. The objective of this analysis is to investigate the research and unique contributions since 2008 regarding blockchain-integrated AI and healthcare systems. The work sheds light on applied AI-based healthcare schemes with machine, ballistic, and acrylic learning and disparate blockchain structures. The use of technology in order to ensure patient data security and manage medical information effectively in healthcare settings offers a highly successful position for both healthcare providers and patients. From 2018 to 2021, the best year was 2021 to grow, enhancing everything to examine the download of the device and the counting of Google Academies, for which the joining perspective was borrowed; local research experts were asked, identified articles in recent years, and read reviews of large research grants.

Decision Explanation: Original decision: REJECT The paper focuses primarily on the integration of blockchain and AI for secure healthcare systems, not on the engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not provide examples of prompts impacting LLM output.

[AdaMixup: A Dynamic Defense Framework for Membership Inference Attack

Mitigation](https://arxiv.org/abs/http://arxiv.org/abs/2501.02182v1) arXiv ID: http://arxiv.org/abs/2501.02182v1

Abstract:

Membership inference attacks have emerged as a significant privacy concern in the training of deep learning models, where attackers can infer whether a data point was part of the training set based on the model's outputs. To address this challenge, we propose a novel defense mechanism, AdaMixup. AdaMixup employs adaptive mixup techniques to enhance the model's robustness against membership inference attacks by dynamically adjusting the mixup strategy during training. This method not only improves the model's privacy protection but also maintains high performance. Experimental results across multiple datasets demonstrate that AdaMixup significantly reduces the risk of membership inference attacks while achieving a favorable trade-off between defensive efficiency and model accuracy. This research provides an effective solution for data privacy protection and lays the groundwork for future advancements in mixup training methods.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a defense mechanism (AdaMixup) for mitigating membership inference attacks in deep learning models, with no apparent emphasis on prompt engineering, Large Language Models (LLMs), or the manipulation of textual input prompts for LLM performance.

CPTuning: Contrastive Prompt Tuning for Generative Relation Extraction

arXiv ID: http://arxiv.org/abs/2501.02196v1

Abstract:

Generative relation extraction (RE) commonly involves first reformulating RE as a linguistic modeling problem easily tackled with pre-trained language models (PLM) and then fine-tuning a PLM with supervised cross-entropy loss. Although having achieved promising performance, existing approaches assume only one deterministic relation between each pair of entities without considering real scenarios where multiple relations may be valid, i.e., entity pair overlap, causing their limited applications. To address this problem, we introduce a novel contrastive prompt tuning method for RE, CPTuning, which learns to associate a candidate relation between two in-context entities with a probability mass above or below a threshold, corresponding to whether the relation exists. Beyond learning schema, CPTuning also organizes RE as a verbalized relation generation task and uses Trie-constrained decoding to ensure a model generates valid relations. It adaptively picks out the generated candidate relations with a high estimated likelihood in inference, thereby achieving multi-relation extraction. We conduct extensive experiments on four widely used datasets to validate our method. Results show that T5-large fine-tuned with CPTuning significantly outperforms previous methods, regardless of single or multiple relations extraction.

Decision Explanation: Original decision: REJECT Although the paper mentions prompt tuning for Large Language Models (LLMs), its primary focus is on improving Generative Relation Extraction through a novel training method, rather than solely on prompt engineering for text-based interactions with LLMs, as evidenced by the emphasis on fine-tuning a PLM and the development of a new decoding strategy.

Financial Named Entity Recognition: How Far Can LLM Go?

arXiv ID: http://arxiv.org/abs/2501.02237v1

Abstract:

The surge of large language models (LLMs) has revolutionized the extraction and analysis of crucial information from a growing volume of financial statements, announcements, and business news. Recognition for named entities to construct structured data poses a significant challenge in analyzing financial documents and is a foundational task for intelligent financial analytics. However, how effective are these generic LLMs and their performance under various prompts are yet need a better understanding. To fill in the blank, we present a systematic evaluation of state-of-the-art LLMs and prompting methods in the financial Named Entity Recognition (NER) problem. Specifically, our experimental results highlight their strengths and limitations, identify five representative failure types, and provide insights into their potential and challenges for domain-specific tasks.

Decision Explanation: Original decision: REJECT Although the paper investigates LLM performance under various prompts, its primary focus is on evaluating LLMs for a specific application (Financial Named Entity Recognition) rather than on the engineering, design, or optimization of prompts themselves for improving LLM text generation capabilities.

[Interpretable Load Forecasting via Representation Learning of

Geo-distributed Meteorological Factors](https://arxiv.org/abs/http://arxiv.org/abs/2501.02241v1) arXiv ID: http://arxiv.org/abs/2501.02241v1

Abstract:

Meteorological factors (MF) are crucial in day-ahead load forecasting as they significantly influence the electricity consumption behaviors of consumers. Numerous studies have incorporated MF into the load forecasting model to achieve higher accuracy. Selecting MF from one representative location or the averaged MF as the inputs of the forecasting model is a common practice. However, the difference in MF collected in various locations within a region may be significant, which poses a challenge in selecting the appropriate MF from numerous locations. A representation learning framework is proposed to extract geo-distributed MF while considering their spatial relationships. In addition, this paper employs the Shapley value in the graph-based model to reveal connections between MF collected in different locations and loads. To reduce the computational complexity of calculating the Shapley value, an acceleration method is adopted based on Monte Carlo sampling and weighted linear regression. Experiments on two real-world datasets demonstrate that the proposed method improves the day-ahead forecasting accuracy, especially in extreme scenarios such as the "accumulation temperature effect" in summer and "sudden temperature change" in winter. We also find a significant correlation between the importance of MF in different locations and the corresponding area's GDP and mainstay industry.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on interpretable load forecasting using representation learning of geo-distributed meteorological factors, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria." }

LLMzSzŁ: a comprehensive LLM benchmark for Polish

arXiv ID: http://arxiv.org/abs/2501.02266v1

Abstract:

This article introduces the first comprehensive benchmark for the Polish language at this scale: LLMzSz{\L} (LLMs Behind the School Desk). It is based on a coherent collection of Polish national exams, including both academic and professional tests extracted from the archives of the Polish Central Examination Board. It covers 4 types of exams, coming from 154 domains. Altogether, it consists of almost 19k closed-ended questions. We investigate the performance of open-source multilingual, English, and Polish LLMs to verify LLMs' abilities to transfer knowledge between languages. Also, the correlation between LLMs and humans at model accuracy and exam pass rate levels is examined. We show that multilingual LLMs can obtain superior results over monolingual ones; however, monolingual models may be beneficial when model size matters. Our analysis highlights the potential of LLMs in assisting with exam validation, particularly in identifying anomalies or errors in examination tasks.

Decision Explanation: Original decision: REJECT The paper focuses on introducing a benchmark for evaluating LLM performance on Polish language tasks, examining knowledge transfer and correlation with human accuracy, rather than primarily on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).

[What Kind of Visual Tokens Do We Need? Training-free Visual Token

Pruning for Multi-modal Large Language Models from the Perspective of Graph](https://arxiv.org/abs/http://arxiv.org/abs/2501.02268v1) arXiv ID: http://arxiv.org/abs/2501.02268v1

Abstract:

Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune.In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background.To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks.The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95% and 2.34% accuracy drops, respectively.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on optimizing visual token pruning for Multi-modal Large Language Models (MLLMs), which is more related to model optimization and reducing computational overhead rather than prompt engineering for text-based interactions with LLMs." }

[Hyperbolic Contrastive Learning for Hierarchical 3D Point Cloud

Embedding](https://arxiv.org/abs/http://arxiv.org/abs/2501.02285v2) arXiv ID: http://arxiv.org/abs/2501.02285v2

Abstract:

Hyperbolic spaces allow for more efficient modeling of complex, hierarchical structures, which is particularly beneficial in tasks involving multi-modal data. Although hyperbolic geometries have been proven effective for language-image pre-training, their capabilities to unify language, image, and 3D Point Cloud modalities are under-explored. We extend the 3D Point Cloud modality in hyperbolic multi-modal contrastive pre-training. Additionally, we explore the entailment, modality gap, and alignment regularizers for learning hierarchical 3D embeddings and facilitating the transfer of knowledge from both Text and Image modalities. These regularizers enable the learning of intra-modal hierarchy within each modality and inter-modal hierarchy across text, 2D images, and 3D Point Clouds. Experimental results demonstrate that our proposed training strategy yields an outstanding 3D Point Cloud encoder, and the obtained 3D Point Cloud hierarchical embeddings significantly improve performance on various downstream tasks.

Decision Explanation: Original decision: REJECT The paper focuses on multi-modal contrastive pre-training for 3D Point Cloud embedding, primarily involving image and 3D modalities, with no clear emphasis on prompt engineering or text generation driven by Large Language Models (LLMs).

SR-Reward: Taking The Path More Traveled

arXiv ID: http://arxiv.org/abs/2501.02330v1

Abstract:

In this paper, we propose a novel method for learning reward functions directly from offline demonstrations. Unlike traditional inverse reinforcement learning (IRL), our approach decouples the reward function from the learner's policy, eliminating the adversarial interaction typically required between the two. This results in a more stable and efficient training process. Our reward function, called \textit{SR-Reward}, leverages successor representation (SR) to encode a state based on expected future states' visitation under the demonstration policy and transition dynamics. By utilizing the Bellman equation, SR-Reward can be learned concurrently with most reinforcement learning (RL) algorithms without altering the existing training pipeline. We also introduce a negative sampling strategy to mitigate overestimation errors by reducing rewards for out-of-distribution data, thereby enhancing robustness. This strategy inherently introduces a conservative bias into RL algorithms that employ the learned reward. We evaluate our method on the D4RL benchmark, achieving competitive results compared to offline RL algorithms with access to true rewards and imitation learning (IL) techniques like behavioral cloning. Moreover, our ablation studies on data size and quality reveal the advantages and limitations of SR-Reward as a proxy for true rewards.

Decision Explanation: Original decision: REJECT The paper focuses on learning reward functions for reinforcement learning (RL) with offline demonstrations, using successor representation and Bellman equation, and does not meet any of the 'MUST' criteria related to prompt engineering for Large Language Models (LLMs), such as improving LLM performance through textual input prompt manipulation or providing concrete examples of LLM prompts.

[AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM

Inference](https://arxiv.org/abs/http://arxiv.org/abs/2501.02336v1) arXiv ID: http://arxiv.org/abs/2501.02336v1

Abstract:

Long-context large language models (LLMs) inference is increasingly critical, motivating a number of studies devoted to alleviating the substantial storage and computational costs in such scenarios. Layer-wise skipping methods are promising optimizations but rarely explored in long-context inference. We observe that existing layer-wise skipping strategies have several limitations when applied in long-context inference, including the inability to adapt to model and context variability, disregard for sublayer significance, and inapplicability for the prefilling phase. This paper proposes \sysname, an adaptive sublayer skipping method specifically designed for long-context inference. \sysname adaptively identifies less important layers by leveraging on-the-fly similarity information, enables sublayer-wise skipping, and accelerates both the prefilling and decoding phases. The effectiveness of \sysname is demonstrated through extensive experiments on various long-context benchmarks and models, showcasing its superior inference performance over existing baselines.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on accelerating long-context LLM inference through adaptive sublayer skipping, which pertains to optimizing LLM architecture and inference methods rather than prompt engineering for text-based interactions with LLMs." }

[Evaluation of the Code Generation Capabilities of ChatGPT 4: A

Comparative Analysis in 19 Programming Languages](https://arxiv.org/abs/http://arxiv.org/abs/2501.02338v1) arXiv ID: http://arxiv.org/abs/2501.02338v1

Abstract:

This bachelor's thesis examines the capabilities of ChatGPT 4 in code generation across 19 programming languages. The study analyzed solution rates across three difficulty levels, types of errors encountered, and code quality in terms of runtime and memory efficiency through a quantitative experiment. A total of 188 programming problems were selected from the LeetCode platform, and ChatGPT 4 was given three attempts to produce a correct solution with feedback. ChatGPT 4 successfully solved 39.67% of all tasks, with success rates decreasing significantly as problem complexity increased. Notably, the model faced considerable challenges with hard problems across all languages. ChatGPT 4 demonstrated higher competence in widely used languages, likely due to a larger volume and higher quality of training data. The solution rates also revealed a preference for languages with low abstraction levels and static typing. For popular languages, the most frequent error was "Wrong Answer," whereas for less popular languages, compiler and runtime errors prevailed, suggesting frequent misunderstandings and confusion regarding the structural characteristics of these languages. The model exhibited above-average runtime efficiency in all programming languages, showing a tendency toward statically typed and low-abstraction languages. Memory efficiency results varied significantly, with above-average performance in 14 languages and below-average performance in five languages. A slight preference for low-abstraction languages and a leaning toward dynamically typed languages in terms of memory efficiency were observed. Future research should include a larger number of tasks, iterations, and less popular languages. Additionally, ChatGPT 4's abilities in code interpretation and summarization, debugging, and the development of complex, practical code could be analyzed further.

Diese Bachelorarbeit untersucht die F"ahigkeiten von ChatGPT 4 zur Code-Generierung in 19 Programmiersprachen. Betrachtet wurden die L"osungsraten zwischen drei Schwierigkeitsgraden, die aufgetretenen Fehlerarten und die Qualit"at des Codes hinsichtlich der Laufzeit- und Speichereffizienz in einem quantitativen Experiment. Dabei wurden 188 Programmierprobleme der Plattform LeetCode entnommen, wobei ChatGPT 4 jeweils drei Versuche hatte, mittels Feedback eine korrekte L"osung zu generieren. ChatGPT 4 l"oste 39,67 % aller Aufgaben erfolgreich, wobei die Erfolgsrate mit zunehmendem Schwierigkeitsgrad deutlich abnahm und bei komplexen Problemen in allen Sprachen signifikante Schwierigkeiten auftraten. Das Modell zeigte eine h"ohere Kompetenz in weit verbreiteten Sprachen, was wahrscheinlich auf eine gr"o{\ss}ere Menge und h"ohere Qualit"at der Trainingsdaten zur"uckzuf"uhren ist. Bez"uglich der L"osungsraten zeigte das Modell zudem eine Pr"aferenz f"ur Sprachen mit niedrigem Abstraktionsniveau und statischer Typisierung. Bei Sprachen hoher Popularit"at trat der Fehler Wrong Answer am h"aufigsten auf, w"ahrend bei weniger popul"aren Sprachen Compiler- und Laufzeitfehler "uberwogen, was auf h"aufige Missverst"andnisse und Verwechslungen bez"uglich der spezifischen strukturellen Eigenschaften dieser Sprachen zur"uckzuf"uhren ist. ChatGPT 4 demonstrierte in allen Programmiersprachen eine "uberdurchschnittliche Laufzeiteffizienz und tendierte diesbez"uglich erneut zu statisch typisierten und niedrig abstrahierten Sprachen. Die Werte zur Speichereffizienz variierten erheblich, wobei in 14 Sprachen "uberdurchschnittliche und in f"unf Sprachen unterdurchschnittliche Werte erzielt wurden. Es zeigte sich diesbez"uglich eine leichte Tendenz zugunsten von niedrig abstrahierten sowie eine Pr"aferenz zu dynamisch typisierten Sprachen. Zuk"unftige Forschung sollte eine h"ohere Anzahl an Aufgaben, Iterationen und unpopul"aren Sprachen einbeziehen. Dar"uber hinaus k"onnten die F"ahigkeiten von ChatGPT 4 in der Code-Interpretation und -Zusammenfassung, im Debugging und in der Entwicklung komplexer, praxisbezogener Codes analysiert werden.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper's primary focus is on evaluating the code generation capabilities of ChatGPT 4 across various programming languages, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), which is the required core subject." }

[UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude

Mobility](https://arxiv.org/abs/http://arxiv.org/abs/2501.02341v1) arXiv ID: http://arxiv.org/abs/2501.02341v1

Abstract:

Low-altitude mobility, exemplified by unmanned aerial vehicles (UAVs), has introduced transformative advancements across various domains, like transportation, logistics, and agriculture. Leveraging flexible perspectives and rapid maneuverability, UAVs extend traditional systems' perception and action capabilities, garnering widespread attention from academia and industry. However, current UAV operations primarily depend on human control, with only limited autonomy in simple scenarios, and lack the intelligence and adaptability needed for more complex environments and tasks. The emergence of large language models (LLMs) demonstrates remarkable problem-solving and generalization capabilities, offering a promising pathway for advancing UAV intelligence. This paper explores the integration of LLMs and UAVs, beginning with an overview of UAV systems' fundamental components and functionalities, followed by an overview of the state-of-the-art in LLM technology. Subsequently, it systematically highlights the multimodal data resources available for UAVs, which provide critical support for training and evaluation. Furthermore, it categorizes and analyzes key tasks and application scenarios where UAVs and LLMs converge. Finally, a reference roadmap towards agentic UAVs is proposed, aiming to enable UAVs to achieve agentic intelligence through autonomous perception, memory, reasoning, and tool utilization. Related resources are available at https://github.com/Hub-Tian/UAVs_Meet_LLMs.

Decision Explanation: Original decision: REJECT The paper focuses primarily on integrating LLMs with UAVs to enhance UAV intelligence, rather than specifically on the engineering, design, or optimization of prompts for Large Language Models, failing to meet the core subject requirement.

[Exploring the Capabilities and Limitations of Large Language Models for

Radiation Oncology Decision Support](https://arxiv.org/abs/http://arxiv.org/abs/2501.02346v1) arXiv ID: http://arxiv.org/abs/2501.02346v1

Abstract:

Thanks to the rapidly evolving integration of LLMs into decision-support tools, a significant transformation is happening across large-scale systems. Like other medical fields, the use of LLMs such as GPT-4 is gaining increasing interest in radiation oncology as well. An attempt to assess GPT-4's performance in radiation oncology was made via a dedicated 100-question examination on the highly specialized topic of radiation oncology physics, revealing GPT-4's superiority over other LLMs. GPT-4's performance on a broader field of clinical radiation oncology is further benchmarked by the ACR Radiation Oncology In-Training (TXIT) exam where GPT-4 achieved a high accuracy of 74.57%. Its performance on re-labelling structure names in accordance with the AAPM TG-263 report has also been benchmarked, achieving above 96% accuracies. Such studies shed light on the potential of LLMs in radiation oncology. As interest in the potential and constraints of LLMs in general healthcare applications continues to rise5, the capabilities and limitations of LLMs in radiation oncology decision support have not yet been fully explored.

Decision Explanation: Original decision: REJECT The paper primarily focuses on the application of LLMs in radiation oncology, a medically-oriented subject explicitly excluded by the criteria, rather than the engineering, design, or optimization of prompts for LLMs.

Context Aware Lemmatization and Morphological Tagging Method in Turkish

arXiv ID: http://arxiv.org/abs/2501.02361v1

Abstract:

The smallest part of a word that defines the word is called a word root. Word roots are used to increase success in many applications since they simplify the word. In this study, the lemmatization model, which is a word root finding method, and the morphological tagging model, which predicts the grammatical knowledge of the word, are presented. The presented model was developed for Turkish, and both models make predictions by taking the meaning of the word into account. In the literature, there is no lemmatization study that is sensitive to word meaning in Turkish. For this reason, the present study shares the model and the results obtained from the model on Turkish lemmatization for the first time in the literature. In the present study, in the lemmatization and morphological tagging models, bidirectional LSTM is used for the spelling of words, and the Turkish BERT model is used for the meaning of words. The models are trained using the IMST and PUD datasets from Universal Dependencies. The results from the training of the models were compared with the results from the SIGMORPHON 2019 competition. The results of the comparisons revealed that our models were superior.

Decision Explanation: Original decision: REJECT The paper focuses on lemmatization and morphological tagging for the Turkish language using BERT and LSTM, without any mention of Large Language Models (LLMs), prompt engineering, or manipulation of textual input prompts to improve LLM performance, thus failing to meet the primary 'MUST' criteria.

Enhancing Workplace Productivity and Well-being Using AI Agent

arXiv ID: http://arxiv.org/abs/2501.02368v1

Abstract:

This paper discusses the use of Artificial Intelligence (AI) to enhance workplace productivity and employee well-being. By integrating machine learning (ML) techniques with neurobiological data, the proposed approaches ensure alignment with human ethical standards through value alignment models and Hierarchical Reinforcement Learning (HRL) for autonomous task management. The system utilizes biometric feedback from employees to generate personalized health prompts, fostering a supportive work environment that encourages physical activity. Additionally, we explore decentralized multi-agent systems for improved collaboration and decision-making frameworks that enhance transparency. Various approaches using ML techniques in conjunction with AI implementations are discussed. Together, these innovations aim to create a more productive and health-conscious workplace. These outcomes assist HR management and organizations in launching more rational career progression streams for employees and facilitating organizational transformation.

Decision Explanation: Original decision: REJECT The paper focuses primarily on using AI for workplace productivity and well-being, integrating various technologies, but does not centrally address prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance, as required.

Syntactic Evolution in Language Usage

arXiv ID: http://arxiv.org/abs/2501.02392v1

Abstract:

This research aims to investigate the dynamic nature of linguistic style throughout various stages of life, from post teenage to old age. By employing linguistic analysis tools and methodologies, the study will delve into the intricacies of how individuals adapt and modify their language use over time. The research uses a data set of blogs from blogger.com from 2004 and focuses on English for syntactic analysis. The findings of this research can have implications for linguistics, psychology, and communication studies, shedding light on the intricate relationship between age and language.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on the evolution of linguistic style across different age groups, using linguistic analysis tools, and does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through prompt manipulation, nor does it provide concrete examples of prompts and their impact on LLM output." }

[iTARGET: Interpretable Tailored Age Regression for Grouped Epigenetic

Traits](https://arxiv.org/abs/http://arxiv.org/abs/2501.02401v1) arXiv ID: http://arxiv.org/abs/2501.02401v1

Abstract:

Accurately predicting chronological age from DNA methylation patterns is crucial for advancing biological age estimation. However, this task is made challenging by Epigenetic Correlation Drift (ECD) and Heterogeneity Among CpGs (HAC), which reflect the dynamic relationship between methylation and age across different life stages. To address these issues, we propose a novel two-phase algorithm. The first phase employs similarity searching to cluster methylation profiles by age group, while the second phase uses Explainable Boosting Machines (EBM) for precise, group-specific prediction. Our method not only improves prediction accuracy but also reveals key age-related CpG sites, detects age-specific changes in aging rates, and identifies pairwise interactions between CpG sites. Experimental results show that our approach outperforms traditional epigenetic clocks and machine learning models, offering a more accurate and interpretable solution for biological age estimation with significant implications for aging research.

Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it does not focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it proposes a novel algorithm for biological age estimation from DNA methylation patterns, unrelated to LLMs or prompt engineering.

[Boosting Explainability through Selective Rationalization in Pre-trained

Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.03182v1) arXiv ID: http://arxiv.org/abs/2501.03182v1

Abstract:

The widespread application of pre-trained language models (PLMs) in natural language processing (NLP) has led to increasing concerns about their explainability. Selective rationalization is a self-explanatory framework that selects human-intelligible input subsets as rationales for predictions. Recent studies have shown that applying existing rationalization frameworks to PLMs will result in severe degeneration and failure problems, producing sub-optimal or meaningless rationales. Such failures severely damage trust in rationalization methods and constrain the application of rationalization techniques on PLMs. In this paper, we find that the homogeneity of tokens in the sentences produced by PLMs is the primary contributor to these problems. To address these challenges, we propose a method named Pre-trained Language Model's Rationalization (PLMR), which splits PLMs into a generator and a predictor to deal with NLP tasks while providing interpretable rationales. The generator in PLMR also alleviates homogeneity by pruning irrelevant tokens, while the predictor uses full-text information to standardize predictions. Experiments conducted on two widely used datasets across multiple PLMs demonstrate the effectiveness of the proposed method PLMR in addressing the challenge of applying selective rationalization to PLMs. Codes: https://github.com/ylb777/PLMR.

Decision Explanation: Original decision: REJECT The paper primarily focuses on improving the explainability of Pre-trained Language Models (PLMs) through selective rationalization, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs). The core subject is not prompt engineering for text-based interactions with LLMs, but rather addressing challenges in applying rationalization techniques to PLMs.

[AI-ANNE: (A) (N)eural (N)et for (E)xploration: Transferring Deep

Learning Models onto Microcontrollers and Embedded Systems](https://arxiv.org/abs/http://arxiv.org/abs/2501.03256v1) arXiv ID: http://arxiv.org/abs/2501.03256v1

Abstract:

This working paper explores the integration of neural networks onto resource-constrained embedded systems like a Raspberry Pi Pico / Raspberry Pi Pico 2. A TinyML aproach transfers neural networks directly on these microcontrollers, enabling real-time, low-latency, and energy-efficient inference while maintaining data privacy. Therefore, AI-ANNE: (A) (N)eural (N)et for (E)xploration will be presented, which facilitates the transfer of pre-trained models from high-performance platforms like TensorFlow and Keras onto microcontrollers, using a lightweight programming language like MicroPython. This approach demonstrates how neural network architectures, such as neurons, layers, density and activation functions can be implemented in MicroPython in order to deal with the computational limitations of embedded systems. Based on the Raspberry Pi Pico / Raspberry Pi Pico 2, two different neural networks on microcontrollers are presented for an example of data classification. As an further application example, such a microcontroller can be used for condition monitoring, where immediate corrective measures are triggered on the basis of sensor data. Overall, this working paper presents a very easy-to-implement way of using neural networks on energy-efficient devices such as microcontrollers. This makes AI-ANNE: (A) (N)eural (N)et for (E)xploration not only suited for practical use, but also as an educational tool with clear insights into how neural networks operate.

Decision Explanation: Original decision: REJECT The paper primarily focuses on transferring deep learning models onto microcontrollers and embedded systems, with no emphasis on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance.

[Optimizing Edge AI: A Comprehensive Survey on Data, Model, and System

Strategies](https://arxiv.org/abs/http://arxiv.org/abs/2501.03265v1) arXiv ID: http://arxiv.org/abs/2501.03265v1

Abstract:

The emergence of 5G and edge computing hardware has brought about a significant shift in artificial intelligence, with edge AI becoming a crucial technology for enabling intelligent applications. With the growing amount of data generated and stored on edge devices, deploying AI models for local processing and inference has become increasingly necessary. However, deploying state-of-the-art AI models on resource-constrained edge devices faces significant challenges that must be addressed. This paper presents an optimization triad for efficient and reliable edge AI deployment, including data, model, and system optimization. First, we discuss optimizing data through data cleaning, compression, and augmentation to make it more suitable for edge deployment. Second, we explore model design and compression methods at the model level, such as pruning, quantization, and knowledge distillation. Finally, we introduce system optimization techniques like framework support and hardware acceleration to accelerate edge AI workflows. Based on an in-depth analysis of various application scenarios and deployment challenges of edge AI, this paper proposes an optimization paradigm based on the data-model-system triad to enable a whole set of solutions to effectively transfer ML models, which are initially trained in the cloud, to various edge devices for supporting multiple scenarios.

Decision Explanation: Original decision: REJECT The paper does not focus on prompt engineering for Large Language Models (LLMs), instead concentrating on optimizing data, model, and system strategies for edge AI deployment, which falls outside the specified criteria.

[Heterogeneous Graph Pre-training Based Model for Secure and Efficient

Prediction of Default Risk Propagation among Bond Issuers](https://arxiv.org/abs/http://arxiv.org/abs/2501.03268v1) arXiv ID: http://arxiv.org/abs/2501.03268v1

Abstract:

Efficient prediction of default risk for bond-issuing enterprises is pivotal for maintaining stability and fostering growth in the bond market. Conventional methods usually rely solely on an enterprise's internal data for risk assessment. In contrast, graph-based techniques leverage interconnected corporate information to enhance default risk identification for targeted bond issuers. Traditional graph techniques such as label propagation algorithm or deepwalk fail to effectively integrate a enterprise's inherent attribute information with its topological network data. Additionally, due to data scarcity and security privacy concerns between enterprises, end-to-end graph neural network (GNN) algorithms may struggle in delivering satisfactory performance for target tasks. To address these challenges, we present a novel two-stage model. In the first stage, we employ an innovative Masked Autoencoders for Heterogeneous Graph (HGMAE) to pre-train on a vast enterprise knowledge graph. Subsequently, in the second stage, a specialized classifier model is trained to predict default risk propagation probabilities. The classifier leverages concatenated feature vectors derived from the pre-trained encoder with the enterprise's task-specific feature vectors. Through the two-stage training approach, our model not only boosts the importance of unique bond characteristics for specific default prediction tasks, but also securely and efficiently leverage the global information pre-trained from other enterprises. Experimental results demonstrate that our proposed model outperforms existing approaches in predicting default risk for bond issuers.

Decision Explanation: Original decision: REJECT The paper focuses on developing a two-stage model for predicting default risk propagation among bond issuers using heterogeneous graph pre-training, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.

[Knowledge-Guided Biomarker Identification for Label-Free Single-Cell

RNA-Seq Data: A Reinforcement Learning Perspective](https://arxiv.org/abs/http://arxiv.org/abs/2501.04718v1) arXiv ID: http://arxiv.org/abs/2501.04718v1

Abstract:

Gene panel selection aims to identify the most informative genomic biomarkers in label-free genomic datasets. Traditional approaches, which rely on domain expertise, embedded machine learning models, or heuristic-based iterative optimization, often introduce biases and inefficiencies, potentially obscuring critical biological signals. To address these challenges, we present an iterative gene panel selection strategy that harnesses ensemble knowledge from existing gene selection algorithms to establish preliminary boundaries or prior knowledge, which guide the initial search space. Subsequently, we incorporate reinforcement learning through a reward function shaped by expert behavior, enabling dynamic refinement and targeted selection of gene panels. This integration mitigates biases stemming from initial boundaries while capitalizing on RL's stochastic adaptability. Comprehensive comparative experiments, case studies, and downstream analyses demonstrate the effectiveness of our method, highlighting its improved precision and efficiency for label-free biomarker discovery. Our results underscore the potential of this approach to advance single-cell genomics data analysis.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on applying reinforcement learning for biomarker identification in genomics, with no apparent emphasis on prompt engineering, Large Language Models (LLMs), or manipulation of textual input prompts for LLM performance improvement." }

[Calculating Customer Lifetime Value and Churn using Beta Geometric

Negative Binomial and Gamma-Gamma Distribution in a NFT based setting](https://arxiv.org/abs/http://arxiv.org/abs/2501.04719v1) arXiv ID: http://arxiv.org/abs/2501.04719v1

Abstract:

Customer Lifetime Value (CLV) is an important metric that measures the total value a customer will bring to a business over their lifetime. The Beta Geometric Negative Binomial Distribution (BGNBD) and Gamma Gamma Distribution are two models that can be used to calculate CLV, taking into account both the frequency and value of customer transactions. This article explains the BGNBD and Gamma Gamma Distribution models, and how they can be used to calculate CLV for NFT (Non-Fungible Token) transaction data in a blockchain setting. By estimating the parameters of these models using historical transaction data, businesses can gain insights into the lifetime value of their customers and make data-driven decisions about marketing and customer retention strategies.

Decision Explanation: Original decision: REJECT The paper focuses on calculating Customer Lifetime Value using statistical models in an NFT setting, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.

[Multi-task Domain Adaptation for Computation Offloading in

Edge-intelligence Networks](https://arxiv.org/abs/http://arxiv.org/abs/2501.07585v1) arXiv ID: http://arxiv.org/abs/2501.07585v1

Abstract:

In the field of multi-access edge computing (MEC), efficient computation offloading is crucial for improving resource utilization and reducing latency in dynamically changing environments. This paper introduces a new approach, termed as Multi-Task Domain Adaptation (MTDA), aiming to enhance the ability of computational offloading models to generalize in the presence of domain shifts, i.e., when new data in the target environment significantly differs from the data in the source domain. The proposed MTDA model incorporates a teacher-student architecture that allows continuous adaptation without necessitating access to the source domain data during inference, thereby maintaining privacy and reducing computational overhead. Utilizing a multi-task learning framework that simultaneously manages offloading decisions and resource allocation, the proposed MTDA approach outperforms benchmark methods regarding mean squared error and accuracy, particularly in environments with increasing numbers of users. It is observed by means of computer simulation that the proposed MTDA model maintains high performance across various scenarios, demonstrating its potential for practical deployment in emerging MEC applications.

Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it focuses primarily on computation offloading in edge-intelligence networks, multi-task domain adaptation, and resource allocation, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thereby failing to address the core subject of prompt engineering for text-based interactions with LLMs.

Adjoint sharding for very long context training of state space models

arXiv ID: http://arxiv.org/abs/2501.00692v1

Abstract:

Despite very fast progress, efficiently training large language models (LLMs) in very long contexts remains challenging. Existing methods fall back to training LLMs with short contexts (a maximum of a few thousands tokens in training) and use inference time techniques when evaluating on long contexts (above 1M tokens context window at inference). As opposed to long-context-inference, training on very long context input prompts is quickly limited by GPU memory availability and by the prohibitively long training times it requires on state-of-the-art hardware. Meanwhile, many real-life applications require not only inference but also training/fine-tuning with long context on specific tasks. Such applications include, for example, augmenting the context with various sources of raw reference information for fact extraction, fact summarization, or fact reconciliation tasks. We propose adjoint sharding, a novel technique that comprises sharding gradient calculation during training to reduce memory requirements by orders of magnitude, making training on very long context computationally tractable. Adjoint sharding is based on the adjoint method and computes equivalent gradients to backpropagation. We also propose truncated adjoint sharding to speed up the algorithm while maintaining performance. We provide a distributed version, and a paralleled version of adjoint sharding to further speed up training. Empirical results show the proposed adjoint sharding algorithm reduces memory usage by up to 3X with a 1.27B parameter large language model on 1M context length training. This allows to increase the maximum context length during training or fine-tuning of a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.

Decision Explanation: Original decision: REJECT The paper primarily focuses on a novel technique for reducing memory requirements during LLM training (adjoint sharding) rather than engineering, designing, or optimizing prompts specifically for improving LLM performance through textual input manipulation.

[Everywhere Attack: Attacking Locally and Globally to Boost Targeted

Transferability](https://arxiv.org/abs/http://arxiv.org/abs/2501.00707v1) arXiv ID: http://arxiv.org/abs/2501.00707v1

Abstract:

Adversarial examples' (AE) transferability refers to the phenomenon that AEs crafted with one surrogate model can also fool other models. Notwithstanding remarkable progress in untargeted transferability, its targeted counterpart remains challenging. This paper proposes an everywhere scheme to boost targeted transferability. Our idea is to attack a victim image both globally and locally. We aim to optimize 'an army of targets' in every local image region instead of the previous works that optimize a high-confidence target in the image. Specifically, we split a victim image into non-overlap blocks and jointly mount a targeted attack on each block. Such a strategy mitigates transfer failures caused by attention inconsistency between surrogate and victim models and thus results in stronger transferability. Our approach is method-agnostic, which means it can be easily combined with existing transferable attacks for even higher transferability. Extensive experiments on ImageNet demonstrate that the proposed approach universally improves the state-of-the-art targeted attacks by a clear margin, e.g., the transferability of the widely adopted Logit attack can be improved by 28.8%-300%.We also evaluate the crafted AEs on a real-world platform: Google Cloud Vision. Results further support the superiority of the proposed method.

Decision Explanation: Original decision: REJECT The paper focuses on improving the transferability of adversarial examples in image classification, with no relevance to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.

[An AI-powered Bayesian generative modeling approach for causal inference

in observational studies](https://arxiv.org/abs/http://arxiv.org/abs/2501.00755v1) arXiv ID: http://arxiv.org/abs/2501.00755v1

Abstract:

Causal inference in observational studies with high-dimensional covariates presents significant challenges. We introduce CausalBGM, an AI-powered Bayesian generative modeling approach that captures the causal relationship among covariates, treatment, and outcome variables. The core innovation of CausalBGM lies in its ability to estimate the individual treatment effect (ITE) by learning individual-specific distributions of a low-dimensional latent feature set (e.g., latent confounders) that drives changes in both treatment and outcome. This approach not only effectively mitigates confounding effects but also provides comprehensive uncertainty quantification, offering reliable and interpretable causal effect estimates at the individual level. CausalBGM adopts a Bayesian model and uses a novel iterative algorithm to update the model parameters and the posterior distribution of latent features until convergence. This framework leverages the power of AI to capture complex dependencies among variables while adhering to the Bayesian principles. Extensive experiments demonstrate that CausalBGM consistently outperforms state-of-the-art methods, particularly in scenarios with high-dimensional covariates and large-scale datasets. Its Bayesian foundation ensures statistical rigor, providing robust and well-calibrated posterior intervals. By addressing key limitations of existing methods, CausalBGM emerges as a robust and promising framework for advancing causal inference in modern applications in fields such as genomics, healthcare, and social sciences. CausalBGM is maintained at the website https://causalbgm.readthedocs.io/.

Decision Explanation: Original decision: REJECT The paper focuses on developing a Bayesian generative modeling approach for causal inference in observational studies, leveraging AI, but does not primarily address the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.

[Revisiting Graph Neural Networks on Graph-level Tasks: Comprehensive

Experiments, Analysis, and Improvements](https://arxiv.org/abs/http://arxiv.org/abs/2501.00773v1) arXiv ID: http://arxiv.org/abs/2501.00773v1

Abstract:

Graphs are essential data structures for modeling complex interactions in domains such as social networks, molecular structures, and biological systems. Graph-level tasks, which predict properties or classes for the entire graph, are critical for applications, such as molecular property prediction and subgraph counting. Graph Neural Networks (GNNs) have shown promise in these tasks, but their evaluations are often limited to narrow datasets, tasks, and inconsistent experimental setups, restricting their generalizability. To address these limitations, we propose a unified evaluation framework for graph-level GNNs. This framework provides a standardized setting to evaluate GNNs across diverse datasets, various graph tasks (e.g., graph classification and regression), and challenging scenarios, including noisy, imbalanced, and few-shot graphs. Additionally, we propose a novel GNN model with enhanced expressivity and generalization capabilities. Specifically, we enhance the expressivity of GNNs through a k-path rooted subgraph approach, enabling the model to effectively count subgraphs (e.g., paths and cycles). Moreover, we introduce a unified graph contrastive learning algorithm for graphs across diverse domains, which adaptively removes unimportant edges to augment graphs, thereby significantly improving generalization performance. Extensive experiments demonstrate that our model achieves superior performance against fourteen effective baselines across twenty-seven graph datasets, establishing it as a robust and generalizable model for graph-level tasks.

Decision Explanation: Original response: json { "decision": "REJECT", "explanation": "The paper does not meet the core criteria as it focuses primarily on the development and optimization of Graph Neural Networks (GNNs) for graph-level tasks, with no indication of Large Language Models (LLMs), prompt engineering, or text generation being the central focus." }

[LENS-XAI: Redefining Lightweight and Explainable Network Security

through Knowledge Distillation and Variational Autoencoders for Scalable Intrusion Detection in Cybersecurity](https://arxiv.org/abs/http://arxiv.org/abs/2501.00790v2) arXiv ID: http://arxiv.org/abs/2501.00790v2

Abstract:

The rapid proliferation of Industrial Internet of Things (IIoT) systems necessitates advanced, interpretable, and scalable intrusion detection systems (IDS) to combat emerging cyber threats. Traditional IDS face challenges such as high computational demands, limited explainability, and inflexibility against evolving attack patterns. To address these limitations, this study introduces the Lightweight Explainable Network Security framework (LENS-XAI), which combines robust intrusion detection with enhanced interpretability and scalability. LENS-XAI integrates knowledge distillation, variational autoencoder models, and attribution-based explainability techniques to achieve high detection accuracy and transparency in decision-making. By leveraging a training set comprising 10% of the available data, the framework optimizes computational efficiency without sacrificing performance. Experimental evaluation on four benchmark datasets: Edge-IIoTset, UKM-IDS20, CTU-13, and NSL-KDD, demonstrates the framework's superior performance, achieving detection accuracies of 95.34%, 99.92%, 98.42%, and 99.34%, respectively. Additionally, the framework excels in reducing false positives and adapting to complex attack scenarios, outperforming existing state-of-the-art methods. Key strengths of LENS-XAI include its lightweight design, suitable for resource-constrained environments, and its scalability across diverse IIoT and cybersecurity contexts. Moreover, the explainability module enhances trust and transparency, critical for practical deployment in dynamic and sensitive applications. This research contributes significantly to advancing IDS by addressing computational efficiency, feature interpretability, and real-world applicability. Future work could focus on extending the framework to ensemble AI systems for distributed environments, further enhancing its robustness and adaptability.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a lightweight and explainable network security framework (LENS-XAI) for intrusion detection in cybersecurity, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions, thus failing to meet all 'MUST' criteria.

[Decoupling Knowledge and Reasoning in Transformers: A Modular

Architecture with Generalized Cross-Attention](https://arxiv.org/abs/http://arxiv.org/abs/2501.00823v2) arXiv ID: http://arxiv.org/abs/2501.00823v2

Abstract:

Transformers have achieved remarkable success across diverse domains, but their monolithic architecture presents challenges in interpretability, adaptability, and scalability. This paper introduces a novel modular Transformer architecture that explicitly decouples knowledge and reasoning through a generalized cross-attention mechanism to a globally shared knowledge base with layer-specific transformations, specifically designed for effective knowledge retrieval. Critically, we provide a rigorous mathematical derivation demonstrating that the Feed-Forward Network (FFN) in a standard Transformer is a specialized case (a closure) of this generalized cross-attention, revealing its role in implicit knowledge retrieval and validating our design. This theoretical framework provides a new lens for understanding FFNs and lays the foundation for future research exploring enhanced interpretability, adaptability, and scalability, enabling richer interplay with external knowledge bases and other systems.

Decision Explanation: Original decision: REJECT The paper focuses primarily on the development of a novel modular Transformer architecture, decoupling knowledge and reasoning, which falls under developing new LLM architectures, violating the 'MUST NOT' criteria (1). Prompt engineering for text-based interactions with LLMs is not the central focus.

Characteristics](https://arxiv.org/abs/http://arxiv.org/abs/2501.00855v1) arXiv ID: http://arxiv.org/abs/2501.00855v1

Abstract:

Chatter on social media is 20% bots and 80% humans. Chatter by bots and humans is consistently different: bots tend to use linguistic cues that can be easily automated while humans use cues that require dialogue understanding. Bots use words that match the identities they choose to present, while humans may send messages that are not related to the identities they present. Bots and humans differ in their communication structure: sampled bots have a star interaction structure, while sampled humans have a hierarchical structure. These conclusions are based on a large-scale analysis of social media tweets across ~200mil users across 7 events. Social media bots took the world by storm when social-cybersecurity researchers realized that social media users not only consisted of humans but also of artificial agents called bots. These bots wreck havoc online by spreading disinformation and manipulating narratives. Most research on bots are based on special-purposed definitions, mostly predicated on the event studied. This article first begins by asking, "What is a bot?", and we study the underlying principles of how bots are different from humans. We develop a first-principle definition of a social media bot. With this definition as a premise, we systematically compare characteristics between bots and humans across global events, and reflect on how the software-programmed bot is an Artificial Intelligent algorithm, and its potential for evolution as technology advances. Based on our results, we provide recommendations for the use and regulation of bots. Finally, we discuss open challenges and future directions: Detect, to systematically identify these automated and potentially evolving bots; Differentiate, to evaluate the goodness of the bot in terms of their content postings and relationship interactions; Disrupt, to moderate the impact of malicious bots.

Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it primarily focuses on analyzing social media bot characteristics, their differentiation from humans, and regulatory recommendations, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs).

DiffETM: Diffusion Process Enhanced Embedded Topic Model

arXiv ID: http://arxiv.org/abs/2501.00862v1

Abstract:

The embedded topic model (ETM) is a widely used approach that assumes the sampled document-topic distribution conforms to the logistic normal distribution for easier optimization. However, this assumption oversimplifies the real document-topic distribution, limiting the model's performance. In response, we propose a novel method that introduces the diffusion process into the sampling process of document-topic distribution to overcome this limitation and maintain an easy optimization process. We validate our method through extensive experiments on two mainstream datasets, proving its effectiveness in improving topic modeling performance.

Decision Explanation: Original decision: REJECT The paper does not focus on prompt engineering for Large Language Models (LLMs), instead proposing a method for enhancing an Embedded Topic Model using a diffusion process, which does not meet the primary criteria of investigating, analyzing, or proposing methods for improving LLM performance through textual input prompt manipulation.

Representation in large language models

arXiv ID: http://arxiv.org/abs/2501.00885v1

Abstract:

The extraordinary success of recent Large Language Models (LLMs) on a diverse array of tasks has led to an explosion of scientific and philosophical theorizing aimed at explaining how they do what they do. Unfortunately, disagreement over fundamental theoretical issues has led to stalemate, with entrenched camps of LLM optimists and pessimists often committed to very different views of how these systems work. Overcoming stalemate requires agreement on fundamental questions, and the goal of this paper is to address one such question, namely: is LLM behavior driven partly by representation-based information processing of the sort implicated in biological cognition, or is it driven entirely by processes of memorization and stochastic table look-up? This is a question about what kind of algorithm LLMs implement, and the answer carries serious implications for higher level questions about whether these systems have beliefs, intentions, concepts, knowledge, and understanding. I argue that LLM behavior is partially driven by representation-based information processing, and then I describe and defend a series of practical techniques for investigating these representations and developing explanations on their basis. The resulting account provides a groundwork for future theorizing about language models and their successors.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on the theoretical foundations and internal workings of Large Language Models (LLMs), specifically whether LLM behavior is driven by representation-based information processing. While it touches on understanding LLMs, it does not primarily investigate, analyze, or propose methods for improving LLM performance through the manipulation of textual input prompts, nor does it provide concrete examples of prompts and their impact on LLM output." }

[Demystifying Online Clustering of Bandits: Enhanced Exploration Under

Stochastic and Smoothed Adversarial Contexts](https://arxiv.org/abs/http://arxiv.org/abs/2501.00891v1) arXiv ID: http://arxiv.org/abs/2501.00891v1

Abstract:

The contextual multi-armed bandit (MAB) problem is crucial in sequential decision-making. A line of research, known as online clustering of bandits, extends contextual MAB by grouping similar users into clusters, utilizing shared features to improve learning efficiency. However, existing algorithms, which rely on the upper confidence bound (UCB) strategy, struggle to gather adequate statistical information to accurately identify unknown user clusters. As a result, their theoretical analyses require several strong assumptions about the "diversity" of contexts generated by the environment, leading to impractical settings, complicated analyses, and poor practical performance. Removing these assumptions has been a long-standing open problem in the clustering of bandits literature. In this paper, we provide two solutions to this open problem. First, following the i.i.d. context generation setting in existing studies, we propose two novel algorithms, UniCLUB and PhaseUniCLUB, which incorporate enhanced exploration mechanisms to accelerate cluster identification. Remarkably, our algorithms require substantially weaker assumptions while achieving regret bounds comparable to prior work. Second, inspired by the smoothed analysis framework, we propose a more practical setting that eliminates the requirement for i.i.d. context generation used in previous studies, thus enhancing the performance of existing algorithms for online clustering of bandits. Our technique can be applied to both graph-based and set-based clustering of bandits frameworks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our proposed algorithms consistently outperform existing approaches.

Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it focuses on online clustering of bandits, multi-armed bandit problems, and sequential decision-making, with no primary focus on engineering, design, or optimization of prompts for Large Language Models (LLMs) or manipulation of textual input prompts to improve LLM performance.

[Large Language Model Based Multi-Agent System Augmented Complex Event

Processing Pipeline for Internet of Multimedia Things](https://arxiv.org/abs/http://arxiv.org/abs/2501.00906v2) arXiv ID: http://arxiv.org/abs/2501.00906v2

Abstract:

This paper presents the development and evaluation of a Large Language Model (LLM), also known as foundation models, based multi-agent system framework for complex event processing (CEP) with a focus on video query processing use cases. The primary goal is to create a proof-of-concept (POC) that integrates state-of-the-art LLM orchestration frameworks with publish/subscribe (pub/sub) tools to address the integration of LLMs with current CEP systems. Utilizing the Autogen framework in conjunction with Kafka message brokers, the system demonstrates an autonomous CEP pipeline capable of handling complex workflows. Extensive experiments evaluate the system's performance across varying configurations, complexities, and video resolutions, revealing the trade-offs between functionality and latency. The results show that while higher agent count and video complexities increase latency, the system maintains high consistency in narrative coherence. This research builds upon and contributes to, existing novel approaches to distributed AI systems, offering detailed insights into integrating such systems into existing infrastructures.

Decision Explanation: Original decision: REJECT The paper's primary focus is on integrating Large Language Models into a multi-agent system for complex event processing, not on prompt engineering for text-based interactions with LLMs, failing to meet the core subject requirement.

[Enhancing Early Diabetic Retinopathy Detection through Synthetic DR1

Image Generation: A StyleGAN3 Approach](https://arxiv.org/abs/http://arxiv.org/abs/2501.00954v1) arXiv ID: http://arxiv.org/abs/2501.00954v1

Abstract:

Diabetic Retinopathy (DR) is a leading cause of preventable blindness. Early detection at the DR1 stage is critical but is hindered by a scarcity of high-quality fundus images. This study uses StyleGAN3 to generate synthetic DR1 images characterized by microaneurysms with high fidelity and diversity. The aim is to address data scarcity and enhance the performance of supervised classifiers. A dataset of 2,602 DR1 images was used to train the model, followed by a comprehensive evaluation using quantitative metrics, including Frechet Inception Distance (FID), Kernel Inception Distance (KID), and Equivariance with respect to translation (EQ-T) and rotation (EQ-R). Qualitative assessments included Human Turing tests, where trained ophthalmologists evaluated the realism of synthetic images. Spectral analysis further validated image quality. The model achieved a final FID score of 17.29, outperforming the mean FID of 21.18 (95 percent confidence interval - 20.83 to 21.56) derived from bootstrap resampling. Human Turing tests demonstrated the model's ability to produce highly realistic images, though minor artifacts near the borders were noted. These findings suggest that StyleGAN3-generated synthetic DR1 images hold significant promise for augmenting training datasets, enabling more accurate early detection of Diabetic Retinopathy. This methodology highlights the potential of synthetic data in advancing medical imaging and AI-driven diagnostics.

Decision Explanation: Original decision: REJECT The paper primarily focuses on medical imaging (Diabetic Retinopathy detection) using StyleGAN3 for synthetic image generation, and does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through the manipulation of textual input prompts, thus failing to meet the core 'MUST' criteria.

[The Silent Majority: Demystifying Memorization Effect in the Presence of

Spurious Correlations](https://arxiv.org/abs/http://arxiv.org/abs/2501.00961v2) arXiv ID: http://arxiv.org/abs/2501.00961v2

Abstract:

Machine learning models often rely on simple spurious features -- patterns in training data that correlate with targets but are not causally related to them, like image backgrounds in foreground classification. This reliance typically leads to imbalanced test performance across minority and majority groups. In this work, we take a closer look at the fundamental cause of such imbalanced performance through the lens of memorization, which refers to the ability to predict accurately on \textit{atypical} examples (minority groups) in the training set but failing in achieving the same accuracy in the testing set. This paper systematically shows the ubiquitous existence of spurious features in a small set of neurons within the network, providing the first-ever evidence that memorization may contribute to imbalanced group performance. Through three experimental sources of converging empirical evidence, we find the property of a small subset of neurons or channels in memorizing minority group information. Inspired by these findings, we articulate the hypothesis: the imbalanced group performance is a byproduct of ``noisy'' spurious memorization confined to a small set of neurons. To further substantiate this hypothesis, we show that eliminating these unnecessary spurious memorization patterns via a novel framework during training can significantly affect the model performance on minority groups. Our experimental results across various architectures and benchmarks offer new insights on how neural networks encode core and spurious knowledge, laying the groundwork for future research in demystifying robustness to spurious correlation.

Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria, as it primarily focuses on understanding memorization effects and spurious correlations in machine learning models, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not provide concrete examples of prompts and their impact on LLM output.

[FlashInfer: Efficient and Customizable Attention Engine for LLM

Inference Serving](https://arxiv.org/abs/http://arxiv.org/abs/2501.01005v1) arXiv ID: http://arxiv.org/abs/2501.01005v1

Abstract:

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing an efficient attention engine for Large Language Model (LLM) inference serving, which falls under the development of new LLM architectures or training methods, and does not centrally address prompt engineering for text-based interactions with LLMs.

[CryptoMamba: Leveraging State Space Models for Accurate Bitcoin Price

Prediction](https://arxiv.org/abs/http://arxiv.org/abs/2501.01010v1) arXiv ID: http://arxiv.org/abs/2501.01010v1

Abstract:

Predicting Bitcoin price remains a challenging problem due to the high volatility and complex non-linear dynamics of cryptocurrency markets. Traditional time-series models, such as ARIMA and GARCH, and recurrent neural networks, like LSTMs, have been widely applied to this task but struggle to capture the regime shifts and long-range dependencies inherent in the data. In this work, we propose CryptoMamba, a novel Mamba-based State Space Model (SSM) architecture designed to effectively capture long-range dependencies in financial time-series data. Our experiments show that CryptoMamba not only provides more accurate predictions but also offers enhanced generalizability across different market conditions, surpassing the limitations of previous models. Coupled with trading algorithms for real-world scenarios, CryptoMamba demonstrates its practical utility by translating accurate forecasts into financial outcomes. Our findings signal a huge advantage for SSMs in stock and cryptocurrency price forecasting tasks.

Decision Explanation: Original decision: REJECT The paper focuses on developing a State Space Model (SSM) for Bitcoin price prediction, with no primary emphasis on the engineering, design, or optimization of prompts for Large Language Models (LLMs), failing to meet the core subject requirement.

[ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented

Contextual Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.01031v2) arXiv ID: http://arxiv.org/abs/2501.01031v2

Abstract:

Cultural values alignment in Large Language Models (LLMs) is a critical challenge due to their tendency to embed Western-centric biases from training data, leading to misrepresentations and fairness issues in cross-cultural contexts. Recent approaches, such as role-assignment and few-shot learning, often struggle with reliable cultural alignment as they heavily rely on pre-trained knowledge, lack scalability, and fail to capture nuanced cultural values effectively. To address these issues, we propose ValuesRAG, a novel and effective framework that applies Retrieval-Augmented Generation (RAG) with In-Context Learning (ICL) to integrate cultural and demographic knowledge dynamically during text generation. Leveraging the World Values Survey (WVS) dataset, ValuesRAG first generates summaries of values for each individual. Subsequently, we curate several representative regional datasets to serve as test datasets and retrieve relevant summaries of values based on demographic features, followed by a reranking step to select the top-k relevant summaries. ValuesRAG consistently outperforms baseline methods, both in the main experiment and in the ablation study where only the values summary was provided. Notably, ValuesRAG demonstrates an accuracy of 21% improvement over other baseline methods, highlighting its potential to foster culturally aligned AI systems and enhance the inclusivity of AI-driven applications.

Decision Explanation: Original decision: REJECT The paper primarily focuses on enhancing cultural alignment in LLMs through Retrieval-Augmented Generation and In-Context Learning, rather than specifically investigating, analyzing, or proposing methods for improving LLM performance through the manipulation of textual input prompts.

[MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for

Driving Perception](https://arxiv.org/abs/http://arxiv.org/abs/2501.01037v1) arXiv ID: http://arxiv.org/abs/2501.01037v1

Abstract:

Multi-sensor fusion models play a crucial role in autonomous driving perception, particularly in tasks like 3D object detection and HD map construction. These models provide essential and comprehensive static environmental information for autonomous driving systems. While camera-LiDAR fusion methods have shown promising results by integrating data from both modalities, they often depend on complete sensor inputs. This reliance can lead to low robustness and potential failures when sensors are corrupted or missing, raising significant safety concerns. To tackle this challenge, we introduce the Multi-Sensor Corruption Benchmark (MSC-Bench), the first comprehensive benchmark aimed at evaluating the robustness of multi-sensor autonomous driving perception models against various sensor corruptions. Our benchmark includes 16 combinations of corruption types that disrupt both camera and LiDAR inputs, either individually or concurrently. Extensive evaluations of six 3D object detection models and four HD map construction models reveal substantial performance degradation under adverse weather conditions and sensor failures, underscoring critical safety issues. The benchmark toolkit and affiliated code and model checkpoints have been made publicly accessible.

Decision Explanation: Original decision: REJECT The paper focuses primarily on autonomous driving perception, multi-sensor fusion, and benchmarking for robustness against sensor corruptions, with no mention or emphasis on prompt engineering for Large Language Models (LLMs) or text generation, thus failing to meet the 'MUST' criteria.

[MMVA: Multimodal Matching Based on Valence and Arousal across Images,

Music, and Musical Captions](https://arxiv.org/abs/http://arxiv.org/abs/2501.01094v1) arXiv ID: http://arxiv.org/abs/2501.01094v1

Abstract:

We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.

Decision Explanation: Original decision: REJECT The paper focuses on multimodal matching across images, music, and musical captions, primarily for valence and arousal prediction, with no clear emphasis on the engineering, design, or optimization of textual input prompts specifically for Large Language Models (LLMs).

[Disambiguation of Chinese Polyphones in an End-to-End Framework with

Semantic Features Extracted by Pre-trained BERT](https://arxiv.org/abs/http://arxiv.org/abs/2501.01102v1) arXiv ID: http://arxiv.org/abs/2501.01102v1

Abstract:

Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of a polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessity of any preprocessing. The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier. The pre-trained BERT model extracts semantic features from a raw Chinese character sequence and the NN based classifier predicts the polyphonic character's pronunciation according to BERT output. In out experiments, we implemented three classifiers, a fully-connected network based classifier, a long short-term memory (LSTM) network based classifier and a Transformer block based classifier. The experimental results compared with the baseline approach based on LSTM demonstrate that, the pre-trained model extracts effective semantic features, which greatly enhances the performance of polyphone disambiguation. In addition, we also explored the impact of contextual information on polyphone disambiguation.

Decision Explanation: Original decision: REJECT The paper focuses on polyphone disambiguation in Chinese text-to-speech (TTS) systems using pre-trained BERT for semantic feature extraction, which does not meet the primary criteria of focusing on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or investigating methods for improving LLM performance through prompt manipulation.

[learning discriminative features from spectrograms using center loss for

speech emotion recognition](https://arxiv.org/abs/http://arxiv.org/abs/2501.01103v1) arXiv ID: http://arxiv.org/abs/2501.01103v1

Abstract:

Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker. However, extracting effective features for emotion recognition is difficult, as emotions are ambiguous. We propose a novel approach to learn discriminative features from variable length spectrograms for emotion recognition by cooperating softmax cross-entropy loss and center loss together. The softmax cross-entropy loss enables features from different emotion categories separable, and center loss efficiently pulls the features belonging to the same emotion category to their center. By combining the two losses together, the discriminative power will be highly enhanced, which leads to network learning more effective features for emotion recognition. As demonstrated by the experimental results, after introducing center loss, both the unweighted accuracy and weighted accuracy are improved by over 3% on Mel-spectrogram input, and more than 4% on Short Time Fourier Transform spectrogram input.

Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria as it primarily focuses on speech emotion recognition using spectrograms and novel loss functions for feature learning, with no mention of Large Language Models (LLMs), prompt engineering, or text-based interactions, thereby failing to meet the core subject requirement.

[Robust COVID-19 Detection from Cough Sounds using Deep Neural Decision

Tree and Forest: A Comprehensive Cross-Datasets Evaluation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01117v1) arXiv ID: http://arxiv.org/abs/2501.01117v1

Abstract:

This research presents a robust approach to classifying COVID-19 cough sounds using cutting-edge machine-learning techniques. Leveraging deep neural decision trees and deep neural decision forests, our methodology demonstrates consistent performance across diverse cough sound datasets. We begin with a comprehensive extraction of features to capture a wide range of audio features from individuals, whether COVID-19 positive or negative. To determine the most important features, we use recursive feature elimination along with cross-validation. Bayesian optimization fine-tunes hyper-parameters of deep neural decision tree and deep neural decision forest models. Additionally, we integrate the SMOTE during training to ensure a balanced representation of positive and negative data. Model performance refinement is achieved through threshold optimization, maximizing the ROC-AUC score. Our approach undergoes a comprehensive evaluation in five datasets: Cambridge, Coswara, COUGHVID, Virufy, and the combined Virufy with the NoCoCoDa dataset. Consistently outperforming state-of-the-art methods, our proposed approach yields notable AUC scores of 0.97, 0.98, 0.92, 0.93, 0.99, and 0.99 across the respective datasets. Merging all datasets into a combined dataset, our method, using a deep neural decision forest classifier, achieves an AUC of 0.97. Also, our study includes a comprehensive cross-datasets analysis, revealing demographic and geographic differences in the cough sounds associated with COVID-19. These differences highlight the challenges in transferring learned features across diverse datasets and underscore the potential benefits of dataset integration, improving generalizability and enhancing COVID-19 detection from audio signals.

Decision Explanation: Original decision: REJECT The paper focuses primarily on COVID-19 detection from cough sounds using deep neural decision trees and forests, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria.

[TED: Turn Emphasis with Dialogue Feature Attention for Emotion

Recognition in Conversation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01123v1) arXiv ID: http://arxiv.org/abs/2501.01123v1

Abstract:

Emotion recognition in conversation (ERC) has been attracting attention by methods for modeling multi-turn contexts. The multi-turn input to a pretraining model implicitly assumes that the current turn and other turns are distinguished during the training process by inserting special tokens into the input sequence. This paper proposes a priority-based attention method to distinguish each turn explicitly by adding dialogue features into the attention mechanism, called Turn Emphasis with Dialogue (TED). It has a priority for each turn according to turn position and speaker information as dialogue features. It takes multi-head self-attention between turn-based vectors for multi-turn input and adjusts attention scores with the dialogue features. We evaluate TED on four typical benchmarks. The experimental results demonstrate that TED has high overall performance in all datasets and achieves state-of-the-art performance on IEMOCAP with numerous turns.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on improving emotion recognition in conversation using a priority-based attention method (TED), not on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not demonstrate the impact of textual input prompts on LLM output." }

[Missing Data as Augmentation in the Earth Observation Domain: A

Multi-View Learning Approach](https://arxiv.org/abs/http://arxiv.org/abs/2501.01132v1) arXiv ID: http://arxiv.org/abs/2501.01132v1

Abstract:

Multi-view learning (MVL) leverages multiple sources or views of data to enhance machine learning model performance and robustness. This approach has been successfully used in the Earth Observation (EO) domain, where views have a heterogeneous nature and can be affected by missing data. Despite the negative effect that missing data has on model predictions, the ML literature has used it as an augmentation technique to improve model generalization, like masking the input data. Inspired by this, we introduce novel methods for EO applications tailored to MVL with missing views. Our methods integrate the combination of a set to simulate all combinations of missing views as different training samples. Instead of replacing missing data with a numerical value, we use dynamic merge functions, like average, and more complex ones like Transformer. This allows the MVL model to entirely ignore the missing views, enhancing its predictive robustness. We experiment on four EO datasets with temporal and static views, including state-of-the-art methods from the EO domain. The results indicate that our methods improve model robustness under conditions of moderate missingness, and improve the predictive performance when all views are present. The proposed methods offer a single adaptive solution to operate effectively with any combination of available views.

Decision Explanation: Original decision: REJECT The paper focuses on multi-view learning for Earth Observation (EO) with missing data augmentation, lacking any direct connection to prompt engineering, Large Language Models (LLMs), or text generation, thus failing to meet the primary 'MUST' criteria.

TexAVi: Generating Stereoscopic VR Video Clips from Text Descriptions

arXiv ID: http://arxiv.org/abs/2501.01156v1

Abstract:

While generative models such as text-to-image, large language models and text-to-video have seen significant progress, the extension to text-to-virtual-reality remains largely unexplored, due to a deficit in training data and the complexity of achieving realistic depth and motion in virtual environments. This paper proposes an approach to coalesce existing generative systems to form a stereoscopic virtual reality video from text. Carried out in three main stages, we start with a base text-to-image model that captures context from an input text. We then employ Stable Diffusion on the rudimentary image produced, to generate frames with enhanced realism and overall quality. These frames are processed with depth estimation algorithms to create left-eye and right-eye views, which are stitched side-by-side to create an immersive viewing experience. Such systems would be highly beneficial in virtual reality production, since filming and scene building often require extensive hours of work and post-production effort. We utilize image evaluation techniques, specifically Fr'echet Inception Distance and CLIP Score, to assess the visual quality of frames produced for the video. These quantitative measures establish the proficiency of the proposed method. Our work highlights the exciting possibilities of using natural language-driven graphics in fields like virtual reality simulations.

Decision Explanation: Original decision: REJECT The paper primarily focuses on generating stereoscopic VR video clips from text descriptions, which falls under image/video generation rather than text generation driven by Large Language Models (LLMs), and does not meet the criteria of investigating prompt engineering for text-based interactions with LLMs.

[Harnessing Multi-Agent LLMs for Complex Engineering Problem-Solving: A

Framework for Senior Design Projects](https://arxiv.org/abs/http://arxiv.org/abs/2501.01205v1) arXiv ID: http://arxiv.org/abs/2501.01205v1

Abstract:

Multi-Agent Large Language Models (LLMs) are gaining significant attention for their ability to harness collective intelligence in complex problem-solving, decision-making, and planning tasks. This aligns with the concept of the wisdom of crowds, where diverse agents contribute collectively to generating effective solutions, making it particularly suitable for educational settings. Senior design projects, also known as capstone or final year projects, are pivotal in engineering education as they integrate theoretical knowledge with practical application, fostering critical thinking, teamwork, and real-world problem-solving skills. In this paper, we explore the use of Multi-Agent LLMs in supporting these senior design projects undertaken by engineering students, which often involve multidisciplinary considerations and conflicting objectives, such as optimizing technical performance while addressing ethical, social, and environmental concerns. We propose a framework where distinct LLM agents represent different expert perspectives, such as problem formulation agents, system complexity agents, societal and ethical agents, or project managers, thus facilitating a holistic problem-solving approach. This implementation leverages standard multi-agent system (MAS) concepts such as coordination, cooperation, and negotiation, incorporating prompt engineering to develop diverse personas for each agent. These agents engage in rich, collaborative dialogues to simulate human engineering teams, guided by principles from swarm AI to efficiently balance individual contributions towards a unified solution. We adapt these techniques to create a collaboration structure for LLM agents, encouraging interdisciplinary reasoning and negotiation similar to real-world senior design projects. To assess the efficacy of this framework, we collected six proposals of engineering and computer science of...

Decision Explanation: Original decision: REJECT The paper's primary focus is on utilizing Multi-Agent LLMs for complex engineering problem-solving in educational settings, rather than specifically on the engineering, design, or optimization of prompts for Large Language Models, as evidenced by its emphasis on a framework for senior design projects and multi-agent system concepts.

[Face-Human-Bench: A Comprehensive Benchmark of Face and Human

Understanding for Multi-modal Assistants](https://arxiv.org/abs/http://arxiv.org/abs/2501.01243v2) arXiv ID: http://arxiv.org/abs/2501.01243v2

Abstract:

Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench comprises a development set with 900 problems and a test set with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. Moreover, inspired by multi-modal agents, we also explore which abilities of MLLMs need to be supplemented by specialist models.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper's primary focus is on evaluating face and human understanding abilities for multi-modal assistants, not specifically on engineering, design, or optimization of prompts for Large Language Models (LLMs). While it mentions evaluating the impact of Chain of Thought (CoT) prompting, this is not the central concern, but rather one aspect of a broader evaluation." }

ProgCo: Program Helps Self-Correction of Large Language Models

arXiv ID: http://arxiv.org/abs/2501.01264v1

Abstract:

Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a self-correction methodology for Large Language Models (LLMs) using program-driven approaches, rather than specifically engineering, designing, or optimizing textual input prompts to improve LLM performance, not meeting the core 'MUST' criteria.

[CultureVLM: Characterizing and Improving Cultural Understanding of

Vision-Language Models for over 100 Countries](https://arxiv.org/abs/http://arxiv.org/abs/2501.01282v1) arXiv ID: http://arxiv.org/abs/2501.01282v1

Abstract:

Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs' multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural, cross-continent, and cross-dataset generalization without sacrificing performance on models' general VLM benchmarks. We further present insights on cultural generalization and forgetting. We hope that this work could lay the foundation for more equitable and culturally aware multimodal AI systems.

Decision Explanation: Original decision: REJECT The paper primarily focuses on fine-tuning Vision-Language Models (VLMs) for improved cultural understanding, which does not meet the 'MUST' criteria of focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs). Additionally, the paper's core subject is VLMs and cultural understanding, not prompt engineering for text-based interactions with LLMs.

[LEO-Split: A Semi-Supervised Split Learning Framework over LEO Satellite

Networks](https://arxiv.org/abs/http://arxiv.org/abs/2501.01293v1) arXiv ID: http://arxiv.org/abs/2501.01293v1

Abstract:

Recently, the increasing deployment of LEO satellite systems has enabled various space analytics (e.g., crop and climate monitoring), which heavily relies on the advancements in deep learning (DL). However, the intermittent connectivity between LEO satellites and ground station (GS) significantly hinders the timely transmission of raw data to GS for centralized learning, while the scaled-up DL models hamper distributed learning on resource-constrained LEO satellites. Though split learning (SL) can be a potential solution to these problems by partitioning a model and offloading primary training workload to GS, the labor-intensive labeling process remains an obstacle, with intermittent connectivity and data heterogeneity being other challenges. In this paper, we propose LEO-Split, a semi-supervised (SS) SL design tailored for satellite networks to combat these challenges. Leveraging SS learning to handle (labeled) data scarcity, we construct an auxiliary model to tackle the training failure of the satellite-GS non-contact time. Moreover, we propose a pseudo-labeling algorithm to rectify data imbalances across satellites. Lastly, an adaptive activation interpolation scheme is devised to prevent the overfitting of server-side sub-model training at GS. Extensive experiments with real-world LEO satellite traces (e.g., Starlink) demonstrate that our LEO-Split framework achieves superior performance compared to state-ofthe-art benchmarks.

Decision Explanation: Original decision: REJECT The paper primarily focuses on developing a semi-supervised split learning framework for LEO satellite networks, with an emphasis on deep learning (DL) model optimization and data transmission challenges, rather than engineering or optimizing prompts for Large Language Models (LLMs).

[CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for

Benchmarking Large Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.01335v1) arXiv ID: http://arxiv.org/abs/2501.01335v1

Abstract:

Numerous studies have investigated methods for jailbreaking Large Language Models (LLMs) to generate harmful content. Typically, these methods are evaluated using datasets of malicious prompts designed to bypass security policies established by LLM providers. However, the generally broad scope and open-ended nature of existing datasets can complicate the assessment of jailbreaking effectiveness, particularly in specific domains, notably cybersecurity. To address this issue, we present and publicly release CySecBench, a comprehensive dataset containing 12662 prompts specifically designed to evaluate jailbreaking techniques in the cybersecurity domain. The dataset is organized into 10 distinct attack-type categories, featuring close-ended prompts to enable a more consistent and accurate assessment of jailbreaking attempts. Furthermore, we detail our methodology for dataset generation and filtration, which can be adapted to create similar datasets in other domains. To demonstrate the utility of CySecBench, we propose and evaluate a jailbreaking approach based on prompt obfuscation. Our experimental results show that this method successfully elicits harmful content from commercial black-box LLMs, achieving Success Rates (SRs) of 65% with ChatGPT and 88% with Gemini; in contrast, Claude demonstrated greater resilience with a jailbreaking SR of 17%. Compared to existing benchmark approaches, our method shows superior performance, highlighting the value of domain-specific evaluation datasets for assessing LLM security measures. Moreover, when evaluated using prompts from a widely used dataset (i.e., AdvBench), it achieved an SR of 78.5%, higher than the state-of-the-art methods.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper's primary focus is on evaluating and bypassing Large Language Models' security measures in the cybersecurity domain, rather than on the engineering, design, or optimization of prompts specifically for improving LLM performance through textual input manipulation." }

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

arXiv ID: http://arxiv.org/abs/2501.01366v1

Abstract:

3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential prompts which could be specified in the English language. To ensure that we are scaling up and testing against a useful and representative set of prompts, we propose a framework for linguistically analyzing 3DVG prompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns. We evaluate existing open-vocabulary 3DVG methods to demonstrate that these methods are not yet proficient in understanding and identifying the targets of more challenging, out-of-distribution prompts, toward real-world applications.

Decision Explanation: Original decision: REJECT Although the paper uses LLMs and discusses the importance of diverse language prompts for 3D visual grounding, its primary focus is on creating a diagnostic dataset for evaluating visual grounding methods, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) to improve their text generation performance.

[Contrastive Learning from Exploratory Actions: Leveraging Natural

Interactions for Preference Elicitation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01367v1) arXiv ID: http://arxiv.org/abs/2501.01367v1

Abstract:

People have a variety of preferences for how robots behave. To understand and reason about these preferences, robots aim to learn a reward function that describes how aligned robot behaviors are with a user's preferences. Good representations of a robot's behavior can significantly reduce the time and effort required for a user to teach the robot their preferences. Specifying these representations -- what "features" of the robot's behavior matter to users -- remains a difficult problem; Features learned from raw data lack semantic meaning and features learned from user data require users to engage in tedious labeling processes. Our key insight is that users tasked with customizing a robot are intrinsically motivated to produce labels through exploratory search; they explore behaviors that they find interesting and ignore behaviors that are irrelevant. To harness this novel data source of exploratory actions, we propose contrastive learning from exploratory actions (CLEA) to learn trajectory features that are aligned with features that users care about. We learned CLEA features from exploratory actions users performed in an open-ended signal design activity (N=25) with a Kuri robot, and evaluated CLEA features through a second user study with a different set of users (N=42). CLEA features outperformed self-supervised features when eliciting user preferences over four metrics: completeness, simplicity, minimality, and explainability.

Decision Explanation: Original decision: REJECT The paper focuses on learning features for preference elicitation in human-robot interaction, specifically leveraging user exploratory actions, and does not primarily investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through the manipulation of textual input prompts.

[ScarNet: A Novel Foundation Model for Automated Myocardial Scar

Quantification from LGE in Cardiac MRI](https://arxiv.org/abs/http://arxiv.org/abs/2501.01372v1) arXiv ID: http://arxiv.org/abs/2501.01372v1

Abstract:

Background: Late Gadolinium Enhancement (LGE) imaging is the gold standard for assessing myocardial fibrosis and scarring, with left ventricular (LV) LGE extent predicting major adverse cardiac events (MACE). Despite its importance, routine LGE-based LV scar quantification is hindered by labor-intensive manual segmentation and inter-observer variability. Methods: We propose ScarNet, a hybrid model combining a transformer-based encoder from the Medical Segment Anything Model (MedSAM) with a convolution-based U-Net decoder, enhanced by tailored attention blocks. ScarNet was trained on 552 ischemic cardiomyopathy patients with expert segmentations of myocardial and scar boundaries and tested on 184 separate patients. Results: ScarNet achieved robust scar segmentation in 184 test patients, yielding a median Dice score of 0.912 (IQR: 0.863--0.944), significantly outperforming MedSAM (median Dice = 0.046, IQR: 0.043--0.047) and nnU-Net (median Dice = 0.638, IQR: 0.604--0.661). ScarNet demonstrated lower bias (-0.63%) and coefficient of variation (4.3%) compared to MedSAM (bias: -13.31%, CoV: 130.3%) and nnU-Net (bias: -2.46%, CoV: 20.3%). In Monte Carlo simulations with noise perturbations, ScarNet achieved significantly higher scar Dice (0.892 \pm 0.053, CoV = 5.9%) than MedSAM (0.048 \pm 0.112, CoV = 233.3%) and nnU-Net (0.615 \pm 0.537, CoV = 28.7%). Conclusion: ScarNet outperformed MedSAM and nnU-Net in accurately segmenting myocardial and scar boundaries in LGE images. The model exhibited robust performance across diverse image qualities and scar patterns.

Decision Explanation: Original decision: REJECT The paper focuses primarily on the development of a novel foundation model (ScarNet) for medical image analysis, rather than engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), and does not investigate or propose methods for improving LLM performance through textual input prompts.

[Training Medical Large Vision-Language Models with Abnormal-Aware

Feedback](https://arxiv.org/abs/http://arxiv.org/abs/2501.01377v1) arXiv ID: http://arxiv.org/abs/2501.01377v1

Abstract:

Existing Medical Large Vision-Language Models (Med-LVLMs), which encapsulate extensive medical knowledge, demonstrate excellent capabilities in understanding medical images and responding to human queries based on these images. However, there remain challenges in visual localization in medical images, which is crucial for abnormality detection and interpretation. To address these issues, we propose a novel UMed-LVLM designed with Unveiling Medical abnormalities. Specifically, we collect a Medical Abnormalities Unveiling (MAU) dataset and propose a two-stage training method for UMed-LVLM training. To collect MAU dataset, we propose a prompt method utilizing the GPT-4V to generate diagnoses based on identified abnormal areas in medical images. Moreover, the two-stage training method includes Abnormal-Aware Instruction Tuning and Abnormal-Aware Rewarding, comprising Abnormal Localization Rewarding and Vision Relevance Rewarding. Experimental results demonstrate that our UMed-LVLM surpasses existing Med-LVLMs in identifying and understanding medical abnormality. In addition, this work shows that enhancing the abnormality detection capabilities of Med-LVLMs significantly improves their understanding of medical images and generalization capability.

Decision Explanation: Original decision: REJECT This paper primarily focuses on the development of Medical Large Vision-Language Models (Med-LVLMs) for abnormality detection in medical images, which falls under excluded categories (medical subjects and non-text generation driven by LLMs), and does not centrally focus on prompt engineering for text-based interactions with LLMs.

arXiv ID: http://arxiv.org/abs/2501.01422v1

Abstract:

This work aims to predict the popularity of short videos using the videos themselves and their related features. Popularity is measured by four key engagement metrics: view count, like count, comment count, and share count. This study employs video classification models with different architectures and training methods as backbone networks to extract video modality features. Meanwhile, the cleaned video captions are incorporated into a carefully designed prompt framework, along with the video, as input for video-to-text generation models, which generate detailed text-based video content understanding. These texts are then encoded into vectors using a pre-trained BERT model. Based on the six sets of vectors mentioned above, a neural network is trained for each of the four prediction metrics. Moreover, the study conducts data mining and feature engineering based on the video and tabular data, constructing practical features such as the total frequency of hashtag appearances, the total frequency of mention appearances, video duration, frame count, frame rate, and total time online. Multiple machine learning models are trained, and the most stable model, XGBoost, is selected. Finally, the predictions from the neural network and XGBoost models are averaged to obtain the final result.

Decision Explanation: Original decision: REJECT The paper's primary focus is on multi-modal video feature extraction for popularity prediction, with LLMs used only as a component (video-to-text generation) and not specifically for text-based interactions with prompt engineering as the central concern, thus failing to meet the 'MUST' criteria.

Object-level Visual Prompts for Compositional Image Generation

arXiv ID: http://arxiv.org/abs/2501.01424v1

Abstract:

We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.

Decision Explanation: Original decision: REJECT The paper focuses on text-to-image generation with visual prompts, not large language models (LLMs) or text-based interactions, violating the 'MUST NOT' criteria for applications of generative AI other than text generation driven by LLMs.

[Enhancing Reasoning through Process Supervision with Monte Carlo Tree

Search](https://arxiv.org/abs/http://arxiv.org/abs/2501.01478v1) arXiv ID: http://arxiv.org/abs/2501.01478v1

Abstract:

Large language models (LLMs) have demonstrated their remarkable capacity across a variety of tasks. However, reasoning remains a challenge for LLMs. To improve LLMs' reasoning ability, process supervision has proven to be better than outcome supervision. In this work, we study using Monte Carlo Tree Search (MCTS) to generate process supervision data with LLMs themselves for training them. We sample reasoning steps with an LLM and assign each step a score that captures its "relative correctness," and the LLM is then trained by minimizing weighted log-likelihood of generating the reasoning steps. This generate-then-train process is repeated iteratively until convergence.Our experimental results demonstrate that the proposed methods considerably improve the performance of LLMs on two mathematical reasoning datasets. Furthermore, models trained on one dataset also exhibit improved performance on the other, showing the transferability of the enhanced reasoning ability.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on improving LLM's reasoning ability through process supervision with Monte Carlo Tree Search, which is a training method enhancement rather than a prompt engineering technique for text-based interactions with LLMs, violating MUST NOT criterion 1." }

[ORACLE: A Real-Time, Hierarchical, Deep-Learning Photometric Classifier

for the LSST](https://arxiv.org/abs/http://arxiv.org/abs/2501.01496v1) arXiv ID: http://arxiv.org/abs/2501.01496v1

Abstract:

We present ORACLE, the first hierarchical deep-learning model for real-time, context-aware classification of transient and variable astrophysical phenomena. ORACLE is a recurrent neural network with Gated Recurrent Units (GRUs), and has been trained using a custom hierarchical cross-entropy loss function to provide high-confidence classifications along an observationally-driven taxonomy with as little as a single photometric observation. Contextual information for each object, including host galaxy photometric redshift, offset, ellipticity and brightness, is concatenated to the light curve embedding and used to make a final prediction. Training on $\sim$0.5M events from the Extended LSST Astronomical Time-Series Classification Challenge, we achieve a top-level (Transient vs Variable) macro-averaged precision of 0.96 using only 1 day of photometric observations after the first detection in addition to contextual information, for each event; this increases to $>$0.99 once 64 days of the light curve has been obtained, and 0.83 at 1024 days after first detection for 19-way classification (including supernova sub-types, active galactic nuclei, variable stars, microlensing events, and kilonovae). We also compare ORACLE with other state-of-the-art classifiers and report comparable performance for the 19-way classification task, in addition to delivering accurate top-level classifications much earlier. The code and model weights used in this work are publicly available at our associated GitHub repository (https://github.com/uiucsn/ELAsTiCC-Classification).

Decision Explanation: Original decision: REJECT The paper does not meet any of the 'MUST' criteria, as it focuses on developing a deep-learning model for astrophysical phenomena classification, not on prompt engineering for Large Language Models (LLMs), and doesn't investigate, analyze, or propose methods for improving LLM performance through textual input prompts.

Transfer Learning Analysis of Variational Quantum Circuits

arXiv ID: http://arxiv.org/abs/2501.01507v1

Abstract:

This work analyzes transfer learning of the Variational Quantum Circuit (VQC). Our framework begins with a pretrained VQC configured in one domain and calculates the transition of 1-parameter unitary subgroups required for a new domain. A formalism is established to investigate the adaptability and capability of a VQC under the analysis of loss bounds. Our theory observes knowledge transfer in VQCs and provides a heuristic interpretation for the mechanism. An analytical fine-tuning method is derived to attain the optimal transition for adaptations of similar domains.

Decision Explanation: Original decision: REJECT The paper focuses on transfer learning and optimization of Variational Quantum Circuits (VQCs), with no apparent connection to Large Language Models (LLMs) or prompt engineering for text-based interactions, failing to meet the primary 'MUST' criteria.

[AI-Enabled Operations at Fermi Complex: Multivariate Time Series

Prediction for Outage Prediction and Diagnosis](https://arxiv.org/abs/http://arxiv.org/abs/2501.01509v1) arXiv ID: http://arxiv.org/abs/2501.01509v1

Abstract:

The Main Control Room of the Fermilab accelerator complex continuously gathers extensive time-series data from thousands of sensors monitoring the beam. However, unplanned events such as trips or voltage fluctuations often result in beam outages, causing operational downtime. This downtime not only consumes operator effort in diagnosing and addressing the issue but also leads to unnecessary energy consumption by idle machines awaiting beam restoration. The current threshold-based alarm system is reactive and faces challenges including frequent false alarms and inconsistent outage-cause labeling. To address these limitations, we propose an AI-enabled framework that leverages predictive analytics and automated labeling. Using data from 2,703 Linac devices and 80 operator-labeled outages, we evaluate state-of-the-art deep learning architectures, including recurrent, attention-based, and linear models, for beam outage prediction. Additionally, we assess a Random Forest-based labeling system for providing consistent, confidence-scored outage annotations. Our findings highlight the strengths and weaknesses of these architectures for beam outage prediction and identify critical gaps that must be addressed to fully harness AI for transitioning downtime handling from reactive to predictive, ultimately reducing downtime and improving decision-making in accelerator management.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on multivariate time series prediction for outage prediction and diagnosis in an accelerator complex using deep learning architectures, with no primary focus on prompt engineering for Large Language Models (LLMs) or manipulation of textual input prompts to improve LLM performance." }

[DiagrammaticLearning: A Graphical Language for Compositional Training

Regimes](https://arxiv.org/abs/http://arxiv.org/abs/2501.01515v1) arXiv ID: http://arxiv.org/abs/2501.01515v1

Abstract:

Motivated by deep learning regimes with multiple interacting yet distinct model components, we introduce learning diagrams, graphical depictions of training setups that capture parameterized learning as data rather than code. A learning diagram compiles to a unique loss function on which component models are trained. The result of training on this loss is a collection of models whose predictions ``agree" with one another. We show that a number of popular learning setups such as few-shot multi-task learning, knowledge distillation, and multi-modal learning can be depicted as learning diagrams. We further implement learning diagrams in a library that allows users to build diagrams of PyTorch and Flux.jl models. By implementing some classic machine learning use cases, we demonstrate how learning diagrams allow practitioners to build complicated models as compositions of smaller components, identify relationships between workflows, and manipulate models during or after training. Leveraging a category theoretic framework, we introduce a rigorous semantics for learning diagrams that puts such operations on a firm mathematical foundation.

Decision Explanation: Original decision: REJECT The paper focuses on developing a graphical language for compositional training regimes and a library for building complex models, which does not meet the primary criterion of focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.

[Improving Robustness Estimates in Natural Language Explainable AI though

Synonymity Weighted Similarity Measures](https://arxiv.org/abs/http://arxiv.org/abs/2501.01516v1) arXiv ID: http://arxiv.org/abs/2501.01516v1

Abstract:

Explainable AI (XAI) has seen a surge in recent interest with the proliferation of powerful but intractable black-box models. Moreover, XAI has come under fire for techniques that may not offer reliable explanations. As many of the methods in XAI are themselves models, adversarial examples have been prominent in the literature surrounding the effectiveness of XAI, with the objective of these examples being to alter the explanation while maintaining the output of the original model. For explanations in natural language, it is natural to use measures found in the domain of information retrieval for use with ranked lists to guide the adversarial XAI process. We show that the standard implementation of these measures are poorly suited for the comparison of explanations in adversarial XAI and amend them by using information that is discarded, the synonymity of perturbed words. This synonymity weighting produces more accurate estimates of the actual weakness of XAI methods to adversarial examples.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on improving robustness estimates in Explainable AI (XAI) using synonymity weighted similarity measures, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), thus not meeting the core subject requirement." }

[A Metasemantic-Metapragmatic Framework for Taxonomizing Multimodal

Communicative Alignment](https://arxiv.org/abs/http://arxiv.org/abs/2501.01535v1) arXiv ID: http://arxiv.org/abs/2501.01535v1

Abstract:

Drawing on contemporary pragmatist philosophy and linguistic theories on cognition, meaning, and communication, this paper presents a dynamic, metasemantic-metapragmatic taxonomy for grounding and conceptualizing human-like multimodal communicative alignment. The framework is rooted in contemporary developments of the three basic communicative capacities initially identified by American logician and pragmatist philosopher Charles Sanders Peirce: iconic (sensory and perceptual qualities), indexical (contextual and sociocultural associations), and rule-like (symbolic and intuitive reasoning). Expanding on these developments, I introduce the concept of indexical contextualization and propose the principle of "contextualization directionality" for characterizing the crucial metapragmatic capacity for maintaining, navigating, or transitioning between semantic and pragmatic modes of multimodal communication. I contend that current cognitive-social computational and engineering methodologies disproportionately emphasize the semantic/metasemantic domain, overlooking the pivotal role of metapragmatic indexicality in traversing the semantic-pragmatic spectrum of communication. The framework's broader implications for intentionality, identity, affect, and ethics in within-modal and cross-modal human-machine alignment are also discussed.

Decision Explanation: Original decision: REJECT The paper does not meet the 'MUST' criteria, as it primarily focuses on a philosophical framework for multimodal communicative alignment, with no clear emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or manipulation of textual input prompts to improve LLM performance.

[In Search of a Lost Metric: Human Empowerment as a Pillar of Socially

Conscious Navigation](https://arxiv.org/abs/http://arxiv.org/abs/2501.01539v1) arXiv ID: http://arxiv.org/abs/2501.01539v1

Abstract:

In social robot navigation, traditional metrics like proxemics and behavior naturalness emphasize human comfort and adherence to social norms but often fail to capture an agent's autonomy and adaptability in dynamic environments. This paper introduces human empowerment, an information-theoretic concept that measures a human's ability to influence their future states and observe those changes, as a complementary metric for evaluating social compliance. This metric reveals how robot navigation policies can indirectly impact human empowerment. We present a framework that integrates human empowerment into the evaluation of social performance in navigation tasks. Through numerical simulations, we demonstrate that human empowerment as a metric not only aligns with intuitive social behavior, but also shows statistically significant differences across various robot navigation policies. These results provide a deeper understanding of how different policies affect social compliance, highlighting the potential of human empowerment as a complementary metric for future research in social navigation.

Decision Explanation: Original decision: REJECT The paper focuses on social robot navigation, introducing 'human empowerment' as a metric for evaluating social compliance, and does not meet the 'MUST' criteria for primarily focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or investigating methods to improve LLM performance through textual input prompt manipulation.

[BLAST: A Stealthy Backdoor Leverage Attack against Cooperative

Multi-Agent Deep Reinforcement Learning based Systems](https://arxiv.org/abs/http://arxiv.org/abs/2501.01593v1) arXiv ID: http://arxiv.org/abs/2501.01593v1

Abstract:

Recent studies have shown that cooperative multi-agent deep reinforcement learning (c-MADRL) is under the threat of backdoor attacks. Once a backdoor trigger is observed, it will perform malicious actions leading to failures or malicious goals. However, existing backdoor attacks suffer from several issues, e.g., instant trigger patterns lack stealthiness, the backdoor is trained or activated by an additional network, or all agents are backdoored. To this end, in this paper, we propose a novel backdoor leverage attack against c-MADRL, BLAST, which attacks the entire multi-agent team by embedding the backdoor only in a single agent. Firstly, we introduce adversary spatiotemporal behavior patterns as the backdoor trigger rather than manual-injected fixed visual patterns or instant status and control the period to perform malicious actions. This method can guarantee the stealthiness and practicality of BLAST. Secondly, we hack the original reward function of the backdoor agent via unilateral guidance to inject BLAST, so as to achieve the \textit{leverage attack effect} that can pry open the entire multi-agent system via a single backdoor agent. We evaluate our BLAST against 3 classic c-MADRL algorithms (VDN, QMIX, and MAPPO) in 2 popular c-MADRL environments (SMAC and Pursuit), and 2 existing defense mechanisms. The experimental results demonstrate that BLAST can achieve a high attack success rate while maintaining a low clean performance variance rate.

Decision Explanation: Original decision: REJECT The paper focuses primarily on a backdoor leverage attack against cooperative multi-agent deep reinforcement learning systems, with no emphasis on Large Language Models (LLMs), prompt engineering, or textual input manipulation, thus failing to meet all 'MUST' criteria.

[PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of

Psychiatric Assessment Conversational Agents](https://arxiv.org/abs/http://arxiv.org/abs/2501.01594v1) arXiv ID: http://arxiv.org/abs/2501.01594v1

Abstract:

Recent advances in large language models (LLMs) have accelerated the development of conversational agents capable of generating human-like responses. Since psychiatric assessments typically involve complex conversational interactions between psychiatrists and patients, there is growing interest in developing LLM-based psychiatric assessment conversational agents (PACAs) that aim to simulate the role of psychiatrists in clinical evaluations. However, standardized methods for benchmarking the clinical appropriateness of PACAs' interaction with patients still remain underexplored. Here, we propose PSYCHE, a novel framework designed to enable the 1) clinically relevant, 2) ethically safe, 3) cost-efficient, and 4) quantitative evaluation of PACAs. This is achieved by simulating psychiatric patients based on a multi-faceted psychiatric construct that defines the simulated patients' profiles, histories, and behaviors, which PACAs are expected to assess. We validate the effectiveness of PSYCHE through a study with 10 board-certified psychiatrists, supported by an in-depth analysis of the simulated patient utterances.

Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a framework for evaluating Psychiatric Assessment Conversational Agents (PACAs), with LLMs being a component, rather than focusing on prompt engineering for text-based interactions with LLMs, not meeting the core subject requirement.

[A non-ergodic framework for understanding emergent capabilities in Large

Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.01638v1) arXiv ID: http://arxiv.org/abs/2501.01638v1

Abstract:

Large language models have emergent capabilities that come unexpectedly at scale, but we need a theoretical framework to explain why and how they emerge. We prove that language models are actually non-ergodic systems while providing a mathematical framework based on Stuart Kauffman's theory of the adjacent possible (TAP) to explain capability emergence. Our resource-constrained TAP equation demonstrates how architectural, training, and contextual constraints interact to shape model capabilities through phase transitions in semantic space. We prove through experiments with three different language models that capacities emerge through discrete transitions guided by constraint interactions and path-dependent exploration. This framework provides a theoretical basis for understanding emergence in language models and guides the development of architectures that can guide capability emergence.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses primarily on developing a theoretical framework to understand emergent capabilities in LLMs, discussing architectural, training, and contextual constraints, but does not investigate, analyze, or propose methods for improving LLM performance through the manipulation of textual input prompts." }

AgentRefine: Enhancing Agent Generalization through Refinement Tuning

arXiv ID: http://arxiv.org/abs/2501.01702v1

Abstract:

Large Language Model (LLM) based agents have proved their ability to perform complex tasks like humans. However, there is still a large gap between open-sourced LLMs and commercial models like the GPT series. In this paper, we focus on improving the agent generalization capabilities of LLMs via instruction tuning. We first observe that the existing agent training corpus exhibits satisfactory results on held-in evaluation sets but fails to generalize to held-out sets. These agent-tuning works face severe formatting errors and are frequently stuck in the same mistake for a long while. We analyze that the poor generalization ability comes from overfitting to several manual agent environments and a lack of adaptation to new situations. They struggle with the wrong action steps and can not learn from the experience but just memorize existing observation-action relations. Inspired by the insight, we propose a novel AgentRefine framework for agent-tuning. The core idea is to enable the model to learn to correct its mistakes via observation in the trajectory. Specifically, we propose an agent synthesis framework to encompass a diverse array of environments and tasks and prompt a strong LLM to refine its error action according to the environment feedback. AgentRefine significantly outperforms state-of-the-art agent-tuning work in terms of generalization ability on diverse agent tasks. It also has better robustness facing perturbation and can generate diversified thought in inference. Our findings establish the correlation between agent generalization and self-refinement and provide a new paradigm for future research.

Decision Explanation: Original decision: REJECT The paper focuses on refining Large Language Model (LLM) based agents through instruction tuning to improve generalization capabilities, which aligns more with the development of new training methods for LLMs (excluded by MUST NOT 1) rather than primarily focusing on prompt engineering for text-based interactions with LLMs.

[Combined Hyper-Extensible Extremely-Secured Zero-Trust CIAM-PAM

architecture](https://arxiv.org/abs/http://arxiv.org/abs/2501.01732v1) arXiv ID: http://arxiv.org/abs/2501.01732v1

Abstract:

Customer Identity and Access Management (CIAM) systems play a pivotal role in securing enterprise infrastructures. However, the complexity of implementing these systems requires careful architectural planning to ensure positive Return on Investment (RoI) and avoid costly delays. The proliferation of Active Persistent cyber threats, coupled with advancements in AI, cloud computing, and geographically distributed customer populations, necessitates a paradigm shift towards adaptive and zero-trust security frameworks. This paper introduces the Combined Hyper-Extensible Extremely-Secured Zero-Trust (CHEZ) CIAM-PAM architecture, designed specifically for large-scale enterprises. The CHEZ PL CIAM-PAM framework addresses critical security gaps by integrating federated identity management (private and public identities), password-less authentication, adaptive multi-factor authentication (MFA), microservice-based PEP (Policy Entitlement Point), multi-layer RBAC (Role Based Access Control) and multi-level trust systems. This future-proof design also includes end-to-end data encryption, and seamless integration with state-of-the-art AI-based threat detection systems, while ensuring compliance with stringent regulatory standards.

Decision Explanation: Original decision: REJECT The paper focuses primarily on developing a secure CIAM-PAM architecture for enterprise infrastructures, with no apparent focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs), thus failing to meet the core subject requirement.

[How Toxic Can You Get? Search-based Toxicity Testing for Large Language

Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.01741v1) arXiv ID: http://arxiv.org/abs/2501.01741v1

Abstract:

Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM, which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs' inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using four state-of-the-art LLMs as evaluation subjects having increasing complexity (7-13 billion parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average).

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "While the paper involves Large Language Models (LLMs) and manipulates textual input prompts to test toxicity, its primary focus is on developing a testing framework for detecting toxicity rather than optimizing prompt engineering *for improved LLM performance* through textual input manipulation." }

[Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic

Data Generation and Fairness Algorithms](https://arxiv.org/abs/http://arxiv.org/abs/2501.01785v1) arXiv ID: http://arxiv.org/abs/2501.01785v1

Abstract:

The increasing use of machine learning in learning analytics (LA) has raised significant concerns around algorithmic fairness and privacy. Synthetic data has emerged as a dual-purpose tool, enhancing privacy and improving fairness in LA models. However, prior research suggests an inverse relationship between fairness and privacy, making it challenging to optimize both. This study investigates which synthetic data generators can best balance privacy and fairness, and whether pre-processing fairness algorithms, typically applied to real datasets, are effective on synthetic data. Our results highlight that the DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness. However, DECAF suffers in utility, as reflected in its predictive accuracy. Notably, we found that applying pre-processing fairness algorithms to synthetic data improves fairness even more than when applied to real data. These findings suggest that combining synthetic data generation with fairness pre-processing offers a promising approach to creating fairer LA models.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper does not meet the primary focus criteria, as it focuses on synthetic data generation for balancing privacy and fairness in learning analytics models, rather than engineering, design, or optimization of prompts specifically for Large Language Models (LLMs)." }

[BERT4MIMO: A Foundation Model using BERT Architecture for Massive MIMO

Channel State Information Prediction](https://arxiv.org/abs/http://arxiv.org/abs/2501.01802v1) arXiv ID: http://arxiv.org/abs/2501.01802v1

Abstract:

Massive MIMO (Multiple-Input Multiple-Output) is an advanced wireless communication technology, using a large number of antennas to improve the overall performance of the communication system in terms of capacity, spectral, and energy efficiency. The performance of MIMO systems is highly dependent on the quality of channel state information (CSI). Predicting CSI is, therefore, essential for improving communication system performance, particularly in MIMO systems, since it represents key characteristics of a wireless channel, including propagation, fading, scattering, and path loss. This study proposes a foundation model inspired by BERT, called BERT4MIMO, which is specifically designed to process high-dimensional CSI data from massive MIMO systems. BERT4MIMO offers superior performance in reconstructing CSI under varying mobility scenarios and channel conditions through deep learning and attention mechanisms. The experimental results demonstrate the effectiveness of BERT4MIMO in a variety of wireless environments.

Decision Explanation: Original decision: REJECT The paper focuses on developing a new foundation model (BERT4MIMO) for predicting channel state information in Massive MIMO systems, which does not meet the primary criteria of focusing on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.

[Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large

Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.01830v1) arXiv ID: http://arxiv.org/abs/2501.01830v1

Abstract:

Automated red-teaming has become a crucial approach for uncovering vulnerabilities in large language models (LLMs). However, most existing methods focus on isolated safety flaws, limiting their ability to adapt to dynamic defenses and uncover complex vulnerabilities efficiently. To address this challenge, we propose Auto-RT, a reinforcement learning framework that automatically explores and optimizes complex attack strategies to effectively uncover security vulnerabilities through malicious queries. Specifically, we introduce two key mechanisms to reduce exploration complexity and improve strategy optimization: 1) Early-terminated Exploration, which accelerate exploration by focusing on high-potential attack strategies; and 2) Progressive Reward Tracking algorithm with intermediate downgrade models, which dynamically refine the search trajectory toward successful vulnerability exploitation. Extensive experiments across diverse LLMs demonstrate that, by significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on developing a reinforcement learning framework for automated red-teaming to uncover security vulnerabilities in LLMs, rather than focusing on the engineering, design, or optimization of prompts specifically for improving LLM performance through textual input manipulation." }

[LCFed: An Efficient Clustered Federated Learning Framework for

Heterogeneous Data](https://arxiv.org/abs/http://arxiv.org/abs/2501.01850v1) arXiv ID: http://arxiv.org/abs/2501.01850v1

Abstract:

Clustered federated learning (CFL) addresses the performance challenges posed by data heterogeneity in federated learning (FL) by organizing edge devices with similar data distributions into clusters, enabling collaborative model training tailored to each group. However, existing CFL approaches strictly limit knowledge sharing to within clusters, lacking the integration of global knowledge with intra-cluster training, which leads to suboptimal performance. Moreover, traditional clustering methods incur significant computational overhead, especially as the number of edge devices increases. In this paper, we propose LCFed, an efficient CFL framework to combat these challenges. By leveraging model partitioning and adopting distinct aggregation strategies for each sub-model, LCFed effectively incorporates global knowledge into intra-cluster co-training, achieving optimal training performance. Additionally, LCFed customizes a computationally efficient model similarity measurement method based on low-rank models, enabling real-time cluster updates with minimal computational overhead. Extensive experiments show that LCFed outperforms state-of-the-art benchmarks in both test accuracy and clustering computational efficiency.

Decision Explanation: Original decision: REJECT The paper focuses primarily on the development of a new federated learning framework (LCFed) for heterogeneous data, with no mention of prompt engineering, Large Language Models (LLMs), or text generation, thus failing to meet all 'MUST' criteria.

[Evaluating Scenario-based Decision-making for Interactive Autonomous

Driving Using Rational Criteria: A Survey](https://arxiv.org/abs/http://arxiv.org/abs/2501.01886v1) arXiv ID: http://arxiv.org/abs/2501.01886v1

Abstract:

Autonomous vehicles (AVs) can significantly promote the advances in road transport mobility in terms of safety, reliability, and decarbonization. However, ensuring safety and efficiency in interactive during within dynamic and diverse environments is still a primary barrier to large-scale AV adoption. In recent years, deep reinforcement learning (DRL) has emerged as an advanced AI-based approach, enabling AVs to learn decision-making strategies adaptively from data and interactions. DRL strategies are better suited than traditional rule-based methods for handling complex, dynamic, and unpredictable driving environments due to their adaptivity. However, varying driving scenarios present distinct challenges, such as avoiding obstacles on highways and reaching specific exits at intersections, requiring different scenario-specific decision-making algorithms. Many DRL algorithms have been proposed in interactive decision-making. However, a rationale review of these DRL algorithms across various scenarios is lacking. Therefore, a comprehensive evaluation is essential to assess these algorithms from multiple perspectives, including those of vehicle users and vehicle manufacturers. This survey reviews the application of DRL algorithms in autonomous driving across typical scenarios, summarizing road features and recent advancements. The scenarios include highways, on-ramp merging, roundabouts, and unsignalized intersections. Furthermore, DRL-based algorithms are evaluated based on five rationale criteria: driving safety, driving efficiency, training efficiency, unselfishness, and interpretability (DDTUI). Each criterion of DDTUI is specifically analyzed in relation to the reviewed algorithms. Finally, the challenges for future DRL-based decision-making algorithms are summarized.

Decision Explanation: Original decision: REJECT The paper focuses primarily on autonomous driving, deep reinforcement learning algorithms, and decision-making for autonomous vehicles, rather than on the engineering, design, or optimization of prompts for Large Language Models (LLMs), failing to meet the core subject requirement.

[QuArch: A Question-Answering Dataset for AI Agents in Computer

Architecture](https://arxiv.org/abs/http://arxiv.org/abs/2501.01892v2) arXiv ID: http://arxiv.org/abs/2501.01892v2

Abstract:

We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models' understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles in memory systems, interconnection networks, and benchmarking. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and leaderboard are at https://harvard-edge.github.io/QuArch/.

Decision Explanation: Original decision: REJECT The paper primarily focuses on introducing a dataset (QuArch) for evaluating and enhancing language models' understanding of computer architecture, rather than specifically investigating, analyzing, or proposing methods for improving LLM performance through the manipulation of textual input prompts.

Mingling with the Good to Backdoor Federated Learning

arXiv ID: http://arxiv.org/abs/2501.01913v1

Abstract:

Federated learning (FL) is a decentralized machine learning technique that allows multiple entities to jointly train a model while preserving dataset privacy. However, its distributed nature has raised various security concerns, which have been addressed by increasingly sophisticated defenses. These protections utilize a range of data sources and metrics to, for example, filter out malicious model updates, ensuring that the impact of attacks is minimized or eliminated. This paper explores the feasibility of designing a generic attack method capable of installing backdoors in FL while evading a diverse array of defenses. Specifically, we focus on an attacker strategy called MIGO, which aims to produce model updates that subtly blend with legitimate ones. The resulting effect is a gradual integration of a backdoor into the global model, often ensuring its persistence long after the attack concludes, while generating enough ambiguity to hinder the effectiveness of defenses. MIGO was employed to implant three types of backdoors across five datasets and different model architectures. The results demonstrate the significant threat posed by these backdoors, as MIGO consistently achieved exceptionally high backdoor accuracy (exceeding 90%) while maintaining the utility of the main task. Moreover, MIGO exhibited strong evasion capabilities against ten defenses, including several state-of-the-art methods. When compared to four other attack strategies, MIGO consistently outperformed them across most configurations. Notably, even in extreme scenarios where the attacker controls just 0.1% of the clients, the results indicate that successful backdoor insertion is possible if the attacker can persist for a sufficient number of rounds.

Decision Explanation: Original decision: REJECT The paper primarily focuses on backdoor attacks in Federated Learning, with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.

[On the Utility of Equivariance and Symmetry Breaking in Deep Learning

Architectures on Point Clouds](https://arxiv.org/abs/http://arxiv.org/abs/2501.01999v1) arXiv ID: http://arxiv.org/abs/2501.01999v1

Abstract:

This paper explores the key factors that influence the performance of models working with point clouds, across different tasks of varying geometric complexity. In this work, we explore the trade-offs between flexibility and weight-sharing introduced by equivariant layers, assessing when equivariance boosts or detracts from performance. It is often argued that providing more information as input improves a model's performance. However, if this additional information breaks certain properties, such as \SE(3) equivariance, does it remain beneficial? We identify the key aspects of equivariant and non-equivariant architectures that drive success in different tasks by benchmarking them on segmentation, regression, and generation tasks across multiple datasets with increasing complexity. We observe a positive impact of equivariance, which becomes more pronounced with increasing task complexity, even when strict equivariance is not required.

Decision Explanation: Original decision: REJECT The paper focuses primarily on the development and performance of deep learning architectures for point clouds, discussing equivariance and symmetry breaking, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions.

[Multi-Center Study on Deep Learning-Assisted Detection and

Classification of Fetal Central Nervous System Anomalies Using Ultrasound Imaging](https://arxiv.org/abs/http://arxiv.org/abs/2501.02000v1) arXiv ID: http://arxiv.org/abs/2501.02000v1

Abstract:

Prenatal ultrasound evaluates fetal growth and detects congenital abnormalities during pregnancy, but the examination of ultrasound images by radiologists requires expertise and sophisticated equipment, which would otherwise fail to improve the rate of identifying specific types of fetal central nervous system (CNS) abnormalities and result in unnecessary patient examinations. We construct a deep learning model to improve the overall accuracy of the diagnosis of fetal cranial anomalies to aid prenatal diagnosis. In our collected multi-center dataset of fetal craniocerebral anomalies covering four typical anomalies of the fetal central nervous system (CNS): anencephaly, encephalocele (including meningocele), holoprosencephaly, and rachischisis, patient-level prediction accuracy reaches 94.5%, with an AUROC value of 99.3%. In the subgroup analyzes, our model is applicable to the entire gestational period, with good identification of fetal anomaly types for any gestational period. Heatmaps superimposed on the ultrasound images not only provide a visual interpretation for the algorithm but also provide an intuitive visual aid to the physician by highlighting key areas that need to be reviewed, helping the physician to quickly identify and validate key areas. Finally, the retrospective reader study demonstrates that by combining the automatic prediction of the DL system with the professional judgment of the radiologist, the diagnostic accuracy and efficiency can be effectively improved and the misdiagnosis rate can be reduced, which has an important clinical application prospect.

Decision Explanation: Original decision: REJECT The paper focuses on deep learning-assisted detection and classification of fetal central nervous system anomalies using ultrasound imaging, which falls under medical subjects and does not involve prompt engineering for Large Language Models (LLMs) as required.

General Information Metrics for Improving AI Model Training Efficiency

arXiv ID: http://arxiv.org/abs/2501.02004v1

Abstract:

To address the growing size of AI model training data and the lack of a universal data selection methodology-factors that significantly drive up training costs -- this paper presents the General Information Metrics Evaluation (GIME) method. GIME leverages general information metrics from Objective Information Theory (OIT), including volume, delay, scope, granularity, variety, duration, sampling rate, aggregation, coverage, distortion, and mismatch to optimize dataset selection for training purposes. Comprehensive experiments conducted across diverse domains, such as CTR Prediction, Civil Case Prediction, and Weather Forecasting, demonstrate that GIME effectively preserves model performance while substantially reducing both training time and costs. Additionally, applying GIME within the Judicial AI Program led to a remarkable 39.56% reduction in total model training expenses, underscoring its potential to support efficient and sustainable AI development.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on optimizing dataset selection for AI model training efficiency using General Information Metrics, not on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), thus failing to meet the primary 'MUST' criteria." }

[ST-HCSS: Deep Spatio-Temporal Hypergraph Convolutional Neural Network

for Soft Sensing](https://arxiv.org/abs/http://arxiv.org/abs/2501.02016v1) arXiv ID: http://arxiv.org/abs/2501.02016v1

Abstract:

Higher-order sensor networks are more accurate in characterizing the nonlinear dynamics of sensory time-series data in modern industrial settings by allowing multi-node connections beyond simple pairwise graph edges. In light of this, we propose a deep spatio-temporal hypergraph convolutional neural network for soft sensing (ST-HCSS). In particular, our proposed framework is able to construct and leverage a higher-order graph (hypergraph) to model the complex multi-interactions between sensor nodes in the absence of prior structural knowledge. To capture rich spatio-temporal relationships underlying sensor data, our proposed ST-HCSS incorporates stacked gated temporal and hypergraph convolution layers to effectively aggregate and update hypergraph information across time and nodes. Our results validate the superiority of ST-HCSS compared to existing state-of-the-art soft sensors, and demonstrates that the learned hypergraph feature representations aligns well with the sensor data correlations. The code is available at https://github.com/htew0001/ST-HCSS.git

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on developing a new neural network architecture (ST-HCSS) for soft sensing in industrial settings, with no apparent primary focus on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate improving LLM performance through textual input prompts." }

[Safeguarding Large Language Models in Real-time with Tunable

Safety-Performance Trade-offs](https://arxiv.org/abs/http://arxiv.org/abs/2501.02018v1) arXiv ID: http://arxiv.org/abs/2501.02018v1

Abstract:

Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model. Jailbreaks have been exploited by cybercriminals and blackhat actors to cause significant harm, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs "self-reflect", may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict ``normal'' model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we introduce a novel safeguard, called SafeNudge, that combines Controlled Text Generation with "nudging", or using text interventions to change the behavior of a model. SafeNudge triggers during text-generation while a jailbreak attack is being executed, and can reduce successful jailbreak attempts by 30% by guiding the LLM towards a safe responses. It adds minimal latency to inference and has a negligible impact on the semantic fluency of outputs. Further, we allow for tunable SPTs. SafeNudge is open-source and available through https://pypi.org/, and is compatible with models loaded with the Hugging Face "transformers" library.

Decision Explanation: Original decision: REJECT The paper primarily focuses on safeguarding LLMs from jailbreak attacks through a novel method (SafeNudge), which involves Controlled Text Generation and 'nudging'. While it does manipulate textual input, the core subject is enhancing LLM safety and security, not prompt engineering for text-based interactions with LLMs, thus not meeting the 'MUST' criteria.

[Benchmarking Constraint-Based Bayesian Structure Learning Algorithms:

Role of Network Topology](https://arxiv.org/abs/http://arxiv.org/abs/2501.02019v1) arXiv ID: http://arxiv.org/abs/2501.02019v1

Abstract:

Modeling the associations between real world entities from their multivariate cross-sectional profiles can provide cues into the concerted working of these entities as a system. Several techniques have been proposed for deciphering these associations including constraint-based Bayesian structure learning (BSL) algorithms that model them as directed acyclic graphs. Benchmarking these algorithms have typically focused on assessing the variation in performance measures such as sensitivity as a function of the dimensionality represented by the number of nodes in the DAG, and sample size. The present study elucidates the importance of network topology in benchmarking exercises. More specifically, it investigates variations in sensitivity across distinct network topologies while constraining the nodes, edges, and sample-size to be identical, eliminating these as potential confounders. Sensitivity of three popular constraint-based BSL algorithms (Peter-Clarke, Grow-Shrink, Incremental Association Markov Blanket) in learning the network structure from multivariate cross-sectional profiles sampled from network models with sub-linear, linear, and super-linear DAG topologies generated using preferential attachment is investigated. Results across linear and nonlinear models revealed statistically significant (\alpha=0.05) decrease in sensitivity estimates from sub-linear to super-linear topology constitutively across the three algorithms. These results are demonstrated on networks with nodes (N_{nods}=48,64), noise strengths (\sigma =3,6) and sample size (N = 2^{10}). The findings elucidate the importance of accommodating the network topology in constraint-based BSL benchmarking exercises.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper does not meet the 'MUST' criteria as it focuses on benchmarking constraint-based Bayesian structure learning algorithms, which is unrelated to engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not investigate the manipulation of textual input prompts to improve LLM performance." }

Model Checking in Medical Imaging for Tumor Detection and Segmentation

arXiv ID: http://arxiv.org/abs/2501.02024v2

Abstract:

Recent advancements in model checking have demonstrated significant potential across diverse applications, particularly in signal and image analysis. Medical imaging stands out as a critical domain where model checking can be effectively applied to design and evaluate robust frameworks. These frameworks facilitate automatic and semi-automatic delineation of regions of interest within images, aiding in accurate segmentation. This paper provides a comprehensive analysis of recent works leveraging spatial logic to develop operators and tools for identifying regions of interest, including tumorous and non-tumorous areas. Additionally, we examine the challenges inherent to spatial model-checking techniques, such as variability in ground truth data and the need for streamlined procedures suitable for routine clinical practice.

Decision Explanation: Original decision: REJECT The paper focuses primarily on medical imaging for tumor detection and segmentation, and model checking techniques, rather than prompt engineering for Large Language Models (LLMs), thus violating the 'MUST NOT' criteria regarding primary concern with medical subjects and not meeting the core subject requirement of prompt engineering for text-based interactions with LLMs.

[Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large

Vision-Language Models](https://arxiv.org/abs/http://arxiv.org/abs/2501.02029v1) arXiv ID: http://arxiv.org/abs/2501.02029v1

Abstract:

With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.

Decision Explanation: Original decision: REJECT The paper focuses on identifying safety risks in Large Vision-Language Models (LVLMs) through internal attention heads, rather than specifically engineering or optimizing textual input prompts for Large Language Models (LLMs), thus not meeting the primary focus criteria.

Detecting Music Performance Errors with Transformers

arXiv ID: http://arxiv.org/abs/2501.02030v1

Abstract:

Beginner musicians often struggle to identify specific errors in their performances, such as playing incorrect notes or rhythms. There are two limitations in existing tools for music error detection: (1) Existing approaches rely on automatic alignment; therefore, they are prone to errors caused by small deviations between alignment targets.; (2) There is a lack of sufficient data to train music error detection models, resulting in over-reliance on heuristics. To address (1), we propose a novel transformer model, Polytune, that takes audio inputs and outputs annotated music scores. This model can be trained end-to-end to implicitly align and compare performance audio with music scores through latent space representations. To address (2), we present a novel data generation technique capable of creating large-scale synthetic music error datasets. Our approach achieves a 64.1% average Error Detection F1 score, improving upon prior work by 40 percentage points across 14 instruments. Additionally, compared with existing transcription methods repurposed for music error detection, our model can handle multiple instruments. Our source code and datasets are available at https://github.com/ben2002chou/Polytune.

Decision Explanation: Original decision: REJECT The paper focuses on music performance error detection using a novel transformer model and data generation technique, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or their textual input prompts.

[Dynamic Feature Fusion: Combining Global Graph Structures and Local

Semantics for Blockchain Fraud Detection](https://arxiv.org/abs/http://arxiv.org/abs/2501.02032v1) arXiv ID: http://arxiv.org/abs/2501.02032v1

Abstract:

The advent of blockchain technology has facilitated the widespread adoption of smart contracts in the financial sector. However, current fraud detection methodologies exhibit limitations in capturing both global structural patterns within transaction networks and local semantic relationships embedded in transaction data. Most existing models focus on either structural information or semantic features individually, leading to suboptimal performance in detecting complex fraud patterns.In this paper, we propose a dynamic feature fusion model that combines graph-based representation learning and semantic feature extraction for blockchain fraud detection. Specifically, we construct global graph representations to model account relationships and extract local contextual features from transaction data. A dynamic multimodal fusion mechanism is introduced to adaptively integrate these features, enabling the model to capture both structural and semantic fraud patterns effectively. We further develop a comprehensive data processing pipeline, including graph construction, temporal feature enhancement, and text preprocessing. Experimental results on large-scale real-world blockchain datasets demonstrate that our method outperforms existing benchmarks across accuracy, F1 score, and recall metrics. This work highlights the importance of integrating structural relationships and semantic similarities for robust fraud detection and offers a scalable solution for securing blockchain systems.

Decision Explanation: Original decision: REJECT The paper does not focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Instead, it concentrates on a dynamic feature fusion model for blockchain fraud detection, utilizing graph-based representation learning and semantic feature extraction, with no evident connection to LLM prompt engineering.

Deep Clustering via Community Detection

arXiv ID: http://arxiv.org/abs/2501.02036v1

Abstract:

Deep clustering is an essential task in modern artificial intelligence, aiming to partition a set of data samples into a given number of homogeneous groups (i.e., clusters). Even though many Deep Neural Network (DNN) backbones and clustering strategies have been proposed for the task, achieving increasingly improved performance, deep clustering remains very challenging due to the lack of accurately labeled samples. In this paper, we propose a novel approach of deep clustering via community detection. It initializes clustering by detecting many communities, and then gradually expands clusters by community merging. Compared with the existing clustering strategies, community detection factors in the new perspective of cluster network analysis. As a result, it has the inherent benefit of high pseudo-label purity, which is critical to the performance of self-supervision. We have validated the efficacy of the proposed approach on benchmark image datasets. Our extensive experiments have shown that it can effectively improve the SOTA performance. Our ablation study also demonstrates that the new network perspective can effectively improve community pseudo-label purity, resulting in improved clustering performance.

Decision Explanation: Original decision: REJECT The paper focuses on deep clustering via community detection for image datasets using Deep Neural Networks (DNNs), with no mention of Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

arXiv ID: http://arxiv.org/abs/2501.02045v1

Abstract:

We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.

Decision Explanation: Original decision: REJECT The paper focuses primarily on the development of a new foundation model (METAGENE-1) for pandemic monitoring through pretraining on metagenomic sequences, rather than on prompt engineering for Large Language Models (LLMs), and does not investigate or propose methods for improving LLM performance through textual input prompt manipulation.

[Relaxation-assisted reverse annealing on nonnegative/binary matrix

factorization](https://arxiv.org/abs/http://arxiv.org/abs/2501.02114v1) arXiv ID: http://arxiv.org/abs/2501.02114v1

Abstract:

Quantum annealing has garnered significant attention as meta-heuristics inspired by quantum physics for combinatorial optimization problems. Among its many applications, nonnegative/binary matrix factorization stands out for its complexity and relevance in unsupervised machine learning. The use of reverse annealing, a derivative procedure of quantum annealing to prioritize the search in a vicinity under a given initial state, helps improve its optimization performance in matrix factorization. This study proposes an improved strategy that integrates reverse annealing with a linear programming relaxation technique. Using relaxed solutions as the initial configuration for reverse annealing, we demonstrate improvements in optimization performance comparable to the exact optimization methods. Our experiments on facial image datasets show that our method provides better convergence than known reverse annealing methods. Furthermore, we investigate the effectiveness of relaxation-based initialization methods on randomized datasets, demonstrating a relationship between the relaxed solution and the optimal solution. This research underscores the potential of combining reverse annealing and classical optimization strategies to enhance optimization performance.

Decision Explanation: Original decision: REJECT The paper focuses on optimizing nonnegative/binary matrix factorization using quantum annealing and linear programming relaxation, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet the primary 'MUST' criteria.

A hybrid marketplace of ideas

arXiv ID: http://arxiv.org/abs/2501.02132v2

Abstract:

The convergence of humans and artificial intelligence systems introduces new dynamics into the cultural and intellectual landscape. Complementing emerging cultural evolution concepts such as machine culture, AI agents represent a significant techno-sociological development, particularly within the anthropological study of Web3 as a community focused on decentralization through blockchain. Despite their growing presence, the cultural significance of AI agents remains largely unexplored in academic literature. Toward this end, we conceived hybrid netnography, a novel interdisciplinary approach that examines the cultural and intellectual dynamics within digital ecosystems by analyzing the interactions and contributions of both human and AI agents as co-participants in shaping narratives, ideas, and cultural artifacts. We argue that, within the Web3 community on the social media platform X, these agents challenge traditional notions of participation and influence in public discourse, creating a hybrid marketplace of ideas, a conceptual space where human and AI generated ideas coexist and compete for attention. We examine the current state of AI agents in idea generation, propagation, and engagement, positioning their role as cultural agents through the lens of memetics and encouraging further inquiry into their cultural and societal impact. Additionally, we address the implications of this paradigm for privacy, intellectual property, and governance, highlighting the societal and legal challenges of integrating AI agents into the hybrid marketplace of ideas.

Decision Explanation: Original decision: REJECT The paper's primary focus is on the societal and cultural impact of AI agents in a Web3 community, not on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), failing to meet the core subject requirement.

Effective LLM-Driven Code Generation with Pythoness

arXiv ID: http://arxiv.org/abs/2501.02138v1

Abstract:

The advent of large language models (LLMs) has paved the way for a new era of programming tools with both significant capabilities and risks, as the generated code lacks guarantees of correctness and reliability. Developers using LLMs currently face the difficult task of optimizing, integrating, and maintaining code generated by AI. We propose an embedded domain-specific language (DSL), Pythoness, to address those challenges. In Pythoness, developers program with LLMs at a higher level of abstraction. Rather than interacting directly with generated code, developers using Pythoness operate at the level of behavioral specifications when writing functions, classes, or an entire program. These specifications can take the form of unit tests and property-based tests, which may be expressed formally or in natural language. Guided by these specifications, Pythoness generates code that both passes the tests and can be continuously checked during execution. We posit that the Pythoness approach lets developers harness the full potential of LLMs for code generation while substantially mitigating their inherent risks. We describe our current prototype implementation of Pythoness and demonstrate that it can successfully leverage a combination of tests and code generation to yield higher quality code than specifications alone.

Decision Explanation: Original decision: REJECT The paper's primary focus is on developing an embedded domain-specific language (DSL) for safer code generation with LLMs, rather than on prompt engineering for text-based interactions with LLMs. While LLMs are utilized, the core subject is the DSL (Pythoness) and its application, not the optimization or design of textual input prompts for improving LLM performance.

Establishing baselines for generative discovery of inorganic crystals

arXiv ID: http://arxiv.org/abs/2501.02144v1

Abstract:

Generative artificial intelligence offers a promising avenue for materials discovery, yet its advantages over traditional methods remain unclear. In this work, we introduce and benchmark two baseline approaches - random enumeration of charge-balanced prototypes and data-driven ion exchange of known compounds - against three generative models: a variational autoencoder, a large language model, and a diffusion model. Our results show that established methods such as ion exchange perform comparably well in generating stable materials, although many of these materials tend to closely resemble known compounds. In contrast, generative models excel at proposing novel structural frameworks and, when sufficient training data is available, can more effectively target properties such as electronic band gap and bulk modulus while maintaining a high stability rate. To enhance the performance of both the baseline and generative approaches, we implement a post-generation screening step in which all proposed structures are passed through stability and property filters from pre-trained machine learning models including universal interatomic potentials. This low-cost filtering step leads to substantial improvement in the success rates of all methods, remains computationally efficient, and ultimately provides a practical pathway toward more effective generative strategies for materials discovery.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on generative discovery of inorganic crystals using various AI models, including LLMs, but does not centralize prompt engineering for text-based interactions with LLMs. Instead, LLMs are utilized as one of several generative models for materials discovery, not meeting the 'MUST' criteria of focusing primarily on prompt engineering for LLMs." }

[Plasma-CycleGAN: Plasma Biomarker-Guided MRI to PET Cross-modality

Translation Using Conditional CycleGAN](https://arxiv.org/abs/http://arxiv.org/abs/2501.02146v1) arXiv ID: http://arxiv.org/abs/2501.02146v1

Abstract:

Cross-modality translation between MRI and PET imaging is challenging due to the distinct mechanisms underlying these modalities. Blood-based biomarkers (BBBMs) are revolutionizing Alzheimer's disease (AD) detection by identifying patients and quantifying brain amyloid levels. However, the potential of BBBMs to enhance PET image synthesis remains unexplored. In this paper, we performed a thorough study on the effect of incorporating BBBM into deep generative models. By evaluating three widely used cross-modality translation models, we found that BBBMs integration consistently enhances the generative quality across all models. By visual inspection of the generated results, we observed that PET images generated by CycleGAN exhibit the best visual fidelity. Based on these findings, we propose Plasma-CycleGAN, a novel generative model based on CycleGAN, to synthesize PET images from MRI using BBBMs as conditions. This is the first approach to integrate BBBMs in conditional cross-modality translation between MRI and PET.

Decision Explanation: Original decision: REJECT The paper primarily focuses on cross-modality translation between MRI and PET imaging using conditional CycleGAN, with an emphasis on integrating blood-based biomarkers for Alzheimer's disease detection, and does not meet the 'MUST' criteria as it lacks focus on prompt engineering, LLMs, and textual input manipulation.

The Race to Efficiency: A New Perspective on AI Scaling Laws

arXiv ID: http://arxiv.org/abs/2501.02156v3

Abstract:

As large-scale AI models expand, training becomes costlier and sustaining progress grows harder. Classical scaling laws (e.g., Kaplan et al. (2020), Hoffmann et al. (2022)) predict training loss from a static compute budget yet neglect time and efficiency, prompting the question: how can we balance ballooning GPU fleets with rapidly improving hardware and algorithms? We introduce the relative-loss equation, a time- and efficiency-aware framework that extends classical AI scaling laws. Our model shows that, without ongoing efficiency gains, advanced performance could demand millennia of training or unrealistically large GPU fleets. However, near-exponential progress remains achievable if the "efficiency-doubling rate" parallels Moore's Law. By formalizing this race to efficiency, we offer a quantitative roadmap for balancing front-loaded GPU investments with incremental improvements across the AI stack. Empirical trends suggest that sustained efficiency gains can push AI scaling well into the coming decade, providing a new perspective on the diminishing returns inherent in classical scaling.

Decision Explanation: Original decision: REJECT The paper focuses primarily on AI scaling laws, efficiency, and balancing GPU investments, with no clear emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria.

[Can ChatGPT implement finite element models for geotechnical engineering

applications?](https://arxiv.org/abs/http://arxiv.org/abs/2501.02199v1) arXiv ID: http://arxiv.org/abs/2501.02199v1

Abstract:

This study assesses the capability of ChatGPT to generate finite element code for geotechnical engineering applications from a set of prompts. We tested three different initial boundary value problems using a hydro-mechanically coupled formulation for unsaturated soils, including the dissipation of excess pore water pressure through fluid mass diffusion in one-dimensional space, time-dependent differential settlement of a strip footing, and gravity-driven seepage. For each case, initial prompting involved providing ChatGPT with necessary information for finite element implementation, such as balance and constitutive equations, problem geometry, initial and boundary conditions, material properties, and spatiotemporal discretization and solution strategies. Any errors and unexpected results were further addressed through prompt augmentation processes until the ChatGPT-generated finite element code passed the verification/validation test. Our results demonstrate that ChatGPT required minimal code revisions when using the FEniCS finite element library, owing to its high-level interfaces that enable efficient programming. In contrast, the MATLAB code generated by ChatGPT necessitated extensive prompt augmentations and/or direct human intervention, as it involves a significant amount of low-level programming required for finite element analysis, such as constructing shape functions or assembling global matrices. Given that prompt engineering for this task requires an understanding of the mathematical formulation and numerical techniques, this study suggests that while a large language model may not yet replace human programmers, it can greatly assist in the implementation of numerical models.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "While the paper does investigate the manipulation of textual input prompts for LLMs (ChatGPT), its primary focus is on the application of LLMs in geotechnical engineering (finite element models) rather than the engineering, design, or optimization of prompts specifically for Large Language Models." }

Learning Evolution via Optimization Knowledge Adaptation

arXiv ID: http://arxiv.org/abs/2501.02200v1

Abstract:

Evolutionary algorithms (EAs) maintain populations through evolutionary operators to discover diverse solutions for complex tasks while gathering valuable knowledge, such as historical population data and fitness evaluations. However, traditional EAs face challenges in dynamically adapting to expanding knowledge bases, hindering the efficient exploitation of accumulated information and limiting adaptability to new situations. To address these issues, we introduce an Optimization Knowledge Adaptation Evolutionary Model (OKAEM), which features dynamic parameter adjustment using accumulated knowledge to enhance its optimization capabilities. OKAEM employs attention mechanisms to model the interactions among individuals, fitness landscapes, and genetic components separately, thereby parameterizing the evolutionary operators of selection, crossover, and mutation. These powerful learnable operators enable OKAEM to benefit from pre-learned extensive prior knowledge and self-tune with real-time evolutionary insights. Experimental results demonstrate that OKAEM: 1) exploits prior knowledge for significant performance gains across various knowledge transfer settings; 2) achieves competitive performance through self-tuning alone, even without prior knowledge; 3) outperforms state-of-the-art black-box baselines in a vision-language model tuning case; 4) can improve its optimization capabilities with growing knowledge; 5) is capable of emulating principles of natural selection and genetic recombination.

Decision Explanation: Original decision: REJECT The paper focuses on evolutionary algorithms and optimization knowledge adaptation for improving performance in complex tasks, with a specific application in vision-language model tuning. It does not meet the 'MUST' criteria, particularly focusing primarily on engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) or investigating methods to improve LLM performance through prompt manipulation.

[Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised

Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.02219v1) arXiv ID: http://arxiv.org/abs/2501.02219v1

Abstract:

Federated semi-supervised learning (FSSL) is primarily challenged by two factors: the scarcity of labeled data across clients and the non-independent and identically distribution (non-IID) nature of data among clients. In this paper, we propose a novel approach, diffusion model-based data synthesis aided FSSL (DDSA-FSSL), which utilizes a diffusion model (DM) to generate synthetic data, bridging the gap between heterogeneous local data distributions and the global data distribution. In DDSA-FSSL, clients address the challenge of the scarcity of labeled data by employing a federated learning-trained classifier to perform pseudo labeling for unlabeled data. The DM is then collaboratively trained using both labeled and precision-optimized pseudo-labeled data, enabling clients to generate synthetic samples for classes that are absent in their labeled datasets. This process allows clients to generate more comprehensive synthetic datasets aligned with the global distribution. Extensive experiments conducted on multiple datasets and varying non-IID distributions demonstrate the effectiveness of DDSA-FSSL, e.g., it improves accuracy from 38.46% to 52.14% on CIFAR-10 datasets with 10% labeled data.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on federated semi-supervised learning and utilizes a diffusion model for data synthesis, with no primary emphasis on prompt engineering for Large Language Models (LLMs) or the manipulation of textual input prompts to improve LLM performance." }

CORD: Generalizable Cooperation via Role Diversity

arXiv ID: http://arxiv.org/abs/2501.02221v2

Abstract:

Cooperative multi-agent reinforcement learning (MARL) aims to develop agents that can collaborate effectively. However, most cooperative MARL methods overfit training agents, making learned policies not generalize well to unseen collaborators, which is a critical issue for real-world deployment. Some methods attempt to address the generalization problem but require prior knowledge or predefined policies of new teammates, limiting real-world applications. To this end, we propose a hierarchical MARL approach to enable generalizable cooperation via role diversity, namely CORD. CORD's high-level controller assigns roles to low-level agents by maximizing the role entropy with constraints. We show this constrained objective can be decomposed into causal influence in role that enables reasonable role assignment, and role heterogeneity that yields coherent, non-redundant role clusters. Evaluated on a variety of cooperative multi-agent tasks, CORD achieves better performance than baselines, especially in generalization tests. Ablation studies further demonstrate the efficacy of the constrained objective in generalizable cooperation.

Decision Explanation: Original decision: REJECT The paper primarily focuses on cooperative multi-agent reinforcement learning (MARL) and a hierarchical approach for generalizable cooperation, without any mention of Large Language Models (LLMs), prompt engineering, or textual input manipulation, thus failing to meet all the 'MUST' criteria.

Towards a constructive framework for control theory

arXiv ID: http://arxiv.org/abs/2501.02267v1

Abstract:

This work presents a framework for control theory based on constructive analysis to account for discrepancy between mathematical results and their implementation in a computer, also referred to as computational uncertainty. In control engineering, the latter is usually either neglected or considered submerged into some other type of uncertainty, such as system noise, and addressed within robust control. However, even robust control methods may be compromised when the mathematical objects involved in the respective algorithms fail to exist in exact form and subsequently fail to satisfy the required properties. For instance, in general stabilization using a control Lyapunov function, computational uncertainty may distort stability certificates or even destabilize the system despite robustness of the stabilization routine with regards to system, actuator and measurement noise. In fact, battling numerical problems in practical implementation of controllers is common among control engineers. Such observations indicate that computational uncertainty should indeed be addressed explicitly in controller synthesis and system analysis. The major contribution here is a fairly general framework for proof techniques in analysis and synthesis of control systems based on constructive analysis which explicitly states that every computation be doable only up to a finite precision thus accounting for computational uncertainty. A series of previous works is overviewed, including constructive system stability and stabilization, approximate optimal controls, eigenvalue problems, Caratheodory trajectories, measurable selectors. Additionally, a new constructive version of the Danskin's theorem, which is crucial in adversarial defense, is presented.

Decision Explanation: Original decision: REJECT The paper does not meet the primary criteria as it focuses on control theory and computational uncertainty in control system engineering, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions with LLMs.

[Deep Learning-Driven Segmentation of Ischemic Stroke Lesions Using

Multi-Channel MRI](https://arxiv.org/abs/http://arxiv.org/abs/2501.02287v1) arXiv ID: http://arxiv.org/abs/2501.02287v1

Abstract:

Ischemic stroke, caused by cerebral vessel occlusion, presents substantial challenges in medical imaging due to the variability and subtlety of stroke lesions. Magnetic Resonance Imaging (MRI) plays a crucial role in diagnosing and managing ischemic stroke, yet existing segmentation techniques often fail to accurately delineate lesions. This study introduces a novel deep learning-based method for segmenting ischemic stroke lesions using multi-channel MRI modalities, including Diffusion Weighted Imaging (DWI), Apparent Diffusion Coefficient (ADC), and enhanced Diffusion Weighted Imaging (eDWI). The proposed architecture integrates DenseNet121 as the encoder with Self-Organized Operational Neural Networks (SelfONN) in the decoder, enhanced by Channel and Space Compound Attention (CSCA) and Double Squeeze-and-Excitation (DSE) blocks. Additionally, a custom loss function combining Dice Loss and Jaccard Loss with weighted averages is introduced to improve model performance. Trained and evaluated on the ISLES 2022 dataset, the model achieved Dice Similarity Coefficients (DSC) of 83.88% using DWI alone, 85.86% with DWI and ADC, and 87.49% with the integration of DWI, ADC, and eDWI. This approach not only outperforms existing methods but also addresses key limitations in current segmentation practices. These advancements significantly enhance diagnostic precision and treatment planning for ischemic stroke, providing valuable support for clinical decision-making.

Decision Explanation: Original decision: REJECT The paper focuses primarily on the development of a deep learning-based method for segmenting ischemic stroke lesions using MRI modalities, which is a medical application unrelated to Large Language Models (LLMs) or prompt engineering, thus failing to meet all 'MUST' criteria.

DiffGraph: Heterogeneous Graph Diffusion Model

arXiv ID: http://arxiv.org/abs/2501.02313v1

Abstract:

Recent advances in Graph Neural Networks (GNNs) have revolutionized graph-structured data modeling, yet traditional GNNs struggle with complex heterogeneous structures prevalent in real-world scenarios. Despite progress in handling heterogeneous interactions, two fundamental challenges persist: noisy data significantly compromising embedding quality and learning performance, and existing methods' inability to capture intricate semantic transitions among heterogeneous relations, which impacts downstream predictions. To address these fundamental issues, we present the Heterogeneous Graph Diffusion Model (DiffGraph), a pioneering framework that introduces an innovative cross-view denoising strategy. This advanced approach transforms auxiliary heterogeneous data into target semantic spaces, enabling precise distillation of task-relevant information. At its core, DiffGraph features a sophisticated latent heterogeneous graph diffusion mechanism, implementing a novel forward and backward diffusion process for superior noise management. This methodology achieves simultaneous heterogeneous graph denoising and cross-type transition, while significantly simplifying graph generation through its latent-space diffusion capabilities. Through rigorous experimental validation on both public and industrial datasets, we demonstrate that DiffGraph consistently surpasses existing methods in link prediction and node classification tasks, establishing new benchmarks for robustness and efficiency in heterogeneous graph processing. The model implementation is publicly available at: https://github.com/HKUDS/DiffGraph.

Decision Explanation: Original decision: REJECT The paper focuses on developing a new graph neural network model (DiffGraph) for handling heterogeneous graph-structured data, primarily addressing issues of noise and semantic transitions in graph processing, with no clear emphasis on prompt engineering for Large Language Models (LLMs) or manipulation of textual input prompts to improve LLM performance.

[Validity Arguments For Constructed Response Scoring Using Generative

Artificial Intelligence Applications](https://arxiv.org/abs/http://arxiv.org/abs/2501.02334v1) arXiv ID: http://arxiv.org/abs/2501.02334v1

Abstract:

The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI. The evidence needed in the generative AI context is more extensive than in the feature-based NLP scoring context because of the lack of transparency and other concerns unique to generative AI such as consistency. Constructed response score data from standardized tests demonstrate the collection of validity evidence for different types of scoring systems and highlights the numerous complexities and considerations when making a validity argument for these scores. In addition, we discuss how the evaluation of AI scores might include a consideration of how a contributory scoring approach combining multiple AI scores (from different sources) will cover more of the construct in the absence of human ratings.

Decision Explanation: Original decision: REJECT The paper primarily focuses on validity arguments for scoring systems using generative AI in high-stakes testing, rather than specifically on the engineering, design, or optimization of prompts for Large Language Models (LLMs). Prompt engineering is not the central concern, and the paper's main subject is the application and validation of generative AI in a testing context, excluding it based on the 'MUST NOT' criteria.

[GNSS/GPS Spoofing and Jamming Identification Using Machine Learning and

Deep Learning](https://arxiv.org/abs/http://arxiv.org/abs/2501.02352v1) arXiv ID: http://arxiv.org/abs/2501.02352v1

Abstract:

The increasing reliance on Global Navigation Satellite Systems (GNSS), particularly the Global Positioning System (GPS), underscores the urgent need to safeguard these technologies against malicious threats such as spoofing and jamming. As the backbone for positioning, navigation, and timing (PNT) across various applications including transportation, telecommunications, and emergency services GNSS is vulnerable to deliberate interference that poses significant risks. Spoofing attacks, which involve transmitting counterfeit GNSS signals to mislead receivers into calculating incorrect positions, can result in serious consequences, from navigational errors in civilian aviation to security breaches in military operations. Furthermore, the lack of inherent security measures within GNSS systems makes them attractive targets for adversaries. While GNSS/GPS jamming and spoofing systems consist of numerous components, the ability to distinguish authentic signals from malicious ones is essential for maintaining system integrity. Recent advancements in machine learning and deep learning provide promising avenues for enhancing detection and mitigation strategies against these threats. This paper addresses both spoofing and jamming by tackling real-world challenges through machine learning, deep learning, and computer vision techniques. Through extensive experiments on two real-world datasets related to spoofing and jamming detection using advanced algorithms, we achieved state of the art results. In the GNSS/GPS jamming detection task, we attained approximately 99% accuracy, improving performance by around 5% compared to previous studies. Additionally, we addressed a challenging tasks related to spoofing detection, yielding results that underscore the potential of machine learning and deep learning in this domain.

Decision Explanation: Original decision: REJECT The paper does not meet any of the 'MUST' criteria, as it focuses primarily on using machine learning and deep learning for GNSS/GPS spoofing and jamming identification, with no mention of Large Language Models (LLMs), prompt engineering, or text generation.

[FaceSpeak: Expressive and High-Quality Speech Synthesis from Human

Portraits of Different Styles](https://arxiv.org/abs/http://arxiv.org/abs/2501.03181v1) arXiv ID: http://arxiv.org/abs/2501.03181v1

Abstract:

Humans can perceive speakers' characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to their voice style. Recently, vision-driven Text-to-speech (TTS) scholars grounded their investigations on real-person faces, thereby restricting effective speech synthesis from applying to vast potential usage scenarios with diverse characters and image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity characteristics and emotional representations from a wide variety of image styles. Meanwhile, it mitigates the extraneous information (e.g., background, clothing, and hair color, etc.), resulting in synthesized speech closely aligned with a character's persona. Furthermore, to overcome the scarcity of multi-modal TTS data, we have devised an innovative dataset, namely Expressive Multi-Modal TTS, which is diligently curated and annotated to facilitate research in this domain. The experimental results demonstrate our proposed FaceSpeak can generate portrait-aligned voice with satisfactory naturalness and quality.

Decision Explanation: Original decision: REJECT The paper focuses primarily on speech synthesis from human portraits using a novel FaceSpeak approach, which involves image processing and Text-to-Speech (TTS) for audio generation. It does not investigate, analyze, or propose methods for improving Large Language Model (LLM) performance through the manipulation of textual input prompts, nor does it provide concrete examples of prompts and their impact on LLM output.

Classifier-Guided Captioning Across Modalities

arXiv ID: http://arxiv.org/abs/2501.03183v1

Abstract:

Most current captioning systems use language models trained on data from specific settings, such as image-based captioning via Amazon Mechanical Turk, limiting their ability to generalize to other modality distributions and contexts. This limitation hinders performance in tasks like audio or video captioning, where different semantic cues are needed. Addressing this challenge is crucial for creating more adaptable and versatile captioning frameworks applicable across diverse real-world contexts. In this work, we introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning, where it is crucial to describe sounds and their sources. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. The classifier is trained on a dataset automatically generated by GPT-4, using tailored prompts specifically designed to enhance key aspects of the generated captions. Importantly, the framework operates solely during inference, eliminating the need for further training of the underlying captioning model. We evaluate the framework on various models and modalities, with a focus on audio captioning, and report promising results. Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.

Decision Explanation: Original decision: REJECT Although the paper mentions using tailored prompts with GPT-4, the primary focus is on adaptating captioning networks across modalities (image, audio, video) and improving the captioning system with a classifier, rather than prompt engineering for Large Language Models (LLMs) being the core subject.

[Breaking Through the Spike: Spike Window Decoding for Accelerated and

Precise Automatic Speech Recognition](https://arxiv.org/abs/http://arxiv.org/abs/2501.03257v1) arXiv ID: http://arxiv.org/abs/2501.03257v1

Abstract:

Recently, end-to-end automatic speech recognition has become the mainstream approach in both industry and academia. To optimize system performance in specific scenarios, the Weighted Finite-State Transducer (WFST) is extensively used to integrate acoustic and language models, leveraging its capacity to implicitly fuse language models within static graphs, thereby ensuring robust recognition while also facilitating rapid error correction. However, WFST necessitates a frame-by-frame search of CTC posterior probabilities through autoregression, which significantly hampers inference speed. In this work, we thoroughly investigate the spike property of CTC outputs and further propose the conjecture that adjacent frames to non-blank spikes carry semantic information beneficial to the model. Building on this, we propose the Spike Window Decoding algorithm, which greatly improves the inference speed by making the number of frames decoded in WFST linearly related to the number of spiking frames in the CTC output, while guaranteeing the recognition performance. Our method achieves SOTA recognition accuracy with significantly accelerates decoding speed, proven across both AISHELL-1 and large-scale In-House datasets, establishing a pioneering approach for integrating CTC output with WFST.

Decision Explanation: Original decision: REJECT The paper focuses on optimizing Automatic Speech Recognition (ASR) using Spike Window Decoding and Weighted Finite-State Transducers, with no mention of Large Language Models (LLMs) or prompt engineering for text-based interactions, thus failing to meet all 'MUST' criteria.

for UAV Path Planning with Kinematic Constraints](https://arxiv.org/abs/http://arxiv.org/abs/2501.03261v1) arXiv ID: http://arxiv.org/abs/2501.03261v1

Abstract:

Path planning is essential for unmanned aerial vehicles (UAVs) as it determines the path that the UAV needs to follow to complete a task. This work addresses this problem by introducing a new algorithm called navigation variable-based multi-objective particle swarm optimization (NMOPSO). It first models path planning as an optimization problem via the definition of a set of objective functions that include optimality and safety requirements for UAV operation. The NMOPSO is then used to minimize those functions through Pareto optimal solutions. The algorithm features a new path representation based on navigation variables to include kinematic constraints and exploit the maneuverable characteristics of the UAV. It also includes an adaptive mutation mechanism to enhance the diversity of the swarm for better solutions. Comparisons with various algorithms have been carried out to benchmark the proposed approach. The results indicate that the NMOPSO performs better than not only other particle swarm optimization variants but also other state-of-the-art multi-objective and metaheuristic optimization algorithms. Experiments have also been conducted with real UAVs to confirm the validity of the approach for practical flights. The source code of the algorithm is available at https://github.com/ngandng/NMOPSO.

Decision Explanation: Original decision: REJECT The paper focuses on developing a new optimization algorithm for UAV path planning, with no mention of Large Language Models (LLMs), prompt engineering, or textual input prompts, thus failing to meet all 'MUST' criteria.

[Bridge the Inference Gaps of Neural Processes via Expectation

Maximization](https://arxiv.org/abs/http://arxiv.org/abs/2501.03264v1) arXiv ID: http://arxiv.org/abs/2501.03264v1

Abstract:

The neural process (NP) is a family of computationally efficient models for learning distributions over functions. However, it suffers from under-fitting and shows suboptimal performance in practice. Researchers have primarily focused on incorporating diverse structural inductive biases, \textit{e.g.} attention or convolution, in modeling. The topic of inference suboptimality and an analysis of the NP from the optimization objective perspective has hardly been studied in earlier work. To fix this issue, we propose a surrogate objective of the target log-likelihood of the meta dataset within the expectation maximization framework. The resulting model, referred to as the Self-normalized Importance weighted Neural Process (SI-NP), can learn a more accurate functional prior and has an improvement guarantee concerning the target log-likelihood. Experimental results show the competitive performance of SI-NP over other NPs objectives and illustrate that structural inductive biases, such as attention modules, can also augment our method to achieve SOTA performance. Our code is available at \url{https://github.com/hhq123gogogo/SI_NPs}.

Decision Explanation: Original decision: REJECT The paper focuses on improving Neural Processes (NP) via Expectation Maximization, primarily addressing under-fitting and optimization objectives, with no clear connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria.

[Listening and Seeing Again: Generative Error Correction for Audio-Visual

Speech Recognition](https://arxiv.org/abs/http://arxiv.org/abs/2501.04038v1) arXiv ID: http://arxiv.org/abs/2501.04038v1

Abstract:

Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again''. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audio-visual compression representation and the N-Best hypothesis together constitute a Cross-modal Prompt to guide the LLM in producing the best transcription. In addition, we also proposed a Multi-Level Consistency Constraint training criterion, including logits-level, utterance-level and representations-level, to improve the correction accuracy while enhancing the interpretability of audio and visual compression representations. The experimental results on the LRS3 dataset show that our method outperforms current mainstream AVSR systems. The proposed AVGER can reduce the Word Error Rate (WER) by 24% compared to them. Code and models can be found at: https://github.com/CircleRedRain/AVGER.

Decision Explanation: Original decision: REJECT The paper focuses primarily on improving Audio-Visual Speech Recognition (AVSR) with Generative Error Correction using LLMs, rather than specifically on the engineering, design, or optimization of prompts for LLMs. While prompts are mentioned (Cross-modal Prompt), they are not the central focus, and the work is more about applying LLMs within a larger AVSR system.

[A Survey on Large Language Models with some Insights on their

Capabilities and Limitations](https://arxiv.org/abs/http://arxiv.org/abs/2501.04040v1) arXiv ID: http://arxiv.org/abs/2501.04040v1

Abstract:

The rapid advancement of artificial intelligence, particularly with the development of Large Language Models (LLMs) built on the transformer architecture, has redefined the capabilities of natural language processing. These models now exhibit remarkable performance across various language-related tasks, such as text generation, question answering, translation, and summarization, often rivaling human-like comprehension. More intriguingly, LLMs have demonstrated emergent abilities extending beyond their core functions, showing proficiency in tasks like commonsense reasoning, code generation, and arithmetic. This survey paper explores the foundational components, scaling mechanisms, and architectural strategies that drive these capabilities. Emphasizing models like GPT and LLaMA, we analyze the impact of exponential data and computational growth on LLM performance, while also addressing the trade-offs associated with scaling. We also examine LLM applications across sectors, such as healthcare, finance, education, and law, highlighting their adaptability and potential to solve domain-specific challenges. Central to this work are the questions of how LLMs generalize across diverse tasks, exhibit planning, and reasoning abilities, and whether these emergent abilities can be systematically elicited or enhanced. In particular, we provide some insights into the CoT (Chain of Thought) and PoT (Plan of Thought) abilities within LLMs, focusing on how pre-training data influences their emergence. Additionally, we investigate LLM-modulo frameworks that integrate external systems, allowing LLMs to handle complex, dynamic tasks. By analyzing these factors, this paper aims to foster the ongoing discussion on the capabilities and limits of LLMs, promoting their responsible development and application in novel and increasingly complex environments.

Decision Explanation: Original decision: REJECT The paper focuses primarily on surveying the capabilities and limitations of Large Language Models (LLMs), with only tangential insights into prompt-related aspects (e.g., Chain of Thought abilities), rather than centrally investigating prompt engineering for text-based interactions with LLMs as required.

FLAME: Financial Large-Language Model Assessment and Metrics Evaluation

arXiv ID: http://arxiv.org/abs/2501.06211v1

Abstract:

LLMs have revolutionized NLP and demonstrated potential across diverse domains. More and more financial LLMs have been introduced for finance-specific tasks, yet comprehensively assessing their value is still challenging. In this paper, we introduce FLAME, a comprehensive financial LLMs evaluation system in Chinese, which includes two core evaluation benchmarks: FLAME-Cer and FLAME-Sce. FLAME-Cer covers 14 types of authoritative financial certifications, including CPA, CFA, and FRM, with a total of approximately 16,000 carefully selected questions. All questions have been manually reviewed to ensure accuracy and representativeness. FLAME-Sce consists of 10 primary core financial business scenarios, 21 secondary financial business scenarios, and a comprehensive evaluation set of nearly 100 tertiary financial application tasks. We evaluate 6 representative LLMs, including GPT-4o, GLM-4, ERNIE-4.0, Qwen2.5, XuanYuan3, and the latest Baichuan4-Finance, revealing Baichuan4-Finance excels other LLMs in most tasks. By establishing a comprehensive and professional evaluation system, FLAME facilitates the advancement of financial LLMs in Chinese contexts. Instructions for participating in the evaluation are available on GitHub: https://github.com/FLAME-ruc/FLAME.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on developing an evaluation system (FLAME) for assessing the performance of financial Large Language Models, rather than primarily investigating, analyzing, or proposing methods for improving LLM performance through the manipulation of textual input prompts." }

[Operator Learning for Reconstructing Flow Fields from Sparse

Measurements: an Energy Transformer Approach](https://arxiv.org/abs/http://arxiv.org/abs/2501.08339v1) arXiv ID: http://arxiv.org/abs/2501.08339v1

Abstract:

Machine learning methods have shown great success in various scientific areas, including fluid mechanics. However, reconstruction problems, where full velocity fields must be recovered from partial observations, remain challenging. In this paper, we propose a novel operator learning framework for solving reconstruction problems by using the Energy Transformer (ET), an architecture inspired by associative memory models. We formulate reconstruction as a mapping from incomplete observed data to full reconstructed fields. The method is validated on three fluid mechanics examples using diverse types of data: (1) unsteady 2D vortex street in flow past a cylinder using simulation data; (2) high-speed under-expanded impinging supersonic jets impingement using Schlieren imaging; and (3) 3D turbulent jet flow using particle tracking. The results demonstrate the ability of ET to accurately reconstruct complex flow fields from highly incomplete data (90% missing), even for noisy experimental measurements, with fast training and inference on a single GPU. This work provides a promising new direction for tackling reconstruction problems in fluid mechanics and other areas in mechanics, geophysics, weather prediction, and beyond.

Decision Explanation: Original decision: REJECT The paper focuses on operator learning for reconstructing flow fields using a novel Energy Transformer approach, with no primary focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs), and does not investigate or propose methods for improving LLM performance through textual input prompt manipulation.

[VERITAS: Verifying the Performance of AI-native Transceiver Actions in

Base-Stations](https://arxiv.org/abs/http://arxiv.org/abs/2501.09761v1) arXiv ID: http://arxiv.org/abs/2501.09761v1

Abstract:

Artificial Intelligence (AI)-native receivers prove significant performance improvement in high noise regimes and can potentially reduce communication overhead compared to the traditional receiver. However, their performance highly depends on the representativeness of the training dataset. A major issue is the uncertainty of whether the training dataset covers all test environments and waveform configurations, and thus, whether the trained model is robust in practical deployment conditions. To this end, we propose a joint measurement-recovery framework for AI-native transceivers post deployment, called VERITAS, that continuously looks for distribution shifts in the received signals and triggers finite re-training spurts. VERITAS monitors the wireless channel using 5G pilots fed to an auxiliary neural network that detects out-of-distribution channel profile, transmitter speed, and delay spread. As soon as such a change is detected, a traditional (reference) receiver is activated, which runs for a period of time in parallel to the AI-native receiver. Finally, VERTIAS compares the bit probabilities of the AI-native and the reference receivers for the same received data inputs, and decides whether or not a retraining process needs to be initiated. Our evaluations reveal that VERITAS can detect changes in the channel profile, transmitter speed, and delay spread with 99%, 97%, and 69% accuracies, respectively, followed by timely initiation of retraining for 86%, 93.3%, and 94.8% of inputs in channel profile, transmitter speed, and delay spread test sets, respectively.

Decision Explanation: Original decision: REJECT The paper focuses on verifying the performance of AI-native transceiver actions in base-stations, with no primary emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), nor does it investigate the manipulation of textual input prompts to improve LLM performance.

[Dynamics of Adversarial Attacks on Large Language Model-Based Search

Engines](https://arxiv.org/abs/http://arxiv.org/abs/2501.00745v1) arXiv ID: http://arxiv.org/abs/2501.00745v1

Abstract:

The increasing integration of Large Language Model (LLM) based search engines has transformed the landscape of information retrieval. However, these systems are vulnerable to adversarial attacks, especially ranking manipulation attacks, where attackers craft webpage content to manipulate the LLM's ranking and promote specific content, gaining an unfair advantage over competitors. In this paper, we study the dynamics of ranking manipulation attacks. We frame this problem as an Infinitely Repeated Prisoners' Dilemma, where multiple players strategically decide whether to cooperate or attack. We analyze the conditions under which cooperation can be sustained, identifying key factors such as attack costs, discount rates, attack success rates, and trigger strategies that influence player behavior. We identify tipping points in the system dynamics, demonstrating that cooperation is more likely to be sustained when players are forward-looking. However, from a defense perspective, we find that simply reducing attack success probabilities can, paradoxically, incentivize attacks under certain conditions. Furthermore, defensive measures to cap the upper bound of attack success rates may prove futile in some scenarios. These insights highlight the complexity of securing LLM-based systems. Our work provides a theoretical foundation and practical insights for understanding and mitigating their vulnerabilities, while emphasizing the importance of adaptive security strategies and thoughtful ecosystem design.

Decision Explanation: Original decision: REJECT The paper focuses primarily on the vulnerability and security of LLM-based search engines to adversarial attacks, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), failing to meet the core subject requirement.

[MuQ: Self-Supervised Music Representation Learning with Mel Residual

Vector Quantization](https://arxiv.org/abs/http://arxiv.org/abs/2501.01108v2) arXiv ID: http://arxiv.org/abs/2501.01108v2

Abstract:

Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in https://github.com/tencent-ailab/MuQ.

Decision Explanation: Original decision: REJECT The paper primarily focuses on self-supervised music representation learning and a new model (MuQ) for music understanding, with no apparent emphasis on engineering, design, or optimization of prompts for Large Language Models (LLMs) or demonstration of prompt impact on LLM output.

Symmetries-enhanced Multi-Agent Reinforcement Learning

arXiv ID: http://arxiv.org/abs/2501.01136v1

Abstract:

Multi-agent reinforcement learning has emerged as a powerful framework for enabling agents to learn complex, coordinated behaviors but faces persistent challenges regarding its generalization, scalability and sample efficiency. Recent advancements have sought to alleviate those issues by embedding intrinsic symmetries of the systems in the policy. Yet, most dynamical systems exhibit little to no symmetries to exploit. This paper presents a novel framework for embedding extrinsic symmetries in multi-agent system dynamics that enables the use of symmetry-enhanced methods to address systems with insufficient intrinsic symmetries, expanding the scope of equivariant learning to a wide variety of MARL problems. Central to our framework is the Group Equivariant Graphormer, a group-modular architecture specifically designed for distributed swarming tasks. Extensive experiments on a swarm of symmetry-breaking quadrotors validate the effectiveness of our approach, showcasing its potential for improved generalization and zero-shot scalability. Our method achieves significant reductions in collision rates and enhances task success rates across a diverse range of scenarios and varying swarm sizes.

Decision Explanation: Original decision: REJECT The paper focuses primarily on multi-agent reinforcement learning, introducing a novel framework for embedding extrinsic symmetries, with no apparent focus on the engineering, design, or optimization of prompts for Large Language Models (LLMs) or their direct application in text generation.

[Change Detection-Based Procedures for Piecewise Stationary MABs: A

Modular Approach](https://arxiv.org/abs/http://arxiv.org/abs/2501.01291v1) arXiv ID: http://arxiv.org/abs/2501.01291v1

Abstract:

Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary environments, where the reward distributions associated with the arms do not change with time. In many applications, however, the environment is more accurately modeled as being nonstationary. In this work, piecewise stationary MAB (PS-MAB) environments are investigated, in which the reward distributions associated with a subset of the arms change at some change-points and remain stationary between change-points. Our focus is on the asymptotic analysis of PS-MABs, for which practical algorithms based on change detection (CD) have been previously proposed. Our goal is to modularize the design and analysis of such CD-based Bandit (CDB) procedures. To this end, we identify the requirements for stationary bandit algorithms and change detectors in a CDB procedure that are needed for the modularization. We assume that the rewards are sub-Gaussian. Under this assumption and a condition on the separation of the change-points, we show that the analysis of CDB procedures can indeed be modularized, so that regret bounds can be obtained in a unified manner for various combinations of change detectors and bandit algorithms. Through this analysis, we develop new modular CDB procedures that are order-optimal. We compare the performance of our modular CDB procedures with various other methods in simulations.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper focuses on Multi-Armed Bandit (MAB) algorithms and change detection procedures, with no apparent connection to Large Language Models (LLMs), prompt engineering, or text generation, thus failing to meet all 'MUST' criteria." }

[Quantifying A Firm's AI Engagement: Constructing Objective, Data-Driven,

AI Stock Indices Using 10-K Filings](https://arxiv.org/abs/http://arxiv.org/abs/2501.01763v1) arXiv ID: http://arxiv.org/abs/2501.01763v1

Abstract:

Following an analysis of existing AI-related exchange-traded funds (ETFs), we reveal the selection criteria for determining which stocks qualify as AI-related are often opaque and rely on vague phrases and subjective judgments. This paper proposes a new, objective, data-driven approach using natural language processing (NLP) techniques to classify AI stocks by analyzing annual 10-K filings from 3,395 NASDAQ-listed firms between 2011 and 2023. This analysis quantifies each company's engagement with AI through binary indicators and weighted AI scores based on the frequency and context of AI-related terms. Using these metrics, we construct four AI stock indices-the Equally Weighted AI Index (AII), the Size-Weighted AI Index (SAII), and two Time-Discounted AI Indices (TAII05 and TAII5X)-offering different perspectives on AI investment. We validate our methodology through an event study on the launch of OpenAI's ChatGPT, demonstrating that companies with higher AI engagement saw significantly greater positive abnormal returns, with analyses supporting the predictive power of our AI measures. Our indices perform on par with or surpass 14 existing AI-themed ETFs and the Nasdaq Composite Index in risk-return profiles, market responsiveness, and overall performance, achieving higher average daily returns and risk-adjusted metrics without increased volatility. These results suggest our NLP-based approach offers a reliable, market-responsive, and cost-effective alternative to existing AI-related ETF products. Our innovative methodology can also guide investors, asset managers, and policymakers in using corporate data to construct other thematic portfolios, contributing to a more transparent, data-driven, and competitive approach.

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper's primary focus is on constructing AI stock indices using NLP techniques to analyze corporate filings, not on prompt engineering for Large Language Models (LLMs). It does not investigate, analyze, or propose methods for improving LLM performance through prompt manipulation, nor does it provide concrete examples of prompts and their impact on LLM output." }

[KANS: Knowledge Discovery Graph Attention Network for Soft Sensing in

Multivariate Industrial Processes](https://arxiv.org/abs/http://arxiv.org/abs/2501.02015v1) arXiv ID: http://arxiv.org/abs/2501.02015v1

Abstract:

Soft sensing of hard-to-measure variables is often crucial in industrial processes. Current practices rely heavily on conventional modeling techniques that show success in improving accuracy. However, they overlook the non-linear nature, dynamics characteristics, and non-Euclidean dependencies between complex process variables. To tackle these challenges, we present a framework known as a Knowledge discovery graph Attention Network for effective Soft sensing (KANS). Unlike the existing deep learning soft sensor models, KANS can discover the intrinsic correlations and irregular relationships between the multivariate industrial processes without a predefined topology. First, an unsupervised graph structure learning method is introduced, incorporating the cosine similarity between different sensor embedding to capture the correlations between sensors. Next, we present a graph attention-based representation learning that can compute the multivariate data parallelly to enhance the model in learning complex sensor nodes and edges. To fully explore KANS, knowledge discovery analysis has also been conducted to demonstrate the interpretability of the model. Experimental results demonstrate that KANS significantly outperforms all the baselines and state-of-the-art methods in soft sensing performance. Furthermore, the analysis shows that KANS can find sensors closely related to different process variables without domain knowledge, significantly improving soft sensing accuracy.

Decision Explanation: Original decision: REJECT The paper's primary focus is on developing a Knowledge Discovery Graph Attention Network (KANS) for soft sensing in industrial processes, not on prompt engineering for Large Language Models (LLMs). It does not investigate, analyze, or propose methods for improving LLM performance through textual input prompt manipulation, nor does it provide examples of prompts and their impact on LLM output.

[Benchmark Evaluations, Applications, and Challenges of Large Vision

Language Models: A Survey](https://arxiv.org/abs/http://arxiv.org/abs/2501.02189v2) arXiv ID: http://arxiv.org/abs/2501.02189v2

Abstract:

Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.

Decision Explanation: Original decision: REJECT The paper focuses primarily on Multimodal Vision Language Models (VLMs), their applications, and challenges, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) as required by the criteria.

Optimizing Small Language Models for In-Vehicle Function-Calling

arXiv ID: http://arxiv.org/abs/2501.02342v1

Abstract:

We propose a holistic approach for deploying Small Language Models (SLMs) as function-calling agents within vehicles as edge devices, offering a more flexible and robust alternative to traditional rule-based systems. By leveraging SLMs, we simplify vehicle control mechanisms and enhance the user experience. Given the in-vehicle hardware constraints, we apply state-of-the-art model compression techniques, including structured pruning, healing, and quantization, ensuring that the model fits within the resource limitations while maintaining acceptable performance. Our work focuses on optimizing a representative SLM, Microsoft's Phi-3 mini, and outlines best practices for enabling embedded models, including compression, task-specific fine-tuning, and vehicle integration. We demonstrate that, despite significant reduction in model size which removes up to 2 billion parameters from the original model, our approach preserves the model's ability to handle complex in-vehicle tasks accurately and efficiently. Furthermore, by executing the model in a lightweight runtime environment, we achieve a generation speed of 11 tokens per second, making real-time, on-device inference feasible without hardware acceleration. Our results demonstrate the potential of SLMs to transform vehicle control systems, enabling more intuitive interactions between users and their vehicles for an enhanced driving experience.

Decision Explanation: Original decision: REJECT The paper focuses primarily on optimizing Small Language Models (SLMs) for deployment in vehicles, discussing model compression and fine-tuning, rather than prompt engineering for Large Language Models (LLMs) and its impact on textual input and output.

Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers

arXiv ID: http://arxiv.org/abs/2501.02393v2

Abstract:

We present an approach to modifying Transformer architectures by integrating graph-aware relational reasoning into the attention mechanism, merging concepts from graph neural networks and language modeling. Building on the inherent connection between attention and graph theory, we reformulate the Transformer's attention mechanism as a graph operation and propose Graph-Aware Isomorphic Attention. This method leverages advanced graph modeling strategies, including Graph Isomorphism Networks (GIN) and Principal Neighborhood Aggregation (PNA), to enrich the representation of relational structures. Our approach captures complex dependencies and generalizes across tasks, as evidenced by a reduced generalization gap and improved learning performance. Additionally, we expand the concept of graph-aware attention to introduce Sparse GIN-Attention, a fine-tuning approach that employs sparse GINs. By interpreting attention matrices as sparse adjacency graphs, this technique enhances the adaptability of pre-trained foundational models with minimal computational overhead, endowing them with graph-aware capabilities. Sparse GIN-Attention fine-tuning achieves improved training dynamics and better generalization compared to alternative methods like low-rank adaption (LoRA). We discuss latent graph-like structures within traditional attention mechanisms, offering a new lens through which Transformers can be understood. By evolving Transformers as hierarchical GIN models for relational reasoning. This perspective suggests profound implications for foundational model development, enabling the design of architectures that dynamically adapt to both local and global dependencies. Applications in bioinformatics, materials science, language modeling, and beyond could benefit from this synthesis of relational and sequential data modeling, setting the stage for interpretable and generalizable modeling strategies.

Decision Explanation: Original decision: REJECT The paper focuses primarily on modifying Transformer architectures by integrating graph-aware relational reasoning, which aligns with developing new LLM architectures, violating the 'MUST NOT' criterion 1. It does not centrally address prompt engineering for text-based interactions with LLMs.

[Toward Inclusive Educational AI: Auditing Frontier LLMs through a

Multiplexity Lens](https://arxiv.org/abs/http://arxiv.org/abs/2501.03259v1) arXiv ID: http://arxiv.org/abs/2501.03259v1

Abstract:

As large language models (LLMs) like GPT-4 and Llama 3 become integral to educational contexts, concerns are mounting over the cultural biases, power imbalances, and ethical limitations embedded within these technologies. Though generative AI tools aim to enhance learning experiences, they often reflect values rooted in Western, Educated, Industrialized, Rich, and Democratic (WEIRD) cultural paradigms, potentially sidelining diverse global perspectives. This paper proposes a framework to assess and mitigate cultural bias within LLMs through the lens of applied multiplexity. Multiplexity, inspired by Senturk et al. and rooted in Islamic and other wisdom traditions, emphasizes the coexistence of diverse cultural viewpoints, supporting a multi-layered epistemology that integrates both empirical sciences and normative values. Our analysis reveals that LLMs frequently exhibit cultural polarization, with biases appearing in both overt responses and subtle contextual cues. To address inherent biases and incorporate multiplexity in LLMs, we propose two strategies: \textit{Contextually-Implemented Multiplex LLMs}, which embed multiplex principles directly into the system prompt, influencing LLM outputs at a foundational level and independent of individual prompts, and \textit{Multi-Agent System (MAS)-Implemented Multiplex LLMs}, where multiple LLM agents, each representing distinct cultural viewpoints, collaboratively generate a balanced, synthesized response. Our findings demonstrate that as mitigation strategies evolve from contextual prompting to MAS-implementation, cultural inclusivity markedly improves, evidenced by a significant rise in the Perspectives Distribution Score (PDS) and a PDS Entropy increase from 3.25% at baseline to 98% with the MAS-Implemented Multiplex LLMs. Sentiment analysis further shows a shift towards positive sentiment across cultures,...

Decision Explanation: Original response: { "decision": "REJECT", "explanation": "The paper primarily focuses on auditing and mitigating cultural biases in LLMs through a multiplexity lens, rather than specifically engineering or optimizing prompts for improving LLM performance. While it mentions prompt manipulation (Contextually-Implemented Multiplex LLMs), this is secondary to the overarching goal of bias mitigation and not the core subject of the paper." }

[LLM Content Moderation and User Satisfaction: Evidence from Response

Refusals in Chatbot Arena](https://arxiv.org/abs/http://arxiv.org/abs/2501.03266v1) arXiv ID: http://arxiv.org/abs/2501.03266v1

Abstract:

LLM safety and ethical alignment are widely discussed, but the impact of content moderation on user satisfaction remains underexplored. To address this, we analyze nearly 50,000 Chatbot Arena response-pairs using a novel fine-tuned RoBERTa model, that we trained on hand-labeled data to disentangle refusals due to ethical concerns from other refusals due to technical disabilities or lack of information. Our findings reveal a significant refusal penalty on content moderation, with users choosing ethical-based refusals roughly one-fourth as often as their preferred LLM response compared to standard responses. However, the context and phrasing play critical roles: refusals on highly sensitive prompts, such as illegal content, achieve higher win rates than less sensitive ethical concerns, and longer responses closely aligned with the prompt perform better. These results emphasize the need for nuanced moderation strategies that balance ethical safeguards with user satisfaction. Moreover, we find that the refusal penalty is notably lower in evaluations using the LLM-as-a-Judge method, highlighting discrepancies between user and automated assessments.

Decision Explanation: Original decision: REJECT The paper primarily focuses on LLM content moderation, user satisfaction, and refining a RoBERTa model for analyzing refusals, rather than the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs) and their direct impact on LLM output.

[Cyber Shadows: Neutralizing Security Threats with AI and Targeted Policy

Measures](https://arxiv.org/abs/http://arxiv.org/abs/2501.09025v1) arXiv ID: http://arxiv.org/abs/2501.09025v1

Abstract:

The digital age, driven by the AI revolution, brings significant opportunities but also conceals security threats, which we refer to as cyber shadows. These threats pose risks at individual, organizational, and societal levels. This paper examines the systemic impact of these cyber threats and proposes a comprehensive cybersecurity strategy that integrates AI-driven solutions, such as Intrusion Detection Systems (IDS), with targeted policy interventions. By combining technological and regulatory measures, we create a multilevel defense capable of addressing both direct threats and indirect negative externalities. We emphasize that the synergy between AI-driven solutions and policy interventions is essential for neutralizing cyber threats and mitigating their negative impact on the digital economy. Finally, we underscore the need for continuous adaptation of these strategies, especially in response to the rapid advancement of autonomous AI-driven attacks, to ensure the creation of secure and resilient digital ecosystems.

Decision Explanation: Original decision: REJECT The paper primarily focuses on a comprehensive cybersecurity strategy using AI-driven solutions and policy interventions, with no clear emphasis on the engineering, design, or optimization of prompts specifically for Large Language Models (LLMs), thus failing to meet the 'MUST' criteria.

[Zero-Shot Statistical Tests for LLM-Generated Text Detection using

Finite Sample Concentration Inequalities](https://arxiv.org/abs/http://arxiv.org/abs/2501.02406v2) arXiv ID: http://arxiv.org/abs/2501.02406v2

Abstract:

Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc. This problem is becoming increasingly difficult as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. In addition, many institutions utilize in-house LLMs and want to ensure that external, non-sanctioned LLMs do not produce content within the institution. In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by LLM A or B (where B can be a human)? We model LLM-generated text as a sequential stochastic process with complete dependence on history and design zero-shot statistical tests to distinguish between (i) the text generated by two different sets of LLMs A (in-house) and B (non-sanctioned) and also (ii) LLM-generated and human-generated texts. We prove that the type I and type II errors for our tests decrease exponentially in the text length. In designing our tests, we derive concentration inequalities on the difference between log-perplexity and the average entropy of the string under A. Specifically, for a given string, we demonstrate that if the string is generated by A, the log-perplexity of the string under A converges to the average entropy of the string under A, except with an exponentially small probability in string length. We also show that if B generates the text, except with an exponentially small probability in string length, the log-perplexity of the string under A converges to the average cross-entropy of B and A. Lastly, we present preliminary experimental results to support our theoretical results. By enabling guaranteed (with high probability) finding of the origin of harmful LLM-generated text with arbitrary size, we can help combat misinformation.

Decision Explanation: Original decision: REJECT The paper focuses primarily on statistical tests for detecting the origin of LLM-generated text, rather than on the engineering, design, or optimization of prompts specifically for Large Language Models. It does not investigate, analyze, or propose methods for improving LLM performance through the manipulation of textual input prompts.

465 KiB Raw Blame History

Accepted Papers

[Adaptive Few-shot Prompting for Machine Translation with Pre-trained

[The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for

[Recursive Decomposition of Logical Thoughts: Framework for Superior

Rejected Papers

[Beyond Text: Implementing Multimodal Large Language Model-Powered

[Rethinking Relation Extraction: Beyond Shortcuts to Generalization with

[eRevise+RF: A Writing Evaluation System for Assessing Student Essay

[Towards End-to-End Neuromorphic Voxel-based 3D Object Reconstruction

[AttriReBoost: A Gradient-Free Propagation Optimization Method for Cold

[REM: A Scalable Reinforced Multi-Expert Framework for Multiplex

[Make Shuffling Great Again: A Side-Channel Resistant Fisher-Yates

[Reasoning-Oriented and Analogy-Based Methods for Locating and Editing in

[Embedding Style Beyond Topics: Analyzing Dispersion Effects Across

[An LLM-Empowered Adaptive Evolutionary Algorithm For Multi-Component

[LLM+AL: Bridging Large Language Models and Action Languages for Complex

[Diversity Optimization for Travelling Salesman Problem via Deep

[Incremental Dialogue Management: Survey, Discussion, and Implications

[Are LLMs effective psychological assessors? Leveraging adaptive RAG for

[Exploring Information Processing in Large Language Models: Insights from

[Deep Reinforcement Learning for Job Scheduling and Resource Management

[MDSF: Context-Aware Multi-Dimensional Data Storytelling Framework based

[BatStyler: Advancing Multi-category Style Generation for Source-free

[MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic

[Pruning-based Data Selection and Network Fusion for Efficient Deep

[Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal

[Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes

[A redescription mining framework for post-hoc explaining and relating

[An Efficient Attention Mechanism for Sequential Recommendation Tasks:

[NeutraSum: A Language Model can help a Balanced Media Diet by

[Multi-Head Explainer: A General Framework to Improve Explainability in

[Understanding Difficult-to-learn Examples in Contrastive Learning: A

[DeepFilter: An Instrumental Baseline for Accurate and Efficient Process

[A Unified Hyperparameter Optimization Pipeline for Transformer-Based

[Augmented Contrastive Clustering with Uncertainty-Aware Prototyping for

[A Survey of Deep Learning Methods in Protein Bioinformatics and its

[Drift2Matrix: Kernel-Induced Self Representation for Concept Drift

[BoxingGym: Benchmarking Progress in Automated Experimental Design and

[Constructing and explaining machine learning models for chemistry:

[(WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering:

[Prism: Mining Task-aware Domains in Non-i.i.d. IMU Data for Flexible

[Google is all you need: Semi-Supervised Transfer Learning Strategy For

[Merging Context Clustering with Visual State Space Models for Medical

[Implications of Artificial Intelligence on Health Data Privacy and

[HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long

[AVATAR: Adversarial Autoencoders with Autoregressive Refinement for Time

[EAUWSeg: Eliminating annotation uncertainty in weakly-supervised medical

[BARTPredict: Empowering IoT Security with LLM-Driven Cyber Threat

[VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer

[The Essence of Contextual Understanding in Theory of Mind: A Study on

[LLMs & Legal Aid: Understanding Legal Needs Exhibited Through User

[Proposing Hierarchical Goal-Conditioned Policy Planning in Multi-Goal

[Augmentation Matters: A Mix-Paste Method for X-Ray Prohibited Item

[Automating Legal Concept Interpretation with LLMs: Retrieval,

[Creating Artificial Students that Never Existed: Leveraging Large

[MoColl: Agent-Based Specific and General Model Collaboration for Image

[Multi-Agent Conversational Online Learning for Adaptive LLM Response

[Accuracy Can Lie: On the Impact of Surrogate Model in Configuration

[Mitigating Hallucination for Large Vision Language Model by

[Abstractive Text Summarization for Contemporary Sanskrit Prose: Issues

[Cold-Start Recommendation towards the Era of Large Language Models

[MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of

[SmartSpatial: Enhancing the 3D Spatial Arrangement Capabilities of

[Multi-Task Semantic Communication With Graph Attention-Based Feature

[TART: Token-based Architecture Transformer for Neural Network

[Cross-model Transferability among Large Language Models on the Platonic

[Machine Learning-Based Differential Diagnosis of Parkinson's Disease

[Enhancing Uncertainty Modeling with Semantic Graph for Hallucination

[CarbonChat: Large Language Model-Based Corporate Carbon Emission

[Architecture for Trajectory-Based Fishing Ship Classification with AIS

[An Investigation into Value Misalignment in LLM-Generated Texts for

[A Separable Self-attention Inspired by the State Space Model for

[MRG: A Multi-Robot Manufacturing Digital Scene Generation Method Using

[Advancing Pancreatic Cancer Prediction with a Next Visit Token

[The interplay between domain specialization and model size: a case study

[On the Statistical Complexity for Offline and Low-Adaptive Reinforcement

[Siamese Networks for Cat Re-Identification: Exploring Neural Models for

[AVTrustBench: Assessing and Enhancing Reliability and Robustness in

[The Integration of Blockchain and Artificial Intelligence for Secure

465 KiB

Raw Blame History