157 lines
33 KiB
Markdown
157 lines
33 KiB
Markdown
Title: SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks
|
||
|
||
URL Source: https://arxiv.org/html/2501.11599v1
|
||
|
||
Markdown Content:
|
||
Wentao Wan1, Zhuojie Yang1, Yongcan Chen3, Chenglin Luo1, Ruilin Wang1, Kehao Cai1, Nan Kang1, Liang Lin1, Keze Wang1,2
|
||
|
||
###### Abstract
|
||
|
||
Deductive reasoning is a crucial logical capability that assists us in solving complex problems based on existing knowledge. Although augmented by Chain-of-Thought prompts, Large Language Models (LLMs) might not follow the correct reasoning paths. Enhancing the deductive reasoning abilities of LLMs, and leveraging their extensive built-in knowledge for various reasoning tasks, remains an open question. Attempting to mimic the human deductive reasoning paradigm, we propose a multi-stage Syllogistic-Reasoning Framework of Thought (SR-FoT) that enables LLMs to perform syllogistic deductive reasoning to handle complex knowledge-based reasoning tasks. Our SR-FoT begins by interpreting the question and then uses the interpretation and the original question to propose a suitable major premise. It proceeds by generating and answering minor premise questions in two stages to match the minor premises. Finally, it guides LLMs to use the previously generated major and minor premises to perform syllogistic deductive reasoning to derive the answer to the original question. Extensive and thorough experiments on knowledge-based reasoning tasks have demonstrated the effectiveness and advantages of our SR-FoT.
|
||
|
||
Code — https://github.com/RodeWayne/SR-FoT
|
||
|
||
Introduction
|
||
------------
|
||
|
||
Deductive reasoning (Johnson-Laird [1999](https://arxiv.org/html/2501.11599v1#bib.bib19)) is the process of drawing valid inferences. Deductive reasoning is a powerful human capability, where rigorous deductive reasoning helps us use existing knowledge as premises to derive correct subsequent conclusions, enabling us to tackle various complex tasks in the real world.
|
||
|
||
Automated deductive reasoning has long been a pursuit in the field of Natural Language Processing (NLP) (Chowdhary and Chowdhary [2020](https://arxiv.org/html/2501.11599v1#bib.bib9); Bharadiya [2023](https://arxiv.org/html/2501.11599v1#bib.bib3); Khurana et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib21)). Works on automated rigorous reasoning include reasoning engines and Automated Theorem Proving (ATP) (Bibel [2013](https://arxiv.org/html/2501.11599v1#bib.bib4)), which often provide methods for automatically checking the rigor of reasoning. However, these engines require the use of formal languages, which limits their applicability in knowledge-based reasoning scenarios. Because formal language-based reasoning requires a predefined library of formalized premises, many knowledge-based reasoning tasks, including common-sense question answering, involve a diverse array of premises. It is difficult to prepare and rigorously formalize such a large library of premises in advance. Therefore, performing correct deductive reasoning in natural language holds significant importance.
|
||
|
||
Large Language Models (LLMs) (Chang et al. [2024](https://arxiv.org/html/2501.11599v1#bib.bib7); Floridi and Chiriatti [2020](https://arxiv.org/html/2501.11599v1#bib.bib15); Touvron et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib29); Chiang et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib8); Huang and Chang [2022](https://arxiv.org/html/2501.11599v1#bib.bib18); DeepSeek-AI [2024](https://arxiv.org/html/2501.11599v1#bib.bib11)) pre-trained on extensive corpora possess inherent soft deductive reasoning capabilities (Seals and Shalin [2023](https://arxiv.org/html/2501.11599v1#bib.bib27)). With the aid of the Chain-of-Thought prompt (CoT) (Lyu et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib25); Wei et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib33); Zhang et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib39); Turpin et al. [2024](https://arxiv.org/html/2501.11599v1#bib.bib30); Lee et al. [2024](https://arxiv.org/html/2501.11599v1#bib.bib22); Liu, Pang, and Fan [2023](https://arxiv.org/html/2501.11599v1#bib.bib23)), the cognitive abilities of LLMs are further enhanced. However, reasoning with CoT often does not constitute strict deductive reasoning and thus can lack rigor. Fig. [4](https://arxiv.org/html/2501.11599v1#Sx3.F4 "Figure 4 ‣ Procedure of Our SR-FoT ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks") illustrates the different processes of handling the same problem using CoT and classic syllogistic deductive reasoning, clearly showing that the syllogistic deductive approach is more rigorous. We believe that guiding large language models to perform deductive reasoning, rather than merely multi-step reasoning, can enhance the rigor of the reasoning process, reduce illusions, and subsequently improve performance on complex tasks.
|
||
|
||
Inspired by the most fundamental human deductive reasoning paradigm, syllogistic reasoning (Bucciarelli and Johnson-Laird [1999](https://arxiv.org/html/2501.11599v1#bib.bib6); Khemlani and Johnson-Laird [2012](https://arxiv.org/html/2501.11599v1#bib.bib20); Bara, Bucciarelli, and Johnson-Laird [1995](https://arxiv.org/html/2501.11599v1#bib.bib2)), we propose a multi-stage reasoning framework for large language models to guide them in using syllogistic reasoning to solve specific problems. In contrast to existing works in the community (Wu et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib34); Ye et al. [2023b](https://arxiv.org/html/2501.11599v1#bib.bib37); Deng et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib12)), we do not solely rely on simplistic processes or create targeted benchmarks to evaluate the LLMs’ capabilities in performing syllogistic reasoning. Instead, we propose a reasoning framework based on the syllogistic thinking paradigm to handle complex knowledge-based reasoning tasks. Our SR-FoT advances in guiding LLMs in performing syllogistic deductive reasoning, thereby achieving improved performance on these tasks and enhancing the rigor and reliability of the reasoning process.
|
||
|
||
Our SR-FoT consists of five stages. Initially, it involves interpreting the question. Subsequently, SR-FoT guides the Large Language Model (LLM) proposing a major premise suitable for the question. This major premise typically encompasses the built-in knowledge of the LLM, which serves as a universal rule that aids in addressing the original question. The next stage involves obtaining a minor premise, which acts as the bridge linking the major premise to the original problem and is crucial for applying the major premise to the current issue. We first let the LLM formulate minor premise questions based on the original question, major premise, and contextual information, and then answer these to obtain an appropriate minor premise. Finally, with both the major and minor premises established, we enable the LLM to perform syllogistic reasoning based on the original question and these premises to derive the answer to the original question. Furthermore, to minimize the interference caused by excessive information during the reasoning process (Dong et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib14)), we restrict each stage to only access the content from its necessary preceding stages. For example, during the final syllogistic reasoning, only the original problem and the previously established major and minor premises are visible, without the need to reference the problem interpretation and minor premise question stages.
|
||
|
||
Our main contributions can be summarized in three points: i) We propose a multi-step thinking framework that guides LLMs in using syllogistic deductive reasoning to solve knowledge-based reasoning tasks. Specifically, to enhance the ability of LLM to leverage its built-in knowledge for solving diverse tasks, we introduce a problem interpretation stage when acquiring the major premise and improve the quality of both premises as well as their logical connection to the original problem by adopting an autonomous question-answering approach during the acquisition of the minor premise (Bubeck et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib5)); ii) To facilitate more rigorous reasoning, we have designed our thinking framework so that each step only accesses the information necessary for that stage, thereby reducing the illusions and error accumulation that can come from overly long reasoning steps; iii) Our SR-FoT achieves superior performance over the existing chain of thought-related methods on various knowledge-based reasoning QA datasets such as ScienceQA (Lu et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib24)), StrategyQA (Geva et al. [2021](https://arxiv.org/html/2501.11599v1#bib.bib17)), and BoolQ (Clark et al. [2019](https://arxiv.org/html/2501.11599v1#bib.bib10)), demonstrating the superiority of our SR-FoT.
|
||
|
||
Related Work
|
||
------------
|
||
|
||
### Chain-of-Thought
|
||
|
||
Chain-of-Thought (Wei et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib33)) has been demonstrated to enhance performance in reasoning tasks by fully utilizing the in-context learning capability of the large language model to stimulate its multi-step reasoning ability. Self-consistency CoT (SC-CoT) (Wang et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib32)) further improves the performance of CoT by utilizing the consistency of multiple sampled reasoning chains. Complexity-based CoT (C-CoT) (Fu et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib16)) further discovers that the consistency of complex reasoning chains is even more vital for the reasoning performance of language models. In addition, some efforts have also been made to further stimulate the reasoning ability of language models by focusing on the structure of the reasoning chain and the levels of reasoning, such as Least-to-Most (Zhou et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib40)) and Tree-of-Thought (Yao et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib35)). However, these works have not considered how to stimulate the reasoning abilities of LLMs from the perspective of logical reasoning.
|
||
|
||
### Logical Reasoning Ability of LLMs
|
||
|
||
There has been considerable research within the community on the logical reasoning capabilities of LLMs, broadly categorized into two directions: one focuses on logic reasoning based on formal languages, and the other on natural language logic reasoning. Research related to formal language-based logic reasoning primarily concentrates on the field of Automated Theorem Proving (ATP) (Bibel [2013](https://arxiv.org/html/2501.11599v1#bib.bib4)), utilizing the built-in mathematical priors of LLMs to accelerate the search process in theorem proving or to construct a growing library of mathematical theorems to aid new proofs (Wang et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib31)). This work typically operates within interactive theorem-proving platforms like the Lean system, which restricts its application in daily question-answering scenarios. Logic reasoning on natural language with LLMs generally involves soft reasoning (Yu et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib38)), which does not provide rigorous guarantees. For instance, the Chain-of-Thought (CoT) enhances the general explicit reasoning abilities of LLMs, and there are exploratory studies demonstrating to what extent LLMs can perform in logical reasoning, or how segment checking might reduce soft deductive reasoning illusions and error accumulation (Ye et al. [2023a](https://arxiv.org/html/2501.11599v1#bib.bib36); Dhuliawala et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib13)). Recently, several studies on syllogistic reasoning with LLMs have been proposed. However, these primarily create benchmarks (Ye et al. [2023b](https://arxiv.org/html/2501.11599v1#bib.bib37)), evaluating the capability of LLMs to perform syllogistic reasoning on datasets with given premises. Unlike previous works, our research investigates how to guide LLMs through a multi-stage process that involves autonomously generating minor and major premises and performing syllogistic deductive reasoning to answer a variety of knowledge-based reasoning questions.
|
||
|
||
Methodology
|
||
-----------
|
||
|
||
We have designed a reasoning framework that guides large language models to perform syllogistic deductive reasoning for addressing various knowledge-based reasoning question-answer tasks. Next, we present syllogistic reasoning as background knowledge, followed by a detailed description of our SR-FoT framework.
|
||
|
||
### Background: Syllogism
|
||
|
||
In traditional logic, syllogism (Smiley [1973](https://arxiv.org/html/2501.11599v1#bib.bib28)) is a form of reasoning where one proposition (the conclusion) necessarily follows from two other propositions (known as premises). As shown in Fig. [1](https://arxiv.org/html/2501.11599v1#Sx3.F1 "Figure 1 ‣ Background: Syllogism ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), a syllogism consists of three parts: a major premise, a minor premise, and a conclusion. Logically, the conclusion is derived by applying the major premise to the minor premise. The major premise represents a general principle, while the minor premise is a specific statement. Syllogistic reasoning is a type of deductive reasoning; rigorous deductive reasoning ensures that if the premises are correct, the conclusion must also be correct.
|
||
|
||
![Image 12: Refer to caption](https://arxiv.org/html/2501.11599v1/x1.png)
|
||
|
||
Figure 1: A syllogism example.
|
||
|
||
### Procedure of Our SR-FoT
|
||
|
||
![Image 13: Refer to caption](https://arxiv.org/html/2501.11599v1/x2.png)
|
||
|
||
Figure 2: Procedure of our SR-FoT. Questionori: Original Question, Context: Context provided for Original Question, Answerori: Answer for Original Question, QuestionmP: Question for Minor Premise, PropmtCoT: Guide Prompt for CoT, PropmtQE: Guide Prompt for Question Explanation, PropmtMP: Guide Prompt for Major Premise Production, PropmtQmP: Guide Prompt for Posing the Minor Premise Question, PromptmP: Guide Prompt for Minor Premise Production, PromptSR: Guide Prompt for Final Syllogistic Reasoning and so on.
|
||
|
||
![Image 14: Refer to caption](https://arxiv.org/html/2501.11599v1/x3.png)
|
||
|
||
Figure 3: Prompts for each stage of our SR-FoT.
|
||
|
||
While our proposed SR-FoT does not guarantee the execution of rigorous syllogistic reasoning for 100%, it aims to guide the reasoning paradigms of the LLM through carefully designed prompts and sub-tasks at each stage. By strategically controlling the input visible at each stage, we strive to ensure that the LLM conducts rigorous syllogistic reasoning and minimizes the occurrence of reasoning fallacies. Specifically, As shown in Fig. [2](https://arxiv.org/html/2501.11599v1#Sx3.F2 "Figure 2 ‣ Procedure of Our SR-FoT ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), our SR-FoT is divided into five stages. The prompts for each stage of our SR-FoT are shown in Fig. [3](https://arxiv.org/html/2501.11599v1#Sx3.F3 "Figure 3 ‣ Procedure of Our SR-FoT ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks").
|
||
|
||
![Image 15: Refer to caption](https://arxiv.org/html/2501.11599v1/x4.png)
|
||
|
||
Figure 4: A case of using CoT and SR-FoT to answer a question in the ScienceQA dataset respectively. The highlighted red parts indicate the incorrect or misleading content, while the highlighted green parts indicate the content that helps correct reasoning.
|
||
|
||
Stage 1: Question Explanation. The key to utilizing syllogistic reasoning to solve various complex knowledge-based reasoning tasks lies in formulating appropriate major and minor premises that fit the current problem. Accordingly, the first stage of our SR-FoT involves using a prompt with examples to guide the LLM in interpreting the original task question and proposing a solution approach. This guidance helps direct the LLM to formulate suitable major premises and then appropriate minor ones. In this stage, besides the guidance and example prompts, the only information available to the LLM is the ”original question” and the ”context” provided by the task, which also includes ”options” information for multiple-choice questions.
|
||
|
||
Stage 2: Major Premise Production. After acquiring the “question explanation”, we gain a deeper understanding of the original question and develop an approach to solve it. This solution approach often includes guidance on what further information is needed. In this stage, based on these guidelines, we propose an appropriate major premise, which is derived from the task ”context” or the built-in knowledge of the LLM. In Stage 2, besides the guidance and example prompts, the information accessible to the LLM includes the ”original question” ”context” and the ”question explanation” generated in the first stage.
|
||
|
||
Stage 3: Posing the Minor Premise Question. After establishing the major premise, to effectively engage in syllogistic reasoning, we need a minor premise. In syllogistic reasoning, the minor premise is a specific statement that describes the relationship between a particular instance and the category mentioned in the major premise. Through the minor premise, the universal characteristics of the major premise can be applied to the specific instance in the minor premise, which is a crucial step in using syllogistic reasoning to solve specific problems. Given the diverse and often complex nature of the knowledge-based reasoning tasks we need to address, it is challenging to provide a matched and correct minor premise in one step. Our SR-FoT divides the step of proposing a minor premise into two stages: posing the minor premise question (Stage 3) and answering the minor premise question (Stage 4). The task of the ”posing the minor premise question” stage is to determine what information about the specific instance in the original question the LLM should acquire to utilize the major premise in answering the ”original question”. Therefore, in the ”posing the minor premise question” stage, besides the guidance and example prompts, the LLM needs access to the ”original question,” ”context,” and the ”major premise” generated in Stage 2.
|
||
|
||
Stage 4: Minor Premise Production. The task of Stage 4 is to utilize the ”context” information provided by the original task, along with the built-in knowledge of the LLM, to answer the minor premise question posed in Stage 3. This stage aims to obtain the correct information about a specific aspect of the particular instance in the original question, leading to the formulation of an accurate and matching minor premise. Given the potential complexity of the minor premise questions, we guide LLM to employ the Chain-of-Thought (CoT) technique to answer the minor premise question and to organize and obtain the minor premise. Furthermore, to avoid the interference caused by excessive information, in this stage, besides the guidance and example prompts, the LLM has access only to the ”minor premise question” and ”context” without needing to see the ”original question” again. The ”minor premise question” and ”context” already contain all the information necessary for the task of the LLM at this stage; viewing additional information like the ”original question” could instead lead to distractions and affect performance.
|
||
|
||
Stage 5: Final Syllogistic Reasoning After the aforementioned stages, complex original knowledge-based reasoning questions can now be answered using syllogistic reasoning. The specific approach involves designing the appropriate task instruction and example prompts, allowing the LLM to engage in syllogistic reasoning based on the major and minor premises generated in earlier stages, to arrive at the answer to the original question. Therefore, in Stage 5, we design the LLM to have access, in addition to the guidance and example prompts, to the ”major premise” generated in Stage 2, the ”minor premise” generated in Stage 4, and the ”original question”.
|
||
|
||
Table 1: Comparison with the state-of-the-art methods on the ScienceQA, StategyQA, and BoolQ datasets.
|
||
|
||
Experiments
|
||
-----------
|
||
|
||
To evaluate the effectiveness of our SR-FoT, we conducted a series of experiments using both Open-source and closed-source LLMs on several common knowledge-based reasoning question-answer datasets.
|
||
|
||
### Experiment Setup
|
||
|
||
#### Datasets.
|
||
|
||
To fully demonstrate the effectiveness and generalization of our SR-FoT, we conduct a series of experiments on three datasets from different fields.
|
||
|
||
ScienceQA (Lu et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib24)) is a scientific question-answering dataset and contains 21,208 multimodal multiple-choice science questions. It can be divided into three subjects: natural science, language science, and social science. It requires the language model to select one answer from multiple options, usually requiring multi-step reasoning. In our experiment, we employ the test set samples which only have a text context, with a total of 2224. We report the accuracy of our SR-FoT and comparison methods on this set.
|
||
|
||
StrategyQA (Geva et al. [2021](https://arxiv.org/html/2501.11599v1#bib.bib17)) is a question-answering dataset focusing on open-domain questions. Its questions contain multiple reasoning steps, and a strategy should be used to obtain the answers. In our experiment, we evaluate the methods with accuracy on the train set, which includes 2290 samples.
|
||
|
||
BoolQ (Clark et al. [2019](https://arxiv.org/html/2501.11599v1#bib.bib10)) is a reading comprehension dataset consisting of 16k samples. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. In our experiment, we compare the accuracy of our SR-FoT with other methods on the dev set, with a total of 3270.
|
||
|
||
#### Experimental Setting.
|
||
|
||
Our experiments are performed using API calls on the proprietary model GPT-3.5-turbo (Ouyang et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib26)), the open-source model DeepSeek-v2 (DeepSeek-AI [2024](https://arxiv.org/html/2501.11599v1#bib.bib11)) with 236B parameters, and Qwen1.5-32B-Chat (Bai et al. [2023](https://arxiv.org/html/2501.11599v1#bib.bib1)) version with 32B parameters. The control group methods include the Base method, Chain of Thought (CoT) (Wei et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib33)), Self-consistency CoT (SC-CoT) (Wang et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib32)), and Complexity-based CoT (C-CoT) (Fu et al. [2022](https://arxiv.org/html/2501.11599v1#bib.bib16)) methods. Our own approaches included SR-FoT and Self-consistency SR-FoT (SC-SR-FoT), which represents our SR-FoT following the self-consistency sampling and aggregation settings of SC-CoT. In the single-round sampling methods which include Base, CoT and SR-FoT, the hyperparameters on GPT-3.5-turbo and Qwen1.5-32B-Chat are set to top\_p=0.3 and temperature=0.2, while the temperature on DeepSeek-v2 was set to the default recommended value of 1 (DeepSeek only allows the temperature hyperparameter to be adjusted). In the multi-round sampling methods which include SC-CoT, C-CoT and SC-SR-FoT, we perform 10 samplings each. To enhance the diversity of sampling outcomes, the hyperparameters on GPT-3.5-turbo and Qwen1.5-32B-Chat for top\_p and temperature are adjusted to 0.7 and 0.9 respectively. The temperature hyperparameter on DeepSeek remained at the default recommended value of 1. The number of in-context example prompts used in all the methods on the ScienceQA, StrategyQA, and BoolQ datasets are 5, 2, and 2, respectively.
|
||
|
||
### Experimental Results and Analyses
|
||
|
||
#### Performance on ScienceQA.
|
||
|
||
Scientific question answering is a task scenario that often requires deductive reasoning. As seen in Tab. [1](https://arxiv.org/html/2501.11599v1#Sx3.T1 "Table 1 ‣ Procedure of Our SR-FoT ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), under GPT-3.5-turbo, in the comparison of single-round sampling methods, our SR-FoT outperforms the Base and CoT methods by 1.5% and 0.5% respectively and is on par with multi-round sampling methods like SC-CoT and C-CoT. In the comparison of multi-round sampling methods, our SC-SR-FoT exceeds SC-CoT and C-CoT methods by 1.5%. Under the open-source model DeepSeek-V2, SR-FoT outperforms the Base and CoT by 4.8% and 3.2% respectively, even surpassing multi-round sampling methods. What’s more, our SC-SR-FoT further increases the accuracy to 93.0%. Under Qwen1.5-32B-Chat, compared to the Base and CoT methods, our SR-FoT has an improvement of 1.3% and 1.6% respectively. Compared to SC-CoT and C-CoT, our SC-SR-FoT also performs better, surpassing them by 2.8% and 3.2% respectively. These indicate that our methods achieves greater superiority on the ScienceQA dataset under multiple models.
|
||
|
||
#### Performance on StrategyQA and BoolQ.
|
||
|
||
StrategyQA and BoolQ are two other knowledge-based reasoning question-answer datasets that require a true or false judgment based on context or common sense knowledge. From Table 1, for StrategyQA under GPT-3.5-turbo, in the comparison of single-round sampling methods, our SR-FoT outperforms Base and CoT by 7.3% and 0.8% respectively; in the comparison of multi-round sampling methods, our SC-SR-FoT exceeds SC-CoT and C-CoT by 1.9% and 3.2% respectively. Similar trends are observed under DeepSeek-V2 and Qwen1.5-32B-Chat. In addition, our SR-FoT and SC-SR-FoT also perform the best in both single-round sampling methods and multi-round sampling methods on BoolQ under the three models.
|
||
|
||
Overall, whether under the closed-source large language model GPT-3.5-turbo or the open-source large language models DeepSeek-V2 and Qwen1.5-32B-Chat, our SR-FoT achieve a superior accuracy compared with other compared methods on the ScienceQA, StrategyQA, and BoolQ datasets. This demonstrates the effectiveness of our SR-FoT.
|
||
|
||
It is worth noting that under DeepSeek-V2 and Qwen1.5-32B-Chat, the Base method achieves relatively high results across all three datasets, while the benefits of the CoT method show signs of saturation, and at times perform worse than the Base method. However, our methods, whether under single-round sampling settings (SR-FoT) or multi-round aggregated sampling settings (SC-SR-FoT), are still able to further enhance performance, demonstrating greater potential for performance gains. We believe this is because our SR-FoT employs a syllogistic deductive reasoning framework, allowing LLMs to address these knowledge-based reasoning tasks based on a more rigorous reasoning process, thereby achieving better overall performance.
|
||
|
||
Table 2: Effectiveness comparisons for subcategories on the ScienceQA dataset with DeepSeek-v2.
|
||
|
||
### Ablation Study
|
||
|
||
#### Effectiveness Comparisons for Subcategories.
|
||
|
||
As shown in table [2](https://arxiv.org/html/2501.11599v1#Sx4.T2 "Table 2 ‣ Performance on StrategyQA and BoolQ. ‣ Experimental Results and Analyses ‣ Experiments ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), we conduct the experiments on the ScienceQA dataset with DeepSeek-v2. The results demonstrate that our methods can enhance the reasoning performance of the language model across questions of different difficulty levels and various subjects, compared with the state-of-the-art methods. When increasing the consistency of the proposed method, the beneficial effects become more significant.
|
||
|
||
#### Ablation of Stages.
|
||
|
||
As shown in Table [3](https://arxiv.org/html/2501.11599v1#Sx4.T3 "Table 3 ‣ Ablation of Stages. ‣ Ablation Study ‣ Experiments ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), we conduct experiments on ScienceQA under DeepSeek-V2 to verify the effectiveness of each stage in our method. Specifically, ‘all in one stage’ denotes using instructions and examples to let the LLM provide the premises based on the question and options, and then directly provide the answers. ‘w/o stage 3’ denotes providing the minor premise directly, instead of posing it as a question first and then answering. The results demonstrate that the completeness of each stage is important. In detail, discarding either the problem explanation or the major and minor premises would decrease the performance. Furthermore, allowing the language model to directly provide the major and minor premise would significantly reduce its performance, demonstrating the necessity of the multi-stage thinking framework in our SR-FoT.
|
||
|
||
Table 3: Ablation study of stages in our proposed methods on the ScienceQA dataset under DeepSeek-V2.
|
||
|
||
Table 4: Ablation study of visible information in various stages on the StrategyQA dataset.
|
||
|
||
Table 5: Fifty cases using CoT and SR-FoT respectively, randomly selected from the three datasets under GPT-3.5-turbo, are analyzed to assess their rigor, and their rigor rates are subsequently calculated.
|
||
|
||
Table 6: Error sources on a random sample of 50 incorrect examples of our SR-FoT from the three datasets using the GPT-3.5-turbo model. ‘MaPE’ denotes major premise error, ‘MiPQE’ denotes minor premise question error, ‘MiPE’ denotes minor premise error and ‘FRPE’ denotes final reasoning process error.
|
||
|
||
#### Impact of Visible Information in Various Stages.
|
||
|
||
As shown in Table [4](https://arxiv.org/html/2501.11599v1#Sx4.T4 "Table 4 ‣ Ablation of Stages. ‣ Ablation Study ‣ Experiments ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), we conduct the experiments on the StrategyQA dataset with DeepSeek-v2. ‘w/o context in stage 3’ denotes that the minor premise question is generated without considering the context. ‘add Qori in stage 4’ denotes that providing the original question, minor premise question, and context all to the LLM during the process of answering the minor premise question. The results demonstrate that both decreasing or increasing the content of the input prompts adversely affect performance. This underlines the appropriateness of the designed visible information at each stage of our SR-FoT.
|
||
|
||
### Rigor Analysis
|
||
|
||
To more directly analyze whether our SR-FoT improves the rigor of the reasoning process compared to CoT, we randomly select 50 cases from each of the three datasets under GPT-3.5-turbo for manual evaluation. For CoT and SR-FoT, if all intermediate steps from the first step of reasoning to obtaining the final answer are correct and logically progressive, without any factual inconsistencies or self-inconsistencies, we call them rigorous; otherwise, they are not rigorous. The results are in Table [5](https://arxiv.org/html/2501.11599v1#Sx4.T5 "Table 5 ‣ Ablation of Stages. ‣ Ablation Study ‣ Experiments ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"). From the table, it can be found that our SR-FoT has a higher rigor rate than CoT on all three datasets, indicating that our SR-FoT has enhanced the rigor and interoperability of LLM reasoning. For specific comparison cases about rigor, please refer to the supplementary materials.
|
||
|
||
### Error Analysis
|
||
|
||
We randomly selected 50 error cases from each of the three datasets under GPT-3.5-turbo to perform an error analysis of our SR-FoT. The sources of errors and their respective proportions are as in table [6](https://arxiv.org/html/2501.11599v1#Sx4.T6 "Table 6 ‣ Ablation of Stages. ‣ Ablation Study ‣ Experiments ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"). From the error analysis, it can be found that the proportion of different types of errors varies on different datasets. In ScienceQA, most errors stem from the step of extracting suitable minor premises from the question information. In StrategyQA, the main errors stem from the final reasoning process and mistakes in presenting the major premise. In BoolQ, the primary errors originate from the final reasoning process and errors in formulating the minor premise.
|
||
|
||
### Case Study
|
||
|
||
We give a case on the ScienceQA dataset to show how CoT and SR-FoT work. In Fig. [4](https://arxiv.org/html/2501.11599v1#Sx3.F4 "Figure 4 ‣ Procedure of Our SR-FoT ‣ Methodology ‣ SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks"), it can be seen that in the fourth and fifth reasoning steps of CoT, the model misunderstands the rhyme condition and thus infers wrong information, resulting in an incorrect final answer. In SR-FoT, the question explanation points out a reasonable direction for the major premise, then the major premise gives more sufficient rhyme conditions, and the minor premise correctly distinguishes different ending sounds. With their joint help, the model gives the correct final answer. More cases can be found in the supplementary materials.
|
||
|
||
Conclusion
|
||
----------
|
||
|
||
In this paper, we have developed a multi-stage syllogistic reasoning framework of thought(SR-FoT) to guide LLMs in solving complex and diverse knowledge-based reasoning question-answering tasks using syllogistic deductive reasoning. Experiments across various knowledge-based reasoning datasets under various LLMs demonstrate the effectiveness and advantages of our method.
|
||
|