ComplexWebQuestions
Also known as: CWQ
Facts (40)
Sources
Empowering GraphRAG with Knowledge Filtering and Integration arxiv.org Mar 18, 2025 14 facts
measurementThe study evaluates performance using the F1 score on the WebQSP (Yih et al., 2016) and CWQ (Talmor and Berant, 2018) datasets.
referenceThe WebQSP dataset, introduced by Yih et al. (2016), contains 4,737 natural language questions that require reasoning over paths of up to two hops, while the CWQ dataset was introduced by Talmor and Berant (2018).
measurementThe GraphRAG-Filtering component improves the ROG retriever's performance, increasing the F1 score by 4.19% and the Hit score by 3.15% on the CWQ dataset.
claimThe researchers utilize the WebQSP and CWQ benchmark datasets for Knowledge Graph Question Answering (KGQA) tasks.
measurementApplying the GraphRAG-FI method results in a 2.23% improvement in Hit and a 3.63% improvement in F1 over the ROG* baseline on the CWQ dataset when noise is present.
measurementThe GraphRAG-FI method yields an average increase of 5.03% in Hit and 3.70% in F1 compared to PageRank-based filtering across both the WebQSP and CWQ datasets.
measurementThe GraphRAG-FI method achieves an average improvement of 4.78% in Hit and 3.95% in F1 compared to similarity-based filtering when used with the ROG retriever across both the WebQSP and CWQ datasets.
measurementOn the CWQ dataset, the ROG-original method achieved a Hit rate of 61.91 and an F1 score of 54.95, while the ROG + GraphRAG-FI method achieved a Hit rate of 64.82 and an F1 score of 55.12.
measurementIntroducing noise paths into the retrieved context causes the ROG* model's Hit score to decrease by 2.29% and the F1 score to drop by 2.23% on the CWQ dataset.
procedureThe framework uses LLaMA2-Chat-7B as the Large Language Model backbone, which is instruction-finetuned on the training splits of WebQSP and CWQ, and Freebase, for three epochs.
measurementThe CWQ dataset contains 34,699 complex questions that necessitate multi-hop reasoning over up to four hops.
measurementLeveraging logits to filter out low-confidence responses improves performance on the WebQSP and CWQ datasets. Specifically, on WebQSP, the 'LLM with Logits' approach achieved a Hit rate of 84.17 and F1 score of 76.74, compared to 66.15 and 49.97 for the baseline LLM. On CWQ, the 'LLM with Logits' approach achieved a Hit rate of 61.83 and F1 score of 58.19, compared to 40.27 and 34.17 for the baseline LLM.
measurementThe CWQ dataset contains 27,639 training samples, 3,531 testing samples, and has a maximum hop count of 4.
measurementIn experiments on the WebQSP and CWQ datasets, the GNN-RAG + GraphRAG-FI method achieved the highest performance, with a Hit rate of 91.89% and F1 score of 75.98% on WebQSP, and a Hit rate of 71.12% and F1 score of 60.34% on CWQ.
KG-RAG: Bridging the Gap Between Knowledge and Creativity - arXiv arxiv.org May 20, 2024 12 facts
procedureThe ComplexWebQuestions dataset was originally split into 80% for training, 10% for development, and 10% for testing.
procedureThe ComplexWebQuestions dataset was originally split into 80% for training, 10% for development, and 10% for testing.
measurementThe ComplexWebQuestions dataset contains 34,689 complex questions, each associated with an average of 367 Google web snippets and corresponding SPARQL queries for the Freebase knowledge graph.
claimPreliminary experiments using the KG-RAG pipeline on the ComplexWebQuestions dataset demonstrate a reduction in hallucinated content.
measurementThe ComplexWebQuestions dataset contains 34,689 complex questions, each associated with an average of 367 Google web snippets and corresponding SPARQL queries for the Freebase knowledge graph.
claimThe ComplexWebQuestions (CWQ) dataset is designed to test knowledge graph question answering (KGQA) frameworks by including complex queries that require multi-hop reasoning, temporal constraints, and aggregations.
procedureIn the KG-RAG study, researchers randomly sampled 100 questions from the development split of the ComplexWebQuestions dataset, excluding 11 questions where the correct answers were absent from the provided snippets.
claimThe ComplexWebQuestions (CWQ) dataset is designed to test knowledge graph question answering (KGQA) frameworks by including complex queries that require multi-hop reasoning, temporal constraints, and aggregations.
claimPreliminary experiments on the ComplexWebQuestions dataset using the KG-RAG pipeline demonstrate improvements in the reduction of hallucinated content.
claimThe KG-RAG experimental evaluation uses the ComplexWebQuestions (CWQ) version 1.1 dataset.
claimThe KG-RAG experimental evaluation uses the ComplexWebQuestions (CWQ) version 1.1 dataset.
procedureIn the KG-RAG study, researchers randomly sampled 100 questions from the development split of the ComplexWebQuestions dataset, excluding 11 questions where the correct answers were absent from the provided snippets.
Large Language Models Meet Knowledge Graphs for Question ... arxiv.org Sep 22, 2025 11 facts
referenceThe KG-Adapter method, proposed by Tian et al. in 2024, utilizes parameter-efficient fine-tuning and joint reasoning with Llama-2-7B-base and Zephyr-7B-alpha language models, incorporating ConceptNet and Freebase knowledge graphs to perform KGQA, MCQA, OBQA, and CWQ tasks on the OBQA, CSQA, WQSP, and CWQ datasets, evaluated using Acc and Hits@1 metrics.
referenceThe KELDaR method, proposed by Li et al. in 2024, employs a question decomposition tree and atomic knowledge graph retrieval using GPT-3.5-Turbo and GPT-4-Turbo models to perform KGQA and multi-hop QA tasks, evaluated using the EM metric on WQSP and CWQ datasets.
referenceThe KBIGER method, proposed by Du et al. in 2022, uses iterative instruction reasoning with an LSTM-based pre-trained model and dataset-inherent knowledge graphs or Freebase to perform multi-hop KBQA tasks, evaluated using Hits@1 and F1 metrics on WQSP, CWQ, and GrailQA datasets.
referenceThe LEGO-GraphRAG method, proposed by Cao et al. in 2024, utilizes modular graph RAG with Qwen2-72B and Sentence Transformer models, incorporating the Freebase knowledge graph to perform KBQA and CWQ tasks on the WQSP, CWQ, and GrailQA datasets, evaluated using R, F1, and Hits@1 metrics.
referenceThe Oreo method, proposed by Hu et al. in 2022, uses knowledge interaction, injection, and knowledge graph random walks with RoBERTA-base and T5-base models to perform CBQA, OBQA, and multi-hop QA tasks, evaluated using accuracy on NQ, WQ, WQSP, TriviaQA, CWQ, and HotpotQA datasets.
referenceThe InteractiveKBQA method, proposed by Xiong et al. in 2024, uses Multi-turn Interaction for Observation and Thinking with GPT-4-Turbo, Mistral-7B, and Llama-2-13B models and Freebase, Wikidata, and Movie KG knowledge graphs for KBQA and domain-specific QA, evaluated using F1, Hits@1, EM, and Acc metrics on the WQSP, CWQ, KQA Pro, and MetaQA datasets.
referenceThe KG-CoT method, proposed by Zhao et al. in 2024, uses chain-of-thought-based joint reasoning between knowledge graphs and LLMs (GPT-4, GPT-3.5-Turbo, Llama-7B, Llama-13B) to perform KBQA and multi-hop QA tasks, evaluated using Acc and Hit@K metrics on WQSP, CWQ, SQ, and WQ datasets.
referenceThe KG-Agent method, proposed by Jiang et al. in 2024, uses KG-Agent-based instruction tuning with Davinci-003, GPT-4, and Llama-2-7B models to perform KGQA and ODQA tasks, evaluated using Hits@1 and F1 metrics on WQSP, CWQ, and GrailQA datasets.
referenceThe GAIL method, proposed by Zhang et al. in 2024, utilizes GAIL fine-tuning with Llama-2-7B and BERTa language models, incorporating the Freebase knowledge graph to perform KGQA tasks on the WQSP, CWQ, and GrailQA datasets, evaluated using EM, F1, and Hits@1 metrics.
referenceThe ToG method, proposed by Sun et al. in 2024, uses beam-search-based retrieval and LLM agents with GPT-3.5-Turbo, GPT-4, and Llama-2-70B-Chat models to perform KBQA and open-domain QA tasks, evaluated using Hits@1 on CWQ, WQSP, GrailQA, QALD10-en, and WQ datasets.
referenceThe SR method, proposed by Zhang et al. in 2022, utilizes a trainable subgraph retriever and fine-tuning with the RoBERTa-base language model and dataset-inherent knowledge graphs to perform KBQA tasks, evaluated using Hits@1 and F1 metrics on WQSP and CWQ datasets.
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org Feb 23, 2026 1 fact
referenceClassic KGQA benchmarks such as ComplexWebQuestions (Talmor and Berant, 2018) and FreebaseQA (Jiang et al., 2019) are static, using fixed Knowledge Graph snapshots.
KG-RAG: Bridging the Gap Between Knowledge and Creativity - arXiv arxiv.org May 20, 2024 1 fact
claimPreliminary experiments using the ComplexWebQuestions dataset indicate that the KG-RAG pipeline reduces hallucinated content in Large Language Model Agents.
A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com Nov 4, 2024 1 fact
claimComplexWebQuestions is a benchmark for evaluating complex question answering over knowledge graphs by testing a model's ability to handle multi-hop reasoning and compositional questions.