NEC orchestrating a brighter world
NEC Laboratories Europe

Home
Synthesizing Data for Context Attribution: The Future of Transparent and Trustworthy Question Answering

Large Language Models (LLMs) have rapidly become indispensable tools in how we search, learn and interact with information. Yet their most common real‑world application – question answering (QA) – still suffers from hallucinations that undermine trust and transparency.

As AI systems move into high‑risk sectors, it becomes critical that their outputs are backed by verifiable evidence. This can be achieved through specialized context attribution models, which are explicitly trained to link AI responses to supporting evidence. However, training data for such models is often scarce. Synthetic data provides a powerful way to bridge this gap.

This landscape sets the stage for our breakthrough research in On Synthesizing Data for Context Attribution in Question Answering (Radevski et al., 2025). Instead of relying on costly and inconsistent human annotation, NEC proposes a method to produce large‑scale, high‑quality synthetic datasets specifically designed to train context attribution models. These datasets empower smaller, more efficient models to outperform even large LLMs in identifying the precise sentences that justify an answer. By closing the gap between answer generation and evidence grounding, we chart a path toward more transparent, explainable and trustworthy QA systems.
 

Why context attribution matters

In traditional QA systems, a model produces an answer directly from the input context or retrieval mechanism. Yet the reasoning behind a model’s answer of – which sentences it used or ignored – remains hidden. This lack of transparency creates challenges:

  • Users cannot quickly verify correctness.
  • Hallucinations remain undetected.
  • Biases may influence answers without clear traceability.
  • High‑risk domains cannot rely on opaque reasoning.
  • Regulatory environments increasingly demand explainability.

Sentence‑level attribution has been shown to be the most effective granularity for the human verification of text, allowing users to confirm correctness efficiently. However, existing attribution approaches often rely on large zero‑shot LLMs – which are costly and inconsistent – or on manually labeled datasets that are slow, expensive and narrow in scope. NEC addresses these limitations with a scalable, synthetic-data‑driven alternative.


SYNQA: A generative approach to attribution training data

At the heart of NEC Laboratories Europe’s research is SYNQA, a novel method for generating synthetic data explicitly tailored for training context attribution models. These models identify the specific sentences within a given text that directly support the answer to a question, making the model’s reasoning transparent and verifiable.

In the traditional setup, the LLM would classify evidence. However, this is a mismatch, since LLMs are fundamentally optimized for text generation rather than evidence classification.

Therefore, instead of asking LLMs to classify evidence, SYNQA prompts the model to generate question – answer pairs directly from curated evidence sentences. This ensures perfect alignment between the question, answer and attribution – making better use of the generative AI strengths inherent in LLMs.


How SYNQA works

SYNQA operates across four major steps of context attribution: context collection, evidence selection, generating QA pairs and distractor mining (finding and adding misleading but plausible pieces of text to a model’s training context). Together, these form a cohesive approach to training reliable attribution models.

1. Context collection

Researchers extract coherent sets of sentences from Wikipedia consisting of single‑article samples for simple reasoning and linked multi‑article samples for multi‑hop reasoning. This creates a broad training foundation with varied semantic complexity.

2. Evidence selection

A small set of sentences is selected as the true attribution, forming the sole knowledge source for QA generation (in other words, the specific sentences in the source text that directly supports or justifies the answer to a question).

3. QA pair generation

A large LLM – such as LLaMA‑70B – generates a question and corresponding answer, using only these evidence sentences. Because the distractors are not shown yet, evidence alignment is perfect.

4. Distractor mining

We introduce semantically similar but irrelevant distractor passages to mimic real‑world information retrieval scenarios. This teaches attribution models to distinguish essential evidence from noise.

The result is a richly diverse, high‑quality dataset of questions, answers, true evidence sentences, and distractor context – all produced at scale without manual annotation.


Why synthetic data works better than human annotation

The NEC laboratories team benchmarked SYNQA‑trained models across major QA datasets including SQuAD, HotpotQA, QuAC, CoQA, OR‑QuAC and DoQA. Their findings highlight several key advantages:

SYNQA‑trained models outperform large zero‑shot LLMs

A 1B‑parameter attribution model trained with SYNQA surpasses models 70 times larger in zero‑shot mode, enabling real‑time attribution and cost‑efficient deployment. It also supports lightweight safety layers in production – small, efficient safeguards that run alongside an AI model in real‑world applications to keep it reliable, safe, and compliant without adding heavy computational load.

Synthetic data generalizes better than human‑annotated datasets

Human‑generated labels are narrow in scope. SYNQA’s structural diversity enables models to generalize effectively – even to conversational QA formats they were never explicitly trained on.

Synthetic data enables multi‑hop reasoning

Manually building multi‑document evidence chains is expensive. SYNQA automates this fully, teaching attribution models to handle complex multi‑hop tasks.

Scaling synthetic data improves performance

Using SYNQA, recall increases as the dataset grows, while precision remains high and robustness improves. This mirrors scaling laws seen in LLM training at a fraction of the cost. Larger synthetic datasets expose attribution models to broader evidence patterns and help them learn subtle distinctions between relevant and irrelevant context, improving reliability even in ambiguous scenarios.


Applications: where attribution becomes essential

In high‑risk domains, knowing why an AI model produces an answer is essential for safety and trust. When LLMs support decisions in areas like healthcare, law or finance, even minor errors can have major consequences. Accurate attribution ensures each answer is backed by verifiable evidence, enabling early error detection, informed decision‑making and compliance with strict transparency requirements. As AI becomes more embedded in critical workflows, attribution provides the visibility and accountability needed to manage risk effectively. Benefits of using SYNQA in different domains include: 

Enterprise knowledge systems
Natural‑language queries over internal documents return not only answers but the precise supporting evidence.

Customer support and chatbots
Attribution reduces hallucinations and improves reliability for automated support.

Legal, healthcare and financial compliance
High‑risk sectors require explainability. Attribution makes AI‑assisted decisions fully auditable.

Fact‑checking and RAG pipelines
Attribution ensures generated answers are grounded in retrieved context.


Human‑centred AI: attribution strengthens trust

User studies show that people verifying answers are significantly faster, substantially more accurate, and most effective when using SYNQA‑trained attributions. Attribution enhances human judgment, enabling users to validate answers confidently rather than relying blindly on model fluency.


The broader implications for AI safety

By requiring models to ground their outputs in observable evidence, attribution acts as a powerful transparency mechanism. It makes hallucinations visible, ensures unsupported claims can be flagged or rejected, and exposes bias at the sentence level rather than allowing it to remain hidden inside the model’s latent space. This evidence‑linking also helps systems meet regulatory explainability expectations and enables audits at scale – critical capabilities as AI becomes embedded in high‑stakes decision‑making environments.

SYNQA also fits directly within NEC’s broader AI Harness Framework, which emphasizes modular safety layers that enhance transparency, controllability and accountability across the entire AI lifecycle. Within this framework, context attribution serves as a foundational guardrail: it transforms LLM outputs from opaque predictions into verifiable claims, enabling quality checkers, fact‑verification modules and risk‑scoring engines to operate on reliable evidence trails. By supplying high‑fidelity synthetic data at scale, SYNQA strengthens one of the framework’s core pillars: the ability to inspect, verify and evaluate AI behavior consistently across applications.


Conclusion: Synthetic Attribution Is a Breakthrough in AI Transparency

Our research shows the transformative potential of synthetic data for context attribution. SYNQA enables smaller models to outperform massive LLMs while maintaining high precision in evidence identification. Synthetic data generated through this approach surpasses human annotation in both scale and generalization, allowing models to learn from broad, diverse, and high‑fidelity examples. As a result, evidence grounding becomes fast, precise, and fine‑grained – making transparent QA systems not only possible but practical today. Context attribution will be a key future aspect  of safe AI – and SYNQA shows exactly how to make it real!

References

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-Hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y., Liang, P., & Zettlemoyer, L. (2018). QuAC: Question Answering in Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2174–2184).

Reddy, S., Chen, D., & Manning, C. D. (2019). CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, 7, 249–266.

Qu, C., Yang, L., Chen, C., Qiu, M., Croft, W. B., & Iyyer, M. (2020). Open-Retrieval Conversational Question Answering. In Proceedings of SIGIR 2020.

Campos, J. A., Otegi, A., Soroa, A., Deriu, J., Cieliebak, M., & Agirre, E. (2020). DoQA: Accessing Domain-Specific FAQs via Conversational QA. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020).

Top of this page