Datenschutz Deep Dive
A bulging folder of customer records on a conveyor belt heading into an AI training server; an undersized stamp reading 'Art. 89 GDPR – Research' covers only one corner, leaving most of the records exposed.

"Analytics Only": Art. 89 GDPR Does Not Cover Your AI Training Set

Companies repurposing customer data for model training routinely invoke the research and statistics privilege under Art. 89(1) GDPR and Art. 31(2)(e) DSG. The privilege lifts only the purpose limitation, not the requirement for a legal basis, and it depends on data minimisation and an aggregated output that commercial training sets rarely deliver — a defence worth documenting, not a safe harbour.

Casimir von Firn, MLaw

“We only use the data for analytics.” The phrase is meant to justify repurposing customer data for model training and invokes — usually without naming it — the research and statistics privilege under Art. 89(1) GDPR and Art. 31(2)(e) DSG. The privilege exists, but it exempts only from the purpose limitation (Art. 5(1)(b) GDPR), not from the requirement for a legal basis, and it depends on safeguards that a commercial training set rarely provides. Anyone relying on it must document that position before the first training run — it is not a safe harbour.

What the Privilege Covers — and What It Does Not

The mechanism is set out in Art. 5(1)(b) GDPR. Further processing “for scientific or historical research purposes or statistical purposes” is deemed, “pursuant to Article 89(1), not incompatible with the initial purposes.” This is a legal fiction of purpose compatibility: a controller that collected data for contract performance and later uses it for training need not separately assess whether the two purposes are compatible. But the fiction applies only “pursuant to Article 89(1).” If the safeguards required by that provision are absent, so is the fiction.

Art. 89(1) GDPR requires “appropriate safeguards for the rights and freedoms of the data subject,” specifically technical and organisational measures to “respect the principle of data minimisation,” and pseudonymisation to the extent the purpose allows. Paragraph 2 authorises member states to restrict data subject rights under Art. 15, 16, 18 and 21 GDPR to the extent that those rights “make it impossible or seriously impair the achievement of the specific purposes” of the research. That is the full extent of the privilege: it dispenses with the compatibility assessment and attenuates certain access and erasure obligations. It does not supply a legal basis. Any controller processing personal data still requires a lawful basis under Art. 6 GDPR and, for special categories of data, additionally under Art. 9. The EDPB Guidelines 1/2026 (consultation draft of 16 April 2026) make clear that compatibility and lawfulness remain two separate, cumulative requirements: the lawfulness of further processing must be established independently, even where purpose compatibility is presumed by the fiction.

The cost of conflating the two is illustrated by the Italian supervisory authority. The Garante fined OpenAI €15 million in December 2024 because the training of ChatGPT lacked an adequate legal basis under Art. 6 GDPR and transparency obligations had been violated. The problem was not purpose compatibility but the missing legal basis. The European Data Protection Board (EDPB) likewise grounds the lawfulness of AI model training in its Opinion 28/2024 on legitimate interest under Art. 6(1)(f) GDPR. The research privilege under Art. 89 GDPR does not appear in Opinion 28/2024 at all — because the legal basis for training is a question that must be answered independently, without recourse to the compatibility fiction.

”Statistics” Means Aggregated, Not Personalised

The second misunderstanding concerns the word “statistics.” Recital 162 GDPR attaches a condition to the statistical purpose: the result must consist of “aggregate data,” and that result must “not be used for measures or decisions regarding particular natural persons.” The limit is therefore not the label but the effect of the model.

A model that ranks customers by churn risk, sets prices individually, ranks applicants, or directs fraud suspicion at a particular person produces precisely the kind of decision about a specific individual that the statistical purpose excludes. In that case the processing is personal in nature, and the privilege does not apply. A controller who derives a market analysis from the same data — one whose result no longer identifies anyone — remains within the privilege.

Swiss law draws the same line, but more explicitly. Art. 31(2)(e) DSG justifies processing that would otherwise breach the purpose limitation under Art. 6(3) DSG. It is permitted only “for non-personal purposes, in particular in research, planning and statistics.” The controller must anonymise the data “as soon as the purpose of the processing allows” and publish results in a form that makes data subjects unidentifiable. The phrase “non-personal purposes” already settles the AI question: if the model targets individuals, the purpose is personal in nature, and the justification falls away.

”Research” Is Broad, But Not a Label

That leaves the escape route through “scientific research.” Recital 159 GDPR reads the concept broadly by design: it is intended to cover “technological development” and “privately funded research.” That sounds as though any corporate R&D lab is covered.

The EDPB Guidelines 1/2026 confirm the broad approach: research may be commercially organised and profit-oriented. But they impose six requirements — a methodical and systematic approach, ethical standards, independence, verifiability, a research objective, and a contribution to knowledge or societal benefit. They also stress that the concept must not be stretched beyond its ordinary meaning. The Guidelines do not cite AI development anywhere as a settled case of research.

That upends the convenient equation. Product development labelled “research” because the word relaxes the purpose limitation rarely satisfies those six criteria. A controller invoking the privilege bears the burden of demonstrating the research character of the activity: method, protocol, independent or ethics oversight, a verifiable result. Training a recommendation model whose sole purpose is better conversion rates is none of those things.

Data Minimisation Overrides “More Data Is Better”

The final safeguard is the most inconvenient. Art. 89(1) GDPR requires measures for data minimisation and pseudonymisation to the extent the purpose allows; Art. 31(2)(e) DSG requires anonymisation as soon as the purpose allows. Both establish a sequence: anonymous data first, then pseudonymous, and only last identifiable data.

That runs counter to the training logic that more complete data produces a better model. The CNIL recommends for model training that data minimisation does not prohibit large training datasets, but the data must be selected and cleaned, and unnecessary personal data stays out. The burden of justification therefore rests with the controller, who must demonstrate for each data field why it belongs in the set.

A training set that retains complete customer profiles “just to be safe” inverts that sequence. It starts with identifiable data and defers minimisation to later — and “later” never comes in live operations. Precisely such a set is not covered by the research privilege.

The Takeaway for Monday

The practical conclusion for Monday: document four things separately before the next training run. First, the legal basis under Art. 6 GDPR — and Art. 9 for special categories — independently of the compatibility question. Second, the safeguard framework under Art. 89(1): the level of minimisation, pseudonymisation, and the technical and organisational measures in place. Third, evidence that the result remains aggregated and does not support decisions about individuals — or that the model is anonymous. Fourth, if you invoke “research,” the file documenting the research character of the activity.

The structural limits are established. The privilege replaces the purpose limitation, not the legal basis; “statistics” means aggregated, not personalised; minimisation applies even to large datasets. What remains open is whether a supervisory authority will recognise commercial model training as “scientific research” within the meaning of Recital 159, and how strictly it will apply the bar against individual decisions to personalising models. That gap will not be closed by commentary but by the final version of EDPB Guidelines 1/2026 after the public consultation, which runs until 25 June 2026. Anyone whose secondary use depends on the research reading should submit comments before that date. Switzerland has no equivalent of these guidelines; whether the EDÖB will transpose the European interpretation to Art. 31(2)(e) DSG will be determined by the first published opinion.