OpenAI began as a non-profit that trained open-source AI models on unpublished books. Eight years later, fueled by billion dollars of investment from Microsoft, the company faces allegations of violating European data protection law—and compliance demands that might be impossible to meet. 

‘A Good Outcome for All’ 

OpenAI started in 2015, funded by donations from entrepreneurs including Sam Altman (now the company’s CEO), Elon Musk, and Peter Thiel—plus corporations such as Amazon Web Services (AWS), Infosys, and Microsoft. 

OpenAI’s stated goal was to “advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return.” 

It’s hard to predict when human-level AI might come within reach,” an early OpenAI press release states. “When it does, it’ll be important to have a leading research institution which can prioritise a good outcome for all over its own self-interest”. 

Semi-Supervised Learning 

OpenAI’s most significant work is its GPT series (short for “Generative Pre-trained Transformer”) of AI models. 

GPT was the first “transformer”-type AI to receive “semi-supervised” training. Whereas earlier transformers required a lot of costly and time-consuming human intervention, GPT could learn from large amounts of raw, unlabelled data. 

The first GPT model was trained on literature—7,000 unpublished books comprising 4.5 GB of text. 

With GTP’s successor, GPT-2, OpenAI began integrating text scraped from the open web. The model’s training set included “all outbound links from Reddit… which received at least three karma”. As a result, GPT-2 produced more convincing, “human-like” outputs. 

After some initial reluctance to publicly release the model—supposedly due to concern over its potential to produce disinformation—OpenAI eventually published GPT-2’s source code in February 2019.  

Common Crawl 

Shortly before releasing GPT-2, OpenAI announced that was switching from a non-profit to a “capped” private company whose profits would never exceed an amount 100 times higher than its original investment. 

The following year, OpenAI announced GPT-3—a version of which would later power OpenAI’s leading commercial product, ChatGPT. 

GPT-3 was trained on a much larger corpus of data than previous GPT models. Around 60% of GPT-3’s training set came from Common Crawl, a non-profit that “scrapes” the web each month and provides free access to the resulting dataset. 

Common Crawl, a non-profit, has been largely left alone by US authorities and rightsholders. The organisation has defended the legality of its operations against allegations of copyright abuse. 

Under US law, web scraping is protected by the first amendment. The legal situation is different in Europe, where a “legal basis” is required for most activities involving personal data (which will inevitably appear in a large enough set of web-scraped data). 

The Closing of OpenAI 

Although OpenAI allowed limited third-party access to the GPT-3 API, enabling others to integrate GPT-3 into their products, the company declined to release the model’s source code. 

“In addition to being a revenue source to help us cover costs in pursuit of our mission, the API has pushed us to sharpen our focus on general-purpose AI technology.” OpenAI said in a June 2020 blog post. 

In January, the company received a $10 billion funding injection from Microsoft, which subsequently announced it would integrate OpenAI’s model into the Bing search engine. 

On releasing its most recent GPT model, GPT-4, this March, OpenAI did not publish any information about the model’s size, architecture, or training data, citing the “competitive landscape” and “safety implications”. 

‘Plausible-Sounding But Incorrect’ 

ChatGPT, the chatbot released by OpenAI last October, runs on GPT-3.5—a fine-tuned version of GPT-3 whose training data includes information published as recently as June 2021. 

ChatGPT’s user-friendly design helped it reportedly become history’s fastest-growing app, attracting over 100 million users within a few months of its launch. 

Despite the program’s impressively human-like outputs, OpenAI admitted that ChatGPT would sometimes “respond to harmful instructions”, “exhibit biased behaviour”, and produce “plausible-sounding but incorrect or nonsensical answers”. 

OpenAI’s GDPR compliance efforts have been relatively slow. The company published its data processing agreement, a mandatory contract for companies using “data processors” under the GDPR, on 14 March—some five months after ChatGPT’s launch. 

The following week, OpenAI notified users of a security breach that exposed some users’ private chat topics, names, email addresses, billing addresses, and limited payment information. It was this relatively minor incident that led to OpenAI’s first reckoning under the GDPR. 

OpenAI’s GDPR Reckoning 

Italy’s data protection authority (DPA), the Garante, was the first regulator to directly challenge OpenAI, announcing action against the company on 31 March. 

While the Garante’s intervention was triggered by ChatGPT’s security incident, the regulator issued an emergency order addressing a much broader set of issues. The Garante alleged that OpenAI: 

 

The regulator cited violations of Articles 5, 6, 8, 13, and 25 of the GDPR—provisions relating to the GDPR’s principles, its legal bases, the rules on delivering online services to children, transparency obligations, and the concept of “data protection by design”. 

The Garante’s order against OpenAI required the “temporary limitation of the processing of personal data of data subjects established in the Italian territory”. The company had 20 days to explain how it would address the compliance issues alleged by the regulator. 

In response, OpenAI “ceased offering ChatGPT in Italy”. 

Not a Block or a Ban 

Italy’s action against OpenAI provoked headlines such as “Italy blocks ChatGPT over privacy concerns” and “ChatGPT banned in Italy”.  

But rather than having been “blocked” or “banned” by Italy, OpenAI chose to restrict Italian users’ access as a means to comply with the regulator’s order.  

In an interview following OpenAI’s decision, a Garante representative said the company could, in theory, have continued offering ChatGPT—if it could do so without processing any personal data about people in Italy. 

On a technical level, it would be impossible to offer ChatGPT in Italy without processing personal data about Italians. In fact, it’s unclear how OpenAI could have limited all processing of such data—regardless of whether the company blocked ChatGPT. 

 Processing” is defined broadly in the GDPR, covering “any operation” performed on personal data. 

As such, whatever OpenAI did in response to the Garante would have constituted “processing”—including deleting personal data in its training set, continuing to store that data, or providing refunds to Italian customers. 

And despite the geo-restriction of ChatGPT, the chatbot would continue to generate inaccurate personal data about people in Italy (which was one of the Garante’s key concerns). There is no clear solution to the “accuracy” problem short of closing down the app altogether. 

Further Investigations 

On 12 April, a week before OpenAI’s original deadline, the Garante announced that OpenAI had a further 18 days to bring its operations into compliance with the GDPR. 

By the end of April, OpenAI must: 

 

The following day, at the request of the Spanish regulator, the European Data Protection Board (EDPB) announced a new “dedicated task force” to “foster cooperation” and “exchange information on possible enforcement actions” against OpenAI. 

The next OpenAI enforcement action could come from France, where the country’s regulator is apparently investigating complaints about the company. 

The End of the Beginning 

Even if OpenAI manages to satisfy Italy’s demands, the company’s compliance issues are unlikely to go away. 

Data protection experts have been highlighting conflicts between the GDPR’s requirements and the large-scale processing of data that powers large language models like GPT. 

In a 2018 paper titled “Algorithms that remember”, academics Michael Veale, Lilian Edwards, and Reuben Binns argued that AI models themselves—not only the datasets used to train the models—constitute “personal data” under the GDPR. 

This would mean that AI models are personal data “all the way down”—they are trained on personal data, process personal data as inputs, produce personal data as outputs, and, according to the above interpretation, are themselves personal data. 

A recent Oxford University paper argues that European regulators are “ill-prepared for the emergence of this new generation of AI models”—and will remain so even after the passing of the EU’s upcoming AI Act. 

The GDPR’s principles of fairness, transparency, data minimisation, data accuracy, and its rules on automated decision-making all apply to the training and operation of AI models.  

But full GDPR compliance could seriously undermine OpenAI’s data-hungry operations, and it might be easier for the company to leave Europe altogether. 

We hope this guide was helpful. Thank you for reading and we wish you the best of luck with improving your company’s privacy practices! Stay tuned for more helpful articles and tips about growing your business and earning trust through data-protection compliance. Test your company’s privacy practices, CLICK HERE to receive your instant privacy score now!