Skip links

Researchers jimmy OpenAI’s and Google’s closed models

Boffins have managed to pry open closed AI services from OpenAI and Google with an attack that recovers an otherwise hidden portion of transformer models.

The attack partially illuminates a particular type of so-called “black box” model, revealing the embedding projection layer of a transformer model through API queries. The cost to do so ranges from a few dollars to several thousand, depending upon the size of the model being attacked and the number of queries.

No less than 13 computer scientists from Google DeepMind, ETH Zurich, University of Washington, OpenAI, and McGill University have penned a paper describing the attack, which builds upon a model extraction attack technique proposed in 2016.

“For under $20 USD, our attack extracts the entire projection matrix of OpenAI’s ada and babbage language models,” the researchers state in their paper. “We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under $2,000 in queries to recover the entire projection matrix.”

The researchers have disclosed their findings to OpenAI and Google, both of which are said to have implemented defenses to mitigate the attack. They chose not to publish the size of two OpenAI gpt-3.5-turbo models, which are still in use. The ada and babbage models are both deprecated, so disclosing their respective sizes was deemed harmless.

While the attack does not completely expose a model, the researchers say that it can reveal the model’s final weight matrix – or its width, which is often related to the parameter count – and provides information about the model’s capabilities that could inform further probing. They explain that being able to obtain any parameters from a production model is surprising and undesirable, because the attack technique may be extensible to recover even more information.

“If you have the weights, then you just have the full model,” explained Edouard Harris, CTO at Gladstone AI, in an email to The Register. “What Google [et al.] did was reconstruct some parameters of the full model by querying it, like a user would. They were showing that you can reconstruct important aspects of the model without having access to the weights at all.”

Access to enough information about a proprietary model might allow someone to replicate it – a scenario that Gladstone AI considered in a report commissioned by the US Department of State titled “Defense in Depth: An Action Plan to Increase the Safety and Security of Advanced AI”.

The report, released yesterday, provides analysis and recommendations for how the government should harness AI and guard against the ways in which it poses a potential threat to national security.

One of the recommendations of the report is “that the US government urgently explore approaches to restrict the open-access release or sale of advanced AI models above key thresholds of capability or total training compute.” That includes “[enacting] adequate security measures to protect critical IP including model weights.”

Asked about the Gladstone report’s recommendations in light of Google’s findings, Harris relied, “Basically, in order to execute attacks like these, you need – at least for now – to execute queries in patterns that may be detectable by the company that’s serving the model, which is OpenAI in the case of GPT-4. We recommend tracking high level usage patterns, which should be done in a privacy-preserving way, in order to identify attempts to reconstruct model parameters using these approaches.”

“Of course this kind of first-pass defense might become impractical as well, and we may need to develop more sophisticated countermeasures (e.g., slightly randomizing which models serve which responses at any given time, or other approaches). We don’t get into that level of detail in the plan itself however.” ®