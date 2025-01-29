OpenAI claims to have found evidence that Chinese AI startup DeepSeek secretly used data produced by OpenAI’s technology to improve its own AI models, according to the Financial Times. If true, DeepSeek would be in violation of OpenAI’s terms of service.

DeepSeek emerged from obscurity earlier this month with the release of R1, a reasoning model capable of thinking through a request in multiple steps. R1 is built on top of DeepSeek’s flagship artificial intelligence model, V3, which it claims to have developed for just over $5.5 million dollars—a fraction of the tens to hundreds of millions spent by titans like OpenAI, Meta, and Microsoft to obtain similar results. To some, the tech seemed too good to be true, and rumors began to circulate that DeepSeek may have been hiding a key aspect of its models’ development. In the fall of 2024, according to reporting from Bloomberg, security researchers at Microsoft, one of OpenAI’s primary partners, noticed a group believed to be connected to DeepSeek that was transferring large amounts of data using OpenAI’s API, raising concerns that DeepSeek was engaging in a process called “distillation.”

FEATURED VIDEO An Inc.com Featured Presentation

So what’s distillation exactly? Speaking to Fox News, White House AI and crypto czar David Sacks said he’s seen “substantial evidence” that DeepSeek distilled knowledge from OpenAI’s models. Sacks described distillation as a process in which one “student” model learns from a “teacher” or “parent” model. The student model—the one that’s being trained—asks the teacher or parent model millions of questions, and the data obtained from the answers allows the student model to mimic the reasoning process of the teacher model. Essentially, Sacks said, it’s a method to “suck the knowledge out of the parent model.” OpenAI told the Financial Times that it had found evidence of distillation that seemed to be linked to DeepSeek, but declined to comment further regarding details of the evidence. Notably, distillation isn’t illegal, and providers of open-source models, like Meta’s Llama, have encouraged developers to distill their models in order to create smaller, better products. But OpenAI’s technology is closed-source, and their terms of service clearly state that developers are not allowed to “automatically or programmatically extract data or output,” or “use output to develop models that compete with OpenAI.”

For some, the situation is ironic. OpenAI built powerful models by gathering data across the internet, and multiple lawsuits, such as ones filed by The New York Times and a collection of authors, allege that data included millions of copyright-protected articles and stories that were obtained without asking for permission. Now, OpenAI is the one reportedly accusing an AI company of stealing its work without permission. In a statement, an OpenAI spokesperson told Inc. that “we know that groups in the PRC [People’s Republic of China] are actively working to use methods, including what’s known as distillation, to try to replicate advanced U.S. AI models. We are aware of and reviewing indications that DeepSeek may have inappropriately distilled our models, and will share information as we know more. We take aggressive, proactive countermeasures to protect our technology and will continue working closely with the U.S. government to protect the most capable models being built here.” The spokesperson added that OpenAI isn’t entirely against distilling—the company actually launched an API product in October that streamlines the process, but it can only be used to improve small models built for specific tasks, rather than to develop models designed to compete with OpenAI.