Meta’s New Tool Quietly Scrapes Online Data to Train Its AIs
Usually it’s big AI names like OpenAI or Anthropic that are in the news for questionably accessing other people’s data to train their AI systems. Now, Facebook’s parent company takes a turn in the spotlight.
Mark Zuckerberg, chief executive officer of Meta Platforms Inc.. Photo: Getty Images
Meta’s been using a new “crawler” robot to roam all over the internet and gather as much data as it can to use to train its AI systems. While many websites are taking action to effectively block similar data-gathering bots from other AI companies and AI data gathering is in the spotlight for all the wrong reasons, it appears very few companies are blocking Meta’s newest robot from harvesting their content.
The new crawler is called Meta External Agent, Fortune reports. The news outlet spoke to Dark Visitors, a company that sells tools for site owners to block crawler bots, which explained that the new Meta bot is similar to OpenAI’s better-known GPTBot. Data reportedly show the bot launched just last month, and though Meta acknowledged its existence when questioned by reporters, it hasn’t publiclized the new crawler.
Most critically, data from Dark Visitors suggest that though around a quarter of the world’s best-known websites have chosen to use existing tools to prevent GPTBot from accessing their data, only 2 percent have taken similar measures against Meta’s new one.
Why should you or your company care about this?
The sudden rise of AI data-scraping robots is happening because AI systems like OpenAI’s ChatGPT or Meta’s Llama need huge amounts of data to train them. So much data that many AI companies have been accused of using questionable tricks to glean data from wherever they can–leading to accusations of copyright infringement. In some cases sites like Reddit have blocked access to their content archives unless search engine and AI companies sign a licensing deal, similar to the one just announced by OpenAI and Condé Nast. [please leave this note – your other story hasn’t pubbed yet – no art – and I want to link it]
Last month Anthropic, which created the Claude AI system, was accused of “hammering” popular websites like iFixit and Freelancer.com with repeated AI bot attempts to gain access. In some cases these request storms affected the websites’ functions, and Anthropic was even accused of ignoring website controls that should automatically bar web crawlers from accessing any content. Microsoft’s AI chief has recently talked about the issue of AI crawlers, suggesting that any data that these systems can get hold of is “fair game,” and he implied that banning AI crawlers from access was a legal “gray area.”
The implications for any company that shares some of its intellectual property on its website is that your IT team must keep abreast of the oncoming swarm of AI crawler robots. To protect your material from turning up in a chatbot, you need to take action. This is not a “set it up once and forget about it” problem–like many cybersecurity issues it warrants almost constant attention to detail, which can be especially tricky for smaller firms with tiny, or even outsourced, IT teams.
It’s not just AI that wants your data
Meanwhile, in yet more proof that Big Tech is intent on gathering your data wherever it can, a U.S. appeals court ruled that Google must face a lawsuit from users of its Chrome browser system. Google was accused of using Chrome to collect “personal” information on users without permission–even after users chose to not “synchronize” their browsers with Google accounts in order to keep their data out of Google’s hands. Since Google’s Chrome browser commands nearly two-thirds of the global browser market by some estimates, this implies that Chrome was also collecting data on the millions of companies, small and large, that rely on its systems to carry out daily business operations.
Google issued a statement disagreeing with the ruling, Reuters reports. But the case should ring an alarm bell for your company if it’s been a while since you’ve retrained your staff on protecting as much company data as they can when they’re working online.
Weekly roundup of the latest in tech news