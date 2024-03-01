The sources of data used to train AI systems become pivotal in shaping further development and the modern legal landscape.

Google will start training its Gemini chatbot using the expertise of software developers who share tips and tricks on the online community Stack Overflow in a deal announced Thursday.

The deal with Google, the details of which haven’t been revealed, stands out in an era where content creators of all types are wary of giving AI systems access to their data. In stark contrast, a new round of lawsuits filed by news organizations allege their content was illegally “scraped” to train AI models. Coding chatbot gets data source

Stack Overflow’s website bills it as the “largest, most trusted online community for developers to learn, share​ ​their programming ​knowledge,” and its question-and-answer content helps developers deal with coding problems. The company decided last year to take a progressive stance on giving AI technology companies access to its vast database, saying that it would charge for access to its data, tech publication Wired reports. Now Google is its first customer, in a deal that appears to have synergies for both parties.

Once the agreement goes into effect when people ask Google’s Gemini AI coding questions the AI can’t answer, Stack Overflow’s crowd-sourced lessons, informed by its large userbase–a Quora question placed its userbase as over 20 million in 2023–may be able to fill the gap. Gemini will summarize the responses, but it will make it clear that the information is from Stack Overflow, showing the company’s logo, links, and the Stack user who contributed the information. Stack’s brand gets a small boost every time this happens, Google pleases more Gemini users, who’ll stay inside its ecosystem for more time, and Stack earns revenues from the deal. Stack Overflow’s CEO Prashanth Chandrasekar explained that there are no real limits on what Google can do with its data, as long as the results are accurate and properly attribute information back its source. Google can use the data to train its AI systems-which means that as they learn more from Stack content, they may get better at solving tricky coding problems.

The deal echoes a recent move by Microsoft, where the tech giant integrated its flagship chatbot system Copilot into Github, its own social media-like coding database. News content from AIs may be legally fraught

That data-sharing relationship stands in sharp contrast to a move by news outlets the Intercept, Raw Story, and AlterNet, which all filed lawsuits in the Southern District of New York alleging leading AI service OpenAI has infringed their copyright. The Intercept is also suing Microsoft, making similar allegations about Copilot-which leverages OpenAI’s GPT-4 model as a core part of its systems.

The lawsuits fall under the Digital Millennium Copyright Act, a 1998 law designed to protect against technology being used for unfair sharing of copyrighted works. It’s most often invoked when a website like YouTube takes a user’s video down for using an unlicensed piece of music or other copyrighted content. Some details in the suits allege that OpenAI deliberately “trained ChatGPT not to acknowledge or respect copyright.” Copyright laws ripe for change

The new news outlet suits echo several earlier copyright cases against AI training data, including a prominent case filed by the New York Times against OpenAI and Microsoft. Some critics argue that the current AI era requires lawmakers to revisit older copyright laws that may be either defunct or being used incorrectly. The Copyright Office, a part of the Library of Congress, is busy planning a review of copyright laws thanks to the rise of AI. The new copyright cases are important because they could be pivotal for the future of how AI systems are trained and developed, and may have a knock-on effect to third–party apps that leverage AI systems in clever ways.

One such use can be found at buzzy AI search firm Perplexity, which is reportedly raising unknown funds again just weeks after closing a successful $73.6 million round to boost team-building and product expansion. Perplexity builds what Business Insider calls a “search engine on steroids,” using a combination of so-called large language AI models from providers like OpenAI-the very AI model that news outlets are now suing for copyright infringement.