In this second installment on AI, we address the key issues around the data being used to create and train an AI model, specifically what needs to be considered in obtaining and using it.

There are two main ways that AI developers obtain training data for their models, in-licensing and gathering the data on their own – which can be direct from customers and contacts – or by scraping data from the web.

When licensing training data, AI developers are typically looking for volume, and in many cases, they want the ability to find a relevant data set to their goals. Fortunately, data brokers have stepped up to fill this need. Once a developer has found the broker, or other source, of the data set that they want, there are some important topics that a developer must focus on in the licensing agreement for the data, and those are:

  • the broker's rights in the data subject to the license and whether it was acquired lawfully;
  • if the data can be used/disclosed/sublicensed as the developer believes necessary for their goals;
  • how the broker ensures the data is accurate; and
  • how the data is kept up to date/current.

Whether we are representing the developer or the broker, these are the topics that we see get addressed time and time again, and the length of the negotiation will depend largely on the hygiene practiced at the broker and the sensitivity of the developer (which may in large part depend on the sophistication of their customers). The more a broker acts like a black hole pulling in as much data as possible without regard for best practices, and the more sensitive the developer (or their industry/customers), the longer it will take, and the more each party will dig its heels in.

On the other hand, we have the do-it-yourself developers that will go out and get the data themselves, either through their own business contacts (though this is typically inefficient and does not produce enough data to help with training the model), or by scraping data from the internet.

Of course, scraping data can have its own problems, and that is when the AI sweeps up data/material that is protected under one or more intellectual property laws, and is owned by a person/entity that is motivated to defend its rights.

A recent example of this came in December 2023 when the NY Times filed suit against OpenAI and Microsoft for infringing its copyright in content posted on the NY Times website by scraping that data, using it to train OpenAI's and Microsoft's respective models, and the subsequent use by the entities. It should be pointed out that Google (wisely) entered into a $100 million, three-year licensing deal with the NY Times in May 2023 (and roughly $900 million with other major news outlets globally) to avoid the distraction of lawsuits and settlement discussions.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.