When one reads about AI tools such as ChatGPT and Bard, the focus is largely on the outputs. We marvel at how rich the answers to questions can be and then scorn when it transpires the answers are not accurate (or complete fiction!) There hasn't been as much discussion about the data being used as the input.

This new generation of tools relies on very large datasets from which to learn and the focus is now turning to where those datasets are coming from. The creative industries are already making their objections heard such as in Getty Images' legal action objecting to Stability AI's use of their photo database for their image creation tool "Stability Diffusion" and the music industry are raising objections (with the notable exception of "Grimes") But what about documents and files that are not normally considered valuable in the copyright sense? Boring everyday business documents?

We recently started investigating risks to our clients of using common free tools such as Google Translate, Dropbox and Wetransfer. And while we were in the middle of doing this, the news that Zoom had changed their terms of business to allow them to use user data for "machine learning, artificial intelligence, training, testing, improvement of the Services, Software, or Zoom's other products, services, and software, or any combination thereof" started to shine a much bigger spotlight on something we now think is a serious risk issue for businesses.

The boring everyday business documents I mentioned above contain intellectual property of the people and companies that have prepared them. Nearly all business documents contain confidential information. This might be know-how of the business, strategies and plans or financial performance information. And if those documents are contracts it will also include details of counterparties and commercial arrangements.

Looking closely at the terms of business for Google Translate (for example), Google acknowledge that they need a licence to use any content that belongs to a user. So far, so good, but then go on to explain that users must licence rights to Google if the user is to use the tool. The licence allows Google to perform some actions that give the application its function, but also rights to use the content for:

operating and improving the services, which means allowing the services to work as designed and creating new features and functionalities. This includes using automated systems and algorithms to analyze your content:

  • for spam, malware, and illegal content
  • to recognize patterns in data, such as determining when to suggest a new album in Google Photos to keep related photos together
  • to customize our services for you, such as providing recommendations and personalized search results, content, and ads (which you can change or turn off in Ads Settings)

This analysis occurs as the content is sent, received, and when it is stored.

using content you've shared publicly to promote the services. For example, to promote a Google app, we might quote a review you wrote. Or to promote Google Play, we might show a screenshot of the app you offer in the Play Store.

developing new technologies and services for Google consistent with these terms

This means that any documents processed using the Google Translate tool are permanently shared with Google for it to analyse, interpret and to use to develop new functionality. For example, a Google goes on to explain in the terms, that means that they may interrogate documents for illegality and even mine them for advertising insights.

Until recently, it was hard to see what damage this could cause to the document owner because the content of documents would be abstract and without context, but now when confidential business documents have the potential for use as the pool of data for open-ended data mining that will be used to answer any question (by way of a generative AI such as Bard) it is possible to see that this could be very damaging. It is conceivable that sophisticated researchers who ask pertinent questions of an AI tool could use it to dig into the detailed affairs of people and businesses.

Researchers have already found ways to get Chat GPT to show its hand about its source material, having it generate strange responses in answer to questions that feature strings of text that mirror particular Reddit usernames. It isn't inconceivable that targeted enquiry about a corporate could return similarly anomalous responses which suggested the contents of otherwise confidential documents which had been passed through the AI as training data.

We intend to follow this article up with some more insights arising from this general point, but here is one example to consider.

Imagine you are medium sized business with operations in various parts of the World. You are preparing for your interim financial results tour that follows official release of your interim results in a few days time. The presentation includes some interesting analysis of historic sales performance of different product lines. You have prepared a briefing paper for financial journalists and it would be useful to have this translated into other languages for the foreign press. You run the document through Google Translate. Shortly afterwards the same financial journalists are doing their preparations and ask Bard about your historic sales performance. It is perfectly possible that the briefing paper is used by Bard to respond to the journalist's question and presents not only genuinely historic figures but also the very recent and as yet unpublished sales results. The rest is a moral dilemma for the journalist...

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.