How Apple Plans to Improve Its Synthetic AI Training Data While Safeguarding User Privacy

featured-image

Apple is working to refine its AI tools by comparing synthetic training data to real-world data—all without compromising user privacy.The post How Apple Plans to Improve Its Synthetic AI Training Data While Safeguarding User Privacy appeared first on MEDIANAMA.

Apple intends to improve the functioning of its artificial intelligence (AI) models by comparing the synthetic data it trains its models on to real-life data samples from its users, the company announced in a recent blog post. Synthetic data is data created to mimic the format and important properties of user data but does not contain any actual user data.This comes after Apple introduced Apple Intelligence, an AI-based messaging, email, audio, and webpage summary generation feature as part of the iOS 18.

1.1 update. The feature notably made a lot of mistakes in summarisation; most notably, it created a fake news headline attributed to the BBC, following which the company suspended the service.



Using the closest synthetic data samples to train models:Apple explains that to improve the quality of the synthetic data that its models work on, it generates a set of synthetic email topics that users commonly mention in their emails. It then derives a representation (or embedding) of these emails, which includes the key dimensions of the message, like the language, topic, and length of the message. Apple sends these embeddings to devices that have opted in for device analytics.

Users who opt in to provide Apple access to device analytics allow the company to access details about hardware and operating system specifications, performance statistics, and data about how they use the device/apps. Apple compares the synthetic embeddings to embeddings from a small sample of recent user emails from devices that have opted in for device analytics. Such devices then select which synthetic embeddings are closest to real user email samples.

Ensuring user privacy:Apple explains that it uses ‘differential privacy’ to learn the synthetic samples that most devices have selected as closest to real user samples, without learning which synthetic embedding was selected on any given device. This implies that the company does not get to learn which synthetic sample corresponds to a specific user’s actual emails.Apple says that it uses the closest matching synthetic samples as training or testing data and can even run curation steps to further refine the data.

The company specifies that the real-user email samples that it compares synthetic data with never leave the user’s device, and Apple also never gets access to this information.Why it matters:As companies seek to compete in the race to make the biggest and best-performing AI models, access to readily available data to train these models is becoming a challenge. In November last year, Reuters reported that AI companies are hitting a scaling wall.

This meant that making models bigger and feeding them more data was no longer providing proportional capability improvements, with access to data reportedly being one of the developers’ key challenges.While using synthetic data, similar to what Apple is doing, makes sense in this case, some, like OpenAI founder Sam Altman, find reliance on synthetic data strange. “It’s really strange if the best way to train a model was to just generate a quadrillion tokens of synthetic data and feed that back in.

You’d say that somehow that seems inefficient, and there ought to be something where you can just learn more from the data as you’re training,” Altman mentioned in a live interview during the AI for Good Global Summit in June 2024. At the same time, he also admitted that the company was experimenting with generating and training on synthetic data.Discussing the quality of output from a model trained on synthetic data, Altman said that what was important was that the training data was of high quality.

As such, if Apple can improve the quality of its synthetic training data through this comparative exercise, it should be able to improve the quality of Apple Intelligence outputs.Also read:CCPA Seeks Response from Apple after iOS Performance Issues. Why Are No Other Countries Probing it?Apple Integrates ChatGPT into iOS, iPadOS, and macOSSam Altman says that OpenAI doesn’t fully understand what is going on inside its AI modelsThe post How Apple Plans to Improve Its Synthetic AI Training Data While Safeguarding User Privacy appeared first on MEDIANAMA.

.