Can AI Training and Global Data Laws Coexist? Experts Debate at #PrivacyNama

Privacy experts discuss lawfulness of data scraping, need to revise data regulation and personal data definition and the ambiguities around mixed data setsThe post Can AI Training and Global Data Laws Coexist? Experts Debate at #PrivacyNama appeared first on MEDIANAMA.

featured-image

Explainer Briefly Slides “The dirty little secret of generative AI is that all the data scraping that happens to train the models is in total disrespect of all the 160+ laws that exist and that protect data protection in the world. Because data scraping is at its core incompatible with lawfulness. There is no consent in scraping all personal data, all contents of any publicly available site, because even if it is public, it doesn’t mean that you have given a consent to use it, to collect and use it as you want.

That is at odds with purpose limitation and it is at odds with lawfulness,” said Professor Luca Belli of the Fundação Getulio Vargas (FGV) Law School during MediaNama’s flagship event PrivacyNama on October 3 and 4, 2024. Throughout the event, speakers talked about the tussle between AI training and privacy protection. Belli specifically talked about the inclusion of personal data within publicly available data sets.



Regarding India and its Digital Personal Data Protection Act, 2023 , Belli said that there should be a consent mechanism for publicly available data even if India’s data protection law allows for the scraping of such data. He pushed for the Data Protection Board to explicate and clarify the data collection and processing structure in case of publicly available data. “Let’s say that I’m participating in this conference in this moment.

And so my bio is on the website of MediaNama. I’ve not necessarily made it explicitly public and given consent for it to be scraped and utilized to train models, right? So, I think that there is a very important role to be played by regulator to specify, to clarify what elements of the law. To my knowledge, India is the only one that exempts the application of the law when data, personal data are public.

But again, it [the provision in the Indian law] doesn’t mean that if you scrape all the internet and use it as you please, that is legal or you have a legitimate interest for it. So, again, very important role for regulators to be played here,” said Belli. Watch the discussion here: AI models should be trained on confidentiality Udbhav Tiwari, Head of Global Product Policy at the Mozilla Foundation, argued that AI models should be trained not to divulge personal information and identify such pieces of data.

One can easily find credit card numbers, phone numbers, email IDs, sometimes even information from digital ID leaks, within an AI’s training dataset, which often scrapes information from the public web, apart from other sources. “Unless you tell them [AI models] to not divulge or share this information, it’s quite possible they might start bringing it out. So, I definitely think it’s something that needs to happen right from the beginning, and that it’s not a binary between a harms approach and a principle-based approach,” said Tiwari.

Should data sets be checked by regulators prior to training? Tiwari called for conversations around the control of datasets used for AI training. He questioned whether companies working on AI models be obligated to submit their datasets to regulators prior to training. Even if a company resolves to remove particular data, Tiwari argued that doing so is not feasible for openly available model weights that exist on consumer computers and have already been trained.

For this reason, he also suggested that a content moderation layer, which is usually used for harmful content and socially unacceptable content, can be used to make sure that certain kinds of data don’t turn up. “I do think that this is a place where the law will probably have to evolve a little bit, not to say that you cannot exercise these rights, but by being a lot more specific about how these rights apply with regard to AI systems. As far as I understand, it’s really, really if not borderline impossible to remove individual pieces of information from a data set before it starts showing up in the world.

To the point of being prohibitively expensive both technically as well as financially to market providers, and definitely something that people outside of the biggest model, like trading companies that have the money to do that, cannot do,” said Tiwari. Companies refuse to answer data subject enquiries Tiwari talked about a need for further enforcement against data scraping when asked about mixed data sets and protection of personal data within such sets. He described the current situation as a “free for all” where entities have “gone around and done what they have.

” He gave the example of the Data Subject Access Request (DSAR) that allows a data subject to access their personal data held by an organisation. “Under the GDPR, people have been trying to use them a lot for the last year and a half to try different ways to figure out whether that information even exists in these systems or not. And almost uniformly the standard response that they have gotten is either the trade secret and intellectual property defense saying that we can’t, or that in order to answer your request, we will have to scan all of the content that we already have in order to determine whether your content exists there.

And therefore, we like it’s technically infeasible and we cannot do that. Even in cases of very targeted pieces of information where people have evidence that the model is spewing output what is their information, they have not gotten responses and have now filed complaints with privacy regulators asking companies to figure out,” said Tiwari, adding that there have been instances where companies have used internal data sets to train models and still refused to comply with DSAR requests. Data protection is not top-priority for SME companies Derek Ho, Mastercard Senior Vice President, Assistant General Counsel, Privacy & Data Protection highlighted disparities in how different organizations handle data protection and AI governance.

Ho emphasized that organizational size and resources heavily influence privacy implementation capabilities. According to Ho, small and medium enterprises focused on basic operational survival often lack dedicated resources for data protection, with officers frequently juggling multiple roles including technical, legal, and administrative duties. In contrast, Ho claimed that larger, better-resourced organizations have generally implemented privacy by design frameworks, providing a foundation for product development involving data and increasingly, AI and machine learning.

Ho concluded that a one-size-fits-all approach to data protection and AI governance is impractical given the diverse organizational landscape. Sovereign authority to anonymise data sets may not work When asked whether the creation of a sovereign data protection authority that ensures anonymized data would resolve the provenance issue, Tiwari said content provenance will remain a problem. This is an initiative that makes sure content is from a source which is trustworthy and reliable.

Digital rights management (DRM) plays a significant role in such actions to enforce those properties upon a particular piece of data. However, to do so, the DRM will have to control, measure and mitigate data either from the point at which it is generated or when the data is first stored. He also warned against the idea of allowing companies like Google or Apple or Microsoft deploy forms of content provenance on user devices that are out of user controls.

While, this will help tag content both personal or otherwise as it makes its way through the system, it ends up creating an Aadhaar-like identifier for each data. “It’s sort of like having a universal Aadhaar for every piece of data that you generate, which is like a single identifier that is supposed to be able to track throughout [its virtual journey] knowing everywhere it’s gone and everything that it’s done. That’s a pretty bad idea.

So, I think that provenance in knowing where information comes from and like for data management generally, that’s fine...

but content provenance in the way that from the beginning of the place where the content is created to when it becomes the output in an AI system...

I think the consequences of that approach are too severe for us to seriously consider them,” said Tiwari. However, Merl Chandana, LIRNEasia Team Lead for Data, Algorithms and Policy, said that some countries are already working on responsible curation of data sets for language models. Some of these practices involve working on non-personal data sets and to see how the AI models perform and consider the amount of information required when using personal data sets.

Personal data definition must expand for better enforcement Speakers talked about navigating compliance and enforcement when dealing with data from a mixed data set. According to Mugjan Caba, Truecaller Senior Compliance Officer, compliance for mixed data sets starts with user’s understanding of their data footprint that leads to creation and management data registers. While some entities bound by GDPR law do carry out such processes to comply with the data protect regulation, the method is yet to expand to the “complex nature of AI.

” “Already today, it’s sometimes not crystal clear whether a data piece is personal data or not...

So even today we might find ourselves making hours of analysis just to understand whether a piece of data is personal data. Now add outputs from AI to that..

. we should expand the definition of data there and [decide] in which processing activities we utilize personal data. But also [consider] data in a gray zone maybe, where you need to discuss or properly analyze and document whether it can be classified a person data or not,” said Caba.

Also Read:.