Gen AI will fail until we hold it accountable for data integrity breaches
The release of ChatGPT in November 2022 unleashed a renewed wave of interest, innovation and hype regarding the venerable topic of Artificial Intelligence (AI) — specifically a buzz about interaction with generative AI. Such a sudden rush into technology use risks growing catastrophic failures and serious breaches until it addresses the cost from its glaring integrity weaknesses. This post explains where such problems come from and offers some potential solutions meant to be deployed sooner rather than later (ounce of prevention).
Machines for decades have been applied to human behavioral data. Models already have been used to analyze information for patterns and make consequential decisions in areas such as credit ratings, medical treatments, jobs and even transit safety. Unfortunately, their effects are now largely recognized to carry harms with them such as hidden agendas, systemic biases and accelerated inaccurate decisions.
Instead of learning from such well-documented mistakes of the recent past, today we are witness to the even more rapid acceleration of compute power applied to an explosion of personal data. Vendors developing Large Language Models (LLM) and other forms of Generative AI promise to very rapidly take rising masses of data spreading around the Web (and even some private silos) and process it all into clear and helpful answers through an interactive conversation.
In the consumer world, with a nod to Microsoft’s “TFC” branded “Clippy”, multiple companies have rushed “personal AI assistants” to the public. (See Inflection, Leon, Lindy and Parrot for example.) In parallel, enterprises in all industries are considering how Generative AI models should be integrated to provide consumers higher service levels or make workforces more efficient and productive.
It’s clear that Generative AI will be fueled by the data of individual consumers and workers. New safety questions are already coming into the spotlight. What exactly does any AI agent, purportedly working for us while also working for the companies behind them, do with and to the data that we share? On what basis should we trust such a “double agent” not to misrepresent or modify our answers? As an individual user, will my chat history with ChatGPT or similar become inauthentic due to faulty OpenAI infrastructure, mixed into a matrix of hidden companies with the data of other users? If I use these tools as an enterprise data professional, will my company’s valuable data be inaccurately mingled with the next LLM training run, risking misrepresentation in a public disclosure of confidential information — driving up costs from inaccurate/stale information in future outputs from the model?
Importance of Data Integrity Breaches
To answer these questions satisfactorily, Generative AI systems must ensure and be held accountability for data integrity. Data integrity is an assurance that data remains accurate, consistent and reliable throughout its lifecycle.
AI systems rely so heavily on accurate and representative data to learn, make predictions and generate outputs, that data integrity could be called foundational. Without integrity, machines operate biased models, deliver incorrect predictions and their insights are found unreliable.
The effectiveness of AI systems relies largely on developing a model of integrity for any data that it processes. In the context of AI, data integrity can mitigate biases, enable reliable outputs and maintain the trust of users.
The good news is that tools are emerging and some already exist for any AI system to safeguard its own foundation and future. We can start immediately implementing real technical measures to control the integrity of the data that the entire foundation of AI trust relies upon. This paper offers W3C’s Solid, an open Web standard usable today, as a robust technical solution to expand the AI market through more trust. Solid empowers organizations by offering a comprehensive technology-based solution intended for safeguarding data integrity at any scale or speed.
Control Data Integrity With Solid
There are four core challenges to data integrity: quality, bias, confidentiality and governance.
Ensuring data quality is a known significant challenge with AI performance and cost. Data often has problems that adversely impact AI ability to deliver effective results, due to the high likelihood of duplication, errors, outliers or missing values. Proper data cleaning, preprocessing and validation techniques are essential to identify and address these issues.
W3C’s Solid lessens the burden of data preprocessing by delivering Pods (personal data stores) directly into the hands of data owners. Having direct access to their own data in Pods empowers and incentivizes users to clean and organize their data. Higher quality data has inherent value to the services they choose, significantly increasing the likelihood of correct data being exposed to applications and services.
In addition, Solid Pods use a structured format called RDF (Resource Description Framework) that simplifies preprocessing tasks. Developers and applications can rely on standard RDF tools and libraries to extract, transform and filter data as needed, making preprocessing more efficient and consistent.
Discriminatory predictions and other outcomes are often caused by insufficient attention to problems with bias. They can arise due to various factors, including historical data collection practices, sampling issues, or errors from human annotators. Failure to address biased results can result in legal consequences and reputational damage. Ensuring responsible AI usage is essential to foster trust and avoid harmful consequences.
W3C’s Solid provides highly scalable methodologies to reduce the cost of achieving high-quality, diverse and representative data collection, directly addressing biases to minimize such risks during training and deployment of AI models.
Data integrity depends heavily on the ability to protect against unauthorized access. Breaches, such as data manipulation by bypassing security controls, can compromise the integrity of AI systems.
Solid's robust new model of data being placed back under the control of owners means security measures, such as encryption, access controls, and privacy-preserving techniques, can be far more appropriately deployed to protect against unauthorized disclosure or tampering.
Strong governance is essential to maintain data integrity, through measures such as clearly defined policies and guidelines for collection, storage, and usage. Data should be properly labeled, versioned and documented to ensure its traceability and accountability. W3C Solid offers standardized protocols and guidelines that put data owners in a crucial role of mediating consent. It enables data owners to achieve traceability, accountability and adherence to privacy regulations, thus providing essential data governance to preserve integrity.
These protections must be put in place now, before such technology is any further embedded in our lives and products we are meant to trust with even trivial decisions. Anyone implementing AI must prepare now to achieve fairness, accountability and transparency while still respecting individual rights to data being accurate (a right to be understood, above the baseline of being forgotten). “You burn the toast, I’ll scrape” should never have become our preferred model for AI innovation, as Deming famously warned everyone over 50 years ago.