Healthcare AI usually demonstrates some major shortfall in the field after the AI app has operated outside the controlled parameters of the pilot program. Given that AI is most commonly sold on the basis of accuracy, once the accuracy value drifts, there is no way to say from where the data originated, and the silence in the room invariably indicates the data has moved outside the confines of the controlled parameters.
The signal is what everyone is presenting, and that is what everyone is focused on. Below the signal is the data, and more importantly the question of who owns the data, or more importantly, who controls or even knows the provenance of the data. I wrote a book on that lower layer, and I named it for the habit of focusing on only the signal while neglecting the question of the data which is buried below it.
What is designed to be focused on, and what actually matters are two different things. Demos are run on data that has been painstakingly prepared and cleaned. This is what good data science and ML validation looks like. The pressure begins when the same model is pointed at the everyday feed, where records are entered under time pressure, fields get repurposed, and a value that meant one thing last quarter means something slightly different today.
I am no longer surprised by seemingly complete models that are not. I now look closer at the data, and whether anyone in the building could tell me its history without having to guess.
The Layer Beneath the Signal
What do I mean when I say the data layer? A few simple things. Data ends up belonging to someone. Someone is able to say how it came to be and what it is suitable for, and is accountable to it when it is moved or broken. You can call that stewardship. The term governance is used in relation to stewardship, however it has been used and abused so much that it tends to refer to a committee rather than an individual.
A model does not know any of this. It takes the field it is given and scores it, regardless of whether that field still means what it meant during training or has quietly changed beneath. People are the only element of the system that has the capacity to see this. Remove the data owner, and the drift goes on without anyone noticing it until it hits somewhere costly.
Treating Data as an Asset, Not a Liability
The ownership of data entails more than just the contracts you draw up to keep you legally unexposed. It provides the means to make data usable and to build upon it. The three defining characteristics of an asset (ownership, history, and care) provide data the capability to sustain an operational model. When these are removed, data cannot sustain an operational model, regardless of how impressive the model is during a demo.
Most budgets invert this ordering. Most of the money is paid for the tool, while the data required for the tool is viewed as a sunk cost. When the model underperforms, the common response is to purchase a better model. The better response is usually more economical with less fanfare: pay the person who has ownership of the data and give them permission to improve the data.
What Data Ownership Means
An owner has the authority to say no, knows the source of the data, knows the shape of its history, and knows the workflows that produce it. Therefore, they can identify changes to the data and workflows before those changes are reflected in the model. It is usually a complex and dirty reality of multiple owners and multiple datasets, with some datasets owned as a result of a desire to avoid ownership. Identifying and delegating those gaps is the most critical aspect.
These things are unsightly and do not photograph well when compared to model launches. However, they are the distinctions I continue to see between deployed AI and those models that are shut down after their pilot.
Reasons Behind the Investment
Models are packaged to sell leadership. However, in reality, they are making the call about whether the people behind the data will see it as worth the investment. There is no buying path to production level AI. Stewardship is a requirement, and the budget line that forecast the success funds the owner.
Every model that has been deployed to production relies on a person who understands and has the authority to answer for the data. This is what I keep coming back to in the book and the practice. It is the signal that is most overlooked, and is the most important one.