Robert T. Chang, MD, explains how aggregated real-world data will drive practice patterns and algorithms going forward.
Obtaining data sets
The rate limiting step in deep learning is clean “Big Data,” that is, sufficient variety and quality of training examples without too many artifacts and ideally the ability to ensure that the AI could calculate the relative risk of misclassification—essentially knowing when not to produce an answer, according to Dr. Chan.
This begs the question: Where does all the data to train algorithms come from?
The answer is “de-identified” public data sets that are essentially open sourced by various organizations. Every company is trying to aggregate its own unique or proprietary data set as a competitive advantage since AI algorithm architectures generally are published and cloud computational power is affordable now.
The only real differentiator is the training data quality, quantity, and diversity. Dr. Chang recounted the original Kaggle data science competition in 2014, including more than 600 teams with the aim of training an AI algorithm to screen for DR. The event that kicked off interest in neural networks applied to ophthalmology.
A follow-up similar contest took place last year sponsored by the Fourth Asia Pacific Tele-Ophthalmology Society (APTOS). The data set for this competition comprised thousands of images from the Aravind Eye Institute in India, which become public domain when released. More than 3,000 teams were interested in training a better algorithm to screen for DR. However, people quickly realized the definitions of disease in this data set differed subtley from India, and models from the prior competition did not achieve similar performance.
Humans may not see the small differences, but machines can.
The biggest problem faced in AI is having sufficient longitudinal, real-time, high-quality representative data cleaned up for the appropriate task. The narrower the task, the easier it is to acquire relevant diverse data. Going forward, there is a problem assessing algorithms against each other. Once a validation set is used, it is hard to keep that data private and not become incorporated into training data.
Also, data generally are stored all over the place in private data silos. Ideally, AI needs a globally shared data system to be truly generalizable.
Amassing enough data to garner trust in AI raises all kinds of ethical issues and considerations, including who owns an individual’s health data—the patient, the hospital or hospital system, the insurance company, or the government? Where will all the data be stored securely? Who is benefiting from monetizing it?
Robert T. Chang, MD
e: [email protected]
Dr. Chang has previously received AI research funding from Santen and was the recipient of a Stanford Center for Innovation in Global Health grant but has no AI financial disclosures.