Back to the Future: Generative AI may return us to the 1990s
/If I told most people that I was building a new system that could change the world, they would likely want to hear more. If I then said that it had a few issues like exponential growth in costs, supply chain limitations, and potential legal problems, the discussion would probably end. Quickly. That’s increasingly where we find ourselves with Generative AI.
Training AI is expensive
Many scientists have observed that we may be approaching the limits of using online data to train Large Language Models (LLMs) with massive numbers of “parameters”, used to establish relationships between bits of words (“tokens”). As the number of parameters has exploded from hundreds to millions, the processing power required has grown exponentially. Since these models need huge amounts of data to learn from, crawlers try to consume the entire Internet. We have massively complex models ingesting massively larger amounts of data.
The cost (time and electric power) to train models is becoming unsustainable. Using rented processors, training a modern, general-purpose model from scratch on today’s data sets can cost tens of millions of dollars.
We are running out of good training data
In addition to cost issues, there may be limits on how much data we can feed AI. As big as it is, the Internet isn’t infinite. Once your massive models have scaped all available online data, what’s left to train with? An often-cited paper projects we may start running out of training data as early as 2030 (arXiv:2211.04325 [cs.LG])
When you are out of new data to ingest, you either stop training or you begin to consume output that was synthesized by other AI. Stopping freezes learning. Training from synthesized data puts you in a sort of echo chamber; your models can quickly spiral into learning from the errors produced by others which create new errors that are again used to train others. This is referred to as model collapse (arXiv:2305.17493 [cs.LG]). It’s inbreeding with data, and probably won’t end well.
And “content creators” are increasingly taking steps to limit the use of their work to train models they will compete with (one of the demands behind Hollywood’s actors and writers strike is that AI not be used to displace them). This will further limit the amount of training data available.
Training data might be illegal to use
Content creators aren’t just looking to block their work from being scraped. They are suing and claiming that AI trained on their work is producing “derivative” works. The courts must decide where the line is between a derivative work and an original work that has been immaterially influenced by exposure. If I paint an original picture, I'm probably influenced by every painting I've ever seen. That isn't considered misuse of all prior painter's IP. The closer I get to copying an individual work, the closer I come to creating a derivative work. Is it different when a machine does the same thing? Probably not, but the lines haven’t been drawn. Yet.
Once the courts and congress do draw lines, what happens to all the AI models that have been contaminated by training with stuff that (retroactively) isn’t allowed? It’s a big enough concern that Microsoft announced they will assume copyright infringement risks related to customers use of their AI “copilot” service. There is also an emerging area called "machine unlearning" where you try to get a model to forget something without starting over. It’s like taking poison back out of a well. Watch that space closely.
Taken together
Where are we? Costs to consume massively larger data sets are becoming too high. Running out of “good” data to train with is a real possibility, and training from synthesized data is frightening. And training with scraped data today has risks that your models won’t comply with future law.
Back to the Future
Solving these issues may require a trip to the past. In the 1990s we did “data analytics” with much smaller amounts of data, and used it to generate predictions like customer buying trends. The algorithms were complex and we had to make sure the data was highly accurate as “garbage in gave garbage out”. We had small, curated data sets with complex algorithms to tease out value.
Then came the era of “big data”, enabled by social media and automation. As available data grew quickly, we simplified our analytics to look for patterns and clusters and began to personalize predictions. We no longer needed “clean” data, we just needed it “unbiased” and relevant to a specific task.
LLMs began to scrape the Internet to learn everything, unconstrained by a specific task, so the inputs grew from “big data” to “huge data”. There’s a lot of inaccurate data online, but also a lot of counter-data to remove bias thanks to the wisdom of the crowd.
As training costs grow unsustainably, and we exhaust available and usable sources, future LLM training will be constrained to learn from smaller amounts of new data. That will require more sophisticated approaches, such as augmenting LLMs with supervision and labeled data. More complex techniques using smaller data sets will once again require more accurate data to minimize errors.
Data lineage and curation will again grow in importance. I may start with a pre-trained LLM and carefully supplement the training with highly focused additional data through “transfer learning”. I will likely want my new, focused data to have greater significance, increasing the importance of its accuracy.
In the end, we may find that our future is back in 1990s style data management. Our algorithms will be swapped for models, but the importance of clean, curated data will grow. Otherwise, our generative AI may generate costs, errors, and lawsuits…but not value.