Generative AI and the Glue Guns
One of the most iconic scenes in Joseph Heller’s “Catch-22” is that of Yossarian and the Glue Guns. And this is what came into my head when I tried to understand the recent problems in ChatGPT et al.
The LLM Dementia
Have you, too, experienced it in the last few weeks?
At first, I thought it was just some sort of coincidence. I’ve been using ChatGPT for a descent period with great success. Most of its answers were pretty relevant, usually accompanied by a list of sources. But then, recently, it began to lie, in a way that made it almost useless.
Bad Coding Skills
It started when I was asking it about some C++ libraries. As a result, it suggested a list of several libraries. The first ones seemed to be real. The last one was a library with a wide variety of solutions, which I have already been using for some other needs. Great! I asked it to demonstrate using each of the libraries with some code examples. It then provided nice code snippets for each of the libraries.
After showing me the trivial usage, I asked for some more advanced scenarios. At this point, I was a bit disappointed; it provided code that seemed to be working, but is missing important semantics for which the libraries were made in the first place. So I gently asked it to refactor the code. In a series of small steps, I pointed out the problem at each level. But, instead of improving the code, things became more and more meaningless. From working code with bad semantics, to very clumsy code, to code that compiles with bugs, and then to code that does not even compile. Whenever I tried to point out the problems in the code, it tried to fix the specific issue, by some unreasonable twist to the rest of the code.
Bad Behavior Skills
Still, its original detailed description about the libraries themselves was so innocent, that I tend to believe it. I was looking at some of the libraries it suggested, and it made perfect sense. So I even sent the list to one of my employees to analyze them and decide which is the best solution for us. I even suggested that the winner might be the library we were already using. A short time later, I tried to dive deeper into it by myself. That’s where the horror began. I went to the library documentation. I tried to find what I needed among its very long list of solutions. I didn’t. So I tried its search box. Still didn’t. I tried to find the header files included the example code by searching GitHub, where the library is located. I didn’t.
I then asked ChatGPT if it was sure that this library does provide the relevant functionality. It was very confident about it. I asked where exactly it is. And got some diplomatic answers. Even when I suggested that maybe this is a mistake, the bot insisted on its existence. Only when I demanded to get a specific URL of the include file location so I can use it, it broke down and admitted it was a mistake.
More than a Coincidence
This was not the only case in the last few weeks. There were more and more false answers, in a variety of domains. At first I thought that it was no more than just a few bad results that I tend to see as a pattern. But not a long while after that, I started seeing a lot of similar complaints on various AI groups. And they were not specific to ChatGPT or Copilot, and not specific for coding and SW issues. It seems that something is going on indeed.
So What is It?
There might be many reasons for those phenomena. At least two possible answers might suggest that the description above is not a real process of degradation:
- There is a true degradation, but it is intentional.
New, stronger models are presented often, offered to paying subscribers only. Their superiority over the older models is in many cases very clear. It is possible that the providers have interest in degrading the abilities of older models, as some extra motivation for users to subscribe to a newer model. A bit conspirative, I admit. But we already saw that some models are undoubtedly manipulated. - It’s all in our head.
When the language models just arrived, we all were doubtful about their abilities. As time passed, we got more and more impressed by them. This changes our expectations. For example, just one year ago it was really impressive to see them provide code with “almost no bugs”. But we got used to it very quickly. As their abilities progressed, so did our expectations. And it is possible that the latter increase faster than the first. Like the old “dog playing chess” joke, our expectations just became unrealistic (well… for a while).
These are two possible explanations. But I think that there is another reason, a real one, that might explain it. It is not the result of a thorough research, but rather a subjective thought. Still, I think it explains it well.
Yossarian and the Glue Guns
In Catch-22, Yossarian decides one day to prank one of his commanders. He tells him that the German had developed a new “three-hundred-and-forty-four-millimeter Lepage glue gun”, that “glues a whole formation of planes together in mid-air”. A short time after that, the intelligence officer describes to Yossarian and his friend the new German weapon. This makes Yossarian collapsing in terror, shrieking “My God, it’s true!”.
I suspected a few months ago that this might also happen to the emerging LLMs. This prediction was based on a very similar observation, which I’ve noticed a couple years ago. Reverso is a free web translator. One of its unique strengths is that it provides translation “in context”. When looking for a given word or a phrase, it provides a lot of sentences using it. The sentences are taken from real sources (websites, articles, books etc.), and each of the sentences is provided in both languages. At the beginning, it was a great tool. It was very helpful for understanding what is the right word or phrase to use when it can be translated in several forms. However, after a few years of use, I started to see a worrying phenomenon. More and more “sources” seemed to be very badly translated.
The original mechanism was great: it took the same source, in both languages, and showed how the word is being used in the same sentence in each of them. When both sources were carefully written and/or translated, it is a great source for comparison. However, over time, more and more sources were not translated by human. They were clearly the results of some earlier-days machine translation. The sort of translations where the wrong words are being used from time to time… And using Reverso became a sort of detective work, trying to understand which of the texts is trustworthy. The Hebrew readers can look into the attached screenshot and see it themselves.
The LLM Recursive Food-Chain
The same issue, I suspect, hits the LLMs. At the beginning, they were trained on lots of human-produced data. The mechanism was imperfect but its sources were relatively reliable. Over time, their mechanism was improved, but their sources were degraded: there is so much content in the web today that was produced by LLMs! So, if a later-generation LLM is trained using the current data of the WWW, much of its input is wrong.
The less noticeable the error, the higher the chance it is adopted. So, for example, people are less likely to use AI pictures of a three-handed person. All but the most clueless creators will not use such a picture, but rather ask the AI to create another one. That implies that over time, the AI engine will be trained that a person should not have more than two hands. But if it produces a sophisticated text, or a piece of code, the human user might not be able to spot the glitches. The engine therefore assumes this was a good product, strengthening the chance to repeat its mistake. And if the user put that text or code in their website or GitHub repository, it will affect all the other LLMs as well.
What about the Future?
At the beginning, the situation reminded me of the great novel “Flowers for Algernon” by Daniel Keyes. However, I don’t think the end is going to be as tragic as in the book (at least for Team AI; not sure about Team Human).
We can see that newer models get better and better. So there are means for making sure they can tell right from wrong when trained. It does mean, however, that we must be very suspicious about any AI product. After all, our sources of authentication are just as bad as theirs.
The evolution of the human body is not suspect to develop a sixth finger in the next century; so people will still be able to detect that kind of error in AI pictures for a long while. But what about more esoteric data? We rely on experts and on web sources. The experts themselves rely on web sources for information they are not expert of. So along the time, if there is a subtle mistake in some AI text that made its way to a relatively reliable website, it might propagate and reproduce itself through AI machines to other websites, books, blogs, articles and more.
It is not just about facts. It is also true for vague ideas like grammar issues, phrasing, double meaning, coding techniques and more. Over time, it will be almost impossible to truly detect the source of it. Let’s hope the machines will help us preserve our knowledge, rather then destroy it.