LLM Data Mining: does your website feed the machine?

LLMs rely on accessible online content in order to train themselves. However - if a significant amount of this content was created by LLMs, will they start to drift in terms of the quality and accuracy of their output? Understand the impact in this post.

Yext

Due to our central role in localization infrastructure, Smartling is well-positioned to do macro-level analysis on usage patterns and general trends in the web content world.

And recently, we found something interesting in that data.

We’ve noticed that LLM bots are scanning localized sites. Presumably, this is to mine them for content to further improve their own foundational models.

It’s an across-the-board trend, with every type and size of company impacted. Without getting into the legality, ethics, or ownership of that content, we are immediately struck by the potential for creating an internet echo chamber due to these crawls.

Training data contamination and consequences

With the increase in companies using an MT-first or MT-fallback approach to their web content, plus the recent availability of LLMs as a translation provider, LLMs may soon find themselves in the position of unwittingly “eating their own dog food.”

What is the impact on the quality and effectiveness of LLMs when their training datasets are interwoven with translated content that originates from LLMs?

LLMs rely on the vast range of freely available digital content on the internet, whether in a newspaper article, academic journal, blog post, or scanned books, to amass enough content to increase the size and complexity of a pre-trained model and thus provide human-like generative capabilities. However, if some significant portion of the content being ingested was created solely by LLMs without any reinforcement learning from human feedback, will they start to drift in terms of the quality and accuracy of their output? Will the feedback loop create some sort of AI’ism that eventually spreads and modifies the structure and tone of language generally?

It is difficult to estimate the impact, but standing as we are at the beginning of this generative AI revolution, we see the potential pitfalls in the data-gathering process used by LLM providers.

Intellectual property and value issues

Identifying all incoming traffic belonging to bots is impossible because we depend on their proper use of User-Agent headers that declare their origin and purpose. Many unscrupulous scraping bots will not only hide their purpose; they will actively try to disguise themselves and blend into the general stream of traffic that any public website sees.

A possible future approach to filtering this “echo chamber” effect is for LLMs to work with content providers to develop some sort of watermarking that identifies content generated by an LLM so that it can be categorized appropriately and treated. This type of watermarking will likely be in demand to mitigate the effects of disinformation, IP theft, and other antisocial behavior that bad actors may exhibit.

Additionally, companies who don’t mind or are interested in having LLMs crawl their data may one day opt to monetize their content by selling access to LLM crawlers. This could prove to be a lucrative side business that pays a negotiated value for human-generated content. Content producers have already brought ongoing lawsuits against LLMs in an attempt to regain control of their copyrighted material.

What can we do about it?

LLM scraping of websites for content is not a secret. Still, many companies may be surprised to learn that it is happening to them, and they may be unwitting participants in activities that bring them little benefit while generating endless value for LLMs.

In the world of machine translation, “using AI to help AI” is not a novel idea. When client-specific, domain, or long-tail language data is scarce, it is not uncommon to resort to data augmentation techniques such as web crawling of similar websites, back translation, or data manufacturing by creating slightly different source and target language variants.

Nevertheless, it is vital that anyone relying on the output of the model understand the pros and cons of such approaches. In most cases, such techniques can only incrementally improve the model quality. Ultimately, they do not replace the underlying motto of machine learning - the need for well-labeled and relevant data.