Wikipedia no longer considers CNET a “generally reliable” source after “AI” scandal

Thom Holwerda 2024-03-01 Internet 21 Comments

Remember last year, when we reported that the Red Ventures-owned CNET had been quietly publishing dozens of AI-generated articles that turned out to be filled with errors and plagiarism?

The revelation kicked off a fiery debate about the future of the media in the era of AI — as well as an equally passionate discussion among editors of Wikipedia, who needed to figure out how to treat CNET content going forward.
[…]
Gerard’s admonition was posted on January 18, 2023, just a few days after our initial story about CNET‘s use of AI. The comment launched a discussion that would ultimately result in CNET’s demotion from its once-strong Wikipedia rating of “generally reliable.” It was a grim fall that one former Red Ventures employee told us could “put a huge dent in their SEO efforts,” and also a cautionary tale about the wide-ranging reputational effects that publishers should consider before moving into AI-generated content.
Maggie Harrison Dupré

Excellent response by Wikipedia. Any outlet that uses spicy autocomplete to generate content needs to be booted off Wikipedia.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

21 Comments

2024-03-01 5:50 am

drstorm
I know “spicy autocomplete” is supposed to be some sort of derogatory term for AI, but I don’t understand what it’s supposed to mean.

2024-03-01 7:31 am

Alfman verbose=1
It’s a reference to the way the GPT AI models generate text one word at a time. However Thom’s under a wrong impression that “spicy autocomplete” is a problem. The problem is not (and has never been) outputting text one word at a time because a black box can do this intelligently and in a mathematically equivalent way to any other way of outputting text. The problem is that said black box cannot differentiate between fact and fiction, which has absolutely nothing to do with the order in which words are written.

2024-03-01 10:10 am

mbq
Nah, it is autocomplete because it simply outputs part of a training data that are most frequently coexisting with the query in the said training set, just like the standard autocomplete outputs training strings which begin with something close to the query. And spicy because of hype, I guess. Black or white, for a system based on frequency, “truth” (output fitness it effectively optimizes) only depends on popularity of certain statement in the training, so won’t be too reliable given what can be found on the Internet (statistically).
Surely OpenAI and alikes are using numerous tricks and biases to tweak the reliability of the output, but follow some misinformed or bizarre topic to get either falsehoods or stochastic noise.

2024-03-01 11:54 am

Alfman verbose=1
mbq,

Nah, it is autocomplete because it simply outputs part of a training data that are most frequently coexisting with the query in the said training set, just like the standard autocomplete outputs training strings which begin with something close to the query.

You’re assuming such a predictive oracle can’t create coherent and intelligent output, but that’s not a logical conclusion. To be genuinely fair, we need to judge these oracles as black boxes without bias towards their implementation.

I hereby declare the following axiom:
An oracle that predicts intelligent output must itself be intelligent.

Assume we have oracle A, that we declare to be intelligent. The implementation of oracle A, and even if it’s human or not, is irrelevant. Now oracle B, a predictive model, is trained to reproduce the outputs of oracle A with the exact same statistical odds. In principal, Fairness requires us to be blind to the mechanisms and just look at the inputs and outputs, and since they produce the same outputs then in principal they are indistinguishable.

Just to be clear, I’m not claiming chatgpt is a perfect “oracle b”, but I am absolutely claiming that your and Thom’s justification for dismissing predictive “autocomplete” AI models is baseless and biased. It’s perfectly justifiable to generate intelligent output this way.

2024-03-01 7:51 pm

sukru
Alfman,

They can differentiate between factual confidence levels. But currently they optimize for “sounding convincing” (just like con men). Or more recently heavy bias against certain groups of people, as per the recent backlash.

That being said, the academia (and LLM “manufacturers”) are already looking into surfacing truthfullness:
https://web.stanford.edu/class/cs329t/slides/lecture_10.pdf

And ironically that lecture refers to the very same CNet issues

2024-03-01 10:08 pm

Alfman verbose=1
sukru,

That being said, the academia (and LLM “manufacturers”) are already looking into surfacing truthfullness:

Of course, a NN can be trained on truths and falsehoods, but the garbage in garbage out problem still exists. We just end up moving it around.

2024-03-02 12:40 pm

sukru
Alfman,

I would normally say “the issue will resolve itself thanks to massive amounts of data”, but we are already poisoning that said dataset.

Not only some of the online content is now AI generated (as in this case), which could be a counter-productive feedback loop, we as a society have done really bad things against truthfulness.

Somehow giving people freedom of opinion, and freedom of speech (which is paramount) led to people also put equal weight to all ideas, regardless of how incorrect they are.

This leads to situations were people believe in a flat Earth, but not called out for it (I would even call for open ridicule), but also in more important areas, like vaccines and public health, where actual snakeoil salesman are regarded as highly as respectable researchers.

And that leads the AI to give answers like: “the topic of vaccines is complicated and contested…” even though there is actually no contest from any truthful source.

Anyway, yes, we might unfortunately need to interfere with the data to weed out garbage.
2024-03-02 1:58 pm

Alfman verbose=1
sukru,

This leads to situations were people believe in a flat Earth, but not called out for it (I would even call for open ridicule), but also in more important areas, like vaccines and public health, where actual snakeoil salesman are regarded as highly as respectable researchers.

Years ago anti-scientific sentiments like “the world is flat” were a joke. Something changed. It’s difficult to comprehend how quickly ignorance spreads now.

Anyway, yes, we might unfortunately need to interfere with the data to weed out garbage.

I personally don’t have much faith in the process. There are too many bad actors with their thumb on the scales including autocratic governments seeking to oppress & punish those who disagree with their narratives. The internet has dramatically accelerated this while helping to anonymize the puppeteers who are astroturfing to create social divisions and redirect public wrath.

During earlier elections I was bombarded by propaganda channels that youtube would recommend to me every day. I use youtube without logging in or cookies, so whatever youtube was recommending to me was being recommended to the public at large. I don’t know whether google has got a handle on it now, but I sure hope so because the amount of hate and ignorance that google were promoting to the very top of youtube disgusted me. I’m sure they weren’t the only platform that got infiltrated, but that’s where I noticed it since I don’t use other social media platforms.
2024-03-02 9:25 pm

sukru
Alfman,

I personally don’t have much faith in the process. There are too many bad actors with their thumb on the scales including autocratic governments seeking to oppress …

You are right on that governments wanting to tip the scales in their direction.

That being said, open source once again comes to the rescue.

https://www.reddit.com/r/LocalLLaMA/

I have been using both ChatGPT subscription, and also locally several models that are comparable at “3.5” level. They might not be perfect, but they are much less likely to refuse following directions.

Joking aside things like “killing a Unix process is a violent act. We should not be discussing its semantics” is very unlikely to show up in an uncensored / unaligned local model.

2024-03-01 11:27 am

kurkosdr
What? You mean words vomited by a stochastic parrot are not facts? Next time you tell me I shouldn’t use my stochastic parrot to present a legal case to a judge:
https://yro.slashdot.org/story/24/02/29/2124254/bc-lawyer-reprimanded-for-citing-fake-cases-invented-by-chatgpt

PS: Calling things that aren’t AGI “AI” (especially without any qualifier as to what kind of non-AGI “AI” it is) has created mass confusion in the general populace that those things can’t actually think.

2024-03-01 12:13 pm

Alfman verbose=1
kurkosdr,

PS: Calling things that aren’t AGI “AI” (especially without any qualifier as to what kind of non-AGI “AI” it is) has created mass confusion in the general populace that those things can’t actually think.

I’m guilty of that. People who study AI don’t assume AI means AGI. I think of AI in terms of solving specific problems like chess, jeopardy, handwriting/voice recognition, language models, etc. But you may be right that the general population might not know/understand the distinction.

2024-03-01 7:57 am

franzrogar
I no longer consider Wikipedia a “generally reliable” source after “censorship” issue.

The Catalonian Wikipedia blocked me for stating that 'Etienne Terrus (painter, friend of Matisse and others), who was born in France was French instead of “Catalonian of the North”, which is indoctrination about a false “Catalonian Countries” (Pa"isos Catalans) for the criminal independentists (Spanish Constitution Art. 2 literally states that Spain is an “indivisible territory”).

So, Wikipedia is nothing more today that a source for indoctrination and false information.

2024-03-01 8:35 am

Alfman verbose=1
franzrogar,

So, Wikipedia is nothing more today that a source for indoctrination and false information.

I agree the wikipedia moderators certainly aren’t perfect, they’re humans with human biases. Many of them don’t like being contested and they have the power to promote one view over another. I think this is really hard to solve in a systematic way though. While they are Imperfect, I feel they do a better job than stack exchange, which can be ruined by overzealous moderators. In that case, I blame a faulty incentive system that rewards doing something over doing nothing.

It would be interesting to use AI for moderation, it could be better than humans at being impartial. However this trait is also exploitable. A good example is our extremist politics where politicians want to spread their lies in place of facts. Arguably it’s bad to be impartial there. Most of us here on osnews are better educated and see through at least the most egregious political lies, but for better or worse there are hoards of uneducated voters who are extremely gullible and exposing them to false information is harmful. Ironically, the exact same thing is true of AI itself. AI is great at using the information it has, but it has no compass for the truth beyond that due to the old “garbage in garbage out” problem.

2024-03-01 9:28 am

The123king
Also, human moderator, (Thom, probably), i think the fact that comment has ended in the queue due to “AI” filtering it out for the bad words in it pretty much demonstrates my point.

(Come back later to see my original comment if it has been manually approved)

2024-03-01 9:49 am

Alfman verbose=1
The123king,

(Come back later to see my original comment if it has been manually approved)

WordPress doesn’t use AI but rather rule based “moderation” to stop spam. IMHO it’s not that great. Long term users ought to get the benefit of doubt and not have our posts blocked so trivially. New accounts are much more likely to be used for spam and that should be reflected in the rules. My posts used to be regularly blocked by wordpress for including too many links. I assume your post was flagged by the same rule. This is why I remove the “http://” part so that wordpress doesn’t flag it as a false spam positive

2024-03-01 10:21 am

The123king
The comment was regarding AI making decisions regarding moderation and why it’ll suffer from the AI equivalent of the “Sc*nthorpe problem”
2024-03-01 11:03 am

Alfman verbose=1
The123king,

The comment was regarding AI making decisions regarding moderation and why it’ll suffer from the AI equivalent of the “Sc*nthorpe problem”

I wasn’t familiar with that, but I see what you mean.
I agree with you sometimes dumb filters can fail in very silly ways. Although in principal you could train an AI model to replicate human moderation much more accurately using large training sets. This would be smart enough to look at context, which many rules engines fail to do. But even then, it’s still subject to bad training data and GIGO. Also, a moderation AI could be used to train a combative AI that has the opposite goal and is very proficient at evading the first AI. This could actually make things worse.

2024-03-02 2:51 am

juanma1980
“Catalonian of the North” refers to the pyrenees-oriental region of France, historically part of the Aragonese Crown and catalonian-speaking till nowadays (anecdote: the term begins to be used in France). And Art2 of the spanish constitution not criminalizes independentism as you seems to suggest, in fact art 1 covers it (freedom of speech).
So you can say Pablo Motos is from Spain, from the Region of Valencia , or from “Castilian valencia”, as is knowed the region of Castilia that belongs to Valencia since XIX and where Motos is from. None of them are inventions nor false information or indoctrination. In fact misinformation – worse, disinformation – is what you did when said that in Spain we have not freedom of political-thinking because art 2.

2024-03-01 11:31 am

Bill Shooter of Bul Platinum Prime
Idk, I’ve long disagreed with wikipedia on what is or is not a reliable source. There are articles that I’ve seen rejected due to lack of reliable sources from site XYZ, but there are other articles over there that only have sources from XYZ. Sometimes you just can’t argue with the powerful mods. So like the web there is a good amount of good information, and a lot of crap on wikipedia. Eh, its free and volunteer based and does more good than harm.

2024-03-01 12:22 pm

Morgan
Indeed, I rarely use it as more than base fact-checking and I always follow the source before simply taking an article at face value. It’s impossible to separate human-generated content from human bias, especially for any emotionally or politically charged subjects.

My most recent encounter with a grievous censorship edit on Wikipedia involved a complete whitewashing of a potentially malicious software product, likely done by either an employee of the company or an ardent fan of the software. All controversial and potentially negative factual information about the product was scrubbed from the article, and now it reads like an advertisement (which ironically is strictly forbidden by Wikipedia’s rules and guidelines). Any attempt to change it gets reverted and the IP address that restored the factually correct information retrieved from the Internet Archive is banned. Ask me how I know.
2024-03-01 12:28 pm

Alfman verbose=1
Bill Shooter of Bul,

So like the web there is a good amount of good information, and a lot of crap on wikipedia.

You shot the Bul on the head
Wikipedia has tons of great information, but not always and sometimes they fall short of encyclopedic goals. I’ve had some disagreements with wikipedia mods, but the mods always win, which the privilege of being a wikipedia mod.

Eh, its free and volunteer based and does more good than harm.

Still it’s a such a useful reference tool. I’d rather have wikipedia than not.