Home Artificial Intelligence Reining in the BS in AI

by Matt Asay

Contributor

Reining in the BS in AI

analysis

Nov 20, 20234 mins

Emerging TechnologyGenerative AITechnology Industry

Large language models trained on questionable stuff online will produce more of the same. Retrieval augmented generation is one way to get closer to truth.

shutterstock 1767076568 big black bull with big horns in a grassy field

Credit: alberto clemares exposito / Shutterstock

Even people not in tech seemed to have heard of Sam Altman’s ouster from OpenAI on Friday. I was with two friends the next day (one works in construction and the other in marketing) and both were talking about it. Generative AI (genAI) seems to have finally gone mainstream.

What it hasn’t done, however, is escape the gravitational pull of BS, as Alan Blackwell has stressed. No, I don’t mean that AI is vacuous, long on hype, and short on substance. AI is already delivering for many enterprises across a host of industries. Even genAI, a small subset of the overall AI market, is a game-changer for software development and beyond. And yet Blackwell is correct: “AI literally produces bullshit.” It makes up stuff that sounds good based on training data.

Even so, if we can “box it in,” as MIT professor of AI Rodney Brooks describes, genAI has potential to make a big difference in our lives.

‘ChatGPT is a bullshit generator’

Truth is not fundamental to how large language models function. LLMs are “deep learning algorithms that can recognize, summarize, translate, predict, and generate content using very large data sets.” Note that “truth” and “knowledge” have no place in that definition. LLMs aren’t designed to tell you the truth. As detailed in an OpenAI forum, “Large language models are probabilistic in nature and operate by generating likely outputs based on patterns they have observed in the training data. In the case of mathematical and physical problems, there may be only one correct answer, and the likelihood of generating that answer may be very low.”

That’s a nice way of saying you might not want to rely on ChatGPT to do basic multiplication problems for you, but it could be great at crafting an answer on the history of algebra. In fact, channeling Geoff Hinton, Blackwell says, “One of the greatest risks is not that chatbots will become super intelligent, but that they will generate text that is super persuasive without being intelligent.”

It’s like “fake news” on steroids. As Blackwell says, “We’ve automated bullshit.”

This isn’t surprising, given the primary sources for the LLMs underlying ChatGPT and other GenAI systems are Twitter, Facebook, Reddit, and “other huge archives of bullshit.” However, “there is no algorithm in ChatGPT to check which parts are true,” such that the “output is literally bullshit,” says Blackwell.

What to do?

‘You have to box things in carefully’

The key to getting some semblance of useful knowledge out of LLMs, according to Brooks, is “boxing in.” He says, “You have to box [LLMs] in carefully so that the craziness doesn’t come out, and the making stuff up doesn’t come out.” But how does one “box an LLM in?”

One critical way is through retrieval augmented generation (RAG). I love how Zachary Proser characterizes it: “RAG is like holding up a cue card containing the critical points for your LLM to see.” It’s a way to augment an LLM with proprietary data, giving the LLM more context and knowledge to improve its responses.

RAG depends on vectors, which are a foundational element used in a variety of AI use cases. A vector embedding is just a long list of numbers that describe features of the data object, like a song, an image, a video, or a poem, stored in a vector database. They’re used to capture the semantic meaning of objects in relation to other objects. Similar objects are grouped together in the vector space. The closer two objects, the more similar they are. (For example, “rugby” and “football” will be closer to each other than “football” and “basketball”). You can then query for related entities that are similar based on their characteristics, without relying on synonyms or keyword matching.

As Proser concludes, “Since the LLM now has access to the most pertinent and grounding facts from your vector database, it can provide an accurate answer for your user. RAG reduces the likelihood of hallucination.” Suddenly, your LLM is much more likely to give you a true response, not merely a response that sounds true. This is the sort of “boxing in” that can make LLMs actually useful and not hype.

Otherwise, it’s just automated bullshit.

by Matt Asay

Contributor

Matt Asay runs developer relations at MongoDB. Previously. Asay was a Principal at Amazon Web Services and Head of Developer Ecosystem for Adobe. Prior to Adobe, Asay held a range of roles at open source companies: VP of business development, marketing, and community at MongoDB; VP of business development at real-time analytics company Nodeable (acquired by Appcelerator); VP of business development and interim CEO at mobile HTML5 start-up Strobe (acquired by Facebook); COO at Canonical, the Ubuntu Linux company; and head of the Americas at Alfresco, a content management startup. Asay is an emeritus board member of the Open Source Initiative (OSI) and holds a J.D. from Stanford, where he focused on open source and other IP licensing issues.

Topics

About

Policies

Our Network

More

Reining in the BS in AI

Large language models trained on questionable stuff online will produce more of the same. Retrieval augmented generation is one way to get closer to truth.

‘ChatGPT is a bullshit generator’

‘You have to box things in carefully’

More from this author

AI’s moment of disillusionment

Your generative AI project is going to fail

AI is in the tire-kicking phase

JavaScript needs more money

We need a Red Hat for AI

AI supply is way ahead of AI demand

AI needs adult supervision

The AI revolution will take time

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI

Reining in the BS in AI

Large language models trained on questionable stuff online will produce more of the same. Retrieval augmented generation is one way to get closer to truth.

‘ChatGPT is a bullshit generator’

‘You have to box things in carefully’

Related content

Beyond the usual suspects: 5 fresh data science tools to try today

Generative AI won’t fix cloud migration

HR professionals trust AI recommendations

Safety off: Programming in Rust with `unsafe`

More from this author

AI’s moment of disillusionment

Your generative AI project is going to fail

AI is in the tire-kicking phase

JavaScript needs more money

We need a Red Hat for AI

AI supply is way ahead of AI demand

AI needs adult supervision

The AI revolution will take time

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI