So you want to build on GPT-4? Questions to bear in mind when applying it to business use cases - 2023-05-24

5
min read
May 24, 2023

My twitter “for you” feed is full of users proclaiming that X AI is going to disrupt Y industry and yet they haven’t even tried the product. There is so much hype right now and very little understanding of how this might apply to real business use cases.

Alex and I are focused on the real world application of new tech, and the AI du jour are Large Language Models (LLMs) with ChatGPT being the fastest growing product of all time.

Given the pace of AI developments it would be hubris to prescribe what will be disrupted; instead I’m going to break down the current limitations and how this applies to use cases.

The Hallucination Problem

LLMs generate their answer based on the next word being the most statistically likely based on the previous messages. There is no truth or facts, just what is the most statistically likely to happen next. Since the LLMs have seen thousands of examples of 2 + 2 = 4. when prompted with “2 + 2” it will answer “4” but only because that is the most common pattern. When you ask something nonsensical it will always play along statistically and this is the hallucination problem. For example:

To understand what are good use cases, you first you need to understand two questions:

  • What is the accepted error rate for your use case?
  • How quickly can a user verify the output is correct?

For example, it is very quick to verify if generative AI art is good or not. The same goes for generative code, it’s very easy to run the code and see if it works or throws an error. Conversely, historical analysis would take time to verify.

For generative content use cases, such as writing marketing or add copy and Jasper is leading the way here. However, it doesn’t have to stop there. If it’s very easy to correct the output and the verification is quicker than human generation; it’s possible to make use cases where extremely high accuracy rates are needed viable. For example, generating internal company announcements where the user can then quickly verify and make quick edits before releasing. This is sometimes referred to as “Human in the loop”.

So bear in mind:

  1. How quick can the user verify the output is correct?
  2. What is the accepted accuracy of the output?
  3. How easy is it for the user to edit the output?

Prompt size limitation

There is a maximum token limit per request depending on which model OpenAI has enabled for you. Most people have 4,000 token limit on GPT-4 but there is a restricted 32,000 model. 32,000 tokens are equivalent to 22,000 odd words or around 10 to 20 pages of text. It’s not enough to upload all your company data.

Now you’re probably wondering how Intercom’s Fin and Slite’s Ask are able to build ChatGPT products on top of the incredibly large data sets. The answer is a hybrid approach:

  1. The core data is uploaded to a vector database, such as Pinecone. In doing so it converts the text into embeddings (a bunch of numbers that represent what those words mean and their relation to other words).
  2. When the user asks their query, i.e. what does FTE mean?, this too is converted to embeddings and a search based on all those number representation of those words is made on the vector database. This will return items in your data that is most relevant to the query; this is then used as the “context” in the prompt.
  3. A combination of the context and the original user query are then sent as the GPT-4 prompt. Given the LLM’s total understanding of “the world” + the large prompt, it’s able to then work out that FTE means “Full Time Employee”.

This architecture works well when the answer to the question only requires a few references, i.e. What does X mean, when did Y happen? When you start to ask questions that aggregate multiple references, i.e. what is the revenue last month from the tier 1 customers? The search query on the vector database will return greater than 4,000 tokens. Which means your LLMs are answering on partial information and will return incorrect information.

Alex and I believe this is one of the biggest problems to solve. There are a number of interesting strategies around this using Langchain and Data Independent on YouTube has a good video on it. There are other options too, instead of a Vector DB you can instead turn the LLM query into a SQL or API query. ChatGPT Code Interpreter is doing the rounds. GPT-3 had fine tuning and it’s possible that OpenAI have something in the works for the latest models.

So bear in mind:

  1. Does the information needed to answer the query fit in the prompt limit?
  2. OpenAI is increasing the token size however API requests can’t be dozens of MBs large. A hybrid architecture will be a must.

Non-deterministic

Deeply embedded in the AI model there is an element of randomness which means that you can’t always guarantee 100% of the time that given the same input you will always get the same output.

To mitigate the impact you can set the temperate to 0. This is highly effective when the prompt and expected answer is short (ensure to use stop sequences). However, when the use case requires generating longer text then the probability of divergence is higher.

Data Drift

I asked a friend who works in PR if and how they use ChatGPT. He said he did, and it was great. First thing they do is ask ChatGPT to write a press release. This he says is the “default answer” and they use it as a base of what NOT to write.

This is just one example of Data Drift, where the LLMs become outdated and the reason is that LLMs have to be trained on a huge corpus of data. This is an expensive and multi-month process meaning the core LLM is always out of date. Plus, for safety reasons they are not trained beyond a certain date, GPT-4 cut off date being September 2021. This also means that LLMs (or at least our current architecture) don’t learn in real-time.

This means that I can’t ask ChatGPT to help me with debugging LangChain (it was started a year after the training cut off date). Sure you could upload the LangChain docs as the prompt and then ask questions. If the problem space is evolving quickly and requires expertise then much of the heavy lifting will be needed in the prompt context which is already a constraint.

So bear in mind:

  1. Does your use case require expertise on the problem and can I fit this context in the prompt?
  2. How mature is the problem space? If it's not mature then expect hallucinations to be a problem and focus on how to avoid these.

Architecture, Business models and costs

Constraints conspire creativity and necessity is the mother of invention. Traditionally the wave of B2C apps from the last platform wave (mobile apps) were free and ad driven. God forbid paying 99¢ a year for instant messaging and phone calls in one app.

The volumes of consumer markets will probably not permit too many businesses to charge. Remember, Youtube and podcast apps will just interrupt right then and there with an advert. Do not be surprised if half way through your AI answer there’s an advertisement being printed out.

I think the more interesting evolution will be in B2B applications. SaaS per seat pricing could be threatened. If you have a highly capable bot the costs generated by usage will not be evenly distributed, i.e. you will have a handful of users running complex queries with multiple GPT-4 round trips (expensive) and then you might have dozens of users just retrieving info requiring single GPT-3 requests (cheaper). Further, the UX of a bot, particularly in slack enables the whole company to use it. We wouldn’t be surprised to see the pendulum swing towards transactional based pricing that is widely used by developer platforms. Adopting elements of this means that price can scale based on computation.

Costs are going down. Some use cases i.e. classification or deduplication of content can use cheaper models. Open source is becoming increasingly competitive.

So when settling on a business model and use case, bear in mind:

  • Models will get cheaper and more capable. A half working POC today might be viable in a few months time.
  • Transactional billing models aren’t universal loved. Customers do like predictable costs. Consider flat rates, usage calculators and how your customer typically buys software.

Notes:

  • James Briggs is also excellent on Youtube
  • The blog post art is generated with DALL-E. It's my first time using it and I found it very difficult to create something that reflected the content of the article. It's very abstract topic and I suppose there isn't much training art for how to build software companies.
  • GPT-4 is very good and finding good examples of hallucinations was hard. Every one of these answers are correct:

Hugh Hopkins
CEO
Share this post

Posts you may like