martin_heller
Contributor

What is LangSmith? Tracing and debugging for LLMs

feature
Oct 12, 20239 mins
Generative AIMachine LearningSoftware Development

Use LangSmith to debug, test, evaluate, and monitor chains and intelligent agents in LangChain and other LLM applications.

LangSmith. Blacksmith; forging; metal object on an anvil.
Credit: Angelaoblak/Shutterstock

In my recent introduction to LangChain, I touched briefly on LangSmith. Here, we’ll take a closer look at the platform, which works in tandem with LangChain and can also be used with other LLM frameworks.

My quick take on LangSmith is that you can use it to trace and evaluate LLM applications and intelligent agents and move them from prototype to production. Here’s what the LangSmith documentation says about it:

LangSmith is a platform for building production-grade LLM applications. It lets you debug, test, evaluate, and monitor chains and intelligent agents built on any LLM framework and seamlessly integrates with LangChain, the go-to open-source framework for building with LLMs.

As of this writing, there are three implementations of LangChain in different programming languages: Python, JavaScript, and Go. We’ll use the Python implementation for our examples.

LangSmith with LangChain

So, basics. After I set up my LangSmith account, created my API key, updated my LangChain installation with pip, and set up my shell environment variables, I tried to run the Python quickstart application:


from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()
llm.predict("Hello, world!")

Before we discuss the results, take a look at the LangSmith hub:

The LangSmith Hub. IDG

Figure 1. The LangSmith hub acts as a repository for prompts, models, use cases, and other LLM artifacts.

Moving on to the next tab, here is the trace list from the default project:

A trace list shows logs of six attempts to run the quick start. IDG

Figure 2. The Python project trace list shows logs of my six attempts to run the quickstart. The first five were unsuccessful: the Python output indicated a timeout from OpenAI.

I took the hint from the timeouts, and went to my OpenAI account and upgraded my ChatGPT plan to ChatGPT Plus ($20 per month). That gave me access to GPT-4 and the ChatGPT plugins, but my program still didn’t run. I left it turned on: I suspect I’ll need the additional capabilities.

Next, I remembered that the OpenAI API plan is separate from the ChatGPT plan, so I upgraded that as well, adding $10 to the account and setting it up to replenish itself as needed. Now the Python program ran to completion, and I was able to see the successful results in LangSmith:

A successful prediction run in LangSmith. IDG

Figure 3. A successful prediction run, finally. Note the Playground button at the top right of the screen.

Looking at the metadata tab for this run told me that it ran the “Hello, World!” prompt against the gpt-3.5-turbo model at a sampling temperature of 0.7. The scale here is 0 to 1, where 1 is the most random, and 0 asks the model to auto-tune the temperature.

The metadata for a successful prediction IDG

Figure 4. The metadata for a successful prediction. In addition to the YAML block at the bottom, there’s a JSON block with the same information.

Overview of LangSmith

LangSmith logs all calls to LLMs, chains, agents, tools, and retrievers in a LangChain or other LLM program. It can help you debug an unexpected end result, determine why an agent is looping, figure out why a chain is slower than expected, and tell you how many tokens an agent used.

LangSmith provides a straightforward visualization of the exact inputs and outputs to all LLM calls. You might think that the input side would be simple, but you’d be wrong: In addition to the input variables (prompt), an LLM call uses a template and often auxiliary functions; for example, retrieval of information from the web, uploaded files, and system prompts that set the context for the LLM.

In general, you should keep LangSmith turned on for all work with LangChain—you only have to look at the logs when they matter. One of the useful things you can try, if an input prompt doesn’t give you the results you need, is to take the prompt to the Playground, which is shown in Figure 5 below. Use the button at the top right of the LangSmith trace page (shown in Figure 4) to navigate to the Playground.

The LangSmith Playground. IDG

Figure 5. The LangSmith Playground allows you to interactively edit your input, change model and temperature, adjust other parameters, add function calls, add stop sequences, and add human, AI, system, function, and chat messages. This is a time saver compared to editing all of this in a Python program.

Don’t forget to add your API keys to the website using the Secrets & API Keys button. Note that playground runs are stored in a separate LangSmith project.

LangSmith LLMChain example

In my introduction to LangChain, I gave the example of an LLMChain that combines a ChatOpenAI call with a simple comma-separated list parser. Looking at the LangSmith log for this Python code helps us understand what’s happening in the program.

The parser is a subclass of the BaseOutputParser class. The system message template for the ChatOpenAI call is fairly standard prompt engineering.


from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.chains import LLMChain
from langchain.schema import BaseOutputParser

class CommaSeparatedListOutputParser(BaseOutputParser):
    """Parse the output of an LLM call to a comma-separated list."""
    
    def parse(self, text: str):
        """Parse the output of an LLM call."""
        return text.strip().split(", ")

template = """You are a helpful assistant who generates comma separated lists.
A user will pass in a category, and you should generate 5 objects in that category in a comma separated list.
ONLY return a comma separated list, and nothing more."""
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_template = "{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])
chain = LLMChain(
    llm=ChatOpenAI(),
    prompt=chat_prompt,
    output_parser=CommaSeparatedListOutputParser()
)
chain.run("colors")
The run tab for the top-level chain. IDG

Figure 6. The Run tab for the top-level chain shows the human input, the parsed output, the latency (under a second), and the tokens used, as well as the clock time and call status.

Diving down to the ChatOpenAI LLM call provides additional information, shown in Figure 7.

System input, and the output produced by the LLM before parsing. IDG

Figure 7. At the LLM level, we see the system input and the output produced by the LLM before parsing.

We can glean even more information from the metadata, shown in Figure 8.

The ChatOpenAI call metadata. IDG

Figure 8. The metadata for the ChatOpenAI call tells us the model used (gpt-3.5-turbo), the sampling temperature (0.7), and the runtime version numbers.

LangSmith evaluation quickstart

This walkthrough evaluates a chain using a dataset of examples. First, it creates a dataset of example inputs, then defines an LLM, chain, or agent for evaluation. After configuring and running the evaluation, it reviews the traces and feedback within LangSmith. Let’s start with the code. Note that the dataset creation step can only be run once, as it lacks the ability to detect an existing dataset by the same name.


from langsmith import Client

example_inputs = [
  "a rap battle between Atticus Finch and Cicero",
  "a rap battle between Barbie and Oppenheimer",
  "a Pythonic rap battle between two swallows: one European and one African",
  "a rap battle between Aubrey Plaza and Stephen Colbert",
]

client = Client()
dataset_name = "Rap Battle Dataset"

# Storing inputs in a dataset lets us
# run chains and LLMs over a shared set of examples.
dataset = client.create_dataset(
    dataset_name=dataset_name, description="Rap battle prompts.",
)
for input_prompt in example_inputs:
    # Each example must be unique and have inputs defined.
    # Outputs are optional
    client.create_example(
        inputs={"question": input_prompt},
        outputs=None,
        dataset_id=dataset.id,
    )

from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain

# Since chains and agents can be stateful (they can have memory),
# create a constructor to pass in to the run_on_dataset method.
def create_chain():
    llm = ChatOpenAI(temperature=0)
    return LLMChain.from_string(llm, "Spit some bars about {input}.")

from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
  evaluators=[
    # You can specify an evaluator by name/enum.
    # In this case, the default criterion is "helpfulness"
    "criteria",
    # Or you can configure the evaluator
    RunEvalConfig.Criteria("harmfulness"),
    RunEvalConfig.Criteria(
      {"cliche": "Are the lyrics cliche?"
      " Respond Y if they are, N if they're entirely unique."}
      )
  ]
)
run_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=create_chain,
    evaluation=eval_config,
    verbose=True,
    project_name="llmchain-test-1",
)

We have a lot more to look at for this example than the last one. The above code uses a dataset, runs the model against four prompts from the dataset, and runs multiple evaluations against each generated rap battle result.

Here are the evaluation statistics, which were printed in the terminal during the run:


Eval quantiles:
             0.25  0.5  0.75  mean  mode
harmfulness  0.00  0.0   0.0  0.00   0.0
helpfulness  0.75  1.0   1.0  0.75   1.0
cliche       1.00  1.0   1.0  1.00   1.0

Somebody had fun creating the rap battle prompts, as shown in the dataset below:

The key-value dataset. IDG

Figure 9. The key-value dataset created by the client.create_dataset()call.

As an aside, I had to look up Aubrey Plaza, who played the deadpan comic character April Ludgate in Parks and Recreation.

This code used its own project name, llmchain-test-1, so that’s where we look for results:

LLM chain results. IDG

Figure 10. The first line in each pair is the LLM chain result, and the second is the LLM result.

Here is the Barbie vs. Oppenheimer rap battle, as generated by gpt-3.5-turbo.

Text generated by the LLM chain. IDG

Figure 11. This is the end of the Barbie/Oppenheimer rap battle text generated by the LLM chain. It won’t win any prizes.

The LangSmith Cookbook

While the standard LangSmith documentation covers the basics, the LangSmith Cookbook repository delves into common patterns and real-world use-cases. You should clone or fork the repo to run the code. The cookbook covers tracing your code without LangChain (using the @traceable decorator); using the LangChain Hub to discover, share, and version control prompts; testing and benchmarking your LLM systems in Python and TypeScript or JavaScript; using user feedback to improve, monitor, and personalize your applications; exporting data for fine-tuning; and exporting your run data for exploratory data analysis.

Conclusion

LangSmith is a platform that works in tandem with LangChain or by itself. In this article, you’ve seen how to use LangSmith to debug, test, evaluate, and monitor chains and intelligent agents in a production-grade LLM application.

martin_heller
Contributor

Martin Heller is a contributing editor and reviewer for InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from his office in Andover, Massachusetts, from 1986 to 2010. More recently, he has served as VP of technology and education at Alpha Software and chairman and CEO at Tubifi. Disclosure: He also writes for Hewlett-Packard’s TechBeacon marketing website.

More from this author