How we saved hundreds of engineering hours by writing tests with LLMs

JUMP TO SECTION

At Assembled, engineering velocity is our competitive edge. We pride ourselves on delivering new features at a fast pace. But how do we maintain quality without slowing down? The answer lies in robust testing. As Martin Fowler aptly puts it:

[Testing] can drastically reduce the number of bugs that get into production… But the biggest benefit isn't about merely avoiding production bugs, it's about the confidence that you get to make changes to the system.

Martin Fowler

Despite this, writing comprehensive tests is often overlooked due to time constraints or the complexity involved. Large Language Models (LLMs) have shifted this dynamic by making it significantly easier and faster to generate robust tests. Tasks that previously required hours can now be completed in just 5–10 minutes.

We've observed tangible benefits within our team:

An engineer who previously wrote few tests began consistently writing them after utilizing LLMs for test generation.
Another engineer, known for writing thorough tests, saved weeks of time by using LLMs to streamline the process.
Collectively, our engineers have saved hundreds of hours, reallocating that time to developing new features and refining existing ones.

In this blog post, we'll explore how we’ve used LLMs to enhance our testing practices.

Leveraging LLMs for testing

To get started, you'll need access to a high-quality LLM for code generation like OpenAI's o1-preview or Anthropic's Claude 3.5 Sonnet.

Then, you should craft a precise prompt that guides the model to produce the desired output. Here's a sample prompt we've found effective for generating Go unit tests:

Help me write a comprehensive set of unit tests in Golang for the following function:

<function_to_test>
// Insert your function code here
</function_to_test>

Here are the definitions of the associated structs used in the function:

<struct_definitions>
// Optionally insert any relevant struct definitions here
</struct_definitions>

Please ensure that:
- The tests use the fixture pattern by defining different test cases in a slice.
- The tests follow Go's testing best practices, including proper naming conventions and code organization.
- Use the `testing` and `require` packages as shown in the example below.
- Cover various scenarios, including normal cases, edge cases, and error handling.

<test_example>
// Include an example of a good unit test from your codebase
</test_example>

‍

In this prompt, you need to provide:

Function to test: Copy and paste the exact code you’re looking to write tests for.
Struct definitions: Include any relevant definitions that the function uses (especially for any objects that appear in the input or output of the function).
Example of a test suite: An example of existing tests that reflect your codebase's style and conventions.

Once you’ve dropped this into an LLM and generated a result, you might need to review and refine the generated tests. You should check for compilation issues, add any potential edge cases the LLM missed, and adjust the style to match your codebase conventions. We’ve found that a few iterations of back and forth are sometimes necessary to arrive at an acceptable test suite. Once you’re close enough, just copy and paste the resulting tests back into your codebase.

If you have an AI-assisted code editor like Copilot or Cursor, the principles remain the same; though, because tools can provide context-aware suggestions based on your existing code, you often can get away with less detailed prompts.

Example in action

Suppose you're building an e-commerce platform and have a function that calculates an order summary. Here's how you might apply the above approach.

// Struct definitions
type OrderItem struct {
    ProductID   string
    Quantity    int
    UnitPrice   float64
    Weight      float64 // Weight per unit in kg
    Category    string
}

type OrderSummary struct {
    TotalPrice      float64
    TotalWeight     float64
    ItemsByCategory map[string]int // Category name to total quantity
}

// Function to test
func CalculateOrderSummary(items []OrderItem) OrderSummary {
    itemsByCategory := make(map[string]int)
    totalPrice := 0.0
    totalWeight := 0.0

    for _, item := range items {
        totalItemPrice := float64(item.Quantity) * item.UnitPrice
        totalItemWeight := float64(item.Quantity) * item.Weight

        totalPrice += totalItemPrice
        totalWeight += totalItemWeight

        itemsByCategory[item.Category] += item.Quantity
    }

    summary := OrderSummary{
		    TotalPrice: totalPrice,
		    TotalWeight: totalWeight,
		    ItemsByCategory: itemsByCategory    
		}
    return summary
}

‍

Using the suggested prompt, we fed this code into ChatGPT o1-preview and, in just 48 seconds, it generated a comprehensive test suite that was ready to use straight out of the box. Here’s the full prompt and results from ChatGPT.

You’ll notice that the resulting tests are both comprehensive and well written:

The tests cover basically all of the cases that you might think of: empty slices, nil slices, single item, multiple items, items with zero quantity, etc. These test cases are mutually exclusive and collectively exhaustive and cover most of the edge cases a good engineer would think of.
Moreover, the resultant code is in the table-driven fixture style that is idiomatic in Go — the exact format that we specified in the initial prompt. The resultant tests even use the testify/require library, which is prescribed in the original example.

‍

import (
	"testing"

	"github.com/stretchr/testify/require"
)

func TestCalculateOrderSummary(t *testing.T) {
	fixtures := []struct {
		Name     string
		Items    []OrderItem
		Expected OrderSummary
	}{
	  ...
		{
			Name: "Multiple items in different categories",
			Items: []OrderItem{
				{
					ProductID: "P1",
					Quantity:  2,
					UnitPrice: 5.0,
					Weight:    0.2,
					Category:  "Books",
				},
				{
					ProductID: "P2",
					Quantity:  1,
					UnitPrice: 100.0,
					Weight:    1.0,
					Category:  "Electronics",
				},
			},
			Expected: OrderSummary{
				TotalPrice:  (2 * 5.0) + (1 * 100.0),
				TotalWeight: (2 * 0.2) + (1 * 1.0),
				ItemsByCategory: map[string]int{
					"Books":       2,
					"Electronics": 1,
				},
			},
		},
		...
	}

	for _, fixture := range fixtures {
		t.Run(fixture.Name, func(t *testing.T) {
			result := CalculateOrderSummary(fixture.Items)
			require.Equal(t, fixture.Expected.TotalPrice, result.TotalPrice, "TotalPrice mismatch")
			require.Equal(t, fixture.Expected.TotalWeight, result.TotalWeight, "TotalWeight mismatch")
			require.Equal(t, fixture.Expected.ItemsByCategory, result.ItemsByCategory, "ItemsByCategory mismatch")
		})
	}
}

‍

Extending to more complex scenarios

The same approach can be applied to more complex testing scenarios. By adjusting the prompt and providing a different set of baseline test cases, you can generate tests for:

Different programming languages. It’s relatively straightforward to adjust the prompt for other languages and tailor the results to specific testing frameworks.
- Example: Unit tests for a typescript function that converts roman numerals to integers, using Claude 3.5 Sonnet
Frontend component testing. You can also extend this to test React components with user interactions and state changes — just make sure your examples capture the libraries you’d use.
- Example: Testing a React dropdown component with Jest and React Testing Library, including user interactions and DOM assertions, using o1-preview.
Integration testing with mocked services. By changing the test case examples, you can test functions that interact with external APIs by mocking HTTP clients.
- Example: Testing a function that fetches average weather data by mocking a weather API call, using o1-preview.

Considerations

At Assembled, we’ve been using LLMs to write tests for a few months now and have seen big boosts in engineering productivity. That said, there are a few considerations to keep in mind as you start using LLMs for test writing:

Iterative refinement: You may need several iterations to cover missed edge cases or adjust to your codebase standards. Sometimes, the LLMs might generate code that doesn’t compile, so asking the LLM to make adjustments is critical.
Double check your test logic: While LLMs are pretty good out of the box, they can sometimes get tests wrong. For example, one of our engineers had an experience where the model gave incorrect output because of improper formatting. We insist that all Assembled engineers read and run any LLM-generated tests before merging into production.
Customize your prompt to your specific context: Our engineers have found that tailoring their prompts can significantly enhance the quality of the generated tests. For example, you might consider specifying your test frameworks (e.g. “Use Jest and React Testing Library for testing this React component.”) or highlighting important edge cases (e.g. “Ensure you include tests for handling null inputs and maximum integer values.”).
Examples matter: LLMs do their best work when they have a good example of tests to learn from. The engineering team at Assembled has built a large repository of comprehensive and idiomatic tests over time, which makes it easier to use these techniques. Remember that your examples are often your most important way to drive the LLM to do what you want.
Use the smartest models: Models like o1-preview or Claude 3.5 Sonnet generally provide better results. Since latency isn't a major concern, we tend to use the best available models.
Code structure reflects testability: If you’re having trouble getting the LLM to construct suitable tests, consider refactoring your code. It’s likely that whatever combination of inputs and outputs you have may be poorly structured or overly complex. You can even ask the LLM to break things up and refactor your code with the same prompting principles discussed above.
Don’t overdo testing: You generally want to test the functions that have clear input / output and which contain the most important pieces of logic. You don’t need to test that a checkbox is working correctly (unless you’re the maintainer of a component library). Likewise, glue code is tough to test, and writing tests for some pretty straightforward glue code may not be worth it — though you should check on a case-by-case basis (e.g., if that glue code is a very hot codepath).

Conclusion

Using LLMs to generate comprehensive test suites in minutes has been a game changer at Assembled. It reduces the activation energy to write tests and makes it less likely that engineers skip tests due to time constraints. This has resulted in a cleaner, safer codebase that has increased development velocity.

We’re hiring

We’ve got a lot of features to build and tests to write. If you’re interested in helping us transform customer support, check out our open roles.

Thanks to Jake Chao, Mae Cromwell, and Whitney Rose for helping with drafts of this post.

ADDITIONAL RESOURCES