What is Load Testing?
Load testing is the practice of simulating real-world traffic on a software system to understand how it behaves under expected and peak conditions. Unlike functional testing — which checks whether features work correctly — load testing answers the question: how many users can this system handle before it falls over?
In 2025, high-profile outages cost companies an average of $5,600 per minute in lost revenue and damage to customer trust. Most of those outages were preventable — teams that ran systematic load tests before major launches and deployments caught the bottlenecks weeks before real users did. Load testing is not a luxury reserved for FAANG companies; it is table stakes for any production system serving more than a few hundred users.
Catch breaking points before real users hit them during launches or traffic spikes.
Confirm that p95/p99 latency and error rate targets hold at projected user volumes.
Determine exactly how many servers, pods, or DB connections you actually need.
Expose slow database queries, N+1 problems, memory leaks, and connection pool exhaustion.
Compare performance before and after a refactor, upgrade, or infrastructure change.
Ship with certainty — your team and stakeholders know the system has been pressure-tested.
Think of load testing as a fire drill — you run it regularly, deliberately, in controlled conditions, so that when real fire comes (a viral post, a Product Hunt launch, a Black Friday sale), your team knows exactly what will break and at what threshold. The goal is not to pass a test; the goal is to learn where your system's limits are and decide whether those limits are acceptable.
Types of Performance Tests
"Load test" is often used as an umbrella term, but there are five distinct test types, each answering a different question. Running the wrong type gives you misleading confidence.
The baseline test. Simulate your anticipated peak concurrent users and measure throughput, latency, and error rates. Typically runs at 1x–1.5x your observed peak traffic for 10–30 minutes to verify steady-state performance.
Before every major release. After infrastructure changes. After adding a new high-traffic feature.
Ramp up → steady state at target load → ramp down
Push the system beyond its rated capacity — 2x, 5x, 10x normal load — until it degrades or fails. The goal is to find the breaking point AND observe the failure mode. Does it return errors gracefully? Queue and recover? Or deadlock and require a restart?
When establishing capacity limits for the first time. After significant architectural changes. Before signing SLA contracts.
Ramp up aggressively until errors spike or latency becomes unacceptable
Instantaneously jump from baseline to 5–10x load with no ramp-up period. Tests whether auto-scaling kicks in fast enough, whether connection pools handle the sudden surge, and whether queues absorb the burst without cascading failures.
If you expect unpredictable spikes (viral social media, flash sales, sports events). After enabling auto-scaling to verify it actually works.
Low baseline → instant spike to maximum → return to baseline
Run moderate load (50–70% of peak) for 4–24 hours or longer. Designed to catch memory leaks, connection pool exhaustion, disk space accumulation, cache poisoning, and other time-dependent degradation that short tests miss entirely.
Before first production launch. Monthly as a health check. Whenever you suspect a memory leak from production metrics.
Steady moderate load for extended duration (4h, 8h, 24h)
Systematically increase load in small increments and measure which metric degrades first at each step. Produces a capacity curve showing throughput vs latency, letting you identify the precise inflection point. More controlled and data-rich than a basic stress test.
Capacity planning. Choosing between architectural options. Justifying infrastructure spend to stakeholders.
Incremental steps: 10% → 20% → 30% → … of max, measuring at each plateau
Tool Comparison: k6, JMeter, Locust, Artillery, Gatling
Five tools dominate the load testing landscape in 2026. Each has a distinct philosophy, scripting model, and performance profile. Here is how they compare across the dimensions that matter most.
| Tool | Language | Protocol Support | Max VUs (single node) | CI/CD Friendly | Cloud Execution |
|---|---|---|---|---|---|
| k6 | JavaScript / TypeScript | HTTP/1.1, HTTP/2, WebSocket, gRPC | 50,000+ | Excellent | k6 Cloud, Grafana |
| JMeter | GUI / Groovy / Java | HTTP, HTTPS, JDBC, FTP, SOAP, SMTP | 5,000–10,000 | Good (JMeter Maven Plugin) | BlazeMeter, Azure Load Testing |
| Locust | Python | HTTP, WebSocket (custom clients) | 10,000+ (distributed) | Good | Self-hosted distributed mode |
| Artillery | YAML / JavaScript | HTTP, WebSocket, Socket.io, Kinesis | 5,000–20,000 | Excellent (npm package) | Artillery Cloud, AWS Lambda |
| Gatling | Scala / Java / Kotlin | HTTP/1.1, HTTP/2, WebSocket, JMS | 30,000+ | Good (Maven/Gradle plugin) | Gatling Enterprise |
Extremely efficient — single k6 process can simulate 50,000+ virtual users with low CPU overhead. Scripts are plain JavaScript/TypeScript. First-class CI/CD output (JUnit XML, Prometheus, Datadog). Thresholds built-in — tests fail automatically when SLOs are breached. Free, open-source, backed by Grafana Labs.
No browser automation. Runs JS in a custom runtime — some Node.js APIs are unavailable. Distributed execution requires k6 Cloud (paid) or manual setup.
Supports virtually every protocol out of the box. GUI makes it accessible to non-developers. Massive plugin ecosystem. Widely accepted in enterprise procurement processes. Excellent distributed testing with JMeter server nodes.
GUI-based workflow is clunky for code review and version control. Java heap memory limits VU count per node. XML-based test plans are difficult to maintain as code. Thread-per-VU model means higher resource consumption than k6 or Gatling.
Real Python — full access to the Python ecosystem for data manipulation, custom auth logic, complex test scenarios. Excellent distributed mode. Clean web UI for real-time monitoring. Easy to extend with custom clients.
GIL (Python's Global Interpreter Lock) limits per-process concurrency — requires many workers for large loads. Slower than k6 or Gatling for pure HTTP load generation. Less CI/CD tooling out of the box.
YAML test definitions are readable by non-engineers. npm package — trivially added to any Node.js project. Built-in support for WebSocket and Socket.io. Serverless execution via AWS Lambda for massive distributed load without managing servers.
YAML can become verbose for complex scenarios. Less community content than k6 or JMeter. Cloud execution features require paid tier.
Asynchronous Netty-based engine — highest VU-per-core ratio among JVM tools. Excellent HTML reports with detailed percentile charts. Strong HTTP/2 support. Type-safe DSL in Scala/Kotlin catches scripting errors at compile time.
Scala/Kotlin DSL has a learning curve. Slower to iterate than k6. Advanced features (distributed execution, CI dashboards) require Gatling Enterprise (paid).
Key Metrics & SLOs
Collecting data is not the hard part — knowing which numbers matter and what targets to set is. These are the seven metrics every load test should capture, plus industry-standard thresholds to guide your SLO definitions.
| Metric | Definition | Good Threshold | Warning Sign |
|---|---|---|---|
| RPS / TPS | Requests (or transactions) per second processed by the system | ≥ target throughput at peak load | RPS plateaus or drops while errors rise |
| p50 Latency (Median) | 50th percentile response time — what the typical user experiences | < 100ms for APIs, < 500ms for web pages | Above 200ms for simple read APIs |
| p95 Latency | 95% of requests complete within this time | < 300ms for APIs, < 1s for web pages | Above 1s for any user-facing endpoint |
| p99 Latency | 99% of requests complete within this time — exposes tail latency | < 500ms for APIs, < 2s for web pages | Above 2s for critical paths |
| Error Rate | Percentage of requests returning 4xx/5xx responses | < 0.1% under normal load, < 1% under peak | Any sustained error rate above 0.5% |
| Throughput | Total data transferred per second (KB/s or MB/s) | Stable across test duration at target load | Degrading throughput with rising latency |
| Virtual Users (VUs) | Concurrent simulated users making requests | Matches or exceeds your peak concurrency target | System instability before reaching VU target |
Setting SLOs From Load Test Data
Write down your SLO targets before you see any data. If you set targets after seeing results, you are rationalizing, not measuring. Example: "The checkout API must serve p99 latency under 400ms at 500 concurrent users with error rate below 0.1%."
GET /products and POST /orders have very different performance characteristics. Track them separately. Aggregating all endpoints into one average hides the endpoints that matter most — usually writes and search.
Average latency is meaningless for user experience. A p50 of 50ms with a p99 of 5,000ms means 1 in 100 users waits 5 seconds — completely unacceptable, but invisible in averages.
Define thresholds in your k6 script so the test exits with a non-zero code when SLOs are breached. This gates deployments in CI/CD without requiring manual analysis of every run.
Writing Load Test Scripts
Below are production-ready script examples for k6, Artillery, and Locust — the three most commonly used tools in modern engineering teams.
k6 — Load Test with Thresholds and Stages
// k6 load test — e-commerce API
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';
const errorRate = new Rate('errors');
export const options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up to 100 VUs
{ duration: '5m', target: 100 }, // Hold at 100 VUs (steady state)
{ duration: '2m', target: 500 }, // Spike to 500 VUs
{ duration: '5m', target: 500 }, // Hold at 500 VUs
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<300', 'p(99)<500'], // SLO thresholds
errors: ['rate<0.01'], // < 1% error rate
http_req_failed: ['rate<0.01'],
},
};
const BASE_URL = 'https://api.yourapp.com';
export function setup() {
// Authenticate and return shared token
const res = http.post(`${BASE_URL}/auth/login`, JSON.stringify({
email: 'loadtest@example.com',
password: __ENV.LOAD_TEST_PASSWORD,
}), { headers: { 'Content-Type': 'application/json' } });
return { token: res.json('accessToken') };
}
export default function (data) {
const headers = {
'Authorization': `Bearer ${data.token}`,
'Content-Type': 'application/json',
};
// Simulate user browsing products
const productRes = http.get(`${BASE_URL}/products?page=1&limit=20`, { headers });
check(productRes, {
'products status 200': (r) => r.status === 200,
'products response time OK': (r) => r.timings.duration < 300,
});
errorRate.add(productRes.status !== 200);
sleep(1);
// View a specific product
const productId = productRes.json('data')[0]?.id ?? 'prod_001';
const detailRes = http.get(`${BASE_URL}/products/${productId}`, { headers });
check(detailRes, { 'product detail 200': (r) => r.status === 200 });
sleep(2);
// Add to cart
const cartRes = http.post(`${BASE_URL}/cart/items`, JSON.stringify({
productId,
quantity: 1,
}), { headers });
check(cartRes, { 'add to cart 200': (r) => r.status === 200 });
errorRate.add(cartRes.status !== 200);
sleep(1);
}Artillery — YAML-Based API Test
# artillery.yml — API load test config
config:
target: "https://api.yourapp.com"
phases:
- duration: 60 # seconds
arrivalRate: 10 # new users per second
name: Warm up
- duration: 300
arrivalRate: 50
name: Sustained load
- duration: 120
arrivalRate: 200
name: Peak spike
defaults:
headers:
Content-Type: "application/json"
variables:
token: "{{ $env.LOAD_TEST_TOKEN }}"
ensure:
p99: 500 # fail if p99 exceeds 500ms
maxErrorRate: 1 # fail if error rate > 1%
scenarios:
- name: Browse and add to cart
weight: 70 # 70% of traffic follows this scenario
flow:
- get:
url: "/products?page=1"
headers:
Authorization: "Bearer {{ token }}"
expect:
- statusCode: 200
- think: 2
- post:
url: "/cart/items"
headers:
Authorization: "Bearer {{ token }}"
json:
productId: "prod_001"
quantity: 1
expect:
- statusCode: 200
- name: Search only
weight: 30
flow:
- get:
url: "/search?q=laptop"
headers:
Authorization: "Bearer {{ token }}"
expect:
- statusCode: 200Locust — Python-Based Distributed Test
# locustfile.py — distributed load test
from locust import HttpUser, task, between, events
import random
import os
class EcommerceUser(HttpUser):
wait_time = between(1, 3) # Wait 1-3s between tasks
token = None
def on_start(self):
"""Authenticate before running tasks"""
res = self.client.post("/auth/login", json={
"email": "loadtest@example.com",
"password": os.environ["LOAD_TEST_PASSWORD"],
})
self.token = res.json()["accessToken"]
self.client.headers.update({"Authorization": f"Bearer {self.token}"})
@task(5)
def browse_products(self):
"""Most common action — weight 5"""
page = random.randint(1, 10)
with self.client.get(
f"/products?page={page}&limit=20",
name="/products [paginated]",
catch_response=True
) as res:
if res.status_code != 200:
res.failure(f"Expected 200, got {res.status_code}")
elif res.elapsed.total_seconds() > 0.5:
res.failure(f"Too slow: {res.elapsed.total_seconds():.2f}s")
@task(3)
def search(self):
"""Search — weight 3"""
terms = ["laptop", "phone", "headphones", "monitor"]
q = random.choice(terms)
self.client.get(f"/search?q={q}", name="/search")
@task(2)
def add_to_cart(self):
"""Add to cart — weight 2"""
self.client.post("/cart/items", json={
"productId": f"prod_{random.randint(1, 100):03d}",
"quantity": random.randint(1, 3),
}, name="/cart/items [POST]")
@task(1)
def checkout(self):
"""Checkout — weight 1 (least frequent)"""
self.client.post("/orders", json={
"paymentMethodId": "pm_test_visa",
}, name="/orders [POST]")Test Scenarios & Strategies
A realistic test scenario is the difference between a load test that finds real problems and one that gives you false confidence. These strategies ensure your tests reflect actual user behavior.
Record real user sessions from production using browser HAR files or API gateway logs and replay them. This captures the exact mix of endpoints, think times, and data patterns that real users exhibit — including the rare but expensive paths like checkout and search.
Never use the same user ID, product ID, or search query for every virtual user. Static test data creates artificially hot cache entries and completely distorts database query performance. Use CSV data files or generated data to ensure each VU uses unique, realistic inputs.
Not all users do the same thing. In most e-commerce apps, 60–70% of users just browse, 20–30% search, and only 5–10% complete a purchase. Model this distribution using scenario weights so your API endpoints receive realistic relative load.
Real users pause between actions — they read a product description, type a search query, or hesitate before clicking buy. Including sleep() calls between requests (1–5 seconds, varying randomly) dramatically changes the concurrency model and produces more realistic connection pool and session management behavior.
JIT compilation, cache warming, and connection pool establishment all happen during the first few minutes of a test. Treat the first 2–3 minutes as a warm-up period and exclude those data points from your SLO evaluation. k6 stages make this straightforward.
CI/CD Integration
The highest-value change you can make to your load testing practice is running tests automatically in your deployment pipeline. A test that only runs manually before quarterly releases catches problems weeks too late.
GitHub Actions — k6 Load Test on Every Pull Request
# .github/workflows/load-test.yml
name: Load Test
on:
pull_request:
branches: [main]
workflow_dispatch:
inputs:
target_rps:
description: 'Target requests per second'
default: '100'
jobs:
load-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start staging environment
run: |
docker compose -f docker-compose.staging.yml up -d
sleep 15 # Wait for services to be healthy
- name: Run k6 load test
uses: grafana/k6-action@v0.3.1
with:
filename: tests/load/smoke-test.js
flags: --env BASE_URL=http://localhost:3000
env:
LOAD_TEST_PASSWORD: ${{ secrets.LOAD_TEST_PASSWORD }}
- name: Upload results
uses: actions/upload-artifact@v4
if: always()
with:
name: load-test-results
path: results/
- name: Comment PR with results
uses: actions/github-script@v7
if: always()
with:
script: |
const fs = require('fs');
const summary = fs.readFileSync('results/summary.json', 'utf8');
const data = JSON.parse(summary);
const p99 = data.metrics.http_req_duration.values['p(99)'].toFixed(0);
const errorRate = (data.metrics.http_req_failed.values.rate * 100).toFixed(2);
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Load Test Results
- p99 latency: ${p99}ms
- Error rate: ${errorRate}%`,
});Three-Tier Load Testing Strategy for CI/CD
Verify the system does not crash under minimal load after every code change. Catches obvious regressions immediately. Fast enough to not slow down the PR review cycle.
Validate SLO compliance at expected peak load before every production deployment. Should block deployment if thresholds are breached.
Detect memory leaks, connection pool exhaustion, and other time-dependent degradation. Run on a schedule rather than blocking deployments. Alert on-call if thresholds are breached.
Interpreting Results & Finding Bottlenecks
Raw numbers from a load test are a starting point, not an answer. The skill is reading the patterns and correlating metrics to identify root causes. Here are the most common failure patterns and how to diagnose them.
You are hitting a sequential bottleneck — typically a database query without an index, a lock contention issue, or a single-threaded queue processor.
EXPLAIN ANALYZE slow queries. Check lock wait times in your DB. Profile CPU usage per service — look for one process at 100%.
Add missing indexes. Optimize the hot query. Move sequential processing to a worker pool.
Resource exhaustion — connection pool depleted, file descriptors maxed out, or queue full. Requests are being rejected rather than just slowed.
Check active DB connections vs pool size. Check open file descriptors (ulimit). Look for "connection refused" or "ECONNRESET" errors in logs.
Increase connection pool size. Raise OS ulimits. Add retry logic with backoff. Implement connection queuing.
Garbage collection pauses, periodic background jobs, or cache invalidation events causing cyclical latency spikes.
Correlate latency spikes with GC pause logs (JVM: -verbose:gc, Node.js: --expose-gc). Check cron job schedules. Monitor cache hit rate over time.
Tune GC settings. Move background jobs to off-peak. Stagger cache invalidation.
Memory leak, connection leak, or disk fill. Resources consumed but never released, causing eventual failure.
Graph memory usage and heap size over time — should be flat, not steadily rising. Check open file descriptors and TCP connections over time. Monitor disk usage.
Find and fix the leak. Common culprits: unclosed database connections, event listeners not removed, log rotation not configured.
The slow endpoints are doing more work — complex joins, external API calls, heavy computation, or missing cache.
Use distributed tracing (Jaeger, Tempo, Datadog APM) to break down latency by span. Which external call takes the longest? Which DB query is slowest?
Cache expensive computations. Optimize the specific DB query. Parallelize independent external API calls. Move heavy computation to async queues.
k6 — Streaming Metrics to Grafana + Prometheus
# Run k6 and push metrics to Prometheus remote write
k6 run \
--out experimental-prometheus-rw \
--env K6_PROMETHEUS_RW_SERVER_URL=http://prometheus:9090/api/v1/write \
--env K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true \
tests/load/checkout.js
# Or output to InfluxDB for Grafana dashboards
k6 run \
--out influxdb=http://influxdb:8086/k6 \
tests/load/checkout.js
# Import the official k6 Grafana dashboard (ID: 2587)
# It shows RPS, p50/p95/p99 latency, error rate, VU count in real-timeCase Study: SaaS Platform Saves Black Friday
A B2C e-commerce SaaS serving 400+ merchant storefronts reached out to Codazz three weeks before Black Friday after their platform had gone down during the previous year's sale event. Here is what we found and fixed.
We ran a standard load test at 2x their estimated Black Friday peak (1,200 concurrent users). The checkout API hit p99 of 8,200ms at just 400 VUs — well before peak. The product search endpoint returned 502 errors at 600 VUs.
Distributed tracing revealed three issues: (1) The checkout endpoint ran 23 sequential database queries due to an N+1 ORM problem — each query added ~35ms. (2) Product search used a LIKE %query% pattern with no full-text index — full table scans at scale. (3) The Node.js API had a database connection pool of just 5 connections, shared across all 16 worker processes.
We batched the 23 ORM queries into 3 optimized joins (checkout p99 dropped from 8,200ms to 180ms). Added PostgreSQL full-text search with GIN index (search p99 dropped from timeout to 45ms). Increased connection pool to 25 per process and added PgBouncer for connection multiplexing.
The platform handled 2,800 concurrent users at peak — 2.3x the previous year's traffic — with p99 latency of 210ms and 0.03% error rate. Zero downtime. The merchant reported a 340% increase in Black Friday GMV vs the prior year.
Frequently Asked Questions
Need Help Load Testing Your Application?
Codazz engineers have run load tests for SaaS platforms, fintech APIs, and e-commerce sites — and fixed the bottlenecks that tests reveal. Let us pressure-test your system before your users do.