Skip to main content
Performance Engineering · March 2026

Load Testing Guide 2026:
Tools, Techniques & Best Practices

A complete production playbook covering every type of performance test, tool comparisons (k6, JMeter, Locust, Artillery, Gatling), key metrics, CI/CD integration, results interpretation, and a real-world case study.

20 min read·Updated March 21, 2026·By Codazz Engineering

What is Load Testing?

Load testing is the practice of simulating real-world traffic on a software system to understand how it behaves under expected and peak conditions. Unlike functional testing — which checks whether features work correctly — load testing answers the question: how many users can this system handle before it falls over?

In 2025, high-profile outages cost companies an average of $5,600 per minute in lost revenue and damage to customer trust. Most of those outages were preventable — teams that ran systematic load tests before major launches and deployments caught the bottlenecks weeks before real users did. Load testing is not a luxury reserved for FAANG companies; it is table stakes for any production system serving more than a few hundred users.

Prevent Outages

Catch breaking points before real users hit them during launches or traffic spikes.

Validate SLOs

Confirm that p95/p99 latency and error rate targets hold at projected user volumes.

Size Infrastructure

Determine exactly how many servers, pods, or DB connections you actually need.

Find Bottlenecks

Expose slow database queries, N+1 problems, memory leaks, and connection pool exhaustion.

Benchmark Changes

Compare performance before and after a refactor, upgrade, or infrastructure change.

Build Confidence

Ship with certainty — your team and stakeholders know the system has been pressure-tested.

The Load Testing Mindset

Think of load testing as a fire drill — you run it regularly, deliberately, in controlled conditions, so that when real fire comes (a viral post, a Product Hunt launch, a Black Friday sale), your team knows exactly what will break and at what threshold. The goal is not to pass a test; the goal is to learn where your system's limits are and decide whether those limits are acceptable.

Types of Performance Tests

"Load test" is often used as an umbrella term, but there are five distinct test types, each answering a different question. Running the wrong type gives you misleading confidence.

Load Test
Does the system perform acceptably at expected peak load?

The baseline test. Simulate your anticipated peak concurrent users and measure throughput, latency, and error rates. Typically runs at 1x–1.5x your observed peak traffic for 10–30 minutes to verify steady-state performance.

Run When

Before every major release. After infrastructure changes. After adding a new high-traffic feature.

Traffic Shape

Ramp up → steady state at target load → ramp down

Stress Test
What is the breaking point, and how does the system fail?

Push the system beyond its rated capacity — 2x, 5x, 10x normal load — until it degrades or fails. The goal is to find the breaking point AND observe the failure mode. Does it return errors gracefully? Queue and recover? Or deadlock and require a restart?

Run When

When establishing capacity limits for the first time. After significant architectural changes. Before signing SLA contracts.

Traffic Shape

Ramp up aggressively until errors spike or latency becomes unacceptable

Spike Test
Can the system survive sudden traffic explosions?

Instantaneously jump from baseline to 5–10x load with no ramp-up period. Tests whether auto-scaling kicks in fast enough, whether connection pools handle the sudden surge, and whether queues absorb the burst without cascading failures.

Run When

If you expect unpredictable spikes (viral social media, flash sales, sports events). After enabling auto-scaling to verify it actually works.

Traffic Shape

Low baseline → instant spike to maximum → return to baseline

Soak / Endurance Test
Does the system remain stable over many hours or days?

Run moderate load (50–70% of peak) for 4–24 hours or longer. Designed to catch memory leaks, connection pool exhaustion, disk space accumulation, cache poisoning, and other time-dependent degradation that short tests miss entirely.

Run When

Before first production launch. Monthly as a health check. Whenever you suspect a memory leak from production metrics.

Traffic Shape

Steady moderate load for extended duration (4h, 8h, 24h)

Breakpoint Test
At exactly what load does each component become the bottleneck?

Systematically increase load in small increments and measure which metric degrades first at each step. Produces a capacity curve showing throughput vs latency, letting you identify the precise inflection point. More controlled and data-rich than a basic stress test.

Run When

Capacity planning. Choosing between architectural options. Justifying infrastructure spend to stakeholders.

Traffic Shape

Incremental steps: 10% → 20% → 30% → … of max, measuring at each plateau

Tool Comparison: k6, JMeter, Locust, Artillery, Gatling

Five tools dominate the load testing landscape in 2026. Each has a distinct philosophy, scripting model, and performance profile. Here is how they compare across the dimensions that matter most.

ToolLanguageProtocol SupportMax VUs (single node)CI/CD FriendlyCloud Execution
k6JavaScript / TypeScriptHTTP/1.1, HTTP/2, WebSocket, gRPC50,000+Excellentk6 Cloud, Grafana
JMeterGUI / Groovy / JavaHTTP, HTTPS, JDBC, FTP, SOAP, SMTP5,000–10,000Good (JMeter Maven Plugin)BlazeMeter, Azure Load Testing
LocustPythonHTTP, WebSocket (custom clients)10,000+ (distributed)GoodSelf-hosted distributed mode
ArtilleryYAML / JavaScriptHTTP, WebSocket, Socket.io, Kinesis5,000–20,000Excellent (npm package)Artillery Cloud, AWS Lambda
GatlingScala / Java / KotlinHTTP/1.1, HTTP/2, WebSocket, JMS30,000+Good (Maven/Gradle plugin)Gatling Enterprise
k6
Best for modern JavaScript/TypeScript teams and CI/CD pipelines
Strengths

Extremely efficient — single k6 process can simulate 50,000+ virtual users with low CPU overhead. Scripts are plain JavaScript/TypeScript. First-class CI/CD output (JUnit XML, Prometheus, Datadog). Thresholds built-in — tests fail automatically when SLOs are breached. Free, open-source, backed by Grafana Labs.

Limitations

No browser automation. Runs JS in a custom runtime — some Node.js APIs are unavailable. Distributed execution requires k6 Cloud (paid) or manual setup.

Verdict: First choice for API load testing in 2026. If your team writes JavaScript, start here.
Apache JMeter
Best for enterprises with GUI requirements or legacy SOAP/JDBC testing
Strengths

Supports virtually every protocol out of the box. GUI makes it accessible to non-developers. Massive plugin ecosystem. Widely accepted in enterprise procurement processes. Excellent distributed testing with JMeter server nodes.

Limitations

GUI-based workflow is clunky for code review and version control. Java heap memory limits VU count per node. XML-based test plans are difficult to maintain as code. Thread-per-VU model means higher resource consumption than k6 or Gatling.

Verdict: Choose JMeter if you need JDBC/database testing, legacy protocol support, or your QA team requires a GUI. Avoid for greenfield HTTP API projects.
Locust
Best for Python teams who want full scripting flexibility
Strengths

Real Python — full access to the Python ecosystem for data manipulation, custom auth logic, complex test scenarios. Excellent distributed mode. Clean web UI for real-time monitoring. Easy to extend with custom clients.

Limitations

GIL (Python's Global Interpreter Lock) limits per-process concurrency — requires many workers for large loads. Slower than k6 or Gatling for pure HTTP load generation. Less CI/CD tooling out of the box.

Verdict: Excellent for Python teams and complex test scenarios requiring rich data manipulation or custom protocol clients.
Artillery
Best for teams wanting a YAML-first, config-driven approach
Strengths

YAML test definitions are readable by non-engineers. npm package — trivially added to any Node.js project. Built-in support for WebSocket and Socket.io. Serverless execution via AWS Lambda for massive distributed load without managing servers.

Limitations

YAML can become verbose for complex scenarios. Less community content than k6 or JMeter. Cloud execution features require paid tier.

Verdict: Great for teams that want simple, readable test definitions and serverless execution. Ideal for WebSocket and real-time API testing.
Gatling
Best for Java/Scala teams needing the highest single-node VU density
Strengths

Asynchronous Netty-based engine — highest VU-per-core ratio among JVM tools. Excellent HTML reports with detailed percentile charts. Strong HTTP/2 support. Type-safe DSL in Scala/Kotlin catches scripting errors at compile time.

Limitations

Scala/Kotlin DSL has a learning curve. Slower to iterate than k6. Advanced features (distributed execution, CI dashboards) require Gatling Enterprise (paid).

Verdict: Top choice for JVM teams and any scenario where you need maximum VU density on a small number of machines.

Key Metrics & SLOs

Collecting data is not the hard part — knowing which numbers matter and what targets to set is. These are the seven metrics every load test should capture, plus industry-standard thresholds to guide your SLO definitions.

MetricDefinitionGood ThresholdWarning Sign
RPS / TPSRequests (or transactions) per second processed by the system≥ target throughput at peak loadRPS plateaus or drops while errors rise
p50 Latency (Median)50th percentile response time — what the typical user experiences< 100ms for APIs, < 500ms for web pagesAbove 200ms for simple read APIs
p95 Latency95% of requests complete within this time< 300ms for APIs, < 1s for web pagesAbove 1s for any user-facing endpoint
p99 Latency99% of requests complete within this time — exposes tail latency< 500ms for APIs, < 2s for web pagesAbove 2s for critical paths
Error RatePercentage of requests returning 4xx/5xx responses< 0.1% under normal load, < 1% under peakAny sustained error rate above 0.5%
ThroughputTotal data transferred per second (KB/s or MB/s)Stable across test duration at target loadDegrading throughput with rising latency
Virtual Users (VUs)Concurrent simulated users making requestsMatches or exceeds your peak concurrency targetSystem instability before reaching VU target

Setting SLOs From Load Test Data

Define objectives before running tests

Write down your SLO targets before you see any data. If you set targets after seeing results, you are rationalizing, not measuring. Example: "The checkout API must serve p99 latency under 400ms at 500 concurrent users with error rate below 0.1%."

Separate read and write endpoints

GET /products and POST /orders have very different performance characteristics. Track them separately. Aggregating all endpoints into one average hides the endpoints that matter most — usually writes and search.

Measure at the 95th and 99th percentile, not average

Average latency is meaningless for user experience. A p50 of 50ms with a p99 of 5,000ms means 1 in 100 users waits 5 seconds — completely unacceptable, but invisible in averages.

Use k6 thresholds to auto-fail CI builds

Define thresholds in your k6 script so the test exits with a non-zero code when SLOs are breached. This gates deployments in CI/CD without requiring manual analysis of every run.

Writing Load Test Scripts

Below are production-ready script examples for k6, Artillery, and Locust — the three most commonly used tools in modern engineering teams.

k6 — Load Test with Thresholds and Stages

// k6 load test — e-commerce API
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';

const errorRate = new Rate('errors');

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up to 100 VUs
    { duration: '5m', target: 100 },   // Hold at 100 VUs (steady state)
    { duration: '2m', target: 500 },   // Spike to 500 VUs
    { duration: '5m', target: 500 },   // Hold at 500 VUs
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<300', 'p(99)<500'], // SLO thresholds
    errors: ['rate<0.01'],                          // < 1% error rate
    http_req_failed: ['rate<0.01'],
  },
};

const BASE_URL = 'https://api.yourapp.com';

export function setup() {
  // Authenticate and return shared token
  const res = http.post(`${BASE_URL}/auth/login`, JSON.stringify({
    email: 'loadtest@example.com',
    password: __ENV.LOAD_TEST_PASSWORD,
  }), { headers: { 'Content-Type': 'application/json' } });

  return { token: res.json('accessToken') };
}

export default function (data) {
  const headers = {
    'Authorization': `Bearer ${data.token}`,
    'Content-Type': 'application/json',
  };

  // Simulate user browsing products
  const productRes = http.get(`${BASE_URL}/products?page=1&limit=20`, { headers });
  check(productRes, {
    'products status 200': (r) => r.status === 200,
    'products response time OK': (r) => r.timings.duration < 300,
  });
  errorRate.add(productRes.status !== 200);

  sleep(1);

  // View a specific product
  const productId = productRes.json('data')[0]?.id ?? 'prod_001';
  const detailRes = http.get(`${BASE_URL}/products/${productId}`, { headers });
  check(detailRes, { 'product detail 200': (r) => r.status === 200 });

  sleep(2);

  // Add to cart
  const cartRes = http.post(`${BASE_URL}/cart/items`, JSON.stringify({
    productId,
    quantity: 1,
  }), { headers });
  check(cartRes, { 'add to cart 200': (r) => r.status === 200 });
  errorRate.add(cartRes.status !== 200);

  sleep(1);
}

Artillery — YAML-Based API Test

# artillery.yml — API load test config
config:
  target: "https://api.yourapp.com"
  phases:
    - duration: 60     # seconds
      arrivalRate: 10  # new users per second
      name: Warm up
    - duration: 300
      arrivalRate: 50
      name: Sustained load
    - duration: 120
      arrivalRate: 200
      name: Peak spike
  defaults:
    headers:
      Content-Type: "application/json"
  variables:
    token: "{{ $env.LOAD_TEST_TOKEN }}"
  ensure:
    p99: 500      # fail if p99 exceeds 500ms
    maxErrorRate: 1  # fail if error rate > 1%

scenarios:
  - name: Browse and add to cart
    weight: 70  # 70% of traffic follows this scenario
    flow:
      - get:
          url: "/products?page=1"
          headers:
            Authorization: "Bearer {{ token }}"
          expect:
            - statusCode: 200
      - think: 2
      - post:
          url: "/cart/items"
          headers:
            Authorization: "Bearer {{ token }}"
          json:
            productId: "prod_001"
            quantity: 1
          expect:
            - statusCode: 200

  - name: Search only
    weight: 30
    flow:
      - get:
          url: "/search?q=laptop"
          headers:
            Authorization: "Bearer {{ token }}"
          expect:
            - statusCode: 200

Locust — Python-Based Distributed Test

# locustfile.py — distributed load test
from locust import HttpUser, task, between, events
import random
import os

class EcommerceUser(HttpUser):
    wait_time = between(1, 3)  # Wait 1-3s between tasks
    token = None

    def on_start(self):
        """Authenticate before running tasks"""
        res = self.client.post("/auth/login", json={
            "email": "loadtest@example.com",
            "password": os.environ["LOAD_TEST_PASSWORD"],
        })
        self.token = res.json()["accessToken"]
        self.client.headers.update({"Authorization": f"Bearer {self.token}"})

    @task(5)
    def browse_products(self):
        """Most common action — weight 5"""
        page = random.randint(1, 10)
        with self.client.get(
            f"/products?page={page}&limit=20",
            name="/products [paginated]",
            catch_response=True
        ) as res:
            if res.status_code != 200:
                res.failure(f"Expected 200, got {res.status_code}")
            elif res.elapsed.total_seconds() > 0.5:
                res.failure(f"Too slow: {res.elapsed.total_seconds():.2f}s")

    @task(3)
    def search(self):
        """Search — weight 3"""
        terms = ["laptop", "phone", "headphones", "monitor"]
        q = random.choice(terms)
        self.client.get(f"/search?q={q}", name="/search")

    @task(2)
    def add_to_cart(self):
        """Add to cart — weight 2"""
        self.client.post("/cart/items", json={
            "productId": f"prod_{random.randint(1, 100):03d}",
            "quantity": random.randint(1, 3),
        }, name="/cart/items [POST]")

    @task(1)
    def checkout(self):
        """Checkout — weight 1 (least frequent)"""
        self.client.post("/orders", json={
            "paymentMethodId": "pm_test_visa",
        }, name="/orders [POST]")

Test Scenarios & Strategies

A realistic test scenario is the difference between a load test that finds real problems and one that gives you false confidence. These strategies ensure your tests reflect actual user behavior.

Use Real User Session Flows

Record real user sessions from production using browser HAR files or API gateway logs and replay them. This captures the exact mix of endpoints, think times, and data patterns that real users exhibit — including the rare but expensive paths like checkout and search.

Pro tip: k6 supports HAR file conversion via the k6-har-converter package. Artillery supports HAR replay natively.
Parameterize Test Data

Never use the same user ID, product ID, or search query for every virtual user. Static test data creates artificially hot cache entries and completely distorts database query performance. Use CSV data files or generated data to ensure each VU uses unique, realistic inputs.

Pro tip: Prepare a CSV with 10,000 rows of test user credentials, product IDs, and search terms. Feed each VU a unique row using SharedArray in k6.
Model Realistic Traffic Distribution

Not all users do the same thing. In most e-commerce apps, 60–70% of users just browse, 20–30% search, and only 5–10% complete a purchase. Model this distribution using scenario weights so your API endpoints receive realistic relative load.

Pro tip: Use the scenarios object in k6 or the weight field in Artillery scenarios to control traffic distribution.
Include Think Time

Real users pause between actions — they read a product description, type a search query, or hesitate before clicking buy. Including sleep() calls between requests (1–5 seconds, varying randomly) dramatically changes the concurrency model and produces more realistic connection pool and session management behavior.

Pro tip: Use between(1, 3) in Locust or sleep(Math.random() * 2 + 1) in k6 to vary think times.
Warm Up the System Before Measuring

JIT compilation, cache warming, and connection pool establishment all happen during the first few minutes of a test. Treat the first 2–3 minutes as a warm-up period and exclude those data points from your SLO evaluation. k6 stages make this straightforward.

Pro tip: In k6, define a startVUs stage before your primary measurement stage. In Grafana dashboards, use time range selection to exclude warm-up data.

CI/CD Integration

The highest-value change you can make to your load testing practice is running tests automatically in your deployment pipeline. A test that only runs manually before quarterly releases catches problems weeks too late.

GitHub Actions — k6 Load Test on Every Pull Request

# .github/workflows/load-test.yml
name: Load Test

on:
  pull_request:
    branches: [main]
  workflow_dispatch:
    inputs:
      target_rps:
        description: 'Target requests per second'
        default: '100'

jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Start staging environment
        run: |
          docker compose -f docker-compose.staging.yml up -d
          sleep 15  # Wait for services to be healthy

      - name: Run k6 load test
        uses: grafana/k6-action@v0.3.1
        with:
          filename: tests/load/smoke-test.js
          flags: --env BASE_URL=http://localhost:3000
        env:
          LOAD_TEST_PASSWORD: ${{ secrets.LOAD_TEST_PASSWORD }}

      - name: Upload results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: load-test-results
          path: results/

      - name: Comment PR with results
        uses: actions/github-script@v7
        if: always()
        with:
          script: |
            const fs = require('fs');
            const summary = fs.readFileSync('results/summary.json', 'utf8');
            const data = JSON.parse(summary);
            const p99 = data.metrics.http_req_duration.values['p(99)'].toFixed(0);
            const errorRate = (data.metrics.http_req_failed.values.rate * 100).toFixed(2);
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Load Test Results
- p99 latency: ${p99}ms
- Error rate: ${errorRate}%`,
            });

Three-Tier Load Testing Strategy for CI/CD

Smoke Test (every PR)
2–5 minutes5–10 VUs

Verify the system does not crash under minimal load after every code change. Catches obvious regressions immediately. Fast enough to not slow down the PR review cycle.

Threshold: p99 < 1000ms, error rate < 1%
Load Test (staging, before deploy)
15–30 minutesExpected peak concurrency

Validate SLO compliance at expected peak load before every production deployment. Should block deployment if thresholds are breached.

Threshold: p95 < 300ms, p99 < 500ms, error rate < 0.1%
Soak Test (weekly, overnight)
4–8 hours50–70% of peak

Detect memory leaks, connection pool exhaustion, and other time-dependent degradation. Run on a schedule rather than blocking deployments. Alert on-call if thresholds are breached.

Threshold: No error rate increase over time, no latency drift > 20%

Interpreting Results & Finding Bottlenecks

Raw numbers from a load test are a starting point, not an answer. The skill is reading the patterns and correlating metrics to identify root causes. Here are the most common failure patterns and how to diagnose them.

Latency increases linearly with VU count
Meaning

You are hitting a sequential bottleneck — typically a database query without an index, a lock contention issue, or a single-threaded queue processor.

How to Diagnose

EXPLAIN ANALYZE slow queries. Check lock wait times in your DB. Profile CPU usage per service — look for one process at 100%.

Fix

Add missing indexes. Optimize the hot query. Move sequential processing to a worker pool.

Latency is fine but error rate spikes at high VU count
Meaning

Resource exhaustion — connection pool depleted, file descriptors maxed out, or queue full. Requests are being rejected rather than just slowed.

How to Diagnose

Check active DB connections vs pool size. Check open file descriptors (ulimit). Look for "connection refused" or "ECONNRESET" errors in logs.

Fix

Increase connection pool size. Raise OS ulimits. Add retry logic with backoff. Implement connection queuing.

Latency spikes then recovers repeatedly (sawtooth pattern)
Meaning

Garbage collection pauses, periodic background jobs, or cache invalidation events causing cyclical latency spikes.

How to Diagnose

Correlate latency spikes with GC pause logs (JVM: -verbose:gc, Node.js: --expose-gc). Check cron job schedules. Monitor cache hit rate over time.

Fix

Tune GC settings. Move background jobs to off-peak. Stagger cache invalidation.

Performance gradually degrades over time (soak test)
Meaning

Memory leak, connection leak, or disk fill. Resources consumed but never released, causing eventual failure.

How to Diagnose

Graph memory usage and heap size over time — should be flat, not steadily rising. Check open file descriptors and TCP connections over time. Monitor disk usage.

Fix

Find and fix the leak. Common culprits: unclosed database connections, event listeners not removed, log rotation not configured.

Specific endpoints much slower than others at the same VU count
Meaning

The slow endpoints are doing more work — complex joins, external API calls, heavy computation, or missing cache.

How to Diagnose

Use distributed tracing (Jaeger, Tempo, Datadog APM) to break down latency by span. Which external call takes the longest? Which DB query is slowest?

Fix

Cache expensive computations. Optimize the specific DB query. Parallelize independent external API calls. Move heavy computation to async queues.

k6 — Streaming Metrics to Grafana + Prometheus

# Run k6 and push metrics to Prometheus remote write
k6 run \
  --out experimental-prometheus-rw \
  --env K6_PROMETHEUS_RW_SERVER_URL=http://prometheus:9090/api/v1/write \
  --env K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true \
  tests/load/checkout.js

# Or output to InfluxDB for Grafana dashboards
k6 run \
  --out influxdb=http://influxdb:8086/k6 \
  tests/load/checkout.js

# Import the official k6 Grafana dashboard (ID: 2587)
# It shows RPS, p50/p95/p99 latency, error rate, VU count in real-time

Case Study: SaaS Platform Saves Black Friday

A B2C e-commerce SaaS serving 400+ merchant storefronts reached out to Codazz three weeks before Black Friday after their platform had gone down during the previous year's sale event. Here is what we found and fixed.

Baseline Assessment

We ran a standard load test at 2x their estimated Black Friday peak (1,200 concurrent users). The checkout API hit p99 of 8,200ms at just 400 VUs — well before peak. The product search endpoint returned 502 errors at 600 VUs.

Root Cause Analysis

Distributed tracing revealed three issues: (1) The checkout endpoint ran 23 sequential database queries due to an N+1 ORM problem — each query added ~35ms. (2) Product search used a LIKE %query% pattern with no full-text index — full table scans at scale. (3) The Node.js API had a database connection pool of just 5 connections, shared across all 16 worker processes.

Fixes Implemented

We batched the 23 ORM queries into 3 optimized joins (checkout p99 dropped from 8,200ms to 180ms). Added PostgreSQL full-text search with GIN index (search p99 dropped from timeout to 45ms). Increased connection pool to 25 per process and added PgBouncer for connection multiplexing.

Black Friday Results

The platform handled 2,800 concurrent users at peak — 2.3x the previous year's traffic — with p99 latency of 210ms and 0.03% error rate. Zero downtime. The merchant reported a 340% increase in Black Friday GMV vs the prior year.

Checkout p99 latency
8,200ms
180ms
Search p99 latency
Timeout
45ms
Peak concurrent users
400 (crashing)
2,800 (stable)
Error rate at peak
~12%
0.03%

Frequently Asked Questions

Need Help Load Testing Your Application?

Codazz engineers have run load tests for SaaS platforms, fintech APIs, and e-commerce sites — and fixed the bottlenecks that tests reveal. Let us pressure-test your system before your users do.