Instrument Error Code Categorization for Deterministic Recovery

When an instrument raises -221,"Settings conflict" in the middle of a source-measure sweep, the control loop has milliseconds to decide whether to retry the point, clear the register and continue, or drive the channel to a safe state and abort the run. Get that decision wrong and the failure mode is silent: a syringe pump that keeps dispensing after a settings conflict, a temperature controller whose PID loop integrates against a stale setpoint, or an eight-hour assay whose dataset is quietly corrupted from the point of the first unhandled fault. Error code categorization is the layer that turns raw transport anomalies and vendor status registers into a small, fixed set of deterministic states — transient, recoverable, terminal — so that every fault reaches exactly one handler and control code never guesses. This guide builds that categorization engine in Python, targeting the mixed serial, USB, and GPIB fleets that real labs run.

Prerequisites and Hardware Scope

This page assumes Python 3.10+ (for match statements and modern typing), pyserial>=3.5 for raw serial transports, and pyvisa>=1.13 with a NI-VISA or pyvisa-py backend for USBTMC and GPIB instruments. The categorization patterns apply to any device that either exposes an IEEE 488.2 status model (*ESR?, *STB?, SYST:ERR?) or returns structured status bytes over a raw byte stream — SCPI-compliant power supplies, source-measure units, spectrometers, and function generators, plus proprietary syringe pumps and motion controllers that use vendor-specific status codes.

Categorization is a boundary layer, not a starting point. It consumes a normalized transport that has already been configured for deterministic reads and writes. Before the categorizer can trust a SYST:ERR? response, the receive buffer, termination characters, and read timeouts must be pinned per instrument, as developed in PySerial Configuration & Tuning, and sessions must be allocated through a VISA Resource Manager so that resource strings resolve to the correct backend. Everything below sits directly above those layers and directly below the retry policy in Timeout Handling & Retry Logic.

Isolating Transport Faults from Application Faults

Instrument control pipelines orchestrate heterogeneous buses, each introducing distinct failure modes. A robust categorization system must first separate transport-layer faults — framing errors, parity mismatches, USB enumeration drops, GPIB bus arbitration lockups — from application-layer faults, which are the instrument’s own reasoned complaints about invalid parameters, execution conflicts, or hardware interlocks. The distinction is operationally decisive: transport faults are usually retryable because the instrument may never have received or answered the command, whereas an application fault means the instrument received the command, understood it, and rejected it — retrying an identical command produces an identical rejection.

Without this separation, retry logic becomes non-deterministic and state machines drift. The rule is that transport errors are caught at the I/O driver level and wrapped in a dedicated exception type (for example TransportFaultError), never allowed to propagate as a bare OSError, serial.SerialException, or pyvisa.VisaIOError. Only after transport normalization does a response reach the categorizer, which is then free to assume it is looking at a syntactically complete instrument reply rather than a fragment. This is the same isolation principle that the parent Serial, USB, and GPIB Communication Workflows reference builds from the physical layer up, applied at the error boundary.

The Three-State Severity Model and Code Ranges

Every categorized fault resolves to one of three severities. Keeping the target set this small is what makes recovery deterministic: each handler is unambiguous, and there is no fourth “maybe” branch for a control loop to stall on. The table below is the reference the engine encodes — the mapping from IEEE 488.2 / SCPI numeric ranges (and common transport signatures) to a severity and its mandated recovery action.

Severity	Source signature	SCPI / vendor range	Mandated recovery action
`TRANSIENT`	Read timeout, partial frame, `VisaIOError` timeout, transient bus congestion	Transport layer (no SCPI code)	Clear input buffer, defer to backoff policy, retry same command
`RECOVERABLE`	Command error (bad header/syntax), execution error (settings conflict, out of range)	`-100` to `-199`, `-200` to `-299`	Issue `*CLS`, drain `SYST:ERR?`, re-raise a typed fault; skip or re-parameterize the offending step
`TERMINAL`	Device-specific hardware fault, query interrupted/unterminated, self-test failure	`-300` to `-399`, `-400` to `-499`	Drive channel to safe state, open circuit breaker, halt queue, page telemetry
`TERMINAL`	Safety interlock, over-temperature, over-current asserted	Vendor status bit / GPIO line	Immediate `safety_shutdown()`, no retry, no command round-trip

Two subtleties matter for correctness. First, SCPI negative codes are ordered low-to-high inside each class, so a range test must be written low <= code <= high with both bounds inclusive — off-by-one boundaries silently misclassify a -200 execution error as a device error. Second, transient faults never carry a SCPI code, because a genuine timeout means the instrument never enqueued an error; the categorizer must synthesize the TRANSIENT verdict from the transport exception type, not from SYST:ERR?. The full five-boundary taxonomy, including positive vendor-specific codes and the exact FIFO drain sequence, is developed in Categorizing SCPI Error Codes for Automated Recovery.

Synchronous Categorization at the Query Boundary

In blocking command-response architectures, categorization must occur immediately after the instrument acknowledges a query, before the next command is issued. The guard function below wraps a single query: it validates termination, polls the error queue, and either returns a clean response or raises a categorized fault. Because it runs inline, the malformed-payload case never reaches a downstream parser or a Checksum/CRC Validation stage with a half-frame masquerading as measurement data.

import logging
from enum import Enum, auto
from typing import Optional, Protocol

logger = logging.getLogger(__name__)


class ErrorSeverity(Enum):
    TRANSIENT = auto()
    RECOVERABLE = auto()
    TERMINAL = auto()


class TransportFaultError(Exception):
    """Raised for I/O-layer faults (framing, timeout, enumeration drop)."""


class InstrumentError(Exception):
    """A categorized application-layer instrument fault."""

    def __init__(
        self,
        code: int,
        message: str,
        severity: ErrorSeverity,
        context: Optional[dict[str, object]] = None,
    ) -> None:
        self.code = code
        self.message = message
        self.severity = severity
        self.context: dict[str, object] = context or {}
        super().__init__(f"[{code}] {message} ({severity.name})")


class Transport(Protocol):
    """Minimal transport contract the categorizer depends on."""

    def query(self, command: str) -> str: ...
    def clear_status(self) -> None: ...      # sends *CLS
    def safety_shutdown(self) -> None: ...    # channel off / output disable

Mapping Numeric Codes to Severity

The classification matrix is a pure, table-driven function so its behavior is fully reproducible under regression testing — the same code always yields the same severity, with no hidden state. Keeping it separate from the I/O path also lets it be unit-tested against a fixture of real error strings captured from the bench.

# Inclusive (low, high) SCPI/IEEE 488.2 ranges. Negative codes run low -> high,
# so `low <= code <= high` is the correct membership test.
SCPI_SEVERITY_MAP: dict[tuple[int, int], ErrorSeverity] = {
    (-199, -100): ErrorSeverity.RECOVERABLE,  # Command errors  (-1xx)
    (-299, -200): ErrorSeverity.RECOVERABLE,  # Execution errors (-2xx)
    (-399, -300): ErrorSeverity.TERMINAL,     # Device-specific  (-3xx)
    (-499, -400): ErrorSeverity.TERMINAL,     # Query errors     (-4xx)
}


def classify_scpi_error(code: int, message: str) -> InstrumentError:
    """Map a numeric SCPI error code to a categorized InstrumentError.

    Unknown or positive (vendor) codes default to TERMINAL: fail safe, never
    optimistically retry a fault the taxonomy does not recognize.
    """
    severity = ErrorSeverity.TERMINAL
    for (low, high), sev in SCPI_SEVERITY_MAP.items():
        if low <= code <= high:
            severity = sev
            break
    return InstrumentError(code=code, message=message, severity=severity)

Routing a Query Through the Categorizer

The execution wrapper ties the pieces together. Transport exceptions become TRANSIENT faults that are re-raised as timeouts for the backoff layer to absorb; SCPI faults are classified and routed to *CLS-and-retry or to a safe-state shutdown. Note the deliberate asymmetry: recoverable faults clear status and re-raise a typed error the caller can decide to skip, while terminal faults never round-trip another command — they drive the hardware safe first.

def execute_with_categorization(command: str, dev: Transport) -> str:
    """Issue `command`, poll SYST:ERR?, and return a clean response or raise
    a categorized fault. Terminal faults leave the instrument in a safe state.
    """
    try:
        response = dev.query(command)
    except TransportFaultError as exc:
        # No SCPI code exists for a transport fault: synthesize TRANSIENT.
        logger.warning("Transport fault on %r: %s", command, exc)
        raise TimeoutError("Transient transport fault; defer to backoff") from exc

    err_raw = dev.query("SYST:ERR?")
    code_str, _, msg = err_raw.partition(",")
    try:
        err_code = int(code_str)
    except ValueError as exc:
        raise TransportFaultError(f"Malformed error reply: {err_raw!r}") from exc

    if err_code == 0:
        return response

    fault = classify_scpi_error(err_code, msg.strip().strip('"'))
    logger.error(
        "Categorized instrument fault",
        extra={"code": fault.code, "severity": fault.severity.name},
    )

    match fault.severity:
        case ErrorSeverity.RECOVERABLE:
            dev.clear_status()
            raise fault
        case ErrorSeverity.TERMINAL:
            dev.safety_shutdown()
            raise RuntimeError("Terminal hardware fault; queue halted") from fault
        case ErrorSeverity.TRANSIENT:
            raise TimeoutError("Transient fault; defer to backoff") from fault

Deferred Categorization in Asynchronous Queues

Non-blocking architectures maximize instrument throughput but change where categorization happens. Async Command Queuing Systems decouple command issuance from response parsing, so a fault surfaces after the command that caused it has already left the head of the queue. The categorizer must therefore tag each fault with execution context — timestamp, command hash, queue position — and route it to a centralized handler rather than raising inline. The severity model is identical; only the delivery mechanism differs.

import asyncio


async def execute_async(command: str, dev: "AsyncTransport") -> str:
    try:
        response = await dev.query(command)
        err_raw = await dev.query("SYST:ERR?")
    except TransportFaultError as exc:
        raise asyncio.TimeoutError("Transient fault; retry via backoff") from exc

    code_str, _, msg = err_raw.partition(",")
    err_code = int(code_str)
    if err_code == 0:
        return response

    fault = classify_scpi_error(err_code, msg.strip().strip('"'))
    fault.context.update(command=command, queued_at=asyncio.get_event_loop().time())
    if fault.severity is ErrorSeverity.TERMINAL:
        await dev.safety_shutdown()   # suspend the queue before re-raising
    raise fault

Transient faults yield an asyncio.TimeoutError so the retry policy re-runs the command without blocking the event loop; terminal faults suspend the queue and transition the affected channel to a safe state before any further work is scheduled. This preserves scheduler responsiveness for the other instruments still polling on the same loop.

SCPI error routing: poll the error queue, parse the numeric code, and classify it as TRANSIENT, RECOVERABLE, or TERMINAL so each fault reaches its correct handler.

Hardware-Specific Variants and Edge Cases

The categorizer’s transport-fault synthesis has to account for how different physical layers manifest the same logical fault, or it will misclassify recoverable timeouts as terminal errors.

FTDI vs CP210x latency behavior: FTDI FT232 bridges expose a tunable latency timer (default 16 ms) that batches status polls; a categorizer that samples SYST:ERR? faster than the latency window reads a stale queue and attributes a prior command’s error to the current one. Silicon Labs CP2102/CP210x bridges buffer differently and are less prone to this, but both must have flow control validated before software retries. Treat a suspected stale read as TRANSIENT, not as the reported code.
USBTMC vs GPIB status semantics: Over GPIB, a Service Request (SRQ) and serial poll surface faults out of band, so the categorizer can be event-driven. USBTMC delivers status through the interrupt-IN endpoint and the same SYST:ERR? queue, but an aborted bulk transfer raises a VisaIOError that has no SCPI code — it must be caught as a transport fault and synthesized to TRANSIENT, never passed to classify_scpi_error.
Buffer drift and stale errors: If SYST:ERR? returns stale messages, the queue was not fully drained on the previous cycle. Run a drain loop during initialization and after every recovery — while int(dev.query("SYST:ERR?").partition(",")[0]) != 0: pass — so categorization always starts from a known-empty queue.
Interlocks bypass the error queue: Hardware safety interlocks (over-temperature, lid-open, external E-stop) frequently trip relays without ever enqueuing a SCPI error. Poll the dedicated status bit or digital I/O line directly at the transport layer; when it asserts, categorize immediately as TERMINAL and halt without waiting for a command round-trip.
CH340/CH341 DTR/RTS glitches: These consumer bridges drop control lines under sustained high-throughput polling, producing phantom timeouts. Correlate errno.EIO with the bridge firmware version and pin the driver before escalating; classify the phantom timeout as TRANSIENT with a bounded retry ceiling so a flaky cable does not masquerade as a healthy instrument.

Fault Categorization Reference

The following matrix maps observed fault signatures to root cause and the recovery action the engine should take. It is the operational companion to the severity model above — an engineer mid-debug can match a symptom to a row and read off the correct response.

Fault signature	Likely root cause	Severity	Recovery action
`SYST:ERR?` returns `-113,"Undefined header"`	Malformed or vendor-unsupported command mnemonic	`RECOVERABLE`	`*CLS`, log offending command, skip step; fix command generation
`-222,"Data out of range"` on a setpoint write	Requested value exceeds instrument limits	`RECOVERABLE`	Clamp to published range or abort step; do not retry unchanged
`-350,"Queue overflow"`	Error queue filled faster than it was drained	`TERMINAL`	Drain fully, `*RST`, re-initialize session before resuming
`-410,"Query INTERRUPTED"`	New command sent before prior response was read	`TERMINAL`	Resynchronize with `*OPC?`; enforce strict query sequencing
`VisaIOError` timeout, empty read	Bus contention, USB latency batching, cable fault	`TRANSIENT`	Flush buffer, apply backoff, retry up to ceiling
Interlock GPIO asserted, no SCPI code	Over-temperature / E-stop / lid-open safety trip	`TERMINAL`	Immediate `safety_shutdown()`; require operator reset

Integrating with Adjacent Layers

Error code categorization is the pivot between transport and orchestration, so it wires into several neighbors. Below it, it depends on the framing and termination contracts fixed in PySerial Configuration & Tuning and on the per-vendor quirk normalization performed by the Protocol Abstraction Layers, which guarantee that a SYST:ERR? reply is complete and correctly terminated before the categorizer parses it. The numeric ranges it keys on are the ones fixed by SCPI Command Set Standardization.

Above it, the TRANSIENT verdict is the contract that Timeout Handling & Retry Logic consumes: the categorizer decides whether a fault is retryable, and the backoff layer decides when to retry. Terminal categorizations should open a circuit breaker after a threshold of consecutive faults (a common default is three) so a degrading instrument is taken out of rotation rather than retried into starvation, with the failure written to a persistent telemetry sink. When the same fleet is captured for LIMS export, categorized faults become part of the audit trail described in the Scientific Instrument Control Architecture & Taxonomy reference.

Implementation Checklist

Every transport exception (serial.SerialException, pyvisa.VisaIOError, OSError) is caught at the I/O boundary and wrapped as TransportFaultError — no bare I/O exception escapes to application code.
The severity map uses inclusive low <= code <= high bounds and is covered by unit tests against a fixture of real bench-captured error strings.
Unknown and positive vendor-specific codes default to TERMINAL, verified by a test that feeds an out-of-taxonomy code.
A SYST:ERR? drain loop runs at session init and after every recovery, confirmed by asserting an empty queue (0,"No error") post-drain on real hardware.
Terminal categorization drives the channel to a documented safe state (output disabled / valve closed) before re-raising, validated on the bench with a deliberately induced fault.
Transient categorization hands off to the backoff policy with a bounded retry ceiling, and a phantom-timeout test on a CH340/FTDI bridge does not exhaust into a false terminal state.
Interlock and E-stop lines are polled independently of the SCPI queue and categorize as TERMINAL within one control cycle of assertion.
A circuit breaker opens after N consecutive terminal faults and logs to a persistent telemetry sink for post-run reconstruction.

Categorizing SCPI Error Codes for Automated Recovery — the full five-boundary SCPI taxonomy and FIFO drain sequence
Timeout Handling & Retry Logic — the backoff policy that consumes TRANSIENT verdicts
Async Command Queuing Systems — deferred categorization in non-blocking pipelines
PySerial Configuration & Tuning — the transport contract categorization depends on
VISA Resource Manager Setup — session allocation beneath the error boundary

← Back to Serial, USB, and GPIB Communication Workflows

Instrument Error Code Categorization for Deterministic Recovery

Explore this section

Categorizing SCPI Error Codes for Automated Recovery