Categorizing SCPI Error Codes for Automated Recovery

An automated instrument sequence cannot decide how to recover from a fault until it knows what class of fault it is looking at. A -113 “Undefined header” is a programming mistake that will fail identically on every retry; a -410 “Query INTERRUPTED” is a transient message-exchange desync that a single resync clears; a -350 “Queue overflow” means errors have already been silently dropped and the run is no longer trustworthy. This page builds a deterministic classifier that reads the IEEE 488.2 / SCPI error queue, maps each numeric code to exactly one recovery class through a total, non-overlapping function, and emits a structured directive that downstream recovery logic can act on without ever parsing an English description.

Scope: Numeric-Only Classification of the SCPI Error Queue

The technique here is deliberately narrow. It assumes the instrument speaks SCPI layered on IEEE 488.2 and exposes its faults through the standard SYSTem:ERRor? FIFO, returning each entry as <code>,"<description>". It applies equally whether that queue is drained over RS-232, USB-TMC, or GPIB, because classification operates on the parsed integer code alone — the transport only governs how the string arrives, not what the code means. What it does not cover is the transport read itself: buffer flushing, termination handling, and query synchronisation are the job of the parent Error Code Categorization cluster and the broader Serial, USB, and GPIB Communication Workflows pillar. This page begins once a raw SYST:ERR? response string is already in hand, and ends when a typed recovery directive has been produced.

Two assumptions hold throughout. First, the classifier must be a total function: every possible integer — including reserved gaps, out-of-spec values, and vendor positives — resolves to some class, because a recovery loop that raises KeyError on an unexpected code is itself a new failure mode. Second, classification must depend on the code and never on the description text, since vendors reword descriptions across firmware revisions while the IEEE 488.2 hundreds-blocks stay fixed. A validator built on string matching drifts silently the day a firmware update ships.

Mechanism: The IEEE 488.2 Hundreds-Block Partition

SCPI inherits its error taxonomy from IEEE 488.2, which partitions the negative integers into hundreds-blocks, each with a defined fault semantics. Positive codes are reserved for device-specific meanings, and 0 denotes an empty queue. The classifier is a piecewise map over these disjoint intervals:

 $C (e) = ⎩ ⎨ ⎧ EMPTY COMMAND EXECUTION DEVICE QUERY VENDOR UNKNOWN e = 0 - 199 \leq e \leq - 100 - 299 \leq e \leq - 200 - 399 \leq e \leq - 300 - 499 \leq e \leq - 400 e \geq 1 otherwise$

The otherwise branch is not decorative: it absorbs the reserved -999 … -500 band and any malformed value, guaranteeing totality. Each class then carries a fixed recovery disposition — a boolean “retryable-after-remediation” flag plus a canonical action — so the mapping from code → class → action is fully deterministic and unit-testable:

Class	Code block	Fault semantics	Retryable	Canonical recovery action
`COMMAND`	`-100 … -199`	Malformed or unsupported header/parameter syntax	No	Halt sequence, surface as programming defect; re-issue is futile
`EXECUTION`	`-200 … -299`	Valid command, device could not execute (out of range, settings conflict, trigger ignored)	Conditional	Reset execution state (`INIT:CONT OFF`), validate parameters, re-issue once
`DEVICE`	`-300 … -399`	Hardware/firmware fault: queue overflow, memory, self-test, calibration	No	Trigger interlock, `*CLS`, escalate to operator; no software retry
`QUERY`	`-400 … -499`	Message-exchange desync: interrupted, unterminated, deadlocked	Yes	Resync only (`CLS`, drain queue); never* `*RST`
`VENDOR`	`≥ +1`	Device-specific, defined by vendor manual	Route	Dispatch to registered vendor handler; fail safe if none
`UNKNOWN`	reserved / malformed	Out-of-spec code	No	Fail safe, log verbatim for triage

The single most consequential distinction this table encodes is QUERY versus DEVICE. A -410 looks alarming but is a host-side protocol mistake — the controller sent a new query before reading the previous response — and is cleared by a resync, whereas a naive *RST on every error throws away instrument state on a fault that never touched the hardware. Conversely, attempting a software retry on a -330 “Self-test failed” masks a real hardware fault behind a retry loop.

Classifier decision tree: classify tests the sign first, then partitions the negative hundreds-blocks with bisect, resolving every integer to exactly one of seven classes. Each terminal carries its fixed RecoveryAction and retryable flag — only QUERY (RESYNC) and EXECUTION (RESET-STATE, conditional) re-issue; COMMAND, DEVICE and UNKNOWN never do.

Production Classifier and Recovery Directive

The implementation parses one SYST:ERR? response into a structured, categorised object and resolves a RecoveryDirective. Classification uses bisect over the sorted interval boundaries rather than a chain of comparisons, so adding a block is a one-line table edit and lookup stays branch-stable. The parser is defensive about the exact wire format — leading +, surrounding quotes, stray whitespace, and a missing description are all handled rather than assumed away.

from __future__ import annotations

import bisect
import enum
import logging
import re
from dataclasses import dataclass
from typing import Callable, Optional

logger = logging.getLogger(__name__)


class ErrorClass(enum.Enum):
    EMPTY = "empty"
    COMMAND = "command"        # -1xx: syntax / unsupported header
    EXECUTION = "execution"    # -2xx: valid command, could not execute
    DEVICE = "device"          # -3xx: hardware / firmware fault
    QUERY = "query"            # -4xx: message-exchange desync
    VENDOR = "vendor"          # >=+1: device-specific
    UNKNOWN = "unknown"        # reserved / out-of-spec


class RecoveryAction(enum.Enum):
    NONE = "none"              # clean queue
    HALT = "halt"             # unrecoverable programming defect
    RESET_STATE = "reset_state"
    INTERLOCK = "interlock"   # hardware fault: escalate, no retry
    RESYNC = "resync"         # drain + *CLS, no *RST
    DISPATCH = "dispatch"     # hand to vendor-specific handler
    FAIL_SAFE = "fail_safe"


@dataclass(frozen=True)
class SCPIError:
    code: int
    message: str
    error_class: ErrorClass


@dataclass(frozen=True)
class RecoveryDirective:
    error: SCPIError
    action: RecoveryAction
    retryable: bool


# Sorted lower bounds of the negative hundreds-blocks; index i maps a code that
# is >= _BOUNDS[i] (and < the next bound) to _NEG_CLASSES[i]. Kept as data so the
# partition is auditable and unit-testable, never buried in if/elif control flow.
_BOUNDS = (-499, -399, -299, -199, -99)
_NEG_CLASSES = (
    ErrorClass.QUERY,      # -499 .. -400
    ErrorClass.DEVICE,     # -399 .. -300
    ErrorClass.EXECUTION,  # -299 .. -200
    ErrorClass.COMMAND,    # -199 .. -100
)

_ACTION_TABLE: dict[ErrorClass, tuple[RecoveryAction, bool]] = {
    ErrorClass.EMPTY: (RecoveryAction.NONE, False),
    ErrorClass.COMMAND: (RecoveryAction.HALT, False),
    ErrorClass.EXECUTION: (RecoveryAction.RESET_STATE, True),
    ErrorClass.DEVICE: (RecoveryAction.INTERLOCK, False),
    ErrorClass.QUERY: (RecoveryAction.RESYNC, True),
    ErrorClass.VENDOR: (RecoveryAction.DISPATCH, False),
    ErrorClass.UNKNOWN: (RecoveryAction.FAIL_SAFE, False),
}

_RESPONSE_RE = re.compile(r"^\s*([+-]?\d+)\s*(?:,\s*(.*))?$")


def classify(code: int) -> ErrorClass:
    """Map an SCPI/IEEE 488.2 error code to exactly one class (total function)."""
    if code == 0:
        return ErrorClass.EMPTY
    if code >= 1:
        return ErrorClass.VENDOR
    if code < -499:                       # reserved -999..-500 and below
        return ErrorClass.UNKNOWN
    # code in [-499, -100]: locate its hundreds-block; anything -99..-1 is UNKNOWN.
    idx = bisect.bisect_right(_BOUNDS, code) - 1
    if 0 <= idx < len(_NEG_CLASSES):
        return _NEG_CLASSES[idx]
    return ErrorClass.UNKNOWN


def parse_error(raw: str) -> SCPIError:
    """Parse one raw ``SYST:ERR?`` response into a categorised ``SCPIError``.

    Accepts leading ``+``, quoted or bare descriptions, and a missing message.
    Raises ``ValueError`` only when no integer code can be recovered at all.
    """
    match = _RESPONSE_RE.match(raw)
    if not match:
        raise ValueError(f"Uninterpretable SCPI error response: {raw!r}")
    code = int(match.group(1))
    message = (match.group(2) or "").strip().strip('"').strip()
    return SCPIError(code=code, message=message, error_class=classify(code))


def resolve(error: SCPIError) -> RecoveryDirective:
    """Attach the fixed recovery disposition for this error's class."""
    action, retryable = _ACTION_TABLE[error.error_class]
    return RecoveryDirective(error=error, action=action, retryable=retryable)


def drain_and_classify(
    query_fn: Callable[[str], str], max_entries: int = 64
) -> list[RecoveryDirective]:
    """Fully drain the FIFO via repeated ``SYST:ERR?`` and classify every entry.

    Reads until the queue reports ``0,"No error"`` or ``max_entries`` is hit, so a
    stuck instrument that never returns 0 cannot spin the loop forever. The
    highest-severity directive should drive recovery; the full list is retained
    for the audit trail.
    """
    directives: list[RecoveryDirective] = []
    for _ in range(max_entries):
        error = parse_error(query_fn("SYST:ERR?"))
        if error.error_class is ErrorClass.EMPTY:
            break
        directive = resolve(error)
        logger.warning(
            "SCPI %s (%d): %s -> %s",
            error.error_class.name, error.code, error.message, directive.action.name,
        )
        directives.append(directive)
    else:
        logger.error("Error queue did not clear within %d reads", max_entries)
    return directives

The drain_and_classify loop is bounded on purpose: an instrument that keeps returning nonzero codes indefinitely — a symptom of a hardware fault re-arming faster than it drains — must not trap the caller in an infinite poll. Pair this classifier with the retry ceilings from Timeout Handling & Retry Logic, and specifically the bounded curve in exponential backoff for serial timeouts, so a retryable directive re-issues under a delay budget rather than immediately.

Validating the Classifier Against a Live Instrument

Prove the partition is total and correct before trusting it in a run. A property-style loopback asserts that every integer in the working range resolves and that the standard exemplar codes land in their intended class:

# Totality: no integer in the SCPI range may raise or fall through.
for code in range(-1000, 33000):
    assert isinstance(classify(code), ErrorClass)

# Standard exemplars from IEEE 488.2 / SCPI Vol.2.
assert classify(0) is ErrorClass.EMPTY
assert classify(-113) is ErrorClass.COMMAND       # Undefined header
assert classify(-222) is ErrorClass.EXECUTION     # Data out of range
assert classify(-350) is ErrorClass.DEVICE        # Queue overflow
assert classify(-410) is ErrorClass.QUERY         # Query INTERRUPTED
assert classify(2001) is ErrorClass.VENDOR        # device-specific
assert classify(-550) is ErrorClass.UNKNOWN       # reserved band

# Parser tolerates the real wire variants.
assert parse_error('-222,"Data out of range"').code == -222
assert parse_error('+0, "No error"').error_class is ErrorClass.EMPTY
assert parse_error("-410").message == ""           # description omitted
assert resolve(parse_error('-410,"Query INTERRUPTED"')).action is RecoveryAction.RESYNC

Against real hardware, provoke each class deliberately rather than waiting for it. Sending a bogus header such as VOLT:BOGUS and then reading SYST:ERR? should return a -113 classified COMMAND; issuing a query for a value outside the instrument’s range should yield a -222 EXECUTION. On the instrument side, the standard indicator is the Event Status Register: after *CLS the error bits (CME, EXE, DDE, QYE in *ESR?) should read clear, and a persistent DDE bit that survives *CLS corroborates a DEVICE-class fault the classifier flagged as non-retryable. In the host logs, watch the ratio of classes over time — a rising QUERY count points at a query-synchronisation bug in your own polling code (often solved by the ordering discipline in Async Command Queuing Systems), while a rising COMMAND count points at a command string your driver is generating wrong.

Failure Modes Specific to Code Classification

These faults are unique to categorising the error code — distinct from the transport-read faults owned by the parent cluster.

Queue overflow erases evidence (-350). The FIFO holds a bounded number of entries (commonly 10–32). When it fills, the instrument overwrites the last slot with -350,"Queue overflow" and discards the errors that would have followed. Classifying -350 correctly as DEVICE is necessary but not sufficient: its presence means the classification you just performed is over an incomplete record. Treat any -350 as a signal to *CLS, abort the current step, and restart error accounting — do not assume the codes you did read are the whole story.

Vendor positives colliding with assumed ranges. Some vendors place safety-critical faults in the positive space that IEEE 488.2 leaves device-specific. Because classify tests the sign first, a +200 never leaks into the negative partition — but the default DISPATCH/fail-safe disposition means an unregistered vendor code is treated as non-retryable, which is correct. Register handlers per instrument model from its manual; diagnose an unexpected positive by logging the code verbatim and cross-referencing the vendor’s SCPI supplement, never by guessing from the description string.

Query-class faults misrouted to *RST. The most damaging mistake is treating a -410/-420/-430 as a device fault and issuing *RST, wiping configured ranges, trigger setup, and calibration in the middle of a run. The classifier assigns these to QUERY with a RESYNC action precisely to prevent that. If you observe instrument state resetting on transient errors, grep the recovery dispatch for any path where a QUERY directive reaches a reset call; the fix is to honour directive.action, not the raw code’s magnitude.

Description-based branching that drifts on firmware updates. Any recovery logic that switches on substrings like "out of range" instead of the code will silently mis-handle faults the day a firmware revision rewords them. Keep every decision anchored to error.code and error.error_class; use error.message only for logging and operator display. When you standardise codes across a mixed fleet, fold that discipline into the conventions on SCPI Command Set Standardization.

Once a directive is resolved, it becomes the structured fault signal that the rest of the pipeline consumes: transient QUERY and EXECUTION classes feed the retry machinery, DEVICE classes trip the resilience patterns described across the Serial, USB, and GPIB Communication Workflows pillar, and the same categorised codes let a data-validation layer such as Checksum & CRC Validation distinguish a corrupt frame from a device that is merely busy.

Error Code Categorization — the parent framework: queue polling, transport normalisation, and the full recovery-routing model this classifier plugs into.
Timeout Handling & Retry Logic — bounded retry ceilings and cumulative-timeout guards for retryable directives.
Async Command Queuing Systems — ordered command/response scheduling that prevents the query desyncs behind -4xx faults.
PySerial Configuration & Tuning — buffer and latency settings that determine whether SYST:ERR? reads arrive intact.
SCPI Command Set Standardization — keeping error semantics consistent across a mixed-vendor instrument fleet.

← Back to Error Code Categorization

References

IEEE 488.2-1992 Standard Codes, Formats, Protocols and Common Commands — the hundreds-block error taxonomy SCPI inherits.
Python bisect module documentation — the sorted-boundary lookup used for branch-stable classification.