Unifying and cleaning up failure remoting

https://blueprints.launchpad.net/oslo.messaging/+spec/failure-remoting

We currently have a couple ways of remoting failures (exceptions + traceback information) that occur on remote systems back to their source. These different ways have differences that make each solution valid and applicable to its problem area. To encourage unification, this spec will work on a proposal that can take the best aspects of both implementations and leave the weaknesses of both behind to make a best of breed implementation.

Problem description

There is a repeated desire to be able to serialize an exception, an exception type, and as much information about the exceptions cause (ie its traceback) when a creator on a remote system fails to some other system (typically transmitted over some RPC or REST or other non-local interface). For brevity sake let us call the tuple of (exception_type, value, traceback) (typically created from some call to sys.exc_info) a failure object. When on a local machine and the failure is created inside its own process the exception, its class and its traceback are natively supported and can be examined, output, logged (typically using the traceback module), handled (via try/catch blocks) and analyzed; but when that exception is remotely created and sent to a receiver the recreation of that failure becomes that much more complicated for a few reasons:

  • Serialization of a traceback object (which typically contains references to local stack frames) into some serializable format typically means that the reconstructed traceback will not be as rich as it was when created on the local process due to the fact that those local stack frames will not exist in the receivers process. This implies that traceback serialization/deserialization is a lossy process and by side-effect this means that for remote exceptions the traceback module can not be used and/or that the information it produces may not be accurate.

  • Input validation must now be performed, ensuring that the serialized format created by the sender is actually valid (this excludes using pickle for serialization/deserialization due to its widely known security vulnerabilities).

  • The receiver of the failure, if it desires to try to recreate an exception object from the serialized version must have access to the same exception type/class that was used to create the original exception; this may not always be possible (depending on modules and classes accessible from the receivers sys.path).

  • Any contained exception value (typically a string, but not limited to) will need to be reconstructed (this may not always be possible, for example if the originating exception value references some local file handle or other non-serializable object, such as a local threading lock).

What exists

There are a few known implementations of failure capturing, serialization and deserialization/reconstruction. Let us dive into how each one works and analyze the benefits and drawbacks of each approach.

Oslo.privsep

Source:

Commentary

  • Sends back class + module name across socket channel + exception arguments.

  • Drops traceback (logs it on priviliged side).

  • Recreates new class object with sent across arguments (and reraises) on unpriviliged side (ideally nothing leaks across?).

Oslo.messaging

Source:

A similar (same?) copy seems to be in nova (for cells?):

Docs: unknown

Commentary

Serializes: yes (to json); keyword arguments of exception are extracted from optional exception attribute kwargs, class name and module name of exception are captured with final data format being:

data = {
    'class': cls_name,
    'module': mod_name,
    'message': six.text_type(exception),
    'tb': tb,
    'args': exception.args,
    'kwargs': kwargs
}

Deserializes: yes; previous json data is loaded as a dictionary.

Validates: No; jsonschema validation is not currently performed.

Reconstructs: yes (with limitations); message of exception from message in data is loaded and concated with traceback from tb dictionary element, module received is then verified against a provided list and if module received is not allowed a generic exception is raised which attempts to encapsulate the received failure. This generic exception (which does retain the traceback) is created via:

oslo_messaging.RemoteError(data.get('class'), data.get('message'),
                           data.get('tb'))

Otherwise if the module is one of the allowed types the exception class object is recreated by using:

klass = <load module and class and verify class is an exception type>
exception = klass(*data.get('args', []), **data.get('kwargs', {}))

Then if this works, to ensure the __str__ and __unicode__ methods correctly return the message key in the previously mentioned data dictionary a dynamic exception type is created with a dynamically created function that returns provided message; then the exception created above has its __class__ attribute replaced to be this new dynamic exception type (woah!):

exc_type = type(exception)
str_override = lambda self: message
new_ex_type = type(ex_type.__name__ + _REMOTE_POSTFIX, (ex_type,),
                   {'__str__': str_override, '__unicode__': str_override})
new_ex_type.__module__ = '%s%s' % (module, _REMOTE_POSTFIX)
exception.__class__ = new_ex_type

if this doesn’t work then exception is returned untouched and instead the exception.args list is replaced with a new args list that has the message from the data dict as its first entry (replacing the prior args first entry with its own).

Notes:

  • Appears to lose remote traceback info during above reconstruction process (unless RemoteError is returned, which does not lose the traceback, but does lose the original type + associated information).

  • Does not capture chained exception information.

  • Copied (or some version of it) into nova cells (currently unknown what version/sha the nova folks copied from).

TaskFlow

Source:

Docs:

Commentary

Serializes: True; translates exception (or sys.exc_info call) into a dictionary using to_dict method. Example:

>>> from taskflow.types import failure
>>> try:
...    raise IOError("I have broke")
... except Exception:
...    f = failure.Failure()
...
>>> print(json.dumps(f.to_dict(), indent=4, sort_keys=True))
{
    "causes": [],
    "exc_type_names": [
        "IOError",
        "EnvironmentError",
        "StandardError",
        "Exception"
    ],
    "exception_str": "I have broke",
    "traceback_str": "  File \"<stdin>\", line 2, in <module>\n",
    "version": 1
}

Deserializes: True; loads from json into dictionary.

Validates: True; uses jsonschema with schema:

SCHEMA = {
    "$ref": "#/definitions/cause",
    "definitions": {
        "cause": {
            "type": "object",
            'properties': {
                'version': {
                    "type": "integer",
                    "minimum": 0,
                },
                'exception_str': {
                    "type": "string",
                },
                'traceback_str': {
                    "type": "string",
                },
                'exc_type_names': {
                    "type": "array",
                    "items": {
                        "type": "string",
                    },
                    "minItems": 1,
                },
                'causes': {
                    "type": "array",
                    "items": {
                        "$ref": "#/definitions/cause",
                    },
                }
            },
            "required": [
                "exception_str",
                'traceback_str',
                'exc_type_names',
            ],
            "additionalProperties": True,
        },
    },
}

Reconstructs: True when failure objects are raised locally (when serialization is not used). False when serialized using to_dict; Instead of going through process like defined in oslo.messaging above this object instead wraps originating exception(s) in a new exception WrappedFailure and exposes its type (string version of) information and its traceback in a new exception and provides accessors and useful methods (defined on the failure class) to contained information for introspection purposes.

Notes:

  • Captures (and serializes and deserializes) chained exceptions (as nested failure objects). Seen in above schema as causes key (which self-references the schema object).

Twisted

Source:

Docs:

Commentary

Example:

>>> from twisted.python import failure
>>> import pickle
>>> import traceback
>>> def blow_up():
...    raise ValueError("broken")
>>> try:
...    blow_up()
... except ValueError:
...    f = failure.Failure()
>>> print(f)
[Failure instance: Traceback: <type 'exceptions.ValueError'>: broken
--- <exception caught here> ---
<stdin>:2:<module>
<stdin>:2:blow_up
]
>>> f.raiseException()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 2, in blow_up
ValueError: broken
>>> f_p = pickle.dumps(f)
>>> f_2 = pickle.loads(f_p)
>>> f_2.raiseException()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 2, in raiseException
ValueError: broken
>>> print(f_2.tb)
None
>>> traceback.print_tb(f_2.getTracebackObject())
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 2, in blow_up

Serializes: pickle supported via __getstate__ method. Since they have created a mostly working replacement for the frame information that a traceback stores it becomes possible to better integrate with the traceback module (which accesses that frame information to try to create useful traceback details).

Deserializes: Yes, via pickle.

Validates: No (pickle is known to be vulnerable anyway to loading arbitrary code).

Reconstructs: Partially, a frame-like replica structure is created that mostly works like the original (except it can’t be re-raised, but it can be passed to the traceback module to have its functions seemingly work).

Proposed change

Create a new library, https://pypi.org/project/failure (or other better named library) that encompasses the combination of the 3-4 models described above.

It would primarily provide a Failure object (like provided by taskflow and twisted) as its main exposed API. That failure class would have a __get_state__ method so that it can be pickled (for situations where this is desired) and a to_dict and from_dict that can be used for json serialization and deserialization. It would also have introspection APIs (similar to what is provided by twisted and taskflow) so that the underlying exception information can be accessed in nice manner.

Basic examples of these API(s) that would be great to have (and have proven themselves useful):

@classmethod
def validate(cls, data):
    """Validate input data matches expected failure format."""

def check(self, *exc_classes):
    """Check if any of ``exc_classes`` caused the failure.

    ...

    """

def reraise(self):
    """Re-raise captured exception."""

@property
def causes(self):
    """Tuple of all *inner* failure *causes* of this failure.

    ...

    """

def pformat(self, traceback=False):
    """Pretty formats the failure object into a string."""

@classmethod
def from_dict(cls, data):
    """Converts this from a dictionary to a object."""

def to_dict(self):
    """Converts this object to a dictionary."""

def copy(self):
    """Copies this object."""

To take advantage of the re-raising capabilities in oslo.messaging this class should also have a reraise method that can attempt to reraise the given failure (if and only if it matches a given list of exception types). It would not attempt to dynamically create a __str__ and __repr__ method (the class manipulation magic happening in oslo.messaging) to avoid the peculiarities of this chunk of code. If the contained failure does not match a known list of failures, then reraise will return false and it will not re-raise anything (leaving it up to the caller to decide what to do in this situation, perhaps at this point a common WrappedFailure like exception should be raised?).

The validation logic using jsonschema would be taken from taskflow and used when deserializing so that errors with bad data can be found earlier (at data load time) rather than later (at data access time).

To provide the twisted like integration with the traceback module (by turning the internal format of a traceback into a pure python object representation) there has been discussed if the traceback2 module can provide equivalent functionality, if it can then it should be used to achieve similar integration (it would be even better if the integration would also allow for re-raising this pure python trackback and frame representation as an actual traceback, although this may not be a reasonable expectation).

Alternatives

Keep having multiple variations, each with their own weaknesses and benefits, instead of unifying them under a single library.

Impact on Existing APIs

Ideally none, as the users should still get the same functionality, but if this is done correctly they will get more meaningful tracebacks, more meaningful introspection on failure objects and overall better and more consistent failures.

Security impact

Performance Impact

N/A

Configuration Impact

N/A

Developer Impact

This should make developers lives better.

Testing Impact

Having the failure code in its own library, allows it to be easily mocked and tested (vs say having it deeply embedded in oslo.messaging where it is not so easily testable/reviewable…); so overall this should improve test coverage (and overall code quality).

Implementation

Assignee(s)

Primary assignee: harlowja

Milestones

Target Milestone for completion: Mikita

Work Items

  1. Create skeleton library.

  2. Get skeleton up on gerrit and integrated into oslo pipelines.

  3. Start to move around code from oslo.messaging and taskflow and refactor to start to form this new library; use concepts and learning from twisted and bolt-ons (and others) to help make this library the best it can be.

  4. Review and code and repeat.

  5. Release and integrate.

  6. Delete older dead code.

  7. Profit!

Incubation

N/A

Documentation Impact

Dependencies

References

N/A (all inline)

Note

This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode