Unifying and cleaning up failure remoting¶
https://blueprints.launchpad.net/oslo.messaging/+spec/failure-remoting
We currently have a couple ways of remoting failures (exceptions + traceback information) that occur on remote systems back to their source. These different ways have differences that make each solution valid and applicable to its problem area. To encourage unification, this spec will work on a proposal that can take the best aspects of both implementations and leave the weaknesses of both behind to make a best of breed implementation.
Problem description¶
There is a repeated desire to be able to serialize an exception, an
exception type, and as much information about the exceptions cause (ie
its traceback) when a creator on a remote system fails to some
other system (typically transmitted over some RPC or REST or other
non-local interface). For brevity sake let us call the tuple
of (exception_type, value, traceback)
(typically created from some
call to sys.exc_info) a failure object. When on a local machine
and the failure is created inside its own process the exception, its class
and its traceback are natively supported and can be examined, output,
logged (typically using the traceback module), handled (via try/catch
blocks) and analyzed; but when that exception is remotely created and
sent to a receiver the recreation of that failure becomes that much more
complicated for a few reasons:
Serialization of a traceback object (which typically contains references to local stack frames) into some serializable format typically means that the reconstructed traceback will not be as rich as it was when created on the local process due to the fact that those local stack frames will not exist in the receivers process. This implies that traceback serialization/deserialization is a lossy process and by side-effect this means that for remote exceptions the traceback module can not be used and/or that the information it produces may not be accurate.
Input validation must now be performed, ensuring that the serialized format created by the sender is actually valid (this excludes using pickle for serialization/deserialization due to its widely known security vulnerabilities).
The receiver of the failure, if it desires to try to recreate an exception object from the serialized version must have access to the same exception type/class that was used to create the original exception; this may not always be possible (depending on modules and classes accessible from the receivers
sys.path
).Any contained exception value (typically a
string
, but not limited to) will need to be reconstructed (this may not always be possible, for example if the originating exception value references some local file handle or other non-serializable object, such as a local threading lock).
What exists¶
There are a few known implementations of failure capturing, serialization and deserialization/reconstruction. Let us dive into how each one works and analyze the benefits and drawbacks of each approach.
Oslo.privsep¶
Source:
https://github.com/openstack/oslo.privsep/blob/1.13.0/oslo_privsep/daemon.py#L181-L187
https://github.com/openstack/oslo.privsep/blob/1.13.0/oslo_privsep/daemon.py#L181-L187
Commentary¶
Sends back class + module name across socket channel + exception arguments.
Drops traceback (logs it on priviliged side).
Recreates new class object with sent across arguments (and reraises) on unpriviliged side (ideally nothing leaks across?).
Oslo.messaging¶
Source:
https://github.com/openstack/oslo.messaging/blob/2.5.0/oslo_messaging/_drivers/common.py#L164
https://github.com/openstack/oslo.messaging/blob/2.5.0/oslo_messaging/_drivers/common.py#L204
A similar (same?) copy seems to be in nova (for cells?):
https://github.com/openstack/nova/blob/stable/liberty/nova/cells/messaging.py#L1878
https://github.com/openstack/nova/blob/stable/liberty/nova/cells/messaging.py#L1918
Docs: unknown
Commentary¶
Serializes: yes (to json); keyword arguments of exception are extracted
from optional exception attribute kwargs
, class name and module name
of exception are captured with final data format being:
data = {
'class': cls_name,
'module': mod_name,
'message': six.text_type(exception),
'tb': tb,
'args': exception.args,
'kwargs': kwargs
}
Deserializes: yes; previous json data is loaded as a dictionary.
Validates: No; jsonschema validation is not currently performed.
Reconstructs: yes (with limitations); message of exception from
message
in data
is loaded and concated with traceback from tb
dictionary element, module received is then verified against a provided list
and if module received is not allowed a generic exception is raised which
attempts to encapsulate the received failure. This generic
exception (which does retain the traceback) is created via:
oslo_messaging.RemoteError(data.get('class'), data.get('message'),
data.get('tb'))
Otherwise if the module is one of the allowed types the exception class object is recreated by using:
klass = <load module and class and verify class is an exception type>
exception = klass(*data.get('args', []), **data.get('kwargs', {}))
Then if this works, to ensure the __str__
and __unicode__
methods
correctly return the message
key in the previously mentioned data
dictionary a dynamic exception type is created with a dynamically created
function that returns provided message
; then the exception
created
above has its __class__
attribute replaced to be this new dynamic
exception type (woah!):
exc_type = type(exception)
str_override = lambda self: message
new_ex_type = type(ex_type.__name__ + _REMOTE_POSTFIX, (ex_type,),
{'__str__': str_override, '__unicode__': str_override})
new_ex_type.__module__ = '%s%s' % (module, _REMOTE_POSTFIX)
exception.__class__ = new_ex_type
if this doesn’t work then exception
is returned
untouched and instead the exception.args
list is replaced with a new
args
list that has the message
from the data
dict as its first
entry (replacing the prior args
first entry with its own).
Notes:
Appears to lose remote traceback info during above reconstruction process (unless RemoteError is returned, which does not lose the traceback, but does lose the original type + associated information).
Does not capture chained exception information.
Copied (or some version of it) into nova cells (currently unknown what version/sha the nova folks copied from).
TaskFlow¶
Source:
Docs:
Commentary¶
Serializes: True; translates exception (or sys.exc_info
call) into
a dictionary using to_dict
method. Example:
>>> from taskflow.types import failure
>>> try:
... raise IOError("I have broke")
... except Exception:
... f = failure.Failure()
...
>>> print(json.dumps(f.to_dict(), indent=4, sort_keys=True))
{
"causes": [],
"exc_type_names": [
"IOError",
"EnvironmentError",
"StandardError",
"Exception"
],
"exception_str": "I have broke",
"traceback_str": " File \"<stdin>\", line 2, in <module>\n",
"version": 1
}
Deserializes: True; loads from json into dictionary.
Validates: True; uses jsonschema with schema:
SCHEMA = {
"$ref": "#/definitions/cause",
"definitions": {
"cause": {
"type": "object",
'properties': {
'version': {
"type": "integer",
"minimum": 0,
},
'exception_str': {
"type": "string",
},
'traceback_str': {
"type": "string",
},
'exc_type_names': {
"type": "array",
"items": {
"type": "string",
},
"minItems": 1,
},
'causes': {
"type": "array",
"items": {
"$ref": "#/definitions/cause",
},
}
},
"required": [
"exception_str",
'traceback_str',
'exc_type_names',
],
"additionalProperties": True,
},
},
}
Reconstructs: True when failure objects are raised locally (when serialization
is not used). False when serialized using to_dict
; Instead of going
through process like defined in oslo.messaging
above this object
instead wraps originating exception(s) in a new exception WrappedFailure and
exposes its type (string version of) information and its traceback in a
new exception and provides accessors and useful methods (defined on the
failure class) to contained information for introspection purposes.
Notes:
Captures (and serializes and deserializes) chained exceptions (as nested failure objects). Seen in above schema as
causes
key (which self-references the schema object).
Twisted¶
Source:
Docs:
Commentary¶
Example:
>>> from twisted.python import failure
>>> import pickle
>>> import traceback
>>> def blow_up():
... raise ValueError("broken")
>>> try:
... blow_up()
... except ValueError:
... f = failure.Failure()
>>> print(f)
[Failure instance: Traceback: <type 'exceptions.ValueError'>: broken
--- <exception caught here> ---
<stdin>:2:<module>
<stdin>:2:blow_up
]
>>> f.raiseException()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in <module>
File "<stdin>", line 2, in blow_up
ValueError: broken
>>> f_p = pickle.dumps(f)
>>> f_2 = pickle.loads(f_p)
>>> f_2.raiseException()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in raiseException
ValueError: broken
>>> print(f_2.tb)
None
>>> traceback.print_tb(f_2.getTracebackObject())
File "<stdin>", line 2, in <module>
File "<stdin>", line 2, in blow_up
Serializes: pickle supported via __getstate__
method. Since
they have created a mostly working replacement for the frame information
that a traceback stores it becomes possible to better integrate with
the traceback module (which accesses that frame information to try to
create useful traceback details).
Deserializes: Yes, via pickle.
Validates: No (pickle is known to be vulnerable anyway to loading arbitrary code).
Reconstructs: Partially, a frame-like replica structure is created that mostly works like the original (except it can’t be re-raised, but it can be passed to the traceback module to have its functions seemingly work).
Proposed change¶
Create a new library, https://pypi.org/project/failure (or other better named library) that encompasses the combination of the 3-4 models described above.
It would primarily provide a Failure
object (like provided by
taskflow and twisted) as its main exposed API. That failure
class would have a __get_state__
method so that it can be pickled (for
situations where this is desired) and a to_dict
and from_dict
that
can be used for json serialization and deserialization. It would also have
introspection APIs (similar to what is provided by twisted and taskflow) so
that the underlying exception information can be accessed in nice manner.
Basic examples of these API(s) that would be great to have (and have proven themselves useful):
@classmethod
def validate(cls, data):
"""Validate input data matches expected failure format."""
def check(self, *exc_classes):
"""Check if any of ``exc_classes`` caused the failure.
...
"""
def reraise(self):
"""Re-raise captured exception."""
@property
def causes(self):
"""Tuple of all *inner* failure *causes* of this failure.
...
"""
def pformat(self, traceback=False):
"""Pretty formats the failure object into a string."""
@classmethod
def from_dict(cls, data):
"""Converts this from a dictionary to a object."""
def to_dict(self):
"""Converts this object to a dictionary."""
def copy(self):
"""Copies this object."""
To take advantage of the re-raising capabilities in oslo.messaging this
class should also have a reraise
method that can attempt to reraise the
given failure (if and only if it matches a given list of exception types). It
would not attempt to dynamically create a __str__
and __repr__
method (the class manipulation magic happening in oslo.messaging) to avoid
the peculiarities of this chunk of code. If the contained failure does
not match a known list of failures, then reraise
will return false and
it will not re-raise anything (leaving it up to the caller to decide what
to do in this situation, perhaps at this point a common WrappedFailure
like exception should be raised?).
The validation logic using jsonschema would be taken from taskflow and used when deserializing so that errors with bad data can be found earlier (at data load time) rather than later (at data access time).
To provide the twisted like integration with the traceback module (by turning the internal format of a traceback into a pure python object representation) there has been discussed if the traceback2 module can provide equivalent functionality, if it can then it should be used to achieve similar integration (it would be even better if the integration would also allow for re-raising this pure python trackback and frame representation as an actual traceback, although this may not be a reasonable expectation).
Alternatives¶
Keep having multiple variations, each with their own weaknesses and benefits, instead of unifying them under a single library.
Impact on Existing APIs¶
Ideally none, as the users should still get the same functionality, but if this is done correctly they will get more meaningful tracebacks, more meaningful introspection on failure objects and overall better and more consistent failures.
Security impact¶
Performance Impact¶
N/A
Configuration Impact¶
N/A
Developer Impact¶
This should make developers lives better.
Testing Impact¶
Having the failure code in its own library, allows it to be easily mocked and tested (vs say having it deeply embedded in oslo.messaging where it is not so easily testable/reviewable…); so overall this should improve test coverage (and overall code quality).
Implementation¶
Assignee(s)¶
Primary assignee: harlowja
Milestones¶
Target Milestone for completion: Mikita
Work Items¶
Create skeleton library.
Get skeleton up on gerrit and integrated into oslo pipelines.
Start to move around code from oslo.messaging and taskflow and refactor to start to form this new library; use concepts and learning from twisted and bolt-ons (and others) to help make this library the best it can be.
Review and code and repeat.
Release and integrate.
Delete older dead code.
Profit!
Incubation¶
N/A
Documentation Impact¶
Dependencies¶
References¶
N/A (all inline)
Note
This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode