Add DataFrame/DataPoint objects¶
CloudKitty has an inner data format called “DataFrame”. It is used almost everywhere: The API returns dataframes, the storage driver expects to store dataframes, collected data is retrieved as a dataframe… But dataframes are always passed around as dicts, making their manipulation tedious. This is a proposal to add a DataFrame and DataPoint class definition, which would allow easier conversion/manipulation of dataframes.
https://storyboard.openstack.org/#!/story/2005890
Problem Description¶
The “dataframe” format is specified in multiple places, but there is no true
implementation of it: dicts respecting the format specifications are passed
around instead. This can be error-prone: the integrity of these objects is not
guaranteed (a function might modify them, even without intending to), and some
specific details may vary from one part of the codebase to another (for
example a float may be used instead of a decimal.Decimal).
Furthermore, the dataframe format is not exactly the same in the v1 and v2
storage interfaces. v1 has a single desc key containing every metadata
attribute of a data point, whereas v2 provides two keys, metadata and
groupby, depending on the type of the attribute. This leads to conversions
between v1 and v2 format in several places in the code. Example taken from the
CloudKittyFormatTransformer:
def format_item(self, groupby, metadata, unit, qty=1.0):
data = {}
data['groupby'] = groupby
data['metadata'] = metadata
# For backward compatibility.
data['desc'] = data['groupby'].copy()
data['desc'].update(data['metadata'])
data['vol'] = {'unit': unit, 'qty': qty}
return data
Proposed Change¶
The proposed solution is to introduce two new classes: DataPoint and
DataFrame.
DataPoint¶
DataPoint replaces a single data point represented by a dict with the
following format:
{
"vol": {
"unit": "GiB",
"qty": 1.2,
},
"rating": {
"price": 0.04,
},
"groupby": {
"group_one": "one",
"group_two": "two",
},
"metadata": {
"attr_one": "one",
"attr_two": "two",
},
}
The following attributes will be accessible in a DataPoint object:
qty:decimal.Decimalprice:decimal.Decimalgroupby:werkzeug.datastructures.ImmutableMultiDictmetadata:werkzeug.datastructures.ImmutableMultiDictdesc:werkzeug.datastructures.ImmutableMultiDict
Note
desc will be a combination of metadata and groupby
In order to ensure data consistency, the DataPoint object will inherit
collections.namedtuple. The groupby and metadata attributes will
be stored as werkzeug.datastructures.ImmutableDict.
In addition to its base attributes, the DataPoint class will have a
desc attribute (implemented as a property), which will return an
ImmutableDict (a merge of metadata and groupby).
DataPoint instances will expose the following methods:
set_price: Set the price of theDataPoint. Returns a new instance.as_dict: Returns an (optionally mutable) dict representation of the object. For convenience with API backward compatibility, it will be possible to obtain the result in legacy format (descwill replacemetadataandgroupby).json: Returns a json representation of the object. For convenience with API backward compatibility, it will be possible to obtain the result in legacy format (descwill replacemetadataandgroupby).from_dict: Creates aDataPointfrom its dict representation.
DataFrame¶
DataFrame replaces a dataframe represented by a dict with the following
format:
{
"period": {
"begin": datetime.datetime,
"end": datetime.datetime,
},
"usage": {
"metric_one": [], # list of datapoints
[...]
}
}
A DataFrame is a wrapper around a collection of DataPoint objects.
DataFrame instances will have two read-only attributes: start and
end (stored as datetime.datetime objects).
DataFrame instances will expose the following methods:
as_dict: Returns an (optionally mutable) dict representation of the object. For convenience with API backward compatibility, it will be possible to obtain the result in legacy format.json: Returns a json representation of the object. For convenience with API backward compatibility, it will be possible to obtain the result in legacy format.from_dict: Creates aDataFramefrom its dict representation.add_points: Adds a list ofDataPointobjects to a dataframe for a given metric.iterpoints: Generator function iterating over all points in theDataFrame. Yields (metric_name,DataPoint) tuples.
Note
Given that the from_dict method of both classes will mainly be
used at the API level, voluptuous schemas matching the classes
will be added and a schema validation will be executed on the
argument from_dict is called with.
Alternatives¶
The code-base could be left as is, letting developers deal with the tedious dataframe manipulations.
Data model impact¶
Data structures manipulated internally get hardened.
REST API impact¶
None. However, this would ease a future endpoint allowing to push dataframes to cloudkitty.
Security impact¶
None.
Notifications Impact¶
None.
Other end user impact¶
None.
Performance Impact¶
Instantiating DataPoints might be slightly slower than instantiating dicts.
However, namedtuple is a high-performance container, and several
dict formatting steps that are currently executed will be skipped if we use
a namedtuple subclass, so there may be no overhead at all.
Other deployer impact¶
None.
Developer impact¶
Manipulating objects with a clear and strict interface should make developing with dataframes easier and way less error-prone.
No extra dependencies are required.
Implementation¶
Assignee(s)¶
Primary assignee:
peschk_l
Work Items¶
Create validation utils that will allow to check the datapoint/dataframe format.
Submit the new classes along with tests.
Dependencies¶
None.
Testing¶
This will be tested with unit tests. A 100% test coverage is expected.
Documentation Impact¶
None.
References¶
None.