Storage of recently logged events for clusters

https://blueprints.launchpad.net/sahara/+spec/event-log

This specification proposes to add event logs assigned to cluster.

Problem description

It will be more user friendly have event log assigned to cluster. In this case users will have the ability to see the steps performed to deploy a cluster. If there is an issue with their cluster, users will be able to see the reasons for the issues in the UI and won’t be required to read the Sahara logs.

Proposed change

The new feature will provide the following ability:

  • For each cluster there will be an event log assigned to it. The deployer will have the ability to see it in Horizon. In that case the user will have the ability to see all the steps for the cluster deploying.

  • The deployer will have the ability to see the current progress of cluster provisioning.

Alternatives

None

Data model impact

This change will require to write event log messages to the database. It’s good idea to store events messages in database in similar manner, in which we store cluster data, node groups data and so on.

Plugins should provide a list of provisioning steps and be able to report status of the current step. All steps should be performed in linear series, and we will store events only for the current step. All completed steps should have the duration time stored in the database. There are no reasons to store events for successfully completed steps, so they will be dropped periodically.

If an error occurs while provisioning cluster we will have error events saved for the current step. Also we should store for each error event error_id, which would help for admins to determine reasons of failures in sahara logs.

We should have new a database object called ClusterEvent, which will have the following fields:

  • node_group_id;

  • instance_id;

  • instance_name;

  • event_info;

  • successful;

  • provision_step_id;

  • id;

  • created at;

  • updated at;

Also we should have a new database object called ClusterProvisionStep, which will have the following fields:

  • id;

  • cluster_id;

  • step_name;

  • step_type;

  • completed;

  • total;

  • successful;

  • started_at;

  • completed_at;

  • created_at;

  • updated_at;

Fields step_name and step_type will contain detail info about step. step_name will contain description of the step, for example Waiting for ssh or Launching instances. step_type will contain info about related process of this step. For example, if we creating new cluster this field will contain creating and for scaling some cluster this field will contain scaling. So, possible values of this field will be creating, scaling, deleting. Also we should add ability to get main provisioning steps from Plugin SPI for each step_type as dictionary. For example, expected return value:

{
   "creating": [
       "Launching instances",
       "Waiting for ssh",
       ....
   ]
   "scaling": [
       ....
   ]
   "deleting": [
       ....
   ]
}

Cluster should have new field: * provisioning_progress

This field will contain list with provisioning steps, which should provide ability to get info about provisioning steps from cluster. We should update this list with new steps every time we start new process with cluster (creating/scaling/deleting). Provision steps should updated both from plugin and infra, because some steps are same for all clusters.

REST API impact

Existing GET request for a cluster should be updated with completed steps info, and short info for the current step. For example, we will have following response: “Launching instances completed 1000 out of 1000 in 10 minutes”, “Trying ssh completed: 59 out of 1000”. Also response should be sorted by increasing of value created_at.

{
   "cluster": {
       "status": "Waiting",
       ....
       "provisioning_progress": [
          {
            "id": "1",
            "cluster_id": "1111",
            "step_name": "Launching instances",
            "step_type": "creating",
            "completed": 1000,
            "total": 1000,
            "successful": "True",
            "created_at": "2013-10-09 12:37:19.295701",
            "started_at": 36000000,
            "completed_at": 18000000,
            "updated_at": "2013-10-09 12:37:19.295701",
          },
          {
            "id": "2",
            "cluster_id": "1111",
            "step_name": "Waiting for ssh",
            "step_type": "creating",
            "completed": 59,
            "total": 1000,
            "successful": None,
            "created_at": "2013-10-09 12:37:19.295701",
            "started_at": 18000000,
            "completed_at": None,
            "updated_at": "2013-10-09 12:37:19.295701",
          }
       ]
       ....
   }
}

In case of errors:

{
   "cluster": {
       "status": "Waiting",
       ....
       "provisioning_progress": [
          {
            "id": "1",
            "cluster_id": "1111",
            "step_name": "Launching instances",
            "step_type": "creating",
            "completed": 1000,
            "total": 1000,
            "successful": "True",
            "created_at": "2013-10-09 12:37:19.295701",
            "started_at": 36000000,
            "completed_at": 18000000,
            "updated_at": "2013-10-09 12:37:19.295701",
          },
          {
            "id": "2",
            "cluster_id": "1111",
            "step_name": "Waiting for ssh",
            "step_type": "creating",
            "completed": 59,
            "total": 1000,
            "successful": False,
            "created_at": "2013-10-09 12:37:19.295701",
            "started_at": 18000000,
            "completed_at": None,
            "updated_at": "2013-10-09 12:37:19.295701",
          }
       ]
       ....
   }
}

Also in these cases we will have events stored in database from which we can debug cluster problems. Because first steps of cluster provision are same, then for these steps infra should update provisioning_progress field. Also for all plugin-related steps plugin should update provisioning_progress field. So, new cluster field should be updated both from infra and plugin.

New endpoint should be added to get details of the current provisioning step: GET /v1.1/<tenant_id>/clusters/<cluster_id>/progress

The expected response should looks like:

{
     "events": [
           {
              'node_group_id': "ee258cbf-4589-484a-a814-81436c18beb3",
              'instance_id': "ss678cbf-4589-484a-a814-81436c18beb3",
              'instance_name': "cluster-namenode-001",
              'provisioning_step_id': '1',
              'event_info': None,
              'successful': True,
              'id': "ss678cbf-4589-484a-a814-81436c18eeee",
              'created_at': "2014-10-29 12:36:59.329034",
           },
           {
              'cluster_id': "d2498cbf-4589-484a-a814-81436c18beb3",
              'node_group_id': "ee258www-4589-484a-a814-81436c18beb3",
              'instance_id': "ss678www-4589-484a-a814-81436c18beb3",
              'instance_name': "cluster-datanode-001",
              'provisioning_step_id': '1',
              'event_info': None,
              'successful': True,
              'id': "ss678cbf-4589-484a-a814-81436c18eeee",
              'created_at': "2014-10-29 12:36:59.329034",
           },
           {
              'cluster_id': "d2498cbf-4589-484a-a814-81436c18beb3",
              'node_group_id': "ee258www-4589-484a-a814-81436c18beb3",
              'instance_id': "ss678www-4589-484a-a814-81436c18beb3",
              'instance_name': "cluster-datanode-001",
              'provisioning_step_id': '2',
              'event_info': "Trying to access failed: reason in sahara logs
                            by id ss678www-4589-484a-a814-81436c18beb3",
              'successful': False,
              'id': "ss678cbf-4589-484a-a814-81436c18eeee",
              'created_at': "2014-10-29 12:36:59.329034",
           },
     ]
}

Event info for the failed step will contain the traceback of an error.

Other end user impact

None

Deployer impact

This change will takes immediate effect after it is merged. Also it is a good idea to have ability to disable event log from configuration.

Developer impact

None

Sahara-image-elements impact

None

Sahara-dashboard / Horizon impact

This change will add section in Horizon at page with event logs /data_processing/clusters/cluster_id. At this page it will be possible to see main provisioning steps, and current progress of all of it. Also we would have an ability to see events of current provisioning step. In case of errors we will be able to see all events of the current step and main reasons of failures.

Implementation

Assignee(s)

Primary assignee:

vgridnev

Other contributors:

slukjanov, alazarev, nkonovalov

Work Items

This feature require following modifications:
  • Add ability to get info about main steps of provisioning cluster from plugin;

  • Add ability to view progress of current provisioning step;

  • Add ability to specify events to current cluster and current step;

  • Add periodic task to erase redundant events from previous step;

  • Add ability to view events about current step of cluster provisioning;

  • Sahara docs should be updated with some use cases of this feature;

  • Saharaclient should be modified with new REST API feature;

  • New cluster tab with events in Horizon should be implemented;

  • Add unit test to test new features of events.

Dependencies

Depends on OpenStack requirements

Testing

As written in Work Items section this feature will require unit tests

Documentation Impact

As written in Work Items section this feature will require docs updating with some use cases of feature

References