ELK Stack

date:2017-12-11 11:00
tags:logging, monitoring, operations

Blueprint on Launchpad:

Log file analysis is an important part of maintaining and troubleshooting OpenStack clouds, but using traditional single server methodology to analyze the logs on clouds with tens, hundreds or thousands of servers can become problematic and unwieldy. By leveraging the search, collation and analysis features of the ELK (Elasticsearch`[1]`_, Logstash`[2]`_ and Kibana`[3]`_) stack we can provide a cloud level view of all of the log files. The ELK stack also provides the ability to correlate log messages across various services, perform detailed log analysis and do trending based on metrics derived from log messages.

Problem description

For deployers and operators findings specific events in the myriad log files produced by the various OpenStack, system and ancillary services can be tedious and error prone. With traditional tools the possibility of missing critical log entries grows as the size of the cluster increases. Log file analysis provides vital information about the state of the OpenStack services as well as the underlying hardware. Currently there are no tools provided by OpenStack-Ansible to detailed log analysis, correlation and trending.

Proposed change

Utilizing the logging/utility node we install the ELK stack in containers, logs are shipped from the individual nodes/containers using the Filebeat package. Using Filebeat to perform the initial log shipping allows us to do initial multiline parsing distributing the load away from a single Logstash container. Version requirements of the ELK packages will be maintained in the ELK roles and barring security fixes the major version of those packages should not change during the release cycle of Openstack. The ELK roles are consumed via Ansible Galaxy pointing to specific SHAs.

Notable changes:
  • Create 3 containers on the logging/utility node, one each for Elasticsearch, Logstash and Kibana. (Additional containers can be created to facilitate HA if needed.)
  • Install the Filebeat package on all nodes/containers
  • ELK and Filebeats galaxy role SHAs added to ansible-requirements.yml

Alternatives

Logs are currently shipped to a centralized rsyslog-server container on the logging/utility server allowing for some sort of centralized log parsing using command line utilities. There are other 3rd party solutions with various levels of cost, adoption and support.

Playbook/Role impact

The changes required are located in stand alone playbooks. Additional roles will need to be created for Logstash, Kibana and Filebeat, the `ansible-elasticsearch``[4]`_ maintained by elastic.co provides Elasticsearch. Configuration can be stand-alone or integrated into the user-variables.yml and user-secrets.yml files.

Upgrade impact

As this is the initial implementation there is no upgrade impact. Future versions will require upgrade planning as it may be necessary to upgrade versions of the ELK packages, OpenJDK packages and possibly the Elasticsearch database itself.

Security impact

This software provides a web based front end as well as API access to any information contained in the Openstack, service and system logs that are shipped to it. As such it will need to be only visible to authenticated users. All access can be secured through the traditional hardening that is applied to any standard web service, namely TLS and an authentication mechanism. Furthermore since the ELK stack is behind a VIP we can limit access to certain IPs and/or networks via a number of ACLs.

By default logs are shipped in plaintext, it is possible, however, to enable SSL encryption on this transport should it be needed.

Performance impact

Based on testing and real-world analysis the largest performance impact will be on the logging/utility server. As this devices original intent was to perform log processing this is expected and not unusual. The filebeat service running in each node/container has demonstrated a negligible performance impact, but certain best practices such as limiting logging levels and eliminating tracebacks in the logs will help maintain the light footprint. Filebeat should not impact the operation of any Openstack services as it is simply a log file processor/shipper, although network utilization could be a concern should debug logging be enabled on a particularly busy service.

Elastic.co is the maintainer of all of the software other than Java, which is maintained by Oracle corporation. Both of these entities provide enterprise software and thus follow strict release schedules and have reliable upstream repositories for their software.

End user impact

End users should not notice the changes from this work. This is primarily intended for deployers and operators. This change does give operations teams more insight into the environment and will hopefully facilitate a more performant and stable deployment.

Deployer impact

The ELK stack is an optional component and does not directly interact with any Openstack services. All of the ELK packages are provided via apt/yum repositories. An additional secret will need to be created for the kibana user. The filebeat package will be installed in all containers and on all nodes but it is extremely lightweight, with configuration stored in /etc/filebeat. Java is required for ELK so the openjdk (default) or JDK implementation of the deployers choosing will need to be installed in three containers on the logging/utility node.

Developer impact

This should be a minimal change for developers, the one thing that they will need to keep in mind is if additional log files are added they will need to be added to the filebeat configuration, this can be handled by re-running the filebeat play against the containers with the new logs.

Dependencies

There are no dependencies.

Implementation

Assignee(s)

Primary Assignee:
David Wilde (d34dh0r53)

Work items

  1. Create ELK and filebeats roles in openstack-ansible, these roles will be generic enough to be published to ansible-galaxy so that they are usable by the Ansible community at large.
  2. Create playbook(s) to install the ELK stack and filebeats, these playbooks will install the OpenStack specific configuration and parsing files.
  3. Create testing procedures for the stack
  4. Documentation

Testing

The ELK stack should be tested on each commit by ensuring that the services start and that logs are flowing into the system and being parsed correctly. This can be acomplished by injecting a line into a services log file and then using the elasticsearch API via curl to verify that the line was correctly inserted into the database with the expected fields parsed.

Documentation impact

Along with the general installation procedures and configuration the key points of documentation will be:

  • Filebeats parsing rules
  • Logstash parsing rules
  • Kibana dashboard configuration
  • The default Kibana dashboard
  • Performance impact and tuning of the ELK stack