Quotas, Usage Plans, and Capacity Management

Cross Project Spec - None

User Story Tracker - None

Problem Description

Problem Definition

A canonical property of an IaaS system like OpenStack is “capacity on demand”. Users expect to be able to allocate new resources via UI or API whenever needed, and to release them when the need ends. By supporting a large number of users, pooling resources, and maintaining some excess capacity, the cloud service provider (CSP) presents the illusion of infinite capacity.

In practice, of course, the resources are not infinite, and the CSP must institute measures to manage capacity so that resource exhaustion is minimized. This is generally done by imposing a cap or quota on the resources that a particular project may consume, and by managing the relationship between the available physical resources and the aggregate quotas for all projects. When a project requires more resources than its assigned quota, the user is generally required to submit a request, generally requiring human approval. The CSP may reject the request, or delay it until sufficient capacity is available. When the request is approved, the quota for the project is modified to reflect the new limit.

Other CSPs have introduced a number of mechanisms to provide them with flexibility in managing capacity. These include group quotas (shared by related projects), reserved instances, ephemeral instances (which may be reclaimed for reallocation), and market-based allocation models. At the present time, OpenStack does not support any of these.

One common factor in all these processes is that they do not reflect temporal variations in resource usage. Yet in many cases the user knows how their usage is going to vary over time, and such information would be useful to the CSP who needs to decide how to handle each request. It might also facilitate the automation of some of the processing. The following user stories capture the possibilities here.

This user story is also applicable to Telcos / TSP (Telecommunication Service Providers) users. There is movement in the industry toward NFV (Network Function Virtualization) that want to leverage the benefits of cloud technologies and virtualization by deploying VNFs (virtual network functions) on industry standard high volume servers, switches and storage located in data centers, network nodes and in end-user premises. The resource requirements for these VNFs are described in the VNF Descriptor (VNFD) which is being standardized under the aegis of ETSI NFV ISG [1] and OASIS TOSCA.

Opportunity/Justification

CSP and TSP need to be able to efficiently manage and utilize the finite amount of resources including their temporal characteristics. Current OpenStack services do not allow for such flexible resource usage requests and scheduling of resources for future usage. In particular:

  • For high priority VNFs (e.g. mobile core network nodes) the TSP requires a guarantee on the availability of the resources to run the VNFs in different operational timing (e.g. in future) and scenarios.

Use Cases

User Stories

This section utilizes the OpenStack UX Personas.

  • As Wei the project owner of a Telco operator, I want to specify my resource usage request (RUR) in a way that will enable automated processing by the CSP, so that my RUR will be handled more quickly and accurately.
  • As Adrian the infrastructure architect, I want to be able to automate the processing of RURs so that I can meet my user SLAs and gain more timely and accurate data input to my capacity management and planning systems.
  • As Wei, I want to be able to describe the temporal characteristics of my RUR, so that the CSP can plan capacity more accurately and reduce the chances of a resource request failure. My CSP may also offer me better pricing for more accurate usage prediction. Some examples of time-based RURs:
    1. I plan to use up to 60 vCPUs and 240GB of RAM from 6/1/2016 to 8/14/2016.
    2. I plan to use 200GB of object storage starting on 8/14/2016, increasing by 100GB every calendar month thereafter.
    3. I want guaranteed access to 30 vCPUs and 200GB of RAM for my project. In addition, during October-December, I want to be able to increase my usage to 150 vCPUs and 1TB of RAM.
    4. I want guaranteed access to 4 instances with 1 vCPU and 1GB of RAM and 10GB of disk and a guaranteed minimum bandwidth of 1Gbps between the instances. This example is similar to what would be described in the VNFD.
  • As Wei, I want to be able to submit an updated version of a rolling RUR for my project every month, so that my CSP has accurate information and can give me the best price and SLA.
  • As Wei, I want to be able to take advantage of pricing and other offers from my CSP in order to meet the business objectives for my project. For example:
    1. I want 60 vCPUs for a minimum of one hour. After that time, the CSP may shut down all my instances if the resources are needed elsewhere. (I assume that the price is lower on such instances.)
    2. I want up to 100 vCPUs for the next 24 hours. Tell me how many I can have.
  • As Adrian, I want to be able to automate the construction and interpretation of a time-based resource usage plan so that I can schedule the most cost-effective actions to maintain my SLA. Some examples of actions:
    1. Schedule the provisioning of additional infrastructure.
    2. Repurpose existing allocated infrastructure.
    3. Assign a new project to one of a number of regions based on usage projections.
    4. Add “burst capacity” from a federation partner or reseller.
    5. Modify or defer another project.
  • As Wei, I want to be able to query/update/terminate a RUR at any point in time.
  • As Wei, I want to receive an appropriate error message in case the a RUR is not successful. In case of a failure of RUR I want the environment to be reverted back to pre-RUR state. In other words, RUR transaction should be Atomic. In case of RUR failure, the error message should contain sufficient information such that user can take actions to modify the RUR.
  • As Adrian, I want to be able to automate the RUR with chargeback so only users with following requirements are considered for resources:
    1. whose account is up to date on payments
    2. whose RUR is within a quota
    3. whose cost of RUR plus current balance is below project/tenant threshold

Usage Scenarios Examples

  1. Reserve resources for upcoming events
    1. Wei the project owner of a Telco operator is in charge of network planning for big events, like mega-concerts and festival, where local traffic spikes are expected.
    2. In order to ensure sufficient network capacity for the upcoming Fuji Rock Festival on 22-24 July 2017, Wei reserves additional capacity by creating a RUR which describes the aforementioned dates and the amount of additional resources, e.g., 4 instances with 1 vCPU, 1GB of RAM, 10GB of disk, and a guaranteed minimum bandwidth of 1Gbps between the instances which are required for scaling the service.
    3. After the RUR having successfully processed, Wei is acknowledged that the appropriate resource is reserved for the event dates.
  2. Reserve resources for maintenance works
    1. Wei is responsible for updating his services and Rey the cloud operator is responsible for maintaining the underlying cloud environment including its hardware. Now, the team plans a maintenance window for several compute hosts on next Monday.
    2. To avoid impact on the service, Wei plans to migrate all VMs running on those hosts to other hosts that are not affected by the maintenance work on Sunday, i.e., a day before the maintenance window.
    3. In order to ensure that those other hosts are available from Sunday to the end of the maintenance window, Wei reserves the required resources through his frontend tools.
    4. In the backend, the system creates respective RURs for this time window to guarantee the availability of the resources and the system returns a reservation ID to Wei.
    5. On Sunday, Wei triggers the migration of the affected VMs referring to the reservation ID. Rey then triggers the maintenance work on the cloud. The work can be finished earlier than expected and after having migrated back the VMs, Wei can release the reservation ahead of the planned reservation end time.
  3. Reserve resources for disaster recovery
    1. Wei is in charge of ensuring core services are running in disaster cases. In order to be able to immediately react to a disaster situation, the services maintains a disaster configuration for its core services and keeps respective resources reserved for such situations.
    2. Just now, an earthquake has hit the country and an automated tsunami warning was issued by the government. Wei has a short time window to prepare for the tsunami hitting the coastlines and its effects, e.g. a high volume of extraordinary communication such as emergency communication, evacuation instructions, and safety confirmations.
    3. Wei switches the service to a pre-configured disaster configuration. Switching to the disaster configuration is supported by resources that had been exclusively reserved for such situations.
  4. Reserve resources for launching new services
    1. Wei is in charge of introducing a new service that has complex requirements on the infrastructure resources. In order to avoid the risk that one requirement during the allocation of the resources cannot be met and the allocation of resources has to be rolled back or be changed to meet the requirements, Wei first creates a reservation for the required resources specifying in the request also all parameters and conditions the resources have to fulfil.
    2. The reservation service tries to reserve the resources with the specified criteria. After having successfully created the reservation, a reservation ID is returned to Wei.
    3. Wei then triggers the setup of the service referencing the reservation ID knowing that all resource requirements can be met. The new service is initialized without conflicts.

Requirements

  • The implementation of these capabilities will depend in part on the existence of a more flexible and holistic quota scheme, so that the capacity management system can adjust quotas programmatically.
  • It will also require a rich monitoring, notification, and visualization system, so that both user and CSP have accurate and timely data about the behavior of the system.

External References

[1] ETSI NFV IFA has specified the concept and use cases of “resource reservation”
and VNFD in the following standard specifications: <http://www.etsi.org/deliver/etsi_gs/NFV-IFA>

Rejected User Stories / Usage Scenarios

None.

Glossary

  • RUR - Resource Usage Request
  • CSP - Cloud service provider
  • VNFD - Virtual Network Function (VNF) Descriptor describes resource requirements for VNFs