Neutron L3 HA VRRP support¶

https://blueprints.launchpad.net/fuel/+spec/neutron-vrrp-deployment

The aim of this blueprint is to add High Availability Features on virtual routers to Fuel. L3 HA VRRP support covers only access to VMs and access from VMs to Internet.

L3 HA feature gives an opportunity to establish connection faster after L3 agent failover than router rescheduling.

Problem description¶

Currently we are able to spawn many l3 agents, however each l3 agent is a SPOF. If an l3 agent fails, all virtual routers of this agent will be lost, and consequently all VMs connected to these virtual routers will be isolated from external networks and possibly from other tenant networks. Existing rescheduling has one big issue - thousands of routers take hours to finish the rescheduling and configuration process.

Proposed changes¶

The idea of this spec is to schedule a virtual router to at least two l3 agents, but this limit could be increased by changing a parameter in the neutron configuration file.

L3 HA starts a keepalived instance in every router namespace. The different router instances talk to one another via a dedicated HA network, one per tenant. This network is created under the blank tenant to hide it from the CLI and GUI. The HA network is a Neutron tenant network, same as every other network, and uses the default segmentation technology. HA routers have an ‘HA’ device in their namespace: When a HA router is created, it is scheduled to a number of network nodes, along with a port per network node, belonging to the tenant’s HA network. keepalived traffic is forwarded through the HA device (As specified in the keepalived.conf file used by the keepalived instance in the router namespace).

Flows:

        +----+                          +----+
        |    |                          |    |
+-------+ QG +------+           +-------+ QG +------+
|       |    |      |           |       |    |      |
|       +-+--+      |           |       +-+--+      |
|     VIPs|         |           |         |VIPs     |
|         |      +--+-+      +--+-+       |         |
|         +      |    |      |    |       +         |
|  KEEPALIVED+---+ HA +------+ HA +----+KEEPALIVED  |
|         +      |    |      |    |       +         |
|         |      +--+-+      +--+-+       |         |
|     VIPs|         |           |         |VIPs     |
|       +-+--+      |           |       +-+--+      |
|       |    |      |           |       |    |      |
+-------+ QR +------+           +-------+ QR +------+
        |    |                          |    |
        +----+                          +----+

Web UI¶

In section Neutron Advanced Configuration we need a checkbox for enabling L3 HA. This checkbox cannot be enabled if DVR is turned on.

Nailgun¶

Additional option ‘neutron_l3_ha’ will be added into opentack.yaml. It will marked as incompatible with Neutron DVR.

Data model¶

None

REST API¶

No FUEL REST API changes.

Orchestration¶

None

RPC Protocol¶

None

Fuel Client¶

None

Plugins¶

None

Fuel Library¶

The following options should be passed to neutron::server class in order to enable L3 HA and disable legacy rescheduling:

l3_ha = True
max_l3_agents_per_router = 0
min_l3_agents_per_router = 2
allow_automatic_l3agent_failover = False

Alternatives¶

None

Upgrade impact¶

Upgrade from legacy to HA router was not added in Liberty, but will be backported from upstream.

We can upgrade legacy router to HA router by 3 steps:

neutron router-update router1 --admin_state_up=False
neutron router-update router1 --ha True
neutron router-update router1 --admin_state_up=True

It also will be possible to make HA router legacy router.

Security impact¶

None

Notifications impact¶

None

End user impact¶

None

Performance impact¶

HA L3 is based on Keepalived(VRRP protocol) which gives the following features:

Works within tenant networks
Failover independent from RPC layer
Expected to be quicker than rescheduling (Rescheduling - 1 router - 5 sec, then linear growth with number of routers Rough failover time: single router - 7-8 sec, 30 routers - 10 sec)

Deployment impact¶

L3 HA feature uses service network called “HA network” for VRRP protocol messages. This network is created for every tenant, so if there’s a limited number of tunnels (or VLANs) for Neutron private networks it should be considered.

Developer impact¶

None

Infrastructure impact¶

None

Documentation impact¶

Ability to enable L3 HA support in Neutron should be documented in Fuel Deployment Guide.

Expected OSCI impact¶

keepalived must satisfy the following criteria: 1.2.13, >1.2.16 (done for Ubuntu 14.04, satisfied in CentOS 7)

Implementation¶

Assignee(s)¶

Primary assignee:: Ann Kamyshnikova <akamyshnikova>
Other contributors:: Sergey Kolekonov <skolekonov> (DE) Kristina Kuznetsova <kkuznetsova> (QA)
Mandatory design review:: Eugene Nikanorov <enikanorov> Oleg Bondarev <obondarev> Sergey Vasilenko <svasilenko>

Work Items¶

Patch fuel-library and nailgun to add option for enabling L3 HA
- disable L3 HA by default
Scale testing
Rally scenarios
Shaker scenarios
checking compatibility with plugins
debug
bug fixing/backport from upstream
Patch fuel-web to add ability to enable/disable L3 HA
- disable L3 HA by default

Dependencies¶

Since this implementation relies on Keepalived, Keepalived has to be installed on each l3 node. The required version of Keepalived is the version 1.2.10 in order to have the IPV6 support. Safe versions:1.2.13,>1.2.16

Testing, QA¶

Manual Acceptance Tests¶

Create HA router and check:

Existence of keepalived process
(from admin) HA network(subnet) should be created for current tenant.
List of agents hosting router should contain list of agents and their status - one active others standby.
Neutron router port list should contain list of HA ports

HA/Destructive Tests¶

All existing HA/destructive tests should pass on env with L3 HA enabled.

Additional scenarios should include:

L3 agent ban: ping test
- create router
- set gateway for external network and add interface
- boot an instance in private net
- add floating ip to vm
- check what agent is active
- start ping vm via floating ip
- ban active l3 agent
- Less than 10 packages should be lost
- Check that another agent become active
SSH session failover
- The same first 5 steps as for “L3 agent ban: ping test”
- Enter vm using ssh
- From vm ping 8.8.8.8
- Stop active agent.
- After some time (less than 10 packages can be lost) ping will be continued and another agent become master.
Test on 50 networks and 50 routers
- Create 50 networks, subnets and 50 routers, for each router add interface to subnet, for some (for each fifths, for example) set gateway to public network.
- Boot vm in one of the networks and add floating ip on it.
- Same last 5 steps as for “L3 agent ban: ping test”
Test with 20 vms
- Create 2 private networks with subnets
- Create router and set gateway for external network and add interfaces for private networks that was created on the previous step.
- Boot 10 vm in each network.
- For each pair using iperf check connectivity restoration (around 15% of loss) for ICMP and TCP/UDP traffic during L3 agent failover.

Scale¶

Environment with L3 HA enabled should pass all tests currently run on Scale Lab with no significant performance degradation. No additional Rally scenarios are needed to test specifics of L3 HA.

Acceptance criteria¶

Pass acceptance functional test - after active L3 agent fails, connection establishes and less than 10 packages should be lost.

References¶

Blueprint

OpenStack

Neutron L3 HA VRRP support¶

Problem description¶

Proposed changes¶

Web UI¶

Nailgun¶

Data model¶

REST API¶

Orchestration¶

RPC Protocol¶

Fuel Client¶

Plugins¶

Fuel Library¶

Alternatives¶

Upgrade impact¶

Security impact¶

Notifications impact¶

End user impact¶

Performance impact¶

Deployment impact¶

Developer impact¶

Infrastructure impact¶

Documentation impact¶

Expected OSCI impact¶

Implementation¶

Assignee(s)¶

Work Items¶

Dependencies¶

Testing, QA¶

Manual Acceptance Tests¶

HA/Destructive Tests¶

Scale¶

Acceptance criteria¶

References¶

Table Of Contents

Previous topic

Next topic

Project Source

This Page

OpenStack

Neutron L3 HA VRRP support¶

Problem description¶

Proposed changes¶

Web UI¶

Nailgun¶

Data model¶

REST API¶

Orchestration¶

RPC Protocol¶

Fuel Client¶

Plugins¶

Fuel Library¶

Alternatives¶

Upgrade impact¶

Security impact¶

Notifications impact¶

End user impact¶

Performance impact¶

Deployment impact¶

Developer impact¶

Infrastructure impact¶

Documentation impact¶

Expected OSCI impact¶

Implementation¶

Assignee(s)¶

Work Items¶

Dependencies¶

Testing, QA¶

Manual Acceptance Tests¶

HA/Destructive Tests¶

Scale¶

Acceptance criteria¶

References¶

Table Of Contents

Previous topic

Next topic

Project Source

This Page

Quick search

Navigation