Monday, December 29, 2014

2 Openstack Neutron Distributed Virtual Router (DVR) - Part 2 of 3

In this post, the 2nd of a 3-post series about DVR, I go into the North-South DNAT scenario in details.

Up until Juno, all L3 traffic was sent through the network node.  In Juno, DVR was introduced to distribute the load from the network node onto the compute nodes.

The L3 networking in Neutron is divided into 3 main services:

  1. East-West communication: IP traffic between VMs in the data center
  2. Floating IP (aka DNAT): The ability to provide a public IP to a VM, making it directly accessible from public networks (i.e. internet)
  3. Shared IP (aka SNAT): The ability to provide public network access to VMs in the data center using a shared (public) IP address
In my previous post, I covered how DVR distributes the the East-West L3 traffic. 
In this post I am going to begin covering the North-South traffic starting with  Floating IP (DNAT).

DNAT Floating IP North-South 

In order to support Juno's DVR local handling of floating IP DNAT traffic in the compute nodes, we now require an additional physical port that connects to the external network, on each compute node. 

The Floating IP functionality enable direct access from the public network (e.g. Internet) to a VM.

Let's follow the example below, where we will assign a floating IP to the web servers.  

When we associate a VM with a floating IP, the following actions take place:
  1. The fip-<netid> namespace is created on the local compute node (if it does not yet exist)
  2. A new port rfp-<portid> is created on the qrouter-<routeridnamespace (if it does not yet exist
  3. The rfp port on the qrouter namespace is assigned the associated floating IP address
  4. The fbr port on the fip namespace is created and linked via point-to-point network to the rfp port of the qrouter namespace
  5. The fip namespace gateway port fg-<portid> is assigned an additional address from the public network range (the floating IP range)
  6. The fg-<portid> is configured as a Proxy ARP

Now, lets take a closer look at VM4 (one of the web servers).

In the diagram below, the red dashed line shows the outbound network traffic flow from VM4 to the public network.

The flow goes through 5 steps:
  1. The originating VM sends a packet via default gateway and the integration bridge forwards the traffic to the local DVR gateway port (qr-<portid>).
  2. DVR routes the packet using the routing table to the rfp-<portid> port
  3. The packet is applied NAT rule using IPTables, changing the source-IP of VM4 to the assigned floating IP, and then it is sent through the rfp-<portid> port, which connects to the fip namespace via point-to-point network (e.g.
  4. The packet is received on the fbr-<portid> port in the fip namespace and then routed outside through the fg-<portid> port
At this point you may be confused with the descriptions, so let's try to simplify this a bit with a concrete example:
[Public Network range]=
[Web Network]=
[VM4 floating IP]=
private IP]=

As you can see in the diagram, routing consumes an additional IP from the public range per compute node (e.g.

The reverse flow will go in the same route, the fg-<portid> act as a proxy ARP  for the DVR namespace.  

In the next post, I will go into the North-South scenario using Shared IP (SNAT).

Please feel free to leave comments, questions and corrections.

Tuesday, December 16, 2014

1 Openstack Neutron Distributed Virtual Router (DVR) - Part 1 of 3

In this 3-post series I cover how DVR works in (just about enough) detail. 

I assume you know how virtual routing is managed in Neutron

Up until Juno, all L3 traffic was sent through the network node, including even traffic between VMs residing on the same physical host.
This, of course, created a bottleneck, as well as a single point of failure (SPOF).

In Juno, DVR was introduced to overcome these problems.

As I described in my previous post, L3 networking in Neutron is divided into 3 main services:

  1. East-West communication: IP traffic between VMs in the data center
  2. Floating IP (aka DNAT): The ability to provide a public IP to a VM, making it directly accessible from public networks (i.e. internet)
  3. Shared IP (aka SNAT): The ability to provide public network access to VMs in the data center using a shared (public) IP address

The main idea of DVR is to distribute the L3 Agent functionality to all the compute nodes, distributing the load and eliminating the SPOF.

DVR successfully distributes the first two services. 

However, SNAT distribution is not covered by DVR, and remains to be handled by hitting the Network Node in a centralized manner. 

The main challenges for SNAT distribution were maintaining the FWaaS 
statefulness and conserving IP addresses from the public network address pool (since supporting these will require each compute node to have a public IP address).

Neutron components with DVR entities are shown in red

DVR implements a virtual router element on every compute node using the Linux network namespace (similarly to L3 Agent virtual router). 

This is achieved by cloning the current centralized L3 implementation onto every compute node.
For each service, this is handled in a different manner.

Inter subnet routing East-West 

All the cross subnet traffic (between VMs of the same tenant) in DVR is now handled locally on the compute node using the router namespace.

A Linux namespace is created for every virtual router, on each compute node that hosts VMs that are connected to that router.

In order to configure these local DVR namespaces, the DVR L3 Agent (which used to be deployed only to the network node) is deployed onto all compute nodes as you can see in the diagram below.  

Now, extra OVS flows were required to enable the DVR functionality.  To do that, enhancements were applied to L2 OVS Agent.  These enhancements are described below.

In order to simplify management, the same IP and MAC addresses are reused on all the compute node DVRs, i.e. a single {IP, MAC} per virtual router port). 
Owing to this, ARP traffic for the virtual router ports is kept local to the compute node.

All the ARP entries for networks attached to each DVR are proactively populated by the L3 DVR Agent on all the cloned virtual router namespaces, i.e. on all the compute nodes that participate. 

In order to avoid having traffic from different physical hosts using the same MAC (which, as we said, we are reusing), each of the cloned DVRs is assigned a system-wide unique MAC address. 

I will reuse the example from my previous post: the popular cloud deployment 
3-tier web application design pattern (www-app-db).

In Horizon, we set up 3 subnets (in addition to the public) with one virtual router to bind them all.

On each compute node, the DVR namespace is pre-configured with the following:
  • Port MAC and IP addresses
  • Routing tables (default route per subnet and static routes)
  • ARP table entries for all the VMs in the connected networks
In addition to the namespace, OVS flows are installed into the integration bridge and into the tunnel bridge to: 
  • Isolate DVR broadcast domain to be local to the integration bridge  
  • Translate the globally assigned MAC of the DVR into the local MAC address and vice versa
  • Redirect traffic correctly per VM (new flows are installed for every new VM started on the compute node)
Realization of our topology on 2 Compute nodes

As we can see, VM1 on Web subnet communicates with VM3 on App subnet:
  1. The packet reaches the subnet GW port on the local compute node 
    • It is routed using the DVR namespace routing table into the destination subnet
    • Linux namespace then performs the followings actions on the packet:
      • Use the pre-populated ARP table to set the destination MAC address
      • Set the source MAC to the local DVR GW port of the destination subnet
  2. On the tunnel bridge the source MAC is replaced with the global DVR MAC address
  3. The packet reaches the compute node that hosts the destination VM:
    • The segmentation ID is replaced with the local vlan
    • The packet is matched by the global source MAC address and forwarded to the integration bridge
  4. The packet source MAC is replaced with the appropriate local DVR GW MAC and forwarded directly to the destination VM port
In the reverse flow from VM3 to VM1, packets are routed through the local DVR instance of the compute node that hosts VM3.
Note that the reverse flow goes in a different route (see the blue line). 
Due to this behavior, we cannot create stateful East-West rules, hence FWaaS for East-West traffic is not supported by DVR Juno.

In my next posts, I will go into the North-South scenario, covering the DNAT and SNAT services.

Wednesday, December 10, 2014

Layer-3 Services in OpenStack Neutron Pre Juno

In this multi-part blog series I intend to dive into the L3 Services in Neutron Openstack.  

Neutron defines an API extension that allows Administrators and tenants to create virtual routers.  The purpose of those is to connect several virtual L2 network subnets (which are defined using some other Neutron APIs).
Another API defined by Neutron and implemented in the L3 service is the Floating IP extension that provides public connectivity, whether directly to a VM (aka DNAT) or via shared gateway (aka SNAT).

The L3 Services in Neutron also handle the optional Extra Route extension API.
Neutron Virtual Router

In the above diagram we can see the following:

  • Yellow: Inter subnet routing (East/West)
  • Cyan: SNAT (port mapping and masquerading the IP address)
  • Red: DNAT (floating IPs, public N/S connectivity directly to VM)
  • Static routes (Extra Routes), defined inside the virtual router

Now, let's look at how everything gets wired when the system loads.

1. The L3 Service Plugin loads inside the Neutron server (which usually runs on the Controller Node).  It handles the layer 3 RESTful APIs and the data access layer. 

L3 Services API

2. The L3 Agent usually runs on the Network Node.  When started, it registers itself on the Neutron L3 Service Plugin Router Scheduler, via the Message Queue.  It implements the virtual router functionality defined by the Neutron API. 

3. Each L3 Agent locally manages virtual routers that are assigned to it, based on configuration data it receives on the Message Queue from the L3 Service Plugin

4. Openstack Neutron reference implementation of the L3 Agent uses  the Linux network namespaces and IPTables rules to implement the virtual routers.

5. The Linux network namespaces provide an isolated network stack with local routing table and IPTables rules.  It enables reuse of IP addresses with scope limited to the namespace. 

6. The L3 Agent creates a network namespace 
for each virtual router in the tenant network.  Then, it creates all the virtual router ports inside the namespace.

7. Each virtual router port in the namespace is tagged in the integration bridge (aka br-int) with the local segmentation ID of the subnet.

8. Each virtual router namespace includes a gateway port to the external network (aka br-ex).

9. DNAT (Floating IP) and SNAT are implemented using IPTables rules that are applied on the gateway port.

Let's take the popular cloud deployment 3-tier web application design pattern (www-app-db) as an example .  
In Horizon, we set up 3 subnets (in addition to the external) with one virtual router to bind them all.

  • The virtual router is realized by the L3 Agent using a namespace
  • For each connected network a port is created on the br-int within the namespace
  • Each port is tagged with a local VLAN ID
  • Local VLAN IDs are mapped to the Segmentation ID by the Tunnel Bridge (aka br-tun)

For the most part, this solution works well.  However, there is an inherent limitation that affects the overall system performance and scalability:

All cross-subnet traffic hits the Network Node. 

The problem becomes more apparent in large deployments.  The Network node quickly becomes a bottleneck in the system.

In Juno release, a solution to this problem was introduced.  
It was called DVR (Distributed Virtual Router).
The main idea was to distribute L3 Agents functionality onto the Compute nodes.

In my next post, I will describe the DVR solution.