Thursday, January 22, 2015

Neutron DVR The SDN Way

In my previous posts I covered the existing Openstack virtual L3 implementations, from the base centralized implementation to the state of the art solution, DVR, that distributes the load among the computes nodes.

I also summarized the limitations of DVR and hopefully convinced you of the motivation for yet another L3 implementation in Neutron.

Just to quickly recap, the approach for DVR’s design was to take the existing centralized router implementation based on Linux network namespaces and clone it on all compute nodes. This is an evolutionary approach and a reasonable step to take in distributing L3 from which we've learned a lot. However, this solution is far form optimal.

In this post I’m going to present a PoC implementation of an embedded L3 controller in Neutron that solves the same problem (the network node bottleneck) the SDN way, and overcomes some of the DVR limitations.

I chose to start with just the East-West traffic, and leave the North-South for the next step.

I will share some of my ideas for North-South implementation in upcoming posts, as well as links to the Stackforge project.

If you're interested in joining the effort or take the code for a test drive, feel free to email me or leave a comment below.

SDN Controllers are generally perceived as huge pieces of complex code with hundreds of thousands of lines of code that try to handle everything.  It's natural that people would not deem such software to be capable of running "in line".  However, this is not necessarily so.  As I will demonstrate, the basic SDN Controller code can be separated and deployed in a lean and lightweight manner.

It occurred to me that by combining the well-defined abstraction layers already in Neutron (namely the split into br-tun, br-int and br-ext) and a lightweight SDN controller that can be embedded directly into Neutron, it will be possible to solve the network node bottleneck and the L3 high-availability problems in the virtual overlay network in a simple way.

Solution overview 

The proposed method is based on the separation of the routing control plane from the data plane. This is accomplished by implementing the routing logic in distributed forwarding rules on the virtual switches. In OpenFlow these rules are called flows. To put this simply, the virtual router is implemented in using OpenFlow flows. 
The functionality of the virtual routes that we should address in our solution is:

  • L3 routing
  • ARP response for the router ports
  • OAM, like ICMP (ping) and others

L3 Controller embedded in Neutron POC

We decided to create our controller PoC with Openstack design tenets in mind. Specifically the following:

  1. Scalability     - Support thousands of compute nodes
  2. Elasticity       - Keep controllers stateless to allow for dynamic growth 
  3. Performance  - Improve upon DVR
  4. Reliability      - Highly available
  5. Non intrusive - Rely on the existing abstraction and plug-able module

What we intend to prove is:
  • We indeed simplify the DVR flows
  • We reduce resource overhead (e.g. bridges, ports, namespaces)
  • We remove existing bottlenecks (compared to L3Agent and DVR)
  • We improve performance
We've started benchmarking the different models and I'll post results as soon as we have them ready.

Reactive vs. Proactive Mode

SDN defines two modes for managing a switch: reactive and proactive.
In our work, We decided to combine these two modes so that we could benefit from the advantages of both modes (although if you believe the FUD about the reactive mode performance, it is quite possible to enhance it to be purely proactive). To learn more about this mixed mode see my previous blog

The Proactive part

In the solution we install a flow pipeline in the OVS br-int in order to offload the handling of L2 and intra-subnet L3 traffic by forwarding these to the NORMAL path (utilizing the hybrid OpenFlow switch). This means that we reuse the built in mechanisms in Neutron for all L2 traffic (i.e. ML2 remains untouched and fully functional) and for L3 traffic that does not need routing (between IPs in the same tenant + subnet).
In addition we use the OVS OpenFlow extensions in order to install an ARP responder for every virtual router port. This is done to offload ARP responses to the compute nodes instead of replying from the controller.

The Reactive part

Out of the remaining traffic (i.e. inter-subnet L3 traffic) the only traffic that is handled in a reactive mode is the first packet of inter-subnet communications between VMs or traffic addressed directly to the routers' ports.

The Pipeline

Perhaps the most important part of the solution is the OpenFlow pipeline which we install into the integration bridge upon bootstrap.
This is the flow that controls all traffic in the OVS integration bridge (br-int).
The pipeline works in the following manner:

  1. Classify the traffic
  2. Forward to the appropriate element:
    1. If it is ARP, forward to the ARP Responder table
    2. If routing is required (L3), forward to the L3 Forwarding table (which implements a virtual router)
    3. Otherwise, offload to NORMAL path
At the end of this post is a detailed explanation of the pipeline and implementation brief in case you're interested in the gory details.

In my following posts I will cover in detail the code modifications that were required to support the L3 controller as well as publish a performance study comparing the L3 Agent, DVR and the L3 controller for inter subnet traffic.

If you want to try it out, the code plus installation guide  are available here.

Again, if you'd like to join the effort, feel free to get in touch.

detailed explanation of the pipeline

The following diagram shows the multi-table OpenFlow pipeline installed into the OVS integration bridge (br-int) in order to represent the virtual router using flows only:
L3 Flows Pipeline

The following table describes the purpose of each of the pipeline tables:

Table Name
Metadata &
  • Tag traffic with the appropriate metadata by input port, the value is the segmentation ID
  • Input traffic from the tunnel bridge is offloaded to the NORMAL path
  • All other traffic is sent to the "Classifier" table for classification
  • All ARP traffic is sent to the ARP responder table
  • All broadcast and multicast traffic is offloaded to the Normal path
  • All L3 traffic is sent to the L3 forwarding table
ARP Responder
  • This table handles ARP responses for the virtual router interfaces.
  • The vlan tag is removed from all other traffic and offloaded to the NORMAL path
L3 Forwarding
  • This table is used by the Controller to install flows to handle VM to VM inter subnet traffic (detailed explanation below)
  • Traffic destined for a virtual router is sent to the controller (e.g. ping, packet_in, etc)
  • All local subnet traffic is offloaded to the NORMAL path
  • All other traffic is sent to the public network table
Public Network
  •  will be described in following blogs

PoC Implementation brief

For the first release, we used the Open vSwitch (OVS), with OpenFlow v1.3 as the southbound protocol, using the RYU project implementation of the protocol stack as a base library.
The PoC support Route API for IPV4 East -West traffic.  
In the current PoC 

implementation, the L3 controller is embedded into  the  L3 service plugin.  


  1. Brilliant work, Eran! Thanks for sharing

  2. Ecxellent!
    Which SDN do you recommend?

    1. For our purpose we used the RYU framework for its OpenFlow protocol support. We used RYU here as a library. This made the code base pretty small and focused on the needs.