Thursday, January 15, 2015

Common Misconceptions about SDN Controller Management and Scalabilty

Here are a few common misconceptions about SDN controller management and scalability.

#1: Either Proactive mode or Reactive mode

When an SDN controller wants to enforce a set of rules or a policy on a forwarding element, it uses the southbound API, OVSDB, OpFlex, OpenFlow or others. In OpenFlow this process is called flow installation and it can be done using different methods: proactive, reactive or mixed (aka hybrid approach).  




Proactive

During bootstrap, the controller installs all flows and pipelines (multi-table entries) into all the forwarding elements.  The flows must cover all possible scenarios.
Whenever there is a change in the network, the controller removes or installs flows where necessary.


Reactive

When a packet arrives in a switch, a look-up is performed on the flow tables.
If there is no match and the switch is connected to a controller, it will forward the packet (either with or without its payload) to the controller.
In OpenFlow, this is called a PACKET_IN message.

It is also possible to create rules that forward the packet to the controller when matched and we call that...


Mixed Proactive / Reactive

With the mixed approach, you can benefit from the best of both worlds, achieving a balance between dynamic management and performance.

For frequently-used and rarely-modified flows, you install a pipeline and flows proactively.

For unmatched flows, or for flows that you want to handle in-line in your SDN application, you install a flow to forward the packet to the controller.

This mixed approach allows the controller to focus on making real-time dynamic decisions only on the traffic that requires it (reactive), while leaving the heavy-lifting of the majority of the traffic to the real-time forwarding element (proactive).

In addition, with this approach you can avoid over-provisioning the forwarding element with flows that are rare, thus dramatically reducing the number of entries in the flow tables. 

#2: Lack of Scalability and High Availability


In SDN, the control plane (i.e. the SDN Controller) is separated from the data plane (i.e. the Forwarding Elements).

Due to the centralized nature of control in SDN, we now need to support high availability and redundancy of the Control Plane.

In OpenFlow, the Forwarding Element (the Switch) connects to the controller via TCP or via TLS for secure channels.

Up until OpenFlow v1.2, whenever the connection to the controller was lost, the switch would lose the ability to forward PACKET_IN messages to the controller, thus having to drop all unmatched traffic or handle it in the NORMAL switch (only in switches that implemented the dual nature). 

This non-deterministic approach would yield unpredictable network behavior 
while the controller was unavailable.

OpenFlow v1.2 introduced the capability of working with multiple controllers.
The Master and Slave modes provide a mechanism for Active-Passive high availability, whereas the Equal mode provides an Active-Active model.

OpenFlow multiple controllers


With the multi-controller capability, we gained the control plane high availability we needed, improved reliability, fast recovery from failure and controller load balancing.

#3: You cannot design SDN Applications for Really Big Scale


Utilizing the two mechanisms I covered - Mixed-Mode and Multi-Controller - is the key to designing really big scale SDN applications.

If you take care to design your application to be stateless (whenever possible) and share nothing (or little) with other controller instances you can benefit from the Reactive mode where any instance can handle any flow.

In my opinion, the best approach for achieving minimal controller response latency and maximal bandwidth while keeping the dynamic allocation of flows, is by using all of these mechanisms together.

When you add to that the OVS OpenFlow extensions and the dual-nature Hybrid OpenFlow capability (which I covered in my previous post), you can really gain a dramatic performance and management boost.


In my next post I will demonstrate how we utilized these design guidelines and capabilities in a prototype for an alternative to Neutron's L3 Distributed Virtual Router.

Tuesday, January 13, 2015

Hybrid OpenFlow Switch


In my last post I summarized the DVR solution and tried to explain the motivation for yet another L3 implementation in Neutron that I am going to present in the coming posts.

This 2-post series is intended to cover basic SDN and OpenFlow mechanisms that we used in the L3 controller:
  • Hybrid OpenFlow Switch 
  • SDN models for managing the forwarding elements (switches)

Hybrid OpenFlow Switch

Hybrid OpenFlow switch was introduced in OpenFlow/1.1. Hybrid switches support both OpenFlow operation pipeline and normal (legacy) Ethernet switching functionality.

The hybrid switch allows forwarding of packets from the OpenFlow
pipeline to the normal pipeline through the NORMAL and FLOOD reserved ports.

The main reason for introducing the hybrid switch was to optimize the handling of operations like MAC learning, where a reactive approach was just not efficient - Doing MAC learning in the OpenFlow controller poses significant cost in terms of network bandwidth and latency and does not scale for large networks.

The NORMAL action comes to the rescue and lets us offload legacy non-OpenFlow pipeline (like MAC learning mechanism, VLAN, ACL, QoS and other base features) to the forwarding element kernel module, which is optimized to handle such operations in near-line-rate.

But what happens when the NORMAL action is used in an OpenFlow flow?

Basically, what happens is that the traffic is redirected to a completely separated processing pipeline. This is illustrated in the diagram below.


OVS Hybrid OpenFlow Switch Pipelines
The OpenFlow pipeline and the Normal Pipeline, each act as a completely isolated switch.

There are, however, some issues with this hybrid approach.
  • NORMAL pipeline is not standardized, so it behaves differently on switches from different vendors - There is a variance in the supported features, and no standard way to configure them
  • NORMAL pipeline does not play well with some OpenFlow actions, for example if the port is tagged for the NORMAL pipeline (using the ovs-vsctrl), you can not tag it using OpenFlow actions and then forward it to the NORMAL path, because it will end-up dropped due to double tagging error.

The Open virtual Switch extensions to OpenFlow were developed to support these extra features using Flows, for example the LEARN action (Open vSwitch extension to OpenFlow) for MAC learning .

In my next post I will cover the SDN models for managing the forwarding elements in reactive, proactive and mix modes.

Thursday, January 8, 2015

Openstack Neutron DVR - Summary

Following my 3-post series about Openstack Juno DVR in detail, this post is a summary of DVR, in my point of view.

In the next post I’m going to present a POC implementation of an embedded L3 controller in Neutron that solves the problem differently. In this post I’d like to explain the motivation for yet another L3 implementation in Neutron.

I believe the L3 Agent drawbacks are pretty clear (centralized bottleneck, limited HA, etc), so in this post I’m going to cover the key benefits and drawbacks of DVR as I see them. So without further ado, let’s start.

Pros


First of all, from a functional point of view, DVR successfully distributed the East-West traffic and the DNAT floating IP, which provided significant offload from the Network node contention. This achieved two key benefits, first is that the failure domains are much smaller (Network node failure only affects SNAT) and the second is scalability due to the distribution of the load on all the compute nodes. 




Cons

The approach for DVR’s design was to take the existing centralized router implementation based on Linux network namespaces and clone it on all compute nodes. This is an evolutionary approach and an obvious step to take in distributing L3 from which we’ve learned a lot from. However, it adds load and complexity in three major areas: Management, Performance and Code.

To explain the technical details I will briefly cover two Linux networking constructs used in the solution for East-West communications:

1. Linux network namespaces (which internally load a complete TCP/IP stack with ARP tables and routing tables)
2. OVS flows

OVS flows were needed in order to block the local ARP, redirect return traffic directly to the VMs and to replace the router ports MAC (which cannot be accomplished reasonably using namespaces alone). On the flip side, OVS flows can easily accomplish everything that DVR uses the namespaces for (in East-West communications) and perhaps more importantly it does so in a more efficient way (avoids the overhead of the extra TCP/IP stack etc). To hint at our solution, flows also allow us to further improve the solution by selectively using a reactive approach where relevant.



Extra Flows Install for DVR
So overall it seems that using the Linux network namespace as a black box to emulate a router is an overkill. This approach makes sense in the centralized solution (L3 Agent) where for each tenant’s virtual router we use a single namespace in the DC, however, in the distributed approach where the number of namespaces in the DC is multiplied by the number of compute nodes this needs to be reevaluated.
If you’re not convinced yet, here is a summary of the implications of using this approach:

Resource consumption and Performance

1. DVR adds an additional namespace as well as virtual ports per router on all compute nodes that host routed VMs. This means additional TCP/IP stacks that each and every cross subnet packet traverses. This adds latency and host CPU consumption.

2. ARP tables on all namespaces in the Compute nodes are proactively pre-populated with all the possible entries. Whenever a VM is started, all compute nodes which have running VMs from the same tenant will be updated to keep these tables up to date (which potentially adds latency to VM start time)

3. Flows and routing rules are proactively installed on all compute nodes.

Management complexity

1. Multiple configuration points that need to be synchronized: namespaces, routing tables, flow tables, ARP tables and IPTables.

2. Namespaces on all compute nodes need to be kept in sync all the time with the tenant’s VM ARP tables and routing information and need to be tracked to handle failures etc.

3. The current DVR implementation does not support a reactive mode (i.e. creating a flow just-in-time), thus all possible flows are created on all hosts, even If they’re never used.

Code Complexity

1. The DVR required cross-component changes due to its multi configuration points: The Neutron server Data Model, ML2 Plugin, L3 Plugin , OVS L2 Agent and the L3 Agent.

2. Using all Linux networking constructs in a single solution (namespaces, flows, linux bridges, etc) requiring code that can manage all of it.

3. The solution is tightly coupled with the overlay manager (ML2) which means that the addition of every new type driver (today only vxlan) requires code additions on all levels (as can be seen by the vlan patches)

In my next post I will present an alternative solution, we evaluated and developed (code available) for distributing the virtual router that attempts to overcome these limitations, using SDN technologies.

Comments, questions and corrections are welcome.

Thursday, January 1, 2015

3 Openstack Neutron Distributed Virtual Router (DVR) - Part 3 of 3

In this post, the last of a 3-post series about Openstack Juno DVR, I go into the North-South SNAT scenario in details.

Up until Juno, all L3 traffic was sent through the network node.  In Juno, DVR was introduced to distribute the load from the network node onto the compute nodes.

The L3 networking in Neutron is divided into 3 main services:
  1. East-West communication: IP traffic between VMs in the data center
  2. Floating IP (aka DNAT): The ability to provide a public IP to a VM, making it directly accessible from public networks (i.e. internet)
  3. Shared IP (aka SNAT): The ability to provide public network access to VMs in the data center using a shared (public) IP address
In my previous posts, I covered how DVR distributes the the East-West L3 traffic and the DNAT North-South.
In this post I finish covering the North-South traffic with the Shared IP (SNAT).

SNAT Shared Gateway  North-South 

The SNAT North-South shared public gateway functionality was not distributed by DVR.  It remains centralized on the Network Node, as you can see in the diagram below.


In order to configure the SNAT namespaces on the Network node, the SNAT-DVR L3 agent is deployed.

An additional private address is assigned in the SNAT namespace for each of the DVR connected subnets, thus providing the centralized SNAT functionality.

Lets follow a communication flow from VM3 in the App network to the public network. 
  1. The packet is forwarded to the DVR namespace via the local DVR default gateway port
  2. If the packet destination is not private, then it is routed to the appropriate default public gateway port (based on the source subnet) on the SNAT namespace 
  3. An OVS flow converts the local VLAN into the App network segmentation ID
  4. The packet is forwarded into the SNAT namespace and NAT-ed using the SNAT public address
Let's look at a concrete example:
[Public Network range]=10.100.100.160/28
[App Network]=192.168.200.0/24
[Shared Public IP (SNAT)]=10.100.100.162
[VM3 private IP]=192.168.200.6

In my next post I will summarize the DVR solution in Openstack Juno.

Questions and comments are welcome.

The configuration files I used to set-up the Openstack  Neutron into DVR mode can be found here.

Monday, December 29, 2014

2 Openstack Neutron Distributed Virtual Router (DVR) - Part 2 of 3

In this post, the 2nd of a 3-post series about DVR, I go into the North-South DNAT scenario in details.

Up until Juno, all L3 traffic was sent through the network node.  In Juno, DVR was introduced to distribute the load from the network node onto the compute nodes.


The L3 networking in Neutron is divided into 3 main services:

  1. East-West communication: IP traffic between VMs in the data center
  2. Floating IP (aka DNAT): The ability to provide a public IP to a VM, making it directly accessible from public networks (i.e. internet)
  3. Shared IP (aka SNAT): The ability to provide public network access to VMs in the data center using a shared (public) IP address
In my previous post, I covered how DVR distributes the the East-West L3 traffic. 
In this post I am going to begin covering the North-South traffic starting with  Floating IP (DNAT).

DNAT Floating IP North-South 

In order to support Juno's DVR local handling of floating IP DNAT traffic in the compute nodes, we now require an additional physical port that connects to the external network, on each compute node. 

The Floating IP functionality enable direct access from the public network (e.g. Internet) to a VM.

Let's follow the example below, where we will assign a floating IP to the web servers.  






When we associate a VM with a floating IP, the following actions take place:
  1. The fip-<netid> namespace is created on the local compute node (if it does not yet exist)
  2. A new port rfp-<portid> is created on the qrouter-<routeridnamespace (if it does not yet exist
  3. The rfp port on the qrouter namespace is assigned the associated floating IP address
  4. The fbr port on the fip namespace is created and linked via point-to-point network to the rfp port of the qrouter namespace
  5. The fip namespace gateway port fg-<portid> is assigned an additional address from the public network range (the floating IP range)
  6. The fg-<portid> is configured as a Proxy ARP

Now, lets take a closer look at VM4 (one of the web servers).

In the diagram below, the red dashed line shows the outbound network traffic flow from VM4 to the public network.



The flow goes through 5 steps:
  1. The originating VM sends a packet via default gateway and the integration bridge forwards the traffic to the local DVR gateway port (qr-<portid>).
  2. DVR routes the packet using the routing table to the rfp-<portid> port
  3. The packet is applied NAT rule using IPTables, changing the source-IP of VM4 to the assigned floating IP, and then it is sent through the rfp-<portid> port, which connects to the fip namespace via point-to-point network (e.g. 169.254.31.28/31)
  4. The packet is received on the fbr-<portid> port in the fip namespace and then routed outside through the fg-<portid> port
At this point you may be confused with the descriptions, so let's try to simplify this a bit with a concrete example:
[Public Network range]=10.100.100.160/28
[Web Network]=10.0.0.0/24
[VM4 floating IP]=10.100.100.163
[VM4 
private IP]=10.0.0.6

As you can see in the diagram, routing consumes an additional IP from the public range per compute node (e.g. 10.100.100.164).


The reverse flow will go in the same route, the fg-<portid> act as a proxy ARP  for the DVR namespace.  

In the next post, I will go into the North-South scenario using Shared IP (SNAT).

Please feel free to leave comments, questions and corrections.

Tuesday, December 16, 2014

1 Openstack Neutron Distributed Virtual Router (DVR) - Part 1 of 3

In this 3-post series I cover how DVR works in (just about enough) detail. 

I assume you know how virtual routing is managed in Neutron

Up until Juno, all L3 traffic was sent through the network node, including even traffic between VMs residing on the same physical host.
This, of course, created a bottleneck, as well as a single point of failure (SPOF).

In Juno, DVR was introduced to overcome these problems.

As I described in my previous post, L3 networking in Neutron is divided into 3 main services:

  1. East-West communication: IP traffic between VMs in the data center
  2. Floating IP (aka DNAT): The ability to provide a public IP to a VM, making it directly accessible from public networks (i.e. internet)
  3. Shared IP (aka SNAT): The ability to provide public network access to VMs in the data center using a shared (public) IP address

The main idea of DVR is to distribute the L3 Agent functionality to all the compute nodes, distributing the load and eliminating the SPOF.

DVR successfully distributes the first two services. 

However, SNAT distribution is not covered by DVR, and remains to be handled by hitting the Network Node in a centralized manner. 

The main challenges for SNAT distribution were maintaining the FWaaS 
statefulness and conserving IP addresses from the public network address pool (since supporting these will require each compute node to have a public IP address).


Neutron components with DVR entities are shown in red

DVR implements a virtual router element on every compute node using the Linux network namespace (similarly to L3 Agent virtual router). 

This is achieved by cloning the current centralized L3 implementation onto every compute node.
For each service, this is handled in a different manner.

Inter subnet routing East-West 

All the cross subnet traffic (between VMs of the same tenant) in DVR is now handled locally on the compute node using the router namespace.

A Linux namespace is created for every virtual router, on each compute node that hosts VMs that are connected to that router.

In order to configure these local DVR namespaces, the DVR L3 Agent (which used to be deployed only to the network node) is deployed onto all compute nodes as you can see in the diagram below.  

Now, extra OVS flows were required to enable the DVR functionality.  To do that, enhancements were applied to L2 OVS Agent.  These enhancements are described below.

In order to simplify management, the same IP and MAC addresses are reused on all the compute node DVRs, i.e. a single {IP, MAC} per virtual router port). 
Owing to this, ARP traffic for the virtual router ports is kept local to the compute node.

All the ARP entries for networks attached to each DVR are proactively populated by the L3 DVR Agent on all the cloned virtual router namespaces, i.e. on all the compute nodes that participate. 

In order to avoid having traffic from different physical hosts using the same MAC (which, as we said, we are reusing), each of the cloned DVRs is assigned a system-wide unique MAC address. 

I will reuse the example from my previous post: the popular cloud deployment 
3-tier web application design pattern (www-app-db).

In Horizon, we set up 3 subnets (in addition to the public) with one virtual router to bind them all.













On each compute node, the DVR namespace is pre-configured with the following:
  • Port MAC and IP addresses
  • Routing tables (default route per subnet and static routes)
  • ARP table entries for all the VMs in the connected networks
In addition to the namespace, OVS flows are installed into the integration bridge and into the tunnel bridge to: 
  • Isolate DVR broadcast domain to be local to the integration bridge  
  • Translate the globally assigned MAC of the DVR into the local MAC address and vice versa
  • Redirect traffic correctly per VM (new flows are installed for every new VM started on the compute node)
Realization of our topology on 2 Compute nodes

As we can see, VM1 on Web subnet communicates with VM3 on App subnet:
  1. The packet reaches the subnet GW port on the local compute node 
    • It is routed using the DVR namespace routing table into the destination subnet
    • Linux namespace then performs the followings actions on the packet:
      • Use the pre-populated ARP table to set the destination MAC address
      • Set the source MAC to the local DVR GW port of the destination subnet
  2. On the tunnel bridge the source MAC is replaced with the global DVR MAC address
  3. The packet reaches the compute node that hosts the destination VM:
    • The segmentation ID is replaced with the local vlan
    • The packet is matched by the global source MAC address and forwarded to the integration bridge
  4. The packet source MAC is replaced with the appropriate local DVR GW MAC and forwarded directly to the destination VM port
In the reverse flow from VM3 to VM1, packets are routed through the local DVR instance of the compute node that hosts VM3.
Note that the reverse flow goes in a different route (see the blue line). 
Due to this behavior, we cannot create stateful East-West rules, hence FWaaS for East-West traffic is not supported by DVR Juno.

In my next posts, I will go into the North-South scenario, covering the DNAT and SNAT services.

Wednesday, December 10, 2014

Layer-3 Services in OpenStack Neutron Pre Juno

  
In this multi-part blog series I intend to dive into the L3 Services in Neutron Openstack.  

Neutron defines an API extension that allows Administrators and tenants to create virtual routers.  The purpose of those is to connect several virtual L2 network subnets (which are defined using some other Neutron APIs).
Another API defined by Neutron and implemented in the L3 service is the Floating IP extension that provides public connectivity, whether directly to a VM (aka DNAT) or via shared gateway (aka SNAT).

The L3 Services in Neutron also handle the optional Extra Route extension API.
Neutron Virtual Router



In the above diagram we can see the following:


  • Yellow: Inter subnet routing (East/West)
  • Cyan: SNAT (port mapping and masquerading the IP address)
  • Red: DNAT (floating IPs, public N/S connectivity directly to VM)
  • Static routes (Extra Routes), defined inside the virtual router


Now, let's look at how everything gets wired when the system loads.



1. The L3 Service Plugin loads inside the Neutron server (which usually runs on the Controller Node).  It handles the layer 3 RESTful APIs and the data access layer. 


L3 Services API

2. The L3 Agent usually runs on the Network Node.  When started, it registers itself on the Neutron L3 Service Plugin Router Scheduler, via the Message Queue.  It implements the virtual router functionality defined by the Neutron API. 

3. Each L3 Agent locally manages virtual routers that are assigned to it, based on configuration data it receives on the Message Queue from the L3 Service Plugin

4. Openstack Neutron reference implementation of the L3 Agent uses  the Linux network namespaces and IPTables rules to implement the virtual routers.

5. The Linux network namespaces provide an isolated network stack with local routing table and IPTables rules.  It enables reuse of IP addresses with scope limited to the namespace. 

6. The L3 Agent creates a network namespace 
for each virtual router in the tenant network.  Then, it creates all the virtual router ports inside the namespace.

7. Each virtual router port in the namespace is tagged in the integration bridge (aka br-int) with the local segmentation ID of the subnet.

8. Each virtual router namespace includes a gateway port to the external network (aka br-ex).

9. DNAT (Floating IP) and SNAT are implemented using IPTables rules that are applied on the gateway port.

Let's take the popular cloud deployment 3-tier web application design pattern (www-app-db) as an example .  
In Horizon, we set up 3 subnets (in addition to the external) with one virtual router to bind them all.




  • The virtual router is realized by the L3 Agent using a namespace
  • For each connected network a port is created on the br-int within the namespace
  • Each port is tagged with a local VLAN ID
  • Local VLAN IDs are mapped to the Segmentation ID by the Tunnel Bridge (aka br-tun)





For the most part, this solution works well.  However, there is an inherent limitation that affects the overall system performance and scalability:

All cross-subnet traffic hits the Network Node. 


  
The problem becomes more apparent in large deployments.  The Network node quickly becomes a bottleneck in the system.










In Juno release, a solution to this problem was introduced.  
It was called DVR (Distributed Virtual Router).
The main idea was to distribute L3 Agents functionality onto the Compute nodes.

In my next post, I will describe the DVR solution.