Scaling to support thousands of BGP peerings in a SaaS environment

Costas Drogos

Playlists: 'denog11' videos starting here / audio / related events

When analyzing peering traffic and identifying DDoS attacks, BGP provides valuable additional insight to supplement Flow information. In this talk we'll go over the different challenges, actions and learnings from the past four years to enable the support of thousands of peerings in a multi-tenant SaaS platform.

Kentik utilizes multiple auxiliary sources, such as SNMP, DNS, RADIUS or Streaming Telemetry, to enrich the ingested flow. The most prominent of these sources though, is BGP. With BGP data, Kentik is able to produce BGP-related analytics such as peering analytics and in addition, utilize the peering bidirectionally to enable DDoS mitigation capabilities such as RTBH and Flowspec.

In this presentation we'll start with a short introduction on how Kentik uses BGP, in order to define the technical requirements for the setup. We'll then overview the different generations of the setup through the years:
1. 1 active node (2 nodes in active-backup) - ucarp
2. 4 active nodes with mask-based hashing - RTBH functionality is introduced, exabgp is introduced
3. 10 active nodes with full-tuple hashing and support for balancing IPv6 (current setup - slowly getting deprecated) - Flowspec is introduced
4. 16+ nodes with IPVS+keepalived and easy pooling/depooling setup (now in testing)

With the requirement being that the external customer service needs to remain stable and not require any reconfiguration, for each phase we'll illustrate the challenges, examine the options available to Kentik engineers, explain the choice that was made and describe the outcome, leading Kentik to be able to support more than 4000 peerings across 16 nodes today.