Tags: Azure


TABLE OF CONTENTS


Overview


Kublr Kubernetes clusters use Azure storage accounts to coordinate and configure master and worker nodes.

Azure API throttles requests to storage accounts on per-client service account, per-subscription and per-tenant basis.


As a result, when too many total cluster nodes using the same service account are present in a subscription or in a tenant, Kubernetes clusters may become unavailable intermittently with the root cause Azure error "The server rejected the request because too many requests have been received" with error code "TooManyRequests".


It is also possible that other applications work with storage accounts in the same subscription or tenant and play role in exhausting the API call limits (these applications don't even have to work with the same storage account as Kubernetes clusters; Azure throttles API calls based on total number of calls to all storage accounts in the same subscription or in the same tenant).


See Microsoft Azure documentation references for more details on throttling.


Solution 1: Upgrade Kubernetes to 1.18 or later


Run Kubernetes 1.18.x or later. These versions contain many improvements that are described in AKS throttling/429 errors and Support large clusters without throttling.


Solution 2: Reconfigure third-party applications to make fewer calls


If third-party applications (such as monitoring applications) make an excessive number of GET requests, change the settings of these applications to reduce the frequency of the GET calls.


Solution 3: Split your clusters into different subscriptions or tenants


If there are numerous clusters and nodes, try to split the clusters into different subscriptions. It also helps if you have many clients (such as Rancher, Terraform, and so on).


Solution 4: Use different service accounts for different clusters


As azure calculates request limits on a subscription and on a tenant on per-client (service account or user) basis, you can try using different Azure service accounts for different clusters.


Solution 5: Reduce Kublr coordination calls frequency


By default Kublr master and worker nodes check with the storage account secret store every 30 seconds. This interval can be increased so that fewer coordination calls are made by Kublr agents.


The following cluster specification snippet shows relevant configuration parameter changes:


spec:
  kublrAgentConfig:
    # delay to retry a call after an error (default: '5s')
    node_monitor_min_period: '20s'
    # standard call interval (default: '30s')
    node_monitor_period: '2m'
  kublrSeederConfig:
    # standard call interval (default: '30s')
    node_monitor_period: '2m'


A duration string is a sequence of decimal numbers, each with optional fraction and a unit suffix, such as "300ms", "1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".


References