[Affects Kublr 1.22.3 and earlier]

[Tags: azure, staticip]


TABLE OF CONTENTS


Overview


Kublr 1.23 and later uses newer default versions of Azure API verison '2022-01-01' for azure resources.

Due to changes in availability zone default settings between different versions of Azure resources API, some resources cannot be updated when an Azure cluster created in Kublr 1.22 and earlier is updated in Kublr 1.23 and later. Azure complains about an attempt to change availability zone settings of an existing resource and interrupts the update.


The issue can be resolved by explicitly specifying the current availability zone settings for the problematic resources in the Kublr cluster specification.


Prerequisites


1. Kublr Control Plane upgraded/running with v1.23.0+ version

2. Azure cluster created in Kublr Control Plane v1.22.3 or earlier


Issue


When Azure cluster update process is started, user can get Azure Deployment error in the Events tab in UI with the following or a similar error:


Azure Location deployment failed
Failed.
{
  "code": "DeploymentFailed",
  "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
  "details": [
    {
      "code": "ResourceAvailabilityZonesCannotBeModified",
      "message": "Resource /subscriptions/****/resourceGroups/****/providers/Microsoft.Network/publicIPAddresses/****-MasterIp has an existing availability zone constraint 1, 2, 3 and the request has availability zone constraint NoZone, which do not match. Zones cannot be added/updated/removed once the resource is created. The resource cannot be updated from regional to zonal or vice-versa."
    }
  ]
}


In some cases, due to a known issue in UI, the error message may miss the details:


Azure Location deployment failed
failed to update location: 'Azure Location deployment failed: Failed.
{}'

In this case, the error details can be confirmed directly via the Azure portal in the corresponding cluster's Deployment Azure resource view.


Root cause


Starting with v1.23.0, Kublr uses Azure apiVersions '2022-01-01' for Azure resources in Azure Deployments.

Azure clusters, created by previous Kublr Control Plane, are created with the publicIP addresses with apiVersions '2018-08-01', which used and this IP addresses are zonal type by default on apiVersion '2022-01-01' migration.

In Deployment process, publicIP addresse try to reconfigure with noZone  availability, and this causes an error.


Solutions


Solution 1: stay on old apiVersion


You update the cluster spec so that the old apiVersion is used for the impacted components: loadBalancerPublicIP, loadBalancerPrivateFrontendIPConfig and natGatewayPublicIP

Use the following cluster specification changes:


spec:
  locations:
    - azure:
        armTemplateExtras:
          loadBalancerPrivate:
            apiVersion: '2018-08-01'
          loadBalancerPublicIP:
            apiVersion: '2018-08-01'
          natGatewayPublicIP:
            apiVersion: '2018-08-01'


Solution 2: use availability zone constraint


You can explicitely specify zone constraints for the affected components in the cluster specification:


spec:
  locations:
    - azure:
        armTemplateExtras:
          loadBalancerPrivateFrontendIPConfig:
            zones: ['1', '2', '3']
          loadBalancerPublicIP:
            zones: ['1', '2', '3']
          natGatewayPublicIP:
            zones: ['1', '2', '3']


Make sure to check the specific list of the zones configured on the resources in the Azure portal: different Azure regions may have different sets of zones or not have zones at all.


Solution 3: recreate the affected resources


Important Note! This is not a recommended solution as this means that public IP used by the cluster will change, which may result in the cluster and/or workload downtime!


Open Azure portal https://portal.azure.com/ in cluster resource group:

  • Delete LoadBalancers named cluster-name and cluster-name-internal
  • Detach NATGateway cluster-name-NatGateway from all Networks
  • Delete PublicIP addresses cluster-name-MasterIp and cluster-name-NatIP
  • Delete NatGetaway cluster-name-NatGateway
  • Change the master group update policy in the cluster specification as follows and run the cluster update:


spec:
  master:
    updateStrategy:
      drainStrategy:
        skip: true
      rollingUpdate:
        maxUnavailable: 100%


  • Wait for the cluster to recover and become healthy
  • Change the master group update policy back to normal