Skip to content

azure-cosmos: async client does not recover from 403 on expired AAD token (bearer token policy only handles 401) #45886

@KrzysztofKasprowicz

Description

@KrzysztofKasprowicz

Bug Report

Description

When using azure.cosmos.aio.CosmosClient with azure.identity.aio.DefaultAzureCredential (managed identity), the client permanently fails with HTTP 403 after the AAD token expires. Only a full process restart recovers it.

Root Cause Analysis

The AsyncBearerTokenCredentialPolicy in azure-core only invalidates its cached _token on HTTP 401 responses (source):

if response.http_response.status_code == 401:
    self._token = None  # any cached token is invalid

However, Cosmos DB returns HTTP 403 (not 401) for expired AAD tokens (sub-status 5300 — AAD_REQUEST_NOT_AUTHORIZED).

The azure-cosmos retry layer in _retry_utility_async.py handles 403 only for region-failover sub-statuses:

  • SubStatusCodes.DATABASE_ACCOUNT_NOT_FOUND (1008)
  • SubStatusCodes.WRITE_FORBIDDEN (3)

It does not handle 403 with sub-status 5300 (expired/invalid AAD token).

The AsyncCosmosBearerTokenCredentialPolicy (in azure.cosmos.aio._auth_policy_async) inherits from AsyncBearerTokenCredentialPolicy but does not override send() to handle 403.

Result

Once the initial AAD token expires (~1 hour), every subsequent Cosmos request fails with 403. The proactive token refresh (_should_refresh_token, 5 minutes before expiry) normally prevents this, but when it fails for any reason (e.g., transient managed identity endpoint unavailability, MSAL cache issues), the SDK has no recovery path:

  1. The cached _token in the policy is expired, but _should_refresh_token() calls credential.get_token() which may return a cached token from MSAL
  2. Cosmos rejects it with 403
  3. The policy does not clear _token on 403 (only on 401)
  4. The cycle repeats indefinitely

Reproduction

  1. Create a long-running application (e.g., FastAPI service on Azure Container Apps)
  2. Use azure.cosmos.aio.CosmosClient with azure.identity.aio.DefaultAzureCredential (managed identity)
  3. Follow the recommended singleton pattern (single client instance for the app lifetime)
  4. Wait for the AAD token to expire
  5. All Cosmos operations fail with 403 and never recover

Expected Behavior

The AsyncCosmosBearerTokenCredentialPolicy should handle HTTP 403 responses with AAD-related sub-statuses by invalidating the cached token and retrying with a fresh one, similar to how the base class handles 401.

Suggested Fix

In AsyncCosmosBearerTokenCredentialPolicy, override send() to also clear self._token on 403 and retry:

async def send(self, request):
    response = await super().send(request)
    if response.http_response.status_code == 403:
        self._token = None
        await self.on_request(request)
        response = await self.next.send(request)
    return response

Environment

  • azure-cosmos 4.15.0
  • azure-identity 1.25.3
  • azure-core 1.34.0
  • Python 3.12
  • Runtime: Azure Container Apps with system-assigned managed identity

Related Issues (other SDKs, same root cause)

This appears to be a cross-SDK design gap — Cosmos DB returns 403 for expired tokens, but all SDK bearer token policies only handle 401.

Metadata

Metadata

Assignees

Labels

CosmosService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions