-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Bug Report
Description
When using azure.cosmos.aio.CosmosClient with azure.identity.aio.DefaultAzureCredential (managed identity), the client permanently fails with HTTP 403 after the AAD token expires. Only a full process restart recovers it.
Root Cause Analysis
The AsyncBearerTokenCredentialPolicy in azure-core only invalidates its cached _token on HTTP 401 responses (source):
if response.http_response.status_code == 401:
self._token = None # any cached token is invalidHowever, Cosmos DB returns HTTP 403 (not 401) for expired AAD tokens (sub-status 5300 — AAD_REQUEST_NOT_AUTHORIZED).
The azure-cosmos retry layer in _retry_utility_async.py handles 403 only for region-failover sub-statuses:
SubStatusCodes.DATABASE_ACCOUNT_NOT_FOUND(1008)SubStatusCodes.WRITE_FORBIDDEN(3)
It does not handle 403 with sub-status 5300 (expired/invalid AAD token).
The AsyncCosmosBearerTokenCredentialPolicy (in azure.cosmos.aio._auth_policy_async) inherits from AsyncBearerTokenCredentialPolicy but does not override send() to handle 403.
Result
Once the initial AAD token expires (~1 hour), every subsequent Cosmos request fails with 403. The proactive token refresh (_should_refresh_token, 5 minutes before expiry) normally prevents this, but when it fails for any reason (e.g., transient managed identity endpoint unavailability, MSAL cache issues), the SDK has no recovery path:
- The cached
_tokenin the policy is expired, but_should_refresh_token()callscredential.get_token()which may return a cached token from MSAL - Cosmos rejects it with 403
- The policy does not clear
_tokenon 403 (only on 401) - The cycle repeats indefinitely
Reproduction
- Create a long-running application (e.g., FastAPI service on Azure Container Apps)
- Use
azure.cosmos.aio.CosmosClientwithazure.identity.aio.DefaultAzureCredential(managed identity) - Follow the recommended singleton pattern (single client instance for the app lifetime)
- Wait for the AAD token to expire
- All Cosmos operations fail with 403 and never recover
Expected Behavior
The AsyncCosmosBearerTokenCredentialPolicy should handle HTTP 403 responses with AAD-related sub-statuses by invalidating the cached token and retrying with a fresh one, similar to how the base class handles 401.
Suggested Fix
In AsyncCosmosBearerTokenCredentialPolicy, override send() to also clear self._token on 403 and retry:
async def send(self, request):
response = await super().send(request)
if response.http_response.status_code == 403:
self._token = None
await self.on_request(request)
response = await self.next.send(request)
return responseEnvironment
azure-cosmos4.15.0azure-identity1.25.3azure-core1.34.0- Python 3.12
- Runtime: Azure Container Apps with system-assigned managed identity
Related Issues (other SDKs, same root cause)
- JavaScript: CosmosDB with managed Identity - Provided AAD token has been expired azure-sdk-for-js#22620
- .NET: 403, substatus 5301 azure-cosmos-dotnet-v3#3110
This appears to be a cross-SDK design gap — Cosmos DB returns 403 for expired tokens, but all SDK bearer token policies only handle 401.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status