Conversation
There was a problem hiding this comment.
Pull request overview
Improves ROCm GPU power monitoring robustness by handling unsupported power fields returned by amdsmi_get_power_info() on newer hardware (e.g., MI300X) and normalizing power_limit units so monitoring doesn’t crash and reports correct watts.
Changes:
- Update ROCm
get_device_power()to fall back fromaverage_socket_powertocurrent_socket_powerwhen the former is unsupported ('N/A'). - Update ROCm
get_device_power_limit()to treatpower_limitas microwatts (when applicable) and convert to watts.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| power = power_measure.get('average_socket_power') | ||
| if not isinstance(power, (int, float)): | ||
| power = power_measure.get('current_socket_power') | ||
| if not isinstance(power, (int, float)): | ||
| return None |
There was a problem hiding this comment.
The new numeric validation (isinstance(..., (int, float))) is stricter than the previous int(...) cast and will reject many valid numeric types (e.g., numpy scalars, ctypes ints) even though int(value) would work. Consider using numbers.Real/numbers.Number for the type check, or just attempt int(power) in a small try/except and only fall back / return None on conversion failure.
| power = power_measure.get('average_socket_power') | ||
| if not isinstance(power, (int, float)): | ||
| power = power_measure.get('current_socket_power') | ||
| if not isinstance(power, (int, float)): | ||
| return None | ||
| return int(power) |
There was a problem hiding this comment.
Using dict.get() here changes behavior vs the prior code: if the expected key is missing, the function now returns None without logging a warning (because no exception is raised). If missing keys indicate an amdsmi schema/version mismatch, consider explicitly checking for key presence / unexpected values and logging a warning before returning None so monitoring failures are diagnosable.
| # amdsmi returns power_limit in microwatts (e.g. 750000000 for 750W), convert to watts. | ||
| if power_limit > 100000: | ||
| power_limit = power_limit // 1000000 |
There was a problem hiding this comment.
The microwatts-to-watts conversion heuristic uses an unexplained magic threshold (> 100000). This is hard to reason about and could silently mis-handle other unit conventions. Prefer a named constant (e.g., MICROWATTS_PER_WATT = 1_000_000) and a clearer conversion rule (always convert if amdsmi is known to return µW, or detect by comparing against MICROWATTS_PER_WATT).
| # amdsmi sets fields to 'N/A' when the hardware reports 0xFFFF (unsupported). | ||
| # On MI300X, average_socket_power is unsupported, so fall back to current_socket_power. | ||
| power = power_measure.get('average_socket_power') | ||
| if not isinstance(power, (int, float)): | ||
| power = power_measure.get('current_socket_power') | ||
| if not isinstance(power, (int, float)): | ||
| return None | ||
| return int(power) |
There was a problem hiding this comment.
This new fallback behavior for ROCm power fields (handling 'N/A' and falling back to current_socket_power) is currently untested. Please add a unit test that mocks amdsmi_get_power_info to cover: (1) average_socket_power='N/A' falling back, and (2) both fields unsupported returning None.
| power_limit = power_measure.get('power_limit') | ||
| if not isinstance(power_limit, (int, float)): | ||
| return None | ||
| # amdsmi returns power_limit in microwatts (e.g. 750000000 for 750W), convert to watts. | ||
| if power_limit > 100000: | ||
| power_limit = power_limit // 1000000 | ||
| return int(power_limit) |
There was a problem hiding this comment.
The µW→W normalization for power_limit should be covered by a unit test (mock amdsmi_get_power_info) to ensure values like 750000000 are converted to 750 and non-numeric/'N/A' values return None.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #794 +/- ##
==========================================
- Coverage 85.69% 85.58% -0.11%
==========================================
Files 103 103
Lines 7890 7900 +10
==========================================
Hits 6761 6761
- Misses 1129 1139 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
GPU monitor crashes with invalid literal for int() with base 10: 'N/A' — amdsmi_get_power_info() returns 'N/A' for average_socket_power on MI300X (hardware reports 0xFFFF for unsupported fields). The
code called int() on this value without validation. Additionally, power_limit is returned in microwatts but was treated as watts.
Changes
get_device_power(): Fall back to current_socket_power when average_socket_power is 'N/A'. get_device_power_limit(): Convert microwatts to watts when needed.