Enhancement - Fix power monitoring for ROCm 6.3#794

Open

polarG wants to merge 1 commit intomainfrom

dev/hongtaohang/fix-amd-device-manager

Contributor

polarG commented Mar 25, 2026

Description
GPU monitor crashes with invalid literal for int() with base 10: 'N/A' — amdsmi_get_power_info() returns 'N/A' for average_socket_power on MI300X (hardware reports 0xFFFF for unsupported fields). The
code called int() on this value without validation. Additionally, power_limit is returned in microwatts but was treated as watts.

Changes
get_device_power(): Fall back to current_socket_power when average_socket_power is 'N/A'. get_device_power_limit(): Convert microwatts to watts when needed.


          Fix amd device manager.

85377ad

polarG requested a review from a team as a code owner

March 25, 2026 23:02

Copilot AI review requested due to automatic review settings

March 25, 2026 23:02

polarG self-assigned this

polarG added enhancement ROCm labels

Copilot started reviewing on behalf of polarG

March 25, 2026 23:03

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

Improves ROCm GPU power monitoring robustness by handling unsupported power fields returned by amdsmi_get_power_info() on newer hardware (e.g., MI300X) and normalizing power_limit units so monitoring doesn’t crash and reports correct watts.

Changes:

Update ROCm get_device_power() to fall back from average_socket_power to current_socket_power when the former is unsupported ('N/A').
Update ROCm get_device_power_limit() to treat power_limit as microwatts (when applicable) and convert to watts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/common/utils/device_manager.py

Comment on lines +394 to +398

+                          power = power_measure.get('average_socket_power')
+                          if not isinstance(power, (int, float)):
+                              power = power_measure.get('current_socket_power')
+                          if not isinstance(power, (int, float)):
+                              return None

Copilot AI Mar 25, 2026

The new numeric validation (isinstance(..., (int, float))) is stricter than the previous int(...) cast and will reject many valid numeric types (e.g., numpy scalars, ctypes ints) even though int(value) would work. Consider using numbers.Real/numbers.Number for the type check, or just attempt int(power) in a small try/except and only fall back / return None on conversion failure.

Copilot uses AI. Check for mistakes.

superbench/common/utils/device_manager.py

Comment on lines +394 to +399

+                          power = power_measure.get('average_socket_power')
+                          if not isinstance(power, (int, float)):
+                              power = power_measure.get('current_socket_power')
+                          if not isinstance(power, (int, float)):
+                              return None
+                          return int(power)

Copilot AI Mar 25, 2026

Using dict.get() here changes behavior vs the prior code: if the expected key is missing, the function now returns None without logging a warning (because no exception is raised). If missing keys indicate an amdsmi schema/version mismatch, consider explicitly checking for key presence / unexpected values and logging a warning before returning None so monitoring failures are diagnosable.

Copilot uses AI. Check for mistakes.

superbench/common/utils/device_manager.py

Comment on lines +418 to +420

+                          # amdsmi returns power_limit in microwatts (e.g. 750000000 for 750W), convert to watts.
+                          if power_limit > 100000:
+                              power_limit = power_limit // 1000000

Copilot AI Mar 25, 2026

The microwatts-to-watts conversion heuristic uses an unexplained magic threshold (> 100000). This is hard to reason about and could silently mis-handle other unit conventions. Prefer a named constant (e.g., MICROWATTS_PER_WATT = 1_000_000) and a clearer conversion rule (always convert if amdsmi is known to return µW, or detect by comparing against MICROWATTS_PER_WATT).

Copilot uses AI. Check for mistakes.

superbench/common/utils/device_manager.py

Comment on lines +392 to +399

+                          # amdsmi sets fields to 'N/A' when the hardware reports 0xFFFF (unsupported).
+                          # On MI300X, average_socket_power is unsupported, so fall back to current_socket_power.
+                          power = power_measure.get('average_socket_power')
+                          if not isinstance(power, (int, float)):
+                              power = power_measure.get('current_socket_power')
+                          if not isinstance(power, (int, float)):
+                              return None
+                          return int(power)

Copilot AI Mar 25, 2026

This new fallback behavior for ROCm power fields (handling 'N/A' and falling back to current_socket_power) is currently untested. Please add a unit test that mocks amdsmi_get_power_info to cover: (1) average_socket_power='N/A' falling back, and (2) both fields unsupported returning None.

Copilot uses AI. Check for mistakes.

superbench/common/utils/device_manager.py

Comment on lines +415 to +421

+                          power_limit = power_measure.get('power_limit')
+                          if not isinstance(power_limit, (int, float)):
+                              return None
+                          # amdsmi returns power_limit in microwatts (e.g. 750000000 for 750W), convert to watts.
+                          if power_limit > 100000:
+                              power_limit = power_limit // 1000000
+                          return int(power_limit)

Copilot AI Mar 25, 2026

The µW→W normalization for power_limit should be covered by a unit test (mock amdsmi_get_power_info) to ensure values like 750000000 are converted to 750 and non-numeric/'N/A' values return None.

Copilot uses AI. Check for mistakes.

codecov bot commented Mar 25, 2026 •

edited

Loading

Codecov Report

❌ Patch coverage is 0% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.58%. Comparing base (036c471) to head (85377ad).

Files with missing lines	Patch %	Lines
superbench/common/utils/device_manager.py	0.00%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #794      +/-   ##
==========================================
- Coverage   85.69%   85.58%   -0.11%     
==========================================
  Files         103      103              
  Lines        7890     7900      +10     
==========================================
  Hits         6761     6761              
- Misses       1129     1139      +10

Flag	Coverage Δ
cpu-python3.10-unit-test	`70.33% <0.00%> (-0.10%)`	⬇️
cpu-python3.7-unit-test	`69.76% <0.00%> (-0.09%)`	⬇️
cuda-unit-test	`83.49% <0.00%> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement ROCm