Skip to content

Enhancement - Fix power monitoring for ROCm 6.3#794

Open
polarG wants to merge 1 commit intomainfrom
dev/hongtaohang/fix-amd-device-manager
Open

Enhancement - Fix power monitoring for ROCm 6.3#794
polarG wants to merge 1 commit intomainfrom
dev/hongtaohang/fix-amd-device-manager

Conversation

@polarG
Copy link
Contributor

@polarG polarG commented Mar 25, 2026

Description
GPU monitor crashes with invalid literal for int() with base 10: 'N/A' — amdsmi_get_power_info() returns 'N/A' for average_socket_power on MI300X (hardware reports 0xFFFF for unsupported fields). The
code called int() on this value without validation. Additionally, power_limit is returned in microwatts but was treated as watts.

Changes
get_device_power(): Fall back to current_socket_power when average_socket_power is 'N/A'. get_device_power_limit(): Convert microwatts to watts when needed.

@polarG polarG requested a review from a team as a code owner March 25, 2026 23:02
Copilot AI review requested due to automatic review settings March 25, 2026 23:02
@polarG polarG self-assigned this Mar 25, 2026
@polarG polarG added enhancement New feature or request ROCm labels Mar 25, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves ROCm GPU power monitoring robustness by handling unsupported power fields returned by amdsmi_get_power_info() on newer hardware (e.g., MI300X) and normalizing power_limit units so monitoring doesn’t crash and reports correct watts.

Changes:

  • Update ROCm get_device_power() to fall back from average_socket_power to current_socket_power when the former is unsupported ('N/A').
  • Update ROCm get_device_power_limit() to treat power_limit as microwatts (when applicable) and convert to watts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +394 to +398
power = power_measure.get('average_socket_power')
if not isinstance(power, (int, float)):
power = power_measure.get('current_socket_power')
if not isinstance(power, (int, float)):
return None
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new numeric validation (isinstance(..., (int, float))) is stricter than the previous int(...) cast and will reject many valid numeric types (e.g., numpy scalars, ctypes ints) even though int(value) would work. Consider using numbers.Real/numbers.Number for the type check, or just attempt int(power) in a small try/except and only fall back / return None on conversion failure.

Copilot uses AI. Check for mistakes.
Comment on lines +394 to +399
power = power_measure.get('average_socket_power')
if not isinstance(power, (int, float)):
power = power_measure.get('current_socket_power')
if not isinstance(power, (int, float)):
return None
return int(power)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using dict.get() here changes behavior vs the prior code: if the expected key is missing, the function now returns None without logging a warning (because no exception is raised). If missing keys indicate an amdsmi schema/version mismatch, consider explicitly checking for key presence / unexpected values and logging a warning before returning None so monitoring failures are diagnosable.

Copilot uses AI. Check for mistakes.
Comment on lines +418 to +420
# amdsmi returns power_limit in microwatts (e.g. 750000000 for 750W), convert to watts.
if power_limit > 100000:
power_limit = power_limit // 1000000
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The microwatts-to-watts conversion heuristic uses an unexplained magic threshold (> 100000). This is hard to reason about and could silently mis-handle other unit conventions. Prefer a named constant (e.g., MICROWATTS_PER_WATT = 1_000_000) and a clearer conversion rule (always convert if amdsmi is known to return µW, or detect by comparing against MICROWATTS_PER_WATT).

Copilot uses AI. Check for mistakes.
Comment on lines +392 to +399
# amdsmi sets fields to 'N/A' when the hardware reports 0xFFFF (unsupported).
# On MI300X, average_socket_power is unsupported, so fall back to current_socket_power.
power = power_measure.get('average_socket_power')
if not isinstance(power, (int, float)):
power = power_measure.get('current_socket_power')
if not isinstance(power, (int, float)):
return None
return int(power)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new fallback behavior for ROCm power fields (handling 'N/A' and falling back to current_socket_power) is currently untested. Please add a unit test that mocks amdsmi_get_power_info to cover: (1) average_socket_power='N/A' falling back, and (2) both fields unsupported returning None.

Copilot uses AI. Check for mistakes.
Comment on lines +415 to +421
power_limit = power_measure.get('power_limit')
if not isinstance(power_limit, (int, float)):
return None
# amdsmi returns power_limit in microwatts (e.g. 750000000 for 750W), convert to watts.
if power_limit > 100000:
power_limit = power_limit // 1000000
return int(power_limit)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The µW→W normalization for power_limit should be covered by a unit test (mock amdsmi_get_power_info) to ensure values like 750000000 are converted to 750 and non-numeric/'N/A' values return None.

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 0% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.58%. Comparing base (036c471) to head (85377ad).

Files with missing lines Patch % Lines
superbench/common/utils/device_manager.py 0.00% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #794      +/-   ##
==========================================
- Coverage   85.69%   85.58%   -0.11%     
==========================================
  Files         103      103              
  Lines        7890     7900      +10     
==========================================
  Hits         6761     6761              
- Misses       1129     1139      +10     
Flag Coverage Δ
cpu-python3.10-unit-test 70.33% <0.00%> (-0.10%) ⬇️
cpu-python3.7-unit-test 69.76% <0.00%> (-0.09%) ⬇️
cuda-unit-test 83.49% <0.00%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants