Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMON Voltage/Current Sensor Monitoring Enhancement #1394

Merged
merged 13 commits into from
Feb 1, 2024

Conversation

bmridul
Copy link
Contributor

@bmridul bmridul commented Jun 28, 2023

HLD for Voltage and Current sensor monitoring.

Related PRs:

PR Description Status
sonic-net/sonic-utilities#2941 Sensormon CLIs Merged
sonic-net/sonic-platform-common#393 Sensors Base Classes Merged
sonic-net/sonic-platform-daemons#393 Sensormon Daemon Merged
sonic-net/sonic-buildimage#16089 Add Sensormon in PMON Merged
sonic-net/sonic-buildimage#16646 System Health support Open
sonic-net/sonic-snmpagent#295 SNMP support Open
sonic-net/sonic-mgmt#10736 SONiC Mgmt test cases Open
sonic-net/sonic-platform-common#426 File system bases sensors Merged

@prgeor
Copy link
Contributor

prgeor commented Jul 11, 2023

@bmridul why not use the existing lm-sensors in linux which provides much more detailed output ? We have sensord running in PMON which can monitor voltage/current. We also syslog alerts for high temperature and current.

root@3POs-8101-H11-18:~# docker exec pmon service sensord status
sensord is running.
root@3POs-8101-H11-18:~#

doc/pmon/pmon-sensormon.md Outdated Show resolved Hide resolved
@bmridul
Copy link
Contributor Author

bmridul commented Jul 14, 2023

@bmridul why not use the existing lm-sensors in linux which provides much more detailed output ? We have sensord running in PMON which can monitor voltage/current. We also syslog alerts for high temperature and current.

root@3POs-8101-H11-18:~# docker exec pmon service sensord status
sensord is running.
root@3POs-8101-H11-18:~#

Hi @prgeor,

Please note that the solution proposed is not mutually exclusive to Sensord/Lmsensors. The solution proposed here can be used alongside or without Sensord.

There are 2 main reasons we are proposing this.

  1. Sensord provides monitoring of sensors which are supported by the linux hwmon infrastructure. The list of devices supported are mentioned here : https://hwmon.wiki.kernel.org/device_support_status

However, the issue is that there are a number of devices which report voltage/current measurements that are not supported by hwmon.

An example is device LTC2497 which can report voltage measurements.

The open source driver available for this device supports a different linux infrastructure – Industrial I/O
https://www.kernel.org/doc/html/v4.12/driver-api/iio/intro.html

The driver code is here : https://github.com/torvalds/linux/blob/master/drivers/iio/adc/ltc2497.c

This device will not be monitored by hwmon unless we patch its standard kernel driver which is not something we would want to do.

Sensormon can cover this gap and provide one CLI/rpcoess to view/monitor all the voltage and current sensors.

  1. Even when hwmon does support the devices, we observed that the fault reporting only works when the device indicates a fault. A user supplied threshold would not generate a minor or major alarm. Sensormon can provide this support and alert if there is a failure even when the device may not have the capability to generate an alarm.

@prgeor
Copy link
Contributor

prgeor commented Jul 14, 2023

@bmridul why not use the existing lm-sensors in linux which provides much more detailed output ? We have sensord running in PMON which can monitor voltage/current. We also syslog alerts for high temperature and current.

root@3POs-8101-H11-18:~# docker exec pmon service sensord status
sensord is running.
root@3POs-8101-H11-18:~#

Hi @prgeor,

Please note that the solution proposed is not mutually exclusive to Sensord/Lmsensors. The solution proposed here can be used alongside or without Sensord.

There are 2 main reasons we are proposing this.

  1. Sensord provides monitoring of sensors which are supported by the linux hwmon infrastructure. The list of devices supported are mentioned here : https://hwmon.wiki.kernel.org/device_support_status

However, the issue is that there are a number of devices which report voltage/current measurements that are not supported by hwmon.

An example is device LTC2497 which can report voltage measurements.

The open source driver available for this device supports a different linux infrastructure – Industrial I/O https://www.kernel.org/doc/html/v4.12/driver-api/iio/intro.html

The driver code is here : https://github.com/torvalds/linux/blob/master/drivers/iio/adc/ltc2497.c

This device will not be monitored by hwmon unless we patch its standard kernel driver which is not something we would want to do.

Sensormon can cover this gap and provide one CLI/rpcoess to view/monitor all the voltage and current sensors.

  1. Even when hwmon does support the devices, we observed that the fault reporting only works when the device indicates a fault. A user supplied threshold would not generate a minor or major alarm. Sensormon can provide this support and alert if there is a failure even when the device may not have the capability to generate an alarm.

@bmridul I am still not quite convinced the use of sensormon, since devices using IIO is mostly used in consumer products like laptop, mobile phones. The operating range of the chip is best known to the chip manufacturer and hence its better to depend upon the chip based alarm rather than platform integrator arriving at the customized threshold to raise alarms. As you rightly pointed out, hwmon already includes almost all sensors barring few exceptions, so I would prefer if you make effort to include these few exception devices included in hwmon support.

Nevertheless, please capture the reasons in the HLD for sensormon.

@bmridul
Copy link
Contributor Author

bmridul commented Jul 17, 2023

@bmridul why not use the existing lm-sensors in linux which provides much more detailed output ? We have sensord running in PMON which can monitor voltage/current. We also syslog alerts for high temperature and current.

root@3POs-8101-H11-18:~# docker exec pmon service sensord status
sensord is running.
root@3POs-8101-H11-18:~#

Hi @prgeor,
Please note that the solution proposed is not mutually exclusive to Sensord/Lmsensors. The solution proposed here can be used alongside or without Sensord.
There are 2 main reasons we are proposing this.

  1. Sensord provides monitoring of sensors which are supported by the linux hwmon infrastructure. The list of devices supported are mentioned here : https://hwmon.wiki.kernel.org/device_support_status

However, the issue is that there are a number of devices which report voltage/current measurements that are not supported by hwmon.
An example is device LTC2497 which can report voltage measurements.
The open source driver available for this device supports a different linux infrastructure – Industrial I/O https://www.kernel.org/doc/html/v4.12/driver-api/iio/intro.html
The driver code is here : https://github.com/torvalds/linux/blob/master/drivers/iio/adc/ltc2497.c
This device will not be monitored by hwmon unless we patch its standard kernel driver which is not something we would want to do.
Sensormon can cover this gap and provide one CLI/rpcoess to view/monitor all the voltage and current sensors.

  1. Even when hwmon does support the devices, we observed that the fault reporting only works when the device indicates a fault. A user supplied threshold would not generate a minor or major alarm. Sensormon can provide this support and alert if there is a failure even when the device may not have the capability to generate an alarm.

@bmridul I am still not quite convinced the use of sensormon, since devices using IIO is mostly used in consumer products like laptop, mobile phones. The operating range of the chip is best known to the chip manufacturer and hence its better to depend upon the chip based alarm rather than platform integrator arriving at the customized threshold to raise alarms. As you rightly pointed out, hwmon already includes almost all sensors barring few exceptions, so I would prefer if you make effort to include these few exception devices included in hwmon support.

Nevertheless, please capture the reasons in the HLD for sensormon.

@prgeor , Thanks for your comments. I will update the HLD with more details. Few responses

  1. The component vendors might push a driver based on initial application of the device. However the components can be used in a variety of applications once generally available.
  2. Many devices do not provide an alarm function. If the device has hardware logic for monitoring the voltage/current, it might provide an alarm, but many devices do not.
  3. Even though the device might be in its max/min limits, it might have an adverse affect on the components it is delivering the voltage to. So the sensor thresholds should be adjusted from a board design perspective.

The uber point is that the sensord may not provide complete coverage. If Linux Sensord is good enough for a platform, it can choose not to use this infra. We can discuss further in community meeting.

@jeff-yin
Copy link

jeff-yin commented Jul 19, 2023

As discussed during the community call today -- please explore integration with SONiC PDDF (https://github.com/sonic-net/SONiC/blob/master/doc/platform/brcm_pdk_pddf.md) so that platform vendors can easily integrate support for voltage/current sensors using their PDDF JSON files.

Copy link
Contributor

@Junchao-Mellanox Junchao-Mellanox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few general comments:

  1. How is the daemon connect to system health service?
  2. How is the daemon connect to snmp?
  3. How to build the sensor hierarchy relation? For example, should PSU voltage sensor object under PSU object or under chassis object?
  4. There is already logic in PSUD to handle PSU voltage threshold, do you plan to remove it?
  5. How to handle hot swappable sensors? For example, PSU sensor, module sensor?

@bmridul
Copy link
Contributor Author

bmridul commented Sep 6, 2023

As discussed during the community call today -- please explore integration with SONiC PDDF (https://github.com/sonic-net/SONiC/blob/master/doc/platform/brcm_pdk_pddf.md) so that platform vendors can easily integrate support for voltage/current sensors using their PDDF JSON files.

I looked briefly into the PDDF documentation and some code. It seems the PDDF support for voltage and current sensors can be added in a similar way as as other platform components are supported. E,g, The code below is implementing the platform support for thermal sensors.

https://github.com/sonic-net/sonic-buildimage/blob/204579a0ccfe80c1f2afea2bda0574528faecb91/platform/pddf/platform-api-pddf-base/sonic_platform_pddf_base/pddf_thermal.py

Voltage and current sensors support can be similarly added.

abdosi pushed a commit to sonic-net/sonic-utilities that referenced this pull request Sep 7, 2023
Added support for voltage and current sensor monitoring CLIs as mentioned in the feature HLD for PMON Voltage/Current Sensor Monitoring Enhancement.
sonic-net/SONiC#1394
* Addressed review comments

* Fix review comment

* UT file

* Fixed dependency for running unit test on platform common changes

* Fixed standalone unit test failure

* Addressed review comment
abdosi pushed a commit to sonic-net/sonic-platform-daemons that referenced this pull request Sep 7, 2023
Sensormond daemon is introduced. It collects the voltage and current sensor information from the platform and populates in the StateDB. The sensor data is available to view using CLIs. Sensormond raises syslogs when the sensors report measurement outside the thresholds.

sonic-net/SONiC#1394
@Junchao-Mellanox
Copy link
Contributor

A few general comments:

  1. How is the daemon connect to system health service?
  2. How is the daemon connect to snmp?
  3. How to build the sensor hierarchy relation? For example, should PSU voltage sensor object under PSU object or under chassis object?
  4. There is already logic in PSUD to handle PSU voltage threshold, do you plan to remove it?
  5. How to handle hot swappable sensors? For example, PSU sensor, module sensor?

Hi @bmridul @abdosi , how about these comments?

@bmridul
Copy link
Contributor Author

bmridul commented Sep 8, 2023

A few general comments:

  1. How is the daemon connect to system health service?

It would be good to show the out of out of range sensor readings in system health o/p. Currently out of range temperature sensors are also not shown in system health. We should add those as well.

I have added PR for system health integration.

@bmridul
Copy link
Contributor Author

bmridul commented Sep 8, 2023

A few general comments:
2. How is the daemon connect to snmp?

SNMP support will need to be added to represent the voltage/current sensors in Entity MIB. We will need to address this with changes in rfc2737 implmentation.

@bmridul
Copy link
Contributor Author

bmridul commented Oct 3, 2023

A few general comments:
2. How is the daemon connect to snmp?

SNMP support will need to be added to represent the voltage/current sensors in Entity MIB. We will need to address this with changes in rfc2737 implmentation.

SNMP support PR added.

@bmridul
Copy link
Contributor Author

bmridul commented Oct 3, 2023

@Junchao-Mellanox @jeff-yin @prgeor , Pls check if ur comments have been addressed.

@jeff-yin
Copy link

jeff-yin commented Oct 3, 2023

@Junchao-Mellanox @jeff-yin @prgeor , Pls check if ur comments have been addressed.

I didn't really see any updates with regards to PDDF specifically. If it's not going to be covered, please indicate that in the HLD and provide a path for integration in the future.

@bmridul
Copy link
Contributor Author

bmridul commented Oct 3, 2023

@Junchao-Mellanox @jeff-yin @prgeor , Pls check if ur comments have been addressed.

I didn't really see any updates with regards to PDDF specifically. If it's not going to be covered, please indicate that in the HLD and provide a path for integration in the future.

Hi Jeff,

I had replied to your comment above - #1394 (comment)

Currently our platforms donot support PDDF infra so I am not sure how to test PDDF integration. However it seems straightforward to me. I can open a git issue for this and future integration can be tracked that way. LMK.

JunhongMao pushed a commit to JunhongMao/sonic-utilities that referenced this pull request Oct 4, 2023
Added support for voltage and current sensor monitoring CLIs as mentioned in the feature HLD for PMON Voltage/Current Sensor Monitoring Enhancement.
sonic-net/SONiC#1394
* Addressed review comments

* Fix review comment

* UT file

* Fixed dependency for running unit test on platform common changes

* Fixed standalone unit test failure

* Addressed review comment
@jeff-yin
Copy link

jeff-yin commented Oct 4, 2023

@Junchao-Mellanox @jeff-yin @prgeor , Pls check if ur comments have been addressed.

I didn't really see any updates with regards to PDDF specifically. If it's not going to be covered, please indicate that in the HLD and provide a path for integration in the future.

Hi Jeff,

I had replied to your comment above - #1394 (comment)

Currently our platforms donot support PDDF infra so I am not sure how to test PDDF integration. However it seems straightforward to me. I can open a git issue for this and future integration can be tracked that way. LMK.

Sure, that's fine. Please do create the issue to track it.

jeff-yin
jeff-yin previously approved these changes Oct 4, 2023
@prgeor prgeor merged commit e000354 into sonic-net:master Feb 1, 2024
1 check passed
rlhui pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Feb 24, 2024
Enable Sensormon daemon in PMON container.
Pls see HLD : sonic-net/SONiC#1394
sonic-otn pushed a commit to Weitang-Zheng/sonic-buildimage that referenced this pull request Mar 11, 2024
Enable Sensormon daemon in PMON container.
Pls see HLD : sonic-net/SONiC#1394
saksarav-nokia pushed a commit to saksarav-nokia/sonic-buildimage that referenced this pull request Mar 12, 2024
Enable Sensormon daemon in PMON container.
Pls see HLD : sonic-net/SONiC#1394
a114j0y pushed a commit to a114j0y/SONiC that referenced this pull request Mar 12, 2024
* Initial draft for Voltage monitor

* Fixed a link.

* Fix more links

* Fix formatting

* Added SensorMon Changes

* Changed daemon name

* Added internal review comments.

* Addressed review comment.

* Review comments

* Addressed review comment

* Addressed review comments

* Added support for file based sensor config.

* Fixed formatting
@zhangyanzhao
Copy link
Collaborator

HLD is merged but code PRs are still open, move to backlog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: MovedToBacklog
Development

Successfully merging this pull request may close these issues.

6 participants