Skip to content

Trixie Status

jhickeyNRC edited this page Mar 27, 2024 · 73 revisions

Current Trixie Operational Status

Upcoming Planned Downtime

Friday, April 12, 2024 - This downtime is scheduled to start at 2:30PM EDT on Friday April 12th and will conclude on the evening of Sunday the 14th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this shutdown, please let us know. Thank you for your patience. Research Platform Support

Current Issues / Outages

None at this time

Past Events / Incidents

  • [RESOLVED] - Friday, February 2, 2024 - This downtime is scheduled to start at 2:00PM EST on February 2nd and will conclude on the evening of the 3rd. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this shutdown, please let us know. Thank you for your patience. Research Platform Support
  • [RESOLVED] - Friday, January 19, 2023 - This downtime is scheduled to start at 2:30 PM EST on Friday January 19th and will conclude on the evening of the 20th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this shutdown, please let us know. Thank you for your patience. Research Platform Support
  • [CANCELLED] - Thursday, December 14th, 2023 - This downtime is scheduled to start at 2:30 PM EST on Thursday the 14th of December and will conclude on the morning of the 15th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this shutdown, please let us know. Thank you for your patience. Research Platform Support
  • [RESOLVED] - Monday, August 28th, 2023 - KITS will be performing a maintenance on the Trixie HPC cluster on Tuesday, August 29th at 6AM EDT. The cluster should be brought back online around 12PM. Jobs with a run time conflicting with the maintenance starting period will stay in the queue and run after the maintenance. A notice will be sent out when the downtime is completed and the cluster is back online. If you have any questions or concerns please let us know. Research Platform Support: rps-spr@nrc-cnrc.gc.ca
  • [RESOLVED] - Wednesday July 19, 2023 - Thursday July 20, 2023 - Please note that RPPM will be shutting down regular power to building M-55 for electrical emergency repairs on July 20th from 6 to 7AM EDT. KITS will therefore be shutting down the Trixie HPC from July 19th at 5 PM to soon after 7AM on the 20th. A notice will be sent out when the downtime is completed and the cluster is back online. If you have any questions or concerns please let us know. IT Operations: ITOperations-OperationsTI@nrc-cnrc.gc.ca
  • [RESOLVED] - Monday June 19, 2023 - Wednesday June 21, 2023 - Please note that in support of the new generator installation taking place at building M-55, RPPM will be shutting down power to building M-55, taking place on two consecutive evenings. Trixie HPC will be unavailable during the scheduled period of Monday June 19th 1:00 pm EDT to morning of Wednesday June 21st. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this maintenance or the maintenance schedule please let us know. Thank you for your patience. IT Operations: ITOperations-OperationsTI@nrc-cnrc.gc.ca
  • [RESOLVED] - Friday June 9, 2023 - Monday June 12, 2023 - A period of downtime is required for the Trixie HPC due to work on the buildings electrical systems. We will also use this time to perform some routine maintenance and upgrades. This downtime is scheduled to start at 2:00 PM EDT on Friday June 9th and will conclude on the afternoon of Monday June 12th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this maintenance or the maintenance schedule please let us know. Thank you for your patience. IT Operations: ITOperations-OperationsTI@nrc-cnrc.gc.ca
  • [RESOLVED] - Monday March 27, 2023 - A period of downtime is required for the Trixie HPC cluster to perform some routine maintenance and upgrades. This downtime is scheduled for the day of Monday March 27th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this maintenance or the maintenance schedule please let us know. Please note that Slurm should stop accepting jobs that would run into the maintenance period. They will be held in the queue until the maintenance period has ended. Thank you for your patience. IT Operations: ITOperations-OperationsTI@nrc-cnrc.gc.ca
  • [RESOLVED] - Monday November 28, 2022 - A period of downtime is required for the Trixie HPC cluster to perform some routine maintenance and upgrades. This downtime is scheduled for the day of Monday November 28th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this maintenance or the maintenance schedule please let us know. Thank you for your patience. IT Operations: ITOperations-OperationsTI@nrc-cnrc.gc.ca
  • [RESOLVED] - Monday May 30, 2022 - Please note that the Trixie server is currently offline - possibly due to a network issue.
  • [RESOLVED] - Monday April 11 - Wednesday April 13, 2022 (3 days) Please note that Trixie (AI4D HPC cluster) will be unavailable from April 11-13th (3 days) due to a scheduled upgrade. The GPFS file system will be upgraded during this time. If you have any questions or concerns about this maintenance please send an email to ITOperations-OperationsTI@nrc-cnrc.gc.ca. Update - Thursday April 14, 2022: Due to complications with the upgrade of the storage array the scheduled downtime for the AI4D-Trixie cluster has been extended. There is currently no estimate for when the cluster will return to service but an update will be sent out as soon as there is more information. - Update - Tuesday April 19, 2022 We are returning the AI4D-Trixie cluster to operational status. Unfortunately the storage array is in a degraded state and is only operating with 25% of its normal transfer capacity. Expect to see slowness from i/o intensive operations. All of the Compute Nodes as well as the Head Node have been re-imaged. The default operating environment has changed so expect many versions of the software loaded when you login to have changed. This may cause issues with job scripts created for the previous environment. If you experience any issues please let us know (ITOperations-OperationsTI@nrc-cnrc.gc.ca).
  • [RESOLVED] - Tuesday January 25, 2022 - There appears to be several issues with the Trixie HPC that are impacting access through the Bastion Host and general performance on the headnode. KITS-ITOps is investigating and hopes to resolve the issues as soon as possible. We will provide an update when we have more information. Update - There is a technical issue with the SSC managed switch due to a recent power outage. Access from Legacy and RES should still function but external access through the Bastion Host is not working. An SSC technician is supposed to be on site tomorrow morning to investigate.
  • [RESOLVED] - Wednesday December 15, 2021 - External access to Trixie is not available. It appears that there is an error with the SSL cert for the external LoginTC URL. When trying to use the LoginTC app on your phone to accept a login request a certificate error appears and the request is never received. Hopefully the issue will be resolved quickly, but access could be offline for a day or two.
  • [RESOLVED] - Wednesday December 15, 2021 - Trixie is currently unavailable and the issue is being investigated.
  • [RESOLVED] - Thursday Dec 2, 2021 - SSH connection to Trixie via the external bastion host are being blocked. Internal NRC network connectivity and Trixie operations continue normally. Investigation of root cause underway.
  • [RESOLVED - downtime completed successfully] - Monday August 23, 2021 - There will be a maintenance period for the Trixe AI4D Cluster on Monday August 23rd starting at 8:00 am EDT. Access to the cluster will not be possible during the maintenance. The entire day will be reserved for the maintenance but current estimates suggest it will be returned to service by noon. Maintenance will involve the replacement of a power distribution unit in one of the racks as well as configuration changes on the primary head node. Every effort will be made to preserve the job queue during the maintenance.
  • [RESOLVED - downtime completed successfully] - We are planning a period of scheduled downtime for the Trixie-AI4D cluster on Monday June 28th from 8:00am to 4:00pm EDT. This will allow a few maintenance tasks to be performed that would interrupt service. These tasks include modifying the partition structure on the primary head node as well as some security patching. - It has become necessary to add a firmware update to this maintenance window for the Mellanox switches. This will cause the GPFS file system to become unavailable forcing us to shutdown the cluster entirely. All jobs in the queue at the start of the maintenance period will likely be lost. - Due to unforeseen complications the maintenance period must be extended.
  • [RESOLVED - downtime completed successfully] - A period of downtime for the Trixie (AI4D) cluster is being scheduled for Monday, May 17th from 8:00 am - 6:00 pm EDT. Due to a hardware issue on the storage array there will need to be a scheduled maintenance period as per the vendors recommendation. Please note that the nature of the maintenance will require all jobs in the queue to be terminated at the start of the maintenance window.
  • [RESOLVED - nodes back in main queue] - Compute nodes cn110 and cn125 have been taken out of the main queue to troubleshoot GPU issues
  • [RESOLVED - downtime completed successfully] - A period of downtime for the Trixie (AI4D) cluster is being scheduled for Monday, April 19th from 9:00am-3:00pm. Due to the nature of the maintenance all jobs in the queue will be terminated at the start of the maintenance window.
  • [RESOLVED] - December 17 - We are currently experiencing issues with Trixie head node performance as detailed in #35 Investigation pending.

Notes