Alarm Service
The Alarm Service provides an API to manage alarms in the TMT software system. The service uses Redis to store Alarm data, including the alarm status and associated metadata. Alarm “keys” are used to access information about an alarm.
Dependencies
The Alarm Service comes bundled with the Framework, no additional dependency needs to be added to your build.sbt
file if using it. To use the Alarm service without using the framework, add this to your build.sbt
file:
- sbt
-
libraryDependencies += "com.github.tmtsoftware.csw" %% "csw-alarm-client" % "5.0.1"
API Flavors
There are two APIs provided in the Alarm Service: a client API, and an administrative (admin) API. The client API is the API used by component developers to set the severity of an alarm. This is the only functionality needed by component developers. As per TMT policy, the severity of an alarm must be set periodically (within some time limit) in order to maintain the integrity of the alarm status. If an alarm severity is not refreshed within the time limit, currently set at 9 seconds, the severity is set to Disconnected
by the Alarm Service, which indicates to the operator that there is some problem with the component’s ability to evaluate the alarm status.
The admin API provides all of the functions needed manage the alarm store, as well as providing access to monitor alarms for use by an operator or instrument specialist. The admin API provides the ability to load alarm data into the alarm store, set the severity of an alarm, acknowledge alarms, shelve or unshelve alarms, reset a latched alarm, get the metadata/status/severity of an alarm, and get or subscribe to aggregations of severity and health of the alarm, a component’s alarms, a subsystem’s alarms, or the alarms of the whole TMT System.
A command line tool is provided as part of the Alarm Service that implements this API and provides low level control over the Alarm Service. More details about alarm CLI can be found here: CSW Alarm Client CLI application
Eventually, operators will use Graphical User Interfaces that access the admin API through a UI gateway. This will be delivered as part of the ESW HCMS package.
Since the admin API will primarily be used with the CLI and HCMS applications, it is only supported in Scala, and not Java.
To summarize, the APIs are as follows: * client API (AlarmService) : Must be used by component. Available method is: {setSeverity}
* admin API (AlarmAdminService) : Expected to be used by administrator. Available methods are: {initAlarm | setSeverity | acknowledge | shelve | unshelve | reset | getMetaData
| getStatus | getCurrentSeverity | getAggregatedSeverity | getAggregatedHealth | subscribeAggregatedSeverityCallback
| subscribeAggregatedSeverityActorRef | subscribeAggregatedHealthCallback | subscribeAggregatedHealthActorRef }
Creating clientAPI and adminAPI
For component developers, the client API is provided as an AlarmService
object in the CswContext
object injected into the ComponentHandlers class provided by the framework.
If you are not using csw-framework, you can create AlarmService
using AlarmServiceFactory
.
- Scala
-
source
// create alarm client using host and port of alarm server private val clientAPI1 = new AlarmServiceFactory().makeClientApi("localhost", 5225) // create alarm client using location service private val clientAPI2 = new AlarmServiceFactory().makeClientApi(locationService) // create alarm admin using host and port of alarm server private val adminAPI1 = new AlarmServiceFactory().makeAdminApi("localhost", 5226) // create alarm admin using location service private val adminAPI2 = new AlarmServiceFactory().makeAdminApi(locationService)
- Java
-
source
// create alarm client using host and port of alarm server final IAlarmService jclientAPI1 = new AlarmServiceFactory().jMakeClientApi("localhost", 5227, actorSystem); // create alarm client using location service IAlarmService jclientAPI2 = new AlarmServiceFactory().jMakeClientApi(jLocationService, actorSystem);
Rules and checks
- When representing a unique alarm, the alarm name or component name must not have
* [ ] ^ -
orany whitespace characters
Model Classes
- AlarmKey : Represents the unique alarm in the TMT system. It is composed of subsystem, component and alarm name.
- ComponentKey : Represents all alarms of a component. Used for getting severity or health of an entire component.
- SubsystemKey : Represents all alarms of a subsystem Used for getting severity or health of an entire subsystem.
- GlobalKey : Represents all alarms present in the TMT system. Used for getting severity or health of an entire observatory.
- AlarmMetadata : Represents static metadata of an alarm, which will not change in its entire lifespan.
- AlarmStatus : Represents dynamically changing data of the an alarm, which will be changing depending on the severity change or manually changed by an operator
- AlarmSeverity : Represents severity levels that can be set by the component developer e.g. Okay, Indeterminate, Warning, Major and Critical
- FullAlarmSeverity : Represents all possible severity levels of the alarm i.e. Disconnected (cannot be set by the developer) plus other severity levels that can be set by the developer
- AlarmHealth : Represents possible health of an alarm or component or subsystem or whole TMT system
Client API
setSeverity
Sets the severity of the given alarm. The severity must be refreshed by setting it at a regular interval or it will automatically be changed to Disconnected
after a specific time.
- Scala
-
source
val alarmKey = AlarmKey(Prefix(NFIRAOS, "trombone"), "tromboneAxisLowLimitAlarm") val resultF: Future[Done] = clientAPI.setSeverity(alarmKey, Okay)
- Java
-
source
private final AlarmKey alarmKey = new AlarmKey(Prefix.apply(JSubsystem.NFIRAOS, "trombone"), "tromboneAxisLowLimitAlarm"); Future<Done> doneF = jclientAPI1.setSeverity(alarmKey, JAlarmSeverity.Okay);
- If the alarm is not refreshed within 9 seconds, it will be inferred as
Disconnected
- If the alarm is auto-acknowledgable and the severity is set to
Okay
then, the alarm will be auto-acknowledged and will not require any explicit admin action in terms of acknowledging
Admin API
initAlarms
Loads the given alarm data in alarm store, passing in the alarm configuration file.
- Scala
-
source
val resource = "test-alarms/valid-alarms.conf" val alarmsConfig: Config = ConfigFactory.parseResources(resource) val result2F: Future[Done] = adminAPI.initAlarms(alarmsConfig)
Alarm configuration files are written in the HOCON format using the following fields:
- subsystem: subsystem name the alarm belongs to
- component: name of component for the alarm, matching the name in the componentInfo file (see Describing Components)
- name: name of the alarm
- description: a description of what the alarm represents
- location: physical location within observatory or instrument in which the alarm condition is occuring
- alarmType: the general category for the alarm. Must be one of the following:
- Absolute: An alarm generated when a setpoint is exceeded.
- BitPattern: An alarm generated when a pattern of digital signals matches a predetermined pattern.
- Calculated: An alarm generated from a calculated value instead of a direct process measurement.
- Deviation: An alarm generated when the difference between two analog values exceeds a limit (e.g., deviation between primary and redundant instruments or a deviation between process variable and setpoint).
- Discrepancy: An alarm generated by error between the comparison of an expected plant or device state to its actual state (e.g., when a motor fails to start after it is command to the on state).
- Instrument: An alarm generated by a field device to indicate a fault (e.g., a sensor failure).
- RateChange: An alarm generated when the change in a process variable per unit time, (dPV/dt), exceeds a defined limit.
- RecipeDriven: An alarm with limits that depend on the recipe that is currently being executed.
- Safety: An alarm that is tied to and echoing an action or interlock in the subsystem’s safety controller. (Note: At TMT Alarm Service can not a primary hazard control for severe hazards).
- Statistical: An alarm generated based on statistical properties of one or more process variables.
- System: An alarm generated by the control system to indicate a fault within the system hardware, software, or components (e.g., unrecoverable communication error).
- supportedSeverities: list of non-Okay severities the alarm may become (Warning, Major, Critical). All alarms are assumed to support Okay, Disconnected, and Indeterminate.
- probableCause: a description of the likely cause of the alarm reaching each severity level
- operatorResponse: instructions or information to help the operator respond to the alarm.
- isAutoAcknowledgable: true/false flag for whether the alarm automatically acknowledges alarm when alarm severity returns to Okay.
- isLatchable: true/false flag whether alarm latches at highest severity until reset.
- activationStatus: true/false flag for whether alarm is currently active (and considered in aggregated severity and health calculations)
- alarms.conf
-
source
alarms: [ { prefix = nfiraos.trombone name = tromboneAxisLowLimitAlarm description = "Warns when trombone axis has reached the low limit" location = "south side" alarmType = Absolute supportedSeverities = [Warning, Major, Critical] probableCause = "the trombone software has failed or the stage was driven into the low limit" operatorResponse = "go to the NFIRAOS engineering user interface and select the datum axis command" isAutoAcknowledgeable = false isLatchable = true activationStatus = Active }, { prefix = nfiraos.trombone name = tromboneAxisHighLimitAlarm description = "Warns when trombone axis has reached the high limit" location = "south side" alarmType = Absolute supportedSeverities = [Warning, Major] probableCause = "the trombone software has failed or the stage was driven into the high limit" operatorResponse = "go to the NFIRAOS engineering user interface and select the datum axis command" isAutoAcknowledgeable = true isLatchable = true activationStatus = Active }, { prefix = tcs.tcspk name = cpuExceededAlarm description = "This alarm is activated when the tcsPk Assembly can no longer calculate all of its pointing values in the time allocated. The CPU may lock power, or there may be pointing loops running that are not needed. Response: Check to see if pointing loops are executing that are not needed or see about a more powerful CPU." location = "in computer..." alarmType = Absolute supportedSeverities = [Warning, Major, Critical] probableCause = "too fast..." operatorResponse = "slow it down..." isAutoAcknowledgeable = true isLatchable = false activationStatus = Active }, { prefix = lgsf.tcspkinactive name = cpuIdleAlarm description = "This alarm is activated CPU is idle" location = "in computer..." alarmType = Absolute supportedSeverities = [Warning, Major, Critical] probableCause = "too fast..." operatorResponse = "slow it down..." isAutoAcknowledgeable = true isLatchable = false activationStatus = Inactive } ]
acknowledge
Acknowledges the given alarm which is raised to a higher severity
- Scala
-
source
val result3F: Future[Done] = adminAPI.acknowledge(alarmKey)
shelve
Shelves the given alarm. Alarms will be unshelved automatically at a specific time (8 AM local time by default) if it is not unshelved manually before that. The time to automatically un-shelve can be configured in application.conf for e.g csw-alarm.shelve-timeout = h:m:s a
.
- Scala
-
source
val result4F: Future[Done] = adminAPI.shelve(alarmKey)
Shelved alarms are also considered in aggregation severity or health calculation of alarms.
unshelve
Unshelves the given alarm
- Scala
-
source
val result5F: Future[Done] = adminAPI.unshelve(alarmKey)
reset
Resets the status of the given latched alarm by updating the latched severity same as current severity and acknowledgement status to acknowledged without changing any other properties of the alarm.
- Scala
-
source
val result6F: Future[Done] = adminAPI.reset(alarmKey)
getMetadata
Gets the metadata of an alarm, component, subsystem, or whole TMT system. The following information is returned for each alarm:
- subsystem
- component
- name
- description
- location
- alarmType
- supported severities
- probable cause
- operator response
- is autoAcknowledgeable
- is latchable
- activation status
- Scala
-
source
val metadataF: Future[AlarmMetadata] = adminAPI.getMetadata(alarmKey) metadataF.onComplete { case Success(metadata) => println(s"${metadata.name}: ${metadata.description}") case Failure(exception) => println(s"Error getting metadata: ${exception.getMessage}") }
Inactive alarms will not be taking part in aggregation of severity or health. Alarms are set active or inactive in the alarm configuration file, and not through either API.
getStatus
Gets the status of the alarm which contains fields like:
- latched severity
- acknowledgement status
- shelve status
- alarm time
- Scala
-
source
val statusF: Future[AlarmStatus] = adminAPI.getStatus(alarmKey) statusF.onComplete { case Success(status) => println(s"${status.alarmTime}: ${status.latchedSeverity}") case Failure(exception) => println(s"Error getting status: ${exception.getMessage}") }
getCurrentSeverity
Gets the severity of the alarm.
- Scala
-
source
val severityF: Future[FullAlarmSeverity] = adminAPI.getCurrentSeverity(alarmKey) severityF.onComplete { case Success(severity) => println(s"${severity.name}: ${severity.level}") case Failure(exception) => println(s"Error getting severity: ${exception.getMessage}") }
getAggregatedSeverity
Gets the aggregated severity for the given alarm/component/subsystem/whole TMT system. Aggregation of the severity represents the most severe alarm amongst the aggregated alarms.
- Scala
-
source
val componentKey = ComponentKey(Prefix(NFIRAOS, "tromboneassembly")) val aggregatedSeverityF: Future[FullAlarmSeverity] = adminAPI.getAggregatedSeverity(componentKey) aggregatedSeverityF.onComplete { case Success(severity) => println(s"aggregate severity: ${severity.name}: ${severity.level}") case Failure(exception) => println(s"Error getting aggregate severity: ${exception.getMessage}") }
getAggregatedHealth
Gets the aggregated health for the given alarm/component/subsystem/whole TMT system. Aggregation of health is either Good
, ill
or Bad
based on the most severe alarm amongst the aggregated alarms.
- Scala
-
source
val subsystemKey = SubsystemKey(IRIS) val healthF: Future[AlarmHealth] = adminAPI.getAggregatedHealth(subsystemKey) healthF.onComplete { case Success(health) => println(s"${subsystemKey.subsystem.name} health = ${health.entryName}") case Failure(exception) => println(s"Error getting health: ${exception.getMessage}") }
subscribeAggregatedSeverityCallback
Subscribes to the changes of aggregated severity for given alarm/component/subsystem/whole TMT system by providing a callback which gets executed for every change.
- Scala
-
source
val alarmSubscription: AlarmSubscription = adminAPI.subscribeAggregatedSeverityCallback( ComponentKey(Prefix(NFIRAOS, "tromboneAssembly")), aggregatedSeverity => { /* do something*/ } ) // to unsubscribe: val unsubscribe1F: Future[Done] = alarmSubscription.unsubscribe()
subscribeAggregatedSeverityActorRef
Subscribes to the changes of aggregated severity for given alarm/component/subsystem/whole TMT system by providing an actor which will receive a message of aggregated severity on every change.
- Scala
-
source
val severityActorRef = typed.ActorSystem(behaviour[FullAlarmSeverity], "fullSeverityActor") val alarmSubscription2: AlarmSubscription = adminAPI.subscribeAggregatedSeverityActorRef(SubsystemKey(NFIRAOS), severityActorRef) // to unsubscribe: val unsubscribe2F: Future[Done] = alarmSubscription2.unsubscribe()
subscribeAggregatedHealthCallback
Subscribe to the changes of aggregated health for given alarm/component/subsystem/whole TMT system by providing a callback which gets executed for every change.
- Scala
-
source
val alarmSubscription3: AlarmSubscription = adminAPI.subscribeAggregatedHealthCallback( ComponentKey(Prefix(IRIS, "ImagerDetectorAssembly")), aggregatedHealth => { /* do something*/ } ) // to unsubscribe val unsubscribe3F: Future[Done] = alarmSubscription3.unsubscribe()
subscribeAggregatedHealthActorRef
Subscribes to the changes of aggregated health for given alarm/component/subsystem/whole TMT system by providing an actor which will receive a message of aggregated severity on every change.
- Scala
-
source
val healthActorRef = typed.ActorSystem(behaviour[AlarmHealth], "healthActor") val alarmSubscription4: AlarmSubscription = adminAPI.subscribeAggregatedHealthActorRef(SubsystemKey(IRIS), healthActorRef) // to unsubscribe val unsubscribe4F: Future[Done] = alarmSubscription4.unsubscribe()
Technical Description
See Alarm Service Technical Description.