Alarm Service

The Alarm Service provides an API to manage alarms in the TMT software system. The service uses Redis to store Alarm data, including the alarm status and associated metadata. Alarm “keys” are used to access information about an alarm.

Dependencies

The Alarm Service comes bundled with the Framework, no additional dependency needs to be added to your build.sbt file if using it. To use the Alarm service without using the framework, add this to your build.sbt file:

sbt
libraryDependencies += "com.github.tmtsoftware.csw" %% "csw-alarm-client" % "5.0.1"

API Flavors

There are two APIs provided in the Alarm Service: a client API, and an administrative (admin) API. The client API is the API used by component developers to set the severity of an alarm. This is the only functionality needed by component developers. As per TMT policy, the severity of an alarm must be set periodically (within some time limit) in order to maintain the integrity of the alarm status. If an alarm severity is not refreshed within the time limit, currently set at 9 seconds, the severity is set to Disconnected by the Alarm Service, which indicates to the operator that there is some problem with the component’s ability to evaluate the alarm status.

The admin API provides all of the functions needed manage the alarm store, as well as providing access to monitor alarms for use by an operator or instrument specialist. The admin API provides the ability to load alarm data into the alarm store, set the severity of an alarm, acknowledge alarms, shelve or unshelve alarms, reset a latched alarm, get the metadata/status/severity of an alarm, and get or subscribe to aggregations of severity and health of the alarm, a component’s alarms, a subsystem’s alarms, or the alarms of the whole TMT System.

A command line tool is provided as part of the Alarm Service that implements this API and provides low level control over the Alarm Service. More details about alarm CLI can be found here: CSW Alarm Client CLI application

Eventually, operators will use Graphical User Interfaces that access the admin API through a UI gateway. This will be delivered as part of the ESW HCMS package.

Note

Since the admin API will primarily be used with the CLI and HCMS applications, it is only supported in Scala, and not Java.

To summarize, the APIs are as follows: * client API (AlarmService) : Must be used by component. Available method is: {setSeverity} * admin API (AlarmAdminService) : Expected to be used by administrator. Available methods are: {initAlarm | setSeverity | acknowledge | shelve | unshelve | reset | getMetaData | getStatus | getCurrentSeverity | getAggregatedSeverity | getAggregatedHealth | subscribeAggregatedSeverityCallback | subscribeAggregatedSeverityActorRef | subscribeAggregatedHealthCallback | subscribeAggregatedHealthActorRef }

Creating clientAPI and adminAPI

For component developers, the client API is provided as an AlarmService object in the CswContext object injected into the ComponentHandlers class provided by the framework.

If you are not using csw-framework, you can create AlarmService using AlarmServiceFactory.

Scala
source// create alarm client using host and port of alarm server
private val clientAPI1 = new AlarmServiceFactory().makeClientApi("localhost", 5225)

// create alarm client using location service
private val clientAPI2 = new AlarmServiceFactory().makeClientApi(locationService)

// create alarm admin using host and port of alarm server
private val adminAPI1 = new AlarmServiceFactory().makeAdminApi("localhost", 5226)

// create alarm admin using location service
private val adminAPI2 = new AlarmServiceFactory().makeAdminApi(locationService)
Java
source// create alarm client using host and port of alarm server
final IAlarmService jclientAPI1 = new AlarmServiceFactory().jMakeClientApi("localhost", 5227, actorSystem);

// create alarm client using location service
IAlarmService jclientAPI2 = new AlarmServiceFactory().jMakeClientApi(jLocationService, actorSystem);

Rules and checks

  • When representing a unique alarm, the alarm name or component name must not have * [ ] ^ - or any whitespace characters

Model Classes

  • AlarmKey : Represents the unique alarm in the TMT system. It is composed of subsystem, component and alarm name.
  • ComponentKey : Represents all alarms of a component. Used for getting severity or health of an entire component.
  • SubsystemKey : Represents all alarms of a subsystem Used for getting severity or health of an entire subsystem.
  • GlobalKey : Represents all alarms present in the TMT system. Used for getting severity or health of an entire observatory.
  • AlarmMetadata : Represents static metadata of an alarm, which will not change in its entire lifespan.
  • AlarmStatus : Represents dynamically changing data of the an alarm, which will be changing depending on the severity change or manually changed by an operator
  • AlarmSeverity : Represents severity levels that can be set by the component developer e.g. Okay, Indeterminate, Warning, Major and Critical
  • FullAlarmSeverity : Represents all possible severity levels of the alarm i.e. Disconnected (cannot be set by the developer) plus other severity levels that can be set by the developer
  • AlarmHealth : Represents possible health of an alarm or component or subsystem or whole TMT system

Client API

setSeverity

Sets the severity of the given alarm. The severity must be refreshed by setting it at a regular interval or it will automatically be changed to Disconnected after a specific time.

Scala
sourceval alarmKey              = AlarmKey(Prefix(NFIRAOS, "trombone"), "tromboneAxisLowLimitAlarm")
val resultF: Future[Done] = clientAPI.setSeverity(alarmKey, Okay)
Java
sourceprivate final AlarmKey alarmKey = new AlarmKey(Prefix.apply(JSubsystem.NFIRAOS, "trombone"), "tromboneAxisLowLimitAlarm");
Future<Done> doneF = jclientAPI1.setSeverity(alarmKey, JAlarmSeverity.Okay);
Note
  • If the alarm is not refreshed within 9 seconds, it will be inferred as Disconnected
  • If the alarm is auto-acknowledgable and the severity is set to Okay then, the alarm will be auto-acknowledged and will not require any explicit admin action in terms of acknowledging

Admin API

initAlarms

Loads the given alarm data in alarm store, passing in the alarm configuration file.

Scala
sourceval resource               = "test-alarms/valid-alarms.conf"
val alarmsConfig: Config   = ConfigFactory.parseResources(resource)
val result2F: Future[Done] = adminAPI.initAlarms(alarmsConfig)

Alarm configuration files are written in the HOCON format using the following fields:

  • subsystem: subsystem name the alarm belongs to
  • component: name of component for the alarm, matching the name in the componentInfo file (see Describing Components)
  • name: name of the alarm
  • description: a description of what the alarm represents
  • location: physical location within observatory or instrument in which the alarm condition is occuring
  • alarmType: the general category for the alarm. Must be one of the following:
    • Absolute: An alarm generated when a setpoint is exceeded.
    • BitPattern: An alarm generated when a pattern of digital signals matches a predetermined pattern.
    • Calculated: An alarm generated from a calculated value instead of a direct process measurement.
    • Deviation: An alarm generated when the difference between two analog values exceeds a limit (e.g., deviation between primary and redundant instruments or a deviation between process variable and setpoint).
    • Discrepancy: An alarm generated by error between the comparison of an expected plant or device state to its actual state (e.g., when a motor fails to start after it is command to the on state).
    • Instrument: An alarm generated by a field device to indicate a fault (e.g., a sensor failure).
    • RateChange: An alarm generated when the change in a process variable per unit time, (dPV/dt), exceeds a defined limit.
    • RecipeDriven: An alarm with limits that depend on the recipe that is currently being executed.
    • Safety: An alarm that is tied to and echoing an action or interlock in the subsystem’s safety controller. (Note: At TMT Alarm Service can not a primary hazard control for severe hazards).
    • Statistical: An alarm generated based on statistical properties of one or more process variables.
    • System: An alarm generated by the control system to indicate a fault within the system hardware, software, or components (e.g., unrecoverable communication error).
  • supportedSeverities: list of non-Okay severities the alarm may become (Warning, Major, Critical). All alarms are assumed to support Okay, Disconnected, and Indeterminate.
  • probableCause: a description of the likely cause of the alarm reaching each severity level
  • operatorResponse: instructions or information to help the operator respond to the alarm.
  • isAutoAcknowledgable: true/false flag for whether the alarm automatically acknowledges alarm when alarm severity returns to Okay.
  • isLatchable: true/false flag whether alarm latches at highest severity until reset.
  • activationStatus: true/false flag for whether alarm is currently active (and considered in aggregated severity and health calculations)
alarms.conf
sourcealarms: [
  {
    prefix = nfiraos.trombone
    name = tromboneAxisLowLimitAlarm
    description = "Warns when trombone axis has reached the low limit"
    location = "south side"
    alarmType = Absolute
    supportedSeverities = [Warning, Major, Critical]
    probableCause = "the trombone software has failed or the stage was driven into the low limit"
    operatorResponse = "go to the NFIRAOS engineering user interface and select the datum axis command"
    isAutoAcknowledgeable = false
    isLatchable = true
    activationStatus = Active
  },
  {
    prefix = nfiraos.trombone
    name = tromboneAxisHighLimitAlarm
    description = "Warns when trombone axis has reached the high limit"
    location = "south side"
    alarmType = Absolute
    supportedSeverities = [Warning, Major]
    probableCause = "the trombone software has failed or the stage was driven into the high limit"
    operatorResponse = "go to the NFIRAOS engineering user interface and select the datum axis command"
    isAutoAcknowledgeable = true
    isLatchable = true
    activationStatus = Active
  },
  {
    prefix = tcs.tcspk
    name = cpuExceededAlarm
    description = "This alarm is activated when the tcsPk Assembly can no longer calculate all of its pointing values in the time allocated. The CPU may lock power, or there may be pointing loops running that are not needed. Response: Check to see if pointing loops are executing that are not needed or see about a more powerful CPU."
    location = "in computer..."
    alarmType = Absolute
    supportedSeverities = [Warning, Major, Critical]
    probableCause = "too fast..."
    operatorResponse = "slow it down..."
    isAutoAcknowledgeable = true
    isLatchable = false
    activationStatus = Active
  },
  {
    prefix = lgsf.tcspkinactive
    name = cpuIdleAlarm
    description = "This alarm is activated CPU is idle"
    location = "in computer..."
    alarmType = Absolute
    supportedSeverities = [Warning, Major, Critical]
    probableCause = "too fast..."
    operatorResponse = "slow it down..."
    isAutoAcknowledgeable = true
    isLatchable = false
    activationStatus = Inactive
  }
]

acknowledge

Acknowledges the given alarm which is raised to a higher severity

Scala
sourceval result3F: Future[Done] = adminAPI.acknowledge(alarmKey)

shelve

Shelves the given alarm. Alarms will be unshelved automatically at a specific time (8 AM local time by default) if it is not unshelved manually before that. The time to automatically un-shelve can be configured in application.conf for e.g csw-alarm.shelve-timeout = h:m:s a.

Scala
sourceval result4F: Future[Done] = adminAPI.shelve(alarmKey)
Note

Shelved alarms are also considered in aggregation severity or health calculation of alarms.

unshelve

Unshelves the given alarm

Scala
sourceval result5F: Future[Done] = adminAPI.unshelve(alarmKey)

reset

Resets the status of the given latched alarm by updating the latched severity same as current severity and acknowledgement status to acknowledged without changing any other properties of the alarm.

Scala
sourceval result6F: Future[Done] = adminAPI.reset(alarmKey)

getMetadata

Gets the metadata of an alarm, component, subsystem, or whole TMT system. The following information is returned for each alarm:

  • subsystem
  • component
  • name
  • description
  • location
  • alarmType
  • supported severities
  • probable cause
  • operator response
  • is autoAcknowledgeable
  • is latchable
  • activation status
Scala
sourceval metadataF: Future[AlarmMetadata] = adminAPI.getMetadata(alarmKey)
metadataF.onComplete {
  case Success(metadata)  => println(s"${metadata.name}: ${metadata.description}")
  case Failure(exception) => println(s"Error getting metadata: ${exception.getMessage}")
}
Note

Inactive alarms will not be taking part in aggregation of severity or health. Alarms are set active or inactive in the alarm configuration file, and not through either API.

getStatus

Gets the status of the alarm which contains fields like:

  • latched severity
  • acknowledgement status
  • shelve status
  • alarm time
Scala
sourceval statusF: Future[AlarmStatus] = adminAPI.getStatus(alarmKey)
statusF.onComplete {
  case Success(status)    => println(s"${status.alarmTime}: ${status.latchedSeverity}")
  case Failure(exception) => println(s"Error getting status: ${exception.getMessage}")
}

getCurrentSeverity

Gets the severity of the alarm.

Scala
sourceval severityF: Future[FullAlarmSeverity] = adminAPI.getCurrentSeverity(alarmKey)
severityF.onComplete {
  case Success(severity)  => println(s"${severity.name}: ${severity.level}")
  case Failure(exception) => println(s"Error getting severity: ${exception.getMessage}")
}

getAggregatedSeverity

Gets the aggregated severity for the given alarm/component/subsystem/whole TMT system. Aggregation of the severity represents the most severe alarm amongst the aggregated alarms.

Scala
sourceval componentKey                                   = ComponentKey(Prefix(NFIRAOS, "tromboneassembly"))
val aggregatedSeverityF: Future[FullAlarmSeverity] = adminAPI.getAggregatedSeverity(componentKey)
aggregatedSeverityF.onComplete {
  case Success(severity)  => println(s"aggregate severity: ${severity.name}: ${severity.level}")
  case Failure(exception) => println(s"Error getting aggregate severity: ${exception.getMessage}")
}

getAggregatedHealth

Gets the aggregated health for the given alarm/component/subsystem/whole TMT system. Aggregation of health is either Good, ill or Bad based on the most severe alarm amongst the aggregated alarms.

Scala
sourceval subsystemKey                 = SubsystemKey(IRIS)
val healthF: Future[AlarmHealth] = adminAPI.getAggregatedHealth(subsystemKey)
healthF.onComplete {
  case Success(health)    => println(s"${subsystemKey.subsystem.name} health = ${health.entryName}")
  case Failure(exception) => println(s"Error getting health: ${exception.getMessage}")
}

subscribeAggregatedSeverityCallback

Subscribes to the changes of aggregated severity for given alarm/component/subsystem/whole TMT system by providing a callback which gets executed for every change.

Scala
sourceval alarmSubscription: AlarmSubscription = adminAPI.subscribeAggregatedSeverityCallback(
  ComponentKey(Prefix(NFIRAOS, "tromboneAssembly")),
  aggregatedSeverity => { /* do something*/ }
)
// to unsubscribe:
val unsubscribe1F: Future[Done] = alarmSubscription.unsubscribe()

subscribeAggregatedSeverityActorRef

Subscribes to the changes of aggregated severity for given alarm/component/subsystem/whole TMT system by providing an actor which will receive a message of aggregated severity on every change.

Scala
sourceval severityActorRef = typed.ActorSystem(behaviour[FullAlarmSeverity], "fullSeverityActor")
val alarmSubscription2: AlarmSubscription =
  adminAPI.subscribeAggregatedSeverityActorRef(SubsystemKey(NFIRAOS), severityActorRef)

// to unsubscribe:
val unsubscribe2F: Future[Done] = alarmSubscription2.unsubscribe()

subscribeAggregatedHealthCallback

Subscribe to the changes of aggregated health for given alarm/component/subsystem/whole TMT system by providing a callback which gets executed for every change.

Scala
sourceval alarmSubscription3: AlarmSubscription = adminAPI.subscribeAggregatedHealthCallback(
  ComponentKey(Prefix(IRIS, "ImagerDetectorAssembly")),
  aggregatedHealth => { /* do something*/ }
)

// to unsubscribe
val unsubscribe3F: Future[Done] = alarmSubscription3.unsubscribe()

subscribeAggregatedHealthActorRef

Subscribes to the changes of aggregated health for given alarm/component/subsystem/whole TMT system by providing an actor which will receive a message of aggregated severity on every change.

Scala
sourceval healthActorRef                        = typed.ActorSystem(behaviour[AlarmHealth], "healthActor")
val alarmSubscription4: AlarmSubscription = adminAPI.subscribeAggregatedHealthActorRef(SubsystemKey(IRIS), healthActorRef)

// to unsubscribe
val unsubscribe4F: Future[Done] = alarmSubscription4.unsubscribe()

Technical Description

See Alarm Service Technical Description.

Source code for examples