1. White Star Software
  2. Advanced Alerting Configuration

Alert Configuration Overview

Alert Configuration

Creating Alerts

See the Alerts Quick Reference for the nearly 1000 metrics available in ProTop. NOTE: Not all Metric Names make sense to alert on. For more information on alerting use the ‘Need Help?' button at the top of this page.

To create a new alert that will be sent to, seen in, and potentially acted upon by the ProTop Portal:

  1. Look for the Data Collector name in the ptInitDC variable in your pt3agent configuration file, and add it if it is not already there
  2. Add the alert parameters for the metric of interest to your alert configuration file
  3. restart the pt3agent if you changed your pt3agent configuration file

For example, let's say we want to be alerted whenever the buffer hit rate drops below 98%, indicating some process might be reading a large number of records from the disk. For this we need these three elements:

    1. Data Collector: B2
    2. Metric Name: Hit%
    3. Data Type: num

And to follow these steps:

A. Edit e.g. etc/pt3agent.yourCustId.cfg and search for ptInitDC. Add B2 inside the end of the quoted-string so it looks like this:

ptInitDC "DBId,Dashboard,Configuration,TableActivity,IndexActivity,LatchActivity,ResourceWaits,StorageAreas,Blocked,RemoteServerActivity,OSInfo,AppActivity,B2" 

B. Edit e.g. etc/alert.yourCustId.cfg, jump to the bottom to add this line:

Hit% num < 98 "" "hourly" "Buffer Hit Rate Low &1 &2 &3" alarm 

This alert definition says:

  • once an hour (the counting for the next alert starts the first time that an event threshold is breached after alert.cfg is (re)loaded)
  • when the alertable metric, Hit%, a number
  • is less than 98
  • send the Alert message "Buffer Hit Rate Low 98 (&1=current value) < (&2=operator) 98 (&3=limit)" to the ProTop Portal
  • as an alarm (shown in orange in the ProTop Portal Dashboard Alerts window)

See The Elements of an Alert Definition article for more detail.

C. If you changed the ptInitDC variable in your pt3agent configuration file, restart the pt3agent by removing the tmp/pt3agent.*.flg file. The pt3agent will then be restarted by the dbmonitor process.

Another example: 

Here is an alarm (in orange on the right) as seen in the ProTop Portal Dashboard Alerts window:

img00100

This is the alert definition in alert.yourCustId.cfg that produced the alarm in the image above:

otrx num > 7200 "" "hourly" "Old Transaction &1 &2 &3" zOldTRXDetails,alarm 

This alert includes an Alert Enhancer (discussed below), zOldTRXDetails, which provides the additional detail seen at the bottom of the blue box. It is executed and added to the Alert message body before the alarm is sent to the portal.

Changing Alerts and their Effect on Alert Timing

The countdown to the next alert for a given definition starts the first time its metric threshold is breached.  If you have an existing breach you will get an alert immediately upon restarting the ProTop agent, then the countdown to the next alert for that metric breach begins.

The ProTop agent re-loads the alert definitions when it is restarted. It also occurs every monInt seconds if the alert.*.cfg file has changed. The variable monInt is set in etc/pt3agent.*.cfg, the default is 300.

When new or changed definitions are loaded by the agent, existing definitions and running countdowns to the next alert are unaffected. If you removed or commented out an alert, that alert will no longer fire.

Alert Enhancers

Alert enhancers gather important context information for your alerts so you don't have to. This detail is added to the body of a message before something like "alarm" or "page" sends it off to the user. The list can have as many parts as make sense.

Take this alert definition in etc/alert.yourCustId.cfg:

resrcWts num > 500 "" "hourly" "&1 &2 &3" alert 

This will send an alert, once an hour, to the portal when the alertable metric resrcWts (resource waits) exceeds 500. If resrcWts reaches say 514, the alert message in the portal will read "resrcWts 514 > 500". This is enough to perhaps kick off an investigation, but, critical details may have been lost by the time we notice the alert and pull up ProTop RT to look into the matter.

If we simply edit the alert definition and add Alert Enhancers, let's say, Resource Details (zResourceDetails) and Latch Details (zLatchDetails) to the action list, so the new definition looks like this:

resrcWts num > 500 "" "hourly" "&1 &2 &3" zResourceDetails,zLatchDetails,alert 

Resource and Latch details, in that order, will be added to the Alert message body before the action "alert" is sent to the portal. You will now see the busiest Resources and Latches and the extent to which each were in use at the time the alert was triggered, getting you closer to root cause, faster, than you would be without the enhancers.

Again, you can add as many enhancers to your alert definition as make sense. They will be executed in the order in which thy appear in the comma separated list of items in the action section of the alert definition.

Here is an example of the alert above as seen in the ProTop Portal Dashboard. The blue box pops up when you hover your cursor over the Alert Message in the Alerts window (red box in the image). This shows the Alert Message "resrcWts 514 > 500" enhanced with Resource Details and Latch Details:

alert enhcance 1

Available Alert Enhancers

Alert Enhancer Useful For Shows ProTop RT Panel Alertable Metrics
zOldTRXDetails Issues regarding old transactions, excessive use of BI User Number, User Name, PID, Device, Flags, Duration, Idle Time and Wait Time; it also fires bin/OldTRXDetails.sh if it exists, and any output is fed to the next action in the alert action list; see bin/OldTRXDetails.shx for an example Active Transactions (x) Active Transactions
zDirtySchema Identifying which schema currently live in the schema area so they can be moved to a more appropriate area Object Number, Type and Name    
zHashPct Tuning Database Buffers and Hash Table Entries -hash, -B, -B2 and -hash as a percent of the database buffer pool Configuration Startup (c) Configuration (Startup)
zBlockedDetails Troubleshooting Locking Conflicts For the "blocked" user, User Number, User Name, PID, Flags, Duration, Wait, Resource ID, Table Name Blocked (b) Blocked Sessions
zBlockerDetails Troubleshooting Locking Conflicts For the "blocker", User Number, User Name, PID, Flags, Duration, Wait, Resource ID, Table Name Used with zBlockedDetails above Blocked (b) Blocked Sessions
zLatchDetails Troubleshooting Latch contention Latch Name, Number of Requests for that Latch, Number of Waits for that Latch, and Lock % Latch Activity (w) Latch Activity
zUserLocks Identifying users with the highest numbers of record locks to get at the code they are running User Number, User Name, PID, Flags, Logical Reads, Record Locks, Locks High Water Mark, Line Number in code they are running and the name of the procedure they are running (requires client statement cache be enabled for suspect users) User IO Activity (u) User IO Activity
zUserLkHWM Similar to zUserLocks above but in presented in order by high water mark      
zUserActivity Silimar to zUserLocks and zUserLkHWM      
zRecordActivity Investigating Table Usage Table Number, Table Name, Number of Records in the Table, Number of Creates, Reads, Updates and Deletes plus Churn (Ratio of Reads to Records in the Table) Table Activity (t) Table Activity
zResourceDetails Troubleshooting Database Internal Resource contention Resource Name, Number of Requests for that Resource, Number of Waits for that Resource, and Lock % Latch Activity (w) Resource Activity
zIndexActivity Investigating Index Usage Index Number, Index Name, Number of Index Blocks for the Index, Number of Index Creates, Reads, Splits, Deletes and Block Level Index Deletes Index Activity (i) Index Activity
zAppNote Providing Application Specific Context As defined by the user when configuring Application Monitoring Application Specific (e) Application Specific
zlog2rec Investigating High OS Read Activity Reported by the Database Logical Reads, Record Reads, Logical Read Threshold and indicates if a Backup is running Dashboard (d) Dashboard

Alert Actions

The next section of arguments in an entry in the alert.yourCustId.cfg tells the portal which comma separated action(s), listed in order from left to right, to perform when this alert arrives. For example, if you want to see the output of the Enhancer and/or the script in the details of the info/warning/alarm/page, make sure that the info/warning/alarm/page is last in the list.

Action Meaning
info seen
warning seen in the Dashboard Alerts window in yellow
alarm seen in the Dashboard Alerts window in orange
page seen in the Dashboard Alerts window in red; also sends out paging events as configured in the portal
script in addition to the above alerts, if the word script is included, a script named the same as the alertable metric (no extension), if found in the ProTop bin directory, will be executed
enhancers see above

Alert Filters

You can include a parameter in your alert definitions in etc/alert.*.cfg specifying a filter. This allows you to add a bit of secondary logic before an alert is sent to the portal. This is useful for alerts on collections of metrics -- like table & index stats, storage areas or app servers where there is a group of things but you want to alert differently based on the name of an instance of thing.

Sample syntax:

tblRd num > 5000 "" "hourly" "&1 &2 &3" alert "tblName = 'product specs'" 

Format:

This is a double-quoted, space-delimited set of three fields:

  1. fieldName - an Alertable Metric from the same Data Collector as the Alertable Metric at the start of the alert definition. In the example above, tblRd and tblName are both from the TableActivity Data Collector; the current record's field value is the value used 
  2. operator - see Rule 4. below
  3. targetValueIf targetValue has embedded spaces it must be quoted within the string with single quotes (as shown above)

Rules:

  1. The set is placed after the alert actions string (just 'alert' in this case)
  2. The filter is basically an AND condition -- if the filter is true then the alert threshold is met, otherwise it fails
  3. Only character comparisons are available
  4. The usual =, >, <, <>, >=, <= operators are supported,
    plus the following:
    • inlist -- equivalent to lookup( X, targetValues ) where X is the current value of the field and the targetValues are a comma delimited string (e.g. _File,_Field)

    • !inlist -- not( inlist )

    • begins -- X begins targetValue (e.g. _Fi)

    • !begin -- not( X begins targetValue ) note: there is no "s", that is deliberate, read as "does not begin"

    • matches -- X matches targetValue (e.g. "tblName matches _Fi*"; note the asterisk)

    • !match -- not( X matches targetValue ) note: there is no "es", that is also deliberate, read as "does not match"

    • longer -- X has more characters than targetValue (a number, e.g. "tblName longer 5")

    • shorter -- X has fewer characters than targetValue (a number, e.g. "tblName shorter 6")

NOTE: These comparison operators are also available for the initial alert comparison.

Alert Scheduling

It is possible within ProTop Alerting to restrict alerts to fire within a given timeframe.  In general, you define a schedule in your alert configuration file and list below that schedule the alert definitions you want to be active in that window.  You can define multiple schedules in the same manner.  Only the alerts after the current schedule line and before the next schedule are affected by that schedule.

For example, to limit an alert for excessive table reads to fire only on weekdays during business hours, follow these steps:

1. Start by adding this block of code to the bottom of your alert.*.cfg file:

### scheduled alerts
#
# Schedule lines apply to all alerts up to the next schedule line
#
# If there is no "schedule" line then alerts default to "always"
#
#       schedule  weekday-list monthday-list start-time end-time
#       ========  ============ ============= ========== ========
#       schedule     2,3,4,5,6             *       6:00    18:00
#
# Weekday-List is a comma delimited list of weekday numbers where Sunday = 1
# weekday-List and monthday-list are currently ignored
# Start-time is the time to begin applying alerts in hh:mm:ss format
# End-time is the time to stop applying alerts in hh:mm:ss format


schedule 2,3,4,5,6 * 8:00 18:00

tblRd   num  > 100000 "" "hourly" "Reads exceeding 100k/sec &1 &2 &3" alert

#schedule 2,3,4,5,6 * 8:00 17:00

#tblRd   num  > 200 "" "60" "alert filter test &1 &2 &3" alert "tblName = _Field"

2. Find the similar alert definition that exists above this one and comment it out.

3. Save the file.

The pt3agent will re-read this alert configuration file at the next monitor interval (monInt) as defined in pt3agent.*.cfg.

This schedule says if it is Monday, Tuesday, Wednesday, Thursday, or Friday, and day of the month, between 8 AM and 6 PM only, execute the alert definitions that follow, up to the next schedule line (if any).

CAVEAT: As highlighted in the commented code block above, the day of the week and day of the month are currently ignored.  There is a fix in the works. This feature is however still good for placing a daily timeframe outside of which you will not be getting alerts from this definition (like 3 AM when you know you have large reports running).