1. White Star Software
  2. Advanced Alerting Configuration

Replication Monitoring

Replication Monitoring

Configuration:

  1. Edit etc/pt3agent.[*].cfg
  2. Add the “replAgent” data collector to ptInitDC
  3. Restart the agent by removing tmp/pt3agent.[*].flg. The dbmonitor will restart the agent shortly.

The agent on the source database uses the repl.properties file to connect to one target database.

NOTE: Only one replication target DB can be connected by the source agent (future enhancement to connect both)

If there is something extra required to connect to the target database (ex.: -U -P), copy [PROTOPDIR]/etc/replx.pfx to [PROTOPDIR]/etc/replx.pf and add the parameters to that pf file (there is only one global replx.pf).

[This connection is currently used to calculate the lag between the source and the target, but this information has proven unreliable in OE 10 and 11. OE 12 is still unknown.]

replagent.p

Alert metrics are included in the default etc/alert.cfg (see the top of that file for more detail on these parameters):

Metric Name Type Operator Threshold Sensitivity Frequency Message and Parameters Action
trxBehind num > 1000 "" "hourly" "Replication lag &1 &2 &3" alert # verify usefulness
agentCommStat num <> 1 "" "hourly" "Replication agent comm status &1 &2 &3" alert
agentStatus num <> 3049 "5:5" "28800" "Replication agent status &1 &2 &3" page
agentStatus num <> 3049 "3:3" "28800" "Replication agent status &1 &2 &3" alarm
agentStatus num <> 3049 "" "hourly" "Replication agent status &1 &2 &3" alert
picaFree num < 50000 "" "hourly" "&1 &2 &3" alarm
picaUsed num > 1000 "" "hourly" "&1 &2 &3" alarm
picaUsed num > 0 "" "hourly" "&1 &2 &3" alert
picaUsedPct num > 30 "" "hourly" "&1 &2 &3" alarm
picaUsedPct num > 5 "" "hourly" "&1 &2 &3" alert
  • agentCommStat: a value of 1 means the agent is connected while 2 means disconnected
  • agentStatus: the value of “dsrutil [db] -C status -detail”. The desired value is 6021 on the source and 3049 on the target. Your etc/alert.[*].cfg should always check for “num <> 3049”

Note that even on the source side, ProTop reads the agent status from the _repl-AgentControl VST.

The meaning of other values is available in the Progress documentation, including:

3048: Startup synchronization
3050: Recovery synchronization
3051: Online backup of the target DB

There are a few known bugs in OE Replication where the VST data is not correct. In this case, we rely on secondary alerts to indicate a problem that the replAgent data collector is not indicating. For example, the number of locked AI files will increase and the “ai_Locked” metric in alert.cfg will fire an alert and there are numerous database log file alerts that may fire.