Pro2 Monitoring

Protop can monitor Pro2 SQL replication when the Pro2Queues data collector is added to ptInitDC. Refer to the list of alertable metrics for Pro2 Queues to see which fields can be monitored and alerted.

NOTICE: If you have previously used the "Pro2Activity" data collector for alerting you should change to use "Pro2Queues". The "Pro2Activity" data collector will be removed in a future release.

Configuration steps:

Edit etc/pt3agent.[*].cfg
add “Pro2Queues” to ptInitDC
Restart the agent by removing tmp/pt3agent.[*].flg. The dbmonitor will restart it shortly.

Verification:

To verify that “Pro2Queues” is in use look for 3 things in the logs - they can all be checked at once with this command on Unix (use FINDSTR on Windows):

grep -i pro2q log/pt3agent.friendlyName.log | more

2021/06/19 00:01:56.998-04:00 friendlyName 60 DBId,Dashboard,Configuration,TableActivity,IndexActivity,LatchActivity,ResourceWaits,StorageAreas,RemoteServerActivity,ReplAgent,Blocked,ActiveTRX,UserIOActivity,df,OSInfo,Pro2Queues

2021/06/19 00:01:57.483-04:00 dc/pro2qmon.p has been initialized as Pro2Queues

2021/06/19 00:02:25.959-04:00 postData: pro2q 5 records, 6 lines, length= 401
. . .

The final line is hopefully repeated many times. That is the line in the log file showing that the pro2q data is actually going to the portal.

Background:

Procedure dc/pro2mon.p is the old data collector that was previously bound to the “2" key in ProTop and known as "Pro2Activity" (pt3agent.p can still use it but protop.p cannot). Procedure dc/pro2qmon.p is the new data collector bound to "2" and known to pt3agent as "Pro2Queues”.

Here's an example of the Pro2 Queue panel brought up when the "2" key is invoked in ProTop. Each column is described below.

pro2 panel

Each queue (QNum) is monitored individually.

The only known Status is “Running”.

Possible values of Action are:

“Delete”, which means that replQueue records are deleted after processing
“Mark” which means that a flag is set but the record is retained after being replicated

The Enabled/Disabled/Paused and Orphans columns correspond to the number of tables classified as such.

Orphaned means that we found a record for a table in the queue that is NOT listed as belonging to that queue - this might happen if someone changes their mind and decides to remove a table from Pro2. That can disrupt the queue record counts and processing so if it happens it needs to be addressed - thus there is an alarm for orphans > 0 (see beow).

If you do have “orphan” replQueue records you can add PRO2QSKIP2SEQ=n to you bin/localenv file. Where "n" is determined by running ad-hoc queries on replQueue. This will skip past the orphans and start counting at the next sequence beyond the orphans.

Depth is the number of records in that queue that are waiting to be processed. It is calculated in one of 3 ways, the single character column to the right tells you how it was calculated:

x = "exact". We counted each record. There is a bin/localenv variable, PRO2QCOUNTLIM, that defines how small a queue is small enough to be counted. The default is 1,000.
2 = “Pro2". That means that we use the values that Pro2 calculates internally. These are only updated every 30 minutes so they are very stale. We prefer not to use these values but might do so if it is too expensive to do it ourselves.
p = “proportional”. We look at the oldest record in the queue and compare its sequence to the next value that Pro2 will pull from the db sequence it is using. We assume that records are randomly distributed to the queues. The PRO2QESTIMATE variable can be used to suppress proportional estimates which will mean that Pro2 values are used instead. Proportional queue depth is not 100% accurate (although it seems to be within 1% in testing) but it is very fast to calculate.

The Queue Lag is the age of the oldest record in the queue. By default we alert at 30 minutes and alarm at one hour (see below). Some users may prefer more aggressive alerts for lag time. Use zLagTime as the alert metric, zLagTime is seconds as an integer rather than an hh:mm:ss string and, therefore, is much easier to write alerts for. If the oldest record is more than a day old the number of days will appear to the left of the hh:mm:ss, zLagTime will be a very large number of seconds. Don’t worry, zLagTime is an int64 ;)

At times very small queue depths may show a lag of zero (or blank) and an oldest table of n/a. This is because the queue catches up while ProTop is in the process of collecting data about it. This is not considered to be a problem.

The Oldest Table in the queue is just for information purposes. It is unlikely to make any sense to want to alert on that but does tend to give a sense of what table’s records a queue is processing.

Ditto the Source DB. The source db name is the "Pro2 name” for the db. It is not the ProTop friendly name nor the db physical name. This name might be handy if you need to talk to a Pro2 admin or use the Pro2 admin console.

The Pro2 schema is not in every database and the replicated data might come from multiple databases. So the data collector uses dynamic queries and it will complain politely if you try to run it in a databases that does not support Pro2:

no pro2 schema

Default alerts in etc/alert.cfg:

rq_pausedTbls num > 0 "" "daily" "&1 &2 &3" alert
rq_orphanTbls num > 0 "" "daily" "&1 &2 &3" alarm

rq_qStatus char <> "Running" "" "hourly" "&1 &2 &3" alert

rq_Depth num > 100000 "" "hourly" "&1 &2 &3" alert
rq_Depth num > 1000000 "" "hourly" "&1 &2 &3" alarm

zLagTime num > 1800 "" "hourly" "&1 &2 &3" alert
zLagTime num > 3600 "" "hourly" "&1 &2 &3" alarm