How to Respond to "No Heartbeat" Alerts

The "no heartbeat" alert means that the ProTop agent is not sending data to the ProTop web portal. Here's what you should do.

Alert Text

No heartbeat from: site.resource since 2021-10-05T07:42:30.392Z Sensitivity: 1200 seconds

Alert generated: 2021/10/05 08:47:01-0400

Description

ProTop relies on monitoring agents (pt3agent) to gather telemetry data, generate alerts and upload all this data to one of our global web portals. If data stops flowing for whatever reason, a "no heartbeat" alert is generated.

The root cause is typically one or more of the following:

- The pt3agent cannot http POST to the ProTop web portal. This could be a firewall issue or an Internet connectivity issue
- The pt3agent cannot connect to the monitored database
- The ProTop service has been stopped and/or disabled
- The pt3agent is unresponsive

Portal Configuration (superadmins only)

There is a portal setting named "HB_NAG_FREQUENCY" that can be used to specify the amount of time that needs to pass before a new HB Alert for the same resource is generated.
If a value for this setting has not been provided, the value "30m" is used as a default. Valid values for this setting can be provided in Jira format, for example, "1h 25m 3s".

Corrective Action(s)

1. Check if the ProTop processes are running

Look for _progres processes running "-p util/pt3agent.p" in Task Manager (Windows) or using "ps -ef" (*nix). There could be many, with the "-param" value corresponding to the unique agent name. For example:

_progres -b -p util/pt3agent.p -param abcprod -pf etc/protop.pf ...
_progres -b -p util/pt3agent.p -param xyzprod -pf etc/protop.pf ...

If you see a pt3agent running for the resource that generated the heartbeat alert, skip to step 3.

2. Check if the DB Monitor process is running

The DB Monitor is the master ProTop process. Among other things, it is responsible for starting the monitoring agents.

Look for an _progres process running "-p util/dbmonitor.p"

_progres -1 -b -p ./util/dbmonitor.p -param siteName ...

If it is not running, start it. On Windows, look for the ProTop3 DB Monitor service. On *nix, the DB Monitor is started with $PROTOP/bin/dbmonitor.sh, usually via a cron job.

Sometimes DBAs comment out the cron job or stop the Windows service while doing maintenance, then forget to enable the service after the maintenance is complete.

3. Check the pt3agent log files

If the pt3agents seem to be running, check their log files for error messages.

Each pt3agent has three log files: $PROTOP/log/pt3agent.agentname.log, $PROTOP/tmp/pt3agent.agentname.err and $PROTOP/tmp/pt3agent.agentname.debug.

Some things to look for:

What is the time stamp of the last entry in the pt3agent.agentname.log file? If it was some time ago, the agent may be hung for some reason. For example, with some older versions of OpenEdge, when a DB is shutdown, it may leave connected processes hung.

First, remove $PROTOP/tmp/pt3agent.agentname.flg and wait 1 minute. If the pt3agent is alive, it will notice the absence of the flag file and auto-terminate.

Otherwise, attempt a $DLC/bin/proGetStack <agent pid> (works on both Windows and *nix). This will generate a protrace.pid file in the working directory of the process and occasionally wakes the process up. Again, check if the process disappeared.

Please send the log files and protrace file to support@wss.com.
Any 4GL error messages? Our expert development team here at White Star Software never allows any bugs to pollute their beautiful programs, but occasionally an evil DBA will foolishly think that they can write code, resulting in unsanctioned changes.

In this case, remove $PROTOP/tmp/pt3agent.agentname.flg. and wait 1 minute. The DB Monitor should respawn a fresh agent.

Please send the log files to support@wss.com.
Any http(s) error messages? Internet connectivity and firewall changes are the most common cause of "no heartbeat" alerts.

Follow the instructions under "No Data in the Web Portal" in our Troubleshooting Guide.

Killing the agent

If all else fails, you can use $PROTOP/bin/killprosession.sh to kill the agent process (or any shared memory 4GL client). Please read the Killprosession page first, as it needs to be activated before use.

There is no safe way to kill a shared memory process on Windows, as the "End Task" button is the equivalent of a "kill -9" on *nix.

On Windows, the agent PID in the .flg can point to the wrong process

If ProTop is not allowed to clean itself up properly when shutting down, say due to an unplanned server reboot, the pt3agent's .flg file will not be removed. When the server restarts, the dbmonitor service can see the .flg file, check for and see a _progres process running with the PID in that .flg file and conclude the agent for that resource is still running.

But, especially on a busy server with lots of local _progres sessions, the PID in the old .flg file can have been assigned by the OS to another _progres process. This will hide the fact that the agent in question is not running; thus, we continue to get heartbeat alerts for that resource from the portal.

The way to fix this is to simply remove the old .flg file and let the dbmonitor service start the agent.

Everything looks good, but I'm still getting the occasional "No Heartbeat" alert

There could well be an excessive delay in the HTTP response coming from the portal due to heavy network traffic. Try setting the WAITLIMIT=10 in your bin/localenv file and restart ProTop. It is 5 seconds by default which may be short for some systems.