Need to restart OE Replication?

More often than not, replication gets hung up due to a communication glitch in the network between the replication server running on the source database and the agent(s) running on the target database(s).

If that is the case, the simplest and quickest way to get replication up and running again is first to restart the replication agent(s), followed by restarting the replication server.

The Progress commands look like this (run with the ID that owns the database files):

rfutil <target database> -C restart agent 

(repeat for the second target if available)

rfutil <source database> -C restart server

Once you have restarted the agents and server, check the status of each with the following:

dsrutil <database> -C status -verbose

You want to see the following:

Source:
6021: Normal Processing

Target:
3049: Normal Processing

Depending on how long replication communications have been down, once communications have been re-established, it may take a while for the target(s) to synchronize with the source database. In that case, the status reported above may be "Startup Synchronization."  

You can expect the source and target(s) to remain in this mode until all of your locked AI extents have been replicated to the target(s).

To see how many locked extents you have, run the following:

proutil <source database> -C aimage list

Look for "locked" AI extents. The speed at which the number of locked ai extents decreases should give you an idea of how long the sync will take.

Restarting OE Replication on Windows

There exists a Windows batch file example, %PROTOP%\bin\restartrepl.batx. This template can be copied and edited to restart the given replication server or agent without tying it to the user's login session. It creates a Windows service, runs the service, and then removes the service.  
NOTE: This avoids the situation where you must keep a command window open to prevent the service from being killed when the user who started it logs out.

It spits out some warnings/errors, which can usually be ignored. The restart does get executed.

NOTE: Don't have OE Replication Plus on your target server and therefore no ProTop installed there? You don't need to worry.  You can still create the same scripts, change the  Install them where your other DBA scripts live and change the reference to PROTOPENV
To set this up:
On a DEV/TEST SOURCE database box where you have ProTop installed:
1. copy bin\restartrepl.batx to restartrepl-srv.bat (name it as you see fit)
2. edit restartrepl-srv.bat 
3. set the DB environment variable to point explicitly to the source database and save
 
On the DEV/TEST TARGET database box(es) where you have ProTop installed:
1. copy bin\restartrepl.batx to restartrepl-agt1.bat (name it as you see fit)
2. edit restartrepl-agt1.bat 
3. set the DB environment variable to point explicitly to the target database
4. change "restart server" to "restart agent" and save (repeat for 2nd target if you have one)

To test this in a DEV/TEST environment to verify it is working as expected:
1. Log in to the test TARGET server(s) with admin priveleges and execute %PROTOP%\bin\protopenv. This will set the ProTop and Progress variables needed for the status command below
2. Run:  
dsrutil <target dbname> -C status -verbose
Having not changed anything yet, it should show 3029 "normal processing."
3. To fake a glitch in communications, run:
 dsrutil <target dbname> -C terminate agent 
Check the status (step 2 above), it will be ending or ended.
4. Run the new restartrepl-agt1.bat to restart the agent(s). Then rerun the status command to verify it is "listening." It will now wait for the server to be restarted.
5. Log into the SOURCE db box with admin privileges and execute %PROTOP%\bin\protopenv.
6. Run:
dsrutil <source dbname> -C status -verbose
Having not changed anything yet, it should show 6021 normal processing.

7. To fake a comms glitch run:

dsrutil <source dbname> -C terminate agent 
Check the status, step 6 above. It will be ending or ended.
8. Run the new restartrepl-srv.bat and verify it restarts the replication server by rerunning the status command (step 6 above), confirming it is showing 6021 "normal processing."
It may take a few seconds to return to normal processing as there are several states the server and agent(s) go through on their way back to normal processing.

On to production ...


Deploy the new scripts to your production source, target db servers, and edit the DB names accordingly. Plan some time to test this in your production environment before you must rely on it!

Now, when needed in production, perform the restart in this order:
1. Log into the TARGET db server(s) as an admin and run your restartrepl-agt1.bat.
2. Log into the SOURCE db server as an admin and run your restartrepl-srv.bat.
The replication server should connect to your agent(s), and depending on how long replication has been off, it might take a while for the server and agents to return to "normal processing." Their status will be "startup synchronization" until the sync is complete.  
If in doubt, you can always run (as above):
dsrutil < [ source | target ] dbname> -C terminate [ server | agent ] 
Then run the respective restart command.
Always kill/restart the agent(s) first, then the server.
If replication has been down for a while and you are in "startup synchronizing," you can see how far the sync has to go. On the source db server, run:  
proutil <source db name> -C aimage list 
Look for "locked" AI extents. The speed at which the number of locked ai extents decreases should give you an idea of how long the sync will take.

Other resources

Progress article on How to restart the Replication Server under the Local System Account.