Failover

General Setup

In many environments the PBX is a critical service that requires a backup solution in the event that the primary server becomes unavailable. There are several ways to achieve this.

Choosing Suitable Hardware

Over the past years there were astounding improvements in hardware reliability. Operators should take advantage of this and provide a stable ground for a reliable PBX service. Because the PBX requires only relatively little hardware space, it is highly recommended to use SSD hard drives and redundant power supplies and use an environment that reduces the risk of a hardware failure (humidity, temperature).

It is not unreasonable to rely on that hardware without an automatic failover mechanism. There are already reports available of servers running for more than 18 years, based on technology that was build at that date. If the probability of a hardware failure is beyond 10+ years, the effort of settings up a secondary server may be higher than installing an automatic failover mechanism for companies where a few hour downtime is not mission critical. In such cases, it is critical to have an automatic file system backup, e.g. through cloud file system providers or network SAN devices. If the server should break, the PBX can be installed on a backup hardware where the service can continue. This is a pragmatic approach that saves time and cost and keeps the damage limited. When choosing this approach it is important to have a plan ready for the failover case in order not get caught on the wrong foot when it happens.

Virtualization

In a virtualized environment, the host software may take care about hardware failures. The PBX has a relatively small footprint; periodic snapshots of the virtual machine can be restored in a short amount of time on a secondary hardware, where it resumes operation. If this is done within a second, it is possible to keep existing TCP connections alive and ongoing calls connected. This kind of setup can achieve outstanding resilience against hardware failures and even makes it possible to swap out hardware while the PBX is running.

The virtualization solution does not require any specific setup change on the PBX itself.

External Failover Software

There are different external solutions available that can take care about hardware failover, for Linux, Windows and other operating systems. Those services essentially take care about starting up the secondary server when the first one becomes unavailable; using such a software can be a good solution to increase the uptime of the service and reduce the failover time to a few minutes.

Using the PBX Failover Feature

When the PBX is starting up, it can delay the process of reading the configuration information. This is useful for secondary servers that should start up only when the primary server is going down. This way the secondary server can start with the last configuration that the primary was using, including up to date call records and mailbox messages.

The failover feature is included in the hosted PBX license.

The PBX uses a special path to store the failover information, which can be set with the command line option --serverdir <dir>. The filename itself is pbxctrl-failover.xml. The directory path tells the PBX where to read and store the information related to the failover. This way, the complete working directory can be kept in sync with the primary server. If the file system synchronization can make exceptions for the files, the PBX can also store the information in the working directory of the PBX.

The following image shows which options are available.

The current state may be one of the following states:

  • Starting: In this state the PBX is testing if the primary server is operating yet. Unless the primary server has been working before, the secondary server will not start counting failures.
  • Waiting: After the primary server has been found responding, the secondary server starts polling for failures.
  • Verifying: After a failure of the primary server has been found, the secondary PBX needs to verify that itself is still operational by checking the connectivity to a web server that is supposed to be up all the time.
  • Failover: In this state the secondary PBX is operating as the failover PBX.

Failover detection parameters

Interval for checking server availability. The secondary server needs to poll in intervals the primary server for its availability. This setting controls how many seconds it should wait between the tests. Shorter intervals result in faster failover detection, however also mean a higher load on the primary server.

Number of tolerated failures. Sometimes the checks fail because of reasons that are not fatal. This settings controls how many times the secondary PBX tolerates failures in order to avoid false alarms. Multiplied with the interval, this results in the time that it takes to detect a failure. Making this interval too short results in false alarms.

URL of the primary server. This settings contains the URL of the primary server (e.g. http://192.168.1.2. It does not matter what content is being returned, as long as a 200 Ok is returned from the primary PBX. In order to resuce the stress caused by the polling, using http instead of https is recommended. It is a good idea to use a IP address instead of a DNS address to make sure that a DNS server is not becoming the point of failure.

URL for validating the fail-over event. When the failure of the primary server is detected, the secondary server will use this URL to validate the event. It should contains a web site that is considered to be always on (e.g. http://vodia.com). This server needs to return a 200 Ok, the page content is irrelevant.

ActionURL in the event of a failover

When the secondary PBX makes the state transition from verifying to failover, it can issue a web request to an outside server. This can be used to trigger events, typically a change of the DNS address for the server. If there are multiple domains involved, it can make sense to use CNAME addresses for the domains and change just that one DNS record for the PBX.

When using the Action URL to change the DNS address of the system, you should make sure that you provision the domain name instead of the IP address of the server. See Outbound Proxy Provisioning for more information.

Action URL are described on a seperate page.

License

The servers that are involved in the failover setup should use the same license activation code. The license must list all IP addresses that will be used by the primary and the secondary server. The license file can be part of the file system replication because it contains exactly the same content for all involved servers.

Servers that are on standby will not affect the metering for the license. Only servers that are active will be taken into the hosted PBX metering. A failover will not impact the readout data.