Summary of 10/26 Service Interruption

We do sincerely apologize for this service interruption. We know that you have many choices for your phone service, and we deeply appreciate your patience and understanding during yesterday's interruption of service. Below are the full details of the service issue.

One of the first things we do when a service issue occurs is update our Network Alert Blog and Twitter page with as much information as we have at that time. We then post comments to that original post as we learn more. Our Network Alert blog is here:
http://www.junctionnetworks.com/blog/category/network-alerts

Click here to learn how to subscribe to our feeds:

Our Twitter account is:
http://www.twitter.com/onsip

As a rule, Junction Networks maintains three different types of maintenance windows:
1.) Weekend - early morning: The maintenance performed will produce a service disruption and could affect multiple systems.
2.) Weekday - early morning: The maintenance performed may produce a service disruption, but is isolated to a single system.
3.) Intra-day: The work performed should not affect our customers.
All maintenance, even that which is known to cause a service disruption, is not expected to cause a disruption for more than a few fractions of a second. For anything that would cause a more serious disruption (one second or more), backup services are swapped in to take the place of the maintenance system.

On October 26th around 10:10am, the firewall on a database server was restarted in order to update firewall rules. This maintenance has been performed in the past intra-day without issue. An authorization sub-system attempted to contact the database via the firewall at exactly the time the firewall was down. This caused the authorization sub-system to hang in a non-responsive state. None of the PSTN gateway machines could communicate with the authorization sub-system and therefore were rejecting all calls to/from the PSTN. It took approximately 20 minutes to diagnose the problem and to restart the authorization sub-system. Once the authorization sub-system was re-started, service returned to normal.

Also on October 26th around 11:15am, a main database server started filling the monitoring system with error messages. The monitoring system produced so many alarms that, at first, the problem was originally diagnosed as being either data corruption or a breakdown in the monitoring system itself. The system was so swamped with error messages that it took some time to figure out where the problem actually was. It was finally determined that the RAID system on the main database server had caused the box to freeze up. While the timing of the two problems implies they were related, the nature of each problem was very different.

At 11:30am, we began the process of migrating to our backup database server. The backup database server is designed to function as a "hot-swap" to which we can switch over without having to restore any data, as it syncs itself in real time. Unfortunately, the backup system had not been syncing since 10:10am when the first problem occured, and we did lose some data between 10:10am and 11:15am. By 11:40am, we had migrated the majority of the system over to the backup database server. There were a few hiccups after that - most notably phone registration data was 'old' and caused the system to think that some phones had lost registration. That state was corrected when phones either re-registered on their own or were rebooted.

Remedies:

Once the system was again stable, we sent an engineer to our datacenter to reboot the problematic database server. It came back up without incident and left no indication that there was ever a problem - it appears to be completely fine. Regardless, our engineers prepared new hardware to replace this server.

Around 11:00pm on October 26th, we finished bringing up a new database server (completely new physical hardware) which is now currently configured and synced as a hot-swap backup server to the currently running hot-swap backup server. This weekend, the new server will become the primary database server. We will be performing this maintenance during a weekend maintenance window, but do not expect any service disruption from this operation.