Webhook Delays 5/10 (RESOLVED)
5/10/2012 5:19PM EST
Webhooks have returned to being nearly instantaneous.
5/10/2012 5:09PM EST
Webhooks are currently running on a slight delay. We have added extra capacity and they are now recovering. We will post again when they are fully recovered.
Authorize.Net Partial Outage (RESOLVED)
5/2/2012 11:30AM EST: The issue at Authorize.Net was resolved soon after we posted about it. It is very difficult to get direct information related to outages from Authorize.Net, but our monitoring shows that the issue began at 1:41PM EST and was resolved at 2:28PM EST. We did see an additional small blip of connection issues with Authorize.Net between 6:22PM EST and 6:43PM EST yesterday (5/1/2012) but things have been clean since then.
Original Post:
There is a partial outage at the Authorize.Net payment gateway. It appears to be coming in and out of availability for the last few hours. This will have the following effect:
Paid signups from your customers may fail with a message such as “The connection to the remote server timed out”. Your customers will see that the signup failed, and their card will not be charged. They will have to try again at a later point in time.
Renewals to your subscriptions may fail with a message such as “The connection to the remote server timed out”. The subscription will move to the “Past Due” state but will retry within 24 hours. Depending on the exact error and your dunning settings, affected customers may receive a failed payment email.
We will continue to monitor the situation. We will also be working to improve our error handling so that failed payment emails are not sent in situations such as this.
Unexpected Outage (RESOLVED)
4/30/2012 12:41PM ET
The networking issues have been resolved and Chargify is back online. We have confirmed that this was a datacenter-wide outage but we are waiting on the reason for the outage. We will share that when we have it.
4/30/2012 12:27 PM ET
We’re experiencing networking issues with our load balancer at the moment that is causing us to drop or be unresponsive to requests. We’ll post an update as soon as we have one.
Degraded Performance (RESOLVED)
4/29/2012 1:00AM EST
The load balancer issue has been resolved and services are at 100% once again.
4/28/2012 9:30PM EST
We’re experiencing another bout with degraded performance this evening. The problem has been traced to the load balancer, which is overloading some servers while leaving others idle. We are working to fix the issue with the load balancer at this time.
Possible Diminished Performance (resolved)
There appears to be a runaway process on one of our web servers. We’re looking into it. The remaining web servers are operating normally and response times should be normal.
Webhook Delays 4/24 (resolved)
8:00 AM EST
Webhooks are running on a slight delay this morning. We’ve added extra capacity to work through the backlog.
Our proactive monitoring warned us immediately, allowing us to address the situation in a timely fashion.
8:30 AM EST
Webhooks have been caught up.
(RESOLVED) Service Outage
We are experiencing a service affecting outage and are working to restore services. The outage began at approximately 3:30AM EST and datacenter staff has been working on the issue since shortly after then. Members of our engineering team are also involved.
Unfortunately there is no ETA at this point in time.
6:22AM EST Service is being restored at this time
6:26AM EST Engineering is verifying all aspects of the service at this time
6:34AM EST Some of the application servers did not come back cleanly, they have just been restarted
6:39AM EST All services are fully restored at this time
9:55AM EST We’re working with the data center to compile an Reason For Outage report and will share it with you once we have it. And, of course, we’re reviewing all of our procedures so we can make sure communication is improved.
4:00PM EST 4/18/2012 For those interested, here is the preliminary report from the datacenter where the problem occurred. We will be following up with them to ensure proper redundancy and procedures are in place. We’ll also continue to consider other options at our disposal for mitigating this type of outage.
One of our networking switches experienced memory overutilization due to high traffic on a monitoring port. The memory overutilization caused the CAM table to become corrupt. Once this occurred the network was flooded with broadcast traffic. The tremendous load of the broadcast traffic resulted in the environments being unavailable. To resolve the issue the monitor port was disabled and the switch was rebooted. This monitoring port will not be re-enabled and another solution will be found for this traffic. We are working to produce the final RCA as quickly as possible and the team will be following up on several takeaways from the incident.
Resolved: Partial outage
We experienced a partial outage of services for 3 minutes starting at 5:13AM EST. The system self-recovered before we were able to log in to investigate. The issue has since been resolved.
Resolved: Increased Response Times - 4/5/12
3:48 PM ET
- One of our application servers began flailing and causing longer-than-normal response times for a subset of requests in the UI and API.
5:20 PM ET
- The application server causing the issue has since been rebooted and the system has stabilized.
Webhook Delays 3/13 (resolved)
9:50 AM EST
Webhooks are running on a delay again this morning. We’ve added extra capacity to work through the backlog.
We’re also preparing a fix for our proactive monitoring of this issue, which has failed us twice in the last 2 days.
We know timely webhooks are important to you, so rest assured we’ll get this fixed!
11:00 AM EST
Webhooks recovered quickly (by 10:15AM EST). As stated, we’re deploying new monitoring today to help prevent this kind of thing from happening.