Back to main menu

Product

Mailgun post mortem May 2016

A review of the incidents that impacted the availability of Mailgun's service in 2016.

PUBLISHED ON

PUBLISHED ON

This was originally reported on May 31, 2016.

Mailgun recently incurred several separate incidents that impacted the availability of our services. We’d like to take this opportunity to provide our customers with visibility around the root cause of these incidents along with details on what we’ve done to address them and ensure they do not occur again in the future.

Recent issues

Distributed Denial of Service (DDoS) Attacks

Mailgun is frequently the target of large and varied distributed denial of service (DDoS) attacks. While many attacks are blocked with minimal disruption, we’ve experienced several incidents where there has been a prolonged impact on our services. In particular, an attack on May 23rd targeted portions of our primary data center and the method of the attack was unique enough that our hosting provider required nearly an hour to effectively identify the attack and deploy the appropriate mitigations that allowed us to restore services to 100%.

API/SMTP timeouts or “SSL handshake” errors

Beginning earlier in the month, we observed a steady increase in customers experiencing abnormal timeouts when using the Mailgun API/SMTP service. As the number of reports regarding this issue increased, it became clear that there was a systemic issue with our service.

We maintain several different systems for monitoring throughput and latency in our service and our own data wasn’t correlating to what customers were experiencing. After we completed the investigation of our application, we escalated this issue to our hosting provider to inspect our network infrastructure to determine if a problem could be identified in our managed load balancer or other network devices.

The initial troubleshooting sessions were inconclusive as the issue itself was intermittent and difficult to reproduce. After several attempts, we were finally able to verify that the requests that were timing out were not reaching our edge networking device, leading us to investigate upstream devices in the network.

We started inspecting the device that is responsible for protecting our infrastructure from DDoS attacks and we observed that when this device was disabled, we were no longer able to reproduce these timeouts and we immediately began investigating to understand what the possible causes were. After analysis with our hosting provider’s DDoS team, we discovered that our DDoS mitigation system was impacting legitimate traffic. After making adjustments to our countermeasures, we were able to eliminate these errors.

Intermittent 421 errors

Mailgun returns a 421 error when we are unable to successfully queue a message. This error message is designed to notify the user that the message was not received by Mailgun and should re-attempted later. This is a normal part of SMTP and is used to signal to the sender to retry the message with a delay.

Last week, we started to see elevated levels of 421s being returned. The cause of this error was due to performance degradation we were experiencing with our Cassandra clusters, which is where we persist messages for storage.

The cause of our Cassandra performance issues was due to a compaction bug in the version of Cassandra we were running that was causing compactions to stall and disk I/O spikes reducing overall Cassandra throughput. While the cluster was in this condition, we were intermittently unable to store messages resulting in the 421 errors.

Corrective actions

  1. Alerting – While in many cases our DDoS mitigation system does not cause disruption, we’ve learned that it’s important for our Mailgun engineering team to know when the system is activated. Having this data allows us to more effectively correlate whether or not issues are being caused by these protections. We’ve already worked with our hosting provider to deploy an alerting system that alerts our engineers when these protections activate.

  2. Mitigation Profiles – We are tuning our DDoS mitigation profiles to improve our defensive posture while minimizing the impact to legitimate traffic. We’re working with our hosting provider to develop these profiles and expect this work to be completed this week.

  3. Cassandra Upgrade – We’ve started performing rolling upgrades of our Cassandra clusters to upgrade them to a version that is not impacted by the Cassandra compaction bug along with configuration adjustments that are more suitable for the type of workload.

  4. Infrastructure Design – We’re in the process of redesigning the underlying Mailgun infrastructure. This effort will give us a more robust network and deployment structure that will reduce the impact of similar types of attacks. This effort is underway and we will be sharing more details in the future.

Finally, while we know these types of incidents are challenging, the Mailgun team is committed to focusing on the plan above and any other steps necessary to ensure that you can continue relying on Mailgun for your email delivery.

Related readings

How to improve your email deliverability for the future of email

If your customers aren’t getting your emails, then there’s a good chance that your email program needs some refreshing with these email deliverability tips taken from Email Camp: MessageMania speaker and industry pro, Laura Atkins.

Read More

Continuous Campaign Analytics

ou're looking at a baby blog post! This came out way back in 2011. For updated...

Read More

Announcing new analytics features to maximize your email performance

Navigating email analytics has never been easier than with our latest updates. Advanced data analysis, faster performance, and better data management tools have been released...

Read More

Popular posts

Email inbox.

Email

5 min

Build Laravel 11 email authentication with Mailgun and Digital Ocean

Read More

Mailgun statistics.

Product

4 min

Sending email using the Mailgun PHP API

Read More

Statistics on deliverability.

Deliverability

5 min

Here’s everything you need to know about DNS blocklists

Read More

See what you can accomplish with the world's best email delivery platform. It's easy to get started.Let's get sending
CTA icon