Engineering
Moving fast with high reliability: Lessons from building Slack for Plain
Preslav Mihaylov
Engineering
Feb 14, 2024
Over the past couple of months, we’ve been hard at work bringing Slack to Plain.
For those unfamiliar, Plain is a support platform specifically for technical B2B support. For many of our customers, support is shifting away from email to shared Slack Connect channels, so it was important for us to integrate this channel natively into our product.
For many teams, Slack is where they help their most important customers. Because of this, we knew from the outset that reliability was key. If we fail to ingest a message, it could result in missing time-sensitive, high-priority support requests.
Slack is however also an unusually complex channel, with a ton of custom paradigms, conventions, and a large surface area.
This led to a really interesting challenge: how to balance shipping speed and reliability.
In this article, I’ll share how we found this balance and how we shipped this core component of our platform.
First, some context
Before diving in, it’s worth understanding how our Slack integration works. At its most basic level, whenever a customer asks a question in a Slack channel, the message and any subsequent threads are captured in Plain.
From Plain, you can reply to Slack messages and the reply appears exactly as it would if you replied directly from Slack.
If you’re curious to see how it works, you can check out more information in our docs.
Spike: One week from kickoff to working feature
We started the project by deep-diving into Slack’s APIs, and prototyping a vertical cut of the full Slack integration to get a subset of the full functionality working end to end. For this first pass, we didn’t write any tests, ignored edge cases, and kept all Slack-related code contained by using feature flags and a separate code structure.
Using this approach, we were able to ship a working prototype of the integration one week after kicking off the project. This was hugely valuable in helping us understand how we wanted Slack and Plain to interact and shape the full scope.
We took this approach for a few reasons.
Testing the unknown is slow and wasteful
It would have been a waste of time at the beginning to diligently test every edge case without knowing what would eventually be kept and what we’d end up removing. Edge case handling wasn’t going to change the broad strokes of how we wanted the integration to function, so ignoring the edge cases initially helped us to cover more ground faster.
We still had a lot of questions about Slack’s technical domain to sort out at this point as well - many of the possible message payloads about Slack’s API are not well documented and seem rarely used.
For example, when you fetch a user profile, it includes a profile.email
field that we use to associate a user ID with a customer. We found out that this field isn’t displayed if a Slack workspace has toggled off a setting about publicly sharing user emails.
We discovered this and many more “gotchas” only after deploying our initial implementation. This is why it was crucial for us to get a first version out quickly so we could start using it ourselves, along with some early testers.
What gave us the most confidence in this phase was good old manual tests. They helped us to better understand Slack’s APIs and let us quickly test how a product decision would play out.
During this testing phase, we set up a separate Slack application in our dev and local environments, which made it easy to test a few features without impacting normal business comms.
Tests don’t matter if the blast radius of an incident is zero
Without any tests, the likelihood of an incident is high. But thanks to thorough feature flagging and a few architectural decisions, we made sure that no other system in Plain was impacted. This helped us to feel comfortable leaving our Slack integration “broken” at the end of any given day without worrying about its impact on anything else.
The same feature flags were also really helpful later on in gradually rolling out Slack to a few early testers.
As complexity grew, we gradually added automated tests
As we built the integration, we focused on a handful of use cases, which we could manually verify in a few minutes. This lets us quickly iterate on the core ingestion logic and behavior.
However as the feature depth grew in size and complexity, it started taking us more time and effort to verify there are no regressions to our changes. At this point, more and more of the team started working on Slack as well, making keeping track of an ever-growing list of complex edge cases even harder.
This was the point where we started to lean into automated testing. Although imperfect, we used test coverage as a proxy to ensure we weren’t missing any important flows until we hit a healthy 90% coverage.
Ship: Building for reliability
Another key aspect to ensure reliability was to set up our infrastructure in a way which minimizes points of failure. We wanted to be able to easily recover from errors and bugs would inevitably run into.
This is an overview of the infrastructure as it is today:
We subscribe to various Slack events using the Slack Events API. Whenever anything of interest happens, Slack makes a webhook call to a public endpoint that we expose withAWS API Gateway.
From there, the JSON payload of the event ends up in a S3 bucket as a file. This triggers a S3 notification which is eventually handled by our slack-webhook-handler.
There are several interesting choices in this setup.
S3 as the starting point
The first step in the flow is to verify the request from Slack and upload it to S3. This ensures that we only process valid requests and that we can rely on S3 events to then trigger any downstream handlers.
We built this using a very small and intentionally simple lambda to reduce the complexity as much as possible. This is important since a failure at this point would have effectively meant we lost a message. This initial lambda was one of the first things we built, and it hasn’t changed since.
With this setup, in the event of a bug in our webhook processing logic, we can replay the webhook payload from S3 at any point and backfill whatever messages we missed. The improved reliability this solution provides to us clearly offsets the (negligible) impact to latency we get.
In previous integrations, we’ve used API Gateways’ ability to directly integrate with other AWS services and a custom Lambda authorizer. This approach avoids writing a lot of custom code and thus reduces further chances of failure.
Unfortunately, Lambda authorizers don’t have access to the body of the request, meaning we couldn’t use this approach to authenticate Slack webhook requests, which rely on a combination of both. You can read more about that process here.
Concurrency and retries using SQS
The next part of the ingestion route attempts to tackle several functional concerns.
We use a SQS FIFO queue, which enables our lambda to process one message a time per Slack channel. To achieve this, the S3 notifications are consumed by a tiny slack-webhook-event-grouper lambda. This lambda writes incoming messages to the FIFO queue, using the Slack channel ID as the message group id.
Allowing multiple messages to be ingested in parallel would have meant handling complex scenarios to ensure messages appeared in Plain correctly.
Apart from addressing the concurrency issue, the queue is used as a retry mechanism. Each webhook is retried several times in case of intermittent failures, such as a downstream API or service being temporarily unavailable. If we still fail to ingest the messages after several attempts, the message ends up in a dead-letter queue (DLQ), which we manually inspect and decide how to deal with.
The DLQ was and continues to be a great source of new and unexpected Slack API payloads and, in 99% of cases, bugs we have to fix.
Isolating subsystem failures
One detail worth highlighting is that the main handler, slack-webhook-handler, is not responsible for managing the full Plain thread lifecycle. It’s only responsible for persisting relevant data about processed Slack messages and publishing events which get routed to other lambdas.
For example:
Thread State Machine This updates the thread’s status based on events it receives. For example, every time we receive a new slack message from a customer, this transitions the thread to Todo.
Notification sender. This creates notification events which fan out to the different notification channels we support (e.g. Slack, Discord, email).
Timeline writer. This persists entries in the thread’s timeline which you can then see in Plain.
This setup enables us to prevent failures in one system from impacting others. For example, if there are intermittent issues with sending notifications due to e.g. the Slack or Discord API, it will not prevent you from seeing messages in a thread in Plain. These downstream actions are crucial to ensure that Plain continues to work and allow us to retry, and clearly isolate event failures.
Harden: Erroring well
Reacting to errors quickly
From the outset, we knew that the timeliness of handling errors was important because of how crucial Slack is to our customers. Fixing a bug 8 hours after a message has failed is too long of a wait time.
For this reason, we decided to build very sensitive alerts from the very beginning. We set up Pagerduty and Slack notifications on a single DLQ item for Slack queues. We also used Sentry to help us quickly debug issues and jump to the relevant logs. This setup allowed us to quickly address issues and, when we couldn’t fix the bug right away, notify the customer immediately.
Although noisy at first, our speed at communicating errors to customers helped us to build a lot of trust in our ability to make Slack a viable channel.
Failing gracefully
Slack supports dozens of message subtypes, plus a few more that are not documented.
As we started to build out Slack, we quickly found that there were several messages that we couldn’t properly parse or display in Plain (for example, messages where the author had been hidden by Slack… yes, that’s a thing).
The context behind each message failure is hard to understand because it’s often unusual or undocumented. But that doesn’t make it any less important to our customers. We knew we needed to handle these cases, but we equally needed to continue making progress on the integration without spending weeks on rare, one-off cases. As this list of so-called expected errors started to grow, we needed to implement a way of surfacing them to customers automatically.
This is why we decided to implement different service modes for different messages: a full-fledged integration where we ingest messages with full context into Plain (incl. usernames, avatars, message, attachments, etc) and a degraded mode where we extract the minimum information we can from a message and inform you about it.
The degraded message handling looks like this in a thread:
If even that results in a failure, we fallback to our notifications system to inform you there was a message we couldn’t process.
And if all of the above fails we we log it to Sentry so we can take manual action:
Failing gracefully (and thoroughly) lets our customers feel confident that no matter what, they won’t miss any important conversation.
Fin
It’s always hard to know whether you could have built something faster or whether you really have captured every possible failure case. And hindsight is cheap. Overall however, we’re really happy with where we landed and how we were able to move quickly without compromising reliability.
Prototyping quickly allowed us to make mistakes early on and iterate quickly. Our infrastructure design made sure we could recover from failures and our nuanced error handling ensured that when we did fail, we were able to react quickly and never leave a customer hanging.
Slack has now been live for a couple of weeks. So far, we've synced over 99.94% of messages correctly and have achieved a good starting depth for our integration: We intelligently group individual Slack messages into threads, and show everything from code blocks, images and custom emoji reactions right in Plain for a frictionless support experience.
If you’re interested in giving it a try, you can set up a workspace at https://www.plain.com/.