Engineering
Moving to local-first state management with GraphQL
Jordan Drake
Engineering
Mar 27, 2024
In the past year at Plain we've released several major new features including Slack, rich-text editing, AI triage, and an entire re-design just to name a few. However, one feature that may have gone unnoticed was a fundamental shift in how we do state management on the front-end.
If you’re not familiar with Plain, I’ll give you a quick background to help you better understand this post.
Plain is a support platform designed for technical product teams. It helps you manage all of your support requests via email, contact forms, and Slack. It's the kind of tool that is open and in-use, all-day, every day. More than anything else, what matters in Plain is the workflow - our users expect everything to be as fast, reactive, and as keyboard-friendly as their IDE. When your customers contact you across any channel, they get raised in Plain as a thread where you can triage, respond, attach metadata, and so on.
The front-end application is a React application built on NextJS, we don’t currently render server-side but that’s a post for another day! We use Redux heavily for both UI state and external state. You don’t need to be a React or Redux expert to understand this post. If you’re unfamiliar with these technologies then think of Redux as the model. You dispatch actions to update this model in a controlled manner and React updates the view whenever any model changes. If you are familiar with useReducer in React, it’s basically like that but more advanced.
How state management used to work
The best way to show you how state management used to work is by talking through an end-to-end example.
This is the state management code for fetching a single thread in practice:
This is how we’d consume it in the front-end:
For requesting and rendering a single entity, that is pretty manageable. However, things start to get more complicated when you want to start displaying lists of things on one page.
Below is an example of what a list of threads looked like in our Redux store:
Let me break this down a little:
First, we have a map of threads by id. This is all threads across all the lists we’ve seen which lets us easily reference a single thread without having to scan through all lists.
A map of queryId to a thread list which includes some information about the request such as the variables used to call it and ultimately the list of threadIds which we’d expand against threads
queryId is a deterministic hash from the variables (both filters and page parameters) used to request that list which serves as a cache should something request a list using the same variables. This ensures that if the UI is making different requests for different lists of threads the relevant data and request state is separate.
state is a flag we would set to ‘stale’ when something has invalidated this particular list then anything consuming a stale list should immediately re-fetch the list. More on this later but this is how we essentially invalidate lists cached in Redux.
pageInfo provides us with the data needed to call the next page of this particular list. This adheres to Relay’s GraphQL cursor-based pagination specification. Getting the next page for this list, would not modify this list but instead be seen as a separate query (with its own queryId)
When we made a query for a list of threads, we'd extract and upsert the threads onto the threads map and at the same time create a new list by extracting the ids from the threads. This way we have both the ability to show the list but also show individual items without having to re-fetch just one thread by itself.
Although complicated, this approach largely worked as we're a support platform; you probably don't want pages and pages of open threads - you want zero. With this very manual and complicated state structure, we were able to build out a lot of the early functionality within Plain.
It was, however, not without its flaws.
Pain points
Liveliness
As a support platform, your users are spending all day in your tool. With this comes the natural expectation that everything is live. It would be a fundamental user experience failure if they had to reload the page to check for new threads so updating the application whenever a new thread came in was something we built in from the very beginning. For this, we use GraphQL subscriptions over websockets.
Unfortunately, when you’re dealing with a bunch of lists this can be very tricky. When you receive a subscription event that a thread has updated, you need to know which lists are now out of date. For that you have to scan all your lists and answer the following questions:
If this thread is in this list already, is it in the same order?
If this thread is not in this list, should it be added? How does this affect the previous or next page?
If this thread should now be removed from this list, how does this affect the previous or next page?
You basically end up having to rewrite filtering, sorting, and querying logic in the front-end that exactly matches what you have in the SQL in the back-end just so you can understand how the lists you have stored in the front-end are affected. Getting this wrong means ending up with out-of-date information, duplicates, or missing items entirely. You can avoid this by going the nuclear option and invalidating all lists but this would have to happen on any change including something as trivial as applying a label to a single thread and when you’re dealing with a lot of threads with a lot of support agents this can happen several times a minute.
Developer experience
Coding against these types of lists can be tedious. It wasn’t just the big thread list that was paginated and built as a list, it was pretty much every entity we had; labels, groups, snippets, users, customers, and more.
For each of those, they’d all have their own Redux code. 90% of this was the same from entity to entity but each would have their own idiosyncrasies (we hold up our hands here, a lot of this could have been avoided) such as when to upsert, soft vs hard deletes, and cascading from other entities (e.g. a thread has a reference to a customer).
Then every place they are shown as a list (even something as minor as a dropdown) you need to handle loading states, error states, re-fetching stale lists, knowing you’re at the end of a page to load more, etc. Almost all components that render an entity end up with some sort of asynchronous handling and whilst React, shared components, and other abstractions can help here these are still logical pathways you need to consider, test, and mitigate against bugs.
Performance
Since the requests to fetch particular lists and entities are all driven from the component structure, it is hard to do any sort of meaningful pre-loading as you need to wait until you encounter that component before making a request. This can also result in a bit of a N+1 problem where you make one request to render one component which itself has components in it that trigger additional requests.
Similarly, when applying filters or sorts you’d be greeted by a loading spinner and whilst this would be relatively fast it impacts perceived performance and disincentivizes our users from using all the features of the application.
When we got it right, optimistic updates largely helped with perceived performance but in practice what often would happen is that a simple innocuous change (like adding a label to a thread) would invalidate a bunch of unrelated requests and trigger spinner-armageddon making the whole app feel slow.
UI Limitations
By only ever having small slices of the full list for a given entity we often found ourselves having a lot of weird UI limitations.
For example, if we wanted to show a filter that was contextually relevant to the full list but not within the small slice we currently had, we would have to build niche back-end APIs to give us metadata on the full list to know what to show.
Similarly, it’s hard to truly show a full overview of the state of your support queues. One way of providing this was with total counts given to us from a back-end API which we did but we’d also constantly have to invalidate and update these with similar difficulties as mentioned when invalidating lists.
More often than that, you’re constantly running into small spinners in several places on the page for metadata, counts, relationships (e.g. displaying something about a customer of a given thread), and so on.
Options we considered
If you’re familiar with using GraphQL within React then at this point you’re probably asking why we didn't turn to a popular GraphQL client (such as Apollo or Relay) or even a general request client like react-query. These libraries would handle a lot of what we’ve done ourselves in Redux and you’re given an interface like this:
This is something we considered and investigated but these libraries don’t solve the problems with list invalidation out of the box and we’d still have to handle asynchronicity everywhere. It would also require a large amount of re-architecturing the existing codebase and make enabling optimistic updates across different parts of our graph very challenging.
Another approach we briefly considered was Replicache which specializes in providing a local-first, persistent, optimistically updating experience. This would have however meant building a new separate API for the front-end vs being able to share the same API our customers use to integrate with Plain today. If we were to rebuild Plain today from scratch Replicache would be a very strong candidate for our front-end architecture.
Our approach today
To solve our state management issues an attractive solution would have been to go for a local-first approach similar to Linear where the client downloads all data and then all operations happen on the client and are synced back to the back-end.
Initially, we dismissed this approach as we didn’t believe you would be able to download all support threads and customers, companies, etc. to the browser as the data set would be too large. However, as we built Plain we saw that in actual fact your dataset does not grow infinitely unlike articles on a news page or messages in a chat room. When you are in a support tool you only care about active support requests and this data set, as you work through your support queue, will trend downwards. This very much put a local-first state management on the table.
We still however had to have a solution for viewing historical data. To achieve this we decided to offer a more classic SPA experience where all data is fetched on demand and only cached in memory. We felt comfortable with this trade-off as it optimized the experience for the 90% of cases where you are trying to help your customers while also keeping implementation complexity down.
Another assumption we saw we could make was that user-created entities such as users, snippets, labels, and customer groups would remain at a relatively reasonable (< 10000) size. If we ever have a customer with more than 10000 users then I think we can afford to rethink this architecture 😉
Technical breakdown
We still store all our entities within Redux in a way that is not too dissimilar to the threads mapping of our previous approach. The similarity enabled us to speed up the transition massively because a consumer that looked up an entity by id remained almost exactly the same.
The major difference is how these entities get into Redux. Before we’d do a typical Redux flow where each request would have a specific set of actions and reducer in Redux. Now we automatically parse the response of every request and if we find any entity we upsert it into our Redux state. This approach gives us a single pipeline of entity data into our application whether it is from a mass fetch, a specific fetch of a single entity, an entity updated in the result of a mutation, or even subscriptions.
Redux
This is what our Redux store for entities looks like today; a single store for all entities with actions for inserting generic entities rather than each individual entity having its own store and own actions ( upsertEntity vs upsertThread, upsertUser, etc.). Each entity is a map of the entity id to the entity model. There’s no request metadata per entity, just a top-level loading state to cover bootstrapping and hard vs soft loading.
We created a variety of abstractions to select data from this redux store. For performance reasons, we make use of local memoization and memoized selectors which means that if several parts of the application want the same data we won’t have to perform various filtering and aggregating twice. One such abstraction is our useThreads hook:
We intentionally abstract away from using selectors directly in components to smoothen potential future transitions.
Parsing a payload
Every payload whether it is from a query, mutation or subscription is scanned for entities. This is achieved by adding a small amount of code in our request wrapper which is connected to our Redux store. The response is then returned back to the original caller.
To parse potential entities, we use Zod to build models. This gives us the confidence that each entity will have the exact same shape and we can transform the data and add additional client-only properties too
There are several major benefits to this approach:
We don’t need to write bespoke code for each possible query and request flow. We just need to define a model and potentially some specific behavior for how that entity is stored (e.g. create vs update vs ignore)
Subscriptions can be parsed just like a response payload
Any response that happens to contain a matching entity, even if indirectly, updates that entity in our Redux ensuring we always have the most up-to-date entities without extra code
Entity idiosyncrasies
Not all entities are created equal; some differences are integral to their behavior, others are simply a result of our understanding of the problem space improving over time and frankly, some are mistakes on our part. Whilst we try to handle entities as generically as possible, we need mechanisms to support any different behaviors.
Each entity has an ‘entity manager’ which is responsible for handling all these differences in one place.
For an example, here’s the thread entity manager:
This logic sits outside Redux but is frequently referenced by it so the key matches the same value within our Redux store
Model is the Zod model we use for parsing an entity
The storage filter ensures that we only store entities in our persistence (see below) that are active to avoid our storage growing infinitely with inactive entities
The upserter is a comparator function that takes a new entity and the previous entity of the same id (if it exists). Fortunately, a thread is a relatively simple case as it has an ‘updatedAt’ field.
Not all users have permission to view every entity within Plain. requiredPermissions ensures that we don’t attempt to fetch entities that the user does not have access to as this request will error. We obviously enforce these on the back-end but this helps reduce the number of requests to our back-end.
fetchAll is a function that loads all active entities into the store for synchronization. In the case of threads, this is our ThreadsQuery with filters for threads that aren’t marked as done. Typically the filters we apply to a fetch are the same as those in the storage filter.
Persistence
The trade-off for loading all the data upfront on application bootstrap is that we’ve significantly increased the time the user is behind a global loading state. To mitigate this, we introduced persisting the entities you’ve downloaded to your local storage. This means that when you load Plain you immediately see all the previous threads and state while we sync in the background.
Persisting data locally comes with a few issues. The first problem is that the model of the entity you’ve stored might have changed and trying to load it would break the application. In order to avoid this we treat locally stored data the same way we would a response payload; by parsing it to ensure it meets the current model. We also store the entities against a key generated by hashing the model schema.
Another issue is that there’s a reasonable chance somebody else has updated an entity (e.g. by responding to a thread) whilst you were offline so whenever you go to a thread we re-fetch that thread behind a loading spinner. Later down the line, we plan to invest in full conflict resolution on the API to provide a complete solution across the board for all entities.
Mutations
This entity pipeline model solves querying entities but doesn’t provide any solutions for mutations.
Previously, a mutation would look very similar to a query, you’d have a bespoke flow for it with appropriate Redux actions and request states and when it was successful you’d update the entity in its mapping and identify what lists are now invalid.
The new approach automatically solves handling the response of a mutation and whilst we could have continued using Redux we no longer wanted to maintain all the boilerplate around managing request state so we opted to use TanStack Query (formally react-query). This handles request states and errors for us without us needing to write Redux actions and reducers for each one. With the help of our own abstractions, we end up with an interface that looks like this:
Optimistic updates
Whilst not directly part of the entity pipeline, we wouldn’t want to commit to an architecture that didn’t allow us to have optimistic updates when performing mutations as this is something that users expect of a modern web application to make it feel fast and performant. To achieve this we added the ability to provide ‘patches’ to our entities from the `createMutationHook` abstraction:
A patch is a function that takes an entity and performs a mutation on it. When anything reads that entity, the patch is applied until the patch is reverted either due to a success (in which case the entity has been updated by the pipeline) or an error. This uses Immer under the hood which itself uses metaprogramming techniques (Proxies) to provide a mutate-able interface in an immutable system
We opted for patch functions over patch objects because they’re much more powerful, more concise and patch objects can unintentionally override each other (e.g. adding and removing items to an array).
Offline handling
Because we’ve designed the web application to be used all day, a lot of our users tend to close their work laptops with Plain open and rightfully expect Plain to continue to work once they open it again the next morning. Since they don’t reload the page they aren’t going to trigger the soft loading synchronization and their browser has almost certainly paused/disconnected from the subscription websockets events.
So we need to know when a user has been away from Plain for a reasonable amount of time to trigger the synchronization and reconnect the subscriptions. To achieve this we layer a few techniques. Firstly, we reference various browser APIs such as navigator.onLine (connected to a LAN, not necessarily the internet) and document.visibilityState (the browser/tab is active). These APIs’ behavior across different browsers is not consistent nor well documented so we only trust their negative states. Ultimately we worked out that the most reliable way to know the browser had paused you was to simply have a setInterval running every second. By comparing the machine’s timestamp between every tick we'd definitely know if the javascript thread had been paused by the browser, the code for this is below:
Downsides
This new approach isn’t without its negatives or compromises. Many of these are things we can iterate and improve upon over time and some are integral to the approach.
A big concern now is browser performance. We’re demanding a lot more from our users's browsers and with lots of complex filtering there’s potential for some O(N*M) operations. We have to be careful here because whilst an expensive asynchronous operation will result in a user seeing a loading spinner for a while, an expensive synchronous operation will completely freeze the browser. To avoid this, we use a combination of good algorithm design (e.g. index tables) and meta-programming (e.g. memoization). In the future, we could look at Web Workers to offload expensive operations away from the main browser thread.
Similarly, we’re also limited by the browser in terms of persistence storage size. If somebody has several thousand active threads on a browser that has a lower local storage capacity then local persistence will fail to work. Even if it is too large, the application will still work as expected as persistence is built as an addition and not in the critical flow however there are definitely other paths we can investigate such as using a different browser storage API like IndexedDB and also look into data compression with libraries such as lz-string.
The other thing to consider is that a lot of this code is built and maintained by us. There are some solutions for pre-loading all entities in some GraphQL client libraries but this is not their typical use-case and I think we’d end up almost working against the library’s idioms to end up with the results we want. Additionally, the approach we’ve taken isn’t strongly linked to GraphQL. We could easily change to a REST API, an event-driven API, etc. with very minimal changes.
There are also parts of the application that don’t fit this model at all. We have settings pages and authentication flows that exist entirely out of this. Previously the state for these flows was managed in Redux but as part of this work, we’ve managed to move a lot of it to use TanStack Query instead. This is ok for now but it's certainly not as nice and simple as having a single way to do things across the full app.
That all being said, I think we’re in a vastly better place and the negatives of this approach are not only much less significant than the previous approach but the benefits are massive.
How’s it going
It's been a long journey getting here but we feel we've solved a lot of the problems we set out to tackle.
For subscriptions, the issue here was knowing what lists to invalidate and re-fetch when a new update came in. In the new architecture, there’s no concept of lists in our Redux, so when we want to display a list of entities we simply do that consumption by applying filters using client-side code. This also has the additional benefit of making applying filters and sorts instantaneous.
The UI now is also no longer limited by pages nor does it require niche back-end APIs for minor features. We’re able to display as many entities on a given page as we’d like. In fact, we’ve become a bit of a victim of our own success here and have had to solve new problems such as the performance of rendering hundreds of threads all at once. Fortunately, this is a solved problem in React with virtualization libraries like Virtuoso which basically only renders items in view. We can now build complex filters and views purely on the client-side and we can extract metadata from the entire set of active entities to provide something as trivial as a count to more interesting features like metrics.
The developer experience has also massively improved. We no longer need to write Redux code to handle every single request flow, we don’t have to handle asynchronicity and error states everywhere, and we can be confident that if we make a mutation then the entity will be updated through the pipeline without having to manually update it.
Overall we’re able to deliver new features faster with fewer lines of code whilst providing a higher quality of user experience than ever before. In fact, the front-end codebase before we made these architecture changes had significantly more lines of code than it did afterward. Typically introducing a new entity or request flow was always something done begrudgingly by developers and now such a joy that there’s often competition between us to see who gets to pick it up.
It also feels really fast.
Where might we take this next?
We've overcome the first step in becoming a local-first app but we still have a long way to go to make this more mature.
From optimizing how we synchronize clients to leveraging service workers, queuing mutations when offline, and handling undo's globally... there is a lot to be done