Slow web socket connection
Incident Report for InEvent
Postmortem

What is a “Websocket”?

Without getting into too much detail, a Websocket is a method of network communication that is used for real time applications. We use Websocket for real time communication and interaction, and these are the modules that uses the Websocket service:

  • News Feed;
  • Inbox;
  • Session Chat;
  • Session Q&A;
  • Networking;
  • Creation of Group Rooms;
  • Invitations;
  • Push Notifications;
  • Live updates (session settings changes);
  • Networking Roulette;

Issues with Native Websocket and Regular Websocket (Google Firebase)

We have two Websocket providers, Google Firebase (realtime database) and our own implementation (Native Websocket). Today we had a large amount of users connecting at the same time and this caused the Websocket servers to halt. Google Firebase couldn’t scale fast enough and the Native Websockets couldn’t handle the scale either. The issue resulted on users having the “Connecting” popup showing up and never disappearing.

Issues with Caching Server (Redis)

We had a major outage with our Caching server (Redis) that caused the entire platform and backend to go offline. The Redis server clogged up and couldn’t handle the server scaling and load, and this resulted in an overall failure of the platform. The landing page and login page were still operational.

Fixes implemented

For Native Websockets we have implemented a manual scale for now and we will work on the autoscaling mechanism to support a large load in the future. If the connection fails, you will still be able to use the Virtual Lobby normally, with limited interaction.

For Google Firebase we couldn’t implement a fix. We will try sharding the entire operation into multiple micro services for different modules (Chat, Q&A, etc), but since they don’t support replicas, it will be hard to scale on large events. If your event has more than 5,000 users, it’s better to use Native Websockets. If the connection fails, you will still be able to use the Virtual Lobby normally, with limited interaction.

For Caching Server (redis) we are still implementing a fix but we did deploy a temporary workload that should replace the caching server for now and keep the platform and the backend stable. This is an internal fix and shouldn’t affect the user experience.

What to expect for this week

The platform backend and all its modules should be operational. In case Native Websockets or Google Firebase fails, you will still be able to access the platform and the Virtual Lobby, but users will have a limited experience without realtime interactions – chat, Q&A and the other modules listed above will not be operational.

We are constantly working on improvements and we will announce when we have both realtime Websockets fully functional for large events. Meanwhile, we can guarantee that the backend and the Virtual Lobby will be online – even in case of limited realtime experience.

Posted Sep 13, 2021 - 17:33 EDT

Resolved
This incident has been resolved.
Posted Sep 13, 2021 - 17:15 EDT
Monitoring
The team has concluded the implementation of all temporary solutions on the platform.
This includes creating a timeout option on slow connecting sockets and also disabled Redis as a single connection.
The web socket chat will remain with limited chat support on the firebase instance until we add support for per instance local connection, which should happen next week.
The Redis cache team will be implementing a new permanent solution end of this week, Friday at the latest.
Posted Sep 13, 2021 - 13:07 EDT
Update
We currently fixed an autoscaling Redis instance that was not able to redirect the load to a separate system. Redis is used by InEvent to quickly balance its write operations instead of relying only on a traditional SQL database.

The following components are still affected:
Firebase Web Sockets, which covers the live chat on the Virtual Lobby.
Posted Sep 13, 2021 - 11:45 EDT
Identified
We have identified the slowness from the Firebase product.
We are deploying an alternative with the Native websockets, under Company > Tools.
We are currently deploying a fix for UI improvements on the Firebase UI console while Firebase is not responsive.
Chat may be offline while the fix is being applied, but video and streaming should work normally.
Posted Sep 13, 2021 - 10:55 EDT
Investigating
We are currently seeing a slow scalability response from the InEvent Firebase socket group. We are currently investigating the issue with Google Firebase team.
Posted Sep 13, 2021 - 09:45 EDT
This incident affected: Firebase Web Sockets.