Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
Bienvenue Guest!

Rejoignez la communauté pour avoir accès à l'entièreté du site! Une fois que vous serez enregistré, vous pourrez créer, aider, partager et discuter avec les membres de la communauté et également participer à l'amélioration du site. Alors, qu'attendez vous ? Enregistrez-vous !

Sign in to follow this  
FiveM

FiveM outage postmortem: 2018-10-13

Recommended Posts

FiveM

@argon wrote:

There was an incident today where FiveM services suffered a brief major outage as result of a cascade effect of a routine service upgrade.

Timeline

  • 5:30 PM CEST: An upgrade of the Discourse forum software was initiated.
  • 5:33 PM: It was noticed that the upgrade hung while pulling the Docker base image, potentially due to network issues. The upgrade was canceled, and retried.
  • At this time, the forums and policy services had gone offline. This was intended to be brief and not noticed by users.
  • 5:40 PM: Docker had corrupted parts of the new base image on vedic. Internet search results indicated that a reset of Docker was the only solution.
  • 5:4x PM: Stopping the Docker service led to an infinite timeout, prompting a host reboot.
  • 5:48 PM: The host hadn’t come up, and accessing iLO initially failed due to a misconfiguration. PXE boot was reconfigured to boot from a rescue filesystem, and the host was power-cycled again.
  • 5:56 PM: The misconfiguration was noticed, and prefixing https:// to the incorrect iLO link resolved an invalid redirect to a LAN IP.
  • 5:57 PM: Connecting to the iLO console instantly made the machine resume booting. For safety purposes, a backup was initiated to a remote host.
  • 5:58 PM: A tweet was posted indicating that this issue is being worked on.
  • 6:25 PM: The backup procedure completed, and vedic could be rebooted to continue rebuilding the Docker data store.
  • 6:35 PM: Since Postgres wasn’t shut down cleanly, the Discourse start scripts had to be modified to allow Postgres time to recover.
  • Around the same time, we re-added the new Docker host data to the Rancher cluster.
  • 6:51 PM: Users started reporting downtime of CnL heartbeats. Investigation showed that oceanic2 had its database service suspended, leading to the second shard of the heartbeat table only having a single replica left. This meant that the table could not be used for writes leading into this shard, and therefore people encountering errors upon joining servers.
  • At this point, war mode engaged, and timestamps weren’t kept.
  • Reconfiguring Docker led to DNS settings being incorrect, which caused Rancher to not be able to bring up the new host in time. Therefore, we attempted to reconfigure the data table.
  • The heartbeat table was flushed and recreated after attempts to set a 2/5 configuration (2 shards, 5 replicas; allowing for 3 failures per shard) led to IO overload of all servers in the cluster.
  • This also mandated a rolling recycle of all database nodes, leading to multiple intermittent outages of CnL.
  • 7:26 PM: All services resumed normal operation, and monitoring indicated heartbeats were being kept in the transient dataset.

Lessons learned

  • The external backup system of the forums should be reconfigured so that it automatically saves backups of all data, not just the Postgres database.
  • A 1/5 configuration should be used instead of a 2/3 configuration, since 2 servers failing can mean that a whole shard becomes unreachable.
  • Monitoring should not run on the same node as other services, for we only found out about the CnL outage a few minutes late, since the Docker host that ran the monitoring service was still being rebuilt.

Posts: 7

Participants: 6

Read full topic

Afficher l’article complet

Share this post


Link to post
Share on other sites
Guest
This topic is now closed to further replies.
Sign in to follow this  

×
×
  • Create New...

Important Information

En utilisant ce site, vous acceptez les présents règlements Terms of Use, Privacy Policy,Guidelines.