Challenges of game backend, lots of concurrent users, complex interactions between players, persistent world with state mutation (every seconds) running 24/7, single unified game world with no barrier between users.
Elixir, based on Erlang makes things easier to built cluster, distributed messaging and pretty great for stateful servers, using a database to store these states. Furthermore, Elixir is fault tolerant, rapid to develop with a small team, the tool and documentation are present, the learning curve is low.
Concerning the deployment, everything runs in a cluster controlled by Kubernetes (https://kubernetes.io/) and permit to make a fully connected Erlang mesh between nodes. Kubernetes is only used to push the different server working together. As external services, we use load balancer, key-value database (Google Cloud Data-Store https://cloud.google.com/datastore/ with really nice user experience), Log storage/analysis (papertrail https://papertrailapp.com/, useful to analyze stuff), Infrastructure monitoring (datadog https://www.datadoghq.com/), Game analytics data-driven.
The game server is one distillery release (https://github.com/bitwalker/distillery), packaged with docker as container and everything is orchestrated with Kubernetes for the deployment and do lot of boring work. The built and the deploy are controlled using Jenkins pipelines.
Connections are made persistent (during the session of the client) using ranch (https://github.com/ninenines/ranch), bidirectional RPC API, all users are also authenticated with the possibility to spawns or resumes session. Most operations have automatic feedback when there is some actions made. If the connection drop, the session can be spawned to another server (because it is actually a distributed application and a connection can be made on another server).
The Game Entities, session module glues the connections and the game world, can resume the connection and implements transactions between players. The player entity is used to store and updates player states. Expedition entity is used for events between guild or players.
GenEntity is an extended version of GenServer (https://hexdocs.pm/elixir/GenServer.html) adding more game logic and less error prone. It permits to discovery with global process registry, can manage Lifecycle management (add new features easily), it is a persistant state into database (and try to store the data until it succeeds), all communication is subscription based one (Each entities can subscribe to each others). Finally, there is more debugging and introspection feature. GenEntity does not have guarantee to be unique, it is possible to have multi same processes running at the same time, if there is writing into database, we can see the duplicate later. There is no guaranty of atomicity when you write to GenEntity processes. This is a same concept than distributed logs, a log can be drop sometime.
Global systems englobe the player-to-player trading, by allowing them to publish items and sale them into the system. Online players can receive item continuously and this entity is fully connect to the cluster.
Clustering helper, use the state of the Kubernetes cluster and try to replicate the state of the Erlang cluster. It keeps tracks of all machine when one is up or down, it sends alerts. Another part of the entities is Ditributed Locks, used to registry process for game entities, it is based on sloppy quorum model and permit to be distributed for high throughput. It gives a pretty good guarantees but it is not perfect. If the cluster split, we keep only the biggest cluster.
How distributed locks work, it grants timed leases to entities, sloppy quorum with three replicas and implement a distributed version of wooga/locker (https://github.com/wooga/locker). A static grid of virtual nodes, it takes the hash of the id, servers compete for ownership based on vnodes, and each nodes are aware of the zone. It is a best-effort consistency without guarantees. The lost entities are overwritten for higher availability, there is a versionning feature to resolve the conflict.
The scaling is simulated with AI clients to see if the system scales (they are using ubuntu). Be careful with the amount of users, for example, the limit of the files/processes, the limit of the logger, the limit to ranch max connection (4k), the DynamoDB (https://aws.amazon.com/dynamodb/) bandwidth... Finally, the disk full (after 12h with 50k connections).
The limit (actually), 420k concurrents connection on 8-nodes with 36 vCPU, 52k connections per nodes, 3GB of memory used per node, and expected to scale further.