The Written Realms codebase (originally named Advent) started on March 31st, 2013. There have been over two thousand commits since then, a true labor of love that has spawned a very complex system which feels remarkably like a young being sometimes. Needy, time consuming, fragile, mysterious, surprising… I do not have children, but sometimes it feels like I do.
These are all of the commits to the master branch of the backend code. It doesn't take into account the other repos (frontend, deployment, documentation, this blog). But the pattern would be repeated there too. It's not a perfect representation of all of the effort that goes into Written Realms, but it is representative.
What I really want to talk about right now is what's off the edge to the right of the graph -- the road ahead. But as is often the case, first we have to talk a little about the past.
The volume of commits clearly slowed down at the end of 2020. This can be corroborated by the lack of blog posts, newsletters or generally new features. I wouldn't go as far as saying that I burned out, but I did hit a wall. Some of it was that I moved and started a new life, some of it was personal / family issues, but some of it was also the project itself.
In a way, Written Realms has been way more stable than I thought it would be. Don't get me wrong it's gone down innumerable times, usually requiring me to restart some of the processes every few days and a full platform restart every week or two. But when you build a platform like this you always fear the worst case scenarios, and those have not come to pass. Plenty of lag, but the system is almost always up.
But beneath that overall stability, there's been a lot of micro-instability (and sometimes, not so micro). The two main flavors of it are A) the lag B) the ghosts. They are the two fundamental issues that I end up wrestling with on a daily basis, in one way or another. Both of my own making.
The lag is the biggest problem. It's really a collection of problems, all stemming from the fact that I had no idea what I was doing when I started working on the game engine. I probably still don't, but at least I now I've gained some understanding from the mistakes. The issue of lag could easily take up a whole other blog post, but I'll briefly touch on it here.
The game's live data resides in a Redis database. The game code is written in Python. We want to be able to manipulate the game data by manipulating objects, for example: player.health = 20
is an object operation that has to set the player's 'health' attribute in the Redis data structure to '20'. The layer that translates object operations into database operations is called the ORM (Object Relational Mapping). It's a really complicated piece of software and the last thing you'd want to grow in house… which is exactly what I did. I wrote it myself, from scratch, it was in fact how Realms started, and it's kind of an abomination.
I was just so damn interested in it. I'm someone who loves to break apart the clock completely, down to its last nuts and bolts. I wanted to play with the concepts of it. It's literally how the project started, me wanting to combine that interest with my now two-decade old desire to build a better MUD. Everything else was built around that. It was both my origin story and my original sin.
Compounding with that is the problem of how the game code is written, especially the combat code. A typical execution of a combat skill will have a series of the following operations:
And these "fetch - do - write" segments repeat sequentially until the all of the components of the skill have been handled. This is a very natural way to go about this, it yields very readable and maintainable code, and it's horribly slow. For complex skills you end up doing this fetch-do-write skill a bunch of times. The correct way to handle this is to do all of the data fetching up front, all at once, as efficiently as possible, then do all the calculations (never having do any kind of I/O again), and finally write all the results.
I've begun the long process writing a new version of the combat code which implements this idea of batching the reads & writes, and decouples from the original implementation of the ORM. But not only will it take time, but it will also be a fork in the combat system. Initially, it will be less powerful than the current default combat system with the 4 prebuilt archetypes. It's one of those situations where we'll have to turn around and go back down the mountain for a while before finding a better way up.
I've received the following bug report hundreds of times: "I'm fighting this soldier who's at 1 health and he won't die." I've spent more time and energy tracking down this one issue than any other, save perhaps the 'lag' umbrella term. I've come to refer to it as "the ghosts", but zombies is perhaps slightly more accurate. The problem is that death, as simple as it sometimes appear to be conceptually, is damn a complicated process to get right in a game.
This issue, too, could easily spawn its own blog post. But at the heart of it was the classical sin of premature optimization. I was trying to plan ahead for when the platform would need to scale and thought that I should start with a multi-processing approach right off the bat. The idea is as follows: if there is an auto-combat loop that happens every 2 seconds, and another system that handles user-initiated attacks, you can have a different process for each. It sounds good in theory because at a high level multiple processes sharing the workload is more performant than stacking all the work for a single process.
In practice, it inevitably creates race conditions. Anytime two different processes manipulate one same data reference in the game engine, they're going to step on each other's toes. The worst problem is when a process writes to a mob that has died an instant before. If the death happens in between the writing process's read and its write, the writing process is going to create a zombie. This is because the writing process didn't reload its internal representation of the dead mob before sending out its write. Doing so before every write would not only be bad performance, but it wouldn't even necessarily prevent a race condition from the other side.
How many independent game engine processes does Written Realms run at once? Historically it's been around 7. That's a lot of feet that can step on each other's toes.
Multiple processes however are not only desirable but necessary. The best example of that is when you use the save
command and write all of your player's game data to the API server. It can take a second or two to save a player's data, and if the save command was executed by the same process as the auto-combat loop, everyone would freeze every time a player saved. You have to split up the work and allow long running operations to finish asynchronously.
The trick is to make sure that all of the 'key operations' for a given world are executed in order by a single process, always. You can have other processes performing monitoring things and queuing operations that need to be executed, but the operations that actually change the data structures upon which the game depends, those have to be carefully routed and queued. That queuing work is something I've been focused on heavily over the last few months, and have actually been making good process on it. But for now, the ghosts remain…
As much time as I spend wrestling with the lag and the ghosts, the problems do not end with them. Synchronicity, for example, is a very tricky one. The player's data exists in the game engine and it exists in the API server. The two need to constantly communicate to make sure that no accidental losses (or duping) happen, and although it's a rare occurrence there have been a handful of instances where players have lost their gear.
There's always new bugs, new exploits I hadn't thought of, sometimes there's actual nefarious attempts to crash the platform, something I've learned a lot from but which I'm not typically exactly eager to run into. There's been a lot of moderation and policing, something I'd never really thought about until it'd become completely necessary. But I've had to build tools to mute (globally and on an individual player-to-player basis), kick, and ban players, and there's many more that need to be developed still. It's not something I ever really look forward to working on, and it can be oddly draining.
The interesting thing about all of the above is that there's been not even a hint of a mention of working on new features. That's because there's been less than a handful of them in the last couple of years. I likely spend less than 5% of my time working on new features. All of the time is spent monitoring, investigating, rewriting existing code, pushing fixes.
But when I think about what I actually want to work on, it's always new features that will empower builders, new content, and gameplay. And that is why I've experienced significant headwinds over the last couple years. Anytime I get on to do stuff, I know it won't be any of what I actually want to do.
But one of the most powerful motivators is embarrassment. The fact that the system slows to a crawl when 20 players are on is going to keep me focused on the performance issues until they are solved. When you start out on this kind of project, you want to build the next big thing. How far we are from even being able to truly handle being big is deeply humbling, a decade in.
A bit of good news: the lag has been much improved in the last few weeks, a result of a couple of changes.
1) Thanks to our generous Patreon donors, I was able to upgrade the Prod server and throw hardware at the problem. It now has double the cores, RAM, and a dedicated CPU. The box feels noticeably zippier. We are incredibly grateful for those contributions, and reinvest 100% of the money into the platform, be it server upgrades or advertising spending.
2) A couple weeks ago I rolled out a new "queue routing" system (not quite sure what to call it yet), that allows world tasks to be safely distributed among multiple processes, where each world is guaranteed to have all of its tasks executed sequentially by a same process, and where the performance lag experienced by one world can be segregated from the other worlds. Edeus for example is on a different process from all other worlds for all synchronous tasks.
I've also built monitoring tools to measure and monitor lag. I can now generate reports that give granular information on which commands are slow, and in which worlds. Similarly, I've put a number of tracing mechanisms to try to eradicate the last of the ghost issues, although I cannot yet report satisfactory progress on that.
As mentioned earlier, I've also started the process of rewriting the combat system. This is a large undertaking that will involve some difficult decisions and compromises, but I believe that batching and proper task routing is the only way to real performance.
Finally what I actually want to talk about. The future.
There will always be a part of working on Written Realms that will involve fixing, reworking, untangling and policing. But if there’s one thing that I’ve learned in the last decade, it’s that it can’t be all I do. New features, visible changes to the site, and new content must be part of the story going forward. I’m going to make time for these things once more.
Alright, let's talk about new stuff!
First, new content! I am thrilled to announce a new world in development, Demigods. It takes place in 5th century BC Greece, during the second invasion of the Achaemenid Empire (480 BC). That’s right after Thermopylae (if you've ever seen or read 300), an incredibly exciting historical period with lots of potential for fantastic quests and storylines.
While I am doing all of the high level design and planning for the world, I’m not actually the one writing the room descriptions. That honor belongs to one GPT-4, the latest breath-taking release by OpenAI. GPT and I have been collaborating on the world creation for about a week now, and the results have honestly blown me away. I can’t wait to share it with you, and I will be writing a separate blog post on the subject.
Demigods will feature the custom skills system which allows builders to replace the standard warrior/mage/cleric/assassin archetypes that ship with Written Realms worlds by default. It will also feature the concept of Instances, which are basically worlds within worlds where players can carry out certain tasks under a slightly different set of rules than their 'normal' world. They are also the ultimate solution to the lag problem as we can have Instances run on separate processes than their parent world.
So that's definitely happening. Other things that are not on the immediate docket but that I think about often, and am really itching to implement:
Finally, I'd like to get back to regular communications with the user base. A consequence of doing almost exclusively backend work for the last two years is that I’ve not been communicating updates or posting about my work. A lot of that is because as you can see with the length of this post, it’s hard to get into any of this without having to expose so much background and details that it felt impossibly intimidating to broach the subject. Now that I’ve published this retrospective write-up, I’m going to try to carry that momentum forward and attempt to be more publicly active than I have been.
I'll conclude this long post by thanking all of the players and builders that have, in some way or other, appreciated and contributed to this platform since 2013. Special thanks to Patrick O'Malley, without whom I wouldn't have had the confidence to push through this insanity, and huge thanks to our Patreon donors, current and past, who keep the lights on by covering server costs, and beyond that who provide me with a constant stream of gratitude and motivation.