Comments

Hello everybody, my name is CCP Prism X and I am a Libra!

This summer we saw a welcome influx of players with the release of EVE Online: Odyssey. On occasions as such we usually celebrate our great success by pulling out the champagne and our $1000 Japanese pants, and having ourselves a great big pillow fight. However, this time around the influx of people was so dramatic that it worried us; Empire Space was burning. There were so many people in New Eden, doing so many things, that the actual CPU cores (reffered to as "nodes" hereafter) running the Empire Space systems were under extreme load. This was causing Time Dilation to affect the various, and disparate, portions of Empire Space. Time Dilation is all well and good when you‘re blobbing with your space-friends in Nullsec. But when you‘re just running a mission in some dead-end system, all alone in local, and suddenly everything starts going all bullet-time on you: you get frustrated.

Empire Space should really have such predictable load that Time Dilation should never kick in there. If it happens, our Static Cluster Premapper should at least be able to avoid a reoccurence. That is because we track an estimated load fingerprint of all solar systems and assign systems to different nodes, based on said estimated load, during startup. As long as that estimate is correct-ish and the premapper is functioning as it should, things should run fairly smoothly. But the same systems were consistently starting up on nodes with too many other systems assigned to them. Things were not running very smoothly at all and we weren‘t quite sure why.

† Assuming that no social engineering takes place that diverts load to new, unexpected, places.
Example: Announcing an upcoming LP offer from a corporation that only has a single good agent in high-security space will result in a lot of new load in that agents system which the premapper cannot in any way anticipate.

Step 1: Identify the problem

Above I omitted a fairly important fact about this premapping of solar systems: systems within the same constellation want to stay on the same node. They want it so bad that they‘ll do so at the expense of balanced load distribution as they‘ll ignore the, load wise, most optimal node assignment if their preferred node is only 20% more loaded than the optimal one. This behavior causes a huge deviation between the load of different nodes whereas we‘d rather want the premapper to spread the load as evenly as possible over all the nodes available to it. Furthermore this behavior doesn‘t extend above the constellation level so effectively a node can run two whole constellations, but one is located in the Deep North and the other in the Deep South. So a fleet fight in the Deep  South could enforce Time Dilation on the other constellation located in the Deep North. That‘s frustrating for the Deep North pilots!

Here is visualization of the clustering of systems to nodes in the old systems (Note, this is 3D data badly projected on a 2D plane so there‘s some skew). Systems of the same colour reside on the same nodes. You‘ll quickly notice that this is just a jumble of colours with little consistency. I‘ve drawn red circles around two clusters which all belong to the same node to show the distance between the two:

 

And to top it off this isn‘t even a good load distribution. The difference between node load is fairly staggering as you can see on the standard deviation values here:

But why were  we clustering systems on the constellation level? Because a long time ago it was theorized that inter-node stargate jumps were more expensive than intra-node jumps. Presumably, that is due to the assumed overhead introduced by network latency. Now we know that this assumption is completely wrong and inter-node jumps are actually cheaper than their counterpart. A stargate jump requires the removal of the player from one system, and his introduction into another one. Splitting that work between two nodes spreads load better than doing it all on the same node.

Realizing this, we just decided to remove any such node affinity behavior from the premapping logic as a first step. Systems would just be assigned to nodes on the basis of the most loaded, unassigned, system being assigned to the least loaded node. That resulted in an amazingly well balanced cluster and the load problem was solved! But even though we‘d made a very balanced cluster the Time Dilation problem was now even more pronounced. Systems had just been semi-randomly assigned to nodes with no respect to locality and so we still had people in the Deep North affected by fleet fights in the Deep South. Revert the changes and put the thinking cap back on!

Step 2: Solve the problem

So we needed a way to split load between nodes whilst ensuring that all systems on a node belong to the same space-neighborhood. As a secondary concern we really did not want to increase the time premapping took, and we used the opportunity to completely nuke the old T-SQL code and recreate it as a Python module. That last decision, unsurprisingly, turned out to be the most important one as a procedural language with a large supporting ecosystem of available tools is much better geared to solve a procedural problem than a highly limited relational language.

With Python at our side we initially arrived at a fairly simple and approachable solution. As we‘re attempting to split up a cluster of stars with defined coordinates so that both halves are adjacent and equally weighted; why don‘t we just do that? If you split any coordinate system in half with a plane, both halves will be adjacent to each other and that solves the problem of proximity. That leaves us with a way of ensuring that both halves are close to equal in their load. As we keep track of each systems load this just becomes a matter of tweaking the split line iteratively in some sensible manner. That‘s not more complex than using the properties of Binary Search! We just have to move the split line a given distance towards the more heavy of the two halves, and then half the distance we‘ll move it on the next iteration. Theoretically we should bounce between both sides of the optimal split line for a while until we hit it. I say theoretically because we‘re not trying to find the optimal solution here, only a good enough solution in the least amount of time.

Here is a graph to help you visualize this process. Imagine the circle stands for the entire cluster of stars we want to split up. The stars themselves are not shown because they're not important, all that matters is that the split will not mix stars from the two distinct halves between them and that the load is as equal as possible.

Once we’re done splitting the initial universe into two clusters, we’ll pick the heavier of those two and split it again. Then we’ll pick the heaviest of those three and split it again and repeat the process on the four resulting clusters of stars. We’ll repeat this until we have as many clusters as we have nodes at which point we commit the mapping to the database and continue on with startup.

To visualize this properly I’ve provided some visualization, using this method, run against the same dataset as the graph above. However, the first three steps are the split of Empire Space only due to the inner workings of the new method and only serve to describe the method better. The last picture is the final allocation of the entire universe (on the same data) and can thus be compared to the similar graph above.

Initial Empire Space step:

 

First Empire Space split. Note the size difference of the two halves due to the fact that not all systems are equally loaded:

 

Seventh Empire Space split (that grey blob is two hues of gray):

 

End result of entire universe. Note how much more uniform the color distribution is:

Step 3: Actually Solve the Problem

That’s how we initially pushed out the new cluster premapper a few weeks before Rubicon was released. We did so understanding what the more Computer Sciency types out there might already have realized from the above; this results in a balanced load spread if, and only if, the number of nodes available for allocation are a whole power of two. Anything else will result in a fairly unbalanced load spread as we’re always splitting load into two halves, creating a sort of a Binary Tree. For a perfectly equal split we must have enough nodes to fill the leaf level of that imaginary tree or we’ll have nodes on level leaf-1 with double the amount of load on the nodes on the leaf level.

Here is a very simple picture displaying the whole power of two issue with 3 nodes as opposed to 2^2=4 nodes. Please keep in mind that this is a perfect world example. We do not actually expect all load splits to be perfectly equal.

This resulted in a too uneven of a split. The statistics below do look better for Nullsec but in Empire we’re still looking at a fairly large standard deviation between nodes. But at least the CPU Minimum isn’t zero here. Sadly the memory stats just got plain worse, but we’re really balancing primarily for CPU.

Obviously the optimal split here is 33% of the load allocated to each node but you can see how the nice and simple algorithm outlined above completely fails here. But since we know this works optimally if we divide the load by a number of nodes that is a whole level of two we just need to force that behavior! We could of course constrain ourselves to always have a whole power of two number of nodes for solar systems but that feels like a rather silly artificial constraint. Especially if you know that any number in base 10 (decimal number system) can be represented in base 2 (binary number system), then this just becomes a matter of rearranging the binary tree above into branches of whole number of two nodes and running the algorithm detailed above on those branches.

Using a bit more complex example than the three nodes above, we can visualize this in action as thus:

This is easy enough to achieve with minimal alterations to the code used in the initial solution. That solution already relies on a method of splitting coordinate systems in two equally loaded parts. Altering it to split it into two X/Y parts is no big deal.

Using this new approach on the exact same dataset as the examples above we end up with this pretty picture of a balanced universe within which space-friends shoot other space-friends in their respective faces:

 

And now the CPU standard deviation has dropped drastically! We're hoping to have this code out by tomorrow Wednesday, December 4th.

Step 4: Write a Dev Blog

Then do it again. This is the second version of this Dev Blog. I’m pretty sure nobody would have enjoyed reading the first one. Sadly I couldn’t reuse a lot of stuff from that first attempt, like I could with the code. It simply had too many mentions of water carriers, aqueducts and Beyonce Knowles. Probably wouldn’t have made a lot of sense to anybody!

So there you have it; our recipe for a Balanced Universe. Perhaps you can find some way of applying it to your own lives! I’m thinking those fancy colored dot charts could probably substitute for a Rorschach test. But in case you just read through my entire blog and do not feel like you’ve gained much from it I’d like to present you with this cool island song as a peace offering (Don’t browse at work kids!).

That’s it folks! If you’ve got some further questions I’ll be looking at the comment thread regularly.

  • Cloora

    Holy shit I don’t understand any of this. Thank goodness there are people that do.

    • Dumbledore

      I asked Minerva and Severus and they don’t understand it either.

  • Gould Ko

    I see a Noctis heading north west. What really surprised me wasn’t the odd load balancing, it was the number of nodes they have. Honestly I expected more. Also, I’m assuming that a single node doesn’t resolve to a single server (or blade) but a CPU core (ie thread). Why else would they be worried about CPU load balancing across nodes when they would have assigned a CPU core to a cluster of systems.

    Are they running multiple systems on a single (time sharing) CPU core?

    • Aonus_the_blaster_maniac

      i called Best Buy to log into EN24 to answer your question.

    • Dirty Rotten Sneaky Bastard

      Something like that. That is, except for Jita, it gets 1-2 threads of it’s own.

    • Blu

      CPU Time is always limited no matter how much CPU “cores” you actually have, and the main problem here is the one thing managing the synchronization.

  • Ashesofempires

    This Dev Blog really needs some background information for the uninitiated:

    EVE’s actual systems are ran in groups on a single core of a server processor. The stats from 2007 are 60 blade servers, each blade running a pair of Opteron dual core processors, as well as a pair of super nodes used to power Jita. There are, presumably, backup blades used in case of hardware failures and to reinforce nodes for fleet fights. So, that’s 2 cores per processor, 2 processors per server, 60 servers, for a total of 240 threads. 5000 systems in EVE, load balanced across 240 threads; that’s an average of 23-24 systems per thread. Each thread takes about 30mb of memory with no load, and each server has 4gb available. Optimally balanced (hah), Tranquility is supposed to be able around 190,000 players under max load.

    This dev blog addresses the fact that their load balancing algorithm was, until now, shit. Which we’ve all suspected for quite a while now.

    • Dirty Rotten Sneaky Bastard

      Dual cores? Really?

      Intel Xeon E5-2690 130w 10-core processors @3.0 GHz are $2,149.00 each
      SUPERMICRO MBD-X9DR3-F-O SSI EEB Server Motherboard Dual LGA 2011 run $449.00 ea.
      Wintec 32GB (4 x 8GB) 240-Pin DDR3 SDRAM is 4 sticks @ $389.00 x2 per board = $778
      Multiply by 64 servers = $216,176.77
      That would give CCP 1,280 threads with 3.2GB per thread. A sizable investment to be sure, but not too big to give a large upgrade that would nearly eliminate TiDi completely and insure more player retention as well as drawing more players to Eve.
      CCP needs to do this or something similar soon before they start hemoraging players to other games. They would do well to remember that Star Citizen is coming and everyone knows that Chris Roberts doesn’t half-step anything.

      • Internet Lawyer

        It wont eliminate TiDi at all. Try reading the blog again. What this new node distribution does is minimize the chances of someone else’s TiDi from affecting you in a different system. Fundamentally, eve is still single threaded, and large numbers of players in one system will always generate TiDi – period.

        • Dirty Rotten Sneaky Bastard

          If you knew anything about IT systems, you would not have said that ( Me -Bachelors Computer Science, working on Masters in Cyber Security @ Univ of Ala.) More threads = less systems per thread = more load per system can be tolerated by the server. It would eliminate TiDi from all but the biggest fights. The blog above was about doing software tweaks to reduce TiDi without changing the hardware. However, software can only do so much. They really need to upgrade the hardware. “Better to remain silent and be thought a fool than to speak up and remove all doubt.” – W.C. Fields

          • Ashesofempires

            “It would reduce Tidi from all but the biggest fights.”
            No, it would not. The problem has always been that a fleet fight happens in a single thread. More cores/threads means that the average number pilots per thread goes down. It does absolutely nothing to help with nodes that are maxed out by a single system. The only thing that helps in that instance is higher IPC and I/O bandwidth.

          • Dirty Rotten Sneaky Bastard

            Once your core programming is re-written for multi-threading, it is a simple matter to introduce a sequence to enable thread switching and sharing.

            In layman’s terms, if a big fight in a system is maxxing out a single thread, it can switch more threads over to share the load. That is not possible with only 256 threads spread over 5,000 systems.

            But if they upgraded like I said above to 1,280 threads, then you would be able to have threads sitting in reserve to switch to where they are needed.
            So as you see, now that they are re-writing the code for multi-threading, they need to also upgrade their hardware to massively increase the available threads to take full advantage of this.
            Also, you seem to be a fairly young gamer, so I’ll spell it out for you.

            Back in the day, Chris Roberts was a giant legend in gaming circles, he is responsible for titles such as: the Wing Commander series, the Privateer series, Starlancer, and Freelancer. You like being able to go where you want and when you want in game and doing what you like, when you want? Thank Chris for that. He pioneered that in Privateer. Before that, you went where the game told you to go and when the game told you.
            When the man releases a title, he has already worked out the major bugs, you might find a few minor ones here and there.
            And yes, CCP needs to worry about Star Citizen. Chris Roberts is simply one of the best game designers ever. When he publishes a game, you can bet it will be cutting edge in many ways.

          • Internet Lawyer

            “Once your core programming is re-written for multi-threading”

            In an ideal world – yes. But it isn’t. The rest of your argument is irrelevant.

        • Blu

          “eve is still single threaded” Bullshit.

          AND Not with a better architecture. Before anyone tells me “nooo it will cause blablabla..” I already developed architectures way better suitable for multi client applications (similar to EVE Online) that can handle trillions of requests on a single node simultaneously. The problem for CCP is just the amount of money and time they would need to invest to migrate their systems.

          • Ashesofempires

            EVE’s server processes are written in Stackless Python. Python utilizes a feature called the global interpreter lock to prevent manipulation of values by more than one thread at once. While CCP has built a few workarounds like CarbonIO to help, the server processes themselves are single threaded. That’s why CCP’s choice of hardware seems odd on the surface but makes sense when you look at the hardware available and their specific needs. Multi-core servers are clocked lower, and CCP needs the highest possible instructions per clock setup available. that’s why they opted for dual core. There were no quad or larger chips available clocked at 3.33 GHz.
            EVE is nothing like a traditional transactional database. It has to deal with much more than a massive amount of server calls, it has to do complex calculation for a large number of players at the same time. Physics engine checks, inventory checks, ship stat changes, player state changes, crimewatch, NPC calculations, and so forth.
            It’s hobbled by the initial choice of programming language, which is inherently difficult to write multi threaded code for. Nothing short of a complete re-write of the sol node code will ever sever CCP’s dependence on high IPC server processors.

          • Blu

            Read my reply again. You point out the obvious. Eve is not single threaded, what you describe is when you use pure Phython on a single machine, then you say it yourself they utilize some other technique, which is in the core of it native calls. At the end everything is eventual sync. so you need to run this part together at some point, here you can apply that all code is only capable of a certain degree of parallel execution anyways, so as a solution you need a better architecture for the whole system, which again is a costly matter.

          • Internet Lawyer

            How about you stop arguing, start listening and go read some dev blogs about WTF is actually going on in the physics engine and exactly why a single solar system must run in a single thread in a single CPU.

            Unless you (as the expert computer programmer you say you are) feel you could rewrite several years worth of code into a new language this afternoon.

          • Ashesofempires

            Watch this AMA by CCP Veritas. In it he explicitly states that EVE is single threaded.

            http://www.youtube.com/watch?v=UcEUB6h4Br0

            For those too lazy to sit through the whole thing, He basically says “EVE is lightly threaded in the client (two threads), single threaded on the server, and the reason they haven’t made much in the way of progress in hardware is because the chipmakers themselves have stopped pushing speed and started working on efficiency and high core counts.

            Here’s the Dev Blog on CarbonIO, and the explanation that puts to rest, once and for all, your idiotic assertions that EVE is multi-threaded.

            http://community.eveonline.com/news/dev-blogs/2332

      • Bawk Bawkbagawk

        yes, star citizen is coming, in 2015, maybe. it isnt the boogeyman, it is just a pile of socks.

        • Ashesofempires

          The game sounds like vaporware to me, but I’m willing to bet that when it releases it’s about as feature complete as EVE was in 2003. Which is to say, not very.

      • the one

        bahahaha star citizens never coming out its just a big scam for chris roberts to elope somewere without extradition and sun himself forever

    • SM Farley

      Nice detective work, seems they should upgrade the hardware since the total population of EVE is increasing steadily.
      Upgrade and Prosper!

  • Robo

    Erm whaaa…

  • Bawk Bawkbagawk

    “Assuming that no social engineering takes place that diverts load to new, unexpected, places.

    Example: Announcing an upcoming LP offer from a corporation that only
    has a single good agent in high-security space will result in a lot of
    new load in that agents system which the premapper cannot in any way
    anticipate.” maybe not but the fucking PEOPLE who set all this up should have and CCP knows that very well. morons.

    • SM Farley

      I think he was being sarcastic. The Developers and Designers are always synced, and I assume the agents were placed for the SOE corp long ago (say 7-10 years ago) and the Designers/Devs of today were probably unaware of the fact that there was only 1 level 4 agent there. That being said it is still there issue that affects thousands of pilots and I applaud then for not delaying the fix of these situations. Frankly I have a lot more respect for CCP in the past couple of years then I ever did since launch. I was there, Eve was not Real!.. Now its pretty freaking good and is currently the only game that i play on the regular.

      • Bawk Bawkbagawk

        did you take cold medication before you typed this?

  • Dirk MacGirk

    wtf – I can’t read all this. Did it solve anything? Or is the guy running missions alone in a hisec dead-end system still unhappy?

  • Confused

    I am sorry, experienced tidi whilst reading this and now i am 128 years old, whats a cummpotar?

    Think we all just want a game where we can enjoy to the maximum of it’s potential. Just improve your hardware please.

    One thing i did notice is that when you first login it does take a while to jump into a system that
    hasn’t been jumped into yet, i assume this is because it is loading, how abouts purging the system again if no one jumps through it within say 15 mins, or if peeps in station just hanging out, no one in space, do the same which would result in a slight delay in undocking, not sure of effect, but every little thing helps i a sure.

    • N3_Grunt

      The hardware is not the problem here. The legacy code is written for a singlecore processor. They can have the best hardware in the world, and you would not notice a single improvement.

      Rewriting this code to support multicore processing will solve alot from what I know. But then again, I am no programmer so my comment is not set in stone.

      • Ashesofempires

        This. I think a lot of people don’t know or don’t remember that up until 2005 the dominant trend in CPU design was for high GHz single core, hyperthreaded processors (Pentium 4), and that plan didn’t work out so well.

        EVE was designed before the rise of multi-threading. It’s unfortunate that CCP chose a programming language that’s really shitty at multi-threaded processing.

        • Fartolio

          Means not it should stay in the dark past, isn’t it?…or am i wrong?

          • N3_Grunt

            You are right, it should not. But it is a very big undertaking rewriting legacy code of a game that is over 10 years old. This is not done over night, and might take months if not years to fully rewrite to todays standards.

          • Ashesofempires

            It’s an absolutely monumental undertaking, and not something to be decided upon lightly. Look how many times they fucked up the load balancing algorithm before they got it “right.” The biggest thing is to maintain synchronicity (is that a word? i hope so) across all of the actors on a node, from player ships, to drones, to npcs, to starbases and stations and even across nodes. If one thread communicates to the backend database that a player has warped, but a different thread says his ship blew up, suddenly you have an actor that is both dead and alive. How do you resolve such a problem?

            High load tends to cause desync issues as well; if anyone here was around for the original WoW launch (yeah yeah) you would remember how bad the de-sync issues were. Multi-threading an individual star system would mean that their load balancer would have to be a lot more dynamic than “10 minutes later it will rebalance.” If not, you’ll end up with situations where a portion of the players will be able to interact fluidly with the server and others are stuck in limbo, or end up desync’d.

            From a technical, engineering standpoint, it’s a task I would cringe at. I hope CCP Veritas is working on it (if he even still works at CCP) because he’s a damn fine software engineer.

          • Fartolio

            10 years of non-evolution is a long time sir…

  • renoirdana

    Great post! I am interested in the details, as my day job is software architecture of systems, and it’s great to see what you’re working on. It really helps to explain the unique challenges of eve – and how the entire universe really is one place.

  • Titshit

    TL’DR

  • I think CCP have said themselves that -“Rewriting the legacy code would be a very difficult and time consuming task and they expect if and when they do it they will come across many problems”
    It is also my understanding that they will have to do it eventually.
    I think they are prolonging the inevitable if were honest. Especially if they honestly expect another 10 years of eve. However I am no computer coder and I know it can be difficult. Just recently I have been playing around with and looking at very very minor code in regards to playstation emulators for PC and that had my mind in a twist so god knows about all of this.