Efficiently storing data for large populations.

I’m working on an ‘adventure life style simulation’ game similar in vein to daggerfall. One of the core aspects of it is that I’d like to have a large population that is also persistent (or at least semi persistent). I’ve come up with a way to procedurally generate any given denizen and assign them a unique 32-bit id. This id can then be used to re-create that same denizen at any time. For example, I can query the system with something like Population.Query().Age(MiddleAged).Race(DarkElf).Occupation(WeaponSmith).Random() and it will return me a unique 32-bit integer randomly selected from the range of valid integers that fit the query. This uid can later be used to map back to those exact same stats. So I can plug in Population.FromUid(id) and get back a data structure with the fields for age, race, sex, occupation, and homeland set appropriate to the original query. This makes it extremely efficient to find any given person in any given place because I don’t actually need to know anything other than where I am and what kind of person I’m looking for.

However, I’m at a slight junction where I’m not sure how to efficiently make use of this data at a more granular level. As an example, I’d like to use this system to ensure that for any given play-through the player would see the same shopkeep at the same shop every time. So it seems I’d likely have to query for this individual and then tie that unique ID to that specific shop. This starts to feel like a lot of data to preserve. For example with a population sample size of 60 million (roughly the population of medieval Europe) that’s nearly a quarter of a gigabyte if I were to end up using the full population range and tying each person to a single building. And that’s assuming one building per person. If they have a place of work and a place of residence that’s twice I need to store that id.

And that’s just to store that person’s association with a building. It goes further down the rabbit hole if I want to store state specific to that individual too. For example, what about death? My best idea so far for this aspect is to use a portion of the 32-uid as the actual id part and then use the upper bits to store additional meta information. However, I’m not sure how I could easily read or write that information that is once again efficient.

My pacled brain is mostly still thinking in terms of databases and tables but I feel like there is a much more clever way to handle this with much higher data density. Does anyone here have experience with such things or have suggestions to make?

You should think of what exactly makes that shop unique. It is likely the position on the map. So the shop’s world coordinate produces a hash, which you use to arrive to a valid uid.

The other thing to consider is that there are way less than 60 million shops, so you can probably associate each shop with a shopkeeper and have this in the hundreds at most. Who needs true scale in their games anyway?

1 Like

Hmm hadn’t considered that since I was thinking in terms of where they work AND live but I guess a combination of the two used as a hash would have the same result. I’ll have to rework my population system since it doesn’t work at that granular of a location (and indeed doesn’t use hashing at all).

You do make a valid point about the likelyhood of ACTUALLY needing all of that data at the start. I could likely just fill in a database as things are discovered. Perhaps do the same when states such as death are updated as well. There is the potential for filebloat there but I might be thinking at a scale much higher than the realistic use case.

Yeah so I decided to actually sit down and work out what kinds of numbers I would realistically be looking at. And the numbers are much more promising than I initially thought. Currently I have it planned for 5 regions. Each region is going to have at least 1 capital city and probably in the ballpark of 10 regular cities. Villages is looking to be in the range of 20-ish but I don’t know how I’m going to categorize a few other things so that might change a bit.

I’m thinking each capital will have a max population of about 100k. Each city will have 10k and each village will be 1k or less. Given the numbers above that comes out to a total population of about 220k people per region or 1.1 million in total among all 5 regions. I can easily fit that into 24 bits of data which leaves me with 8 bits left over for state info per denizen (I could even sacrifice about 60k off the total population and squeeze it down to 20 bits if I really wanted).

At 1.1million, using a 4-byte value per denizen (3 for the uid and 1 for the state) I can fit all of that into a space of just over 4 megabytes. Not tiny but certainly not as gargantuan as I was thinking. I could literally store 255 mutually exclusive states or 8 binary states (or some combo of the two) for every single denizen that could ever exist in the game with that. And if I decided to expand and add another region or two it wouldn’t be a huge deal as that 24-bit value has room for up to 15 million more uids! So if I decided to even double my population I could still easily fit all of that in only double the memory, which is about 8 megs.

Honestly I should probably just consider storing everything in memory and just dumping to a file on save but the idea of using sqlite as a databse seems nicer. I’ve never used sqlite on such large datasets but I can’t imagine it would struggle much with this. Obviously the file size will be much bigger than the raw data would suggest but I think worst case I might be looking at a few dozen megs? That doesn’t seem too big for a single savefile these days.

All of that being said I’m still kinda trying to see how much more I can generate procedurally vs storing in such a database. Take the earlier examples of workplace/living place. Each id is actually unique to the region it comes from but I don’t take into consideration what specific city/village they might have come from. Depending on how much of that leftover 8 bits is used for state I could probably spare some to map to the town of the current region. After that I could map occupation to buildings in sequence. For example, the first blacksmith generated always goes in the first smithy of the town. If there ends up being more buildings than NPCs I guess I could just make them vaccant - out of buisness or for sale or something lol Obviously this doesn’t take into account their home though. Only the place of business.

Okay, so I realized I was being a bit bone-headed with my previous post. There is absolutely no reason to store the Uid along with the state information since it is already implied simply by the index into the array in the first place. Essentially what this means is that if I limit my max population to say, 2million, I can store a 4-byte value per denizen in an array with a length of 2 million which takes up 8 megs but allows me to store 32-bits worth of data per entity. This is ideal since I probably don’t want this data packed into the uid anyway otherwise it becomes a pain to pull all of that data out for persistence. Ultimately this solves a lot of problems.

So far I’ve hard-limited the total population to 2,097,152 using a const value. I’ve also added a static array with that many elements of the type uint. I figured the breakdown can go something like this:

        //For a 32-bit state the breakdown is as follows
        //5 Bits    = Town Id       (32)
        //9 Bits    = Residence Id  (512)
        //1 Bit     = Death State   (2)
        //3 Bits    = Faction       (8)
        //7 Bits    = Workplace Id  (128)
        //1 Bit     = In Use        (1)
        //7 Bits    = UN-USED
        public static readonly uint TownIdMask          = 0x0000_001F; //           31
        public static readonly uint ResidenceIdMask     = 0x0000_3FE0; //       16_352
        public static readonly uint DeathStateMask      = 0x0000_4000; //       16_384
        public static readonly uint FactionMask         = 0x0003_8000; //      229_376
        public static readonly uint WorkplaceIdMask     = 0x01FC_0000; //   33_292_288
        public static readonly uint InUse               = 0x0200_0000; //   33_554_432
        public static readonly uint UnusedMask          = 0xFC00_0000; //4_227_858_432

My only real concern now is that I’m still stuck with manual iteration for finding a particular individual for a particular building. The suggestion above about using the specific building as part of the hash sounds nice in theory but is totally incompatible with how my data is structured and calculated. Essentially it’s just a recursive tree where each node represents a sub-section of the total population for that category. For example I split the world population up based on each region’s percentage of the total pop and store them in nodes with index ranges to represent the indices valid for that region. Each region then breaks down further by ages and percentages that each represent, then races, then sex, then occupation. I’m not sure I would even have enough resolution to actually break it down further by towns and individual houses and even if I did it would be a lot of work to mash that into the system at this point.

I’m thinking it might be best to run some performance tests and simply brute-force iteration. For example, if I need to know the shop keeper for a particular building in a particular city I’ll just iterate over every single element of the 2million-long array, mask out the bits that represent the town id and the building id, and look for a match.

Well, to solve dependency issues, any transient PCG scheme demands that you have a reliable hash seed to begin with. If you don’t design everything around this, then you can as well do a world-generation step where you initialize your world state. Thus you never mix the two approaches, but your world seems to be too big/populous for this, so I’m not sure if the latter is a good design.

You can, for example, partition your world into regions and provinces, and then declare “state codes” for each province based on the geography, which is practically a hash seed. You can then model demographics by implementing look up tables that probabilistically map state code to specific citizen hash blocks. You then produce citizens from these hash blocks, where each hash is technically the same as your uid.

That way you are free to reshape the look up tables dynamically, and thus modify demographics as the game world gets older and maybe influenced. Of course, you would have to design every little nook and cranny for such an elaborate machine, but planning lets you crunch a lot of data in the long term. And in your case, it’s not only that such data that big (megabytes are nothing anyway), but performance issues are more likely if you cannot rely entirely on local generation, and you are to maintain a complex inter-dependency structure and then constantly query over N dimensions to look for truth (treat dimensions as exponents really). That has to scale poorly no matter what you do.

Again, if you want to keep track of your denizens, to remember them in certain roles or certain duties, having specific inventories and whatnot, there is no other way but to associate them 1 to 1. Every such association thus becomes a burden on your system. This is exactly why you don’t see games that model that many people or that many locations etc. And frankly it’s boring, especially given it’s mostly random and essentially doomed to feel generic.

You can try and develop an idea where your denizens are much more transient, i.e. pop into existence only as a local information, but if there are certain characters you wish to promote into becoming persistent, you actually only treat these ones differently in your system. I.e. you might want to hook quests onto them, produce VIPs or have shopkeepers. This is a perfect compromise for a truly big world inhabited with millions of souls, most of which are really just a background noise anyway.

This way the model won’t ever scale into megabytes, and very much depends on the actual behavior of the player, especially if you introduce a way to recycle a pool of persistent people. I.e. a shopkeeper grows too old, dies, quits, or closes the shop, a VIP went onto a crusade or to another continent, a quest is done and its giver is no longer hanging in the local tavern etc.

1 Like