I just had another idea, but this one depends on how far back the server logs go...
You mentioned previously that the server logs the position of node placements (and removals), but not the orientation/metadata. For now, let's assume a theoretical scenario where the server logs encompass everything back to the beginning of time and everything else works out as expected (obviously not the case):
- Replay the node placement and removal logs against a freshy-generated copy of the map (this will be generated as required to save disk space i.e. a block will only be generated when it is referenced in the log). The resulting map will contain all the correct nodes in the correct positions but with the wrong orientations. Let's call this map A.
- For each block in the backup, attempt to find its position by matching it against map A, by comparing only the node strings and ignoring the node metadata. Let's call this map B.
Now, in an ideal world, map B should be more-or-less 100% accurate (there may be some ambiguities in the case of blocks with identical nodes but different metadata and human verification may be required for these blocks). Any blocks that weren't matched against map A are most likely unmodified and weren't generated in map A and can therefore be discarded.
Because this isn't an ideal world, it obviously won't work out that perfectly.
But by using a closest-match algorithm rather than an exact-match algorithm in the second step, a fairly complete map should be constructed (more accurate/complete than the result from my previous post, with less processing and simpler steps required). Using a closest-match algorithm increases the average time taken to match one block but is possibly still feasible with some optimisations (multi-threading, processing blocks in batches to reduce the number of required database reads, etc.).
The downside is, it won't work with blocks that have been heavily-modified and aren't captured in the server logs. It also won't properly match blocks that have been heavily modified with falling nodes (sand and gravel falling is not logged), buckets (empty bucket use is logged, lava and water bucket placement is not, and lava cooling is not), TNT (TNT placement and explosion is logged, but not the exact nodes which were affected), saplings (sapling placement and growth is logged, but not the exact nodes which were affected), and countless other things that I don't feel like trying to guess right now (I could probably create an exhaustive list if I had to).
But maybe this could be considered as a technique to use alongside whatever other techniques can be found. I'd happily try this myself if I had the disk space and especially the internet (I could probably gather up the disk space but our internet is limited to 10 G/month). I could run it on one of ExeterDad's computers/servers but that would require him trusting me to not screw him over...