« Some Notes on Xpages (or What I Did Over My Summer Holiday) | Main| Land of Confusion (or Listen to Me Demonstrate My Ignorance) »

Andrew's Allusion (or Creating Primary Keys in NSFs)


Andrew mentioned something I discussed with him regarding setting UniversalIDs (or UniqueIDs.... the jury is still out on what the proper terminology is after almost 20 years of production Notes apps) on documents.  Mr. Pollack likes the idea for the sake of maintaining document relationship structures across multiple non-replica instances, and I agree that it's good for that, but I find myself using it for a bit more esoteric a purpose lately.

I was inspired to look at this more closely by reading Andrei Kouvchinnikov's marvelous article on his DigestSearch toolkit at devWorks.  Andrei struck upon the most marvelous combination of ideas...

1) .getDocumentByUNID( ) is dramatically faster than .getDocumentByKey (or .search( )... ugh!)
2) .setUniversalID( ) can accept any string that is a valid UNID.  That is, anything that's a 32-character hexadecimal string.
3) The @Password function returns an MD2 hash variant that is cryptographically sound and.... to our good fortune... a 32-character hexadecimal string.

Since @Password is a hash function, it returns a unique value that is pretty well evenly spread over the available bit space of 2^128, that is theoretically unique for any given unique input string.   So if you @Password(@UserName), you'll get a hash value that is perfectly consistent to your user name, but all-but-impossible to reverse.  (Note for the OCD: yes, you can hash a dictionary and compare the hashes.)

In DigestSearch, Andrei allows you to specify any set of field names in a database, and run an index process against the values in them.  He then scans the database, identifies each value within those fields, and treats that as a search criteria.  For instance, FirstName = Nathan or Tags = l33tn3ss.  He then takes that search criteria, and performs an @Password, and gets a document with that UNID.  If the document doesn't exist yet, he creates it, and stores the UNID of the current document in an array in the digest document.

The effect is, given any arbitrary search term against a mapped field, one can determine the UNID from the string, rather than depending on a .getDocumentByKey.  In other words, it's not necessary to build the index.  You can determine what the UNID is directly, and then attempt to retrieve it.  If it's not there, then there's no documents matching your search criteria.

If I did this here at Escape Velocity, I could index every Title, Author and Tag on every entry.  If I searched for "Author = Nathan T. Freeman" I'd get a document with a UNID field that I could then walk through.  If I searched for "Title = I love Exchange", the UNID wouldn't exist, and I'd know that the result set was null.

So, Andrei is magnificently smarter than any of us.

But I started contemplating this and thought "Why should I only do this for digest searches?  Why not do it all the time?"

If, for instance, I'm building an NSF of documents that define URLs, why should I create arbitrary documents and store a field called "URL"?  Why not create the document, and set the UNID to be @Password(URL)?  (Note for the OCD: it's slightly more complicated than that, but if you understand this part, you'll understand the minor modification you need to make.)

Then, if I'm trying to find whether the document exists for a given URL, I don't have to build a view with a sorted column of URL, and then search the view for the key.  I just @Password(URL) and I know the UNID already!  All my searches become 5-10 times faster!

This becomes relevant in the context of Xpages because now that we can do crazy things like lookup values in a view column, scalability needs remind us that there's a big difference in server workload between looping through a bunch of @DbLookups and a bunch of @GetDocField formulas.

In other words, as data architectures change to take advantage of the more relational models that are now possible, our optimization approaches have to change.

The UNID of a document is essentially the primary key to it.  It's the fastest possible index because the data structure is centered around it.  (The on-disk structure is centered on NoteID, but that's a question of data's written to disk, not the logical structure of the data store.  And the performance difference is extremely marginal.)  If you build intelligence into that primary key, then you can get at it much faster than by other methods.  And that's a good thing in the world of more server-side looping than we've ever faced before.

I've already started using this approach in my applications.  It's working great.  So far as I can tell, there's exactly one side-effect: @Created no longer works.

You see, neither Andrei nor I were the first to build intelligence into the UNID.  It turns out, Lotus already did that years ago by embedded a timestamp into it.  So if you set the UNID to be whatever you need it to be, chances are that the @Created date that's reported will be some crazy value that your OS won't be able to actually report.

Fortunately, the response to that is: so what?  For 90% faster access to a given document, I'll set a requirement that I write the original creation time in a field.  It doesn't mess with the "Added to file" date, and it doesn't affect @Modified, so it doesn't appear to mess with replication.

I hope it proves valuable to you.

Comments

1 - I missed my bedtime last night by a couple hours owing to your giving me a sneak preview of this. Mind you not because of any editorial review burdens, but because I just had to put it into practice on an app I'm building.

I had just built a nifty little attachment controller feature for the Notes client (more to come on that) using an embedded view and some pretty cool hacks I picked up from folks like Rocky Oliver, Thomas Schulte, and Andre Guirard. Part of that solution is to create temporary documents to hold information about attachments that have been added but not saved to the main document yet. Where this hack comes into play is when I do the cleanup of those temporary documents (in both PostSave and QueryClose). My first approach (just to get things working mind you) was to use db.search, which as you point out isn't the fastest method. Before seeing this hack I would have probably gone ahead and created a lookup view but I'm glad I didn't have to. Emoticon

Thanks Nathan!

2 - Wow. That's cool. Really cool. Now I need to go think about all the ways we could be using this technique. Thanks, Nathan, you rock.

3 - Cool stuff - I've actually got a fair amount of experience with .setUniversalID and a similar approach. A couple of things that people should keep in mind:

1. setUniversalID (as of 7.0.2, haven't tested in 7.0.3 or 8.0.x yet) has some weird stack/memory leaks when run on large volumes of docs. It seems to work anyway, but you may encounter crashes or other weirdness after the process is finished.

2. Don't underestimate the speed of GetDocumentByKey. It's very, VERY fast. In fact, it's faster in Andrei's testing than DigestSearch when run locally. That's something to keep in mind for web-based apps.

3. Also keep in mind that speed will be impacted by what you want to *do* with the doc once you've got it, and *how* you get the doc will decide what options you have available. Because of that, GetDocumentByKey can end up being the fastest approach anyway.

Here's why: Once you've got a doc, nothing is faster than grabbing a value contained in a view using NotesDocument.ColumnValues(). Of course, you don't have this option unless you actually obtained a handle on the doc through a view (e.g. GetDocumentByKey). With a direct UNID-level search you are limited to .GetItemValue or something else that results in "cracking open" the document.

This can result in an overall slower algorithm even if the search itself ends up running faster, because getting the item values themselves is slower. This is especially true if you need to iterate over multiple fields.



4 - My only concern with this approach is it seems to be best suited for data structures where the piece(s) that comprise the document's now-meaningful UNID are known before the document is initially created... and never change. Which is why I think the URL example is perfect: if you want to store certain information about a URL in a document, and you're hashing that URL to get/set the document's UNID, then by definition that UNID would never again change. If, on the other hand, a user name forms part (or all) of the key, and that user's name ever changes, then the document's UNID needs to change in order to continue to be meaningful, which could cause replication issues. Suddenly you have a new document containing all the old data... so you'd need some logic for deleting the old document, stamping it obsolete, etc. I assume this could cause response hierarchy mayhem as well if not handled properly. So I've been mulling over ways to get easy wins with this technique, but am cautiously optimistic about being able to do so with minimal risk.

5 - Oh, nice. I must have a play with this.

Now... SUPER PEDANT MODE, ON!!!

"Since @Password is a hash function, it returns a unique value that is pretty well evenly spread over the available bit space of 2^128, that is theoretically unique for any given unique input string."

Erm, no. It's not theoretically unique. Even a perfect cryptographic hash will have a collision if it can hash values over the length of its output. So if you use @password 2^128 +1 times you are guaranteed to have at least one collision.

Of course the chances are pretty small. For example, even if you used @password 2^32 (~4 Billion) times, then the the chances of a collision are 1 in 2^32. If you only used it 2^16 (~65k) the chance of a collision are 1 in 2^48 (~280 Trillion)

Cryptography is fun. How often to get to play with export controlled weaponry ;)

6 - @5 - The possibility of a collision after 2^128+1 is why I mentioned the size of the bit space in the first place. kthx.

The cryptographic soundness of the algorithm is that the hash function doesn't favor any one part of that space over another.

@3 - GetDocumentByKey is, I agree, quite fast on stable indicies. However, if I combine your points 2 and 3, the proposed approach becomes even more important in high-transaction environments. If you do lots of writes, then .columnValues won't help you much -- you'll need the doc handle anyway, and it's going to dirty your view.

Please note that the other purpose here is to ensure uniqueness of the existence of that key. I can think of about a dozen situations where one would like to design something where there's one-and-only-one document that meets certain criteria. Sequential numbers anyone? Why not just @password("0000001"), @Password("0000002"), @Password("0000003"), etc?

7 - @6, It's not a possibility after 2^128, it's a guaranteed. You have the possibility of a collision from the word go. The probability just goes up and up till it's 1 in 1.

8 - @7 - Thank you for being needlessly pedantic. We don't get enough of that on the internets these days.

And if you're going to be pedantic, get the math right. The birthday bound of a balanced 2^128 space is 2^64, or about 1.4*10^19, before you have a greater than 50% chance of finding a collision. In a a space of 2^128, the expected number of values you have to choose before having a 10^15 chance (one in a million billion) is 820 billion.

If you're prepared to populate an NSF with 3.4*10^38 documents to get your guaranteed collision, go for it. If you were able to design clients that could write docs to the server at a rate of one per nanosecond, and you had a billion of those machines, you'd only need the lifetime of 269 universes to complete the task.

Good luck with that.

9 - @8, my, you're prickly this morning. Someone piss in your cornflakes? ;P

All I was saying is that it's not guaranteed unique. I then went on to point out that a collision it was very unlikely.

But thanks for the help with my maths home work, you shall have a gold star.

Emoticon

10 - @9,

Nathan "prickly" - never Emoticon.

11 - @9 & 10 - not prickly... just annoyed at the missing of the point. Particularly in light of the fact that there's several notes in the original post that mention that there's a little more to it in implementation than I'm mentioning here. I didn't bother saying that you actually need to use @Middle(@Password(key); "("; ")") as the formula either, because it's nitpicking and if you're an advanced enough Domino coder to need primary keys, you can probably debug some @formula, too.

If you want to generate a collision on UNIDs, there's easier ways to do it. Take two servers, offset their time by 2 minutes, then create a common replica on each. Set them to replicate every hour. Then generate new documents in local agent processes as fast as possible.

Because the UNID has an implicit timestamp, your odds of getting a collision are much better than with hashed keys. The minute you get a replication conflict, you had a collision.

This nitpicking about how many angels can dance on the head of a pin does not help advance the art, and simple generates an artificial reason for someone to think they can't take an incredibly easy approach to dramatically improving the performance of their applications.

Let's put it another way, shall we? If you have an anonymous form with 1 byte in the Form field and absolutely no other information at all, that takes 8 bytes of disk space. A UNID itself is 16 bytes. So the smallest on-disk requirement of a document is around 24 bytes. Given that the current limit of an NSF is about 7.03x(10^13) bytes, you can only generate about 2.93 billion records in a Notes database at all. So if you completely filled an NSF with the smallest possible records while using unique keys that were hashed to generate UNIDs, your odds of a collision are in the neighborhood of 10^-16.

Or yet another way: it would require 442,721,857,769,029,238,784 bytes of storage, or 442 exabytes of storage, to hold the number of documents you'd need to even have a 50% chance of a collision. That's only 88 times the total storage requirements for all words ever spoken by humans, or about 50 times the amount of information produced annually by the sum total of civilization in all print, film, magnetic or optical media.

So, sure... there's lots of risk involved.

12 - @11, And who said otherwise?

13 - @12 - You did when you got the math wrong in comment 5.

14 - @11, Also to clarify, I really do think this is a excellent technique. As the first line of my first comment noted, I'm going to have a play with this. All cool stuff, and I've used crypto hashes for uniqueness in a few places without a problem. So if you thought I was trying to argue against the technique I apologise.

In the mean time, are we talking standard angels or jumbo ones on the head of that pin?

15 - Fair enough. I just get frustrated because I've heard more than one IT exec try to claim that, for instance, sequential numbers are needed because they "guarantee uniqueness" where something like a 128-bit hash doesn't. Of course, these are typically the same guys who have "password123" on their accounts. So what the hell do they know?

I just want to have audience members be well-equipped on this point. When the Oracle guy says "you might have a collision on your key," I'd like the Domino guy to be able to point out that if the sum total of human civilization dedicated all their information-generation efforts to making hashed UNIDs, it would take 50 years to even get to an even chance.

16 - @16, On the maths point, I'm going to have to fire the researcher. I was using a note I took somewhere for figuring out numbers of bits needed for given numbers of key values and desired collision probability. It's complete bolox.
Emoticon

So thanks for drawing my attention to it. Emoticon

17 - lol. Well, let's be clear.... I looked it up. :) { Link } There's a nice handy grid there.

But I knew the odds weren't the kind of direct linear approach you were using. The likelihood of any two matches increases as an exponent. But that means it also decreases as an exponent.

Mostly, I think people just need to learn how big 10^38 really is. Emoticon

18 - Interesting stuff. I'm going to have to pull apart DigestSearch to really understand what it's doing. I'm thinking that combining this with HayStack could be extremely powerful.

19 - Something that is much more likely to happen than a duplicate hash value is a duplicate caused by deletion stubs in your DigestSearch index db.

I'm sure most people here know this, but if you delete a doc, that UNID is still in use. You won't be able to assign another doc to that UNID until the deletion stub has been purged, or you make a new copy of the DB (effectively removing the stubs).

You could also theoretically bump into a design doc's UNID, or a replication setting doc's UNID, the ACL, etc. You could have no code at all in the db, which would help, but it's still potentially a problem. Be careful.



20 - @19 - Dammit, Erik, you're better at this stuff than this!

Yes, a collision with a deleted doc is perfectly viable. And that means you used to have the key and have since deleted it. Personally, I'm not sure why you would have deleted it. I'm a fan of never deleting ANYTHING, but simply changing it's status. In an RDBMS, would you allow the deletion of a row, or would you simply set row #421 to a status of "inactive?" Yeah, I thought so.

But Erik's right, that if you're using a primary key strategy such as this, you should defensively code by looking for .IsDeleted, in case you used to have that key and for whatever reason it's removed from the database now.

HOWEVER, the odds of a collision with an arbitrary design element are so astronomically high as to be laughable. Which is EXACTLY the point I was making with Kerr. There's no need to "be careful." If you had 820 BILLION FOLDERS in your design, there would still be only 1 in 10^15 chance that you had a UNID collision. That is about the same probability as the uncorrectable bit rate failure of a hard drive. In other words, if you dedicated 18TB of drive space to doing nothing but writing the smallest possible documents at a rate of one per millisecond nonstop for the next 26 years into NSFs, you'd still have an 6-times-better chance of a meteor striking your house and killing you than you would of having a UNID collision on ANY document, much less specifically a design element.

I hope you people making these remarks understand that having anything but perfectly synchronized atomic clocks on every server and workstation results in EXACTLY THE SAME LIKELIHOOD of generating UNID collisions. The MD2 hash with ofthe rolling checksum is balanced, so it's not like the chances of a collision are less because the plaintext values are more similar.

I don't see anyone claiming we should have sequential IDs being assigned to UNIDs across multiple servers because we're at risk of random collisions.

21 - Trust me, I understand it. Emoticon The problem I see somebody running into is when they mass-delete their index docs for whatever reason -- "rebuilding the index", so to speak -- and attempt to re-create them without changing their hash-calculation logic. That would definitely result in massive UNID collisions unless they took the time to purge the stubs.

Of course, this is more likely to happen during the dev phase where they're just playing around, doing benchmarks, etc. But I wanted to mention it so that people taking the time to experiment with this method wouldn't just write it off as "it's busted, screw it."

I don't like to delete things, either. I'm personally a HUGE fan of setting the Form field to "Deleted", and setting a field such as "FormOld" to whatever Form used to be. Of course, with that lovely Optimize Document I-can't-even-remember-what-the-frak-it's-called-right-now SPR we've been talking about that method can be a bit flakey. Emoticon

22 - I think (particularly in large databases) this lookup method would be good for more than just the speed of individual lookups. It would mean having fewer lookup views, and every view in a database increases the total index size and the time required to update the indexes.

I'm currently working on a database that contains about 1.2 million docs and 145 views. If there was a fast and easy way of changing the UNIDs of all those documents, I could remove at least 10 lookup views (maybe more) and about 500MB of indexes.

23 - First things first: Great idea. And great you share it.

@Kerr, Nathan, Erik: all your calculations are based on the assumption of an ideal hash function, which doesn't have collisions for values less or equal in length as the result set's members (i. e. for 128 bit hashes no collisions for 128 bit input values).

To clarify my point: even a function, which maps every input value to the same value (say zero), could be called a hash function. But you get a guarateed collision with any two different values. A very bad hash function, indeed. Emoticon

The @Password function is certainly a much better hash function, and I don't even now the algorithm Iris has implemented, nor if they implemented it flawlessly. But I don't believe, it's an ideal hash function. Therefore the chance of a collision is certainly much bigger, but nonetheless negligible.

And even knowing the algorithm and its exact implemenation it's a quite challanging task to derive the exact distribution of the result set. Believe me, I have a degree in mathematics. Emoticon

Let us put those fruitless considerations aside, please, and concentrate on the possible uses and pitfalls.

Thomas

24 - Thomas, @Password is an MD2 variant that is specific either to Notes itself or to the RSA BSAFE implementation that they have. I'm not sure which.

The Westford security team is the authority on this stuff, but my understanding is that it's MD2 with an additional rolling checksum. That variation actually prevents it from being subject to some of the more recent deliberate collision approaches that plagued some other hash functions.

In any case, unless @Password and the built-in UNID generator actually run the same underlying code, I can't see any reason to believe that using this primary key approach is any more likely to result in an unintentional collision than the regular UNID process in Domino is -- particularly given the known flaws with UNID generation as it is.

Post A Comment

:-D:-o:-p:-x:-(:-):-\:angry::cool::cry::emb::grin::huh::laugh::lips::rolleyes:;-)

11 Aug 

Hire Me 

Lotus-911-Logo.jpg

Search 

Disclaimer 

Welcome to Escape Velocity!

Opinions expressed here by Nathan T. Freeman are not necessarily those of his employer. However, there's a decent chance they are, so check with them if you really want to know.

But really... do you need that kind of validation? Are the opinions expressed here in doubt?

MiscLinks