« End user awareness in IT business | Main| My moment of Zen »

The definitive case for segregation


No, not THAT kind of segregation!  I mean segregation of data.

Ever since Notes first got calendaring & scheduling features, a few yellow bleeders have questioned the wisdom of putting calendar events into the same data store as email.  As features have progressed, this practice has gotten more prevalent, with Contacts and Chat histories now stored in the mail file, and some asking for RSS feeds and Journal entries to be included as well.  There are two reasons for this strategy...

1) It's easier for an administrator to manage a single user's presence within the Domino domain, because the directory and client configuration is engineered to automatically associate a single user with a single mail file, so user maintenance via servers is much simpler;

2) It's easier to write something like DWA, because cross-NSF references on the server are unnecessary, and Domino HTML performance is generally better with fewer NSF handles open at the server.

These are both good reasons.  Item 1 definitely makes it easier for admins to deal with backups, user migrations, restores and disaster recovery.   Item 2 helps scalability on the Domino server.

But at what cost?

Let's take two scenarios.  In the first, we'll imagine all personal data for a user is in one NSF.  This is pretty close to how things are today.  If you crack open the 8.0.1 mail template, you'll see there's design for messaging, C&S, contacts, groups, chat and journals.  The possibility exists that RSS feeds might be added to that as well.  So let's imagine that all PIM functions have been merged into one colossal template, that has everything a growing user needs.

In the second scenario, imagine all personal data for a user were split into purpose-specific NSFs.  This is the user's PERCEPTION of Notes as it exists today, with one store for Mail, one for C&S, one for Chat History, one for Contacts & Groups, one for Journals and one for RSS.  Nothing about the UI suggests that these areas of functionality are contained in the same data store.  In fact, much of the functionality goes out of it's way to make it appear that calendars are completely separate from To-dos which are completely separate from mail.  So we have no issues like "a view can't show documents from multiple databases."  The PIM UI segregates all of it for us already!

Let's further imagine that a given user has been working with their data for a while, and is fairly typical about how they maintain it (ie: they don't.)  Let's say PIM functions have a 5 year use history for this user, and chat & RSS have a 2 year history, with some rough assumptions about daily traffic...
Years
Per day
Email messages
5
25
Contacts
5
1
Appointments
5
4
To-Dos
5
3
Feed Stories
2
80
Chats
2
10
Groups n/a n/a
Configs n/a n/a
Journals
5
1


After 5 years, this user has a mail NSF with about 98000 documents in it.  Pretty close to 100K.  I bet you've got plenty of users in your environment who's mail files look like this.

A lot of admins that I talk to are quick on the draw to say stuff like "if this is all in one place, then when a user's database gets corrupt, we only have one file to restore."  Well, sure.  Of course, that leads to the question of "why did the user's database get corrupt in the first place?"  Could it be that there's 100,000 documents and 2000 design elements crammed into one store?  Seems likely to me.

It's certainly true that the single store means that there's only one file to replicate between a mobile client and the home server.  But this is a matter of engineering on the Domino server.  There's no technical reason why your personal address book, journals, RSS feeds and any other NSF you have couldn't replicate back to your home server.  In fact, this architecture exists today.  It's called Roaming, and while it's not bulletproof, when it's properly configured it provides great administrative success for customers.  It could be better, and Lotus has certainly committed to making it better.

So the administrative burden of many versus one NSF for users is a bit of a red herring.

What about the other reason?  The NSF handles needed by the server?  What if using DWA on the server required 3 or 4 NSFs to be open?  Wouldn't that impact scalability?

Sure, it probably would.  But what's the impact on scalability on the server today?  The secret, I think, lies in the Indexer.

There are 98 views that make up the PIM features of Notes 8.0.1...
Views
Email messages
23
Contacts
17
Appointments
17
To-Dos
9
Feed Stories
5
Chats
1
Groups
7
Configs
18
Journals
1
Total
98


My question is: what percentage of indexer time and effort is spent figuring out what data to EXCLUDE from any given view?  Well, that's a simple matter of a spreadsheet calculation.  You'll find my spreadsheet as SEGREGATION.ODS in the downloads section of this blog.
Views
# Docs
Excluded
Incl Evals
Excl Evals
Email messages
23
35000
63050
805000
1450150
Contacts
17
1400
96650
23800
1643050
Appointments
17
5600
92450
95200
1571650
To-Dos
9
4200
93850
37800
844650
Feed Stories
5
44800
53250
224000
266250
Chats
1
5600
92450
5600
92450
Groups
7
20
98030
140
686210
Configs
18
30
98020
540
1764360
Journals
1
1400
96650
1400
96650
Total
98
98050
1193480
8415420


That's about 9.6 million evaluations needed to figure out what docs show in what views.

Now let's look at what happens if we split up the data into logical NSFs that represent more of the user experience.
# Docs
Excluded
Incl Evals
Excl Evals
Email messages
35000
0
805000
0
Appointments
5600
4200
95200
37800
To-Dos
4200
4200
37800
71400
9800
Feed Stories
44800
0
224000
0
Chats
5600
0
5600
0
Contacts
1400
50
23800
1250
Groups
20
1430
140
50050
Configs
30
1420
540
34080
1450
Journals
1400
0
1
0
Total
1192081
194580


You're reading that correctly.  The total evaluations is just under 1.4 million.

So if we look at building collections for the view indexer, the all-in-one strategy increases indexer workload SEVEN HUNDRED PERCENT!!!

Sure there's more to the indexer than figuring out what docs to include & exclude, there's also collating and summary representation.  But even if we heavily discount the difference, it seems likely that segregating the PIM data stores in Domino would yield at least a 50% improvement in indexer workload.

Is the indexer a bottleneck on your mail servers?  It's a silly question, isn't it?

Don't take my word for it, though.  Download the spreadsheet.  Look at the templates.  Come up with your own scenarios.  Maybe someone better than me with productivity apps can come up with trend graphs or something.

But please, for the love of all that is sacred and holy, stop asking Lotus to cram everything about a user into a single NSF data store.  The processor matters.

Comments

1 - It's an interesting idea. Seems to me that this is the type of situation that document table optimization was designed to address. Of course, because of the way the view selections are programmed in the mail file, it cannot take advantage of this feature. AS YET.

The problem with breaking these up now is that there are a lot of applications out there that go looking for the mail database and expect to find all this stuff.

2 - I agree, whole heartedly. The actual mail template doesn't bother me too much, but that's just because I don't have to deal with it, other than as a user, but if they did split it up, then hopefully we would see some nice features that would help in multi-databases single logical app scenarios.

3 - This is all very interesting theory, but have you actually MEASURED this? Reality has a nasty habit of being very different from theory. If you could come up with a reasonable way of measuring this and got the numbers you expected, you would have a very strong case. But then again, you might find when you do real measurements, the overhead savings from splitting up the data store are not significant.Emoticon

4 - I'm really getting annoyed with the party line of backwards compatibility at all costs. Progress means changing and adapting. IBM has to do what's right by their customers. If third-party vendors can't support more recent releases that's not IBM's fault. It's happening now anyway, with Blackberry not officially supporting R8 yet (afaik).

5 - To Andre's point, things like @MailDbName and @NameLookup would have to be carefully rethought.

Unfortunately I think it's probably unavoidable that lots of existing custom application calls will have to be rewritten to accommodate such an architectural change. Given that this will dramatically violate the "backward compatibility" standards to which Notes has tried to adhere, this will probably have to be an optional as opposed to default configuration.

Of course, if some future release of Notes is able to seemlessly handle the concept of multi-nsf applications, maybe this won't be such a problem. For example, let's say Notes 9 allows an "application" to be defined as a primary nsf with multiple "pointers" to secondary nsfs, all addressable through calls to the primary one. That way existing calls to the mail db would work just fine, and references to some view would in turn be pointed transparently to the secondary nsf which contains it.

As I think about it, the "primary" nsf would almost have to consist of nothing BUT pointers to elements in its secondary nsfs. How else could you avoid the problem of duplicate view names?

I dunno, maybe there is a way to compartmentalize a single nsf to avoid the current indexing overhead.


6 - @3 - Chris, are you volunteering to help with the testing effort? The amount of work would be.... significant.

But having built several dozen Notes application suites over the last 15 years, I'm quite confident that I'd be proven right. View overhead in an NSF is notorious for being a performance killer.

@4 & 5 - I agree that there would be some cost to the API changes. I'd point out that IBM is already undertaking this with directory federation, which is why we have all those new NotesDirectory objects in Lotusscript.

There's a reason why it's important to look at this question NOW. Unfortunately confidentiality prevens be from being able to say more.

7 - Since I do both Admin & Dev, I dither back and forth on this. From an ideal dev perspective you are perfectly correct.

You'd think that processor matters, but I thought memory and disk space mattered, too, and we've proven with Notes 8 that that is not the case. Why is processor special?

When designing systems, you make tradeoffs. adminCraig would gladly trade some processor for simplicity.

We have to be stewards of our users/clients data. Fewer dbs, means fewer things we need to keep track of. On a server with 1500 mailfiles you're talking about a lot of NSFs.

Without a rocksolid and reliable way to assure that the proposed collection of NSFs is replicated and backed up to the server, it's hard to go with this.

Since I'm split, I have two ideas out on ideajam that cover both sides of this debate:

Although neither have the magic 1000 votes, it's not even close.

{ Link }

{ Link }

8 - I wasn't going to say anything but upon further reflection.

A post with that subject posted around April 4th, esp. THIS April 4th (40th anniversary of 04/04/1968), is beyond what I would consider appropriate.

<SET SOAPBOX=OFF>

9 - @8 - Sorry, Craig... the date hadn't even occurred to me. No offense was intended.

As far as technical content goes, "adminCraig would gladly trade some processor for simplicity."

That's the thing: the present design is horrendously complex. It's not simple at all, because managing the data silos is basically left up to View selection formulas, which are a relatively expensive way to do it.

10 - @5 - I like your idea Kevin. I think it would have to be sort of a "bookmarks.nsf on massive steroids".

This actually sounds like something worth playing with. A special nsf that is nothing more than a set of configurable pointers to design elements in other nsfs.

hmmmmm, interesting indeed.


11 - Something else that makes the "cram it all in one" even less desirable is this often overlooked fact:

DOMINO LOCKS THE ENTIRE DATABASE WHEN IT IS WRITING *ANYTHING* TO IT.

Did you add a doc to a DB? The entire DB is locked while that happens. Everything else queues up until that doc write is complete.

Then when the indexer decides it's time to update a view, it LOCKS THE ENTIRE DB until that view update is completed.

When properly designed, 100 dbs with 10,000 docs each will far out perform 1 db with 1,000,000.

12 - @11 Agreed.
But what happens if the db with 1,000,000 performs at an acceptable rate?

There is a danger in over-optimizing one thing and leading to other issues.

For instance, if you take one mailfile and split it into logically 4, you'd end up with a server that used to have 1500 mailfiles now having 6000. And these 4500 of them would have to be replicated to user's desktops.

That's some significant replication overhead that would have an impact on processor, I'm sure.

13 - @12 - Client-based replication requests are driven by the client. All the processing takes place AT the client, except for ACL checking (same in both cases) and the response to the node modified collection, which is pretty close to free in either case.

Of course, with data segregation, I might elect to only update my calendar and not update my email (perhaps because I don't have time to deal with large attachments at the moment, and I get alerts on my Blackberry anyway.) Is there an easy way to allow the user to make that determination today?

And yes, you'd have a server than has 6000 NSFs instead of 1500. Of course, they'd all be nicely organized, just like mail files typically are. They'd tend to be substantially smaller. They'd be easier to compact, since most of the time, only your email files really require compaction, so you wouldn't need to dig through lining up all those calendar entries or contacts. Backups would be easier because the need to have files other than the mail with open handles would be reduced.

And of course, if you combine this whole idea with Summary compression and DAOS, your storage requirements could be reduced dramatically.

Anyone every thought that it might be a lot easier to optimize your server architecture by putting Calendars & Contacts on a separate disk platter from email? That's impossible today. Segregate, and it becomes a no-brainer.

So let's see if I can sum up... on the side of splitting the Dbs, we have reduced file size, reduced indexer load, reduced design complexity, storage architecture flexibility, replication flexibility, and improved administration options. On the side of keeping them together we have reduced absolute file count (irrelevant if you implement DAOS, by the way,) less work for Lotus, and the fact that it's how it is today.

That seems to pretty much sum up to two positions to me.

14 - @12 - If the db performs acceptably, great. I've got a gargantuan app here that has both scenarios: a 1.5 million doc, mostly static db used for lookups, and thousands of "split", sets-of-6 dbs. Both fly. But if my sets-of-6 were combined in one db, I'd be dead in the water.

They're all nicely organized, easy to compact, easy to fixup/updall/etc... it's a no-brainer setup.

The only disadvantage (and I'm a bit shocked to see Nathan change his position on this @13) is design complexity. A multi-NSFed application is harder to work with design-wise. E.g.:

1. There is no cross-Db @Formula equivalent for @SetDocField.

2. The cross-DB equivalent for @GetDocField is @DbLookup, using UNID as your key. This unfortunately requires you to keep a "UNIDs" view in each DB so the DbLookup has a view to use.

Of course, IBM/Lotus could likely address both of these shortcomings REAL quick.

Admin-wise I would argue that it's easier to deal with sets in most cases, but there are disadvantages in dealing with sets. Of course, if you roll-your-own admin tools to help out this disappears quickly. IBM could solve this problem too, and Nathan has wrote about this a million times.

Oh, and don't forget that with split DBs you can have *multiple* indexers indexing both your mail and your calendar at the same time. That's huge.

15 - @14 Design complexity is increase in multi-NSF implementations *if the NSFs interact.* But in PIM apps, they don't! The data is already silo'ed between PIM functions. You don't see a single view with emails and to-dos at the same time (though you should, because follow-up flags and To-dos are redundant.) The "copy into new" is already in Lotusscript, and it would be absolutely trivial to point that to an abstract NSF location instead of the current NSF.

The net effect of the "all in one" strategy is that the template itself is hugely complicated, with dozens of forms and subforms and views. The design simplification would be that EACH SEGMENT would only have the design elements needed for that functionality base.

How much better would the high-volatility environment of email work if there were only, say, 10 forms, 15 views and 5 subforms? Wouldn't that make things like, say, email customizations easier to manage?

Hell yeah it would.

And IBM should be responsible for writing the NSF set management tools. (They should have written them 10 years ago, when customers first started asking.) And in some sense, they already have, because THIS IS EXACTLY WHAT ROAMING USERS DOES. So fix Roaming users and the administrative problem goes away.

16 - @15 - Wow, I hadn't dissected the template so far as to see that they (the PIM functions) don't interact in the slightest.

In that case it's absolutely ridiculous to lump everything together, except for the admin pieces. And those aren't really *that* big a deal.

DOLS already somewhat deals with this too, via the subscription model -- one "subscription" has multiple NSFs, and replication is done as a batch. The user has no clue that there's multiple dbs unless they happen to watch the detailed logging as things sync.

DOLS is slightly busted, though, because it stores everything in this "Sub_x" folder nonsense, making it impossible (literally) at times to determine filepaths to DBs. IBM could fix that, though, if they wanted.

I haven't done much with Roaming Users, though. How are those broken?



17 - @14 - As I said in my ideajam post - it's cool with me as long as the tools are there to properly manage and protect the data.

We haven't seen them in 10 years, and they don't contribute to portal adoption, so I am in a position where I'll be happy when I see them.

18 - Regarding the increased (multiplied) number of databases: doesn't the server keep them all open in memory and reserve some resources for each ?

19 - Wow, good topic...I just found this...glad I subscribed on RSS.

I'm thinking outside the box a little on my response:

I've always treated an .nsf as being like a mini file system. Each document is like a OS level file and these are stored in a UNID table that's a lot like the OS level File Allocation Table (FAT).

As you get more and more files on disk, the FAT gets bigger and disk activities get a bit slower. This can be helped by doung a defrag, much like we do a compact in Domino. We have fixup, the OS has scandisk.

How do we improve things at OS level and make things faster and more efficient? Would we put the OS and the application on the same disk? No!

We split it all into partitions. Even if stored on the same physical disk, a partition can help a lot. You can move a hard drive from one machine to another, all the data comes with it, but it's still segregated.

So what if Lotus developed a new On Disk Structure with some kind of partition within the nsf? The new ODS would effectively have 2 (or more) UNID tables, like a partitioned hard drive would have 2 different FATs. A view in partition 1 would only show documents stored in partition 1. Likewise, a view in partition 2 would only show those documents. This would reduce all that indexer load you referred to above...no need to exclude documents that aren't listed in that partition.

All the functions like @dblookup would need to be updated to allow a partition number to be specified in the query...like we specify a column number now...it wouldn't be complicated.

For backwards compatibility, you could have all older servers and clients(that didn't specify a partition number in the request) simply be served up partition 1 by default. As a result, all the older clients/servers couldn't access the data in the new partitions...but server interaction across versions would work perfectly well.

At OS level, there would still only be one .nsf file...in the case of Mail/C&S this would be one per user. You could have a partition for mail, one for contacts, one for calendar entries etc...only need to replicate one file down to the client...but performance would be optimised with markedly smaller unid tables.

Kevin Pettitt - Your pointer idea sounds too much like CC:Mail...scary.

20 - @18 - Yes, that's correct. Number of open NSF handles is generally offered by Lotus as the reason for the single data store strategy in the product today.

I have a very difficult time believing that the resources consumed by the additional NSF handles outweighs the additional stress on the indexer. Even if maintaining 3 or 4 open db handles per user scales linearly, surely that's worth an 80 percent reduction in indexer workload.

21 - @20 - Agreed. The primary concern with handles would be memory consumption and fragmentation. That may technically be a concern now in some extreme cases, but as Domino moves into the 64-bit realm that should become less and less of a concern. ND8 also removed the last of Domino's legacy 16-bit memory management, so it may be ready to rock-and-roll with more DBs as-is.

Assuming the server can simply stay up with the number of handles it needs, it's *got* to be less overhead than the indexer.

There's the potential for more overhead being consumed in the server's design cache with multiple DBs (from duplicated entries), but again that's probably the same tradeoff -- more handles. And based on what you stated about the PIM functions being separated, there really shouldn't be much duplication of design elements anyway, aside from some shared libraries.


22 - @21, that reminds me of the write up Damien did of the formula engine rewrite. His new version was designed with more modern constraints in mind. As such it did away with things that were necessary to make the old version run on much lower spec hardware. The end result was that the new version was much more efficient than the old when running on new hardware. Obviously a lot more than just that went into it, but the point should be that sound design decisions made for 10-20 year old constraints may not be valid any more.

Post A Comment

:-D:-o:-p:-x:-(:-):-\:angry::cool::cry::emb::grin::huh::laugh::lips::rolleyes:;-)

11 Aug 

Hire Me 

Lotus-911-Logo.jpg

Search 

Disclaimer 

Welcome to Escape Velocity!

Opinions expressed here by Nathan T. Freeman are not necessarily those of his employer. However, there's a decent chance they are, so check with them if you really want to know.

But really... do you need that kind of validation? Are the opinions expressed here in doubt?

MiscLinks