This is my contribution to TSQL Tuesday #99: Dealer’s Choice hosted by Aaron Bertrand this month. I choose door #2 to write about T-SQL bad habits. One thing that comes to my mind immediately is how to handle id columns and which data type to choose. First of all: I don’t think there is any one-size-fits-all-approach you could use without thinking twice. My experience is based on working for a mid-sized business with fairly small sized tables which rarely have millions of rows. So if you are working with VLDBs please don’t feel intimidated by my writing…for you the laws of physics are probably quite different.
A few years ago I designed a completely new data warehouse. One essential part was to introduce surrogate ID columns to serve as a primary key for each table. I was burned in the past by having to update lots of tables because the natural key of the source system changed: it was being reused for completely different entities and thus all the existing data in the data warehouse had to be adapted to a new key value to avoid key collisions from happening. Long story short: this was quite a lot of work and a long running process. The lesson I learned from that was simply: never trust a source system primary key column to be stable in the long run.
Looking at a bad role model
One important decision was the datatype of this surrogate key column in the new data warehouse. First of all I looked at the source system primary keys: Most of them are defined as BIGINT. The values used are UNIX timestamps. These are generated by the application and guaranteed to be unique…but at what cost? Starting in 2005 the lowest value in ID columns is something like 1108279285. Today we have reached values like 1518506486. This basically means that all numbers < 1108279285 are not used and this capacity of the BIGINT datatype is just wasted. Furthermore this type of keys appears to be too smart…you could guess the insert date from the key value. This violates the principle for keys to be meaningless integers. Thanks God nobody does convert these values regularly to dates and vice versa but at least the ability to do that lets me shudder a bit. So this usage of IDs was no good example to follow.
Me and my colleagues looked for alternative datatypes and come up with four possible candidates:
- Unique Identifier (16 Byte)
- Numeric (with 9 digits: 5 Byte)
- Integer (4 Byte)
- Big Integer (8 Byte)
We quickly ruled out the first two (unique identifier and numeric) because of their space consumption. We could only imagine to have about one or two billion rows in the complete database at some point in the future. Thus Unique Identifier just seemed like an overkill for this scenario. Numeric needs more space than integer by providing less room (with 9 digits): its advantage of storing decimal values is not needed for an id column anyway.
So we were left with the two apparently most popular datatypes for id columns Integer and Big Integer. Big Integer has a positive range of more than 9 trillion. This is a huge number and we could not imagine of this being exhausted quickly. Integer however has a positive range of just a little bit more than 2 billion.
But who says that an id column has to be positive? By definition the value of a surrogate key “contains no semantic meaning” (Wikipedia). Therefore the value -2 has an equivalent meaning as the value 4…just giving a row a unique identifier: I don’t claim that these values are the same…but it doesn’t matter if that value is negative or positive since it doesn’t have any meaning at all.
So choosing an integer column and starting the ids with the lowest possible value -2.147.483.648 enabled us to take advantage of more than 4 billion possible values. So we implemented this solution using a SEQUENCE object to generate unique values for the whole database. Actually there is no need to have distinct ids between separate tables but to have just one sequence object was the easiest way to implement the surrogate key pipeline in our ETL tool.
Ensuring smooth operation
Of course a slight scepticism remains like “will these values be enough for the lifetime of the data warehouse?”. Thus we implemented a simple daily check in an agent job giving us the amount of used key values. Here is the output of today:
As you can see we still have more than 98% of the sequence values not assigned while the data warehouse has already been running for years. In case you are interested…here’s the SQL to generate this little report:
/* Checks the left sequence values */ DECLARE @tableHTML NVARCHAR(MAX) ; SET @tableHTML = N'<H1>Sequence Check</H1>' + N'<table border="1">' + N'<tr><th>sequence name</th><th>starting value</th><th>maximum value</th>' + N'<th>current value</th><th>left sequence values in percent</th></tr>' + CAST ( ( SELECT td = s.name, '', td = CAST(s.minimum_value AS BIGINT), '', td = CAST(s.maximum_value AS BIGINT), '', td = CAST(s.current_value AS BIGINT), '', td = ABS((s.current_value - s.maximum_value) / (s.maximum_value - s.minimum_value)) * 100 FROM ( SELECT s.name, CAST(s.minimum_value AS NUMERIC(18,8)) AS minimum_value, CAST(s.maximum_value AS NUMERIC(18,8)) AS maximum_value, CAST(s.current_value AS NUMERIC(18,8)) AS current_value FROM sys.sequences s ) s FOR XML PATH('tr'), TYPE ) AS NVARCHAR(MAX) ) + N'</table>' ; EXEC msdb.dbo.sp_send_dbmail @profile_name = 'xyz', @recipients = 'abc', @body = @tableHTML, @subject = 'Sequence Check', @body_format = 'HTML' ;