ETL, Auditing and Row Counts...
Yes, it has been too long since I last posted – My wife and I are in the middle of selling our house, so it has been a busy month doing little projects here and there. For anyone doing ETL projects or working on a BI project and in the ETL phase, auditing is a major part of your work. In this instance, I am specifically looking at physical statistics on a table in my database. Here is a simple schema for a table to track these statistics:
CREATE TABLE [dbo].[TrackTableProcessLog]
( [TableProcessKey] [int] IDENTITY(1,1) NOT NULL, [ProcessLogKey] [int] NOT NULL, [Object] [varchar](250) NOT NULL, [ObjectDatabase] [varchar](75) NOT NULL, [CountExtractRows] [int] NULL, [CountInsertRows] [int] NULL, [CountUpdatedRows] [int] NULL, [CountDeletedRows] [int] NULL, [CountErrorRows] [int] NULL, [CountInitialRows] [bigint] NULL, [CountFinalRows] [bigint] NULL, [RecordInsertDate] [datetime] NOT NULL CONSTRAINT [DF_TableProcessLog_RecordInsertDate] DEFAULT (getdate()), CONSTRAINT [PK_TrackTableProcessLog] PRIMARY KEY CLUSTERED ( [TableProcessKey] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
As you can see, I am tracking counts on several different items for my table objects in my database. Most of the counts are derived directly from the ETL process, inside my SSIS package, however, two of the counts are not derived from the ETL action itself: CountInitialRows and CountFinalRows. Both of these are counts of the rows in my table object before and after the ETL operation – I need this to be exact as this table is used for auditing purposes and all the numbers need to line up properly (or not, so that I can be alerted of an issue with the ETL process).
Thankfully, with SQL 2005, there are several ways to readily grab these figures, even on very large tables, without resorting to “select count(*) from …”. I will illustrate the several methods available, and you can pick which one you like best. Just as a note, each method returns the same result – I have run large data operations against some target tables and test that each method returns the same answer, so thus far, any or a combination thereof should return the same result.
First, let’s look at the system stored procedure: sp_spaceused. In executing this stored procedure with a table object as the parameter, your output will be: name, rows, reserved, data, index_size and unused. In this case, I just need rows, so while the other information is nice, it is unnecessary. The procedure also performs some other steps, so if I am shooting for efficiency in my data movement processes, then I want to eliminate any extraneous querying.
If you break open sp_spaceused, you will find that it uses the dynamic management view sys.dm_db_partition_stats to deliver the row metric. If you want to use this dynamic management view, you will need to be sensitive to the fact that a single table object can live on multiple partitions – and your indexes can live on other partitions as well. To query this table, you will need to sum the row_count column for a given object_id and where the index_id value is less than 2; this is necessary to exclude counts from any non-clustered indexes related to a table object.
Row count data is also available in sys.sysindexes – to get this data you will filter on the id column = object_id of your target object, however this table is not necessarily purposed for this information, so while possible, I would recommend using one of the other methods.
The last way to conveniently access row count information is through sys.partitions table. The filter for this data is the same as the dynamic management view; you will need to sum across all rows for a specific object_id and include only rows where the index_id is less than 2.So all in all, here are several methods to use to retrieve row count information for your table object. Each method provides sub-second response times for my ETL operations, allowing me to conveniently gather statistics about my tables as I populate my warehouse.Please feel free to post or e-mail with any questions!
References:
Sys.partitions: http://msdn2.microsoft.com/en-us/library/ms175012.aspx
Sys.dm_dm_partition_stats: http://msdn2.microsoft.com/en-us/library/ms187737.aspx
Table and index organization: http://msdn2.microsoft.com/en-us/library/ms189051.aspx