SSIS Design Pattern - ETL Instrumentation, Part 1
Introduction
SSIS is a fantastic ETL engine. When I build and ETL solution in SSIS, I like to collect runtime metrics. Why? I use the data initially to determine a baseline for performance and to determine, in some cases, if I'm loading within a defined window.
I refer to this process of collecting performance and runtime metrics as ETL Instrumentation and I collect this data for every enterprise ETL solution I design.
The types of things I want to know are:
- Did the process start when I wanted it to start?
- Did it succeed?
-
- If not, is it still running?
- Did it fail?
- How long did it execute?
- How does recent performance compare to past performance?
In this mini-series entitled SSIS Design Pattern - ETL Instrumentation, I will demonstrate how I capture, report, and measure these data using SSIS, SSRS, SQL Agent Jobs, and Transact-SQL.
In Part 1, we build an elementary system. In subsequent parts, we will improve upon the concepts demonstrated here.
To Instrument Or Not To Instrument; That Is The Question
My thoughts on this are best summed up thus:
There are two types of enterprise data warehouse developers:
1) Those who collect ETL instrumentation data; and
2) Those who will.
Why do I make this statement? When you build ETL to load your enterprise data warehouse, you are building the source for the reporting system. Should anything unfortunate happen to the data between the time you read it from the System Of Record and the time it lands in your Reporting Server Database, you are basically guilty until proven innocent. ETL Instrumentation will provide evidence for (and sometimes against) you as you defend your ETL design and implementation.
Building The Database
Before we instrument an ETL process, we need a database solution for the data. We'll start by building the database itself, then add objects.
Let's build the database. We'll call it SSISRunTimeMetrics. Open SQL Server Management Studio (SSMS) and execute the following Transact-SQL:
use master
go
if not exists(select name
from master.sys.databases
where name = 'SSISRunTimeMetrics')
Create Database SSISRunTimeMetrics
go
This script creates the SSISRunTimeMetrics database. Next, let's create a schema and table to hold our metrics:
use SSISRunTimeMetrics
go
-- vars...
declare @sql varchar(255)
-- create ssis schema...
if not exists(select name
from sys.schemas
where name = 'ssis')
begin
set @sql = 'Create Schema ssis'
exec(@sql)
end
-- create RunTimeMetrics table...
if not exists(select s.name + '.' + t.name
from sys.tables t
inner join sys.schemas s on s.schema_id = t.schema_id
where t.name = 'RunTimeMetrics'
and s.name = 'ssis')
Create Table ssis.RunTimeMetrics
(id int identity(1,1)
,packageStartDateTime datetime null
,packageEndDateTime datetime null
,packageName varchar(255) null
,packageVersion varchar(25) null
,packageStatus varchar(25) null)
Next, let's add a stored procedure to record a package start event:
use SSISRunTimeMetrics
go
if exists(select s.name + '.' + p.name
from sys.procedures p
inner join sys.schemas s on s.schema_id = p.schema_id
where p.name = 'usp_RecordPackageStart'
and s.name = 'ssis')
begin
Drop Procedure ssis.usp_RecordPackageStart
end
go
Create Procedure ssis.usp_RecordPackageStart
@packageStartDateTime datetime = null
,@packageName varchar(255) = null
,@packageVersion varchar(25) = null
,@packageStatus varchar(25) = null
As
begin
-- set @packageStartDateTime default...
declare @Now datetime
set @Now = GetDate()
if (@packageStartDateTime Is Null)
set @packageStartDateTime = @Now
-- insert the run time metrics data...
insert into ssis.RunTimeMetrics
(packageStartDateTime
,packageName
,packageVersion
,packageStatus)
values
(@packageStartDateTime
,@packageName
,@packageVersion
,@packageStatus)
end
go
That's all for now. We'll add more stuff later.
Building The SSIS Project
Open SQL Server Business Intelligence Development Studio (BIDS) and create a new project. When the New Project dialog displays, select Business Intelligence Projects from the Project types list and Integration Services Project from the Templates. Name the project SSISRunTimeMetrics and click the OK button:
Rename the Package SSISRunTimeMetrics.dtsx:
Always click the Yes button when you see this dialog (trust me):
Right-click in the white-space on the Control Flow canvas and click Add Annotation to document your code:
I can't say enough about good documentation. You never know when someone else will have to open your source code to troubleshoot.
Right-click in the white-space of the Control Flow canvas and click Variables. Add a variable named sVersion. Make sure the Scope is the SSISRunTimeMetrics package and make the Data Type String:
Add a Sequence Container and change the name to "Step 1 - Do some stuff":
Sequence Containers are nice for several reasons:
- They're aesthetic. You can break up chucks of functionality logically... how cool!
- When troubleshooting, you can execute the contents of a container from the right-click context menu.
- You can isolate chunks of functionality using transactions by container.
Next, add an Execute SQL Task and name it "Log Start of Package Execution":
Double-click the Execute SQL Task to edit it. Click Connection and select "":
When the Configure OLE DB Connection Manager dialog displays, click the "New..." button:
Enter or select your server in the Server Name dropdown. In the "Connect to a database" panel, select the "Select or enter a database name" option and enter or select the SSISRunTimeMetrics database you created earlier. Click the OK button to proceed:
Then click the OK button again to set the Connection property of the Execute SQL Task:
Select the SQL Statement property and click the ellipsis. When the "Enter SQL Query" dialog displays, enter the following parameterized Transact-SQL statement:
Exec ssis.usp_RecordPackageStart ?,?,?,'Started'
Click the OK button to proceed:
Click on the Parameter Mapping page and configure parameters as shown:
Click the OK button to proceed.
Next, let's do some work on the sVersion variable. Click on the sVersion variable name and press the F4 key to display variable properties. Change the EvaluateAsExpression property to True, then click on the Expression property's ellipsis:
When the Expression Builder displays, enter the following SSIS Expression Language text to build a version string in the format Major.Minor.Build:
(DT_WSTR, 3) @VersionMajor + "." + (DT_WSTR, 3)@VersionMinor + "." + (DT_WSTR, 3)@VersionBuild
Click the "Evaluate Expression" button to check the expression. Then click the OK button to close the Expreesion Builder:
Add a Script Task and configure it to popup a message box displaying the value of sVersion:
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Public Class ScriptMain
Public Sub Main()
Dim sVersion As String = Dts.Variables("sVersion").Value.ToString
Dim sMsg As String
sMsg = "Version: " & sVersion
MsgBox(sMsg)
Dts.TaskResult = Dts.Results.Success
End Sub
End Class
Execute the package. Then execute the following Transact-SQL query to check the results:
select * from ssis.RunTimeMetrics
It's time to add a procedure to log the end of package execution, but we have a problem: How do we identify the row we just added to the ssis.RunTimeMetrics table? Think about it. We just inserted a row, and since that's the only row in the table it's pretty simple to identify. Later, we will have lots of rows in the table in varying states. We need a method to positively identify the row we're working with.
We could use the last row inserted - apply Max(id) criteria - but what happens as our metrics grows to include multiple packages? One package could start while another was executing and we'd update the wrong row.
We require a means to positively identify the row when added to the table. Such a means exists. If we modify the ssis.usp_RecordPackageStart stored procedure we can have it return the value of the inserted ID. The OUTPUT clause gives us access to the inserted and deleted virtual tables. Modify the stored procedure by adding the following OUTPUT clause as shown:
output inserted.id as 'Inserted_ID'
Execute the Transact-SQL to drop and recreate the ssis.usp_RecordPackageStart stored procedure:
Add another Int32 Type variable named iPackageLoadID:
Back in Execute SQL Task, change the ResultSet property to "Single row":
Click the Result Set page and add a new resultset. Name it 0 (ordinals work well) and assign the results of this query to the iPackageLoadID variable you just created. This will push the results of the OUTPUT clause in the stored procedure (which will include the id of the newly inserted row) into the iPackageLoadID variable. You can then access it later in the package to update the precise row:
Alter the Script Task by adding iPackageLoadID to the list of ReadOnlyVariables:
Alter the VB.Net script to read:
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Public Class ScriptMain
Public Sub Main()
Dim sVersion As String = Dts.Variables("sVersion").Value.ToString
Dim sPackageLoadID As String = Dts.Variables("iPackageLoadID").Value.ToString
Dim sMsg As String
sMsg = "Version: " & sversion & vbCrLf
sMsg = sMsg & "PackageLoadID: " & sPackageLoadID
MsgBox(sMsg)
Dts.TaskResult = Dts.Results.Success
End Sub
End Class
Execute the package to test the new script:
Next, add a new stored procedure named ssis.usp_RecordPackageEnd to the SSISRunTimeMetrics database using the following Transact-SQL script:
use SSISRunTimeMetrics
go
if exists(select s.name + '.' + p.name
from sys.procedures p
inner join sys.schemas s on s.schema_id = p.schema_id
where p.name = 'usp_RecordPackageEnd'
and s.name = 'ssis')
begin
Drop Procedure ssis.usp_RecordPackageEnd
end
go
Create Procedure ssis.usp_RecordPackageEnd
@id int
,@packageEndDateTime datetime = null
,@packageStatus varchar(25) = null
As
begin
-- set @packageEndDateTime default...
declare @Now datetime
set @Now = GetDate()
if (@packageEndDateTime Is Null)
set @packageEndDateTime = @Now
-- update the run time metrics data...
update ssis.RunTimeMetrics
Set packageEndDateTime = @packageEndDateTime
,packageStatus = @packageStatus
where id = @id
end
go
Now add a new Execute SQL Task named "Log End of Package Execution":
Set the Connection property to your SSISRunTimeMetrics database and the SQLStatement property to the following Transact-SQL statement:
declare @Now datetime
set @Now = GetDate()
Exec ssis.usp_RecordPackageEnd ?,@Now,'Succeeded'
Click the Parameter Mapping page and configure the iPackageLoadID variable as an input variable of the Long data type, assigned to ordinal 0:
Click the OK button:
Execute the package. Note that each time a change is saved, the VersionBuild property auto-increments, thus updating your sVersion variable value:
The package succeeds:
Examine the data stored in the ssis.RunTimeMetrics table. Note the latest row now contains a Start and End DateTime, and the packageStatus field now records "Succeeded":
We will work with these concepts more - fleshing them out by adding more robust code and additional functionality. For now, this is a good start. Congratulations!
:{> Andy