The new is a good place to start learning about Master Data Management and Microsoft’s efforts in this area.
Master Data is really a pretty common thing for engineers. I learned about it way back in my manufacturing engineering days.
Consider this scenario: Conglomerate C (CC) makes widgets and starts acquiring businesses that also make widgets. CC sells widgets by the pound, but Acquisition A (AA) measures them by counting individual widgets, while Acquisition B (AB) sells them by the case (gross, or 144 ea).
CC now wants all this data in a data warehouse so they can compare apples to apples and know, among other things, how many widgets they’re actually making and selling on a given day.
Note: Instrumentation and measurement are scientific disciplines in their own rite. There's a lot more to this, which I hope to cover in this blog.
The Unit of Measure in the existing database, dbCC, is pounds. The Widgets tables from the three companies look like this:
dbCC.dbo.Widgets
|
ID
|
Date
|
Weight
|
1
|
1/1/2007
|
2076
|
2
|
1/2/2007
|
2100
|
3
|
1/3/2007
|
1977
|
dbAA.Product.Widgets
|
ProductID
|
Date
|
Count
|
|
1 Jan 2007
|
10,265
|
|
2 Jan 2007
|
13,009
|
|
3 Jan 2007
|
17,121
|
dbAB.dbo.Widgets
|
ID
|
Date
|
Cases
|
1
|
20070101
|
84
|
2
|
20070102
|
82
|
3
|
20070103
|
99
|
MDM is all about standardizing this data. The keys to standardizing this data are recognizing traits in the data types. For instance, the Cases to Count ratio is most likely stable and predictable. Conversion is easily accomplished using multiplication (or division, depending on which way you go in the standardization). But the weight to count (individual or case count) is going to depend on other factors. Most notably, do all widgets weigh the same? If not, what’s the acceptable tolerance?
Dimensional analysis (the multiplication or division you do to convert known quantities) is really a question about measurement granularity (or grain). You will want to store as fine a grain as possible, trust me. Looking at the sample data, you will want to store WidgetCount somewhere. dbAA is already in this format. Yay. dbAB is easy enough: dbAD.dbo.Widgets.Cases * 144 gives you WidgetCount. Again, the math on widget Weight becomes fuzzy.
This fuzziness will impact the integrity of your data. There are a couple important measures of data warehouse integrity – data accuracy and signal to noise (usually defined by the percentage of “unknowns” in the data).
When I have encountered this scenario in the field, I have informed the customer of the dangers and begged them to collect better metrics at the WidgetWeight station.
There are other issues in these examples: date and ID standardization. Dates are fairly straightforward. The IDs can be a little tricky. To standardize the IDs in this example I would consider a Location and ID compound key on the first pass. I’d create a couple tables in the data warehouse staging database that look like this:
Staging.Products.Widget
|
LocationID
|
ID
|
Date
|
Count
|
1
|
1
|
1/1/2007
|
10380
|
1
|
2
|
1/2/2007
|
10500
|
1
|
3
|
1/3/2007
|
9885
|
2
|
1
|
1/1/2007
|
10,265
|
2
|
2
|
1/2/2007
|
13,009
|
2
|
3
|
1/3/2007
|
17,121
|
3
|
1
|
1/1/2007
|
12,096
|
3
|
2
|
1/2/2007
|
11,808
|
3
|
3
|
1/3/2007
|
14,256
|
|
|
|
|
Staging.Products.Location
|
LocationID
|
LocationDescription
|
1
|
dbCC
|
2
|
dbAA
|
3
|
dbAB
|
I’ve assumed (based on customer feedback) I get 5 widgets / pound from dbCC, and I know the math for the rest. I normalized dates and IDs and added a LocationID and Location table to manage my data source / IDs.
There are some definite tricks to initial master data loads. To get data into this format in Staging I would execute the following queries:
-- initial load...
-- dbCC...
Insert Into Products.Widget
(LocationID
,ID
,Date
,[Count])
Select l.LocationID
,s.ID
,s.Date
,(s.Weight * 5)
From Products.Location l
Left Outer Join [dbCC].dbo.Widgets s on s.ID >= l.LocationID or s.ID < l.LocationID
Where l.LocationDescription = 'dbCC'
order by s.ID
-- dbAA...
declare @tbl table
(ID int
,Date smalldatetime
,[Count] int)
Insert Into @tbl
(ID
,Date
,[Count])
Select
row_number() over(order by s.ProductID) as 'RowNumber'
,s.Date
,s.[Count]
From [dbAA].Product.Widgets s
order by s.ProductID
Insert Into Products.Widget
(LocationID
,ID
,Date
,[Count])
Select l.LocationID
,s.ID
,Convert(smalldatetime, s.Date)
,s.[Count]
From Products.Location l
Left Outer Join @tbl s on s.ID >= l.LocationID or s.ID < l.LocationID
Where l.LocationDescription = 'dbAA'
order by s.ID
-- dbAB...
Insert Into Products.Widget
(LocationID
,ID
,Date
,[Count])
Select l.LocationID
,s.ID
,Convert(smalldatetime, Left(Convert(varchar,s.Date), 4) + '-' +
SubString(Convert(varchar,s.Date), 5, 2) + '-' +
Right(Convert(varchar,s.Date), 2))
,(s.Cases * 144)
From Products.Location l
Left Outer Join [dbAB].dbo.Widgets s on s.ID >= l.LocationID or s.ID < l.LocationID
Where l.LocationDescription = 'dbAB'
order by s.ID
There are definitely better ways to finesse the data into the inital Staging load than this. The Outer Joins are meaningless and serve no logical purpose other than to "trick" the data into the same row as the location lookup data. We will examine some better ways to load MDM in future posts.
There’s more to Master Data, but this is the type of business problem folks are trying to solve when they talk about MDM.
:{> Andy
Technorati Tags: Master Data MDM Staging Load
Technorati Profile