There has been a long-standing debate in the QlikView world regarding when a data model should be normalized (i.e. split out into a one or more "Fact" tables and one or more "Dimension" tables) and when it should be denormalized (i.e. combined into as few tables as possible via one or more JOIN operation). Conventional QlikView wisdom holds that each link between 2 tables comes at a cost of memory; nevertheless, it is very rare to see a QlikView developer actually create a fully denormalized data model. Since, to the best of my knowledge, no clear test results demonstrating tangible benefits of either approach had been published, I ran some simple tests—with some surprising results!
I set up a simple tester application and populated the data model of the respective applications as follows:
Normalized Data Model
- Invoice (Fact) Table: Autogenerated table of 10,000,000 rows of random data. Fields: InvoiceNum (unique identifier of each Fact row), DateKey (numeric key to the Calendar table), CustomerID (numeric key to the Customers table), Sales (metric)
- Customer (Dimension) Table: Table containing distinct values of the CustomerID field from the Fact table, using the Fact table as a data source. Fields: CustomerID (numeric key to the Fact table), CustomerName (text string containing dummy customer names)
- Calendar (Dimension) Table: Full calendar table populated using the minimum/maximum values of field DateKey from the Fact table. Fields: DateKey (numeric key to the Fact table), Date, Month, Year, MonthYear, Quarter
Denormalized Data Model
To keep actual data identical for testing purposes, the denormalized data model binary loads the normalized data model and then performs 2 LEFT JOIN operations to bring data from the Customers and Calendar table into the Fact table. The Customers and Calendar tables are then dropped.
As a quick disclaimer, these tests were run on the same computer at exactly the same time to minimize the chance of other applications interfering. The numbers themselves will no doubt vary from computer to computer; the interesting trend is relationship of the normalized and denormalized statistics:
These results demonstrate that denormalizing a data model results in significant performance improvements in chart calculation times, as well as improved overall application memory usage. Interestingly, the physical size of the application on the hard drive is slightly larger for the denormalized model, despite the performance improvements.
Although these results are highly interesting, I am not certainly not advocating denormalizing a data model in every situation. For one thing, denormalizing in such a fashion would only work if the Fact table rows have a 1:1 relationship to the Dimension table rows. A 1:N relationship would result in Fact table row duplication if the Dimension table is joined in. Another consideration is the ease of application redevelopment—e.g. a partial reload where only one table is modified obviously becomes impossible once all tables have been joined together. Finally, one should consider overall data model aesthetics; denormalizing may not be the way to go if finding fields becomes much more difficult once all fields are in a single table. There are probably other considerations, as well, but hopefully these will give the reader some idea of when denormalizing should be avoided.
See for Yourself!
I have uploaded the applications used to create these results here:
Please feel free to download them and run these (or more advanced) tests for yourself. To get an accurate performance measurement, I suggest that these both be run on the same computer at the same time. And please leave comments below with any interesting results you may find!