Normalize or Denormalize: The Results Might Surprise You

There has been a long-standing debate in the QlikView world regarding when a data model should be normalized (i.e. split out into a one or more "Fact" tables and one or more "Dimension" tables) and when it should be denormalized (i.e. combined into as few tables as possible via one or more JOIN operation). Conventional QlikView wisdom holds that each link between 2 tables comes at a cost of memory; nevertheless, it is very rare to see a QlikView developer actually create a fully denormalized data model. Since, to the best of my knowledge, no clear test results demonstrating tangible benefits of either approach had been published, I ran some simple tests—with some surprising results!

I set up a simple tester application and populated the data model of the respective applications as follows:

Normalized Data Model

  • Invoice (Fact) Table: Autogenerated table of 10,000,000 rows of random data. Fields: InvoiceNum (unique identifier of each Fact row), DateKey (numeric key to the Calendar table), CustomerID (numeric key to the Customers table), Sales (metric)
  • Customer (Dimension) Table: Table containing distinct values of the CustomerID field from the Fact table, using the Fact table as a data source. Fields: CustomerID (numeric key to the Fact table), CustomerName (text string containing dummy customer names)
  • Calendar (Dimension) Table: Full calendar table populated using the minimum/maximum values of field DateKey from the Fact table. Fields: DateKey (numeric key to the Fact table), Date, Month, Year, MonthYear, Quarter

Denormalized Data Model

To keep actual data identical for testing purposes, the denormalized data model binary loads the normalized data model and then performs 2 LEFT JOIN operations to bring data from the Customers and Calendar table into the Fact table. The Customers and Calendar tables are then dropped.

Results

As a quick disclaimer, these tests were run on the same computer at exactly the same time to minimize the chance of other applications interfering. The numbers themselves will no doubt vary from computer to computer; the interesting trend is relationship of the normalized and denormalized statistics:


These results demonstrate that denormalizing a data model results in significant performance improvements in chart calculation times, as well as improved overall application memory usage. Interestingly, the physical size of the application on the hard drive is slightly larger for the denormalized model, despite the performance improvements.

Although these results are highly interesting, I am not certainly not advocating denormalizing a data model in every situation. For one thing, denormalizing in such a fashion would only work if the Fact table rows have a 1:1 relationship to the Dimension table rows. A 1:N relationship would result in Fact table row duplication if the Dimension table is joined in. Another consideration is the ease of application redevelopment—e.g. a partial reload where only one table is modified obviously becomes impossible once all tables have been joined together. Finally, one should consider overall data model aesthetics; denormalizing may not be the way to go if finding fields becomes much more difficult once all fields are in a single table. There are probably other considerations, as well, but hopefully these will give the reader some idea of when denormalizing should be avoided.

See for Yourself!

I have uploaded the applications used to create these results here:

Please feel free to download them and run these (or more advanced) tests for yourself. To get an accurate performance measurement, I suggest that these both be run on the same computer at the same time. And please leave comments below with any interesting results you may find!

This entry was posted in Development and tagged , , . Bookmark the permalink.

11 Responses to Normalize or Denormalize: The Results Might Surprise You

  1. Amaury Moreau says:

    Hello Vlad,

    First thank you for your topics very interesting

    Calculation is faster with a Denormalized datamodel.

    But I have a question

    Do you know if chart calculation times are the same between both models when you select some values into “Customer Name” or “Year” list box

    In fact with a normalized datamodel there are less associations to update (Only associations between CustomerKey in Fact/Customer tables, between DateKey in Fact/Calendar tables, and some associations inside dimensions tables).

    With a denormalized datamodel all the fields are directly associated because all the fields are in the same table

    And it is potentialy longer to update these associations

    I have no any time to make this tests with big tables (> 100.000.000 of lines).

    I should like to hear your feelings

    Regards

  2. Amaury,

    Thank you for your comment. Your question is quite interesting. If I understand you correctly, you’re asking whether the conclusions I reached above would be different if I had filtered my data set in both versions rather than leave it entirely unfiltered. Your point about the field associations was very thought provoking, and I must confess I didn’t know the answer off the top of my head.

    I decided to create a simple experiment using the above files. I selected Year = 2010 in both applications, saved and closed both, and then reopened and reran the performance tests above three times. Here are the average results I got:

    Normalized:
    CH01: 4 KB; 312 ms
    CH02: 5 MB; 15,584 ms
    CH03: 34 MB; 14,664 ms
    CH04: 45 MB; 19,016 ms
    CH05: 73 MB; 26,816 ms
    CH06: 96 MB; 10,779 ms

    Denormalized:
    CH01: 4 KB; 172 ms
    CH02: 5 MB; 2,418 ms
    CH03: 34 MB; 6,178 ms
    CH04: 45 MB; 9,111 ms
    CH05: 73 MB; 26,177 ms
    CH06: 96 MB; 19,110 ms

    The denormalized model still resulted in a performance improvement in charts 1 – 4, was pretty much even for chart 5, and actually had worse performance than the normalized model for chart 6; I was able to verify this in all 3 tests.

    The only difference between chart 4 and chart 6 is that chart 6 includes subtotals at the Year level (which is the field that I filtered). I’m not sure why, but this is not the first time I’ve noticed that pivot tables in QlikView tend to be heavier when either subtotals are enabled or dimensions are pivoted. This is just a guess, but I think the heavier performance in this one chart has to do with the specific way in which pivot tables with subtotals are calculated. Your hypothesis (that denormalization leads to associations that are more memory-heavy in general) doesn’t seem to hold up, since we did see a performance improvement in all but that last chart.

    Good question!
    Vlad

  3. Amaury Moreau says:

    Thank you Vad for your answer

    It was not exactly my question. I have not been clear in my explanations. Nevertheless your last tests are intersting

    My query focused mainly on the average time for QlikView to update associations when you select one or more values in a list box (so before the calculations in the objects)

    Is this time better, worse, equal when you select a value in a field (Year for example) outside the fact table or a field present in the fact table (in both cases through a list box)

    It is perfectly possible that the differences doesn’t appear with a fact table containing 10.000.000 lines.

    It would be interesting to make all the tests with a fact table containing at least 100.000.000 lines. If I have time I will make tests

    Regards

  4. Amaury, sorry I misunderstood. Your test sounds interesting, please let me know how it turns out. My theory is the pattern will follow the results I posted in response to your previous comment. Should be interesting to see if that holds true!

    All the best,
    Vlad

  5. Michael Steedle says:

    I realize this is a super-old post, but my theory is that people have long attributed improved performance to denormalizing, while most of the difference is actually a byproduct of the join operations: sorting in the data. Data sort order may be more important than the structure of the data model.

    Here are some tests I performed. You should give it a whirl and see whether you think it’s worth a followup post.
    1. To reduce the noise from sorting, re-sort the Invoices table by the PK (the original sort order) after doing the joins. Comparing this to the normalized results would tell you how change much was really from removing the associations compared to the sorting that resulted from the joins.
    2. In your denormalized example, change the order in which the dimensions are joined to Invoices (calendar then customer, customer then calendar). Notice that the performance results materially change, yet the structure is identical. The order in which you join tables, which affects the hidden performance variable of sort order, does matter. In fact, if you aren’t aware of the effect, the order in which you arbitrarily join the tables may lead you to conclude that denormalizing is much better or much worse than it really is.
    3. As another example, keep your original, normalized data model, but sort the Invoices table by the CustomerID. Without the RAM overhead of denormalizing (which can really matter with extremely large data), you can achieve almost all of the performance improvement. The data model looks identical. So maybe the denormalizing wasn’t the cure we thought it was.

    Because all but one of chart charts has Customer as the first Dimension, sorting by that (or joining it last) improves the performance most, overall. But that is highly variable, depending on your UI, so unfortunately there is no one right answer on how universally squeeze the best performance out of a data model.

    • Michael, that’s an interesting theory. I’d be very surprised if it holds up though. Qlik’s index is much smarter than simply sorting a table, and uses memory pointers behind the scenes. I’ve never heard of a sort order making any difference in performance. Unfortunately, I don’t have time at the moment to test this, but would be interested in your results if you do!

  6. Michael Steedle says:

    WordPress is giving me an error when I try to post any URL. Go to michaelsteedle.com. I just posted the results.

  7. Rob Wunderlich says:

    I have tested Mike’s ideas about sort order and have come to similar conclusions. I’ve posted some detailed results here qlikviewcookbook.com/2018/09/does-data-sort-order-impact-chart-calc-time/

    • Very cool, Rob. Didn’t realize that the question of sorting has already been “asked and answered” as it were. Do you implement key-field sorting in all your apps as a matter of habit? And which key field do you sort on, the heaviest?

      • Rob Wunderlich says:

        My observation is that the performance difference is only significant when the chart has a lot of rows in the primary dimension. So yes, for large apps I try to keep the fact table sorted on that field. On the other hand, I tend to avoid displaying charts with many rows so it’s not usually a big win.

  8. For future readers, here’s a link to Mike’s latest post.

Leave a Reply

Your email address will not be published. Required fields are marked *

Notify via email when new comments are added

Blog Home
Archives