Saving memory with Interval Match

As veteran QlikView users well know, Interval Match can be one of the most powerful functions available to developers. Interval Match is used to match a date/timestamp in one table to an interval in another (defined by a numeric start date/timestamp and a numeric end date/timestamp). However, even expert users may sometimes not consider using Interval Match as often as they should. Interval Match provides 2 types of savings: (1) Memory: avoids having to load all possible dates/timestamps into a Fact table; and (2) Load Time: often not using Interval Match means that a join must be used instead which, especially on large data sets, may add seconds or even minutes to script reload times. This article will lay out one non-traditional method of using Interval Match that will hopefully give you some ideas as to the versatility of this function!

Let's take a hypothetical scenario. The client is a global online education institution that offers classes to its students. The client would like a QlikView dashboard to be able to analyze how many classes it offers per day. To facilitate this, the client has provided daily snapshots of its database of classes. Each snapshot contains a list of classes offered on that date, and 2 years of snapshots have been provided with approximately 300,000 rows per snapshot.

As most readers can probably tell already, we are going to have a data volume challenge if we attempt to concatenate all snapshots onto a Fact table. 2 years of snapshots x 300,000 rows per snapshot = 219,000,000 rows! But let's remember the type of data we are dealing with. Each snapshot contains information about classes, and one thing we know about classes is that they have a start date and an end date (for example, at the beginning of the semester and the end of the semester, respectively). Interval Match can save the day!

Step 1 - Load All Snapshots
Before we can use Interval Match, we must first load all snapshots into memory. The snapshots should all be concatenated onto a single resident table, making sure to include the snapshot date field.

Step 2 - Define Interval Start
To identify the interval start, an ORDER BY load is necessary. A new interval start date should be declared each time a new class is encountered or a dimension of that class changes. For example, suppose each snapshot consists of the following fields:

ClassID,
ClassType,
Instructor,
SnapshotDate

Assuming that all snapshots have been loaded into resident table "snapshots", then the ORDER BY load to define the interval start would resemble the following:

fact_temp1:
LOAD
ClassID,
ClassType,
Instructor,
SnapshotDate,
if(ClassID=previous(ClassID) and ClassType=previous(ClassType) and Instructor=previous(Instructor)
and SnapshotDate=previous(SnapshotDate)+1,
peek(SegmentStartDate),
SnapshotDate
) as SegmentStartDate
RESIDENT snapshots
ORDER BY ClassID,SnapshotDate;

Step 3 - Define Interval End
We can identify the interval end with a simple GROUP BY load to find the maximum date per interval start. This load will also result in our final Fact table:

FACT:
LOAD
ClassID,
ClassType,
Instructor,
SegmentStartDate,
max(SnapshotDate) as SegmentEndDate
RESIDENT fact_temp1
GROUP BY ClassID,ClassType,Instructor,SegmentStartDate;

The temporary tables "snapshots" and "fact_temp1" can now be dropped.

Step 4 - Create a calendar
Before we can match our intervals, we need to create a calendar to have a date field on which to match. The following simple code will suffice:

min_max_dates:
LOAD
min(SegmentStartDate) as mindate,
max(SegmentEndDate) as maxdate
RESIDENT FACT;

LET vMinDate = peek('mindate');
LET vMaxDate = peek('maxdate');
LET vNoDays = '$(vMaxDate)' - '$(vMinDate)' + 1;

Calendar:
LOAD
*,
weekstart(Date) as Week,
year(Date) as Year,
month(Date) as Month,
date(monthstart(Date),'MMM YYYY') as MonthYear,
'Q' & ceil(month(Date)/3) as Quarter,
day(Date) as Day,
weekday(Date) as Weekday,
'Q' & ceil(month(Date)/3) & ' - ' & year(Date) as YearQuarter
;
LOAD
date('$(vMinDate)' + recno() - 1) as Date
AUTOGENERATE $(vNoDays);

Step 5 - Interval Match
It is always a best practice to only use Interval Match on distinct intervals. For example, if 5 classes are offered with the same start and end date, then only a single interval need to be matched on. To use Interval Match with our data, simply do the following:

interval_temp:
LOAD DISTINCT
SegmentStartDate,
SegmentEndDate
RESIDENT FACT;

INTERVAL:
INTERVALMATCH (Date) LOAD
SegmentStartDate,
SegmentEndDate
RESIDENT interval_temp;

The temporary table "interval_temp" can now be dropped. The result of this load script will produce (1) a Fact table with a greatly reduced number of rows, (2) a simple Calendar table, (3) an Interval table containing field: Date, SegmentStartDate, and SegmentEndDate, and (4) a synthetic table containing fields SegmentStartDate and SegmentEndDate. In general, Interval Match is one of the few cases in QlikView where synthetic keys are a best practice!

As mentioned above, simply concatenating all the snapshots onto the Fact table would have resulted in ~219,000,000 rows. Suppose that the data set contained only 2,000,000 unique ClassID values. Using the method described in this article, the Fact table will be reduced to ~2,000,000 rows. Performance will be greatly improved and memory consumption will be reduced!

This entry was posted in Development and tagged , , , . Bookmark the permalink.

5 Responses to Saving memory with Interval Match

  1. Alexander Boyko says:

    A very good example.
    I would like more source code to test these possibilities.

  2. Thanks, Alexander. All the code is available above. Unfortunately, I don’t have any non-confidential example data that I can share.

  3. Frank van der Mast says:

    Please be advised the procedure laid out here only works if your snapshots cannot ever repeat.

    E.g. if you have 3 days where ‘situation A’ exists, then a date where it doesn’t, and then again 3 days where ‘situation A’ exists.
    Doing the MIN and MAX date grouping yields you with a min date for the first occurrence of ‘Situation A’ while your MAX date yields the last date of the second occurrence of ‘Situation A’, yielding you with a single interval entry of 7 days instead of 2 intervals of 3 days each!

    • Frank van der Mast says:

      Correction: I misread your procedure, it does account for that! 🙂
      I quickly glanced over it and thought you did a MIN MAX over the date field, which wouldn’t work.
      Please disregard my previous reply haha.

Leave a Reply

Your email address will not be published. Required fields are marked *

Notify via email when new comments are added

Blog Home
Categories
Archives