Saturday, March 23, 2013

TPC-H Query 1: Performance Optimization Through Materialization

In this tutorial, we will attempt to improve the performance to TPC-H query 1 for scale factor 10 database. For this purpose, we have generated the LINEITEM table with scale factor 10. Table generation will take some time as it will be of 7.3 GB. For this tutorial, I will be using Intel Core 2 Duo 2.2 GHz machine with only 2 GB of RAM running Windows 7. I will be using SQL Server 2008 DBMS. Time may vary for you depending upon the type of hardware you have.

Once LINEITEM.tbl file was generated. I create LINEITEM table using following TPC-H ddl query:

CREATE TABLE LINEITEM ( L_ORDERKEY    INTEGER NOT NULL,
                             L_PARTKEY     INTEGER NOT NULL,
                             L_SUPPKEY     INTEGER NOT NULL,
                             L_LINENUMBER  INTEGER NOT NULL,
                             L_QUANTITY    DECIMAL(15,2) NOT NULL,
                             L_EXTENDEDPRICE  DECIMAL(15,2) NOT NULL,
                             L_DISCOUNT    DECIMAL(15,2) NOT NULL,
                             L_TAX         DECIMAL(15,2) NOT NULL,
                             L_RETURNFLAG  CHAR(1) NOT NULL,
                             L_LINESTATUS  CHAR(1) NOT NULL,
                             L_SHIPDATE    DATE NOT NULL,
                             L_COMMITDATE  DATE NOT NULL,
                             L_RECEIPTDATE DATE NOT NULL,
                             L_SHIPINSTRUCT CHAR(25) NOT NULL,
                             L_SHIPMODE     CHAR(10) NOT NULL,
                             L_COMMENT      VARCHAR(44) NOT NULL);


After creating the table I loaded the data from LINEITEM.tbl file to my table using SQL Server BULK INSERT command.

BULK INSERT LINEITEM FROM 'D:\lineitem.tbl' WITH (FIELDTERMINATOR = '|')

This will take some good amount of time. On my hardware it took around 45 minutes. Once data loading is successfully complete, I will recommend you to restart SQL Server Database Engine services. This will free lot of memory that SQL Server might not be willing to release otherwise. This behavior can be observed in image below:



After successfully loading the data. The second step is to create constraints. In this tutorial, I will only be creating the primary key constraint. However, there are two more foreign key constraint in proper TPC-H specification. Please create the primary key for LINEITEM table using the query below:

ALTER TABLE dbo.LINEITEM

ADD PRIMARY KEY (L_ORDERKEY,L_LINENUMBER);


On my computer, it took around 32 minutes to complete. And I again restarted SQL Server Database Engine services to ensure that my computer is with enough memory to progress forward without heavily relying on virtual memory.



Now we are ready to execute TPC-H Query 1 for first time on our scale factor 10 LINEITEM table. Please execute the following query:



SELECT
L_RETURNFLAG
, L_LINESTATUS
, SUM(L_QUANTITY) AS SUM_QTY
, SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE
, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS SUM_DISC_PRICE
, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)) AS SUM_CHARGE
, AVG(L_QUANTITY) AS AVG_QTY
, AVG(L_EXTENDEDPRICE) AS AVG_PRICE
, AVG(L_DISCOUNT) AS AVG_DISC
, COUNT(*) AS COUNT_ORDER
FROM LINEITEM
WHERE L_SHIPDATE <= dateadd(dd, -90, cast('1998-12-01' as datetime))
GROUP BY L_RETURNFLAG, L_LINESTATUS
ORDER BY L_RETURNFLAG,L_LINESTATUS

This query took around 8 minutes to complete on my computer. Now consider a business user executing the same query. Will it be affordable to to wait for 8 minutes, especially when you boss is on your head waiting for your response? Of-course not. We must find a way to reduce the query execution time.

As I mentioned earlier, keep check on your computer memory consumption. If it is high, a quick relief will be to restart the engine to keep progressing with this tutorial.


The result of query 1 was as shown below:


What are possibilities to reduce the query execution time? One popular approach is to use materialized views. In this tutorial, we will use the same approach, but as we are using SQL Server 2008. It will be done a different way. To achieve similar behavior as materialized view. We have to make use of clustered index in SQL Server 2008.

The solution is simple. First create a view. In this view you have to return all columns that you want to use as criteria, i.e., to restrict the query results. For example, in TPC-H Query 1 L_SHIPDATE is is used in where clause, therefore, we will return L_SHIPDATE as column in our view. A sample view is give below:

create view queryone with schemabinding as
SELECT
L_SHIPDATE
, L_RETURNFLAG
, L_LINESTATUS
, SUM(L_QUANTITY) AS SUM_QTY
, SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE
, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS SUM_DISC_PRICE
, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)) AS SUM_CHARGE
, SUM(L_QUANTITY) AS AVG_QTY_SUM
, SUM(L_EXTENDEDPRICE) AS AVG_PRICE_SUM
, SUM(L_DISCOUNT) AS AVG_DISC_SUM
--, AVG(L_QUANTITY) AS AVG_QTY --We have done AVG_QTY_SUM above
--, AVG(L_EXTENDEDPRICE) AS AVG_PRICE --We have done AVG_PRICE_SUM above
--, AVG(L_DISCOUNT) AS AVG_DISC -- We have done AVG_DISC_SUM
, COUNT_BIG(*) AS COUNT_ORDER
FROM dbo.LINEITEM
--WHERE L_SHIPDATE <= dateadd(dd, -90, cast('1998-12-01' as datetime))
GROUP BY L_SHIPDATE, L_RETURNFLAG, L_LINESTATUS
--ORDER BY L_RETURNFLAG,L_LINESTATUS --invalid in views
go

An important point to observe here is that we have used with schemabinding option while creating view. It is important to successfully complete our next step. the Once the view is created. The next step is to create a unique clustered index on view using GROUP BY clause fields in the same order. Unique clustered index query is given below:

create unique clustered index queryonemv on queryone(L_SHIPDATE, L_RETURNFLAG, L_LINESTATUS)
go

Creating clustered index will take some time. And again before proceeding further, it is recommended to restart the SQL Server Database Engine to release the memory still in use by engine as shown below in Figure.


Now we are done with our optimization. All we need to do is to check our TPC-H query 1 again, but this time we will execute it on our newly created view instead of LINEITEM table directly. The query rewritten to get the same result as the TPC-H Query 1 on LINEITEM table is given below:

select
[L_RETURNFLAG]
      ,[L_LINESTATUS]
      ,sum([SUM_QTY]) as SUM_QTY
      ,sum([SUM_BASE_PRICE]) as SUM_BASE_PRICE
      ,sum([SUM_DISC_PRICE]) as SUM_DISC_PRICE
      ,sum([SUM_CHARGE]) as SUM_CHARGE
      ,round(avg([AVG_QTY_SUM]/[COUNT_ORDER]), 2) QTY_SUM
      ,round(avg([AVG_PRICE_SUM]/[COUNT_ORDER]), 2) PRICE_SUM
      ,round(avg([AVG_DISC_SUM]/[COUNT_ORDER]), 2) DISC_SUM
      ,sum([COUNT_ORDER]) as COUNT_ORDER
from queryone
WHERE L_SHIPDATE <= dateadd(dd, -90, cast('1998-12-01' as datetime))
GROUP BY L_RETURNFLAG, L_LINESTATUS
ORDER BY L_RETURNFLAG,L_LINESTATUS

This query took no time to return our results as show in figure below. Either we rely on Codd's Rules or FASMI test, this response time will be acceptable for all for any OLAP task.


In case you are using Oracle. You can perform similar optimization using following materialized view query (please correct the syntax and naming correction according to ORACLE conventions):

CREATE MATERIALIZED VIEW queryone as
SELECT
L_SHIPDATE
, L_RETURNFLAG
, L_LINESTATUS
, SUM(L_QUANTITY) AS SUM_QTY

, SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE
, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS SUM_DISC_PRICE
, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)) AS SUM_CHARGE
, SUM(L_QUANTITY) AS AVG_QTY_SUM
, SUM(L_EXTENDEDPRICE) AS AVG_PRICE_SUM
, SUM(L_DISCOUNT) AS AVG_DISC_SUM
--, AVG(L_QUANTITY) AS AVG_QTY
--, AVG(L_EXTENDEDPRICE) AS AVG_PRICE
--, AVG(L_DISCOUNT) AS AVG_DISC
, COUNT(*) AS COUNT_ORDER
FROM LINEITEM
--WHERE L_SHIPDATE <= dateadd(dd, -90, cast('1998-12-01' as datetime))
GROUP BY L_SHIPDATE, L_RETURNFLAG, L_LINESTATUS
--ORDER BY L_RETURNFLAG,L_LINESTATUS --invalid in views