When it comes to performance tuning a database, there are several proven methods available, including indexing, fine tuning the queries, etc. Partitioning can also be used to categorize data based on its nature (for example geography) and then playing with a subset of the data. As the volume of records accessed is lower, query performance is improved. But this may not yield performance gains in all the cases. I am going to discuss my experiments on this.
In my current data warehousing project, the data warehouse stores a couple of years of data but the reporting requires only recent information. Although the business requires older data to be readily available in the data warehouse (DW), in case they need to do some ad-hoc analysis. Now the reporting subsystem is facing performance issues in rendering time for a lot of reports. The solution suggested by architects, to use report-ready tables, containing only the recent records. Means, all data required by reports for the final rendering is prepared by the ETL process itself and made available as a cache, from which reports can directly pick records or do some filtering or aggregation. Of course this solves the problem, but reduces the reporting tool to a mere rendering tool, not taping its full potential. No need to say that you will need to rewrite the ETL to change anything in the final reports.
So, I thought of an alternate mechanism, to use partitions. My plan was to partition the data based on transaction date values available in the records. May be views can be created with a filter for transaction date restricting it to the recent data, forcing the DB engine to look in the partition with the latest data. Architects were not convinced about the performance gain it can give. I did some googling and found some positive results. So I decided to do some testing on this. Below is how I did this…
-
Simulate the current scenario:
To understand the performance gain that can be achieved, the current situation has to be simulated to a reasonable extent. Tables need to be with similar structure and various types of columns reflecting the tables in the data warehouse.
-
Create a table with similar data types
The following script was used to create the table containing an identity column as key column, a date column for partitioning, a string column for text search checks, etc.
CREATE TABLE PartitionTest
(ID int,
TxDate Datetime,
Amount float,
Comment varchar(15),
CONSTRAINT [PK_PartitionTest] PRIMARY KEY CLUSTERED (ID ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
ON [PRIMARY])
-
Populate the table with random sample data
Following script was used to populate the table. A sample of 10 million records was used.
DECLARE
@i INT = 1,
@Date DATETIME = ’2006-01-01′,
@Today DATE = GETDATE()
WHILE @i<=10000000
BEGIN
DECLARE
@j INT = 0,
@k INT = 5+CAST(rand()*10 AS INT),
@Str VARCHAR(15) = ”
WHILE @j<@k
BEGIN
SET @Str=@Str+CHAR(CAST(rand()*100 AS INT)%26+97)
SET @j =@j+1
END
INSERT INTO PartitionTest VALUES (
@i
,@Date
,CAST(rand()*100 as int)
,@Str
)
SET @Date=DATEADD(MILLISECOND,CAST(rand()*10 AS INT)+1,@Date)
SET @i =@i+1
END
-
Create a copy of the table
A copy of the table (same 10 million records) with a name PartitionTestPT was created.
-
Partition the copy
The PartitionTestPT was partitioned based on the TxDate column with appropriate partition boundaries. As the data generated covered only a time span of 15 hours, one partition per hour was created.
-
Fire queries and measure execution time
As the sample data is ready now, different queries have to be fired with filters using different columns, to assess the performance using query execution time.
-
Fire queries based on primary key column
BEGIN
DECLARE
@StartTime DATETIME2=GETDATE()
BEGIN
SELECT *
FROM PartitionTest
WHERE ID=5489257
SELECT DATEDIFF(MILLISECOND,@StartTime,GETDATE())
END
END
-
Fire queries based on partition column
BEGIN
DECLARE
@StartTime DATETIME2=GETDATE()
BEGIN
SELECT *
FROM PartitionTest
WHERE TxDate=’2006-01-01 08:38:33.587′
SELECT DATEDIFF(MILLISECOND,@StartTime,GETDATE())
END
END
-
Fire queries based on text column
BEGIN
DECLARE
@StartTime DATETIME2=GETDATE()
BEGIN
SELECT *
FROM PartitionTest
WHERE Comment=’acynieci’
SELECT DATEDIFF(MILLISECOND,@StartTime,GETDATE())
END
END
-
Analyze final results
Below is a summary of results for the above queries, fired with different values. (The table was corrected after Niyaz pointed out some mistakes)
| Search Column |
Value used |
Time taken (ms) |
Value used |
Time taken (ms) |
Value used |
Time taken (ms) |
Average (ms) |
| Before partitioning |
| Key |
9875236 |
0 |
1878236 |
0 |
5489257 |
47 |
15.67 |
| String |
ahfwkubgiuhpr |
1987 |
dfmscx |
1987 |
acynieci |
10203 |
4725.67 |
| Date |
2006-01-01 15:32:45.510 |
1530 |
2006-01-01 02:57:27.780 |
1516 |
2006-01-01 08:38:33.587 |
1530 |
1525.33 |
| After partitioning |
|
9875236 |
1814 |
1878236 |
1780 |
5489257 |
9250 |
4281.33 |
|
ahfwkubgiuhpr |
2470 |
dfmscx |
2466 |
acynieci |
2436 |
2457.33 |
|
2006-01-01 15:32:45.510 |
93 |
2006-01-01 02:57:27.780 |
127 |
2006-01-01 08:38:33.587 |
140 |
120.00 |
| After partitioning and indexing the ID column |
|
9875236 |
17 |
1878236 |
30 |
5489257 |
63 |
36.67 |
|
ahfwkubgiuhpr |
2610 |
dfmscx |
2546 |
acynieci |
2470 |
2542.00 |
|
2006-01-01 15:32:45.510 |
123 |
2006-01-01 02:57:27.780 |
94 |
2006-01-01 08:38:33.587 |
127 |
114.67 |
[Following paragraph was edited after Niyaz’s comment]
The results clearly show a huge performance gain, around 12 ~ 13 times, for queries utilizing the partitioned column. And queries utilizing the key column are not affected much (after re-planning the index on ID column). Even the string column based search has improved very much.
To answer Niyaz’s questions, I ran another pair of queries to see how partitioning will perform in a real environment. I used the following queries,
BEGIN
DECLARE
@StartTime DATETIME2=GETDATE()
BEGIN
SELECT *
FROM PartitionTestPTAI
WHERE
Comment=’acynieci’
SELECT DATEDIFF(MILLISECOND,@StartTime,GETDATE())
END
END
And then used a query with a filter on TxDate column to utilize the partition, as shown below,
BEGIN
DECLARE
@StartTime DATETIME2=GETDATE()
BEGIN
SELECT *
FROM PartitionTestPTAI
WHERE
Comment=’acynieci’
AND TxDate BETWEEN ’2006-01-01 08:00:00.000′ AND ’2006-01-01 09:00:00.000′
SELECT DATEDIFF(MILLISECOND,@StartTime,GETDATE())
END
END
And below are the results I got.
| Search Column |
Value used |
Time taken (ms) |
Value used |
Time taken (ms) |
Value used |
Time taken (ms) |
Average (ms) |
| Without the filter on TxDate column |
| String |
ahfwkubgiuhpr |
2550 |
dfmscx |
2546 |
acynieci |
2486 |
2527.33 |
| With the filter on TxDate column |
| Comment |
ahfwkubgiuhpr |
187 |
dfmscx |
170 |
acynieci |
407 |
254.67 |
| TxDate |
2006-01-01 15:00:00.000 |
|
2006-01-01 02:00:00.000 |
|
2006-01-01 08:00:00.000 |
|
|
The results shows that, when a partitioned column is utilized in the query, it may provide up to ten times more performance.
So that concludes my experiments on partitioning. Please let me know your views on this.