Posts Tagged: Data warehousing


26
Aug 09

SSIS: Transaction enabled tasks fail due to MsDTC

In my current project we have multiple SSIS developers in team and we all were sharing the database server instance in my system. We faced an issue yesterday when a team mate was trying to implement transactions in SSIS. The package fails in other machines although it is running smooth in my system where the database resides. The error message thrown out was,

The SSIS Runtime has failed to enlist the OLE DB connection in a distributed transaction with error 0x8004D00E “The transaction has already been implicitly or explicitly committed or aborted”

Soon we realized that this is something related to Microsoft Distributed Transaction Coordinator (MsDTC). We did some search and got a tool called “Dtcping.exe” which will check the health status of MsDTC processes in different machines. The tool reported an error “Access denied”, hinting some security issue with in the MsDTC. But we were not lucky even after a couple of hours of Googling. Then I decided to lean the security settings for MsDTC and I found that all network related connectivity is disabled by default. I learned that authentication settings can cause trouble as our machines were running Windows XP and as they were in a work group (we have a strange network configuration in my organization). I changed to “No Authentication Required” for MsDTC instances in all machines and it worked! To change the Security Configuration for MsDTC, go to Control Panel >> Administrative Tools >> Component Services >> Computers >> Right click My Computer and then click Properties >> Click the MSDTC tab >> Click Security Configuration. Below is a screen shot of the settings that I used, but I do not recommend this configuration for all cases as I am not aware of the impact it can have on securiy.

MsDTC

Thought of sharing this with you. And check the Technet link describing the security settings for MsDTC. It is very useful.


2
Aug 09

Does partitioning improve performance for SQL tables?

When it comes to performance tuning a database, there are several proven methods available, including indexing, fine tuning the queries, etc. Partitioning can also be used to categorize data based on its nature (for example geography) and then playing with a subset of the data. As the volume of records accessed is lower, query performance is improved. But this may not yield performance gains in all the cases. I am going to discuss my experiments on this.

In my current data warehousing project, the data warehouse stores a couple of years of data but the reporting requires only recent information. Although the business requires older data to be readily available in the data warehouse (DW), in case they need to do some ad-hoc analysis. Now the reporting subsystem is facing performance issues in rendering time for a lot of reports. The solution suggested by architects, to use report-ready tables, containing only the recent records. Means, all data required by reports for the final rendering is prepared by the ETL process itself and made available as a cache, from which reports can directly pick records or do some filtering or aggregation. Of course this solves the problem, but reduces the reporting tool to a mere rendering tool, not taping its full potential. No need to say that you will need to rewrite the ETL to change anything in the final reports.

So, I thought of an alternate mechanism, to use partitions. My plan was to partition the data based on transaction date values available in the records. May be views can be created with a filter for transaction date restricting it to the recent data, forcing the DB engine to look in the partition with the latest data. Architects were not convinced about the performance gain it can give. I did some googling and found some positive results. So I decided to do some testing on this. Below is how I did this…

  1. Simulate the current scenario:

    To understand the performance gain that can be achieved, the current situation has to be simulated to a reasonable extent. Tables need to be with similar structure and various types of columns reflecting the tables in the data warehouse.

    1. Create a table with similar data types

      The following script was used to create the table containing an identity column as key column, a date column for partitioning, a string column for text search checks, etc.

      CREATE TABLE PartitionTest
      (ID int,
      TxDate Datetime,
      Amount float,
      Comment varchar(15),
      CONSTRAINT [PK_PartitionTest] PRIMARY KEY CLUSTERED (ID ASC)
      WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON)
      ON [PRIMARY])

    2. Populate the table with random sample data

      Following script was used to populate the table. A sample of 10 million records was used.

      DECLARE
      @i INT = 1,
      @Date DATETIME = ’2006-01-01′,
      @Today DATE = GETDATE()

      WHILE @i<=10000000
      BEGIN
      DECLARE
      @j INT = 0,
      @k INT = 5+CAST(rand()*10 AS INT),
      @Str VARCHAR(15) = ”
      WHILE @j<@k
      BEGIN
      SET @Str=@Str+CHAR(CAST(rand()*100 AS INT)%26+97)
      SET @j  =@j+1
      END
      INSERT INTO PartitionTest VALUES (
      @i
      ,@Date
      ,CAST(rand()*100 as int)
      ,@Str
      )

      SET @Date=DATEADD(MILLISECOND,CAST(rand()*10 AS INT)+1,@Date)
      SET @i   =@i+1
      END

    3. Create a copy of the table

      A copy of the table (same 10 million records) with a name PartitionTestPT was created.

    4. Partition the copy

      The PartitionTestPT was partitioned based on the TxDate column with appropriate partition boundaries. As the data generated covered only a time span of 15 hours, one partition per hour was created.

  2. Fire queries and measure execution time

    As the sample data is ready now, different queries have to be fired with filters using different columns, to assess the performance using query execution time.

    1. Fire queries based on primary key column

      BEGIN
      DECLARE
      @StartTime DATETIME2=GETDATE()
      BEGIN
      SELECT *
      FROM   PartitionTest
      WHERE  ID=5489257

      SELECT DATEDIFF(MILLISECOND,@StartTime,GETDATE())
      END
      END

    2. Fire queries based on partition column

      BEGIN
      DECLARE
      @StartTime DATETIME2=GETDATE()
      BEGIN
      SELECT *
      FROM   PartitionTest
      WHERE  TxDate=’2006-01-01 08:38:33.587′

      SELECT DATEDIFF(MILLISECOND,@StartTime,GETDATE())
      END
      END

    3. Fire queries based on text column

      BEGIN
      DECLARE
      @StartTime DATETIME2=GETDATE()
      BEGIN
      SELECT *
      FROM   PartitionTest
      WHERE  Comment=’acynieci’

      SELECT DATEDIFF(MILLISECOND,@StartTime,GETDATE())
      END
      END

  3. Analyze final results

    Below is a summary of results for the above queries, fired with different values. (The table was corrected after Niyaz pointed out some mistakes)

Search Column Value used Time taken (ms) Value used Time taken
(ms)
Value used Time taken
(ms)
Average (ms)
Before partitioning
Key 9875236 0 1878236 0 5489257 47 15.67
String ahfwkubgiuhpr 1987 dfmscx 1987 acynieci 10203 4725.67
Date 2006-01-01 15:32:45.510 1530 2006-01-01 02:57:27.780 1516 2006-01-01 08:38:33.587 1530 1525.33
After partitioning
9875236 1814 1878236 1780 5489257 9250 4281.33
ahfwkubgiuhpr 2470 dfmscx 2466 acynieci 2436 2457.33
2006-01-01 15:32:45.510 93 2006-01-01 02:57:27.780 127 2006-01-01 08:38:33.587 140 120.00
After partitioning and indexing the ID column
9875236 17 1878236 30 5489257 63 36.67
ahfwkubgiuhpr 2610 dfmscx 2546 acynieci 2470 2542.00
2006-01-01 15:32:45.510 123 2006-01-01 02:57:27.780 94 2006-01-01 08:38:33.587 127 114.67

[Following paragraph was edited after Niyaz’s comment]
The results clearly show a huge performance gain, around 12 ~ 13 times, for queries utilizing the partitioned column. And queries utilizing the key column are not affected much (after re-planning the index on ID column). Even the string column based search has improved very much.

To answer Niyaz’s questions, I ran another pair of queries to see how partitioning will perform in a real environment. I used the following queries,

BEGIN
DECLARE
@StartTime DATETIME2=GETDATE()
BEGIN
SELECT *
FROM   PartitionTestPTAI
WHERE 
Comment=’acynieci’

SELECT DATEDIFF(MILLISECOND,@StartTime,GETDATE())
END
END

And then used a query with a filter on TxDate column to utilize the partition, as shown below,

BEGIN
DECLARE
@StartTime DATETIME2=GETDATE()
BEGIN
SELECT *
FROM   PartitionTestPTAI
WHERE 
Comment=’acynieci’
AND TxDate BETWEEN ’2006-01-01 08:00:00.000′ AND ’2006-01-01 09:00:00.000′

SELECT DATEDIFF(MILLISECOND,@StartTime,GETDATE())
END
END

And below are the results I got.

Search Column Value used Time taken (ms) Value used Time taken
(ms)
Value used Time taken
(ms)
Average (ms)
Without the filter on TxDate column
String ahfwkubgiuhpr 2550 dfmscx 2546 acynieci 2486 2527.33
With the filter on TxDate column
Comment ahfwkubgiuhpr 187 dfmscx 170 acynieci 407 254.67
TxDate 2006-01-01 15:00:00.000 2006-01-01 02:00:00.000 2006-01-01 08:00:00.000

The results shows that, when a partitioned column is utilized in the query, it may provide up to ten times more performance.

So that concludes my experiments on partitioning. Please let me know your views on this.