Skip to content

Reading SQL PDW Query Plans (Part 3)

2011 January 26

Welcome to my series of posts on reading SQL Server 2008 R2 Parallel Data Warehouse Query Plans.  In previous posts, in Part 1 we looked at a simple query against a dimension table that was replicated and in Part 2 we examined a query against a fact table that was distributed.  Today we will take a look at what happens when you join a distributed table with a replicated table.

One Replicated Table joining One Distributed Table

We’ll first look at a simple join between two tables and the results we get from looking at the plan in the admin console as we have done before.  We will be joining together two tables: DimDate and FactInternetSales.  DimDate is a replicated table, meaning that it resides in whole on each compute node.  FactInternetSales is a distributed table. It is hash distributed on the SalesOrderNumber column.  For more on the basics of how PDW table architecture, check out this presentation from Matt Peebles. 

Here is the query:

SELECT DateKey
     , FiscalYear
     , ProductKey
     , SalesAmount
  FROM DimDate              dat
  JOIN FactInternetSales    sal
    ON dat.DateKey     =    sal.OrderDateKey;

  

Here is the plan.  Remember, the important things to look for are going through the steps one by one and looking at the operation, location, and distribution.  In this case I can see that in StepID 0, I have an OnOperation occurring on AllDistributions.  This means that the query shown will run on all distributions within the appliance. 

image

In this appliance’s case that means that 32 queries are being sent to the compute nodes and returned to the control node who then, according to StepID 1, will simply stream the results back to the client that requested the information.  That doesn’t mean the compute node communicates with the client, but sends them back through the control node who does nothing but pass them on to the client.  All communication with the PDW appliance is done through the control node.  As far as the client is concerned, it has no idea its talking to a set of servers, it see just one PDW database server.

Using Explain

So far, its been simple and quick to use the admin console.  One question you might have is what if I have a complicated query and it takes 20 minutes to run?  I don’t want to wait until the query is complete to then login in to the Admin Console to find out what it did.  Do you have an equivalent to SQL Server’s SET SHOWPLAN_XML ON statement?  The good news is yes PDW does.  It’s the EXPLAIN statement.

EXPLAIN

SELECT DateKey
     , FiscalYear
     , ProductKey
     , SalesAmount
  FROM DimDate              dat
  JOIN FactInternetSales    sal
    ON dat.DateKey     =    sal.OrderDateKey;

If I run the above statement, instead of running the query and getting back a result set, I get the following query plan.  Essentially, it is the same as what you see in the admin console.  First, it shows the query being passed to PDW.  Second, it shows the query plan.  There are two DSQL Operations: an On and a Return.  The first dsql_operation called OnOperation is run on AllDistributions and produces resultset RS_3.  In the next dsql_operation, the ReturnOperation sends resultset RS_3 back to the client.

 <?xml version=”1.0″ encoding=”utf-8″?>
<dsql_query>
  <sql>select DateKey
     , FiscalYear
     , ProductKey
     , SalesAmount
  from DimDate              dat
  join FactInternetSales    sal
    on dat.DateKey = sal.OrderDateKey;</sql>
  <params />
  <dsql_operations>
    <dsql_operation>
      <optype>ON</optype>
      <location permanent=”false” distribution=”AllDistributions”>Compute</location>
      <resultset>RS_3</resultset>
      <sql_operations>
        <sql_operation type=”statement”>SELECT
[dat].[DateKey],
[dat].[FiscalYear],
[sal].[ProductKey],
[sal].[SalesAmount]
FROM
[adventureworksdw2008].[dbo].[DimDate] [dat] JOIN [adventureworksdw2008].[dbo].[FactInternetSales] [sal] ON ([dat].[DateKey] = [sal].[OrderDateKey]) </sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation>
      <optype>RETURN</optype>
      <resultset>RS_3</resultset>
    </dsql_operation>
  </dsql_operations>
</dsql_query>

No need to wait on a query to complete to find out what the PDW Query Plan will be, just use the EXPLAIN statement within your query windows to get the results right there in Nexus.  I’ll tend to use the screen captures of the admin console in the blog simply because they are much easier to digest quickly.

Two Replicated Tables joined to One Distributed Table

 

SELECT DateKey
     , FiscalYear
     , FirstName
     , LastName
     , EmailAddress
     , ProductKey
     , SalesAmount
  FROM DimDate              dat
  JOIN FactInternetSales    sal
    ON dat.DateKey      =   sal.OrderDateKey
  JOIN DimCustomer          cst
    ON sal.CustomerKey  =   cst.CustomerKey
WHERE FiscalYear       =   2004;

Both DimDate and DimCustomer are replicated tables, meaning they reside in whole on each compute node.  FactInternetSales is a distributed table, meaning the data is distributed on the compute nodes with each holding a partial set of the tables data.  How would you expect PDW to handle the above query? 

We should expect to see essentially the same results as before.   That is each query sent to a distribution should be able to answer the query wholly without having to move data around or aggregate it on the control node.   Each distribution will return the query and stream it back to the client with no data movement or temporary tables

In fact, that is what we get.  

image

 

Drilling into the details of the OnOperation we see the first 16 of 32 concurrent queries being run within the appliance.  If this was a one data rack HP appliance, this would have been 80 concurrent queries because there are 10 compute nodes in a HP rack.  I’ve expanded out one of the queries so that you can see the details of the query against the physical tables for distribution D on SQLCMP01. 

image

In future posts, we will continue to look at more complicated queries and their plans.  Additionally, we will start examining ways to mitigate plans that you don’t like.

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS