Skip to content

Reading SQL PDW Query Plans (Part 1)

2011 January 18

 Querying A Replicated Table

This week we will start a series of posts on reading SQL PDW Query plans when executing queries against SQL Server 2008 R2 Parallel Data Warehouse.  Just like on SQL Server, being able to read a query plan and then determining what you may be able to do to improve performance is an essential part of performance tuning not only for a developer but the administrator in charge of the appliance.  A few things you want to understand when looking at a query plan in PDW are:
  1. How are my tables designed?  Replicated or Distributed?
  2. What Actions are taking place on the Compute Nodes?
  3. Am I causing the results to get stored on the Control Node before streaming back to the client?  Do I need to?

Ideally you should already know which tables in your database are replicated and which are distributed.  If you don’t, you can quickly check it out in Nexus, where there is a nifty icon next to the table name.  Notice the multiple table icon next to the dimension tables in my copy of AdventureWorksDW2008?  These icons indicate that the table exists in whole as a replicated table on each compute node.  A replicated table is a table that is stored in its entirety on each Compute node.  The single table icon next to the fact tables indicates a distributed table.  A distributed table is a table whose rows are dispersed across multiple storage locations.  Each compute node has eight distributions.

 
image
 
Additionally, you can query the DMV’s to determine which tables are distributed or replicated:
 

SELECT name
     , distribution_policy
     , distribution_policy_desc
  FROM sys.tables;

select tab.name
     , case when count(*) > 1
             then ‘Distributed’
             else ‘Replicated’
       end
  from sys.tables tab
  join sys.pdw_table_mappings map
    on tab.object_id = map.object_id
group by tab.name

 
 
Result Set:
image
 

Querying from Replicated Table

 

Let’s take a simple example of querying from a single replicated table, DimCustomer.

 

select CustomerKey
     , FirstName
     , LastName
     , Phone
  from DimCustomer

 
I know that I have a replicated table of DimCustomer residing on each of my compute nodes.  If I’m only querying from this table, logic would tell me that PDW really only needs to extract the information from a single compute node.  Looking at the query plan, we can see that this is exactly what happens.  This query plan only has two steps:
  1. The OnOperation can be used to run a query that does not require data movement.  A close look at this line tells us that it is being run on a ComputeNode and gives us the logical query being run on against PDW>
  2. The ReturnOperation which is the last step in any query plan.  The ReturnOperation does what you think it does, it sends the results back to the requester.
 
image
 
If you click on View Details for Step ID 0, you can get the actual compute node that this query ran on.  In this case, it was SQLCMP04.  Additionally, you can see the actual query passed to the compute node.
 
image
 
What may be interesting about these operations is that if I rerun the query, it will run on a different compute node each time.  From my observations, it looks like that PDW will rotate each operation on the appliance that needs to run on a single node only and increment the node id and run it there.  In this fashion, no single node is ever bearing the brunt these operations.  Finally, for this query, it is important to notice that there are no operations occurring on the control node, thus the results are streamed directly back to the client.
 
In the next post, we will take a look at a simple query against a distributed table.
2 Responses leave one →
  1. November 29, 1999

    A simpler approach to determining if a table is replicated or distributed is to view the sys.tables view directly. There are two additional columns in the system tables view that provide the nature of the table; distribution_policy and distribution_policy_desc. The distribution_policy column is an integer representation whereas the distribution_policy_desc actuals spells out if the table is replicated or distributed. The supported values in distribution_policy are 2 and 3, where 2 is distributed and 3 is replicated.

    E.g.
    CREATE TABLE dist_table (i int) with (distribution=hash(i));
    CREATE TABLE rep_table (i int);
    SELECT
    name,
    distribution_policy,
    distribution_policy_desc
    FROM sys.tables;

    Result:

    name distribution_policy distribution_policy_desc
    —————————————————————————————-
    rep_table 3 REPLICATE
    dist_table 2 HASH

  2. Brian Mitchell permalink*
    November 29, 1999

    @Justin – You are right. I’ve run across that before and completely forgot about it when writing the post. I’ve added your query to the post. Thanks Justin!

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS