Skip to content

Reading SQL PDW Query Plans (Part 2)

2011 January 20

Querying a Distributed Table

In the first part of this series, we took a look at a very simple query that hit a single replicated table.  Let’s take a look at another fairly straightforward query. A simple query that queries a single distributed table.  I know these may seem simplistic, but I believe that building upon the straightforward helps us understand the more complicated stuff coming up.  Here is the query:

select ProductKey 
     , OrderDateKey 
     , ShipDateKey  
     , CustomerKey 
     , SalesAmount 

  from FactInternetSales

 
Query Plan from Admin Console (click to enlarge):
image

Looking at the query plan for this query, we can see that there are only two steps.

0. OnOperation that runs the query on all distributions.

1. ReturnOperation that streams the results back to the client.

By clicking on the View Details link for Step ID 0, we get the following page(click to enlarge).

clip_image004

We are looking at four compute node mini appliance used for testing. On any appliance, each compute node has eight distributions, thus a four node appliance has 32 distributions.  Each of the 32 distributions has it’s own physical table that stores its portion of data for the logical FactInternetSales table.  This is based off of the hash of the distribution key defined at table creation.  Each of these 32 distributions has a query passed to it that returns its portion of the data.  Here I can see the duration of the query for each distribution. This can be helpful in diagnosing long running queries. If I have a certain distribution taking significantly longer than others, it may be an indication of data skew.
 
A 10 compute node HP appliance would have 80 distributions and thus would have 80 separate queries passed to the compute nodes for the same query we are looking at.  Each one of those queries would be associated with it’s own processor through SoftNuma and its data would reside on its own LUN based on the filegroup layout.  This is where the scale of MPP becomes quite handy and efficient.
 

Just like in the example of a simple query against a replicated table, a simple query against a distributed table does not involved storing any data on the control node.  Once results are returned back from the compute node, the results are streamed directly to the client.  The power of

Another look at the query plan (click to enlarge):

image

In the next post we’ll start looking at what happens when you join dimension tables to fact tables.  At the physical level, that would be a replicated table joined to a distributed table.

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS