You are on page 1of 210

Chapter 1 - What is Columnar?

When you go into court you, are putting your fate into the hands of twelve people who werent smart enough to
get out of jury duty.
- Norm Crosby

WHAT IS PARALLEL PROCESSING?


"After enlightenment, the laundry"

-Zen Proverb

"After parallel processing the laundry, enlightenment!"

-Redshift Zen Proverb

Two guys were having fun on a Saturday night when one said, Ive got to go and do my laundry. The other
said, "What!?" The first man explained that if he went to the laundry mat the next morning, he would be lucky to
get one machine and be there all day. But if he went on Saturday night, he could get all the machines. Then, he
could do all his wash and dry in two hours. Now that's parallel processing mixed in with a little dry humor!

THE BASICS OF A SINGLE COMPUTER

When you are courting a nice girl, an hour seems like a second. When you sit on a red-hot cinder, a second
seems like an hour. Thats relativity.

Albert Einstein

Data on disk does absolutely nothing. When data is requested, the computer moves the data one block at a time
from disk into memory. Once the data is in memory, it is processed by the CPU at lightning speed. All computers
work this way. The "Achilles Heel" of every computer is the slow process of moving data from disk to memory.
The real theory of relativity is find out how to get blocks of data from the disk into memory faster!
DATA IN MEMORY IS FAST AS LIGHTNING

You can observe a lot by watching.

Yogi Berra

Once the data block is moved off of the disk and into memory, the processing of that block happens as fast as
lightning. It is the movement of the block from disk into memory that slows down every computer. Data being
processed in memory is so fast that even Yogi Berra couldn't catch it!

PARALLEL PROCESSING OF DATA

"If the facts don't fit the theory, change the facts."

-Albert Einstein

Big Data is all about parallel processing. Parallel processing is all about taking the rows of a table and spreading
them among many parallel processing units. Above, we can see a table called Orders. There are 16 rows in the
table. Each parallel processor holds four rows. Now they can process the data in parallel and be four times as
fast. What Albert Einstein meant to say was, If the theory doesn't fit the dimension table, change it to a fact."

A TABLE HAS COLUMNS AND ROWS

The table above has 9 rows. Our small system above has three parallel processing units. Each unit holds three
rows.
EACH PARALLEL PROCESS ORGANIZES THE ROWS INSIDE A DATA BLOCK

The rows of a table are stored on disk in a data block. Above, you can see we have four rows in each data block.
Think of the data block as a suitcase you might take to the airport (without the $50 fee).

MOVING DATA BLOCKS IS LIKE CHECKING IN LUGGAGE

Please put your data block on the scale (inside memory)

To a computer, the data block on disk is as heavy as a large suitcase. It is difficult and cumbersome to lift.

FACTS THAT ARE DISTURBING

The data block above has 9 rows and five columns. If someone requested to see Rob Rivers salary, the entire
data block would still have to move into memory. Then, a salary of 50000 would be returned. That is a lot of
heavy lifting just to analyze one row and return one column. It is just like burning an entire candle just because
you need a flicker of light!

WHY COLUMNAR?
Each data block holds a single column. The row can be rebuilt because everything is aligned perfectly. If
someone runs a query that would return the average salary, then only one small data block is moved into
memory. The salary block moves into memory where it is processed as fast as lightning. We just cut down on
moving large blocks by 80%! Why columnar? Because like our Yiddish Proverb says, "All data is not kneaded on
every query, so that is why it costs so much dough."

ROW BASED BLOCKS VS. COLUMNAR BASED BLOCKS

Both designs have the same amount of data. Both take up just as much space. In this example, both have 9 rows
and five columns. If a query needs to analyze all of the rows or return most of the columns, then the row based
design is faster and more efficient. However, if the query only needs to analyze a few rows or merely a few
columns, then the columnar design is much lighter because not all of the data is moved into memory. Just one or
two columns move. Take the road less traveled.

AS ROW-BASED TABLES GET BIGGER, THE BLOCKS SPLIT

When you go on vacation for two-weeks, you might pack a lot of clothes. It is then that you take two suitcases. A
data block can only get so big before it is forced to split, otherwise it might not fit into memory.

DATA BLOCKS ARE PROCESSED ONE AT A TIME PER UNIT


At the Airport luggage counter, each bag needs to be weighed. You put bag one on first, and then after it is
processed, you put on bag two. That is how the processing of data blocks happen. One data block at a time.

COLUMNAR TABLES STORE EACH COLUMN IN SEPARATE BLOCKS

This is the same data you saw on the previous page! The difference is that the above is a columnar design. I have
color coded this for you. There are 8 rows in the table and five columns. Notice that the entire row stays on the
same disk, but each column is a separate block. This is a brilliant design for Ad Hoc queries and analytics
because when only a few columns are needed, columnar can move just the columns it needs to. Columnar can't
be beat for queries because the blocks are so much smaller, and what isn't needed isn't moved.

VISUALIZE THE DATA ROWS VS. COLUMNS

Both examples above have the same data and the same amount of data. If your applications tend to need to
analyze the majority of columns or read the entire table, then a row-based system (top example) can move more
data into memory. Columnar tables are advantageous when only a few columns need to be read. This is just one
of the reasons that analytics goes with columnar like bread goes with butter. A row-based system must move the
entire block into memory even if it only needs to read one row or even a single column. If a user above needed to
analyze the Salary, the columnar system would move 80% less block mass.

THE ARCHITECTURE OF REDSHIFT


Be the change that you want to see in the world.

- Mahatma Gandhi

The leader node is the brains behind the entire operation. The user logs into the leader node, and for each SQL
query, the leader node will come up with a plan to retrieve the data. It passes that compiled plan to each compute
node, and each slice processes their portion of the data. If the data is spread evenly, parallel processing works
perfectly. This technology is relatively inexpensive. It might not "be the change", but it will help your company
"keep the change" because costs are low.

REDSHIFT HAS LINEAR SCALABILITY

"A Journey of a thousand miles begins with a single step."

- Lao Tzu

Redshift was born to be parallel. With each query, a single step is performed in parallel by each Slice. A Redshift
system consists of a series of slices that will work in parallel to store and process your data. This design allows
you to start small and grow infinitely. If your Redshift system provides you with an excellent Return On
Investment (ROI), then continue to invest by purchasing more nodes (adds additional slices). Most companies
start small, but after seeing what Redshift can do, they continue to grow their ROI from the single step of
implementing a Redshift system to millions of dollars in profits. Double your slices and double your
speeds. . . . Forever. Redshift actually provides a journey of a thousand smiles!

DISTRIBUTION STYLES

KEY distribution - The rows are distributed according to the values in one column. The leader node
places matching values on the same node slice. If you distribute a pair of tables on the joining keys,
the leader node co-locates the rows on the slices according to the values in the joining columns.
Now, matching values from the common columns are physically stored together. This is extremely
important for table joins.

ALL distribution - A copy of the entire table is distributed to every node.

EVEN distribution - The rows are distributed across the slices in a round-robin fashion, regardless of
the values in any particular column. EVEN distribution is appropriate when a table does not
participate in joins or when there is not a clear choice between KEY distribution and ALL
distribution. EVEN distribution is the default distribution style.
Redshift gives you three great choices to distribute your tables. If you have two tables that are being joined
together a lot and they are about the same size, then you want to give them both the same distribution key as the
join key. This co-locates the matching rows on the same slice. Two rows being joined together must be on the
same slice (or Redshift will move one or both of the rows temporarily to satisfy the join requirement). If you join
two tables a lot, but one table is really big and the other is small, then you want to have the small table distributed
by ALL. Use your distribution key to ensure joins happen faster, but also use it to spread the data as evenly
among the slices as possible.

DISTRIBUTION KEY WHERE THE DATA IS UNIQUE

The entire row of a table is on a slice, but each column in the row is in a separate container (block). A Unique
Distribution Key spreads the rows of a table evenly across the slices. A good Distribution Key is the key to good
distribution!

ANOTHER WAY TO CREATE A TABLE

We have chosen the Emp_No column as both the distribution key and the sort key. We can control both!

DISTRIBUTION KEY WHERE THE DATA IS NON-UNIQUE

The data did not spread evenly among the slices for this table. Do you know why? The Distribution Key is
Dept_No. All like values went to the same slice. This distribution isn't perfect, but it is reasonable, so it is an
acceptable practice.
DISTRIBUTION KEY IS ALL

When ALL is selected as the distribution key, the entire table is copied to each slice.

EVEN DISTRIBUTION KEY

The data did not spread evenly among the slices for this table. Do you know why? The Distribution Key is
Dept_No. All like values went to the same slice. This distribution isn't perfect, but it is reasonable, so it is an
acceptable practice.

MATCHING DISTRIBUTION KEYS FOR CO-LOCATION OF JOINS

Notice that both tables are distributed on Dept_No. When these two tables are joined WHERE Dept_No =
Dept_No, the rows with matching department numbers are on the same Slice. This is called Co-Location. This
makes joins efficient and fast.

BIG TABLE / SMALL TABLE JOINS


Notice that the Department_Table has only four rows. Those four rows are copied to every slice. This is
distributed by ALL. Now, the Department_Table can be joined to the Employee_Table with a guarantee that
matching rows are co-located. They are co-located because the smaller table has copied ALL of its rows to each
slice. When two joining tables have one large table (fact table) and one small table (dimension table), then use
the ALL keyword to distribute the smaller table.

FACT AND DIMENSION TABLE DISTRIBUTION KEY DESIGNS

The fact table (Line_Order_Fact_Table) is the largest table, but the Part_Table is the largest dimension table.
That is why you make Part_Key the distribution key for both tables. Now, when these two tables are joined
together, the matching Part_Key rows are on the same slice. You can then distribute by ALL on the other
dimension tables. Each of these table will have all their rows on each slice. Now, everything that joins to the fact
table is co-located!

IMPROVING PERFORMANCE BY DEFINING A SORT KEY

There are three basic reasons to use the sortkey keyword when creating a table. 1) If recent data is queried most
frequently, specify the timestamp or date column as the leading column for the sort key. 2) If you do frequent
range filtering or equality filtering on one column, specify that column as the sort key. 3) If you frequently join a
(dimension) table, specify the join column as the sort key. Above, you can see we have made our sortkey the
Order_Date column. Look how the data is sorted!

SORT KEYS HELP GROUP BY, ORDER BY AND WINDOW FUNCTIONS


When data is sorted on a strategic column, it will improve (GROUP BY and ORDER BY operations), window
functions (PARTITION BY and ORDER BY operations), and even as a means of optimizing compression. But,
as new rows are incrementally loaded, these new rows are sorted but they reside temporarily in a separate region
on disk. In order to maintain a fully sorted table, you need to run the VACUUM command at regular intervals.
You will also need to run ANALYZE.

EACH BLOCK COMES WITH METADATA

Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as
part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max
values to rapidly skip over large numbers of blocks during table Where most databases use indexes to determine
where data is, Redshift uses the block's metadata to determine where data is NOT!

Our query above is looking for data WHERE Order_Total < 300. The metadata shows this block will contain
rows, and therefore it will be moved into memory for processing. Each slice has metadata for each of the blocks
they own.

HOW DATA MIGHT LOOK ON A SLICE

Redshift allocates a 1 MB per block when a table begins loading. When a block is filled, another is allocated. I
want you to imagine that we created a table that had only one column, and that column was Order_Date. On
January 1 , data was loaded. Notice in the examples that as data is loaded, it continues to fill until the block
st

reaches 1 MB. The Order_Date is ordered (because as each day is loaded, it fills up the next slot). Then, notice
how the metadata has the min and max Order_Date. The metadata is designed to inform Redshift whether this
block should be read when this table is queried. If a query is looking for data in April, then there is no reason to
read block 1 because it falls outside of the min/max range.

QUESTION HOW MANY BLOCKS MOVE INTO MEMORY?


SELECT *
FROM Orders
WHERE Order_Total < 250.00

Looking at the SQL and the metadata, how many blocks will need to be moved into memory?

ANSWER HOW MANY BLOCKS MOVE INTO MEMORY?

SELECT * FROM Orders


WHERE Order_Total < 250.00

Only one block moves into memory. The metadata shows that the min and max for Order_total only falls into the
range for the last Slice. Only that Slice moves the block into memory.

QUIZ MASTER THAT QUERY WITH THE METADATA

Looking at the SQL and the metadata, how many blocks will need to be moved into memory for each query?

ANSWER TO QUIZ MASTER THAT QUERY WITH THE METADATA


Above are your answers.

THE ANALYZE COMMAND COLLECTS STATISTICS

The Analyze command updates table statistics for use by the query planner. You can analyze all the tables in an
entire database, or you can analyze specific tables including temporary tables. If you want to specifically analyze
a table, you can but not more than one table_name with a single ANALYZE table_name statement. If you do not
specify a table_name, all of the tables in the currently connected database are analyzed including the persistent
tables in the system catalog.

REDSHIFT AUTOMATICALLY ANALYZES SOME CREATE STATEMENTS


Redshift automatically analyzes tables that you create with the following commands:
1. CREATE TABLE AS
2. CREATE TEMP TABLE AS
3. SELECT INTO
You do not need to run the ANALYZE command on these tables when they are first created. If you modify them
with additional inserts, updates, or deletes, you should analyze them in the same way as other tables.

The above examples wont need the analyze statement because it is done automatically, but if you modify these
tables, you will need to run the analyze command. The Analyze command updates table statistics for use by the
query planner. You can analyze all the tables in an entire database, or you can analyze specific tables including
temporary tables.

WHAT IS A VACUUM?
What is a Vacuum?
Amazon Redshift doesn't automatically reclaim and reuse space that is freed when you delete or update rows.
These rows are logically deleted but not physically deleted (until you run a vacuum). The vacuum will reclaim
the space.
Amazon Redshift doesnt automatically reclaim and reuse space that is freed when you delete rows and update
rows. These rows are logically deleted, but not physically deleted (until you run a vacuum). To perform an
update, Amazon Redshift deletes the original row and appends the updated row, so every update is effectively a
delete followed by an insert. When you perform a delete, the rows are marked for deletion but not removed.

WHEN IS A GOOD TIME TO VACUUM?


When is a Good Time to Vacuum?
Run VACUUM during maintenance, batch windows, or time periods when you expect minimal activity on the
cluster.
A large unsorted region results in longer vacuum times. If you delay vacuuming, the vacuum will take longer
because more data has to be reorganized. Keep the vacuum regular enough to properly maintain the table.
VACUUM is an I/O intensive operation, so the longer it takes for your vacuum to complete, the more impact it
will have on concurrent queries and other database operations running on your cluster.

Time flies like an arrow. Fruit flies like a banana.

- Groucho Marx

A vacuum can be time consuming and it is very intensive. That is why the above advice is needed. Vacuum
wisely. You can run the vacuum command to get rid of the logically deleted rows and resort the table 100%
perfectly. When about 10% of the table has changed over time, it is a good practice to run both the Vacuum and
Analyze commands. Like Groucho Marx has basically stated, "If data processing slows down and users get
groucho, hit your marks and make if fly after a vacuum."

THE VACUUM COMMAND GROOMS A TABLE

When tables are originally created and loaded, the rows are in perfect order (naturally) or because a sort key was
specified. As additional inserts, updates, deletes are performed over time, two things happen. Rows that have
been modified are done so logically, thus there are additional rows physically still there, but that have been
logically deleted. The second thing that happens is that new rows that are inserted are stored on a different part of
the disk, so the sort is no longer 100% accurate.

DATABASE LIMITS
Amazon Redshift enforces these limits for databases.
1. Maximum of 60 user-defined databases per cluster.
2. Maximum of 127 characters for a database name.
3. Cannot be a reserved word.
CREATE DATABASE SQL_Class2 WITH OWNER TeraTom ;

Where there is no patrol car, there is no speed limit.


-Al Capone

The following example creates a database named SQL_Class2 and gives ownership to the user TeraTom. You can
only create a maximum of 60 different database per cluster, so get yours created before the mob!

CREATING A DATABASE
create database sql_class ;

The best way to predict the future is to create it.


- Sophia Bedford-Pierce

A Redshift cluster can have many databases. Above is the syntax to create a database. The database is named
sql_class. The data in a database can help you predict the future, and Redshift makes it so easy to create it. I think
Sophia Bedford-Pierce must be a DBA!

CREATING A USER
create user teratom
password 'TLc123123' ;

Password must:
be between 8 and 64 characters
have at least one uppercase letter
have at least one lowercase letter
have at least one number

To create a new user, you specify the name of the new user and a password. The password is required, and it
must be reasonably secure. It must have between 8 and 64 characters, and it must include at least one uppercase
letter, one lowercase letter, and one number.

DROPPING A USER
Drop user teratom;

All glory comes from daring to begin.


Anonymous

If you delete a database user account, the user will no longer be able to access any of the cluster databases. The
quote above is the opposite of the DBA credo which states, "All glory comes from daring to drop a user."

INSERTING INTO A TABLE


INSERT INTO Customer_Table
VALUES (121346543, 'Lawn Drivers', '555-1234') ;

The INSERT command inserts individual rows into a database table.

RENAMING A TABLE OR A COLUMN


ALTER TABLE Employee_Table
rename to Employee_Table_Backup ;
ALTER TABLE Student_Table
RENAME COLUMN Grade_Pt to Grade_Point;

The first command renames the Employee_Table to Employee_Table_Backup. The second example renames the
column Grade_Pt to Grade_Point.

ADDING AND DROPPING A COLUMN TO A TABLE


ALTER TABLE Employee_Table
ADD COLUMN Mgr int
default NULL;
ALTER TABLE Employee_Table
DROP COLUMN Mgr ;

In our first example we have added a new column called Mgr to the table Employee_Table. The second example
drops that column.
Chapter 2 - Best Practices For Table Design

Beware of the young doctor and the old barber.


- Benjamin Franklin

CONVERTING TABLE STRUCTURES TO REDSHIFT

Above, we are converting all of the tables in a Teradata database to Redshift table structures. We went to our
Teradata system and right clicked on the database SQL_Class and chose "Convert Table Structures". We selected
all of the tables and hit the blue arrow. We then chose to convert to Redshift. Watch in amazement what happens
next!

CONVERTING TABLE STRUCTURES TO REDSHIFT FINALE

All 20 Teradata tables have now been converted to Redshift. Just cut and paste to your Redshift system, and you
have converted the tables.

BEST PRACTICES FOR DESIGNING TABLES


1. Choose the best sort key

2. Choose a great distribution key


3. Consider defining primary key and foreign key constraints

4. Use the smallest possible column size

5. Use date/time data types for date columns

6. Specify redundant predicates on the sort column


I have found the best way to give advice to your children is to find out what they want and then advise them to
do it.

--Harry S. Truman

As you design your database, there are important decisions you must make that will heavily influence overall
query performance. These design choices also have a significant effect on how data is stored, which in turn
affects query performance by reducing the number of I/O operations and minimizing the memory required to
process certain queries. Harry S. Truman was right. "If you want your Redshift system to run brilliantly, take
advice from your users, and use best practices to deliver what they asked for".

CHOOSE THE BEST SORT KEY


When you give an Amazon Redshift table a sort key, it stores your data on disk in sorted order.

The sort order is used by the optimizer to determine optimal query plans.

If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.

If you do range filtering or equality filtering on one column, specify that column as the sort key.

If you frequently join a table, specify the join column as both the sort key and the distribution key.
Data sorted correctly helps eliminate unneeded blocks. This is because Redshift has metadata on each block
showing column min and max values.

When you give an Amazon Redshift table a sort key, it stores your data on disk in sorted order. The sort order is
used by the optimizer to determine optimal query plans. If recent data is queried most frequently, specify the
timestamp column as the leading column for the sort key. If you do frequent range filtering or equality filtering
on one column, specify that column as the sort key. If you frequently join a table, specify the join column as both
the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a
slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase
of the sort merge join.

EACH BLOCK COMES WITH METADATA

Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as
part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max
values to rapidly skip over large numbers of blocks during table Where most databases use indexes to determine
where data is, Redshift uses the blocks metadata to determine where data is NOT!
Our query above is looking for data WHERE Order_Total < 300. The metadata shows this block will contain
rows, and therefore it will be moved into memory for processing. Each slice has metadata for each of the blocks
they own.

CREATING A SORT KEY

There are three basic reasons to use the sortkey keyword when creating a table. 1) If recent data is queried most
frequently, specify the timestamp or date column as the leading column for the sort key. 2) If you do frequent
range filtering or equality filtering on one column, specify that column as the sort key. 3) If you frequently join a
(dimension) table, specify the join column as the sort key. Above, you can see we have made our sortkey the
Order_Date column. Look how the data is sorted!

SORT KEYS HELP GROUP BY, ORDER BY AND WINDOW FUNCTIONS

When data is sorted on a strategic column, it will improve (GROUP BY and ORDER BY operations), window
functions (PARTITION BY and ORDER BY operations), and even as a means of optimizing compression. But,
as new rows are incrementally loaded, these new rows are sorted but they reside temporarily in a separate region
on disk. In order to maintain a fully sorted table, you need to run the VACUUM command at regular intervals.
You will also need to run ANALYZE.

CHOOSE A GREAT DISTRIBUTION KEY


Good data distribution has two goals:

1. To distribute data evenly among the nodes and slices in a cluster.

2. To collocate data for joins and aggregations.

Uneven distribution, or data skew, forces some nodes to do more work than others which slows down the entire
process. With parallel processing, a query is only as fast as the slowest node. Even distribution is a key concept
when each node processes the information they own simultaneously with their node peers.
When rows that participate in joins or aggregations are located on different nodes, more data has to be moved
among nodes. This is because Amazon Redshift must ensure that two rows being joined are on the same node in
the same memory. If this is not the case, then Redshift will either copy the smaller table to all nodes temporarily
or redistribute one or both tables.

DISTRIBUTION KEY WHERE THE DATA IS UNIQUE

The entire row of a table is on a slice, but each column in the row is in a separate container (block). A Unique
Distribution Key spreads the rows of a table evenly across the slices. A good Distribution Key is the key to good
distribution!

MATCHING DISTRIBUTION KEYS FOR CO-LOCATION OF JOINS

Notice that both tables are distributed on Dept_No. When these two tables are joined WHERE Dept_No =
Dept_No, the rows with matching department numbers are on the same Slice. This is called Co-Location. This
makes joins efficient and fast.

BIG TABLE / SMALL TABLE JOINS

Notice that the Department_Table has only four rows. Those four rows are copied to every slice. This is
distributed by ALL. Now, the Department_Table can be joined to the Employee_Table with a guarantee that
matching rows are co-located. They are co-located because the smaller table has copied ALL of its rows to each
slice. When two joining tables have one large table (fact table) and the other table is small (dimension table),
then use the ALL keyword to distribute the smaller table.
DEFINE PRIMARY KEY AND FOREIGN KEY CONSTRAINTS
1. Define primary key and foreign key constraints between tables wherever appropriate.

2. Primary key and foreign key constraints are informational only.

3. Amazon Redshift does not enforce unique, primary key, and foreign key constraints.

4. The query planner uses these keys in certain statistical computations, to infer uniqueness and referential
relationships that affect subquery decorrelation techniques, to order large numbers of joins, and to eliminate
redundant joins.

5. Amazon Redshift does enforce NOT NULL column constraints.

Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints. Your application is
responsible for ensuring uniqueness and managing the DML operations. The query planner will use primary and
foreign keys in certain statistical computations, to infer uniqueness and referential relationships that affect
subquery decorrelation techniques, to order large numbers of joins, and to eliminate redundant joins. The planner
leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If
your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For
example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not
define key constraints for your tables if you doubt their validity. On the other hand, you should always declare
primary and foreign keys and uniqueness constraints when you know that they are valid.

PRIMARY KEY AND FOREIGN KEY EXAMPLES

The query planner uses referential integrity in certain


situations, to infer uniqueness and for referential
relationships that affect subquery techniques, to order large
numbers of joins, and to eliminate redundant joins.

Amazon Redshift does not enforce primary key and foreign key constraints. The only reason to apply them is so
the query optimizer can generate a better query plan.

USE THE SMALLEST COLUMN SIZE WHEN CREATING TABLES

You will improve query performance by


reducing columns to the minimum possible size.
Table size is not impacted, but query processing
will be if the column is being processed in a
temporary table to gather intermediate results.

Amazon Redshift compresses column data very effectively, so creating columns much larger than necessary has
minimal impact on the size of data tables. It is in the processing of queries that the size can hurt you. This is
because during processing for complex queries, intermediate query results might need to be stored in temporary
tables. Because temporary tables are not compressed, unnecessarily large columns consume excessive memory
and temporary disk space, which can affect query performance. Don't go overboard here! Don't get columns so
small that they can't contain your largest values!

USE DATE/TIME DATA TYPES FOR DATE COLUMNS

Use the DATE or TIMESTAMP data type rather than a character type when storing date/time information.
Amazon Redshift stores DATE and TIMESTAMP data more efficiently than CHAR or VARCHAR, which results
in better query performance. Let Amazon Redshift handle the DATE or TIMESTAMP conversions internally
instead of you trying to do so in your applications. Most of the time users utilize CHAR or VARCHAR is in the
ETL process of moving data. There is no need to do that because Redshift handles any conversions necessary.

SPECIFY REDUNDANT PREDICATES ON THE SORT COLUMN


We want to join Table1 with Table2. Both have a distribution key on
Customer_Number and both have a sort key on Order_Date.
SELECT T1.*, T2.Order_Number
FROM Table1 as T1
INNER JOIN
Table2 as T2
ON T1.Customer_Number = T2.Customer_Number
WHERE T1.Order_Date > '1/1/2014';

You should consider using a predicate on the leading sort column of the fact table, or the largest table, in a join.
You can also add predicates to filter other tables that participate in the join, even when the predicates are
redundant. These predicates refer to WHERE or AND clauses. Because Redshift has the max and min value for
each column per block, you can get better performance when you choose a good sortkey. This allows Redshift to
skip reading certain blocks because Redshift always checks the min and max values to see if the block should
even be read. The second example above uses a redundant AND clause in hopes the entire table won't have to be
read.
SETTING THE STATEMENT_TIMEOUT TO ABORT LONG QUERIES

The above query aborts because it took longer than 10 milliseconds. The statement_timeout is designed to abort
any statement that takes over the milliseconds specified. If the system setting WLM timeout
(max_execution_time) is also specified as part of a WLM configuration, the lower of statement_timeout and
max_execution_time is used.

Chapter 3 System Tables

He who asks a question may be a fool for five minutes, but he who never asks a question remains a fool
forever.
-Unknown

AMAZON REDSHIFT SYSTEM TABLES


Amazon Redshift provides access to the following types of system tables:

STL tables for logging - These system tables are generated from Amazon Redshift log files to provide a history
of the system. Logging tables have an STL prefix.

STV tables for snapshot data - These tables are virtual system tables that contain snapshots of the current
system data. Snapshot tables have an STV prefix.

System views - System views contain a subset of data found in several of the STL and STV system tables.
Systems views have an SVV or SVL prefix.

System catalog tables - The system catalog tables store schema metadata, such as information about tables and
columns. System catalog tables have a PG prefix.

Every Redshift system automatically contains a number of system tables. These system tables contain
information about the installation and about the various queries and processes that are running on the system.
You can query these system tables to collect information about the redshift database that is installed.

TROUBLE SHOOTING CATALOG TABLE PG_TABLE_DEF


set search_path to '$user', 'public', 'sql_class';

The above query references the system catalog table named pg_table_def, and it only runs exclusively on the
leader node. PG_TABLE_DEF will only return information for tables in schemas that are included in the search
path. The first query failed because the 'employee_table' was not in the search_path. Above, we added sql_class
to our path. The first query will work now because the database sql_class has been placed in our search path, and
that is where the employee_table resides.

SEEING THE SYSTEM TABLES IN YOUR NEXUS TREE

The Redshift catalog is in the pg_catalog database. You can query these tables with SQL or merely do a "Quick
Select" by right clicking on any table in the tree. We just did a "Quick Select" on the pg_aggregate table.

CATALOG TABLE PG_TABLE_DEF

The above query references the system catalog table named pg_table_def, and it only runs exclusively on the
leader node. PG_TABLE_DEF will only return information for tables in schemas that are included in the search
path. The query we ran on the previous page failed because the 'employee_table' was not in the search_path. The
database that contains the employee_table is the sql_class database. Once we added the database sql_class to our
search path, the query ran perfectly!

CHECKING TABLES FOR SKEW (POOR DISTRIBUTION)


SELECT TRIM(name) as Table_Name
,slice
,sum(num_values) as rows
from svv_diskusage
where name in ('Order_Table', 'Customer_Table')
and col =0
group by name, slice
order by name, slice;

Uneven distribution, or data distribution skew, forces some nodes or slices to do more work than others which
inhibits query performance. To check for distribution skew, you can query the SVV_DISKUSAGE system view.
Each row in the system table SVV_DISKUSAGE records the statistics for one disk block. The num_values
column gives the number of rows in that disk block, so when you sum(num_values), it returns the number of
rows on each slice.

CHECKING ALL STATEMENTS THAT USED THE ANALYZE COMMAND


SELECT xid
,to_char(starttime, 'HH24:MM:SS.MS') as starttime
,date_diff('sec',starttime,endtime ) as secs
,substring(text, 1, 40) as ActualText
FROM svl_statementtext
WHERE sequence = 0
AND xid in (select xid from svl_statementtext s
where s.text like 'redshift_fetch_sample%' )
order by xid desc, starttime;

The query above returns all the statements that ran in every completed transaction that included an ANALYZE
command.

CHECKING TABLES FOR SKEW (POOR DISTRIBUTION)


select P.name as "Table"
,count(*) as "1 MB blocks"
from stv_blocklist as B
INNER JOIN
stv_tbl_perm as P
ON B.tbl = P.id
AND B.slice = P.slice
WHERE P.name in ('Customer_Table', 'Order_Table')
GROUP BY P.name
ORDER BY 1 asc;
You can easily check on how many 1 MB blocks of disk space are used for each table by querying the
STV_BLOCKLIST table. This will give you measurements on table sizes.

CHECKING FOR DETAILS ABOUT THE LAST COPY OPERATION


SELECT query as Query
,TRIM(filename) as File
,curtime as Updated
from stl_load_commits
where query = pg_last_copy_id() ;
STLtables for logging - These
system tables are generated from
Amazon Redshift log files to
provide a history of the system.
Logging tables have an STLprefix.

The above example returns details for the last COPY operation.

CHECKING WHEN A TABLE HAS LAST BEEN ANALYZED


SELECT query
,rtrim(querytxt)
,starttime
FROM stl_query
WHERE querytxt like 'padb_fetch_sample%'
AND querytxt like '%Sales_Table%'
ORDER BY 1 desc;

To find out when ANALYZE commands were run, you can query STL_QUERY. For example, to find out when
the Sales_Table was last analyzed, run the query above.

CHECKING FOR COLUMN INFORMATION ON A TABLE


SELECT Schemaname as "Schema"
,Tablename
,Column
,Type
,Distkey
FROM pg_table_def
WHERE tablename = 'Department_Table';
System catalog tables - The system
catalog tables store schema
metadata, such as information
about tables and columns. System
catalog tables have a PG prefix.

The above example returns information for the Department_Table.

SYSTEM TABLES FOR TROUBLESHOOTING DATA LOADS

SELECT *
FROM ch_loadview
WHERE table_name='Employee_Table';

The example above is helpful in troubleshooting data load issues.

DETERMINING WHETHER A QUERY IS WRITING TO DISK

If IS_DISKBASED is true ("t") for any step, then that step wrote data to disk.
Chapter 4 - Compression

Speak in a moment of anger and youll deliver the greatest speech youll ever regret.
- Anonymous

COMPRESSION TYPES

The table above identifies the supported compression encodings and the data types that support the encoding.
Compression reduces the size of data when it is stored, and it is a column-level operation. Compression
conserves storage space and reduces the size of data that is read from storage, which will then reduce the amount
of disk I/O, thus improving query performance. By default, Amazon Redshift stores data in its raw,
uncompressed format, but you can apply a compression type, or encoding, to the columns in a table manually
(when the table is created). Or you can use the COPY command to analyze and apply compression automatically.
Either way, it is important to compress your data.

BYTE DICTIONARY COMPRESSION

Byte dictionary encoding utilizes a separate dictionary of unique values for each block of column values on disk.
Remember, each Amazon Redshift disk block occupies 1 MB. The dictionary contains up to 256 one-byte values
that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the
extra values are written into the block in raw, uncompressed form. The process repeats for each disk block. This
encoding is very effective when a column contains a limited number of unique values, and it is especially optimal
when there is less than 256 unique values.

DELTA ENCODING
Delta encodings are very useful for date and time columns. Delta encoding compresses data by recording the
difference between values that follow each other in the column. These differences are recorded in a separate
dictionary for each block of column values on disk. If the column contains 10 integers in sequence from 1 to 10,
the first will be stored as a 4-byte integer (plus a 1-byte flag), and the next 9 will each be stored as a byte with the
value 1, indicating that it is one greater than the previous value. Delta encoding comes in two variations. DELTA
records the differences as 1-byte values (8-bit integers), and DELTA32K records differences as 2-byte values
(16-bit integers)

LZO ENCODING

Designed to work best with Char and Varchar data that store long character strings

Is a portable lossless data compression library written in ANSI C

Offers fast compression but extremely fast decompression

Includes slower compression levels achieving a quite competitive compression ratio while still decompressing at
this very high speed

Often implemented with a tool called LZOP

LempelZivOberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed.
LZO encoding provides a high compression ratio with good performance. LZO encoding is designed to work
well with character data. It is especially good for CHAR and VARCHAR columns that store very long character
strings especially free form text such as product descriptions, user comments, or JSON strings.

MOSTLY ENCODING
Mostly encodings are useful when the data type for a column is larger than the majority of the stored values
require. By specifying a mostly encoding for this type of column, you can compress the majority of the values in
the column to a smaller standard storage size. The remaining values that cannot be compressed are stored in their
raw form.

RUNLENGTH ENCODING

Runlength encoding replaces a value that is repeated consecutively with a token that consists of the value and a
count of the number of consecutive occurrences (the length of the run). This is where the name Runlength comes
into play. A separate dictionary of unique values is created for each block of column values on disk. This
encoding is best suited to a table in which data values are often repeated consecutively, for example, when the
table is sorted by those values.

TEXT255 AND TEXT32K ENCODINGS

Text255 and text32k encodings are useful for compressing VARCHAR columns only. Both compression
techniques work best when the same words recur often. A separate dictionary of unique words is created for each
block of column values on disk. Text255 has a dictionary that contains the first 245 unique words in the column.
Those words are replaced on disk by a one-byte index value representing one of the 245 values, and any words
that are not represented in the dictionary are stored uncompressed. This process is repeated for each block.

For the text32k encoding, the principle is the same, but the dictionary for each block does not capture a specific
number of words. Instead, the dictionary indexes each unique word it finds until the combined entries reach a
length of 32K, minus some overhead. The index values are stored in two bytes.

ANALYZE COMPRESSION
ANALYZE COMPRESSION
[ [ table_name ]
[ ( column_name [, . . .] ) ] ]
[COMPROWS numrows]
Table_Name-You can optionally specify a table_name to analyze a single table. If you do not specify a
table_name, all of the tables in the currently connected database are analyzed. You cannot specify more than one
table_name with a single ANALYZE COMPRESSION statement. You can also analyze compression for
temporary tables.
Column_Name-If you specify a table_name, you can also specify one or more columns in the table (as a column-
separated list within parentheses).
COMPROWSThis is the number of rows to be used as the sample size for compression analysis. The analysis is
run on rows from each data slice. For example, if you specify COMPROWS 2000000 (2,000,000) and the system
contains 4 total slices, no more than 500,000 rows per slice are read and analyzed. If COMPROWS is not
specified, the sample size defaults to 100,000 per slice.
Numrows -Number of rows to be used as the compression sample size. The accepted range for numrows is a
number between 1000 and 1000000000 (1,000,000,000).

The ANALYZE Compression command performs compression analysis and produces a report with the suggested
column encoding schemes for the tables analyzed. ANALYZE COMPRESSION does not modify the column
encodings of the table but merely makes suggestions. To implement the suggestions, you must recreate the table,
or create a new table with the same schema. ANALYZE COMPRESSION does not consider Runlength encoding
on any column that is designated as a SORTKEY. This is because range-restricted scans might perform poorly
when SORTKEY columns are compressed much more highly than other columns. ANALYZE COMPRESSION
acquires an exclusive table lock, which prevents concurrent reads and writes against the table. Only run the
ANALYZE COMPRESSION command when the table is idle.

COPY

The above example (two parts) gives the syntax for COPY from Amazon S3, COPY from Amazon EMR, and
COPY from a remote host (COPY from SSH).
Chapter 5 Temporary Tables

They can conquer who believe they can.


- Rita Rudner

CREATE TABLE SYNTAX


CREATE [ [LOCAL ] { TEMPORARY | TEMP } ] TABLE table_name
( { column_name data_type [column_attributes] [ column_constraints ]
| table_constraints
| LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ] }
[, ... ] )
[table_attribute]

where column_attributes are:


[ DEFAULT default_expr ]
[ IDENTITY ( seed, step ) ]
[ ENCODE encoding ]
[ DISTKEY ]
[ SORTKEY ]

and column_constraints are:


[ { NOT NULL | NULL } ]
[ { UNIQUE | PRIMARY KEY } ]
[ REFERENCES reftable [ ( refcolumn ) ] ]

and table_constraints are:


[ UNIQUE ( column_name [, ... ] ) ]
[ PRIMARY KEY ( column_name [, ... ] ) ]
[ FOREIGN KEY (column_name [, ... ] ) REFERENCES reftable [ ( refcolumn ) ]

and table_attributes are:


[ DISTSTYLE { EVEN | KEY | ALL } ]
[ DISTKEY ( column_name ) ]
[ SORTKEY ( column_name [, ...] ) ]

Creates a new table in the current database. The owner of this table is the issuer of the CREATE TABLE
command.

BASIC TEMPORARY TABLE EXAMPLES


When you create a temporary table, it is visible only within the current session. The table is automatically
dropped at the end of the session. Above, we use the pound sign (#) at the front of the table name to
automatically make the table a temporary table. We then populate the table with an Insert/Select

MORE ADVANCED TEMPORARY TABLE EXAMPLES

When you create a temporary table, it is visible only within the current session. The table is automatically
dropped at the end of the session. A derived table only lasts the life of a single query, but a temporary table last
the entire session. This allows a user to run hundreds of queries against the temporary table. A temporary table
can have the same name as a permanent table, but I don't recommend this. You don't give a temporary table a
schema because it is automatically associated with the users session. Once the session is over, the table and data
are dropped. If the user tries to query the table in another session, the system won't recognize the table. In other
words, the table doesn't exist outside of the current session it was created in.

ADVANCED TEMPORARY TABLE EXAMPLES

When you create a temporary table, it is visible only within the current session. The table is automatically
dropped at the end of the session. Above are some examples that allow you to define a different distkey, diststyle
and sortkey. Users (by default) are granted permission to create temporary tables by their automatic membership
in the PUBLIC group. To remove the privilege for any users to create temporary tables, revoke the TEMP
permission from the PUBLIC group and then explicitly grant the permission to create temporary tables to
specific users or groups of users.
TABLE LIMITS AND CTAS
9,900 permanent tables.
The maximum number of characters for a table name is 127.
The maximum number of columns you can define in a single table is 1,600.
CREATE TABLE
Student_Table_Backup
AS
SELECT *
FROM Student_Table;
To have everything is to possess nothing.
--Buddha

The resulting table inherits the distribution and sort key from the Student_Table (STUDENT_ID). Buddha might
have been wrong here. "To have everything is to possess 9,900 permanent tables."

PERFORMING A DEEP COPY


A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table. If a table
has a large unsorted region, a deep copy is much faster than a vacuum. You can choose one of four methods to
create a copy of the original table
1) Use the original table DDL. This is the best method for perfect reproduction.

2) Use CREATE TABLE AS (CTAS). If the original DDL is not available, you can use CREATE TABLE AS to
create a copy of current table, then rename the copy. The new table will not inherit the encoding, distkey, sortkey,
not null, primary key, and foreign key attributes of the parent table.

3) Use CREATE TABLE LIKE. If the original DDL is not available, you can use CREATE TABLE LIKE to
recreate the original table. The new table will not inherit the primary key and foreign key attributes of the parent
table. The new table does, though, inherit the encoding, distkey, sortkey, and not null attributes of the parent
table.

4) Create a temporary table and truncate the original table. If you need to retain the primary key and foreign key
attributes of the parent table, you can use CTAS to create a temporary table, then truncate the original table and
populate it from the temporary table. This method is slower than CREATE TABLE LIKE because it requires two
insert statements.

A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table. If a table
has a large unsorted region, a deep copy is much faster than a vacuum. The difference is that you cannot make
concurrent updates during a deep copy operation which you can do during a vacuum. The next four slides will
show each technique with an example.

DEEP COPY USING THE ORIGINAL DDL


1) Use the original table DDL. This is the best method for perfect reproduction.
1. Create a copy of the table using the original CREATE TABLE DDL.

2. Use an INSERT INTO ... SELECT statement to populate the copy with data from the original table.
3. Drop the original table.

4. Use an ALTER TABLE statement to rename the copy to the original table name.

A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.

DEEP COPY USING A CTAS


2) Use CREATE TABLE AS (CTAS). If the original DDL is not available, you can use CREATE TABLE AS to
create a copy of current table, then rename the copy. The new table will not inherit the encoding, distkey, sortkey,
not null, primary key, and foreign key attributes of the parent table.
1. Create a copy of the original table by using CREATE TABLE AS to select the rows from the original table.

2. Drop the original table.

3. Use an ALTER TABLE statement to rename the new table to the original table.
The following example performs a deep copy on the Sales_Table using a duplicate of the Sales_Table
namedSales_Table_Copy.

CREATE TABLE Sales_Table_Copy as (select * from Sales_Table) ;

DROP TABLE Sales_Table ;

ALTER TABLE Sales_Table_Copy rename to Sales_Table ;

A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.

DEEP COPY USING A CREATE TABLE LIKE


2) Use CREATE TABLE LIKE. If the original DDL is not available, you can use CREATE TABLE LIKE to
recreate the original table. The new table will not inherit the primary key and foreign key attributes of the parent
table. The new table does though inherit the encoding, distkey, sortkey, and not null attributes of the parent table.
1. Create a new table using CREATE TABLE LIKE.
2. Use an INSERT INTO ... SELECT statement to copy the rows from the current table to the new table.
3. Drop the current table.
4. Use an ALTER TABLE statement to rename the new table to the original table.
The following example performs a deep copy on the Sales_Table using a duplicate of the Sales_Table
namedSales_Table_Copy.
CREATE TABLE Sales_Table_Copy (like Sales_Table);
INSERT INTO Sales_Table_Copy (select * from Sales_Table);
DROP TABLE Sales_Table;
ALTER TABLE Sales_Table_Copy RENAME to Sales_Table;
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.

DEEP COPY BY CREATING A TEMP TABLE AND TRUNCATING


ORIGINAL
Create a temporary table and truncate the original table. If you need to retain the primary key and foreign key
attributes of the parent table, you can use CTAS to create a temporary table, then truncate the original table and
populate it from the temporary table. This method is slower than CREATE TABLE LIKE because it requires two
insert statements.
1. Use CREATE TABLE AS to create a temporary table with the rows from the original table.

2. Truncate the current table.

3. Use an INSERT INTO ... SELECT statement to copy the rows from the temporary table to the original table.

4. Drop the temporary table.


The following example performs a deep copy on the Sales_Table using a duplicate of the Sales_Table
namedSales_Table_Copy.
CREATE Temp Table Sales_Table_Copy as select * from Sales_Table ;
TRUNCATE Sales_Table ;
Insert Into Sales_Table (select * from Sales_Table_Copy);
DROP Table Sales_Table_Copy;

A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.

CREATING A DERIVED TABLE


Exists only within a query
Materialized by a SELECT Statement inside a query
Space comes from the Users Spool space
Deleted when the query ends

The SELECT Statement that creates and populates the Derived table is always inside Parentheses.

THE THREE COMPONENTS OF A DERIVED TABLE


A derived table will always have a SELECT query to materialize the derived table with data. The
SELECT query always starts with an open parenthesis and ends with aclose parenthesis.

The derived table must be given a name. Above, we called our derived tableTeraTom.

You will need to define (alias) the columns in the derived table. Above, we allowed Dept_No to default
to Dept_No, but we had to specifically alias AVG(Salary) asAVGSAL.

Every derived table must have the three components listed above.

NAMING THE DERIVED TABLE

In the example above, TeraTom is the name we gave the Derived Table. It is mandatory that you always name the
table or it errors.

ALIASING THE COLUMN NAMES IN THE DERIVED TABLE

AVGSAL is the name we gave to the column in our Derived Table that we call TeraTom. Our SELECT (which
builds the columns) shows we are only going to have one column in our derived table and we have named that
column AVGSAL.

VISUALIZE THIS DERIVED TABLE


Our example above shows the data in the derived table named TeraTom. This query allows us to see each
employee and the plus or minus avg of their salary compared to the other workers in their department.

MOST DERIVED TABLES ARE USED TO JOIN TO OTHER TABLES

The first five columns in the Answer Set came from the Employee_Table. AVGSAL came from the derived table
named TeraTom.

MULTIPLE WAYS TO ALIAS THE COLUMNS IN A DERIVED TABLE

OUR JOIN EXAMPLE WITH A DIFFERENT COLUMN ALIASING STYLE


COLUMN ALIASING CAN DEFAULT FOR NORMAL COLUMNS

TeraTom

Dept_No AVGSAL

? 32800.50

10 64300.00

100 48850.00

200 44944.44

300 40200.00

400 48333.33

The derived table


is built first

In a derived table, you will always have a SELECT query in parenthesis, and you will always name the table.
You have options when aliasing the columns. As in the example above, you can let normal columns default to
their current name.

CREATING A DERIVED TABLE USING THE WITH COMMAND


When using the WITH Command, we can CREATE our Derived table before running the main query. The only
issue here is that you can only have 1 WITH.

OUR JOIN EXAMPLE WITH THE WITH SYNTAX

Now, the lower portion of the query refers to TeraTom Almost like it is a permanent table, but it is not!

WITH STATEMENT THAT USES A SELECT *

The following example shows the simplest possible case of a query that contains a WITH clause. The WITH
query named TeraTom selects all of the rows from the Student_Table. The main query, in turn, selects all of the
rows from TeraTom. The TeraTom table exists only for the life of the query.

A WITH CLAUSE THAT PRODUCES TWO TABLES


with Budget_Derived as
(SELECT Max(Budget) as Max_Budget
FROM Department_Table),
Emp_Derived as
(SELECT Dept_No, AVG(Salary) as Avg_Sal
FROM Employee_Table
GROUP BY Dept_No)
select E.*, Max_Budget Budget as Under_Max, Avg_Sal
FROM Employee_Table as E
INNER JOIN
Emp_Derived
On E.Dept_No = Emp_Derived.Dept_No
INNER JOIN Department_Table as D
ON E.Dept_No = D.Dept_No;

The most important thing a father can do for his children is to love their mother.
-Anonymous

The following example shows two tables created from the With statement. "Sometimes the most important thing
a WITH clause can do is to have multiple children."

THE SAME DERIVED QUERY SHOWN THREE DIFFERENT WAYS

QUIZ - ANSWER THE QUESTIONS


SELECT Dept_No, First_Name, Last_Name, AVGSAL
FROM Employee_Table
INNER JOIN
(SELECT Dept_No, AVG(Salary)
FROM Employee_Table
GROUP BY Dept_No) as TeraTom(Depty, AVGSAL)
ON Dept_No = Depty ;
1) What is the name of the derived table? __________

2) How many columns are in the derived table? _______

3) What is the name of the derived table columns? ______

4) Is there more than one row in the derived table? _______

5) What common keys join the Employee and Derived? _______

6) Why were the join keys named differently? ______________

ANSWER TO QUIZ - ANSWER THE QUESTIONS


SELECT Dept_No, First_Name, Last_Name, AVGSAL
FROM Employee_Table
INNER JOIN
(SELECT Dept_No, AVG(Salary)
FROM Employee_Table
GROUP BY Dept_No) as TeraTom(Depty, AVGSAL)
ON Dept_No = Depty ;
1) What is the name of the derived table? TeraTom

2) How many columns are in the derived table? 2

3) Whats the name of the derived columns? Depty and AVGSAL

4) Is their more than one row in the derived table? Yes

5) What keys join the tables? Dept_No and Depty

6) Why were the join keys named differently? If both were named Dept_No, we would error unless we full
qualified.

CLEVER TRICKS ON ALIASING COLUMNS IN A DERIVED TABLE

A DERIVED TABLE LIVES ONLY FOR THE LIFETIME OF A SINGLE


QUERY

AN EXAMPLE OF TWO DERIVED TABLES IN A SINGLE QUERY


WITH T (Dept_No, AVGSAL) AS
(SELECT Dept_No, AVG(Salary) FROM Employee_Table
GROUP BY Dept_No)

SELECT T.Dept_No, First_Name, Last_Name,


AVGSAL, Counter
FROM Employee_Table as E
INNER JOIN
T
ON E.Dept_No = T.Dept_No
INNER JOIN

(SELECT Employee_No, SUM(1) OVER(PARTITION BY Dept_No


ORDER BY Dept_No, Last_Name Rows Unbounded Preceding)
FROM Employee_Table) as S (Employee_No, Counter)

ON E.Employee_No = S.Employee_No
ORDER BY T.Dept_No;

CONNECTING TO REDSHIFT VIA NEXUS

CONNECTING TO REDSHIFT VIA NEXUS

CONNECTING TO REDSHIFT VIA NEXUS

CONNECTING TO REDSHIFT VIA NEXUS


Chapter 6 - Explain

Fall seven times, stand up eight.


- Japanese Proverb

THREE WAYS TO RUN AN EXPLAIN

When you run an EXPLAIN, you are seeing the plan passed to the slices by the optimizer on the Leader Node.
There are three ways to see an EXPLAIN. You can merely type the word EXPLAIN in front of any SQL. You
can also hit F6 (Function Key 6), or you can click on the magnifying glass in Nexus. The EXPLAIN shows the
plan, but does NOT run the actual query. Once you see the costs of the EXPLAIN, you can decide whether or not
to run the query.

EXPLAIN STEPS, SEGMENTS AND STREAMS


Step - Each individual step is an individual operation in the explain plan. Steps can even be combined to allow
compute nodes to perform a query, join, subquery, or other type of database operation.

Segment Segments are a number of steps that can be done by a single process. A segment is a single
compilation unit executable by compute nodes. Each segment begins with a scan or reading of table data and
ends either with a materialization step or some other network activity.

Stream - A collection of segments that always begins with a scan or reading of some data set and ends with a
materialization or blocking step. Materialization or blocking steps can include HASH, AGG, SORT, and SAVE.
Last segment The term last segment means the query returns the data. If the return set is aggregated or sorted,
the compute nodes each send their piece of the intermediate result to the leader node which then merges the data
so the final result can be sent back to the requesting client.

EXPLAIN TERMS FOR SCANS AND JOINS


Sequential Scan Also termed as a scan. This means that Amazon Redshift will scan the entire table sequentially
from beginning to end. Deem this a Full Table Scan. This means that Redshift also evaluates query constraints
for every row (Filter) if specified with WHERE clause. This can also be utilized to run INSERT, UPDATE, and
DELETE statements.

Merge Join Also termed as an mjoin. This is commonly used for inner joins and outer joins (for join tables that
are both distributed and sorted on the joining columns), this is generally the the fastest Amazon Redshift join
algorithm.

Hash Join Also termed as an hjoin. This is also used for inner joins and left and right outer joins, and it is typ
ically faster than a nested loop join. Hash Join reads the outer table then hashes the joining column and finds
matches in the inner hash table.

Nested Loop Join Also termed as an nloop. This is the least optimal join, and it is mainly used for cross-joins,
product joins, and Cartesian product joins. It is also used for joins without ajoin condition and inequality joins.

EXPLAIN TERMS FOR AGGREGATION AND SORTS


Aggregate Also termed as a aggr. This is an operator/step for scalar aggregate functions.

HashAggregate Also termed as a aggr. This is an operator/step for any grouped aggregate functions. This has
the ability to operate from disk by virtue of hash table spilling to disk.

GroupAggregate Also termed as a aggr. This is an operator that is sometimes chosen for grouped aggregate
queries. This is only done if the Amazon Redshift configuration setting for force_hash_grouping setting is off.

Sort Also termed as a sort. The ORDER BY clause controls this sort. It can also perform other operations such
as UNIONs and joins. It can also operate from disk.

Merge Also termed as a merge. This produces the final sorted results of a query based on intermediate sorted
results derived from operations performed in parallel.

EXPLAIN TERMS FOR SET OPERATORS AND MISCELLANEOUS TERMS


SetOp Except Also termed as an hjoin. This is only used for EXCEPT queries.

Hash Intersect Also termed as an hjoin. This is used for INTERSECT queries.

Append Also termed as save. This is the append used with a Subquery Scan to implement UNION and UNION
ALL queries.

Limit Also termed as limit. This term evaluates the LIMIT clause.

Materialize Also termed as save. This term means to materialize rows for input to nested loop joins and some
merge joins. This can operate from disk.

Unique Also termed as unique. This term means to materialize rows for input to nested loop joins and some
merge joins. This can operate from disk.

Window Also termed as window. This term means to compute aggregate and ranking window functions. This
can operate from disk.
EXPLAIN TERMS FOR SET OPERATORS AND MISCELLANEOUS TERMS
Network (Broadcast) Also termed as a bcast. This is a Broadcast that is considered an attribute of the Join
Explain operators and steps.

Network (Distribute) Also termed as a dist. This is used to distribute rows to compute nodes for parallel
processing by the data warehouse cluster.

Network (Send to Leader) Also termed as return. This sends results back to the leader for further processing.

INSERT (Using Results) Also termed as insert. This inserts data.

Delete (Scan and Filter) Also termed as delete. This term means to delete data. This operation can operate from
disk.

Update (Scan and Filter) Also termed as delete, insert. This term means to implement as delete and Insert.

EXPLAIN EXAMPLE AND THE COST

Costs are cumulative as you read up the plan, so the HashAggregate cost number in this example (0.09) (red
arrow) consists mostly of the Seq Scan cost below it (0.06) (blue arrow). So, the first scan (blue arrow) was 0.06
and the final cost is 0.09, which means the final step was a 0.03 cost. Add up 0.06 (blue arrow) and 0.03 (red
arrow) and you get 0.09!

EXPLAIN EXAMPLE AND THE ROWS

The keyword rows in the EXPLAIN means the expected number of rows to return. In this example, the scan is
expected to return 6 rows. The HashAggregate operator is expected to return 6 rows (red arrow).

EXPLAIN EXAMPLE AND THE WIDTH


The keyword width in the EXPLAIN means the estimated width of the average row, in bytes. It is important to
analyze the table so statistics are collected. The width and number of rows returned can be improved
dramatically.

SIMPLE EXPLAIN EXAMPLE AND THE COSTS

Above, you can see a simple query with a simple explain plan. This is designed to show how the cost works.

EXPLAIN JOIN EXAMPLE USING DS_BCAST_INNER

A Broadcast (BCAST) means to duplicate the table in its entirety across all nodes. We are duplicating the
Department_Table on all nodes.

EXPLAIN JOIN EXAMPLE USING DS_DIST_NONE


Take a look at the above query and the explain plan. We are joining three tables. The Student_Table and the
Student_Course_Table are joined first. They both have a Distribution Key of Student_ID. Since they join on this
column, there is no need to redistribute the data thus the DS_DIST_NONE keyword. Matching rows are already
on the same node.

EXPLAIN SHOWING DS_DIST_NONE VISUALLY

The Student_Table and the Student_Course_Table are joined first on Student_ID. Both tables have a Distribution
Key of Student_ID. Since they join on Student_ID, the matching rows are already on the same slice. So, there is
no need to redistribute the data thus the DS_DIST_NONE keyword. Then, the Course_Table is broadcast to all
slices (DS_BCAST_INNER) where it can be joined to the results of the first two table join. Data movement
happens in joins because the joining of two rows has to happen in the same slice and memory.

EXPLAIN WITH A WARNING

The example above might become a problem. The costs look high and the EXPLAIN plan is shouting out to
review the join predicates to avoid Cartesian product joins.

EXPLAIN FOR ORDERED ANALYTICS SUCH AS CSUM


Let's examine this EXPLAIN plan from the bottom up. The first thing done (bottom) is a sequential scan of the
sales_table. The cost for reading the first row is 0. The cost for reading all rows is 0.21. Then, the data is passed
to slice 0. The keyword Network means to send intermediate results to the leader node for further processing. On
the leader node, the data is sorted by Product_Id and Sale_Date. Then, the actual Cumulative Sum, Moving Sum,
and Moving Avg are calculated. Notice the cost of the first row and the last row are both large. They expect 21
rows to be returned with an average width per row of 21 bytes. This is a good approach for analyzing an
EXPLAIN plan.

EXPLAIN FOR SCALAR AGGREGATE FUNCTIONS

XN Aggregate (cost=0.11..0.11 rows=1 width=12)


-> XN Seq Scan on employee_table (cost=0.00..0.09 rows=9 width=12)

The keyword Aggregate in the EXPLAIN is used for aggregation scalar functions. A scalar function means that
only one row and one column are returned in the answer set for an aggregation function. Notice that the above
query produces a scalar result. The AVG(Salary) in the Employee_Table is $46782.15. That result is only one
column and one row! It is scalar!

EXPLAIN FOR HASHAGGREGATE FUNCTIONS

XN HashAggregate (cost=0.18..0.22 rows=5 width=14)


-> XN Seq Scan on employee_table (cost=0.00..0.09 rows=9 width=14)

The keyword HashAggregate in the EXPLAIN is used for unsorted grouped aggregate functions. Notice there is
no sort!
EXPLAIN USING LIMIT, MERGE AND SORT

The keyword Limit is used to evaluate the LIMIT clause. The keyword Sort is used to evaluate the ORDER BY
clause. The Keyword Merge is used when producing the final sorted results, which is derived from intermediate
sorted results that each slice parallel processed. Remember, each slice must perform their work. Then, the data is
sorted on each slice and passed to the leader node where a Merge operation is performed.

EXPLAIN USING A WHERE CLAUSE FILTER

XN Seq Scan on student_table (cost=0.00..0.12 rows=4 width=53)


Filter: (class_code = 'FR'::bpchar)

The keyword Filter is used to evaluate the WHERE clause. In the above example, we are filtering the returning
rows by only looking for Freshman who have a class_code of 'FR'. Our EXPLAIN (in yellow) shows the
keyword filter looking for 'FR'.

EXPLAIN USING THE KEYWORD DISTINCT


The keyword Unique in the EXPLAIN is used to evaluate the Distinct clause. In the above example, we do a
sequential scan of the entire student_table. Then, you see the keyword UNIQUE in the EXPLAIN plan. This
ensures that no duplicate values will be returned. The data is then sorted on each slice and sent to the leader node
for a final merge among all slices.

EXPLAIN FOR SUBQUERIES

A subquery involves at least two queries. A top and bottom query. In the above example, the bottom query is run
first on the Department_Table. The result set consists of the column Dept_No. The Employee_Table (top query)
is scanned next. Then, the results of both the Department_Table and the Employee_Table scans are hashed by
Dept_No. This places all matches on the same slice. The rows can then be joined using a Hash Join in memory.

Chapter 7 Basic SQL Functions

When I was 14, I thought my parents were the stupidest people in the world. When I was 21, I was amazed at
how much they learned in seven years.
- Mark Twain

FINDING THE CURRENT SCHEMA ON THE LEADER NODE

The CURRENT_SCHEMA function is a leader-node only function. In this example, the query does not reference
a table, so it runs exclusively on the leader node in order to show the schema. Our Current_Schema is the public
schema.

GETTING THINGS SETUP IN YOUR SEARCH PATH


set search_path to '$user', 'public', 'sql_class', 'sql_views' ;

We ran two queries above. The first query showed us our search_path and it contained $user and public. Since
we will be querying the databases sql_class and sql_views in our example labs, we need to place them in our
search_path. The second example (in yellow) has done this through the "Set search_path" command.

FIVE DETAILS YOU NEED TO KNOW ABOUT THE SEARCH_PATH

The current session's temporary-table schema, pg_temp_nnn, is searched (if it exists). It can be explicitly listed in
the path by using the alias pg_temp. If not listed in the path, it will be searched first (even before pg_catalog).
Remember, the temporary schema is only searched for tables and view names. It is not searched for any function
names.

Above are the five things you need to know about how the Search_Path works.

INTRODUCTION

The Student_Table above will be used in our early SQL Examples

This is a pictorial of the Student_Table which we will use to present some basic examples of SQL and get some
hands-on experience with querying this table. This book attempts to show you the table, show you the query, and
show you the result set.

SELECT * (ALL COLUMNS) IN A TABLE


Mostly every SQL statement will consist of a SELECT and a FROM. You SELECT the columns you want to see
on your report and an Asterisk (*) means you want to see all columns in the table on the returning answer set!

SELECT SPECIFIC COLUMNS IN A TABLE


SELECT First_Name
,Last_Name
,Class_Code
,Grade_Pt
FROM Student_Table ;

This is a great way to show the columns you are selecting from the Table_Name.

COMMAS IN THE FRONT OR BACK?

Why is the example on the left better even though they are functionally equivalent? Errors are easier to spot and
comments won't cause errors.
PLACE YOUR COMMAS IN FRONT FOR BETTER DEBUGGING
CAPABILITIES

"A life filled with love may have some thorns,


but a life empty of love will have no roses."

Anonymous

Having commas in front to separate column names makes it easier to debug. Remember our quote above. "A
query filled with commas at the end just might fill you with thorns, but a query filled with commas in the front
will allow you to always come up smelling like roses."

SORT THE DATA WITH THE ORDER BY KEYWORD

Rows typically come back to the report in random order. To order the result set, you must use an ORDER BY.
When you order by a column, it will order in ASCENDING order. This is called the Major Sort!

ORDER BY DEFAULTS TO ASCENDING


Rows typically come back to the report in random order, but we decided to use the ORDER BY statement. Now,
the data comes back ordered by Last_Name.

USE THE NAME OR THE NUMBER IN YOUR ORDER BY STATEMENT

The ORDER BY can use a number to represent the sort column. The number 2 represents the second column on
the report.

TWO EXAMPLES OF ORDER BY USING DIFFERENT TECHNIQUES

Notice that the answer set is sorted in ascending order based on the column Grade_Pt. Also notice that Grade_Pt
is the fifth column coming back on the report. That is why the SQL in both statements is ordering by Grade_Pt.
Did you notice that the null value came back first? Nulls sort first in ascending order and last in descending
order.

CHANGING THE ORDER BY TO DESCENDING ORDER

Notice that the answer set is sorted in descending order based on the column Last_Name. Also, notice that
Last_Name is the second column coming back on the report. We could have done an Order By 2. If you spell out
the word DESCENDING the query will fail, so you must remember to just use DESC.
NULL VALUES SORT FIRST IN ASCENDING MODE (DEFAULT)

Did you notice that the null value came back first? Nulls sort first in ascending order and last in descending
order.

NULL VALUES SORT LAST IN DESCENDING MODE (DESC)

You can ORDER BY in descending order by putting a DESC after the column name or its corresponding number.
Null Values will sort Last in DESC order.

MAJOR SORT VS. MINOR SORTS

Major sort is the first sort. There can only be one major sort. A minor sort kicks in if there are Major Sort ties.
There can be zero or more minor sorts.

MULTIPLE SORT KEYS USING NAMES VS. NUMBERS


In the example above, the Dept_No is the major sort and we have two minor sorts. The minor sorts are on the
Salary and the Last_Name columns. Both Queries above have an equivalent Order by statement and sort exactly
the same.

SORTS ARE ALPHABETICAL, NOT LOGICAL


SELECT * FROM Student_Table
ORDER BY Class_Code ;

This sorts alphabetically. Can you change the sort so the Freshman come first, followed by the Sophomores,
Juniors, Seniors and then the Null?

Can you change the query to Order BY Class_Code logically (FR, SO, JR, SR, ?)?

USING A CASE STATEMENT TO SORT LOGICALLY

This is the way the pros do it.

HOW TO ALIAS A COLUMN NAME


ALIAS Rules!
1) AS is optional
2) Use Double Quotes when Spaces are in the Alias name
3) Use Double Quotes when the Alias is a reserved word

When you ALIAS a column, you give it a new name for the report header. You should always reference the
column using the ALIAS everywhere else in the query. You never need Double Quotes in SQL unless you are
Aliasing.

A MISSING COMMA CAN BY MISTAKE BECOME AN ALIAS

Column names must be separated by commas. Notice in this example, there is a comma missing between
Class_Code and Grade_Pt. What this will result in is only three columns appearing on your report with one being
aliased wrong.

COMMENTS USING DOUBLE DASHES ARE SINGLE LINE COMMENTS

Double dashes make a single line comment that will be ignored by the system.

COMMENTS FOR MULTI-LINES


Slash Asterisk starts a multi-line comment and Asterisk Slash ends the comment.

COMMENTS FOR MULTI-LINES AS DOUBLE DASHES PER LINE

Double Dashes in front of both lines comments both lines out and theyre ignored.

A GREAT TECHNIQUE FOR COMMENTS TO LOOK FOR SQL ERRORS

The query on the left had an error because the keyword Sum is reserved. We can test if this is the problem by
commenting out that line in our SQL (example on the right). Now, our query works. We know the problem is on
the line that we commented out. Once we put "Sum" (double quotes around the alias) it works. Use comments to
help you debug.
Chapter 8 - The WHERE Clause

I saw the angel in the marble and carved until I set him free.
- Michelangelo

USING LIMIT TO BRING BACK A SAMPLE


LIMIT { ALL | <integer-number> }

The following is an example using LIMIT:


SELECT *
FROM Employee_table
LIMIT 3 ;

Redshift offers a unique capability in its SQL to limit the number of rows returned from the table's data. It is a
LIMIT clause and it is normally added at end of a valid SELECT statement with the above example and syntax.
This example uses a LIMIT clause to reduce the rows returned, but in reality, the limiting of rows comes from the
WHERE clause.

USING LIMIT WITH AN ORDER BY STATEMENT


The brilliance of the example above is that we have sorted the data using an ORDER BY statement. Since we are
sorting by Salary DESC, and we have a limit of 5 rows, this will bring back the top 5 salaried employees..

THE WHERE CLAUSE LIMITS RETURNING ROWS

The WHERE Clause here filters how many ROWS are coming back. In this example, I am asking for the report
to only rows WHERE the first name is Henry.

USING A COLUMN ALIAS THROUGHOUT THE SQL

When you ALIAS a column, you give it a new name for the report header, but a good rule of thumb is to refer to
the column by the alias throughout the query.

DOUBLE QUOTED ALIASES ARE FOR RESERVED WORDS AND SPACES


Write a wise saying and your name will live forever.

-Anonymous

When you ALIAS a column, you give it a new name for the report header, but a good rule of thumb is to refer to
the column by the alias throughout the query. Whoever wrote the above quote was way off. "Write a wise alias
and it will live until the query ends bummer".

CHARACTER DATA NEEDS SINGLE QUOTES IN THE WHERE CLAUSE

In the WHERE clause, if you search for Character data such as first name, you need single quotes around it. You
Dont single-quote integers.

CHARACTER DATA NEEDS SINGLE QUOTES, BUT NUMBERS DONT

Character data (letters) need single quotes, but you need NO Single Quotes for Integers (numbers). Remember,
you never use double quotes except for aliasing.

NULL MEANS UNKNOWN DATA SO EQUAL (=) WONT WORK


First thing you need to know about a NULL is it is unknown data. It is NOT a zero. It is missing data. Since we
dont know what is in a NULL, you cant use an = sign. You must use IS NULL or IS NOT NULL.

USE IS NULL OR IS NOT NULL WHEN DEALING WITH NULLS


SELECT *
FROM Student_Table
WHERE Class_Code IS NULL ;

If you are looking for a row that holds NULL value, you need to put IS NULL. This will only bring back the
rows with a NULL value in it.

NULL IS UNKNOWN DATA SO NOT EQUAL WONT WORK

The same goes with = NOT NULL. We cant compare a NULL with any equal sign. We can only deal with
NULL values with IS NULL and IS NOT NULL.

USE IS NULL OR IS NOT NULL WHEN DEALING WITH NULLS


SELECT *
FROM Student_Table
WHERE Class_Code IS NOT NULL ;

Much like before, when you want to bring back the rows that do not have NULLs in them, you put an IS NOT
NULL in the WHERE Clause.

USING GREATER THAN OR EQUAL TO (>=)

All rows returned have a Grade_Pt >= 3.0

The WHERE Clause doesnt just deal with Equals. You can look for things that are GREATER or LESSER
THAN along with asking for things that are GREATER/LESSER THAN or EQUAL to.

AND IN THE WHERE CLAUSE

Notice the WHERE statement and the word AND. In this example, qualifying rows must have a Class_Code =
FR and also must have a First_Name of Henry. Notice how the WHERE and the AND clause are on their own
line. Good practice!

TROUBLESHOOTING AND
What is going wrong here? You are using an AND to check the same column. What you are basically asking with
this syntax is to see the rows that have BOTH a Grade_Pt of 3.0 and a 4.0. That is impossible, so no rows will be
returned.

OR IN THE WHERE CLAUSE

Notice above in the WHERE Clause we use OR. Or allows for either of the parameters to be TRUE in order for
the data to qualify and return.

TROUBLESHOOTING OR

Notice above in the WHERE Clause we use OR. Or allows for either of the parameters to be TRUE in order for
the data to qualify and return. The first example errors and is a common mistake. The second example is perfect.

TROUBLESHOOTING CHARACTER DATA


This query errors! What is WRONG with this syntax? No Single quotes around SR.

USING DIFFERENT COLUMNS IN AN AND STATEMENT

Notice that AND separates two different columns, and the data will come back if both are TRUE.

QUIZ HOW MANY ROWS WILL RETURN?

SELECT * FROM Student_Table


WHERE Grade_Pt = 4.0 OR Grade_Pt = 3.0
AND Class_Code = 'SR' ;

Which Seniors have a 3.0 or a 4.0 Grade_Pt average. How many rows will return?

A) 2 C) Error

B) 1 D) 3
ANSWER TO QUIZ HOW MANY ROWS WILL RETURN?

We had two rows return! Isnt that a mystery? Why?

WHAT IS THE ORDER OF PRECEDENCE?

()

NOT

AND

OR
SELECT *
FROM Student_Table
WHERE Grade_Pt = 4.0 OR Grade_Pt = 3.0
AND Class_Code = 'SR' ;

Syntax has an ORDER OF PRECEDENCE. It will read anything with parentheses around it first. Then, it will
read all the NOT statements. Then, the AND statements. FINALLY, the OR Statements. This is why the last
query came out odd. Lets fix it and bring back the right answer set.

USING PARENTHESES TO CHANGE THE ORDER OF PRECEDENCE


This is the proper way of looking for rows that have both a Grade_Pt of 3.0 or 4.0 AND also having a
Class_Code of SR. Only ONE row comes back. Parentheses are evaluated first, so this allows you to direct
exactly what you want to work first.

USING AN IN LIST IN PLACE OF OR

Using an IN List is a great way of looking for rows that have both a Grade_Pt of 3.0 or 4.0 AND also have a
Class_Code of SR. Only ONE row comes back.

THE IN LIST IS AN EXCELLENT TECHNIQUE

The IN Statement avoids retyping the same column name separated by an OR. The IN allows you to search the
same column for a list of values. Both queries above are equal, but the IN list is a nice way to keep things easy
and organized.

IN LIST VS. OR BRINGS THE SAME RESULTS


The IN Statement avoids retyping the same column name separated by an OR. The IN allows you to search the
same column for a list of values. Both queries above are equal, but the IN list is a nice way to keep things easy
and organized.

USING A NOT IN LIST


SELECT *
FROM Student_Table
WHERE Grade_Pt NOT IN (2.0, 3.0, 4.0) ;

First you imitate, then you innovate.


- Miles Davis

You can also ask to see the results that ARE NOT IN your parameter list. That requires the column name and a
NOT IN. Neither the IN nor NOT IN can search for NULLs! Miles Davis got this IT quote all wrong. First you
innovate, and then you sue anyone who imitates. Please make a note of it!

A TECHNIQUE FOR HANDLING NULLS WITH A NOT IN LIST

This is a great technique to look for a NULL when using a NOT IN List.

ANOTHER TECHNIQUE FOR HANDLING NULLS WITH A NOT IN LIST

This is a great technique to eliminate any NULL values when using a NOT IN List.

BETWEEN IS INCLUSIVE
SELECT *
FROM Student_Table
WHERE Grade_Pt BETWEEN 2.0 AND 4.0 ;

This is a BETWEEN. What this allows you to do is see if a column falls in a range. It is inclusive meaning that in
our example, we will be getting the rows that also have a 2.0 and 4.0 in their column!

NOT BETWEEN IS ALSO INCLUSIVE

"The difference between genius and stupidity is that genius has its limits."

Albert Einstein

This is a NOT BETWEEN example. What this allows you to do is see if a column does not fall in a range. It is
inclusive meaning that in our example, we will be getting no rows where the grade_pt is between a 2.0 and 4.0 in
their column! The 2.0 and the 4.0 will also not return.

LIKE COMMAND UNDERSCORE IS WILDCARD FOR ONE CHARACTER

The _ underscore sign is a wildcard for any a single character. We are looking for anyone who has an 'a' as the
second letter of their last name.

LIKE COMMAND WORKS DIFFERENTLY ON CHAR VS VARCHAR


It is important that you know the data type of the column you are using with your LIKE command. VARCHAR
and CHAR data differ slightly.

THE ILIKE COMMAND IS NOT CASE SENSITIVE

With Redshift, the ilike command is NOT case sensitive, but the like command is case sensitive. These rows
came back because they have an 'AR' in positions 2 and 3 of their last_name. The 'AR' are not really capitalized,
but that is why you use the ilike command. It doesn't care about case!

TROUBLESHOOTING LIKE COMMAND ON CHARACTER DATA

This is a CHAR(20) data type. That means that any words under 20 characters will pad spaces behind them until
they reach 20 characters. You will not get any rows back from this example because technically, no row ends in
an N, but instead ends in a space.

INTRODUCING THE TRIM COMMAND


This is a CHAR(20) data type. That means that every Last_Name is going to be 20 characters long. Most names
are not really 20 characters long, so spaces are padded at the end to ensure filling up all 20 characters. We need to
do the TRIM command to remove the leading and trailing spaces. Once the spaces are trimmed, we can find out
whose name ends in 'n'.

QUIZ WHAT DATA IS LEFT JUSTIFIED AND WHAT IS RIGHT?


SELECT *
FROM Sample_Table
WHERE Column1 IS NULL
AND Column2 IS NULL ;

Which Column from the Answer Set could have a DATA TYPE of INTEGER, and which could have Character
Data?

NUMBERS ARE RIGHT JUSTIFIED AND CHARACTER DATA IS LEFT


SELECT *
FROM Sample_Table
WHERE Column1 IS NULL
AND Column2 IS NULL ;

All Integers will start from the right and move left. Thus, Col1 was defined during the table create statement to
hold an INTEGER. The next page shows a clear example.

ANSWER WHAT DATA IS LEFT JUSTIFIED AND WHAT IS RIGHT?


SELECT Employee_No, First_Name
FROM Employee_Table
WHERE Employee_No = 2000000;
All Integers will start from the right and move left. All Character data will start from the left and move to the
right.

AN EXAMPLE OF DATA WITH LEFT AND RIGHT JUSTIFICATION


SELECT Student_ID, Last_Name
FROM Student_Table ;

This is how a standard result set will look. Notice that the integer type in Student_ID starts from the right and
goes left. Character data type in Last_Name moves left to right like we are use to seeing while reading English.

A VISUAL OF CHARACTER DATA VS. VARCHAR DATA

Character data pads spaces to the right and Varchar uses a 2-byte VLI instead.

USE THE TRIM COMMAND TO REMOVE SPACES ON CHAR DATA


By using the TRIM command on the Last_Name column, you are able to trim off any spaces from the end. Once
we use the TRIM on Last_Name, we have eliminated any spaces at the end, so now we are set to bring back
anyone with a Last_Name that truly ends in n!

LIKE AND YOUR ESCAPE CHARACTER OF CHOICE

Sometimes you want to use the LIKE command, but you also want to search for the values of percent (%) or
Underscore (_). You can turn off these wildcards by using an escape character. The following example uses the
escape character @ to search for strings that include "_" just after the word "start". The @ sign just in front of the
underscore (_) means that the underscore is no longer a wildcard, but an actual literal underscore.

LIKE AND THE DEFAULT ESCAPE CHARACTER

Sometimes you want to use the LIKE command, but you also want to search for the values of percent (%) or
Underscore (_). You can turn off these wildcards by using an escape character. The following example uses the
default escape characters \\ to search for strings that include underscore "_" just after the word "start". The \\ just
in front of the underscore means that the underscore is no longer a wildcard, but a literal underscore.

SIMILAR TO OPERATORS

. Matches any single character.

* Matches zero or more occurrences.

+ Matches one or more occurrences.

? Matches zero or one occurrence.

| Specifies alternative matches; for example, E | H means E or H.

^ Matches the beginning-of-line character.


$ Matches the end-of-line character.

$ $ Matches the end of the string.

[] Brackets specify a matching list, that should match one expression in the list. A
caret (^) precedes a nonmatching list, which matches any character except for
the expressions represented in the list.

() Parentheses group items into a single logical item.

{m} Repeat the previous item exactly m times.

{m,} Repeat the previous item m or more times.

{m,n} Repeat the previous item at least m and not more than n times.

[: :] Matches any character within a POSIX character class. In the following character
classes, Amazon Redshift supports only ASCII characters: [:alnum:],
[:alpha:],[:lower:], [:upper:]

POSIX pattern matching supports the above metacharacters

SIMILAR TO OPERATORS

% Matches any sequence of zero or more characters.

_ Matches any single character.

| Denotes alternation (either of two alternatives).

* Repeat the previous item zero or more times.

+ Repeat the previous item one or more times.

? Repeat the previous item zero or one time.

{m} Repeat the previous item exactly m times.

{m,} Repeat the previous item m or more times.

{m,n} Repeat the previous item at least m and not more than n times.

() Parentheses group items into a single logical item.

[. . .] A bracket expression specifies a character class, just as in


POSIX regular expressions.

SELECT First_Name
,Last_Name
FROM Employee_Table
WHERE First_Name similar to '%e%|%h%'
ORDER BY First_Name;

The following example finds all employees with a First_Name that contain "e" or "h". Regular expression
matching using SIMILAR TO is computationally expensive. We recommend using LIKE whenever possible
especially when processing a very large number of rows. For example, the following queries are functionally
identical, but the query that uses LIKE executes several times faster than the query that uses a regular expression.
The next page shows the answer set.

SIMILAR TO EXAMPLE WITH LOWER CASE LETTERS

The example above finds all employees with a First_Name that contains an "e" or an "h". Herbert did not return
because he has an 'h', but he returned because he does have an 'e' in his First_Name.

SIMILAR TO EXAMPLE WITH LOWER AND UPPER CASE LETTERS

The example above finds all employees with a First_Name that contains an "i" or a capital "H". Notice that
"John" is no longer in the answer set (like he was in the previous example). John has an "h" in it, but not a capital
"H".

SIMILAR TO EXAMPLE WITH MULTIPLE OCCURRENCES


The example above finds all employees with a First_Name that contains two l's. Both Billy and William contain
two 'l's. Notice that both names have the letter 'l' back to back. Notice that the name William also has the letter 'i'
in it twice also, but if I have changed the query to look for 'i' instead of 'l, then William would not have come
back. The occurrences must follow consecutively. I will show the same query on the next page, but use the 'i'
instead of 'l'.

MULTIPLE OCCURRENCES MUST BE CONSECUTIVE

The name William has the letter 'i' in it twice, but no rows came back. This is because the occurrences must
follow each other consecutively. There needs to be a name with two occurrences of the letter 'i' back to back!

Chapter 9 - Distinct Vs Group By AND TOP

A bird does not sing because it has the answers, it sings because it has a song.
- Anonymous

THE DISTINCT COMMAND

DISTINCT eliminates duplicates from returning in the Answer Set.

DISTINCT VS. GROUP BY


Both examples produce the exact same result
Class_Code
FR
JR
SO
SR
?
Rules for Distinct Vs. GROUP BY

(1) Many Duplicates use GROUP BY

(2) Few Duplicates use DISTINCT

(3) Space Exceeded use GROUP BY

Distinct and GROUP BY in the two examples return the same answer set.

QUIZ HOW MANY ROWS COME BACK FROM THE DISTINCT?

How many rows will come back from the above SQL?

ANSWER HOW MANY ROWS COME BACK FROM THE DISTINCT?


SELECT Distinct Class_Code, Grade_Pt
FROM Student_Table
ORDER BY Class_Code, Grade_Pt ;

How many rows will come back from the above SQL? 10. All rows came back. Why? Because there are no exact
duplicates that contain a duplicate Class_Code and Duplicate Grade_Pt combined. Each row in the SELECT list
is distinct.
TOP COMMAND

In the above example, we brought back 3 rows only. This is because of the TOP 3 statement which means to get
an answer set, and then bring back the first 3 rows in that answer set. Because this example does not have an
ORDER BY statement, you can consider this example as merely bringing back 3 random rows.

TOP COMMAND IS BRILLIANT WHEN ORDER BY IS USED!

In the above example, we brought back 3 rows only. This is because of the TOP 3 statement which means to get
an answer set, and then bring back the first 3 rows. Because this example uses an ORDER BY statement, the data
brought back is from the top 3 students with the highest Grade_Pt. This is the real power of the TOP command.
Use it with an ORDER BY!

WHAT IS THE DIFFERENCE BETWEEN TOP AND LIMIT?

Both queries above bring back the top 3 students with the highest grade_pt. The TOP command is designed to
bring back the top n rows. The LIMIT clause is used more often if you merely want to see a quick sample, but
both techniques will work with an ORDER BY statement and both can utilize an ORDER BY statement in the
creation of a view.
Chapter 10 - Aggregation

Redshift climbed Aggregate Mountain and delivered a better way to Sum It.
Tera-Tom Coffing

QUIZ YOU CALCULATE THE ANSWER SET IN YOUR OWN MIND


Aggregation_Table

Employee_No Salary

423400 100000.00

423401 100000.00

423402 NULL

SELECT AVG(Salary) as "AVG"


,Count(Salary) as SalCnt
,Count(*) as RowCnt
FROM Aggregation_Table ;

What would the result set be from the above query? The next slide shows answers!

ANSWER YOU CALCULATE THE ANSWER SET IN YOUR OWN MIND

SELECT AVG(Salary) as "AVG"


,Count(Salary) as SalCnt
,Count(*) as RowCnt
FROM Aggregation_Table ;

Here are your answers!

THE 3 RULES OF AGGREGATION


1) Aggregates Ignore Null Values.

2) Aggregates WANT to come back in one row.

3) You CANT mix Aggregates with normal columns unless you use a GROUP BY.

THERE ARE FIVE AGGREGATES


There are FIVE AGGREGATES which are the following:

MIN The Minimum Value.


MAX The Maximum Value.
AVG The Average of the Column Values.
SUM The Sum Total of the Column Values.
COUNT The Count of the Column Values.
SELECT MIN (Salary)
,MAX (Salary)
,SUM (Salary)
,AVG (Salary)
,Count(*)
FROM Employee_Table ;
Dont count the days, make the days count.

-Mohammed Ali

The five aggregates are listed above. Mohammed Ali was way off in his quote. He meant to say, "Don't you count
the days, make the data count for you".

QUIZ HOW MANY ROWS COME BACK?

How many rows will the above query produce in the result set?

ANSWER HOW MANY ROWS COME BACK?


How many rows will the above query produce in the result set? The answer is one.

TROUBLESHOOTING AGGREGATES

If you have a normal column (non aggregate) in your query, you must have a corresponding GROUP BY
statement.

GROUP BY WHEN AGGREGATES AND NORMAL COLUMNS MIX

If you have a normal column (non aggregate) in your query, you must have a corresponding GROUP BY
statement.

GROUP BY DELIVERS ONE ROW PER GROUP


Group By Dept_No command allow for the Aggregates to be calculated per Dept_No. The data has also been
sorted with the ORDER BY statement.

GROUP BY DEPT_NO OR GROUP BY 1 THE SAME THING

Both queries above produce the same result. The GROUP BY allows you to either name the column or use the
number in the SELECT list just like the ORDER BY.

LIMITING ROWS AND IMPROVING PERFORMANCE WITH WHERE

Will Dept_No 300 be calculated? Of course you know it will . . . NOT!

WHERE CLAUSE IN AGGREGATION LIMITS UNNEEDED


CALCULATIONS

The system eliminates reading any other Dept_Nos other than 200 and 400. This means that only Dept_Nos of
200 and 400 will come off the disk to be calculated.

KEYWORD HAVING TESTS AGGREGATES AFTER THEY ARE TOTALED


Previous Answer Set

The HAVING Clause only works on Aggregate Totals. The WHERE filters rows to be excluded from calculation,
but the HAVING filters the Aggregate totals after the calculations, thus eliminating certain Aggregate totals.

KEYWORD HAVING IS LIKE AN EXTRA WHERE CLAUSE FOR TOTALS

New Answer Set using the HAVING Statement

The HAVING Clause only works on Aggregate Totals, and in the above example, only Count(*) > 2 can return.

Chapter 11 - Join Functions

When spider webs unite they can tie up a lion.


- African Proverb

A TWO-TABLE JOIN USING TRADITIONAL SYNTAX


A Join combines columns on the report from more than one table. The example above joins the Customer_Table
and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN
CONDITION is which Column from each table is a match. In this case, Customer_Number is a match that
establishes the relationship, so this join will happen on matching Customer_Number columns.

A TWO-TABLE JOIN USING NON-ANSI SYNTAX WITH TABLE ALIAS

A Join combines columns on the report from more than one table. The example above joins the Customer_Table
and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN
CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that
establishes the relationship.

YOU CAN FULLY QUALIFY ALL COLUMNS

Whenever a column is in both tables, you must fully qualify it when doing a join. You don't have to fully qualify
tables that are only in one of the tables because the system knows which table that particular column is in. You
can choose to fully qualify every column if you like. This is a good practice because it is more apparent which
columns belong to which tables for anyone else looking at your SQL.

A TWO-TABLE JOIN USING ANSI SYNTAX

This is the same join as the previous slide except it is using ANSI syntax. Both will return the same rows with the
same performance. Rows are joined when the Customer_Number matches on both tables, but non-matches wont
return.
BOTH QUERIES HAVE THE SAME RESULTS AND PERFORMANCE

Both of these syntax techniques bring back the same result set and have the same performance. The INNER
JOIN is considered ANSI. Which one does Outer Joins?

QUIZ CAN YOU FINISH THE JOIN SYNTAX?

SELECT First_Name, Last_Name,


Department_Name
FROM Employee_Table as E
INNER JOIN
Department_Table as D
ON
Finish the Join

Finish this join by placing the missing SQL in the proper place!

ANSWER TO QUIZ CAN YOU FINISH THE JOIN SYNTAX?

This query is ready to run.

QUIZ CAN YOU FIND THE ERROR?


This query has an error! Can you find it?

ANSWER TO QUIZ CAN YOU FIND THE ERROR?

If a column in the SELECT list is in both tables, you must fully qualify it.

SUPER QUIZ CAN YOU FIND THE DIFFICULT ERROR?

This query has an error! Can you find it?

ANSWER TO SUPER QUIZ CAN YOU FIND THE DIFFICULT ERROR?


If a column in the SELECT list is in both tables, you must fully qualify it.

QUIZ WHICH ROWS FROM BOTH TABLES WONT RETURN?

An Inner Join returns matching rows, but did you know an Outer Join returns both matching rows and non-
matching rows? You will understand soon!

ANSWER TO QUIZ WHICH ROWS FROM BOTH TABLES WONT


RETURN?

The bottom line is that the three rows excluded did not have a matching Dept_No.

LEFT OUTER JOIN

This is a LEFT OUTER JOIN. That means that all rows from the LEFT Table will appear in the report regardless
if it finds a match on the right table.

LEFT OUTER JOIN RESULTS


A LEFT Outer Join Returns all rows from the LEFT Table including all Matches. If a LEFT row cant find a
match, a NULL is placed on right columns not found!

LEFT OUTER JOINS COMPATIBLE WITH ORACLE


SELECT Customer_Name, Order_Date, Order_Total
FROM Customer_Table as C
LEFT OUTER JOIN
Order_Table as O
ON C.Customer_Number = O.Customer_Number ;
SELECT Customer_Name, Order_Date, Order_Total
FROM Customer_Table as C,
Order_Table as O
WHERE C.Customer_Number = O.Customer_Number (+) ;

Can't died when Could was born.

-- Author Unknown

Redshift supports outer joins using both the ANSI syntax and the Oracle syntax. I think Oracle joins are a real
plus!

RIGHT OUTER JOIN

This is a RIGHT OUTER JOIN. That means that all rows from the RIGHT Table will appear in the report
regardless if it finds a match with the LEFT Table.

RIGHT OUTER JOIN EXAMPLE AND RESULTS


All rows from the Right Table were returned with matches, but since Dept_No 500 didnt have a match, the
system put a NULL Value for Left Column values.

RIGHT OUTER JOINS COMPATIBLE WITH ORACLE


SELECT Customer_Name, Order_Date, Order_Total
FROM Customer_Table as C
RIGHT OUTER JOIN
Order_Table as O
ON C.Customer_Number = O.Customer_Number ;
SELECT Customer_Name, Order_Date, Order_Total
FROM Customer_Table as C,
Order_Table as O
WHERE C.Customer_Number (+) = O.Customer_Number ;

Redshift supports outer joins using both the ANSI syntax and the Oracle syntax

FULL OUTER JOIN

The is a FULL OUTER JOIN. That means that all rows from both the RIGHT and LEFT Table will appear in the
report regardless if it finds a match.

FULL OUTER JOIN RESULTS


The FULL Outer Join Returns all rows from both Tables. NULLs show the flaws!

WHICH TABLES ARE THE LEFT AND WHICH ARE THE RIGHT?

Can you list which tables above are left tables and which tables are right tables?

ANSWER - WHICH TABLES ARE THE LEFT AND WHICH ARE THE
RIGHT?

The first table is always the left table and the rest are right tables. The results from the first two tables being
joined becomes the left table.

INNER JOIN WITH ADDITIONAL AND CLAUSE

The additional AND is performed first in order to eliminate unwanted data, so the join is less intensive than
joining everything first and then eliminating rows that dont qualify.

ANSI INNER JOIN WITH ADDITIONAL AND CLAUSE


The additional AND is performed first in order to eliminate unwanted data, so the join is less intensive than
joining everything first and then eliminating after.

ANSI INNER JOIN WITH ADDITIONAL WHERE CLAUSE

The additional WHERE is performed first in order to eliminate unwanted data, so the join is less intensive than
joining everything first and then eliminating.

OUTER JOIN WITH ADDITIONAL WHERE CLAUSE

The additional WHERE is performed last on Outer Joins. All rows will be joined first and then the additional
WHERE clause filters after the join takes place.

OUTER JOIN WITH ADDITIONAL AND CLAUSE


The additional AND is performed in conjunction with the ON statement on Outer Joins. All rows will evaluated
with the ON clause and the AND combined.

OUTER JOIN WITH ADDITIONAL AND CLAUSE RESULTS

The additional AND is performed in conjunction with the ON statement on Outer Joins. This can surprise you.
Only Mandee is in Dept_No 100, so she showed up like expected, but an outer join returns non-matches also.
Ouch!!!

QUIZ WHY IS THIS CONSIDERED AN INNER JOIN?

This is considered an INNER JOIN because we are doing a LEFT OUTER JOIN on the Employee_Table and
then filtering with the AND for a column in the right table!

THE DREADED PRODUCT JOIN


This query becomes a Product Join because it does not possess any JOIN Conditions (Join Keys). Every row
from one table is compared to every row of the other table, and quite often, the data is not what you intended to
get back.

THE DREADED PRODUCT JOIN RESULTS

How can Billy Coffing work in 3 different departments?

A Product Join is often a mistake! 3 Department rows had an m in their name, so these were joined to every
employee, and the information is worthless.

THE HORRIFYING CARTESIAN PRODUCT JOIN

A Cartesian Product Join is usually a big mistake.

THE ANSI CARTESIAN JOIN WILL ERROR


This causes an error. ANSI wont let this run unless a join condition is present.

QUIZ DO THESE JOINS RETURN THE SAME ANSWER SET?

Do these two queries produce the same result?

ANSWER DO THESE JOINS RETURN THE SAME ANSWER SET?

Do these two queries produce the same result? No, Query 1 Errors due to ANSI syntax and no ON Clause, but
Query 2 Product Joins to bring back junk!

THE CROSS JOIN


This query becomes a Product Join because a Cross Join is an ANSI Product Join. It will compare every row
from the Customer_Table to Order_Number 123456 in the Order_Table. Check out the Answer Set on the next
page.

THE CROSS JOIN ANSWER SET

This Cross Join produces information that just isnt worth anything quite often!

THE SELF JOIN

A Self Join gives itself 2 different Aliases, which is then seen as two different tables.

THE SELF JOIN WITH ANSI SYNTAX

A Self Join gives itself 2 different Aliases, which is then seen as two different tables.

QUIZ WILL BOTH QUERIES BRING BACK THE SAME ANSWER SET?
Will both queries bring back the same result set?

ANSWER WILL BOTH QUERIES BRING BACK THE SAME ANSWER


SET?

Will both queries bring back the same result set? Yes! Because theyre both inner joins.

QUIZ WILL BOTH QUERIES BRING BACK THE SAME ANSWER SET?

Will both queries bring back the same result set?

ANSWER WILL BOTH QUERIES BRING BACK THE SAME ANSWER


SET?

Will both queries bring back the same result set? NO! The WHERE is performed last.
HOW WOULD YOU JOIN THESE TWO TABLES?

How would you join these two tables together? You can't do it. There is no matching column with like data.
There is no Primary Key/Foreign Key relationship between these two tables. That is why you are about to be
introduced to a bridge table. It is formally called an Associative table or a Lookup table.

AN ASSOCIATIVE TABLE IS A BRIDGE THAT JOINS TWO TABLES

The Associative Table is a bridge between the Course_Table and Student_Table.

QUIZ CAN YOU WRITE THE 3-TABLE JOIN?

SELECT ALL Columns from the Course_Table and Student_Table and Join them

ANSWER TO QUIZ CAN YOU WRITE THE 3-TABLE JOIN?


The Associative Table is a bridge between the Course_Table and Student_Table, and its sole purpose is to join
these two tables together.

QUIZ CAN YOU WRITE THE 3-TABLE JOIN TO ANSI SYNTAX?

SELECT S.*, C.*


FROM Student_Table as S,
Course_Table as C,
Student_Course_Table as SC
Where S.Student_ID = SC.Student_ID
AND C.Course_ID = SC.Course_ID ;
Convert this query to ANSI syntax

Please re-write the above query using ANSI Syntax.

ANSWER CAN YOU WRITE THE 3-TABLE JOIN TO ANSI SYNTAX?

The above queries show both traditional and ANSI form for this three table join.

QUIZ CAN YOU PLACE THE ON CLAUSES AT THE END?


Please re-write the above query and place both ON Clauses at the end.

ANSWER CAN YOU PLACE THE ON CLAUSES AT THE END?

This is tricky. The only way it works is to place the ON clauses backwards. The first ON Clause represents the
last INNER JOIN and then moves backwards.

THE 5-TABLE JOIN LOGICAL INSURANCE MODEL

Above is the logical model for the insurance tables showing the Primary Key and Foreign Key relationships
(PK/FK).

QUIZ - WRITE A FIVE TABLE JOIN USING ANSI SYNTAX

Your mission is to write a five table join selecting all columns using ANSI syntax.

ANSWER - WRITE A FIVE TABLE JOIN USING ANSI SYNTAX


SELECT
cla1.*, sub1.*, add1.* pro1.*, ser1.*
FROM CLAIMS AS cla1
INNER JOIN
SUBSCRIBERS AS sub1
ON cla1.Subscriber_No = sub1.Subscriber_No
AND cla1.Member_No = sub1.Member_No
INNER JOIN
ADDRESSES AS add1
ON sub1.Subscriber_No = add1.Subscriber_No
INNER JOIN
PROVIDERS AS pro1
ON cla1.Provider_No = pro1.Provider_Code
INNER JOIN
SERVICES AS ser1
ON cla1.Claim_Service = ser1.Service_Code ;

Above is the example writing this five table join using ANSI syntax.

QUIZ - WRITE A FIVE TABLE JOIN USING NON-ANSI SYNTAX

Your mission is to write a five table join selecting all columns using Non-ANSI syntax.

ANSWER - WRITE A FIVE TABLE JOIN USING NON-ANSI SYNTAX


SELECT cla1.*, sub1.*, add1.* pro1.*, ser1.*
FROM CLAIMS AS cla1,
SUBSCRIBERS AS sub1,
ADDRESSES AS add1,
PROVIDERS AS pro1,
SERVICES AS ser1
WHERE cla1.Subscriber_No = sub1.Subscriber_No
AND cla1.Member_No = sub1.Member_No
AND sub1.Subscriber_No = add1.Subscriber_No
AND cla1.Provider_No = pro1.Provider_Code
AND cla1.Claim_Service = ser1.Service_Code ;

Above is the example writing this five table join using Non-ANSI syntax.
QUIZ RE-WRITE THIS PUTTING THE ON CLAUSES AT THE END
SELECT
cla1.*, sub1.*, add1.* pro1.*, ser1.*
FROM CLAIMS AS cla1
INNER JOIN
SUBSCRIBERS AS sub1
ON cla1.Subscriber_No = sub1.Subscriber_No
AND cla1.Member_No = sub1.Member_No
INNER JOIN
ADDRESSES AS add1
ON sub1.Subscriber_No = add1.Subscriber_No
INNER JOIN
PROVIDERS AS pro1
ON cla1.Provider_No = pro1.Provider_Code
INNER JOIN
SERVICES AS ser1
ON cla1.Claim_Service = ser1.Service_Code ;

Above is the example writing this five table join using Non-ANSI syntax.

ANSWER RE-WRITE THIS PUTTING THE ON CLAUSES AT THE END

Above is the example writing this five table join using ANSI syntax with the ON clauses at the end. We had to
move the tables around also to make this happen. Notice that the first ON clause represents the last two tables
being joined, and then it works backwards.
Chapter 12 - Date Functions

"An inch of time cannot be bought with an inch of gold."


- Chinese Proverb

CURRENT_DATE
This example uses the Current_Date to return the current date.
SELECT Current_Date as ANSI_Date;

ANSI_Date
------------
2014-10-04

Not all who wander are lost.


J. R. R. Tolkien

The Current_Date will return today's date

TIMEOFDAY()
TIMEOFDAY() returns a VARCHAR data type and specifies the weekday, date, and time.
SELECT TIMEOFDAY() ;

timeofday
------------
Mon Oct 6 22:53:50.333525 2014 UTC

Always remember that you are unique just like everyone else.

Anonymous

The TIMEOFDAY function returns the weekday, date and the time.
SYSDATE RETURNS A TIMESTAMP WITH MICROSECONDS
This example uses the SYSDATE function
to return the full timestamp for the current date.

This example uses the SYSDATE function inside the TRUNC function to return the current date without the time
included.
SELECT TRUNC(SYSDATE) ;

trunc
------------
2014-10-04

The SYSDATE function returns the current date and time according to the system clock on the leader node. The
functions CURRENT_DATE and TRUNC(SYSDATE) produce the same results.

GETDATE RETURNS A TIMESTAMP WITHOUT MICROSECONDS


This example uses the GETDATE() function
to return the full timestamp for the current date.

This example uses the GETDATE() function inside the TRUNC function to return the current date without the
time included.
SELECT TRUNC(GETDATE());
trunc
------------
2014-10-04

GETDATE returns a TIMESTAMP. The parenthesis are required.

ADD OR SUBTRACT DAYS FROM A DATE


SELECT Order_Date + 60 AS "Due Date"
Order_Date
to_char(Order_total,'$99,999.99') as Total_Due
FROM Order_Table
ORDER BY 1 ;
When you add or subtract from a Date, you are adding/subtracting Days.

Because Dates are stored internally on disk as integers, it makes it easy to add days to the calendar. In the query
above, we are adding 60 days to the Order_Date. Also, notice the to_char command which will format the
amount.

THE ADD_MONTHS COMMAND RETURNS A TIMESTAMP


SELECT Order_Date
Add_Months (Order_Date,2) as "Due Date"
Order_Total
FROM Order_Table ORDER BY 1 ;

The ADD_MONTHS function adds a specified number of months to a date or timestamp value. If date is the last
day of the month, or if the resulting month is shorter, the function returns the last day of the month in the result.
For other dates, the result contains the same day number as the date expression. A positive or negative integer or
any value that implicitly converts to an integer. You can even use a negative number to subtract months from
dates. The DATEADD function provides similar functionality.

THE ADD_MONTHS COMMAND WITH TRUNC REMOVES TIME


SELECT Order_Date
,TRUNC(Add_Months (Order_Date,2)) as "Due Date"
,Order_Total
FROM Order_Table ORDER BY 1 ;

Above, we used the TRUNC command to get rid of the time (00:00:00) on the returning answer set. The
ADD_MONTHS function adds a specified number of months to a date or timestamp value. If date is the last day
of the month, or if the resulting month is shorter, the function returns the last day of the month in the result. For
other dates, the result contains the same day number as the date expression. A positive or negative integer or any
value that implicitly converts to an integer. You can even use a negative number to subtract months from dates.
The DATEADD function provides similar functionality.

ADD_MONTHS COMMAND TO ADD 1-YEAR OR 5-YEARS


There is no Add_Year command, so put in 12 months for 1-year

In this example, we multiplied 12 months times 5 for a total of 5 years!

The Add_Months command adds months to any date. Above, we used a great technique that would give us 1-
year. We then showed an even better technique to get 5-years.

DATEADD FUNCTION AND ADD_MONTHS FUNCTION ARE DIFFERENT


DATEADD: If there are fewer days in the date you are adding to
than in the result month, the result will be the corresponding day
of the result month, not the last day of that month. For example,
April 30 + 1 month is May 30th:
th

SELECT DATEADD (month,1,'2014-04-30');


DATEADD
------------------------
2014-05-30 00:00:00
ADD_MONTHS: If the date you are adding to is the last day of the
month, the result is always the last day of the result month, regardless of
the length of the month. For example, April 30th + 1 month is May 31st:
SELECT ADD_Months ('2014-04-30',1);
ADD_Months
------------------------
2014-05-31 00:00:00

The DATEADD and ADD_MONTHS functions handle dates that fall at the ends of months differently.

THE EXTRACT COMMAND


The EXTRACT command extracts portions of Date, Time, and Timestamp
SELECT Order_Date
,Add_Months (Order_Date,12 * 5) as "Due Date"
,Order_Total
FROM Order_Table
WHERE EXTRACT(Month from Order_Date) = 9
ORDER BY 1 ;

You miss 100 percent of the shots you never take.

Wayne Gretzky

This is the Extract command. It returns a date part, such as a day, month, or year, from a timestamp value or
expression.

EXTRACT FROM DATES AND TIME


SELECT
Current_Date
,EXTRACT(Year from Current_Date) as Yr
,EXTRACT(Month from Current_Date) as Mo
,EXTRACT(Day from Current_Date) as Da
,Current_Time
,EXTRACT(Hour from Current_Time) as Hr
,EXTRACT(Minute from Current_Time) as Mn
,EXTRACT(Second from Current_Time) as Sc

Just like the Add_Months, the EXTRACT Command is a Temporal Function or a Time-Based Function.

EXTRACT WITH DATE AND TIME LITERALS


SELECT
EXTRACT(YEAR FROM DATE '2000-10-01') AS "Yr"
,EXTRACT(MONTH FROM DATE '2000-10-01') AS "Mth"
,EXTRACT(DAY FROM DATE '2000-10-01') AS "Day"
,EXTRACT(HOUR FROM TIME '10:01:30') AS "Hr"
,EXTRACT(MINUTE FROM TIME '10:01:30') AS "Min"
,EXTRACT(SECOND FROM TIME '10:01:30') AS "Sec"
,EXTRACT(MONTH FROM current_timestamp) AS ts_Mth
,EXTRACT(SECOND FROM current_timestamp) AS ts_Part ;

Just like the Add_Months, the EXTRACT Command is a Temporal Function or a Time-Based Function, and the
above is designed to show how to use it with literal values.
EXTRACT OF THE MONTH ON AGGREGATE QUERIES
SELECT EXTRACT(Month FROM Order_date)
,COUNT(*) AS Nbr_of_rows
,AVG(Order_Total)
FROM Order_Table
GROUP BY 1
ORDER BY 1 ;

The above SELECT uses the EXTRACT to only display the month and also to control the number of aggregates
displayed in the GROUP BY. Notice the Answer Set headers.

THE DATEDIFF COMMAND

This function uses a datepart (day, week, month etc.) and two target expressions. This function returns the
difference between the two expressions. The expressions must be date or timestamp expressions and they must
both contain the specified datepart. If the second date is later than the first date, the result is positive. If the
second date is earlier than the first date, the result is negative.

THE DATEDIFF FUNCTION ON COLUMN DATA


This function uses a datepart (day, week, month etc.) and two target expressions. This function returns the
difference between the two expressions. The expressions must be date or timestamp expressions, and they must
both contain the specified datepart. If the second date is later than the first date, the result is positive. If the
second date is earlier than the first date, the result is negative.

THE DATE_PART FUNCTION USING A DATE

The specific part of the date value (year, month, or day, for example) that the datepart function operates on. The
expression must be a date or timestamp expression that contains the specified date_part.

THE DATE_PART FUNCTION USING A TIME


Extract the minute from the timestamp

pgdate_part
8

Speak in a moment of anger and youll deliver the greatest speech youll ever regret.

Anonymous

The specific part of the date value (year, month, or day, for example) that the DATE _PART function operates on.
The expression must be a date or timestamp expression that contains the specified DATE_PART. Notice that the
default column name for the DATE_PART function is PGDATE_PART.

DATE_PART ABBREVIATIONS
Below are dateparts for date or timestamp functions. The following table identifies the datepart and timepart
names and abbreviations that are accepted as arguments to the following functions:

Above are the functions for datepart or timepart, their parts, and the acceptable abbreviations.

THE TO_CHAR COMMAND


SELECT Order_Date
,Order_Date + 60 as "Due Date"
,to_char(Order_Total, '$99,999.99') As Order_Total
,Order_Date + 50 as "Discount Date"
,to_char(Order_Total *.98, '$99,999.99') as Discounted
FROM Order_Table
ORDER BY 1 ;

The to_char command will take a value and convert it to a character string.

CONVERSION FUNCTIONS

Function Name Conversion Operation

to_number() Character to numeric

to_date() Character or timestamp to date

to_timestamp() Character to timestamp

to_char() Numeric, date or timestamp to character

The following shows the syntax for using these functions:


,to_number(<character-data>,'<template>')
,to_date(<character-data>,'<template>')
,to_timestamp(<character-data>,'<template>')
,to_char(<numeric-data>)
,to_date(<character-data>,'<template>') ;

The NPS database provides some functions that assist in the conversion of data from one type to another.
CONVERSION FUNCTION TEMPLATES

HH, HH12 Hour of day (01:12).

HH24 Hour of day (00:23).

MI Minute (00:59).

SS Second (00:59).

SSSS Seconds past midnight (0:86399).

AM, am, A.M., a.m. or PM, pm, P.M., p.m. Meridian indicator (uppercase and lowercase).

Y,YYY Year (4 and more digits) with a comma.

YYYY Year (4 and more digits).

YYY Last 3 digits of the year.

YY Last 2 digits of the year.

Y Last digit of the year.

MONTH, Month, month Full month name (blank-padded to 9 chars).

MON, Mon, mon Abbreviated uppercase month name (3 chars).

MM Month number (01:12).

DAY, Day, day Full day name (blank-padded to 9 chars).

DY, Dy, dy Abbreviated uppercase day name (3 chars).

DDD Day of the year (001:366).

DD Day of the month (01:31).

D Day of the week (1:7; SUN=1).

BC, bc, B.C., b.c or AD, ad, A.D., a.d. Era indicator (uppercase and lowercase).

CONVERSION FUNCTION TEMPLATES CONTINUED


W Week of the month (1:5) where first

week start on the first day of the month.

WW Week number of the year (1:53) where the first week starts on the first day of the year.

IW ISO week number of the year (The first Thursday of the new year is in week 1.)
CC Century (2 digits).

J Julian Day (days since January 1, 4712 BC).

Q Quarter

RM Month in Roman Numerals (I-XII; I=January) uppercase.

rm Month in Roman Numerals (i-xii; i=January) lowercase.

FM prefix Fill mode (suppresses padding blanks and zeroes).

TH, th suffix Add uppercase ordinal number suffix.

FX prefix Fixed format global option.

9 Value with the specified number of digits.

0 Value with leading zeros.

. (period) Decimal point.

, (comma) Group (thousand) separator.

PR Negative value in angle brackets.

S Negative value with minus sign (uses locale).

L Currency symbol (uses locale).

D Decimal point (uses locale).

G Group separator (uses locale).

MI Minus sign in the specified position (if number < 0).

RN Roman numeral (input between 1 and 3999).

V Shift n digits (see notes).

FORMATTING A DATE
The to_char command will take a value and convert it to a character string. This includes formatting a date.

A SUMMARY OF MATH OPERATIONS ON DATES

DATE DATE = Interval (days between dates)

DATE + or - Integer = Date

Let's find the number of days Tera-Tom has been alive since his last birthday.
SELECT (date '2012-01-10') - (date '1959-01-10') AS "Tom"s Age In Days";
Tera-Tom's Age In Days
19358

A DATE DATE is an interval of days between dates. A DATE + or Integer = Date. The query above uses the
dates the traditional way to deliver the Interval.

USING A MATH OPERATION TO FIND YOUR AGE IN YEARS

DATE DATE = Interval (days between dates)

DATE + or - Integer = Date

Let's find the number of days Tera-Tom has been alive since his last birthday.
SELECT (date '2012-01-10') - (date '1959-01-10') AS "Tom"s Age In Days";
Tera-Tom's Age In Days

19358
Let's find the number of years Tera-Tom has been alive since his last birthday.
SELECT ((date '2012-01-10') - (date '1959-01-10'))/365 "Tom"s Age In Years";
Tera-Tom's Age In Years

53

A DATE DATE is an interval of days between dates. A DATE + or Integer = Date. Both queries above
perform the same function, but the top query uses the date functions to find "Days" and the query on the bottom
finds "Years".

DATE RELATED FUNCTIONS


TO_CHAR - Receives a date and based on the template
characters, displays portions of the date.

TO_DATE - Receives a character string and converts it to a


date based on the template provided.
SELECT to_char(Order_Date, 'Day dddd, Mon yy')
,Order_Date -365 "Year Later Date"
,to_char(Order_Total,'$99,999.99') Order_Total
,to_date('Dec 31, 2005','mon dd, yyyy') as "Due Date"
FROM Order_Table ORDER BY 2 ;
Answer Set

A SIDE TITLE EXAMPLE WITH RESERVED WORDS AS AN ALIAS


SELECT 'Due Date:' AS """ /* title as 2 single quotes for no title */
EXTRACT(Month FROM Order_date+64) AS "Month"
EXTRACT(Day FROM Order_date+64) AS "Day"
EXTRACT(Year FROM Order_date+64) AS "Year"
to_char(Order_Date, 'Mon-dd, yyyy')
Order_Total
FROM Order_Table
ORDER BY 2,3 ;

The next SELECT operation uses entirely ANSI compliant code to show the month and day of the payment due
date in 2 months and 4 days. Notice it uses double quotes to allow reserved words as alias names.

IMPLIED EXTRACT OF DAY, MONTH AND YEAR


Compatibility: Redshift Extension. The syntax for implied extract:

SELECT to_char(<date-data>,'DD') /* extracts the day */


to_char(<date-data>,'MM') /* extracts the month */
to_char(<date-data>,'YYYY') /* extracts the year */
FROM <table-name> ;

--The following SELECT uses math to extract the three portions of Tom's literal birthday
SELECT to_char(date '2012-01-10','DD') AS Day_portion
to_char(date '2012-01-10','MM') AS Month_portion
to_char(date '2012-01-10','YYYY') AS Year_portion ;

It was mentioned earlier that Redshift stores a date as an integer and therefore allows math operations to be
performed on a date. Although the EXTRACT works great and it is ANSI compliant, it is a function. Therefore, it
must be executed and the parameters passed to it to identify the desired portion as data. Then, it must pass back
the answer. As a result, there is additional overhead processing required to use it.
DATE_PART FUNCTION
Compatibility: Redshift Extension. Syntax of DATE_PART:

DATE_PART('<text',<date-time-timestamp>)

Where <text> can be: 'YEAR', 'MONTH', 'DAY' for a date or a time
stamp and ('HOUR', 'MINUTE', 'SECOND' for time and time stamp.

The following SELECT will show DATE_PART with a Date, Time, and Timestamp.

SELECT CURRENT_DATE
DATE_PART('MONTH',CURRENT_DATE)
CURRENT_TIME
DATE_PART('MINUTE',CURRENT_TIME)
CURRENT_TIMESTAMP
DATE_PART('SECOND',CURRENT_TIMESTAMP) ;

The DATE_PART function works exactly like EXTRACT. Although the name contains DATE, it also works with
time and time stamp data. Notice the column headers!

DATE_PART FUNCTION USING AN ALIAS

The following SELECT will show DATE_PART with a Date,


Time, and Timestamp and add an ALIAS for each DATE_PART.

SELECT
CURRENT_DATE
DATE_PART('MONTH',CURRENT_DATE) as "MONTH"
CURRENT_TIME
DATE_PART('MINUTE',CURRENT_TIME) as "MINUTE"
CURRENT_TIMESTAMP
DATE_PART('SECOND',CURRENT_TIMESTAMP) as "SECOND" ;

The DATE_PART function works exactly like EXTRACT. Although the name contains DATE, it also works with
time and time stamp data. Now notice the column headers!

DATE_TRUNC FUNCTION
Compatibility: Redshift Extension. Syntax of DATE_TRUNC:

DATE_TRUNC('<text',<date>)
Where <text> can be: 'YEAR', 'MONTH', 'DAY' for a date or a time stamp and ('HOUR', 'MINUTE', 'SECOND'
for time and time stamp. Although DAY and SECOND are allowed, they have no impact on the output data, see
below.
SELECT CURRENT_TIMESTAMP
DATE_TRUNC('YEAR',CURRENT_TIMESTAMP) AS Yr_Trunc
DATE_TRUNC('MONTH',CURRENT_TIMESTAMP) AS Mo_Trunc
DATE_TRUNC('DAY',CURRENT_TIMESTAMP) AS Da_Trunc;

The DATE_TRUNC function has an interesting capability in that it truncates the portion of a date back to the
first. Notice that the year portion becomes January 1, the month portion becomes August 1, and the day portion
does not change. However, the entire time portion is set back to 12:00:00. When DATE data is used, the time
portion is set to 12:00:00 just like in the above.

DATE_TRUNC FUNCTION USING TIME


Compatibility: Redshift Extension. Syntax of DATE_TRUNC:

DATE_TRUNC('<text',<date>)

Where <text> can be: 'YEAR', 'MONTH', 'DAY' for a date or a time stamp and ('HOUR', 'MINUTE', 'SECOND'
for time and time stamp. Although DAY and SECOND are allowed, they have no impact on the output data, see
below.

SELECT CURRENT_TIMESTAMP
DATE_TRUNC('HOUR',CURRENT_TIMESTAMP) hr_trunc
DATE_TRUNC('MINUTE'CURRENT_TIMESTAMP) min_trunc
DATE_TRUNC('SECOND',CURRENT_TIMESTAMP) sec_trunc;

Notice that the hour portion becomes 9:00:00, the minute portion becomes 9:05:00, and the second and date
portions do not change. If TIME was used, there would be no DATE portion as in a TIME STAMP.

MONTHS_BETWEEN FUNCTION
Compatibility: Redshift Extension

Syntax: MONTHS_BETWEEN (<start_date>,<end_date>)

The following example uses the MONTHS_BETWEEN with some fixed dates

SELECT months_between(date '2004-10-01',date '2004-09-01')


months_between(date '2004-10-01',date '2004-09-15')
months_between(date '2004-10-01',date '2002-08-15')
months_between(date '2003-10-01',date '2004-10-15') ;
The MONTHS_BETWEEN function is handy for doing date subtraction. Unlike normal date subtraction, it
returns a fractional portion based on the days in a month.

MONTHS_BETWEEN FUNCTION IN ACTION


SELECT ADD_MONTHS(Order_Date, 2) AS "Due Date"
Order_Date
to_char(Order_total,'$99,999.99')
MONTHS_BETWEEN(Add_Months(order_date,2),order_date)
FROM Order_Table
ORDER BY 2 ;

The above example uses the order table to demonstrate BETWEEN_MONTHS.

ANSI TIME

Redshift has the ANSI time display and TIME data type.
CURRENT_TIME is the ANSI name of the time function.
SELECT CURRENT_TIME;

TIME
17:27:56
SELECT Current_Time
CURRENT_TIME - 55 as Subtract

TIME Subtract
17:27:56 17:27:01

As well as creating a TIME data type, intelligence has been added to the clock software. It can increment or
decrement TIME with the result increasing to the next minute or decreasing from the previous minute based on
the addition or subtraction of seconds.

ANSI TIMESTAMP
TIMESTAMP is a display format, a reserved name and a new data type. It is a combination of the DATE and
TIME data types combined together into a single column data type.
SELECT CURRENT_TIMESTAMP
Notice that there is a space between the DATE and TIME portions of a TIMESTAMP. This is a required element
to delimit or separate the day from the hour.

REDSHIFT TIMESTAMP FUNCTION


Compatibility: Redshift Extension

The TIMESTAMP function can be used to convert a date or combination of a date and time into a timestamp.
Syntax for using TIMESTAMP:

TIMESTAMP(<date> [ <time> ] )
SELECT TIMESTAMP(CURRENT_DATE)
,TIMESTAMP(CURRENT_DATE, CURRENT_TIME)
,TIMESTAMP(DATE '2005-10-01', TIME '08:30:05') ;

What a wonderful feature. Redshift allows you to convert a date or a combination of a date and a time into a
Timestamp. The example above shows an example of converting a date, a date and time, and a literal date and
time. This should be all you need.

REDSHIFT TO_TIMESTAMP FUNCTION


The TO_TIMESTAMP function can be used to convert
characters strings to a timestamp.

Syntax for using TO_TIMESTAMP:

TO_TIMESTAMP(<date-string> [ <time-string> ] )

TO_TIMESTAMP TO_TIMESTAMP
10/01/2005 8:30:05.204331 10/01/2005 8:30:05.204331
Redshift allows you to convert character strings into a Timestamp. Notice that both answers are exactly the same.
The second parameter is NOT how the data should be output or formatted, but instead it reflects how the string
should be interpreted.

REDSHIFT NOW() FUNCTION

Compatibility: Redshift Extension

The timestamp can also be obtained using the NOW() function:

SELECT NOW() ;

SELECT NOW() ;
NOW
08/03/2012 11:07:10

Redshift allows you to see the date and time with the NOW() function. The next time someone asks for the time
tell them NOW.

REDSHIFT TIMEOFDAY FUNCTION


Compatibility: Redshift Extension

To get a bit more extended version of a time stamp use TIMEOFDAY:

SELECT TIMEOFDAY () ;

Answer Set
TIMEOFDAY
Fri Aug 03 11:11:38 2012 EDT

Redshift allows you an extended version of a time stamp that is robust and verbose.

REDSHIFT AGE FUNCTION


Compatibility: Redshift Extension

The AGE function returns the interval (discussed later in this chapter) between two time stamps. If you use a
single time stamp, the age function returns the interval between the current time and the time stamp provided.
The interval returned by the age function can include year and month data as well as day and time data.

Syntax of AGE:
AGE(<start-date>,<end-date>)

SELECT CURRENT_TIMESTAMP
,AGE('10-28-2004','7-20-2003')
,AGE(current_timestamp,'7-20-2003')
,AGE('7-20-2003') as AGE2 /* defaults to CURRENT_TIMESTAMP */ ;

To subtract one time stamp from another, use the AGE function.

TIME ZONES
A time zone relative to London (UTC) might be:

LA----------Miami-----------Frankfurt------------Hong Kong
+8:00 +05:00 00:00 -08:00

A time zone relative to New York (EST) might be:

LA----------Miami-----------Frankfurt------------Hong Kong
+3:00 00:00 -05:00 -13:00

Redshift has the ability to adjust the time and timestamp values to reflect the hours difference between the user's
time zone, the system time zone, and the United Kingdom location that was historically called Greenwich Mean
Time (GMT). Since the Greenwich observatory has been "decommissioned," the new reference to this same time
zone is called Universal Time Coordinate (UTC).

Here, the time zones used are represented from the perspective of the system at EST. In the above, it appears to
be backward. This is because the time zone is set using the number of hours that the system is from the user.

SETTING TIME ZONES


A Time Zone should be established for the system and every user in each different time zone.

Syntax for changing the session time zone:

SET TIME ZONE { LOCAL | DEFAULT | 'xxx[-]H[H:MI]yyy' ;

Where xxx is the designation for standard time and yyy is for daylight savings time

Setting a Session's time zone:

SET TIME ZONE LOCAL ; /* use system level */


SET TIME ZONE 'PST8PDT' ; /* explicit setting Pacific */
SET TIME ZONE 'HKT-11:00HKT'; /*uses both hours and optional minutes */

A Redshift session can modify the time zone during normal operations without requiring a logoff and logon. At
this time, the NPS only recognizes time zone processing stored in a table with data type of TIME WITH TIME
ZONE. Hopefully, it will soon also be added to TIMESTAMP when stored in a table.
USING TIME ZONES
The way time zones are implemented in Redshift is that the session time zone setting adjusts the value returned
by the TIME and TIMESTAMP when referenced in an SQL statement. To make some of the changes more
apparent, the following statements "assume" that they are all run at the same time.

Examples above set the time zone and then query Current_Timestamp simultaneously.

INTERVALS FOR DATE, TIME AND TIMESTAMP

Interval Chart

Simple Intervals More involved Intervals

YEAR DAY TO HOUR

MONTH DAY TO MINUTE

DAY DAY TO SECOND

HOUR HOUR TO MINUTE

MINUTE HOUR TO SECOND

SECOND MINUTE TO SECOND

Its not the size of the dog in the fight, but the size of the fight in the dog.
Archie Griffin

Redshift has added INTERVAL processing, however, it is not ANSI compliant. Intervals are used to perform
DATE, TIME and TIMESTAMP arithmetic and conversion.

USING INTERVALS
SELECT Current_Date as Our_Date
,Current_Date + Interval '1' Day as Plus_1_Day
,Current_Date + Interval '3' Month as Plus_3_Months
,Current_Date + Interval '5' Year as Plus_5_Years
The afternoon knows what the morning never suspected.
- Swedish Proverb

To use the ANSI syntax for intervals, the SQL statement must be very specific as to what the data values mean
and the format in which they are coded. ANSI standards tend to be lengthier to write and more restrictive as to
what is and what is not allowed regarding the values and their use.

TROUBLESHOOTING THE BASICS OF A SIMPLE INTERVAL


SELECT Date '2012-01-29' as Our_Date
,Date '2012-01-29' + INTERVAL '1' Month as Leap_Year

Our_Date Leap_Year

01/29/2012 02/29/2012
SELECT Date '2011-01-29' as Our_Date
,Date '2011-01-29' + INTERVAL '1' Month as Leap_Year

Error Invalid Date

The first example works because we added 1 month to the date '2012-01-29' and we got '2012-02-29'. Because
this was leap year, there actually is a date of February 29, 2012. The next example is the real point. We have a
date of '2011-01-29' and we add 1-month to that, but there is no February 29 in 2011, so the query fails.
th

INTERVAL ARITHMETIC RESULTS

DATE and TIME arithmetic results using intervals:

Once the game is over, the king and the pawn go back in the same box.
- Italian Proverb
To use DATE and TIME arithmetic, it is important to keep in mind the results of various operations. The above
chart is your Interval guide.

A DATE INTERVAL EXAMPLE


SELECT (DATE '1999-10-01' - DATE '1988-10-01') DAY AS Actual_Days ;
ERROR Interval Field Overflow

The Error occurred because the default


for all intervals is 2 digits.

Actual_Days
4017

The default for all intervals is 2 digits. We received an overflow error because the Actual_Days is 4017. The
second example works because we demanded the output to be 4 digits (the maximum for intervals).

A TIME INTERVAL EXAMPLE

SELECT
(TIME '12:45:01' - TIME '10:10:01') HOUR AS Actual_Hours
,(TIME '12:45:01' - TIME '10:10:01') MINUTE AS Actual_Minutes
,(TIME '12:45:01' - TIME '10:10:01') SECOND(4) AS Actual_Seconds
,(TIME '12:45:01' - TIME '10:10:01') SECOND(4,4) AS Actual_Seconds4
ERROR Interval Field Overflow

The default for all intervals is 2 digits, but notice in the top example, we put in 3 digits for Minute, 4 digits for
Second, and 4,4 digits for the Acutal_Seconds4. If we had not, we would have received an overflow error as in
the bottom example.

A DATE INTERVAL EXAMPLE


SELECT
Current_Date,
INTERVAL -'2' YEAR + CURRENT_DATE as Two_years_Ago;

Date Two_Year_Ago
06/18/2012 06/18/2010
I know that you believe that you understand what you think I said, but
I am not sure you realize that what you heard is not what I meant.

-Sign on Pentagon office wall

The above Interval example uses a '2' to go back in time.

A COMPLEX TIME INTERVAL EXAMPLE USING CAST


Below is the syntax for using the CAST with a date:

SELECT CAST (<interval> AS INTERVAL <interval> )


FROM <table-name> ;

The following converts an INTERVAL of 6 years


and 2 months to an INTERVAL number of months:
SELECT
CAST( (INTERVAL '6-02' YEAR TO MONTH) AS INTERVAL MONTH );
6-02
74

The CAST function (Convert And Store) is the ANSI method for converting data from one type to another. It can
also be used to convert one INTERVAL to another INTERVAL representation. Although the CAST is normally
used in the SELECT list, it works in the WHERE clause for comparison reasons.

A COMPLEX TIME INTERVAL EXAMPLE USING CAST


This request attempts to convert 1300 months to show
the number of years and months. Why does it fail?
SELECT
CAST(INTERVAL '1300' MONTH AS INTERVAL YEAR TO MONTH)
AS "Years & Months";
ERROR

Years & Month


108-04

The top query failed because the INTERVAL result defaults to 2-digits and we have a 3-digit answer for the year
portion (108). The bottom query fixes that specifying 3-digits. The biggest advantage in using the INTERVAL
processing is that SQL written on another system is now compatible.
THE OVERLAPS COMMAND
Compatibility: Redshift Extension

The syntax of the OVERLAPS is:

SELECT <literal>
WHERE (<start-date-time>, <end-date-time>) OVERLAPS
(<start-date-time>, <end-date-time>) ;
SELECT 'The Dates Overlap' as Dater
WHERE (DATE '2001-01-01', DATE '2001-11-30') OVERLAPS
(DATE '2001-10-15', DATE '2001-12-31');

When working with dates and times, sometimes it is necessary to determine whether two different ranges have
common points in time. Redshift provides a Boolean function to make this test for you. It is called OVERLAPS;
it evaluates true if multiple points are in common, otherwise it returns a false. The literal is returned because both
date ranges have from October 15 through November 30 in common.

AN OVERLAPS EXAMPLE THAT RETURNS NO ROWS


SELECT 'The dates overlap' AS OverlapAnswer
WHERE (DATE '2001-01-01', DATE '2001-11-30') OVERLAPS
(DATE '2001-11-30', DATE '2001-12-31') ;

I don't know who my grandfather was. I am more interested in who


his grandson will become.
Abraham Lincoln

The above SELECT example tests two literal dates and uses the OVERLAPS to determine whether or not to
display the character literal. The literal was not selected because the ranges do not overlap. So, the common
single date of November 30 does not constitute an overlap. When dates are used, 2 days must be involved, and
when time is used, 2 seconds must be contained in both ranges.

THE OVERLAPS COMMAND USING TIME


SELECT 'The Times Overlap' As DoThey
WHERE (TIME '08:00:00', TIME '02:00:00') OVERLAPS
(TIME '02:01:00', TIME '04:15:00') ;
The above SELECT example tests two literal times and uses the OVERLAPS to determine whether or not to
display the character literal. This is a tricky example, and it is shown to prove a point. At first glance, it appears
as if this answer is incorrect because 02:01:00 looks like it starts 1 second after the first range ends. However, the
system works on a 24-hour clock when a date and time (timestamp) is not used together. Therefore, the system
considers the earlier time of 2AM time as the start and the later time of 8 AM as the end of the range. Therefore,
not only do they overlap, the second range is entirely contained in the first range.

THE OVERLAPS COMMAND USING A NULL VALUE


SELECT 'The Times Overlap' As Time1
WHERE
(TIME '10:00:00', NULL) OVERLAPS (TIME '01:01:00', TIME '04:15:00')

The above SELECT example tests two literal dates and uses the OVERLAPS to determine whether or not to
display the character literal:

When using the OVERLAPS function, there are a couple of situations to keep in mind:

1. A single point in time, i.e. the same date, does not constitute an overlap. There must be at least one second of
time in common for TIME or one day when using DATE.

2. Using a NULL as one of the parameters, the other DATE or TIME constitutes a single point in time versus a
range.

Chapter 13 - OLAP Functions

Dont count the days, make the days count.


- Mohammed Ali

CSUM
This ANSI version of CSUM is SUM() Over. Right now, the syntax wants to see the sum of the Daily_Sales after
it is first sorted by Sale_Date. Rows Unbounded Preceding makes this a CSUM. The ANSI Syntax seems
difficult, but only at first.

CSUM THE SORT EXPLAINED


SELECT Product_ID Sale_Date, Daily_Sales,
SUM(Daily_Sales) OVER (ORDER BY Sale_Date
ROWS UNBOUNDED PRECEDING) AS SUMOVER
FROM Sales_Table WHERE Product_ID BETWEEN 1000 and 2000 ;

The first thing the above query does before calculating is SORT all the rows by Sale_Date. The Sort is located
right after the ORDER BY.

CSUM ROWS UNBOUNDED PRECEDING EXPLAINED


SELECT Product_ID Sale_Date, Daily_Sales,
SUM(Daily_Sales) OVER (ORDER BY Sale_Date
ROWS UNBOUNDED PRECEDING) AS SUMOVER
FROM Sales_Table WHERE Product_ID BETWEEN 1000 and 2000 ;

The keywords Rows Unbounded Preceding determines that this is a CSUM. There are only a few different
statements and Rows Unbounded Preceding is the main one. It means start calculating at the beginning row, and
continue calculating until the last row.

CSUM MAKING SENSE OF THE DATA


SELECT Product_ID Sale_Date, Daily_Sales,
SUM(Daily_Sales) OVER (ORDER BY Sale_Date
ROWS UNBOUNDED PRECEDING) AS SUMOVER
FROM Sales_Table WHERE Product_ID BETWEEN 1000 and 2000 ;
The second SUMOVER row is 90739.28. That is derived by the first rows Daily_Sales (41888.88) added to
the SECOND rows Daily_Sales (48850.40).

CSUM MAKING EVEN MORE SENSE OF THE DATA


SELECT Product_ID Sale_Date, Daily_Sales,
SUM(Daily_Sales) OVER (ORDER BY Sale_Date
ROWS UNBOUNDED PRECEDING) AS SUMOVER
FROM Sales_Table WHERE Product_ID BETWEEN 1000 and 2000 ;

The third SUMOVER row is 138739.28. That is derived by taking the first rows Daily_Sales (41888.88) and
adding it to the SECOND rows Daily_Sales (48850.40). Then, you add that total to the THIRD rows
Daily_Sales (48000.00).

CSUM THE MAJOR AND MINOR SORT KEY(S)


SELECT Product_ID Sale_Date, Daily_Sales,
SUM(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS SumOVER
FROM Sales_Table ;

You can have more than one SORT KEY. In the top query, Product_ID is the MAJOR Sort, and Sale_Date is the
MINOR Sort.

RESET WITH A PARTITION BY STATEMENT


SELECT Product_ID Sale_Date, Daily_Sales,
SUM(Daily_Sales) OVER (PARTITION BY Product_ID
ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING)
AS SumANSI
FROM Sales_Table ;

The PARTITION Statement is how you reset in ANSI. This will cause the SUMANSI to start over (reset) on its
calculating for each NEW Product_ID.

PARTITION BY ONLY RESETS A SINGLE OLAP NOT ALL OF THEM


SELECT Product_ID Sale_Date, Daily_Sales,
SUM(Daily_Sales) OVER (PARTITION BY Product_ID
ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS Subtotal,
SUM(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS GRANDTotal
FROM Sales_Table ;

Above are two OLAP statements. Only one has PARTITION BY, so only it resets.

ANSI MOVING WINDOW IS CURRENT ROW AND PRECEDING N ROWS

The SUM () Over allows you to get the moving SUM of a certain column.

HOW ANSI MOVING SUM HANDLES THE SORT


The SUM OVER places the sort after the ORDER BY.

QUIZ HOW IS THAT TOTAL CALCULATED?


SELECT Product_ID Sale_Date, Daily_Sales,
SUM(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS 2 Preceding) AS Sum3_ANSI
FROM Sales_Table ;

With a Moving Window of 3, how is the 139350.69 amount derived in the Sum3_ANSI column in the third row?

ANSWER TO QUIZ HOW IS THAT TOTAL CALCULATED?


SELECT Product_ID Sale_Date, Daily_Sales,
SUM(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS 2 Preceding) AS Sum3_ANSI
FROM Sales_Table ;

With a Moving Window of 3, how is the 139350.69 amount derived in the Sum3_ANSI column in the third row?
It is the sum of 48850.40, 54500.22 and 36000.07. The current row of Daily_Sales plus the previous two rows of
Daily_Sales.

MOVING SUM EVERY 3-ROWS VS A CONTINUOUS AVERAGE


SELECT Product_ID Sale_Date, Daily_Sales,
SUM(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS 2 Preceding) AS SUM3,
SUM(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED Preceding) AS Continuous
FROM Sales_Table;

The ROWS 2 Preceding gives the MSUM for every 3 rows. The ROWS UNBOUNDED Preceding gives the
continuous MSUM.

PARTITION BY RESETS AN ANSI OLAP

Use a PARTITION BY Statement to Reset the ANSI OLAP. Notice it only resets the OLAP command containing
the Partition By statement, but not the other OLAPs.

MOVING AVERAGE
SELECT Product_ID Sale_Date, Daily_Sales,
AVG(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS 2 Preceding) AS AVG_3
FROM Sales_Table ;

Notice the Moving Window of 3 in the syntax and that it is a 2 in the ANSI version. That is because in ANSI, it is
considered the Current Row and 2 preceding.

THE MOVING WINDOW IS CURRENT ROW AND PRECEDING


The AVG () Over allows you to get the moving average of a certain column. The Rows 2 Preceding is a moving
window of 3 in ANSI.

HOW MOVING AVERAGE HANDLES THE SORT

Much like the SUM OVER Command, the Average OVER places the sort keys via the ORDER BY keywords.

QUIZ HOW IS THAT TOTAL CALCULATED?


SELECT Product_ID Sale_Date, Daily_Sales,
AVG(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS 2 Preceding) AS AVG_3_ANSI
FROM Sales_Table ;

With a Moving Window of 3, how is the 46450.23 amount derived in the AVG_3_ANSI column in the third row?

ANSWER TO QUIZ HOW IS THAT TOTAL CALCULATED?


SELECT Product_ID Sale_Date, Daily_Sales,
AVG(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS 2 Preceding) AS AVG_3_ANSI
FROM Sales_Table ;
AVG of 48850.40,54500.22, and 36000.07

QUIZ HOW IS THAT 4 T H ROW CALCULATED?


SELECT Product_ID Sale_Date, Daily_Sales,
AVG(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS 2 Preceding) AS AVG_3_ANSI
FROM Sales_Table ;

With a Moving Window of 3, how is the 43566.91 amount derived in the AVG_3_ANSI column in the fourth
row?

ANSWER TO QUIZ HOW IS THAT 4 T H ROW CALCULATED?


SELECT Product_ID Sale_Date, Daily_Sales,
AVG(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS 2 Preceding) AS AVG_3_ANSI
FROM Sales_Table ;

AVG of 54500.22,36000.07, and 40200.43

With a Moving Window of 3, how is the 43566.91 amount derived in the AVG_3_ANSI column in the fourth
row? The current row plus Rows 2 Preceding.

MOVING AVERAGE EVERY 3-ROWS VS A CONTINUOUS AVERAGE


SELECT Product_ID Sale_Date, Daily_Sales,
AVG(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS 2 Preceding) AS AVG3,
AVG(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED Preceding) AS Continuous
FROM Sales_Table;

The ROWS 2 Preceding gives the MAVG for every 3 rows. The ROWS UNBOUNDED Preceding gives the
continuous MAVG.

PARTITION BY RESETS AN ANSI OLAP

Use a PARTITION BY Statement to Reset the ANSI OLAP. The Partition By statement only resets the column
using the statement. Notice that only Continuous resets.

RANK DEFAULTS TO ASCENDING ORDER


SELECT Product_ID Sale_Date Daily_Sales,
RANK() OVER (ORDER BY Daily_Sales) AS Rank1
FROM Sales_Table
WHERE Product_ID IN (1000, 2000) ;

This is the RANK() OVER. It provides a rank for your queries. Notice how you do not place anything within the
() after the word RANK. Default Sort is ASC.

GETTING RANK TO SORT IN DESC ORDER


SELECT Product_ID Sale_Date Daily_Sales,
RANK() OVER (ORDER BY Daily_Sales DESC)
AS Rank1
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;
Is the query above in ASC mode or DESC mode for sorting?

RANK() OVER AND PARTITION BY


SELECT Product_ID Sale_Date Daily_Sales,
RANK() OVER (PARTITION BY Product_ID
ORDER BY Daily_Sales DESC)
AS Rank1
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;

What does the PARTITION Statement in the RANK() OVER do? It resets the rank.

RANK() OVER AND LIMIT


SELECT Product_ID Sale_Date Daily_Sales,
RANK() OVER (ORDER BY Daily_Sales DESC)
AS Rank 1
FROM Sales_Table
WHERE Product_ID IN (1000, 2000)
Limit 6 ;

The Limit statement limits rows once the Ranks been calculated.

PERCENT_RANK() OVER
SELECT Product_ID Sale_Date Daily_Sales,
PERCENT_RANK() OVER (PARTITION BY PRODUCT_ID
ORDER BY Daily_Sales DESC) AS PercentRank1
FROM Sales_Table WHERE Product_ID in (1000, 2000) ;

We now have added a Partition statement which resets on Product_ID so this produces 7 rows for each of our
Product_IDs.

PERCENT_RANK() OVER WITH 14 ROWS IN CALCULATION


SELECT Product_ID Sale_Date Daily_Sales,
PERCENT_RANK()
OVER ( ORDER BY Daily_Sales DESC) AS PercentRank1
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;

Percentage_Rank is just like RANK however, it gives you the Rank as a percent, but only a percent of all the
other rows up to 100%.

PERCENT_RANK() OVER WITH 21 ROWS IN CALCULATION


SELECT Product_ID Sale_Date Daily_Sales,
PERCENT_RANK() OVER ( ORDER BY Daily_Sales DESC)
AS PercentRank1
FROM Sales_Table ;
Percentage_Rank is just like RANK, however, it gives you the Rank as a percent but only a percent of all the
other rows up to 100%.

QUIZ WHAT CAUSES THE PRODUCT_ID TO RESET?


SELECT Product_ID Sale_Date Daily_Sales,
PERCENT_RANK() OVER (PARTITION BY PRODUCT_ID
ORDER BY Daily_Sales DESC) AS PercentRank1
FROM Sales_Table WHERE Product_ID in (1000, 2000) ;

What caused the Product_IDs to be sorted?

ANSWER TO QUIZ WHAT CAUSE THE PRODUCT_ID TO RESET?


SELECT Product_ID Sale_Date Daily_Sales,
PERCENT_RANK() OVER (PARTITION BY PRODUCT_ID
ORDER BY Daily_Sales DESC) AS PercentRank1
FROM Sales_Table WHERE Product_ID in (1000, 2000) ;

What caused the Product_IDs to be sorted? It was the PARTITION BY statement.

COUNT OVER FOR A SEQUENTIAL NUMBER


SELECT Product_ID Sale_Date Daily_Sales,
COUNT(*) OVER (ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS Seq_Number
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;
This is the COUNT OVER. It will provide a sequential number starting at 1. The Keyword(s) ROWS
UNBOUNDED PRECEDING causes Seq_Number to start at the beginning and increase sequentially to the end.

QUIZ WHAT CAUSED THE COUNT OVER TO RESET?


SELECT Product_ID Sale_Date Daily_Sales,
COUNT(*) OVER (PARTITION BY Product_ID
ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS StartOver
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;

What Keyword(s) caused StartOver to reset?

ANSWER TO QUIZ WHAT CAUSED THE COUNT OVER TO RESET?


SELECT Product_ID Sale_Date Daily_Sales,
COUNT(*) OVER (PARTITION BY Product_ID
ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS StartOver
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;
What Keyword(s) caused StartOver to reset? It is the PARTITION BY statement.

THE MAX OVER COMMAND


SELECT Product_ID Sale_Date Daily_Sales,
MAX(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS MaxOver
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;

After the sort, the Max() Over shows the Max Value up to that point.

MAX OVER WITH PARTITION BY RESET


SELECT Product_ID Sale_Date Daily_Sales,
MAX(Daily_Sales) OVER (PARTITION BY Product_ID
ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS MaxOver
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;

The largest value is 64300.00 in the column MaxOver. Once it was evaluated, it did not continue until the end
because of the PARTITION BY reset.

THE MIN OVER COMMAND


SELECT Product_ID, Sale_Date Daily_Sales
,MIN(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS MinOver
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;
After the sort, the MIN () Over shows the Max Value up to that point.

QUIZ FILL IN THE BLANK


SELECT Product_ID Sale_Date Daily_Sales,
MIN(Daily_Sales) OVER (PARTITION BY Product_ID
ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS MinOver
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;

The last two answers (MinOver) are blank, so you can fill in the blank.

ANSWER FILL IN THE BLANK


SELECT Product_ID Sale_Date Daily_Sales,
MIN(Daily_Sales) OVER (PARTITION BY Product_ID
ORDER BY Product_ID, Sale_Date
ROWS UNBOUNDED PRECEDING) AS MinOver
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;
The last two answers (MinOver) are filled in.

THE ROW_NUMBER COMMAND


SELECT Product_ID Sale_Date Daily_Sales,
ROW_NUMBER() OVER
(ORDER BY Product_ID, Sale_Date) AS Seq_Number
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;

The ROW_NUMBER() Keyword(s) caused Seq_Number to increase sequentially. Notice that this does NOT
have a Rows Unbounded Preceding, and it still works!

QUIZ HOW DID THE ROW_NUMBER RESET?


SELECT Product_ID Sale_Date Daily_Sales,
ROW_NUMBER() OVER (PARTITION BY Product_ID
ORDER BY Product_ID, Sale_Date ) AS StartOver
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;

What Keyword(s) caused StartOver to reset?

QUIZ HOW DID THE ROW_NUMBER RESET?


SELECT Product_ID Sale_Date Daily_Sales,
ROW_NUMBER() OVER (PARTITION BY Product_ID
ORDER BY Product_ID, Sale_Date ) AS StartOver
FROM Sales_Table WHERE Product_ID IN (1000, 2000) ;
What Keyword(s) caused StartOver to reset? It is the PARTITION BY statement.

STANDARD DEVIATION FUNCTIONS USING STDDEV / OVER


There are actually three different versions of the standard deviation:
STDDEV - returns the standard deviation of the
expression, which is the square root of VARIANCE.
When VARIANCE returns null, this function returns null.
STDDEV_POP - computes the population standard
deviation. This function is the same as the square root of
the VAR_POP function. When VAR_POP returns null,
this function returns null.
STDDEV_SAMP - computes the sample standard
deviation, which is the square root of VAR_ SAMP.
When VAR_SAMP returns null, this function returns null.

The above information is an introduction to Standard Deviation in an OLAP statement.

STANDARD DEVIATION FUNCTIONS AND STDDEV / OVER SYNTAX


The following ANSI syntax is used with the
three standard deviation functions:
{ STDDEV | STDDEV_POP | STDDEV_SAMP (<column-name>) OVER
([ PARTITION BY <column-list> ]
ORDER BY <column-list>
[ ROWS [BETWEEN] { UNBOUNDED | <number>} PRECEDING
[ AND UNBOUNDED | <number> } FOLLOWING ] ]
[EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES |
EXCLUDE NO OTHERS] ) ;

The above information is the syntax for standard deviation using the OVER clause. In order to provide the
moving functionality, it is necessary to have a method that designates the number of rows to include in the
STDDEV. It allows the function to calculate values from columns contained in rows that are before the current
row and also rows that are after the current row.

STDDEV / OVER EXAMPLE


Wow! What an amazing example.

VARIANCE / OVER SYNTAX


The following ANSI syntax is used with
the three standard deviation functions:
{ VARIANCE | VAR_SAMP (<column-name>) OVER
([ PARTITION BY <column-list> ]
ORDER BY <column-list>
[ ROWS [BETWEEN] { UNBOUNDED | <number>} PRECEDING
[ AND UNBOUNDED | <number> } FOLLOWING ] ]
[EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES |
EXCLUDE NO OTHERS] ) ;

The above information is the syntax for VARIANCE and VAR_SAMP using the OVER clause. In order to
provide the moving functionality, it is necessary to have a method that designates the number of rows to include
in the VARIANCE. It allows the function to calculate values from columns contained in rows that are before the
current row and also rows that are after the current row.

VARIANCE FUNCTIONS USING VARIANCE / OVER


VARIANCE Function
Returns the variance of the expression. If you apply this function to an empty set, it returns null. Redshift
calculates the expression as follows:

0 if the number of rows in expression = 1


VAR_SAMP if the number of rows in expression > 1
VARIANCE = (SUM(expr2) - ((SUM(expr))2 / COUNT(expr))) / (COUNT(expr) - 1)
VAR_POP Function
Returns the population variance of a set of numbers after discarding the nulls in this set. If you apply this
function to an empty set, it returns null.

VAR_POP = (SUM(expr2) - ((SUM(expr))2 / COUNT(expr))) / COUNT(expr)


VAR_ SAMP Function
Returns the sample variance of a set of numbers after discarding the nulls in this set. If you apply this function to
an empty set, it returns null.

The above information is an introduction to the Variance functions used in conjunction with an OVER statement.

USING VARIANCE WITH PARTITION BY EXAMPLE


SELECT Product_ID AS "Prod", Sale_Date, Daily_Sales AS "Sales"
VARIANCE(Daily_Sales) OVER ( ORDER BY sale_date
ROWS unbounded PRECEDING) AS VAR_S
VARIANCE(Daily_Sales) OVER ( PARTITION BY sale_date
ORDER BY sale_date ROWS unbounded PRECEDING ) VAR_P
VAR_POP(Daily_Sales) OVER ( ORDER BY sale_date
ROWS unbounded PRECEDING) AS VAR_POP
VAR_SAMP(Daily_Sales) OVER ( ORDER BY sale_date
ROWS unbounded PRECEDING) AS VAR_SMP
FROM sql_class..sales_table
WHERE EXTRACT(MONTH FROM Sale_Date) = 9 ;

Wow! Another amazing example. The above example uses all three standard deviation functions to produce
output sorting on the sales date for the dates in September.

USING FIRST_VALUE AND LAST_VALUE


The FIRST_VALUE and LAST_VALUE functions allow you to specify sorted aggregate groups and return the
first or last value of each group. The function needs to know the length of the data at run time and does not allow
a decimal value.
Syntax for FIRST_VALUE and LAST_VALUE:

{FIRST_VALUE | LAST_VALUE} ({ <column reference> | <value expression> | *})


OVER
([PARTITION BY <column reference>[,...]]
ORDER BY {<column reference> [ASC | DESC] } [,...]]
[ ROWS | RANGE {{ CURRENT ROW | UNBOUNDED
| <literal value> PRECEDING} | BETWEEN {CURRENT ROW |
UNBOUNDED PRECEDING | <literal value> PRECEDING}
AND {CURRENT ROW
| UNBOUNDED FOLLOWING | <literal value> FOLLOWING}}]
[ EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES
| EXCLUDE NO OTHERS ] ) ;

The above information provides information and the syntax for FIRST_VALUE and LAST_Value.
USING FIRST_VALUE
SELECT Last_name, first_name, dept_no
FIRST_VALUE(first_name)
OVER (ORDER BY dept_no, last_name desc
rows unbounded preceding) AS "First All"
FIRST_VALUE(first_name)
OVER (PARTITION BY dept_no
ORDER BY dept_no, last_name desc
rows unbounded preceding) AS "First Partition"
FROM SQL_Class..Employee_Table;

The above example uses FIRST_VALUE to show you the very first first_name returned. It also uses the keyword
Partition to show you the very first first_name returned in each department.

USING LAST_VALUE
SELECT Last_name, first_name, dept_no
LAST_VALUE(first_name)
OVER (ORDER BY dept_no, last_name desc
rows unbounded preceding) AS "Last All"
LAST_VALUE(first_name)
OVER (PARTITION BY dept_no
ORDER BY dept_no, last_name desc
rows unbounded preceding) AS "Last Partition"
FROM sql_class.Employee_Table;

The FIRST_VALUE and LAST_VALUE are good to use anytime you need to propagate a value from one row to
all or multiple rows based on a sorted sequence. However, the output from the LAST_VALUE function appears
to be incorrect or a little missing until you understand a few concepts. The SQL request specifies "rows
unbounded preceding" and LAST_VALUE looks at the last row. The current row is always the last row, and
therefore, it appears in the output.

USING LAG AND LEAD


The LAG and LEAD functions allow you to compare different rows of a table by specifying an offset from the
current row. You can use these functions to analyze change and variation.
Syntax for LAG and LEAD:

{LAG | LEAD} (<value expression>, [<offset> [, <default>]]) OVER


([PARTITION BY <column reference>[,...]]
ORDER BY <column reference> [ASC | DESC] [,...] );

Only he who attempts the ridiculous may achieve the impossible.


Don Quixote

The above provides information and the syntax for LAG and LEAD.

USING LEAD
SELECT last_name, dept_no
lead(dept_no)
over (order by dept_no, last_name) as "Lead All"
lead(dept_no) over (Partition by dept_no
order by dept_no, last_name) as "Lead Partition"
FROM employee_table;

As you can see, the first LEAD brings back the value from the next row except for the last which has no row
following it. The offset value was not specified in this example, so it defaulted to a value of 1 row.

USING LEAD WITH AND OFFSET OF 2


SELECT last_name, dept_no
lead(dept_no,2)
over (order by dept_no, last_name) as "Lead All"
lead(dept_no,2) over (Partition by dept_no
order by dept_no, last_name) as "Lead Partition"
FROM employee_table;
Above, each value in the first LEAD is 2 rows away, and the partitioning only shows when values are contained
in each value group with 1 more then offset value.

USING LAG
SELECT last_name, dept_no
lag(dept_no)
over (order by dept_no, last_name) as "Lag All"
lag(dept_no)
over (Partition by dept_no
order by dept_no, last_name) as "Lag Partition"
FROM employee_table;

From the example above, you see that LAG uses the value from a previous row and makes it available in the next
row. For LAG, the first row(s) will contain a null based on the value in the offset, here it defaulted to 1. The first
null comes from the function, whereas the second row gets the null from the first row.

USING LAG WITH AN OFFSET OF 2


SELECT last_name, dept_no
lag(dept_no,2)
over (order by dept_no, last_name) as "Lag All"
lag(dept_no,2)
over (Partition by dept_no
order by dept_no, last_name) as "Lag Partition"
FROM employee_table;

For this example, the first two rows have a null because there is not a row two rows before these. The number of
nulls will always be the same as the offset value. There is a third null because Jones Dept_No is null.
Chapter 14 - Temporary Tables

I cannot imagine any condition which would cause this ship to founder. Modern shipbuilding has gone beyond
that.
- E. I. Smith, Captain of the Titanic

CREATING A DERIVED TABLE


Exists only within a query
Materialized by a SELECT Statement inside a query
Space comes from the Users Spool space
Deleted when the query ends

The SELECT Statement that creates and populates the Derived table is always inside Parentheses.

THE THREE COMPONENTS OF A DERIVED TABLE

A derived table will always have a SELECT query to materialize the derived table with data. The
SELECT query always starts with an open parenthesis and ends with aclose parenthesis.

The derived table must be given a name. Above we called our derived table TeraTom.

You will need to define (alias) the columns in the derived table. Above we allowed Dept_No to default
to Dept_No, but we had to specifically alias AVG(Salary) asAVGSAL.

Every derived table must have the three components listed above.
NAMING THE DERIVED TABLE

In the example above, TeraTom is the name we gave the Derived Table. It is mandatory that you always name the
table or it errors.

ALIASING THE COLUMN NAMES IN THE DERIVED TABLE

AVGSAL is the name we gave to the column in our Derived Table that we call TeraTom. Our SELECT (which
builds the columns) shows we are only going to have one column in our derived table, and we have named that
column AVGSAL.

VISUALIZE THIS DERIVED TABLE

Our example above shows the data in the derived table named TeraTom. This query allows us to see each
employee and the plus or minus avg of their salary compared to the other workers in their department.

MOST DERIVED TABLES ARE USED TO JOIN TO OTHER TABLES


The first five columns in the Answer Set came from the Employee_Table. AVGSAL came from the derived table
named TeraTom.

MULTIPLE WAYS TO ALIAS THE COLUMNS IN A DERIVED TABLE

OUR JOIN EXAMPLE WITH A DIFFERENT COLUMN ALIASING STYLE

COLUMN ALIASING CAN DEFAULT FOR NORMAL COLUMNS

TeraTom

Dept_No AVGSAL

? 32800.50

10 64300.00

100 48850.00
200 44944.44

300 40200.00

400 48333.33

The derived table


is built first

In a derived table, you will always have a SELECT query in parenthesis, and you will always name the table.
You have options when aliasing the columns. As in the example above, you can let normal columns default to
their current name.

CREATING A DERIVED TABLE USING THE WITH COMMAND

When using the WITH Command, we can CREATE our Derived table before running the main query. The only
issue here is that you can only have 1 WITH.

OUR JOIN EXAMPLE WITH THE WITH SYNTAX

Now, the lower portion of the query refers to TeraTom Almost like it is a permanent table, but it is not!

WITH
with TeraTom as (select * from Student_Table)
SELECT *
FROM TeraTom
ORDER BY 1
LIMIT 5;

We're going to have the best-educated American people in the world.

Dan Quayle
The following example shows the simplest possible case of a query that contains a WITH clause. The WITH
query named TeraTom selects all of the rows from the Student_Table. The main query, in turn, selects all of the
rows from TeraTom. The TeraTom table exists only for the life of the query.

A WITH CLAUSE THAT PRODUCES TWO TABLES


with Budget_Derived as
(SELECT Max(Budget) as Max_Budget
FROM Department_Table),
Emp_Derived as
(SELECT Dept_No, AVG(Salary) as Avg_Sal
FROM Employee_Table
GROUP BY Dept_No)

select E.*, Max_Budget Budget as Under_Max, Avg_Sal


FROM Employee_Table as E
INNER JOIN
Emp_Derived
On E.Dept_No = Emp_Derived.Dept_No
INNER JOIN Department_Table as D
ON E.Dept_No = D.Dept_No;

The following example shows two tables created from the With statement.

THE SAME DERIVED QUERY SHOWN THREE DIFFERENT WAYS

QUIZ - ANSWER THE QUESTIONS


SELECT Dept_No, First_Name, Last_Name, AVGSAL
FROM Employee_Table
INNER JOIN
(SELECT Dept_No, AVG(Salary)
FROM Employee_Table
GROUP BY Dept_No) as TeraTom (Depty, AVGSAL)
ON Dept_No = Depty ;
1) What is the name of the derived table? __________

2) How many columns are in the derived table? _______

3) What is the name of the derived table columns? ______


4) Is there more than one row in the derived table? _______

5) What common keys join the Employee and Derived? _______

6) Why were the join keys named differently? ______________

ANSWER TO QUIZ - ANSWER THE QUESTIONS


SELECT Dept_No, First_Name, Last_Name, AVGSAL
FROM Employee_Table
INNER JOIN
(SELECT Dept_No, AVG(Salary)
FROM Employee_Table
GROUP BY Dept_No) as TeraTom (Depty, AVGSAL)
ON Dept_No = Depty ;
1) What is the name of the derived table? TeraTom

2) How many columns are in the derived table? 2

3) Whats the name of the derived columns? Depty and AVGSAL

4) Is their more than one row in the derived table? Yes

5) What keys join the tables? Dept_No and Depty

6) Why were the join keys named differently? If both were named Dept_No, we would error unless we full
qualified.

CLEVER TRICKS ON ALIASING COLUMNS IN A DERIVED TABLE

A DERIVED TABLE LIVES ONLY FOR THE LIFETIME OF A SINGLE


QUERY
AN EXAMPLE OF TWO DERIVED TABLES IN A SINGLE QUERY
WITH T (Dept_No, AVGSAL) AS
(SELECT Dept_No, AVG(Salary) FROM Employee_Table
GROUP BY Dept_No)

SELECT T.Dept_No, First_Name, Last_Name,


AVGSAL, Counter
FROM Employee_Table as E
INNER JOIN
T
ON E.Dept_No = T.Dept_No
INNER JOIN

(SELECT Employee_No, SUM(1) OVER(PARTITION BY Dept_No


ORDER BY Dept_No, Last_Name Rows Unbounded Preceding)
FROM Employee_Table) as S (Employee_No, Counter)

ON E.Employee_No = S.Employee_No
ORDER BY T.Dept_No;

CREATE TABLE SYNTAX


CREATE [ [LOCAL ] { TEMPORARY | TEMP } ] TABLE table_name
( { column_name data_type [column_attributes] [ column_constraints ]
| table_constraints
| LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ] }
[, ... ] )
[table_attribute]
where column_attributes are:
[ DEFAULT default_expr ]
[ IDENTITY ( seed, step ) ]
[ ENCODE encoding ]
[ DISTKEY ] [ SORTKEY ]
and column_constraints are:
[ { NOT NULL | NULL } ]
[ { UNIQUE | PRIMARY KEY } ]
[ REFERENCES reftable [ ( refcolumn ) ] ]
and table_constraints are:
[ UNIQUE ( column_name [, ... ] ) ]
[ PRIMARY KEY ( column_name [, ... ] ) ]
[ FOREIGN KEY (column_name [, ... ] ) REFERENCES reftable [ ( refcolumn ) ]
and table_attributes are:
[ DISTSTYLE { EVEN | KEY | ALL } ]
[ DISTKEY ( column_name ) ]
[ SORTKEY ( column_name [, ...] ) ]

Create Table Syntax creates a new table in the current database. The owner of this table is the issuer of the
CREATE TABLE command.

BASIC TEMPORARY TABLE EXAMPLES


When you create a temporary table, it is visible only within the current session. The table is automatically
dropped at the end of the session. Above, we use the pound sign (#) at the front of the table name to
automatically make the table a temporary table. We then populate the table with an Insert/Select

MORE ADVANCED TEMPORARY TABLE EXAMPLES

When you create a temporary table, it is visible only within the current session. The table is automatically
dropped at the end of the session. A derived table only lasts the life of a single query, but a temporary table last
the entire session. This allows a user to run hundreds of queries against the temporary table. A temporary table
can have the same name as a permanent table, but I dont recommend this. You dont give a temporary table a
schema because it is automatically associated with the users session. Once the session is over, the table and data
are dropped. If the user tries to query the table in another session, the system wont recognize the table. In other
words, the table doesnt exist outside of the current session it was created in.

ADVANCED TEMPORARY TABLE EXAMPLES

When you create a temporary table, it is visible only within the current session. The table is automatically
dropped at the end of the session. Above are some examples that allow you to define a different distkey, diststyle
and sortkey. Users (by default) are granted permission to create temporary tables by their automatic membership
in the PUBLIC group. To remove the privilege for any users to create temporary tables, revoke the TEMP
permission from the PUBLIC group, and then explicitly grant the permission to create temporary tables to
specific users or groups of users.
PERFORMING A DEEP COPY
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table. If a table
has a large unsorted region, a deep copy is much faster than a vacuum. You can choose one of four methods to
create a copy of the original table:
1) Use the original table DDL. This is the best method for perfect reproduction.
2) Use CREATE TABLE AS (CTAS). If the original DDL is not available, you can use CREATE TABLE AS to
create a copy of current table, then rename the copy. The new table will not inherit the encoding, distkey, sortkey,
not null, primary key, and foreign key attributes of the parent table.
3) Use CREATE TABLE LIKE. If the original DDL is not available, you can use CREATE TABLE LIKE to
recreate the original table. The new table will not inherit the primary key and foreign key attributes of the parent
table. The new table does, though, inherit the encoding, distkey, sortkey, and not null attributes of the parent
table.
4) Create a temporary table and truncate the original table. If you need to retain the primary key and foreign key
attributes of the parent table, you can use CTAS to create a temporary table, then truncate the original table and
populate it from the temporary table. This method is slower than CREATE TABLE LIKE because it requires two
insert statements.

A deep copy recreates and repopulates a table by using a bulk insert, which automatically sorts the table. If a
table has a large unsorted region, a deep copy is much faster than a vacuum. The difference is that you cannot
make concurrent updates during a deep copy operation which you can do during a vacuum. The next four slides
will show each technique with an example.

DEEP COPY USING THE ORIGINAL DDL


1) Use the original table DDL. This is the best method for perfect reproduction.
1. Create a copy of the table using the original CREATE TABLE DDL.

2. Use an INSERT INTO ... SELECT statement to populate the copy with data from the original table.

3. Drop the original table.

4. Use an ALTER TABLE statement to rename the copy to the original table name.

A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.

DEEP COPY USING A CTAS


2) Use CREATE TABLE AS (CTAS). If the original DDL is not available, you can use CREATE TABLE AS to
create a copy of current table, then rename the copy. The new table will not inherit the encoding, distkey, sortkey,
not null, primary key, and foreign key attributes of the parent table.
1. Create a copy of the original table by using CREATE TABLE AS to select the rows from the original table.

2. Drop the original table.

3. Use an ALTER TABLE statement to rename the new table to the original table.
The following example performs a deep copy on the Sales_Table using a duplicate of the Sales_Table
namedSales_Table_Copy.

CREATE TABLE Sales_Table_Copy as (select * from Sales_Table) ;

DROP TABLE Sales_Table ;

ALTER TABLE Sales_Table_Copy rename to Sales_Table ;

A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.

DEEP COPY USING A CREATE TABLE LIKE


2) Use CREATE TABLE LIKE. If the original DDL is not available, you can use CREATE TABLE LIKE to
recreate the original table. The new table will not inherit the primary key and foreign key attributes of the parent
table. The new table does though inherit the encoding, distkey, sortkey, and not null attributes of the parent table.
1. Create a new table using CREATE TABLE LIKE.

2. Use an INSERT INTO ... SELECT statement to copy the rows from the current table to the new table.

3. Drop the current table.

4. Use an ALTER TABLE statement to rename the new table to the original table.
The following example performs a deep copy on the Sales_Table using a duplicate of the Sales_Table
namedSales_Table_Copy.

CREATE TABLE Sales_Table_Copy (like Sales_Table);

INSERT INTO Sales_Table_Copy (select * from Sales_Table);

DROP TABLE Sales_Table;

ALTER TABLE Sales_Table_Copy RENAME to Sales_Table;

A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.

DEEP COPY BY CREATING A TEMP TABLE AND TRUNCATING


ORIGINAL
Create a temporary table and truncate the original table. If you need to retain the primary key and foreign key
attributes of the parent table, you can use CTAS to create a temporary table, then truncate the original table and
populate it from the temporary table. This method is slower than CREATE TABLE LIKE because it requires two
insert statements.
1. Use CREATE TABLE AS to create a temporary table with the rows from the original table.

2. Truncate the current table.

3. Use an INSERT INTO ... SELECT statement to copy the rows from the temporary table to the original table.

4. Drop the temporary table.


The following example performs a deep copy on the Sales_Table using a duplicate of the Sales_Table
namedSales_Table_Copy.
CREATE Temp Table Sales_Table_Copy as select * from Sales_Table ;

TRUNCATE Sales_Table ;

Insert Into Sales_Table (select * from Sales_Table_Copy);

DROP Table Sales_Table_Copy;

A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.

Chapter 15 - Sub-query Functions

An invasion of Armies can be resisted, but not an idea whose time has come.
- Victor Hugo

AN IN LIST IS MUCH LIKE A SUBQUERY

This query is very simple and easy to understand. It uses an IN List to find all Employees who are in Dept_No
100 or Dept_No 200.
3.

AN IN LIST NEVER HAS DUPLICATES JUST LIKE A SUBQUERY


What is going on with this IN List? Why in the world are their duplicates in there? Will this query even work?
What will the result set look like? Turn the page!

AN IN LIST IGNORES DUPLICATES

Duplicate values are ignored here. We got the same rows back as before, and it is as if the system ignored the
duplicate values in the IN List. That is exactly what happened.

THE SUBQUERY

The query above is a Subquery which means there are multiple queries in the same SQL. The bottom query runs
first, and its purpose in life is to build a distinct list of values that it passes to the top query. The top query then
returns the result set. This query solves the problem: Show all Employees in Valid Departments!

THE THREE STEPS OF HOW A BASIC SUBQUERY WORKS

The bottom query runs first and builds a distinct IN list. Then the top query runs using the list.

THESE ARE EQUIVALENT QUERIES


Both queries above are the same. Query 2 has values in an IN list. Query 1 runs a subquery to build the values in
the IN list.

THE FINAL ANSWER SET FROM THE SUBQUERY

QUIZ- ANSWER THE DIFFICULT QUESTION

How are Subqueries similar to Joins between two tables?

A great question was asked above. Do you know the key to answering? Turn the page!

ANSWER TO QUIZ- ANSWER THE DIFFICULT QUESTION

How are Subqueries similar to Joins between two tables?


A Subquery between two tables or a Join between two tables will each need a common key that represents the
relationship. This is called a Primary Key/Foreign Key relationship.
A Subquery will use a common key linking the two tables together very similar to a join! When subquerying
between two tables, look for the common link between the two tables. Most of the time they both have a column
with the same name but not always.

SHOULD YOU USE A SUBQUERY OF A JOIN?

If you only want to see a report where the final result set has only columns from one table, use a Subquery.
Obviously, if you need columns on the report where the final result set has columns from both tables, you have to
do a Join.

QUIZ- WRITE THE SUBQUERY

Write the Subquery


Select all columns in the Customer_Table if the customer has placed an order!

Here is your opportunity to show how smart you are. Write a Subquery that will bring back everything from the
Customer_Table if the customer has placed an order in the Order_Table. Good luck! Advice: Look for the
common key among both tables!

ANSWER TO QUIZ- WRITE THE SUBQUERY

Write the Subquery


Select all columns in the Customer_Table if the customer has placed an order!

The common key among both tables is Customer_Number. The bottom query runs first and delivers a distinct list
of Customer_Numbers which the top query uses in the IN List!
QUIZ- WRITE THE MORE DIFFICULT SUBQUERY

Write the Subquery


Select all columns in the Customer_Table if the customer has placed an order over $10,000.00 Dollars!

Here is your opportunity to show how smart you are. Write a Subquery that will bring back everything from the
Customer_Table if the customer has placed an order in the Order_Table that is greater than $10,000.00.

ANSWER TO QUIZ- WRITE THE MORE DIFFICULT SUBQUERY

Write the Subquery


Select all columns in the Customer_Table if the customer has placed an order over $10,000.00 Dollars!

Here is your answer!

QUIZ- WRITE THE SUBQUERY WITH AN AGGREGATE

Write the Subquery


Select all columns in the Employee_Table if the employee makes a greater Salary than the AVERAGE Salary.

Another opportunity knocking! Would someone please answer the query door?

ANSWER TO QUIZ- WRITE THE SUBQUERY WITH AN AGGREGATE


Write the Subquery
Select all columns in the Employee_Table if the employee makes a greater Salary than the AVERAGE Salary.
SELECT *
FROM Employee_Table
WHERE Salary > (
SELECT AVG(Salary)
FROM Employee_Table) ;

QUIZ- WRITE THE CORRELATED SUBQUERY

Write the Correlated Subquery


Select all columns in the Employee_Table if the employee makes a greater Salary than
the AVERAGE Salary(within their own Department).

Another opportunity knocking! This is a tough one, and only the best get this written correctly.

ANSWER TO QUIZ- WRITE THE CORRELATED SUBQUERY

Write the Correlated Subquery


Select all columns in the Employee_Table if the employee makes a greater Salary than
the AVERAGE Salary(within their own Department).
SELECT *
FROM Employee_Table as EE
WHERE Salary > (
SELECT AVG(Salary)
FROM Employee_Table as EEEE
WHERE EE.Dept_No = EEEE.Dept_No) ;

THE BASICS OF A CORRELATED SUBQUERY


The Top Query is Co-Related (Correlated) with the Bottom Query.

The table name from the top query and the table name from the bottom query are given a different alias.

The bottom query WHERE clause co-relates Dept_No from Top and Bottom.

The top query is run first.

The bottom query is run one time for each distinct value delivered from the top query.
SELECT *
FROM Employee_Table as EE
WHERE Salary > (
SELECT AVG(Salary)
FROM Employee_Table as EEEE
WHERE EE.Dept_No = EEEE.Dept_No) ;

A correlated subquery breaks all the rules. It is the top query that runs first. Then, the bottom query is run one
time for each distinct column in the bottom WHERE clause. In our example, this is the column Dept_No. This is
because in our example, the WHERE clause is comparing the column Dept_No. After the top query runs and
brings back its rows, the bottom query will run one time for each distinct Dept_No. If this is confusing, it is not
you. These take a little time to understand, but I have a plan to make you an expert. Keep reading!

THE TOP QUERY ALWAYS RUNS FIRST IN A CORRELATED SUBQUERY

CORRELATED SUBQUERY EXAMPLE VS. A JOIN WITH A DERIVED


TABLE
Both queries above will bring back all employees making a salary that is greater than the average salary in their
department. The biggest difference is that the Join with the Derived Table also shows the Average Salary in the
result set.

QUIZ- A SECOND CHANCE TO WRITE A CORRELATED SUBQUERY

Select all columns in the Sales_Table if the Daily_Sales column is greater than the Average Daily_Sales within
its own Product_ID.

Another opportunity knocking! This is your second chance. I will even give you a third chance.

ANSWER - A SECOND CHANCE TO WRITE A CORRELATED SUBQUERY


Select all columns in the Sales_Table if the Daily_Sales column is greater than the Average Daily_Sales within
its own Product_ID.
SELECT * FROM Sales_Table as TopS
WHERE Daily_Sales > (
SELECT AVG(Daily_Sales)
FROM Sales_Table as BotS
WHERE TopS.Product_ID = BotS.Product_ID)
ORDER BY Product_ID, Sale_Date ;
QUIZ- A THIRD CHANCE TO WRITE A CORRELATED SUBQUERY

Select all columns in the Sales_Table if the Daily_Sales column is greater than the Average Daily_Sales within
its own Sale_Date.

Another opportunity knocking! There is just one minor adjustment and you are home free.

ANSWER - A THIRD CHANCE TO WRITE A CORRELATED SUBQUERY


Select all columns in the Sales_Table if the Daily_Sales column is greater than the Average Daily_Sales within
its own Sale_Date.
SELECT * FROM Sales_Table as TopS
WHERE Daily_Sales > (
SELECT AVG(Daily_Sales)
FROM Sales_Table as BotS
WHERE TopS.Sale_Date = BotS.Sale_Date)
ORDER BY Sale_Date ;

QUIZ- LAST CHANCE TO WRITE A CORRELATED SUBQUERY


Select all columns in the Student_Table if the Grade_Pt column is greater than the Average Grade_Pt within its
own Class_Code.

Another opportunity knocking! There is just one minor adjustment and you are home free.

ANSWER LAST CHANCE TO WRITE A CORRELATED SUBQUERY


Select all columns in the Student_Table if the Grade_Pt column is greater than the Average Grade_Pt within its
own Class_Code.
SELECT * FROM Student_Table as TopS
WHERE Grade_Pt > (
SELECT AVG(Grade_Pt)
FROM Student_Table as BotS
WHERE TopS. Class_Code = BotS.Class_Code )
ORDER BY Class_Code ;

QUIZ- WRITE THE NOT SUBQUERY

Write the Subquery

Select all columns in the Customer_Table if the


Customer has NOT placed an order.

Another opportunity knocking! Write the above query!


ANSWER TO QUIZ- WRITE THE NOT SUBQUERY

Select all columns in the Customer_Table if the


Customer has NOT placed an order.

QUIZ- WRITE THE SUBQUERY USING A WHERE CLAUSE

Write the Subquery


Select all columns in the Order_Table that were placed
by a customer with Bill anywhere in their name.

Another opportunity to show your brilliance is ready for you to make it happen.

ANSWER - WRITE THE SUBQUERY USING A WHERE CLAUSE

Write the Subquery


Select all columns in the Order_Table that were placed
by a customer with Bill anywhere in their name.
SELECT *
FROM Order_Table
WHERE Customer_Number IN
(SELECT Customer_Number FROM Customer_Table
WHERE Customer_Name ilike '%Bill%') ;

Great job on writing your query just like the above.

QUIZ- WRITE THE SUBQUERY WITH TWO PARAMETERS

Write the Subquery


What is the highest dollar order for each Customer?
This Subquery will involve two parameters!

Get ready to be amazed at either yourself or the Answer on the next page!

ANSWER TO QUIZ- WRITE THE SUBQUERY WITH TWO PARAMETERS

Write the Subquery


What is the highest dollar order for each Customer?
This Subquery will involve two parameters!

This is how you utilize multiple parameters in a Subquery! Turn the page for more.

HOW THE DOUBLE PARAMETER SUBQUERY WORKS

SELECT Customer_Number, Order_Number, Order_Total


FROM Order_Table
WHERE (Customer_Number, Order_Total) IN
(SELECT Customer_Number, MAX(Order_Total)
FROM Order_Table GROUP BY 1) ;

The bottom query runs first returning two columns. Next page for more info!

MORE ON HOW THE DOUBLE PARAMETER SUBQUERY WORKS

The IN list is built and the top query can now process for the final Answer Set.

QUIZ WRITE THE TRIPLE SUBQUERY


Write the Subquery
What is the Customer_Name who has the highest dollar order
among all customers? This query will have multiple Subqueries!

Good luck in writing this. Remember that this will involve multiple Subqueries.

ANSWER TO QUIZ WRITE THE TRIPLE SUBQUERY

Write the Subquery


What is the Customer_Name who has the highest dollar order
among all customers? This query will have multiple Subqueries!

The query is above and, of course, the answer is XYZ Plumbing.

QUIZ HOW MANY ROWS RETURN ON A NOT IN WITH A NULL?

How many rows return from the query now that a


NULL value is in a Customer_Number?

We really didnt place a new row inside the Order_Table with a NULL value for the Customer_Number column,
but in theory, if we had, how many rows would return?

ANSWER HOW MANY ROWS RETURN ON A NOT IN WITH A NULL?


How many rows return from the query now that a
NULL value is in a Customer_Number?

ZERO rows will return

The answer is no rows come back. This is because when you have a NULL value in a NOT IN list, the system
doesnt know the value of NULL, so it returns nothing.

HOW TO HANDLE A NOT IN WITH POTENTIAL NULL VALUES

How many rows return NOW from the query? 1 Acme Products

You can utilize a WHERE clause that tests to make sure Customer_Number IS NOT NULL. This should be used
when a NOT IN could encounter a NULL.

USING A CORRELATED EXISTS

Use EXISTS to find which Customers have placed an Order?


SELECT Customer_Number, Customer_Name
FROM Customer_Table as Top1
WHERE EXISTS
(SELECT * FROM Order_Table as Bot1
Where Top1.Customer_Number = Bot1.Customer_Number ) ;

The EXISTS command will determine via a Boolean if something is True or False. If a customer placed an order,
it EXISTS, and using the Correlated Exists statement, only customers who have placed an order will return in the
answer set. EXISTS is different than IN as it is less restrictive as you will soon understand.

HOW A CORRELATED EXISTS MATCHES UP

SELECT Customer_Number, Customer_Name


FROM Customer_Table as Top1
WHERE EXISTS
(SELECT * FROM Order_Table as Bot1
Where Top1.Customer_Number = Bot1.Customer_Number ) ;
Customer_Number Customer_Name

11111111 Billys Best Choice

31323134 ACE Consulting

57896883 XYZ Plumbing

87323456 Databases N-U

Only customers who placed an order return with the above Correlated EXISTS.

THE CORRELATED NOT EXISTS

Use NOT EXISTS to find which Customers have NOT placed an Order?
SELECT Customer_Number, Customer_Name
FROM Customer_Table as Top1
WHERE NOT EXISTS
(SELECT * FROM Order_Table as Bot1
Where Top1.Customer_Number = Bot1.Customer_Number ) ;

The EXISTS command will determine via a Boolean if something is True or False. If a customer placed an order,
it EXISTS, and using the Correlated Exists statement, only customers who have placed an order will return in the
answer set. EXISTS is different than IN as it is less restrictive as you will soon understand.

THE CORRELATED NOT EXISTS ANSWER SET

Use NOT EXISTS to find which Customers have NOT placed an Order?
SELECT Customer_Number, Customer_Name
FROM Customer_Table as Top1
WHERE NOT EXISTS
(SELECT * FROM Order_Table as Bot1
Where Top1.Customer_Number = Bot1.Customer_Number ) ;
Customer_Number Customer_Name

31313131 Acme Products

The only customer who did NOT place an order was Acme Products.

QUIZ HOW MANY ROWS COME BACK FROM THIS NOT EXISTS?

SELECT Customer_Number, Customer_Name


FROM Customer_Table as Top1
WHERE NOT EXISTS
(SELECT * FROM Order_Table as Bot1
Where Top1.Customer_Number = Bot1.Customer_Number ) ;
How many rows return from the query?

A NULL value in a list for queries with NOT IN returned nothing, but you must now decide if that is also true for
the NOT EXISTS. How many rows will return?

ANSWER HOW MANY ROWS COME BACK FROM THIS NOT EXISTS?

SELECT Customer_Number, Customer_Name


FROM Customer_Table as Top1
WHERE NOT EXISTS
(SELECT * FROM Order_Table as Bot1
Where Top1.Customer_Number = Bot1.Customer_Number ) ;
How many rows return from the query?
One row
Acme Products

NOT EXISTS is unaffected by a NULL in the list, thats why it is more flexible!
Chapter 16 - Substrings and Positioning Functions

Its always been and always will be the same in the world: the horse does the work and the coachman is tipped.
- Anonymous

THE TRIM COMMAND TRIMS BOTH LEADING AND TRAILING SPACES


Query 1
SELECT Last_Name
,Trim(Last_Name) AS No_Spaces
FROM Employee_Table ;
Query 2
SELECT Last_Name
,Trim(Both from Last_Name) AS No_Spaces
FROM Employee_Table ;
Both queries above do the exact same thing. They remove spaces from the beginning and the end of the column
Last_Name.

Both queries trim both the leading and trailing spaces from Last_Name.

A VISUAL OF THE TRIM COMMAND USING CONCATENATION

When you use the TRIM command on a column, that column will have all beginning and ending spaces
removed.

TRIM AND TRAILING IS CASE SENSITIVE

For leading and trailing TRIM commands, case sensitivity is important.


For LEADING and TRAILNG TRIM commands, case sensitivity is required.

HOW TO TRIM TRAILING LETTERS

The above example removed the trailing y from the First_Name and the trailing g from the Last_Name.
Remember that this is case sensitive.

THE SUBSTRING COMMAND

First_Name Quiz

Squiggy qui

John ohn

Richard ich

Herbert erb

Mandee and

Cletus let

William ill

Billy ill

Loraine ora

This is a SUBSTRING. The substring is passed two parameters, and they are the starting position of the string
and the number of positions to return (from the starting position). The above example will start in position 2 and
go for 3 positions!
HOW SUBSTRING WORKS WITH NO ENDING POSITION

First_Name GoToEnd

Squiggy quiggy

John ohn

Richard ichard

Herbert erbert

Mandee andee

Cletus letus

William illiam

Billy illy

Loraine oraine

If you dont tell the Substring the end position, it will go all the way to the end.

USING SUBSTRING TO MOVE BACKWARDS

First_Name Before1

Squiggy Squig

John John

Richard Richa

Herbert Herbe

Mandee Mande

Cletus Cletu

William Willi

Billy Billy

Loraine Lorai
A starting position of zero moves one space in front of the beginning. Notice that our FOR Length is 6 so
Squiggy turns into Squig. The point being made here is that both the starting position and ending positions
can move backwards which will come in handy as you see other examples.

HOW SUBSTRING WORKS WITH A STARTING POSITION OF -1

First_Name Before2

Squiggy S

John J

Richard R

Herbert H

Mandee M

Cletus C

William W

Billy B

Loraine L

A starting position of -1 moves two spaces in front of the beginning. Notice that our FOR Length is 3, so each
name delivers only the first initial. The point being made here is that both the starting position and ending
positions can move backwards which will come in handy as you see other examples.

HOW SUBSTRING WORKS WITH AN ENDING POSITION OF 0

First_Name WhatsUp

Squiggy

John

Richard

Herbert

Mandee

Cletus
William

Billy

Loraine

In our example above, we start in position 3, but we go for zero positions, so nothing is delivered in the column.
That is whats up!

THE POSITION COMMAND FINDS A LETTERS POSITION


SELECT Last_Name
,Position ('e' in Last_Name) AS Find_The_E
,Position ('f' in Last_Name) AS Find_The_F
FROM Employee_Table ;

This is the position counter. What it will do is tell you what position a letter is on. Why did Jones have a 4 in the
result set? The e was in the 4 position. Why did Smith get a zero for both columns? There is no e in Smith
th

and no f in Smith. If there are two fs, only the first occurrence is reported.

QUIZ FIND THAT SUBSTRING STARTING POSITION


SELECT DISTINCT Department_Name as Dept_Name
SUBSTRING(Department_Name FROM
POSITION(' ' IN Department_Name) +1) as Word2
FROM Department_Table
WHERE POSITION(' ' IN trim(Department_Name)) >0;
Dept_Name Word2

Customer Support Support

Human Resources Resources

Research and Develop and Develop


What is the Starting Position here?

What is the Starting position of the Substring in the above query? Hint: This only looks for a Dept_Name that
has two words or more.

ANSWER TO QUIZ FIND THAT SUBSTRING STARTING POSITION


SELECT DISTINCT Department_Name as Dept_Name
SUBSTRING(Department_Name FROM
POSITION(' ' IN Department_Name) +1) as Word2
FROM Department_Table
WHERE POSITION(' ' IN trim(Department_Name)) >0;
Dept_Name Word2

Customer Support Support

Human Resources Resources

Research and Develop and Develop


What is the Starting Position here?
The Starting Position is calculated by finding the length up to the first SPACE and then adding 1.
Customer Support (FROM 10)
Human Resources (FROM 7)
Research and Develop FROM 10)

What is the Starting position of the Substring in the above query? See above!

USING THE SUBSTRING TO FIND THE SECOND WORD ON


SELECT DISTINCT Department_Name as Dept_Name
SUBSTRING(Department_Name FROM
POSITION(' ' IN Department_Name) +1) as Word2
FROM Department_Table
WHERE POSITION(' ' IN trim(Department_Name)) >0;
Dept_Name Word2

Customer Support Support

Human Resources Resources

Research and Develop and Develop

Notice we only had three rows come back. That is because our WHERE looks for only Department_Name that
has multiple words. Then, notice that our starting position of the Substring is a subquery that looks for the first
space. Then, it adds 1 to the starting position, and we have a staring position for the 2 word. We dont give a
nd

FOR length parameter, so it goes to the end.

QUIZ WHY DID ONLY ONE ROW RETURN


SELECT Department_Name
SUBSTRING(Department_Name from
POSITION(' ' IN Department_Name) + 1 +
POSITION(' ' IN SUBSTRING(Department_Name
FROM POSITION(' ' IN Department_Name) + 1))) as Third_Word
FROM Department_Table
WHERE POSITION(' ' IN
TRIM(Substring(Department_Name from
POSITION(' ' in Department_Name) + 1)))> 0
Dept_Name Third_Word

Research and Develop Develop

Why did only one row come back?

ANSWER TO QUIZ WHY DID ONLY ONE ROW RETURN


SELECT Department_Name
SUBSTRING(Department_Name from
POSITION(' ' IN Department_Name) + 1 +
POSITION(' ' IN SUBSTRING(Department_Name
FROM POSITION(' ' IN Department_Name) + 1))) as Third_Word
FROM Department_Table
WHERE POSITION(' ' IN
TRIM(Substring(Department_Name from
POSITION(' ' in Department_Name) + 1)))> 0
Dept_Name Third_Word

Research and Develop Develop

It has 3 words

Why did only one row come back? Its the Only Department Name with three words. The SUBSTRING and the
WHERE clause both look for the first space, and if they find it, they look for the second space. If they find that,
add 1 to it, and their Starting Position is the third word. There is no FOR position, so it defaults to go to the
end.

CONCATENATION
See those || ? Those represent concatenation. That allows you to combine multiple columns into one column. The
|| (Pipe Symbol) on your keyboard is just above the ENTER key. Dont put a space in between, but just put two
Pipe Symbols together. In this example, we have combined the first name, then a single space, and then the last
name to get a new column called Full name like Squiggy Jones.

CONCATENATION AND SUBSTRING

Of the three items being concatenated together, what is the first item of concatenation in the example above? The
first initial of the First_Name. Then, we concatenated a literal space and a period. Then, we concatenated the
Last_Name.

FOUR CONCATENATIONS TOGETHER

Why did we TRIM the Last_Name? To get rid of the spaces or the output would have looked odd. How many
items are being concatenated in the example above? There are 4 items concatenated. We start with the
Last_Name (after we trim it), then we have a single space, then we have the First Initial of the First Name, and
then we have a Period.

TROUBLESHOOTING CONCATENATION

What happened above to cause the error. Can you see it? The Pipe Symbols || have a space between them like | |,
when it should be ||. It is a tough one to spot, so be careful.
DECLARING A CURSOR

The above example declares a cursor named TeraTom to select sales information from the Sales_Table and then
fetch rows from the result set using the cursor.

Chapter 17 Interrogating the Data

"The difference between genius and stupidity is that genius has its limits"
- Albert Einstein

QUIZ WHAT WOULD THE ANSWER BE?

Sample_Table

Class_Code Grade_Pt

Fr 0

Can you guess what would return in the Answer Set?

Use the fake table above called Sample_Table, and try and predict what the Answer will be if this query was
running on the system.
ANSWER TO QUIZ WHAT WOULD THE ANSWER BE?

Sample_Table

Class_Code Grade_Pt

Fr 0

Can you guess what would return in the Answer Set?

Error Division by zero

You get an error when you DIVIDE by ZERO! Lets turn the page and fix it!

THE NULLIFZERO COMMAND

SELECT Class_Code
Grade_Pt / ( NULLIFZERO (Grade_pt) * 2 ) AS Math1
FROM Sample_Table;

What the NULLIFZERO does is make a zero into a NULL. So, the answer set youd get from this is a simple
FR, and then a NULL value represented usually by a ?. If you have a calculation where a ZERO could kill the
operation, and you dont want that, you can use the NULLIFZERO command to convert any zero value to a
NULL value.

QUIZ FILL IN THE BLANK VALUES IN THE ANSWER SET

SELECT NULLIFZERO (Cust_No) AS Cust_No


NULLIFZERO (Acc_Balance) AS Acc_Balance
NULLIFZERO (Location) AS Location
FROM Sample_Table ;
Fill in the Answer Set above after looking at the table and the query.

Okay! Time to show me your brilliance! What would the Answer Set produce?

ANSWER TO QUIZ FILL IN THE BLANK VALUES IN THE ANSWER SET

SELECT NULLIFZERO (Cust_No) AS Cust_No


NULLIFZERO (Acc_Balance) AS Acc_Balance
NULLIFZERO (Location) AS Location
FROM Sample_Table ;

Here is the answer set! Howd you do? The NULLIFZERO command found a zero in Cust_No, so it made it
Null. The others were not zero, so they retained their value. The only time NULLIFZERO changes data is if it
finds a zero, and then it changes it to null.

QUIZ FILL IN THE ANSWERS FOR THE NULLIF COMMAND

SELECT NULLIF(Cust_No, 0) AS Cust1


NULLIF(Cust_No, 3) AS Cust2
NULLIF(Acc_Balance, 0) AS Acc1
NULLIF(Acc_Balance, 3) AS Acc2
NULLIF(Location, 0) AS Loc1
NULLIF(Location, 3) AS Loc2
FROM Sample_Table;

Fill in the Answer Set above after looking at the table and the query.

You can also use the NULLIF(). What you are asking Redshift to do is to NULL the answer if the COLUMN
matches the number in the parentheses. What would the above Answer Set produce from your analysis?

QUIZ FILL IN THE ANSWERS FOR THE NULLIF COMMAND


SELECT NULLIF(Cust_No, 0) AS Cust1
NULLIF(Cust_No, 3) AS Cust2
NULLIF(Acc_Balance, 0) AS Acc1
NULLIF(Acc_Balance, 3) AS Acc2
NULLIF(Location, 0) AS Loc1
NULLIF(Location, 3) AS Loc2
FROM Sample_Table;

Look at the answers above, and if it doesnt make sense, go over it again until it does.

THE ZEROIFNULL COMMAND

SELECT ZEROIFNULL (Cust_No) as Cust


ZEROIFNULL (Acc_Balance) as Balance
ZEROIFNULL (Location) as Location
FROM Sample_Table;

Fill in the Answer Set above after looking at the table and the query.

This is the ZEROIFNULL. What it will do is put a zero into a place where a NULL shows up. What would the
Answer Set produce?

ANSWER TO THE ZEROIFNULL QUESTION

SELECT ZEROIFNULL (Cust_No) as Cust


,ZEROIFNULL (Acc_Balance) as Balance
,ZEROIFNULL (Location) as Location
FROM Sample_Table ;

The answer set placed a zero in the place of the NULL Acc_Balance, but the other values didnt change because
they were NOT Null.
THE COALESCE COMMAND

SELECT Last_Name
,COALESCE (Home_Phone, Work_Phone, Cell_Phone) as Phone
FROM Sample_Table ;
Last_Name Phone

Fill in the Answer Set above after looking at the table and the query.

Coalesce returns the first non-Null value in a list, and if all values are Null, returns Null.

THE COALESCE ANSWER SET

SELECT Last_Name
,COALESCE (Home_Phone, Work_Phone, Cell_Phone) as Phone
FROM Sample_Table ;
Last_Name Phone

Jones 555-1234

Patel 456-7890

Gonzales 354-0987

Nguyen ?

Coalesce returns the first non-Null value in a list, and if all values are Null, returns Null.

THE COALESCE QUIZ

SELECT Last_Name
,COALESCE (Home_Phone, Work_Phone, Cell_Phone, 'No Phone') as Phone
FROM Sample_Table ;
Last_Name Phone

Fill in the Answer Set above after looking at the table and the query.

Coalesce returns the first non-Null value in a list, and if all values are Null, returns Null. Since we decided in the
above query we dont want NULLs, notice we have placed a literal No Phone in the list. How will this effect
the Answer Set?

ANSWER THE COALESCE QUIZ

SELECT Last_Name
COALESCE (Home_Phone, Work_Phone, Cell_Phone, 'No Phone') as Phone
FROM Sample_Table ;
Last_Name Phone

Jones 555-1234

Patel 456-7890

Gonzales 354-0987

Nguyen No Phone

Answers are above! We put a literal in the list so theres no chance of NULL returning.

THE BASICS OF CAST (CONVERT AND STORE)


CAST will convert a column or values data type
temporarily into another data type. Below is the syntax:
SELECT CAST(<column-name> AS <data-type>[(<length>)] )
FROM <table-name> ;

Examples using CAST:


CAST ( <smallint-data> AS CHAR(5) ) /* convert smallint to character */
CAST ( <decimal-data> AS INTEGER ) /* truncates decimals */
CAST ( <byteint-data> AS SMALLINT ) /* convert binary to smallint */
CAST ( <char-data> AS BYTE (128) ) /* convert character to binary */
CAST ( <byteint-data> AS VARCHAR(5) ) /* convert byteint to character */
CAST ( <integer-data> AS FLOAT ) /* convert integer to float point */

Data can be converted from one type to another by using the CAST function. As long as the data involved does
not break any data rules (i.e. placing alphabetic or special characters into a numeric data type), the conversion
works. The name of the CAST function comes from the Convert And STore operation that it performs.

SOME GREAT CAST (CONVERT AND STORE) EXAMPLES


SELECT CAST('ABCDE' AS CHAR(1) ) AS Trunc
,CAST(128 AS CHAR(3) ) AS OK
,CAST(127 AS INTEGER ) AS Bigger ;

The first CAST truncates the five characters (left to right) to form the single character A. In the second CAST,
the integer 128 is converted to three characters and left justified in the output. The 127 was initially stored in a
SMALLINT (5 digits - up to 32767) and then converted to an INTEGER. Hence, it uses 11 character positions
for its display, ten numeric digits and a sign (positive assumed) and right justified as numeric.

SOME GREAT CAST (CONVERT AND STORE) EXAMPLES


SELECT CAST(121.53 AS SMALLINT) AS Whole
,CAST(121.53 AS DECIMAL(3,0)) AS Rounder ;
Whole Rounder

121 122

The value of 121.53 was initially stored as a DECIMAL as 5 total digits with 2 of them to the right of the
decimal point. Then, it is converted to a SMALLINT using CAST to remove the decimal positions. Therefore, it
truncates data by stripping off the decimal portion. It does not round data using this data type. On the other hand,
the CAST in the fifth column called Rounder is converted to a DECIMAL as 3 digits with no digits (3,0) to the
right of the decimal, so it will round data values instead of truncating. Since .53 is greater than .5, it is rounded
up to 122.

SOME GREAT CAST (CONVERT AND STORE) EXAMPLES


SELECT Order_Number as OrdNo
,Customer_Number as CustNo
,Order_Date
,Order_Total
,CAST(Order_Total as integer) as Chopped
,CAST(Order_Total as Decimal(5,0)) as Rounded
FROM Order_Table ;

The Column Chopped takes Order_Total (a Decimal (10,2) and CASTs it as an integer which chops off the
decimals. Rounded CASTs Order_Total as a Decimal (5,0), which takes the decimals and rounds up if the
decimal is .50 or above.

THE BASICS OF THE CASE STATEMENTS

Sample_Table

Course_Name Credits

Tera-Tom on SQL 1

SELECT Course_Name
,CASE Credits
WHEN 1 THEN 'One Credit'
WHEN 2 THEN 'Two Credits'
WHEN 3 THEN 'Three Credits'
END AS CreditAlias
FROM Sample_Table ;
Course_Name CreditAlias

Fill in the Answer Set above after looking at the table and the query.

This is a CASE STATEMENT which allows you to evaluate a column in your table, and from that, come up with
a new answer for your report. Every CASE begins with a CASE, and they all must end with a corresponding
END. What would the answer be?

THE BASICS OF THE CASE STATEMENT

Sample_Table

Course_Name Credits

Tera-Tom on SQL 1

SELECT Course_Name
,CASE Credits
WHEN 1 THEN 'One Credit'
WHEN 2 THEN 'Two Credits'
WHEN 3 THEN 'Three Credits'
END AS CreditAlias
FROM Sample_Table ;
Course_Name CreditAlias

Tera-Tom on SQL One Credit

This is a CASE STATEMENT which allows you to evaluate a column in your table, and from that, come up with
a new answer for your report. Every CASE begins with a CASE, and they all must end with a corresponding
END. What would the answer be?

VALUED CASE VS. A SEARCHED CASE

The second example is better unless you have a simple query like the first example.

QUIZ - VALUED CASE STATEMENT

Look at the CASE Statement and look at the Course_Table, and fill in the Answer Set.

ANSWER - VALUED CASE STATEMENT


Above is the full answer set.

QUIZ - SEARCHED CASE STATEMENT

Look at the CASE Statement and look at the Course_Table, and fill in the Answer Set.

ANSWER - SEARCHED CASE STATEMENT

Above is the full answer set.

QUIZ - WHEN NO ELSE IS PRESENT IN CASE STATEMENT


Notice now that we have a 4 under the Credit Column. However, in our CASE statement, we dont have
instructions on what to do if the number is 4. What will occur?

ANSWER - WHEN NO ELSE IS PRESENT IN CASE STATEMENT

A null value will occur when the evaluation falls through the case and there is no else statement. Notice above
that we have a 4 under the Credit Column. However, in our CASE statement, we dont have instructions on
what to do if the number is 4. That is why the null value is in the report.

WHEN AN ELSE IS PRESENT IN CASE STATEMENT

Notice now that we have a 4 under the Credit Column. However, in our CASE statement, we dont have
instructions on what to do if the number is 4. What will occur?

ANSWER - WHEN AN ELSE IS PRESENT IN CASE STATEMENT


Since our value of 4 fell through the CASE statement, the ELSE statement kicked in and we delivered Dont
Know. Notice two single quotes that provided the word Dont.

WHEN AN ALIAS IS NOT USED IN A CASE STATEMENT

Notice now that we dont have an ALIAS for the CASE Statement. What will the system place in there for the
Column Title.

ANSWER - WHEN AN ALIAS IS NOT USED IN A CASE STATEMENT

Notice now that we dont have an ALIAS for the CASE Statement. The title given by default is < CASE
Expression >. That is why you should ALIAS your Case statements.

COMBINING SEARCHED CASE AND VALUED CASE


This Query above uses both a Valued Case and Searched Case. Thats ALLOWED!

NESTED CASE
SELECT Last_Name
,CASE Class_Code
WHEN 'JR' THEN 'Jr'
||(CASE WHEN Grade_pt < 2 THEN 'Failing'
WHEN Grade_pt < 3.5 THEN 'Passing'
ELSE 'Exceeding'
END)
ELSE 'Sr'
||(CASE WHEN Grade_pt < 2 THEN 'Failing'
WHEN Grade_pt < 3.5 THEN 'Passing'
ELSE 'Exceeding'
END)
END AS Status
FROM Student_Table WHERE Class_Code IN ('JR','SR')
ORDER BY Class_Code, Last_Name;
Last_Name Status

Bond Jr Exceeding

McRoberts Jr Failing

Delaney Sr Passing

Phillips Sr Passing

A NESTED Case occurs when you have a Case Statement within another CASE Statement. Notice the Double
Pipe symbols (||) that provide Concatenation.

PUT A CASE IN THE ORDER BY


Chapter 18 - View Functions

"Be the change that you want to see in the world."


-Mahatma Gandhi

CREATING A SIMPLE VIEW TO RESTRICT SENSITIVE COLUMNS

CREATE View Employee_V AS


SELECT Employee_No
,First_Name
,Last_Name
,Dept_No
FROM Employee_Table ;

The purposes of views is to restrict access to certain columns, derive columns or Join Tables, and to restrict
access to certain rows (if a WHERE clause is used). This view does not allow the user to see the column salary.

CREATING A SIMPLE VIEW TO RESTRICT ROWS

CREATE VIEW Employee_View


AS
SELECT First_Name
,Last_Name
,Dept_No
,Salary
FROM Employee_Table
WHERE Dept_No IN (300, 400) ;

The purposes of views is to restrict access to certain columns, derive columns or Join Tables, and to restrict
access to certain rows (if a WHERE clause is used). This view does not allow the user to see information about
rows unless the rows have a Dept_No of either 300 or 400.

CREATING A VIEW TO JOIN TABLES TOGETHER

This view is designed to join two tables together. By creating a view, we have now made it easier for the user
community to join these tables by merely selecting the columns you want from the view. The view exists now in
the database sql_views and accesses the tables in sql_class.

YOU SELECT FROM A VIEW

Once the view is created, then users can query them with a SELECT statement. Above, we have queried the view
we created to join the employee_table to the department_table (created on previous page). Users can select all
columns with an asterisk, or they can choose individual columns (separated by a comma). Above, we selected all
columns from the view.

BASIC RULES FOR VIEWS


1.All Aggregation needs to have an ALIAS
2.Any Derived columns (such as Math) needs an ALIAS
Above are the basic rules of Views with an excellent example. Redshift allows views to be created with an
ORDER BY statement. In our example above, we did NOT include an ORDER BY statement to create the view.
We allow the users to perform the ORDER BY when they SELECT from the view.

AN ORDER BY EXAMPLE INSIDE OF A VIEW

Redshift allows for an ORDER BY statement in the creation of a view. In the example above, notice that we have
an ORDER BY statement inside the view creation. When the user selects from the view, the data comes back
already sorted.

AN ORDER BY INSIDE OF A VIEW THAT IS QUERIED DIFFERENTLY

Redshift allows for an ORDER BY statement in the creation of a view. In the example above, notice that we have
an ORDER BY statement inside the view creation. In our second example where the user selects from the view,
they also put in an different Order By statement. The data comes back sorted by class_code.

CREATING A VIEW WITH ORDERED ANALYTICS


This view is used to create a Cumulative Sum, Moving Sum, Moving Average, and Moving Difference on the
sales_table. Users will now be able to query this view with a simple SELECT statement. Views are designed to
take the complexity out of querying for the majority of the user community. We are allowed to have an ORDER
BY statement in the above creation of the view only because the ORDER BY statement is part of the ordered
analytic.

CREATING A VIEW WITH THE TOP COMMAND

This view is used to find the top 3 salaried employees in the employee_table. Notice that the view creation has an
ORDER BY statement. This is another exception to the rule that you can't have an ORDER BY statement in a
view creation. The reason is that the TOP command goes with ORDER BY like bread goes with butter. This view
actually selects all the data from the employee_table. Then, the system sorts the data with the ORDER BY
statement so that the rows show the largest to the smallest salaries. Then, only the top 3 salaried employees are
selected.

CREATING A VIEW WITH THE LIMIT COMMAND

This view is used to find the top 3 students with the highest grade points. Notice that the view creation has an
ORDER BY statement. This is another exception to the rule that you can't have an ORDER BY statement in a
view creation. The reason is that the LIMIT command goes with ORDER BY like bread goes with butter. This
view actually selects all the data from the student_table. Then, the system sorts the data with the ORDER BY
statement so that the rows show the highest to the lowest Grade_Pt' s. Then, only the top 3 students are selected.

ALTERING A TABLE
CREATE VIEW Emp_HR_v AS
SELECT Employee_No
,Dept_No
,Last_Name
,First_Name
FROM Employee_Table ;
Altering the actual Table

This view will run after the table has added an additional column!

ALTERING A TABLE AFTER A VIEW HAS BEEN CREATED


CREATE VIEW Emp_HR_v4 AS
SELECT *
FROM Employee_Table4 ;
Altering the actual Table

This view runs after the table has added an additional column, but it wont include Mgr_No in the view results
even though there is a SELECT * in the view. The View includes only the columns present when the view was
CREATED.

A VIEW THAT ERRORS AFTER AN ALTER


CREATE VIEW Emp_HR_v5 AS
SELECT Employee_No
,Dept_No
,Last_Name
,First_Name
FROM Employee_Table5 ;
Altering the actual Table

This view will NOT run after the table has dropped a column referenced in the view.

TROUBLESHOOTING A VIEW
CREATE VIEW Emp_HR_v6 AS
SELECT *
FROM Employee_Table6 ;
Altering the actual Table

This view will NOT run after the table has dropped a column referenced in the view even though the View was
CREATED with a SELECT *. At View CREATE Time, the columns present were the only ones the view
considered responsible for, and Dept_No was one of those columns. Once Dept_No was dropped, the view no
longer works.

UPDATING DATA IN A TABLE THROUGH A VIEW


CREATE VIEW Emp_HR_v8 AS
SELECT *
FROM Employee_Table8;
Updating the table
through the View
UPDATE Emp_HR_V8
SET Salary = 88888.88
WHERE Employee_No = 2000000;
Will the View still run?
SELECT *
FROM Employee_Table8
WHERE Employee_No = 2000000;

You can UPDATE a table through a View if you have the RIGHTS to do so.

Chapter 19 - Set Operators Functions

"The man who doesn't read good books has no advantage over the man who can't read them."
-Mark Twain

RULES OF SET OPERATORS


1. Each query will have at least two SELECT Statements separated by a SET Operator
2. SET Operators are UNION, INTERSECT, or EXCEPT/MINUS
3. Must specify the same number of columns from the same domain (data type/range)
4. If using Aggregates, both SELECTs much have their own GROUP BY
5. Both SELECTS must have a FROM Clause
6. The First SELECT is used for all ALIAS, TITLE, and FORMAT Statements
7. The Second SELECT will have the ORDER BY statement which must be a number
8. When multiple operators the order of precedence is INTERSECT, UNION, and EXCEPT/MINUS
9. Parentheses can change the order of Precedence
10. Duplicate rows are eliminated in the spool unless the ALLkeyword is used

INTERSECT EXPLAINED LOGICALLY

SELECT * FROM Table_Red


INTERSECT
SELECT * FROM Table_Blue ;

In this example, what numbers in the answer set would come from the query above ?

INTERSECT EXPLAINED LOGICALLY

SELECT * FROM Table_Red


INTERSECT
SELECT * FROM Table_Blue ;

In this example, only the number 3 was in both tables so they INTERSECT

UNION EXPLAINED LOGICALLY

SELECT * FROM Table_Red


UNION
SELECT * FROM Table_Blue ;

In this example, what numbers in the answer set would come from the query above ?

UNION EXPLAINED LOGICALLY


SELECT * FROM Table_Red
UNION
SELECT * FROM Table_Blue ;

1 2 3 4 5

Both top and bottom queries run simultaneously, then the two different spools files are merged to eliminate
duplicates and place the remaining numbers in the answer set.

UNION ALL EXPLAINED LOGICALLY

SELECT * FROM Table_Red


UNION ALL
SELECT * FROM Table_Blue ;

In this example, what numbers in the answer set would come from the query above ?

UNION EXPLAINED LOGICALLY

SELECT * FROM Table_Red


UNION ALL
SELECT * FROM Table_Blue ;

1 2 3 3 4 5

Both top and bottom queries run simultaneously. Then, the two different spools files are merged together to build
the answer set. The ALL prevents eliminating Duplicates.

EXCEPT EXPLAINED LOGICALLY

SELECT * FROM Table_Red


EXCEPT
SELECT * FROM Table_Blue ;
EXCEPT and MINUS do the exact
same thing so either word will work!

In this example, what numbers in the answer set would come from the query above ?

EXCEPT EXPLAINED LOGICALLY

SELECT * FROM Table_Red


EXCEPT
SELECT * FROM Table_Blue ;
1 2

The Top query SELECTED 1, 2, 3 from Table_Red. From that point on, only 1, 2, 3 at most could come back.
The bottom query is run on Table_Blue, and if there are any matches, they are not ADDED to the 1, 2, 3 but
instead take away either the 1, 2, or 3.

MINUS EXPLAINED LOGICALLY

SELECT * FROM Table_Blue


MINUS
SELECT * FROM Table_Red ;
EXCEPT and MINUS do the exact
same thing, so either word will work

What will the answer set be? Notice I changed the order of the tables in the query!

MINUS EXPLAINED LOGICALLY

SELECT * FROM Table_Blue


MINUS
SELECT * FROM Table_Red ;
4 5
The Top query SELECTED 3, 4, 5 from Table_Blue. From that point on, only 3, 4, 5 at most could come back.
The bottom query is run on Table_Red, and if there are any matches, they are not ADDED to the 3, 4, 5 but
instead take away either the 3, 4, or 5.

TESTING YOUR KNOWLEDGE

Will the result set be the same for both queries above?

Will both queries bring back the exact same result set? Check out the next page to find out.

ANSWER - TESTING YOUR KNOWLEDGE

Will the result set be the same for both queries above?
Yes

Both queries above are exactly the same to the system and produce the same result set.

TESTING YOUR KNOWLEDGE

Will the result set be the same for both queries above?

Will both queries bring back the exact same result set? Check out the next page to find out.

ANSWER - TESTING YOUR KNOWLEDGE


Will the result set be the same for both queries above?

No! The first query returns 4, 5, and the query on the right returns 1, 2.

AN EQUAL AMOUNT OF COLUMNS IN BOTH SELECT LIST

You must have an equal amount of columns in both SELECT lists. This is because data is compared from the two
spool files, and duplicates are eliminated. So, for comparison purposes, there must be an equal amount of
columns in both queries.

COLUMNS IN THE SELECT LIST SHOULD BE FROM THE SAME


DOMAIN

The above query works without error, but no data is returned. There are no First Names that are the same as
Department Names. This is like comparing Apples to Oranges. That means they are NOT in the same Domain.

THE TOP QUERY HANDLES ALL ALIASES


The Top Query is responsible for ALIASING.

THE BOTTOM QUERY DOES THE ORDER BY (A NUMBER)

The Bottom Query is responsible for sorting, but the ORDER BY statement must be a number, which represents
column1, column2, column3, etc.

GREAT TRICK: PLACE YOUR SET OPERATOR IN A DERIVED TABLE


SELECT Employee_No AS MANAGER
,Trim(Last_Name) || ', ' || First_Name as "Name"
FROM Employee_Table
INNER JOIN
(SELECT Employee_No
FROM Employee_Table
INTERSECT
SELECT Mgr_No
FROM Department_Table)
AS TeraTom (empno)
ON Employee_No = empno
ORDER BY "Name"
MANAGER Name

1256349 Harrison, Herbert

1333454 Smith, John

1000234 Smythe, Richard

1121334 Strickling, Cietus

The Derived Table gave us the empno for all managers, and we were able to join it.

UNION VS. UNION ALL


SELECT Department_Name, Dept_No
FROM Department_Table
UNION ALL
SELECT Department_Name, Dept_No
FROM Department_Table
ORDER BY 1;
UNION eliminates duplicates, but UNION ALL does not.

A GREAT EXAMPLE OF HOW EXCEPT WORKS

SELECT Dept_No as Department_Number


FROM Department_Table
EXCEPT
SELECT Dept_No
FROM Employee_Table
ORDER BY 1 ;
Department Number
500

This query brought back all Departments without any employees.

Chapter 20 Statistical Aggregate Functions

"You can make more friends in two months by becoming interested in other people than you will in two years by
trying to get other people interested in you."
-Dale Carnegie

THE STATS TABLE

Above is the Stats_Table data in which we will use in our statistical examples.

STDDEV
The query below returns the sample standard deviation for the Daily_Sales column in the Sales_Table. The
Daily_Sales column is a DECIMAL. The scale of the result is reduced to 10 digits.
SELECT CAST(STDDEV(Daily_Sales) as dec(18,10)) FROM Sales_Table ;
stddev
-----------------------
13389.6235806995

CASTING STDDEV_SAMP AND SQRT (VAR_SAMP)


The query below returns both the sample standard deviation and the square root of the sample variance for the
Daily_Sales column in the Sales_Table. Notice that the result of these calculations are the same.

THE STDDEV_POP FUNCTION


SELECT STDDEV_POP(col1) AS SDPCol1
FROM Stats_Table;

The standard deviation function is a statistical measure of spread or dispersion of values. It is the roots square of
the difference of the mean (average). This measure is to compare the amount by which a set of values differs
from the arithmetical mean.

The STDDEV_POP function is one of two that calculates the standard deviation. The population is of all the
rows included based on the comparison in the WHERE clause.
Syntax for using STDDEV_POP:

STDDEV_POP(<column-name>)

A STDDEV_POP EXAMPLE

THE STDDEV_SAMP FUNCTION


SELECT STDDEV_SAMP(col1) AS SDSCol1
FROM Stats_Table;

The standard deviation function is a statistical measure of spread or dispersion of values. It is the roots square of
the difference of the mean (average). This measure is to compare the amount by which a set of values differs
from the arithmetical mean.

The STDDEV_SAMP function is one of two that calculates the standard deviation. The sample is a random
selection of all rows returned based on the comparisons in the WHERE clause. The population is for all of the
rows based on the WHERE clause.
Syntax for using STDDEV_SAMP:

STDDEV_SAMP(<column-name>)

A STDDEV_SAMP EXAMPLE

THE VAR_POP FUNCTION


SELECT VAR_POP(col1) AS VPCol1
FROM Stats_Table;
The Variance function is a measure of dispersion (spread of the distribution) as the square of the standard
deviation. There are two forms of Variance in Redshift, VAR_POP is for the entire population of data rows
allowed by the WHERE clause.

Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is
not easy to elaborate. Most often variance is used in theoretical work where a variance of the sample is needed.

There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and
Friedman two-way Analysis of Variance by rank.
Syntax for using VAR_POP:

VAR_POP(<column-name>)

A VAR_POP EXAMPLE

THE VAR_SAMP FUNCTION


SELECT VAR_SAMP(col1) AS VSCol1
FROM Stats_Table;

The Variance function is a measure of dispersion (spread of the distribution) as the square of the standard
deviation. There are two forms of Variance in Redshift, VAR_SAMP is used for a random sampling of the data
rows allowed through by the WHERE clause.

Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is
not easy to elaborate. Most often variance is used in theoretical work where a variance of the sample is needed to
look for consistency.

There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and
Friedman two-way Analysis of Variance by rank.
Syntax for using VAR_SAMP:

VAR_SAMP(<column-name>)

A VAR_SAMP EXAMPLE

You might also like