Professional Documents
Culture Documents
When you go into court you, are putting your fate into the hands of twelve people who werent smart enough to
get out of jury duty.
- Norm Crosby
-Zen Proverb
Two guys were having fun on a Saturday night when one said, Ive got to go and do my laundry. The other
said, "What!?" The first man explained that if he went to the laundry mat the next morning, he would be lucky to
get one machine and be there all day. But if he went on Saturday night, he could get all the machines. Then, he
could do all his wash and dry in two hours. Now that's parallel processing mixed in with a little dry humor!
When you are courting a nice girl, an hour seems like a second. When you sit on a red-hot cinder, a second
seems like an hour. Thats relativity.
Albert Einstein
Data on disk does absolutely nothing. When data is requested, the computer moves the data one block at a time
from disk into memory. Once the data is in memory, it is processed by the CPU at lightning speed. All computers
work this way. The "Achilles Heel" of every computer is the slow process of moving data from disk to memory.
The real theory of relativity is find out how to get blocks of data from the disk into memory faster!
DATA IN MEMORY IS FAST AS LIGHTNING
Yogi Berra
Once the data block is moved off of the disk and into memory, the processing of that block happens as fast as
lightning. It is the movement of the block from disk into memory that slows down every computer. Data being
processed in memory is so fast that even Yogi Berra couldn't catch it!
"If the facts don't fit the theory, change the facts."
-Albert Einstein
Big Data is all about parallel processing. Parallel processing is all about taking the rows of a table and spreading
them among many parallel processing units. Above, we can see a table called Orders. There are 16 rows in the
table. Each parallel processor holds four rows. Now they can process the data in parallel and be four times as
fast. What Albert Einstein meant to say was, If the theory doesn't fit the dimension table, change it to a fact."
The table above has 9 rows. Our small system above has three parallel processing units. Each unit holds three
rows.
EACH PARALLEL PROCESS ORGANIZES THE ROWS INSIDE A DATA BLOCK
The rows of a table are stored on disk in a data block. Above, you can see we have four rows in each data block.
Think of the data block as a suitcase you might take to the airport (without the $50 fee).
To a computer, the data block on disk is as heavy as a large suitcase. It is difficult and cumbersome to lift.
The data block above has 9 rows and five columns. If someone requested to see Rob Rivers salary, the entire
data block would still have to move into memory. Then, a salary of 50000 would be returned. That is a lot of
heavy lifting just to analyze one row and return one column. It is just like burning an entire candle just because
you need a flicker of light!
WHY COLUMNAR?
Each data block holds a single column. The row can be rebuilt because everything is aligned perfectly. If
someone runs a query that would return the average salary, then only one small data block is moved into
memory. The salary block moves into memory where it is processed as fast as lightning. We just cut down on
moving large blocks by 80%! Why columnar? Because like our Yiddish Proverb says, "All data is not kneaded on
every query, so that is why it costs so much dough."
Both designs have the same amount of data. Both take up just as much space. In this example, both have 9 rows
and five columns. If a query needs to analyze all of the rows or return most of the columns, then the row based
design is faster and more efficient. However, if the query only needs to analyze a few rows or merely a few
columns, then the columnar design is much lighter because not all of the data is moved into memory. Just one or
two columns move. Take the road less traveled.
When you go on vacation for two-weeks, you might pack a lot of clothes. It is then that you take two suitcases. A
data block can only get so big before it is forced to split, otherwise it might not fit into memory.
This is the same data you saw on the previous page! The difference is that the above is a columnar design. I have
color coded this for you. There are 8 rows in the table and five columns. Notice that the entire row stays on the
same disk, but each column is a separate block. This is a brilliant design for Ad Hoc queries and analytics
because when only a few columns are needed, columnar can move just the columns it needs to. Columnar can't
be beat for queries because the blocks are so much smaller, and what isn't needed isn't moved.
Both examples above have the same data and the same amount of data. If your applications tend to need to
analyze the majority of columns or read the entire table, then a row-based system (top example) can move more
data into memory. Columnar tables are advantageous when only a few columns need to be read. This is just one
of the reasons that analytics goes with columnar like bread goes with butter. A row-based system must move the
entire block into memory even if it only needs to read one row or even a single column. If a user above needed to
analyze the Salary, the columnar system would move 80% less block mass.
- Mahatma Gandhi
The leader node is the brains behind the entire operation. The user logs into the leader node, and for each SQL
query, the leader node will come up with a plan to retrieve the data. It passes that compiled plan to each compute
node, and each slice processes their portion of the data. If the data is spread evenly, parallel processing works
perfectly. This technology is relatively inexpensive. It might not "be the change", but it will help your company
"keep the change" because costs are low.
- Lao Tzu
Redshift was born to be parallel. With each query, a single step is performed in parallel by each Slice. A Redshift
system consists of a series of slices that will work in parallel to store and process your data. This design allows
you to start small and grow infinitely. If your Redshift system provides you with an excellent Return On
Investment (ROI), then continue to invest by purchasing more nodes (adds additional slices). Most companies
start small, but after seeing what Redshift can do, they continue to grow their ROI from the single step of
implementing a Redshift system to millions of dollars in profits. Double your slices and double your
speeds. . . . Forever. Redshift actually provides a journey of a thousand smiles!
DISTRIBUTION STYLES
KEY distribution - The rows are distributed according to the values in one column. The leader node
places matching values on the same node slice. If you distribute a pair of tables on the joining keys,
the leader node co-locates the rows on the slices according to the values in the joining columns.
Now, matching values from the common columns are physically stored together. This is extremely
important for table joins.
EVEN distribution - The rows are distributed across the slices in a round-robin fashion, regardless of
the values in any particular column. EVEN distribution is appropriate when a table does not
participate in joins or when there is not a clear choice between KEY distribution and ALL
distribution. EVEN distribution is the default distribution style.
Redshift gives you three great choices to distribute your tables. If you have two tables that are being joined
together a lot and they are about the same size, then you want to give them both the same distribution key as the
join key. This co-locates the matching rows on the same slice. Two rows being joined together must be on the
same slice (or Redshift will move one or both of the rows temporarily to satisfy the join requirement). If you join
two tables a lot, but one table is really big and the other is small, then you want to have the small table distributed
by ALL. Use your distribution key to ensure joins happen faster, but also use it to spread the data as evenly
among the slices as possible.
The entire row of a table is on a slice, but each column in the row is in a separate container (block). A Unique
Distribution Key spreads the rows of a table evenly across the slices. A good Distribution Key is the key to good
distribution!
We have chosen the Emp_No column as both the distribution key and the sort key. We can control both!
The data did not spread evenly among the slices for this table. Do you know why? The Distribution Key is
Dept_No. All like values went to the same slice. This distribution isn't perfect, but it is reasonable, so it is an
acceptable practice.
DISTRIBUTION KEY IS ALL
When ALL is selected as the distribution key, the entire table is copied to each slice.
The data did not spread evenly among the slices for this table. Do you know why? The Distribution Key is
Dept_No. All like values went to the same slice. This distribution isn't perfect, but it is reasonable, so it is an
acceptable practice.
Notice that both tables are distributed on Dept_No. When these two tables are joined WHERE Dept_No =
Dept_No, the rows with matching department numbers are on the same Slice. This is called Co-Location. This
makes joins efficient and fast.
The fact table (Line_Order_Fact_Table) is the largest table, but the Part_Table is the largest dimension table.
That is why you make Part_Key the distribution key for both tables. Now, when these two tables are joined
together, the matching Part_Key rows are on the same slice. You can then distribute by ALL on the other
dimension tables. Each of these table will have all their rows on each slice. Now, everything that joins to the fact
table is co-located!
There are three basic reasons to use the sortkey keyword when creating a table. 1) If recent data is queried most
frequently, specify the timestamp or date column as the leading column for the sort key. 2) If you do frequent
range filtering or equality filtering on one column, specify that column as the sort key. 3) If you frequently join a
(dimension) table, specify the join column as the sort key. Above, you can see we have made our sortkey the
Order_Date column. Look how the data is sorted!
Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as
part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max
values to rapidly skip over large numbers of blocks during table Where most databases use indexes to determine
where data is, Redshift uses the block's metadata to determine where data is NOT!
Our query above is looking for data WHERE Order_Total < 300. The metadata shows this block will contain
rows, and therefore it will be moved into memory for processing. Each slice has metadata for each of the blocks
they own.
Redshift allocates a 1 MB per block when a table begins loading. When a block is filled, another is allocated. I
want you to imagine that we created a table that had only one column, and that column was Order_Date. On
January 1 , data was loaded. Notice in the examples that as data is loaded, it continues to fill until the block
st
reaches 1 MB. The Order_Date is ordered (because as each day is loaded, it fills up the next slot). Then, notice
how the metadata has the min and max Order_Date. The metadata is designed to inform Redshift whether this
block should be read when this table is queried. If a query is looking for data in April, then there is no reason to
read block 1 because it falls outside of the min/max range.
Looking at the SQL and the metadata, how many blocks will need to be moved into memory?
Only one block moves into memory. The metadata shows that the min and max for Order_total only falls into the
range for the last Slice. Only that Slice moves the block into memory.
Looking at the SQL and the metadata, how many blocks will need to be moved into memory for each query?
The Analyze command updates table statistics for use by the query planner. You can analyze all the tables in an
entire database, or you can analyze specific tables including temporary tables. If you want to specifically analyze
a table, you can but not more than one table_name with a single ANALYZE table_name statement. If you do not
specify a table_name, all of the tables in the currently connected database are analyzed including the persistent
tables in the system catalog.
The above examples wont need the analyze statement because it is done automatically, but if you modify these
tables, you will need to run the analyze command. The Analyze command updates table statistics for use by the
query planner. You can analyze all the tables in an entire database, or you can analyze specific tables including
temporary tables.
WHAT IS A VACUUM?
What is a Vacuum?
Amazon Redshift doesn't automatically reclaim and reuse space that is freed when you delete or update rows.
These rows are logically deleted but not physically deleted (until you run a vacuum). The vacuum will reclaim
the space.
Amazon Redshift doesnt automatically reclaim and reuse space that is freed when you delete rows and update
rows. These rows are logically deleted, but not physically deleted (until you run a vacuum). To perform an
update, Amazon Redshift deletes the original row and appends the updated row, so every update is effectively a
delete followed by an insert. When you perform a delete, the rows are marked for deletion but not removed.
- Groucho Marx
A vacuum can be time consuming and it is very intensive. That is why the above advice is needed. Vacuum
wisely. You can run the vacuum command to get rid of the logically deleted rows and resort the table 100%
perfectly. When about 10% of the table has changed over time, it is a good practice to run both the Vacuum and
Analyze commands. Like Groucho Marx has basically stated, "If data processing slows down and users get
groucho, hit your marks and make if fly after a vacuum."
When tables are originally created and loaded, the rows are in perfect order (naturally) or because a sort key was
specified. As additional inserts, updates, deletes are performed over time, two things happen. Rows that have
been modified are done so logically, thus there are additional rows physically still there, but that have been
logically deleted. The second thing that happens is that new rows that are inserted are stored on a different part of
the disk, so the sort is no longer 100% accurate.
DATABASE LIMITS
Amazon Redshift enforces these limits for databases.
1. Maximum of 60 user-defined databases per cluster.
2. Maximum of 127 characters for a database name.
3. Cannot be a reserved word.
CREATE DATABASE SQL_Class2 WITH OWNER TeraTom ;
The following example creates a database named SQL_Class2 and gives ownership to the user TeraTom. You can
only create a maximum of 60 different database per cluster, so get yours created before the mob!
CREATING A DATABASE
create database sql_class ;
A Redshift cluster can have many databases. Above is the syntax to create a database. The database is named
sql_class. The data in a database can help you predict the future, and Redshift makes it so easy to create it. I think
Sophia Bedford-Pierce must be a DBA!
CREATING A USER
create user teratom
password 'TLc123123' ;
Password must:
be between 8 and 64 characters
have at least one uppercase letter
have at least one lowercase letter
have at least one number
To create a new user, you specify the name of the new user and a password. The password is required, and it
must be reasonably secure. It must have between 8 and 64 characters, and it must include at least one uppercase
letter, one lowercase letter, and one number.
DROPPING A USER
Drop user teratom;
If you delete a database user account, the user will no longer be able to access any of the cluster databases. The
quote above is the opposite of the DBA credo which states, "All glory comes from daring to drop a user."
The first command renames the Employee_Table to Employee_Table_Backup. The second example renames the
column Grade_Pt to Grade_Point.
In our first example we have added a new column called Mgr to the table Employee_Table. The second example
drops that column.
Chapter 2 - Best Practices For Table Design
Above, we are converting all of the tables in a Teradata database to Redshift table structures. We went to our
Teradata system and right clicked on the database SQL_Class and chose "Convert Table Structures". We selected
all of the tables and hit the blue arrow. We then chose to convert to Redshift. Watch in amazement what happens
next!
All 20 Teradata tables have now been converted to Redshift. Just cut and paste to your Redshift system, and you
have converted the tables.
--Harry S. Truman
As you design your database, there are important decisions you must make that will heavily influence overall
query performance. These design choices also have a significant effect on how data is stored, which in turn
affects query performance by reducing the number of I/O operations and minimizing the memory required to
process certain queries. Harry S. Truman was right. "If you want your Redshift system to run brilliantly, take
advice from your users, and use best practices to deliver what they asked for".
The sort order is used by the optimizer to determine optimal query plans.
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
If you do range filtering or equality filtering on one column, specify that column as the sort key.
If you frequently join a table, specify the join column as both the sort key and the distribution key.
Data sorted correctly helps eliminate unneeded blocks. This is because Redshift has metadata on each block
showing column min and max values.
When you give an Amazon Redshift table a sort key, it stores your data on disk in sorted order. The sort order is
used by the optimizer to determine optimal query plans. If recent data is queried most frequently, specify the
timestamp column as the leading column for the sort key. If you do frequent range filtering or equality filtering
on one column, specify that column as the sort key. If you frequently join a table, specify the join column as both
the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a
slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase
of the sort merge join.
Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as
part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max
values to rapidly skip over large numbers of blocks during table Where most databases use indexes to determine
where data is, Redshift uses the blocks metadata to determine where data is NOT!
Our query above is looking for data WHERE Order_Total < 300. The metadata shows this block will contain
rows, and therefore it will be moved into memory for processing. Each slice has metadata for each of the blocks
they own.
There are three basic reasons to use the sortkey keyword when creating a table. 1) If recent data is queried most
frequently, specify the timestamp or date column as the leading column for the sort key. 2) If you do frequent
range filtering or equality filtering on one column, specify that column as the sort key. 3) If you frequently join a
(dimension) table, specify the join column as the sort key. Above, you can see we have made our sortkey the
Order_Date column. Look how the data is sorted!
When data is sorted on a strategic column, it will improve (GROUP BY and ORDER BY operations), window
functions (PARTITION BY and ORDER BY operations), and even as a means of optimizing compression. But,
as new rows are incrementally loaded, these new rows are sorted but they reside temporarily in a separate region
on disk. In order to maintain a fully sorted table, you need to run the VACUUM command at regular intervals.
You will also need to run ANALYZE.
Uneven distribution, or data skew, forces some nodes to do more work than others which slows down the entire
process. With parallel processing, a query is only as fast as the slowest node. Even distribution is a key concept
when each node processes the information they own simultaneously with their node peers.
When rows that participate in joins or aggregations are located on different nodes, more data has to be moved
among nodes. This is because Amazon Redshift must ensure that two rows being joined are on the same node in
the same memory. If this is not the case, then Redshift will either copy the smaller table to all nodes temporarily
or redistribute one or both tables.
The entire row of a table is on a slice, but each column in the row is in a separate container (block). A Unique
Distribution Key spreads the rows of a table evenly across the slices. A good Distribution Key is the key to good
distribution!
Notice that both tables are distributed on Dept_No. When these two tables are joined WHERE Dept_No =
Dept_No, the rows with matching department numbers are on the same Slice. This is called Co-Location. This
makes joins efficient and fast.
Notice that the Department_Table has only four rows. Those four rows are copied to every slice. This is
distributed by ALL. Now, the Department_Table can be joined to the Employee_Table with a guarantee that
matching rows are co-located. They are co-located because the smaller table has copied ALL of its rows to each
slice. When two joining tables have one large table (fact table) and the other table is small (dimension table),
then use the ALL keyword to distribute the smaller table.
DEFINE PRIMARY KEY AND FOREIGN KEY CONSTRAINTS
1. Define primary key and foreign key constraints between tables wherever appropriate.
3. Amazon Redshift does not enforce unique, primary key, and foreign key constraints.
4. The query planner uses these keys in certain statistical computations, to infer uniqueness and referential
relationships that affect subquery decorrelation techniques, to order large numbers of joins, and to eliminate
redundant joins.
Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints. Your application is
responsible for ensuring uniqueness and managing the DML operations. The query planner will use primary and
foreign keys in certain statistical computations, to infer uniqueness and referential relationships that affect
subquery decorrelation techniques, to order large numbers of joins, and to eliminate redundant joins. The planner
leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If
your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For
example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not
define key constraints for your tables if you doubt their validity. On the other hand, you should always declare
primary and foreign keys and uniqueness constraints when you know that they are valid.
Amazon Redshift does not enforce primary key and foreign key constraints. The only reason to apply them is so
the query optimizer can generate a better query plan.
Amazon Redshift compresses column data very effectively, so creating columns much larger than necessary has
minimal impact on the size of data tables. It is in the processing of queries that the size can hurt you. This is
because during processing for complex queries, intermediate query results might need to be stored in temporary
tables. Because temporary tables are not compressed, unnecessarily large columns consume excessive memory
and temporary disk space, which can affect query performance. Don't go overboard here! Don't get columns so
small that they can't contain your largest values!
Use the DATE or TIMESTAMP data type rather than a character type when storing date/time information.
Amazon Redshift stores DATE and TIMESTAMP data more efficiently than CHAR or VARCHAR, which results
in better query performance. Let Amazon Redshift handle the DATE or TIMESTAMP conversions internally
instead of you trying to do so in your applications. Most of the time users utilize CHAR or VARCHAR is in the
ETL process of moving data. There is no need to do that because Redshift handles any conversions necessary.
You should consider using a predicate on the leading sort column of the fact table, or the largest table, in a join.
You can also add predicates to filter other tables that participate in the join, even when the predicates are
redundant. These predicates refer to WHERE or AND clauses. Because Redshift has the max and min value for
each column per block, you can get better performance when you choose a good sortkey. This allows Redshift to
skip reading certain blocks because Redshift always checks the min and max values to see if the block should
even be read. The second example above uses a redundant AND clause in hopes the entire table won't have to be
read.
SETTING THE STATEMENT_TIMEOUT TO ABORT LONG QUERIES
The above query aborts because it took longer than 10 milliseconds. The statement_timeout is designed to abort
any statement that takes over the milliseconds specified. If the system setting WLM timeout
(max_execution_time) is also specified as part of a WLM configuration, the lower of statement_timeout and
max_execution_time is used.
He who asks a question may be a fool for five minutes, but he who never asks a question remains a fool
forever.
-Unknown
STL tables for logging - These system tables are generated from Amazon Redshift log files to provide a history
of the system. Logging tables have an STL prefix.
STV tables for snapshot data - These tables are virtual system tables that contain snapshots of the current
system data. Snapshot tables have an STV prefix.
System views - System views contain a subset of data found in several of the STL and STV system tables.
Systems views have an SVV or SVL prefix.
System catalog tables - The system catalog tables store schema metadata, such as information about tables and
columns. System catalog tables have a PG prefix.
Every Redshift system automatically contains a number of system tables. These system tables contain
information about the installation and about the various queries and processes that are running on the system.
You can query these system tables to collect information about the redshift database that is installed.
The above query references the system catalog table named pg_table_def, and it only runs exclusively on the
leader node. PG_TABLE_DEF will only return information for tables in schemas that are included in the search
path. The first query failed because the 'employee_table' was not in the search_path. Above, we added sql_class
to our path. The first query will work now because the database sql_class has been placed in our search path, and
that is where the employee_table resides.
The Redshift catalog is in the pg_catalog database. You can query these tables with SQL or merely do a "Quick
Select" by right clicking on any table in the tree. We just did a "Quick Select" on the pg_aggregate table.
The above query references the system catalog table named pg_table_def, and it only runs exclusively on the
leader node. PG_TABLE_DEF will only return information for tables in schemas that are included in the search
path. The query we ran on the previous page failed because the 'employee_table' was not in the search_path. The
database that contains the employee_table is the sql_class database. Once we added the database sql_class to our
search path, the query ran perfectly!
Uneven distribution, or data distribution skew, forces some nodes or slices to do more work than others which
inhibits query performance. To check for distribution skew, you can query the SVV_DISKUSAGE system view.
Each row in the system table SVV_DISKUSAGE records the statistics for one disk block. The num_values
column gives the number of rows in that disk block, so when you sum(num_values), it returns the number of
rows on each slice.
The query above returns all the statements that ran in every completed transaction that included an ANALYZE
command.
The above example returns details for the last COPY operation.
To find out when ANALYZE commands were run, you can query STL_QUERY. For example, to find out when
the Sales_Table was last analyzed, run the query above.
SELECT *
FROM ch_loadview
WHERE table_name='Employee_Table';
If IS_DISKBASED is true ("t") for any step, then that step wrote data to disk.
Chapter 4 - Compression
Speak in a moment of anger and youll deliver the greatest speech youll ever regret.
- Anonymous
COMPRESSION TYPES
The table above identifies the supported compression encodings and the data types that support the encoding.
Compression reduces the size of data when it is stored, and it is a column-level operation. Compression
conserves storage space and reduces the size of data that is read from storage, which will then reduce the amount
of disk I/O, thus improving query performance. By default, Amazon Redshift stores data in its raw,
uncompressed format, but you can apply a compression type, or encoding, to the columns in a table manually
(when the table is created). Or you can use the COPY command to analyze and apply compression automatically.
Either way, it is important to compress your data.
Byte dictionary encoding utilizes a separate dictionary of unique values for each block of column values on disk.
Remember, each Amazon Redshift disk block occupies 1 MB. The dictionary contains up to 256 one-byte values
that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the
extra values are written into the block in raw, uncompressed form. The process repeats for each disk block. This
encoding is very effective when a column contains a limited number of unique values, and it is especially optimal
when there is less than 256 unique values.
DELTA ENCODING
Delta encodings are very useful for date and time columns. Delta encoding compresses data by recording the
difference between values that follow each other in the column. These differences are recorded in a separate
dictionary for each block of column values on disk. If the column contains 10 integers in sequence from 1 to 10,
the first will be stored as a 4-byte integer (plus a 1-byte flag), and the next 9 will each be stored as a byte with the
value 1, indicating that it is one greater than the previous value. Delta encoding comes in two variations. DELTA
records the differences as 1-byte values (8-bit integers), and DELTA32K records differences as 2-byte values
(16-bit integers)
LZO ENCODING
Designed to work best with Char and Varchar data that store long character strings
Includes slower compression levels achieving a quite competitive compression ratio while still decompressing at
this very high speed
LempelZivOberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed.
LZO encoding provides a high compression ratio with good performance. LZO encoding is designed to work
well with character data. It is especially good for CHAR and VARCHAR columns that store very long character
strings especially free form text such as product descriptions, user comments, or JSON strings.
MOSTLY ENCODING
Mostly encodings are useful when the data type for a column is larger than the majority of the stored values
require. By specifying a mostly encoding for this type of column, you can compress the majority of the values in
the column to a smaller standard storage size. The remaining values that cannot be compressed are stored in their
raw form.
RUNLENGTH ENCODING
Runlength encoding replaces a value that is repeated consecutively with a token that consists of the value and a
count of the number of consecutive occurrences (the length of the run). This is where the name Runlength comes
into play. A separate dictionary of unique values is created for each block of column values on disk. This
encoding is best suited to a table in which data values are often repeated consecutively, for example, when the
table is sorted by those values.
Text255 and text32k encodings are useful for compressing VARCHAR columns only. Both compression
techniques work best when the same words recur often. A separate dictionary of unique words is created for each
block of column values on disk. Text255 has a dictionary that contains the first 245 unique words in the column.
Those words are replaced on disk by a one-byte index value representing one of the 245 values, and any words
that are not represented in the dictionary are stored uncompressed. This process is repeated for each block.
For the text32k encoding, the principle is the same, but the dictionary for each block does not capture a specific
number of words. Instead, the dictionary indexes each unique word it finds until the combined entries reach a
length of 32K, minus some overhead. The index values are stored in two bytes.
ANALYZE COMPRESSION
ANALYZE COMPRESSION
[ [ table_name ]
[ ( column_name [, . . .] ) ] ]
[COMPROWS numrows]
Table_Name-You can optionally specify a table_name to analyze a single table. If you do not specify a
table_name, all of the tables in the currently connected database are analyzed. You cannot specify more than one
table_name with a single ANALYZE COMPRESSION statement. You can also analyze compression for
temporary tables.
Column_Name-If you specify a table_name, you can also specify one or more columns in the table (as a column-
separated list within parentheses).
COMPROWSThis is the number of rows to be used as the sample size for compression analysis. The analysis is
run on rows from each data slice. For example, if you specify COMPROWS 2000000 (2,000,000) and the system
contains 4 total slices, no more than 500,000 rows per slice are read and analyzed. If COMPROWS is not
specified, the sample size defaults to 100,000 per slice.
Numrows -Number of rows to be used as the compression sample size. The accepted range for numrows is a
number between 1000 and 1000000000 (1,000,000,000).
The ANALYZE Compression command performs compression analysis and produces a report with the suggested
column encoding schemes for the tables analyzed. ANALYZE COMPRESSION does not modify the column
encodings of the table but merely makes suggestions. To implement the suggestions, you must recreate the table,
or create a new table with the same schema. ANALYZE COMPRESSION does not consider Runlength encoding
on any column that is designated as a SORTKEY. This is because range-restricted scans might perform poorly
when SORTKEY columns are compressed much more highly than other columns. ANALYZE COMPRESSION
acquires an exclusive table lock, which prevents concurrent reads and writes against the table. Only run the
ANALYZE COMPRESSION command when the table is idle.
COPY
The above example (two parts) gives the syntax for COPY from Amazon S3, COPY from Amazon EMR, and
COPY from a remote host (COPY from SSH).
Chapter 5 Temporary Tables
Creates a new table in the current database. The owner of this table is the issuer of the CREATE TABLE
command.
When you create a temporary table, it is visible only within the current session. The table is automatically
dropped at the end of the session. A derived table only lasts the life of a single query, but a temporary table last
the entire session. This allows a user to run hundreds of queries against the temporary table. A temporary table
can have the same name as a permanent table, but I don't recommend this. You don't give a temporary table a
schema because it is automatically associated with the users session. Once the session is over, the table and data
are dropped. If the user tries to query the table in another session, the system won't recognize the table. In other
words, the table doesn't exist outside of the current session it was created in.
When you create a temporary table, it is visible only within the current session. The table is automatically
dropped at the end of the session. Above are some examples that allow you to define a different distkey, diststyle
and sortkey. Users (by default) are granted permission to create temporary tables by their automatic membership
in the PUBLIC group. To remove the privilege for any users to create temporary tables, revoke the TEMP
permission from the PUBLIC group and then explicitly grant the permission to create temporary tables to
specific users or groups of users.
TABLE LIMITS AND CTAS
9,900 permanent tables.
The maximum number of characters for a table name is 127.
The maximum number of columns you can define in a single table is 1,600.
CREATE TABLE
Student_Table_Backup
AS
SELECT *
FROM Student_Table;
To have everything is to possess nothing.
--Buddha
The resulting table inherits the distribution and sort key from the Student_Table (STUDENT_ID). Buddha might
have been wrong here. "To have everything is to possess 9,900 permanent tables."
2) Use CREATE TABLE AS (CTAS). If the original DDL is not available, you can use CREATE TABLE AS to
create a copy of current table, then rename the copy. The new table will not inherit the encoding, distkey, sortkey,
not null, primary key, and foreign key attributes of the parent table.
3) Use CREATE TABLE LIKE. If the original DDL is not available, you can use CREATE TABLE LIKE to
recreate the original table. The new table will not inherit the primary key and foreign key attributes of the parent
table. The new table does, though, inherit the encoding, distkey, sortkey, and not null attributes of the parent
table.
4) Create a temporary table and truncate the original table. If you need to retain the primary key and foreign key
attributes of the parent table, you can use CTAS to create a temporary table, then truncate the original table and
populate it from the temporary table. This method is slower than CREATE TABLE LIKE because it requires two
insert statements.
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table. If a table
has a large unsorted region, a deep copy is much faster than a vacuum. The difference is that you cannot make
concurrent updates during a deep copy operation which you can do during a vacuum. The next four slides will
show each technique with an example.
2. Use an INSERT INTO ... SELECT statement to populate the copy with data from the original table.
3. Drop the original table.
4. Use an ALTER TABLE statement to rename the copy to the original table name.
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.
3. Use an ALTER TABLE statement to rename the new table to the original table.
The following example performs a deep copy on the Sales_Table using a duplicate of the Sales_Table
namedSales_Table_Copy.
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.
3. Use an INSERT INTO ... SELECT statement to copy the rows from the temporary table to the original table.
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.
The SELECT Statement that creates and populates the Derived table is always inside Parentheses.
The derived table must be given a name. Above, we called our derived tableTeraTom.
You will need to define (alias) the columns in the derived table. Above, we allowed Dept_No to default
to Dept_No, but we had to specifically alias AVG(Salary) asAVGSAL.
Every derived table must have the three components listed above.
In the example above, TeraTom is the name we gave the Derived Table. It is mandatory that you always name the
table or it errors.
AVGSAL is the name we gave to the column in our Derived Table that we call TeraTom. Our SELECT (which
builds the columns) shows we are only going to have one column in our derived table and we have named that
column AVGSAL.
The first five columns in the Answer Set came from the Employee_Table. AVGSAL came from the derived table
named TeraTom.
TeraTom
Dept_No AVGSAL
? 32800.50
10 64300.00
100 48850.00
200 44944.44
300 40200.00
400 48333.33
In a derived table, you will always have a SELECT query in parenthesis, and you will always name the table.
You have options when aliasing the columns. As in the example above, you can let normal columns default to
their current name.
Now, the lower portion of the query refers to TeraTom Almost like it is a permanent table, but it is not!
The following example shows the simplest possible case of a query that contains a WITH clause. The WITH
query named TeraTom selects all of the rows from the Student_Table. The main query, in turn, selects all of the
rows from TeraTom. The TeraTom table exists only for the life of the query.
The most important thing a father can do for his children is to love their mother.
-Anonymous
The following example shows two tables created from the With statement. "Sometimes the most important thing
a WITH clause can do is to have multiple children."
6) Why were the join keys named differently? If both were named Dept_No, we would error unless we full
qualified.
ON E.Employee_No = S.Employee_No
ORDER BY T.Dept_No;
When you run an EXPLAIN, you are seeing the plan passed to the slices by the optimizer on the Leader Node.
There are three ways to see an EXPLAIN. You can merely type the word EXPLAIN in front of any SQL. You
can also hit F6 (Function Key 6), or you can click on the magnifying glass in Nexus. The EXPLAIN shows the
plan, but does NOT run the actual query. Once you see the costs of the EXPLAIN, you can decide whether or not
to run the query.
Segment Segments are a number of steps that can be done by a single process. A segment is a single
compilation unit executable by compute nodes. Each segment begins with a scan or reading of table data and
ends either with a materialization step or some other network activity.
Stream - A collection of segments that always begins with a scan or reading of some data set and ends with a
materialization or blocking step. Materialization or blocking steps can include HASH, AGG, SORT, and SAVE.
Last segment The term last segment means the query returns the data. If the return set is aggregated or sorted,
the compute nodes each send their piece of the intermediate result to the leader node which then merges the data
so the final result can be sent back to the requesting client.
Merge Join Also termed as an mjoin. This is commonly used for inner joins and outer joins (for join tables that
are both distributed and sorted on the joining columns), this is generally the the fastest Amazon Redshift join
algorithm.
Hash Join Also termed as an hjoin. This is also used for inner joins and left and right outer joins, and it is typ
ically faster than a nested loop join. Hash Join reads the outer table then hashes the joining column and finds
matches in the inner hash table.
Nested Loop Join Also termed as an nloop. This is the least optimal join, and it is mainly used for cross-joins,
product joins, and Cartesian product joins. It is also used for joins without ajoin condition and inequality joins.
HashAggregate Also termed as a aggr. This is an operator/step for any grouped aggregate functions. This has
the ability to operate from disk by virtue of hash table spilling to disk.
GroupAggregate Also termed as a aggr. This is an operator that is sometimes chosen for grouped aggregate
queries. This is only done if the Amazon Redshift configuration setting for force_hash_grouping setting is off.
Sort Also termed as a sort. The ORDER BY clause controls this sort. It can also perform other operations such
as UNIONs and joins. It can also operate from disk.
Merge Also termed as a merge. This produces the final sorted results of a query based on intermediate sorted
results derived from operations performed in parallel.
Hash Intersect Also termed as an hjoin. This is used for INTERSECT queries.
Append Also termed as save. This is the append used with a Subquery Scan to implement UNION and UNION
ALL queries.
Limit Also termed as limit. This term evaluates the LIMIT clause.
Materialize Also termed as save. This term means to materialize rows for input to nested loop joins and some
merge joins. This can operate from disk.
Unique Also termed as unique. This term means to materialize rows for input to nested loop joins and some
merge joins. This can operate from disk.
Window Also termed as window. This term means to compute aggregate and ranking window functions. This
can operate from disk.
EXPLAIN TERMS FOR SET OPERATORS AND MISCELLANEOUS TERMS
Network (Broadcast) Also termed as a bcast. This is a Broadcast that is considered an attribute of the Join
Explain operators and steps.
Network (Distribute) Also termed as a dist. This is used to distribute rows to compute nodes for parallel
processing by the data warehouse cluster.
Network (Send to Leader) Also termed as return. This sends results back to the leader for further processing.
Delete (Scan and Filter) Also termed as delete. This term means to delete data. This operation can operate from
disk.
Update (Scan and Filter) Also termed as delete, insert. This term means to implement as delete and Insert.
Costs are cumulative as you read up the plan, so the HashAggregate cost number in this example (0.09) (red
arrow) consists mostly of the Seq Scan cost below it (0.06) (blue arrow). So, the first scan (blue arrow) was 0.06
and the final cost is 0.09, which means the final step was a 0.03 cost. Add up 0.06 (blue arrow) and 0.03 (red
arrow) and you get 0.09!
The keyword rows in the EXPLAIN means the expected number of rows to return. In this example, the scan is
expected to return 6 rows. The HashAggregate operator is expected to return 6 rows (red arrow).
Above, you can see a simple query with a simple explain plan. This is designed to show how the cost works.
A Broadcast (BCAST) means to duplicate the table in its entirety across all nodes. We are duplicating the
Department_Table on all nodes.
The Student_Table and the Student_Course_Table are joined first on Student_ID. Both tables have a Distribution
Key of Student_ID. Since they join on Student_ID, the matching rows are already on the same slice. So, there is
no need to redistribute the data thus the DS_DIST_NONE keyword. Then, the Course_Table is broadcast to all
slices (DS_BCAST_INNER) where it can be joined to the results of the first two table join. Data movement
happens in joins because the joining of two rows has to happen in the same slice and memory.
The example above might become a problem. The costs look high and the EXPLAIN plan is shouting out to
review the join predicates to avoid Cartesian product joins.
The keyword Aggregate in the EXPLAIN is used for aggregation scalar functions. A scalar function means that
only one row and one column are returned in the answer set for an aggregation function. Notice that the above
query produces a scalar result. The AVG(Salary) in the Employee_Table is $46782.15. That result is only one
column and one row! It is scalar!
The keyword HashAggregate in the EXPLAIN is used for unsorted grouped aggregate functions. Notice there is
no sort!
EXPLAIN USING LIMIT, MERGE AND SORT
The keyword Limit is used to evaluate the LIMIT clause. The keyword Sort is used to evaluate the ORDER BY
clause. The Keyword Merge is used when producing the final sorted results, which is derived from intermediate
sorted results that each slice parallel processed. Remember, each slice must perform their work. Then, the data is
sorted on each slice and passed to the leader node where a Merge operation is performed.
The keyword Filter is used to evaluate the WHERE clause. In the above example, we are filtering the returning
rows by only looking for Freshman who have a class_code of 'FR'. Our EXPLAIN (in yellow) shows the
keyword filter looking for 'FR'.
A subquery involves at least two queries. A top and bottom query. In the above example, the bottom query is run
first on the Department_Table. The result set consists of the column Dept_No. The Employee_Table (top query)
is scanned next. Then, the results of both the Department_Table and the Employee_Table scans are hashed by
Dept_No. This places all matches on the same slice. The rows can then be joined using a Hash Join in memory.
When I was 14, I thought my parents were the stupidest people in the world. When I was 21, I was amazed at
how much they learned in seven years.
- Mark Twain
The CURRENT_SCHEMA function is a leader-node only function. In this example, the query does not reference
a table, so it runs exclusively on the leader node in order to show the schema. Our Current_Schema is the public
schema.
We ran two queries above. The first query showed us our search_path and it contained $user and public. Since
we will be querying the databases sql_class and sql_views in our example labs, we need to place them in our
search_path. The second example (in yellow) has done this through the "Set search_path" command.
The current session's temporary-table schema, pg_temp_nnn, is searched (if it exists). It can be explicitly listed in
the path by using the alias pg_temp. If not listed in the path, it will be searched first (even before pg_catalog).
Remember, the temporary schema is only searched for tables and view names. It is not searched for any function
names.
Above are the five things you need to know about how the Search_Path works.
INTRODUCTION
This is a pictorial of the Student_Table which we will use to present some basic examples of SQL and get some
hands-on experience with querying this table. This book attempts to show you the table, show you the query, and
show you the result set.
This is a great way to show the columns you are selecting from the Table_Name.
Why is the example on the left better even though they are functionally equivalent? Errors are easier to spot and
comments won't cause errors.
PLACE YOUR COMMAS IN FRONT FOR BETTER DEBUGGING
CAPABILITIES
Anonymous
Having commas in front to separate column names makes it easier to debug. Remember our quote above. "A
query filled with commas at the end just might fill you with thorns, but a query filled with commas in the front
will allow you to always come up smelling like roses."
Rows typically come back to the report in random order. To order the result set, you must use an ORDER BY.
When you order by a column, it will order in ASCENDING order. This is called the Major Sort!
The ORDER BY can use a number to represent the sort column. The number 2 represents the second column on
the report.
Notice that the answer set is sorted in ascending order based on the column Grade_Pt. Also notice that Grade_Pt
is the fifth column coming back on the report. That is why the SQL in both statements is ordering by Grade_Pt.
Did you notice that the null value came back first? Nulls sort first in ascending order and last in descending
order.
Notice that the answer set is sorted in descending order based on the column Last_Name. Also, notice that
Last_Name is the second column coming back on the report. We could have done an Order By 2. If you spell out
the word DESCENDING the query will fail, so you must remember to just use DESC.
NULL VALUES SORT FIRST IN ASCENDING MODE (DEFAULT)
Did you notice that the null value came back first? Nulls sort first in ascending order and last in descending
order.
You can ORDER BY in descending order by putting a DESC after the column name or its corresponding number.
Null Values will sort Last in DESC order.
Major sort is the first sort. There can only be one major sort. A minor sort kicks in if there are Major Sort ties.
There can be zero or more minor sorts.
This sorts alphabetically. Can you change the sort so the Freshman come first, followed by the Sophomores,
Juniors, Seniors and then the Null?
Can you change the query to Order BY Class_Code logically (FR, SO, JR, SR, ?)?
When you ALIAS a column, you give it a new name for the report header. You should always reference the
column using the ALIAS everywhere else in the query. You never need Double Quotes in SQL unless you are
Aliasing.
Column names must be separated by commas. Notice in this example, there is a comma missing between
Class_Code and Grade_Pt. What this will result in is only three columns appearing on your report with one being
aliased wrong.
Double dashes make a single line comment that will be ignored by the system.
Double Dashes in front of both lines comments both lines out and theyre ignored.
The query on the left had an error because the keyword Sum is reserved. We can test if this is the problem by
commenting out that line in our SQL (example on the right). Now, our query works. We know the problem is on
the line that we commented out. Once we put "Sum" (double quotes around the alias) it works. Use comments to
help you debug.
Chapter 8 - The WHERE Clause
I saw the angel in the marble and carved until I set him free.
- Michelangelo
Redshift offers a unique capability in its SQL to limit the number of rows returned from the table's data. It is a
LIMIT clause and it is normally added at end of a valid SELECT statement with the above example and syntax.
This example uses a LIMIT clause to reduce the rows returned, but in reality, the limiting of rows comes from the
WHERE clause.
The WHERE Clause here filters how many ROWS are coming back. In this example, I am asking for the report
to only rows WHERE the first name is Henry.
When you ALIAS a column, you give it a new name for the report header, but a good rule of thumb is to refer to
the column by the alias throughout the query.
-Anonymous
When you ALIAS a column, you give it a new name for the report header, but a good rule of thumb is to refer to
the column by the alias throughout the query. Whoever wrote the above quote was way off. "Write a wise alias
and it will live until the query ends bummer".
In the WHERE clause, if you search for Character data such as first name, you need single quotes around it. You
Dont single-quote integers.
Character data (letters) need single quotes, but you need NO Single Quotes for Integers (numbers). Remember,
you never use double quotes except for aliasing.
If you are looking for a row that holds NULL value, you need to put IS NULL. This will only bring back the
rows with a NULL value in it.
The same goes with = NOT NULL. We cant compare a NULL with any equal sign. We can only deal with
NULL values with IS NULL and IS NOT NULL.
Much like before, when you want to bring back the rows that do not have NULLs in them, you put an IS NOT
NULL in the WHERE Clause.
The WHERE Clause doesnt just deal with Equals. You can look for things that are GREATER or LESSER
THAN along with asking for things that are GREATER/LESSER THAN or EQUAL to.
Notice the WHERE statement and the word AND. In this example, qualifying rows must have a Class_Code =
FR and also must have a First_Name of Henry. Notice how the WHERE and the AND clause are on their own
line. Good practice!
TROUBLESHOOTING AND
What is going wrong here? You are using an AND to check the same column. What you are basically asking with
this syntax is to see the rows that have BOTH a Grade_Pt of 3.0 and a 4.0. That is impossible, so no rows will be
returned.
Notice above in the WHERE Clause we use OR. Or allows for either of the parameters to be TRUE in order for
the data to qualify and return.
TROUBLESHOOTING OR
Notice above in the WHERE Clause we use OR. Or allows for either of the parameters to be TRUE in order for
the data to qualify and return. The first example errors and is a common mistake. The second example is perfect.
Notice that AND separates two different columns, and the data will come back if both are TRUE.
Which Seniors have a 3.0 or a 4.0 Grade_Pt average. How many rows will return?
A) 2 C) Error
B) 1 D) 3
ANSWER TO QUIZ HOW MANY ROWS WILL RETURN?
()
NOT
AND
OR
SELECT *
FROM Student_Table
WHERE Grade_Pt = 4.0 OR Grade_Pt = 3.0
AND Class_Code = 'SR' ;
Syntax has an ORDER OF PRECEDENCE. It will read anything with parentheses around it first. Then, it will
read all the NOT statements. Then, the AND statements. FINALLY, the OR Statements. This is why the last
query came out odd. Lets fix it and bring back the right answer set.
Using an IN List is a great way of looking for rows that have both a Grade_Pt of 3.0 or 4.0 AND also have a
Class_Code of SR. Only ONE row comes back.
The IN Statement avoids retyping the same column name separated by an OR. The IN allows you to search the
same column for a list of values. Both queries above are equal, but the IN list is a nice way to keep things easy
and organized.
You can also ask to see the results that ARE NOT IN your parameter list. That requires the column name and a
NOT IN. Neither the IN nor NOT IN can search for NULLs! Miles Davis got this IT quote all wrong. First you
innovate, and then you sue anyone who imitates. Please make a note of it!
This is a great technique to look for a NULL when using a NOT IN List.
This is a great technique to eliminate any NULL values when using a NOT IN List.
BETWEEN IS INCLUSIVE
SELECT *
FROM Student_Table
WHERE Grade_Pt BETWEEN 2.0 AND 4.0 ;
This is a BETWEEN. What this allows you to do is see if a column falls in a range. It is inclusive meaning that in
our example, we will be getting the rows that also have a 2.0 and 4.0 in their column!
"The difference between genius and stupidity is that genius has its limits."
Albert Einstein
This is a NOT BETWEEN example. What this allows you to do is see if a column does not fall in a range. It is
inclusive meaning that in our example, we will be getting no rows where the grade_pt is between a 2.0 and 4.0 in
their column! The 2.0 and the 4.0 will also not return.
The _ underscore sign is a wildcard for any a single character. We are looking for anyone who has an 'a' as the
second letter of their last name.
With Redshift, the ilike command is NOT case sensitive, but the like command is case sensitive. These rows
came back because they have an 'AR' in positions 2 and 3 of their last_name. The 'AR' are not really capitalized,
but that is why you use the ilike command. It doesn't care about case!
This is a CHAR(20) data type. That means that any words under 20 characters will pad spaces behind them until
they reach 20 characters. You will not get any rows back from this example because technically, no row ends in
an N, but instead ends in a space.
Which Column from the Answer Set could have a DATA TYPE of INTEGER, and which could have Character
Data?
All Integers will start from the right and move left. Thus, Col1 was defined during the table create statement to
hold an INTEGER. The next page shows a clear example.
This is how a standard result set will look. Notice that the integer type in Student_ID starts from the right and
goes left. Character data type in Last_Name moves left to right like we are use to seeing while reading English.
Character data pads spaces to the right and Varchar uses a 2-byte VLI instead.
Sometimes you want to use the LIKE command, but you also want to search for the values of percent (%) or
Underscore (_). You can turn off these wildcards by using an escape character. The following example uses the
escape character @ to search for strings that include "_" just after the word "start". The @ sign just in front of the
underscore (_) means that the underscore is no longer a wildcard, but an actual literal underscore.
Sometimes you want to use the LIKE command, but you also want to search for the values of percent (%) or
Underscore (_). You can turn off these wildcards by using an escape character. The following example uses the
default escape characters \\ to search for strings that include underscore "_" just after the word "start". The \\ just
in front of the underscore means that the underscore is no longer a wildcard, but a literal underscore.
SIMILAR TO OPERATORS
[] Brackets specify a matching list, that should match one expression in the list. A
caret (^) precedes a nonmatching list, which matches any character except for
the expressions represented in the list.
{m,n} Repeat the previous item at least m and not more than n times.
[: :] Matches any character within a POSIX character class. In the following character
classes, Amazon Redshift supports only ASCII characters: [:alnum:],
[:alpha:],[:lower:], [:upper:]
SIMILAR TO OPERATORS
{m,n} Repeat the previous item at least m and not more than n times.
SELECT First_Name
,Last_Name
FROM Employee_Table
WHERE First_Name similar to '%e%|%h%'
ORDER BY First_Name;
The following example finds all employees with a First_Name that contain "e" or "h". Regular expression
matching using SIMILAR TO is computationally expensive. We recommend using LIKE whenever possible
especially when processing a very large number of rows. For example, the following queries are functionally
identical, but the query that uses LIKE executes several times faster than the query that uses a regular expression.
The next page shows the answer set.
The example above finds all employees with a First_Name that contains an "e" or an "h". Herbert did not return
because he has an 'h', but he returned because he does have an 'e' in his First_Name.
The example above finds all employees with a First_Name that contains an "i" or a capital "H". Notice that
"John" is no longer in the answer set (like he was in the previous example). John has an "h" in it, but not a capital
"H".
The name William has the letter 'i' in it twice, but no rows came back. This is because the occurrences must
follow each other consecutively. There needs to be a name with two occurrences of the letter 'i' back to back!
A bird does not sing because it has the answers, it sings because it has a song.
- Anonymous
Distinct and GROUP BY in the two examples return the same answer set.
How many rows will come back from the above SQL?
How many rows will come back from the above SQL? 10. All rows came back. Why? Because there are no exact
duplicates that contain a duplicate Class_Code and Duplicate Grade_Pt combined. Each row in the SELECT list
is distinct.
TOP COMMAND
In the above example, we brought back 3 rows only. This is because of the TOP 3 statement which means to get
an answer set, and then bring back the first 3 rows in that answer set. Because this example does not have an
ORDER BY statement, you can consider this example as merely bringing back 3 random rows.
In the above example, we brought back 3 rows only. This is because of the TOP 3 statement which means to get
an answer set, and then bring back the first 3 rows. Because this example uses an ORDER BY statement, the data
brought back is from the top 3 students with the highest Grade_Pt. This is the real power of the TOP command.
Use it with an ORDER BY!
Both queries above bring back the top 3 students with the highest grade_pt. The TOP command is designed to
bring back the top n rows. The LIMIT clause is used more often if you merely want to see a quick sample, but
both techniques will work with an ORDER BY statement and both can utilize an ORDER BY statement in the
creation of a view.
Chapter 10 - Aggregation
Redshift climbed Aggregate Mountain and delivered a better way to Sum It.
Tera-Tom Coffing
Employee_No Salary
423400 100000.00
423401 100000.00
423402 NULL
What would the result set be from the above query? The next slide shows answers!
3) You CANT mix Aggregates with normal columns unless you use a GROUP BY.
-Mohammed Ali
The five aggregates are listed above. Mohammed Ali was way off in his quote. He meant to say, "Don't you count
the days, make the data count for you".
How many rows will the above query produce in the result set?
TROUBLESHOOTING AGGREGATES
If you have a normal column (non aggregate) in your query, you must have a corresponding GROUP BY
statement.
If you have a normal column (non aggregate) in your query, you must have a corresponding GROUP BY
statement.
Both queries above produce the same result. The GROUP BY allows you to either name the column or use the
number in the SELECT list just like the ORDER BY.
The system eliminates reading any other Dept_Nos other than 200 and 400. This means that only Dept_Nos of
200 and 400 will come off the disk to be calculated.
The HAVING Clause only works on Aggregate Totals. The WHERE filters rows to be excluded from calculation,
but the HAVING filters the Aggregate totals after the calculations, thus eliminating certain Aggregate totals.
The HAVING Clause only works on Aggregate Totals, and in the above example, only Count(*) > 2 can return.
A Join combines columns on the report from more than one table. The example above joins the Customer_Table
and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN
CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that
establishes the relationship.
Whenever a column is in both tables, you must fully qualify it when doing a join. You don't have to fully qualify
tables that are only in one of the tables because the system knows which table that particular column is in. You
can choose to fully qualify every column if you like. This is a good practice because it is more apparent which
columns belong to which tables for anyone else looking at your SQL.
This is the same join as the previous slide except it is using ANSI syntax. Both will return the same rows with the
same performance. Rows are joined when the Customer_Number matches on both tables, but non-matches wont
return.
BOTH QUERIES HAVE THE SAME RESULTS AND PERFORMANCE
Both of these syntax techniques bring back the same result set and have the same performance. The INNER
JOIN is considered ANSI. Which one does Outer Joins?
Finish this join by placing the missing SQL in the proper place!
If a column in the SELECT list is in both tables, you must fully qualify it.
An Inner Join returns matching rows, but did you know an Outer Join returns both matching rows and non-
matching rows? You will understand soon!
The bottom line is that the three rows excluded did not have a matching Dept_No.
This is a LEFT OUTER JOIN. That means that all rows from the LEFT Table will appear in the report regardless
if it finds a match on the right table.
-- Author Unknown
Redshift supports outer joins using both the ANSI syntax and the Oracle syntax. I think Oracle joins are a real
plus!
This is a RIGHT OUTER JOIN. That means that all rows from the RIGHT Table will appear in the report
regardless if it finds a match with the LEFT Table.
Redshift supports outer joins using both the ANSI syntax and the Oracle syntax
The is a FULL OUTER JOIN. That means that all rows from both the RIGHT and LEFT Table will appear in the
report regardless if it finds a match.
WHICH TABLES ARE THE LEFT AND WHICH ARE THE RIGHT?
Can you list which tables above are left tables and which tables are right tables?
ANSWER - WHICH TABLES ARE THE LEFT AND WHICH ARE THE
RIGHT?
The first table is always the left table and the rest are right tables. The results from the first two tables being
joined becomes the left table.
The additional AND is performed first in order to eliminate unwanted data, so the join is less intensive than
joining everything first and then eliminating rows that dont qualify.
The additional WHERE is performed first in order to eliminate unwanted data, so the join is less intensive than
joining everything first and then eliminating.
The additional WHERE is performed last on Outer Joins. All rows will be joined first and then the additional
WHERE clause filters after the join takes place.
The additional AND is performed in conjunction with the ON statement on Outer Joins. This can surprise you.
Only Mandee is in Dept_No 100, so she showed up like expected, but an outer join returns non-matches also.
Ouch!!!
This is considered an INNER JOIN because we are doing a LEFT OUTER JOIN on the Employee_Table and
then filtering with the AND for a column in the right table!
A Product Join is often a mistake! 3 Department rows had an m in their name, so these were joined to every
employee, and the information is worthless.
Do these two queries produce the same result? No, Query 1 Errors due to ANSI syntax and no ON Clause, but
Query 2 Product Joins to bring back junk!
This Cross Join produces information that just isnt worth anything quite often!
A Self Join gives itself 2 different Aliases, which is then seen as two different tables.
A Self Join gives itself 2 different Aliases, which is then seen as two different tables.
QUIZ WILL BOTH QUERIES BRING BACK THE SAME ANSWER SET?
Will both queries bring back the same result set?
Will both queries bring back the same result set? Yes! Because theyre both inner joins.
QUIZ WILL BOTH QUERIES BRING BACK THE SAME ANSWER SET?
Will both queries bring back the same result set? NO! The WHERE is performed last.
HOW WOULD YOU JOIN THESE TWO TABLES?
How would you join these two tables together? You can't do it. There is no matching column with like data.
There is no Primary Key/Foreign Key relationship between these two tables. That is why you are about to be
introduced to a bridge table. It is formally called an Associative table or a Lookup table.
SELECT ALL Columns from the Course_Table and Student_Table and Join them
The above queries show both traditional and ANSI form for this three table join.
This is tricky. The only way it works is to place the ON clauses backwards. The first ON Clause represents the
last INNER JOIN and then moves backwards.
Above is the logical model for the insurance tables showing the Primary Key and Foreign Key relationships
(PK/FK).
Your mission is to write a five table join selecting all columns using ANSI syntax.
Above is the example writing this five table join using ANSI syntax.
Your mission is to write a five table join selecting all columns using Non-ANSI syntax.
Above is the example writing this five table join using Non-ANSI syntax.
QUIZ RE-WRITE THIS PUTTING THE ON CLAUSES AT THE END
SELECT
cla1.*, sub1.*, add1.* pro1.*, ser1.*
FROM CLAIMS AS cla1
INNER JOIN
SUBSCRIBERS AS sub1
ON cla1.Subscriber_No = sub1.Subscriber_No
AND cla1.Member_No = sub1.Member_No
INNER JOIN
ADDRESSES AS add1
ON sub1.Subscriber_No = add1.Subscriber_No
INNER JOIN
PROVIDERS AS pro1
ON cla1.Provider_No = pro1.Provider_Code
INNER JOIN
SERVICES AS ser1
ON cla1.Claim_Service = ser1.Service_Code ;
Above is the example writing this five table join using Non-ANSI syntax.
Above is the example writing this five table join using ANSI syntax with the ON clauses at the end. We had to
move the tables around also to make this happen. Notice that the first ON clause represents the last two tables
being joined, and then it works backwards.
Chapter 12 - Date Functions
CURRENT_DATE
This example uses the Current_Date to return the current date.
SELECT Current_Date as ANSI_Date;
ANSI_Date
------------
2014-10-04
TIMEOFDAY()
TIMEOFDAY() returns a VARCHAR data type and specifies the weekday, date, and time.
SELECT TIMEOFDAY() ;
timeofday
------------
Mon Oct 6 22:53:50.333525 2014 UTC
Always remember that you are unique just like everyone else.
Anonymous
The TIMEOFDAY function returns the weekday, date and the time.
SYSDATE RETURNS A TIMESTAMP WITH MICROSECONDS
This example uses the SYSDATE function
to return the full timestamp for the current date.
This example uses the SYSDATE function inside the TRUNC function to return the current date without the time
included.
SELECT TRUNC(SYSDATE) ;
trunc
------------
2014-10-04
The SYSDATE function returns the current date and time according to the system clock on the leader node. The
functions CURRENT_DATE and TRUNC(SYSDATE) produce the same results.
This example uses the GETDATE() function inside the TRUNC function to return the current date without the
time included.
SELECT TRUNC(GETDATE());
trunc
------------
2014-10-04
Because Dates are stored internally on disk as integers, it makes it easy to add days to the calendar. In the query
above, we are adding 60 days to the Order_Date. Also, notice the to_char command which will format the
amount.
The ADD_MONTHS function adds a specified number of months to a date or timestamp value. If date is the last
day of the month, or if the resulting month is shorter, the function returns the last day of the month in the result.
For other dates, the result contains the same day number as the date expression. A positive or negative integer or
any value that implicitly converts to an integer. You can even use a negative number to subtract months from
dates. The DATEADD function provides similar functionality.
Above, we used the TRUNC command to get rid of the time (00:00:00) on the returning answer set. The
ADD_MONTHS function adds a specified number of months to a date or timestamp value. If date is the last day
of the month, or if the resulting month is shorter, the function returns the last day of the month in the result. For
other dates, the result contains the same day number as the date expression. A positive or negative integer or any
value that implicitly converts to an integer. You can even use a negative number to subtract months from dates.
The DATEADD function provides similar functionality.
The Add_Months command adds months to any date. Above, we used a great technique that would give us 1-
year. We then showed an even better technique to get 5-years.
The DATEADD and ADD_MONTHS functions handle dates that fall at the ends of months differently.
Wayne Gretzky
This is the Extract command. It returns a date part, such as a day, month, or year, from a timestamp value or
expression.
Just like the Add_Months, the EXTRACT Command is a Temporal Function or a Time-Based Function.
Just like the Add_Months, the EXTRACT Command is a Temporal Function or a Time-Based Function, and the
above is designed to show how to use it with literal values.
EXTRACT OF THE MONTH ON AGGREGATE QUERIES
SELECT EXTRACT(Month FROM Order_date)
,COUNT(*) AS Nbr_of_rows
,AVG(Order_Total)
FROM Order_Table
GROUP BY 1
ORDER BY 1 ;
The above SELECT uses the EXTRACT to only display the month and also to control the number of aggregates
displayed in the GROUP BY. Notice the Answer Set headers.
This function uses a datepart (day, week, month etc.) and two target expressions. This function returns the
difference between the two expressions. The expressions must be date or timestamp expressions and they must
both contain the specified datepart. If the second date is later than the first date, the result is positive. If the
second date is earlier than the first date, the result is negative.
The specific part of the date value (year, month, or day, for example) that the datepart function operates on. The
expression must be a date or timestamp expression that contains the specified date_part.
pgdate_part
8
Speak in a moment of anger and youll deliver the greatest speech youll ever regret.
Anonymous
The specific part of the date value (year, month, or day, for example) that the DATE _PART function operates on.
The expression must be a date or timestamp expression that contains the specified DATE_PART. Notice that the
default column name for the DATE_PART function is PGDATE_PART.
DATE_PART ABBREVIATIONS
Below are dateparts for date or timestamp functions. The following table identifies the datepart and timepart
names and abbreviations that are accepted as arguments to the following functions:
Above are the functions for datepart or timepart, their parts, and the acceptable abbreviations.
The to_char command will take a value and convert it to a character string.
CONVERSION FUNCTIONS
The NPS database provides some functions that assist in the conversion of data from one type to another.
CONVERSION FUNCTION TEMPLATES
MI Minute (00:59).
SS Second (00:59).
AM, am, A.M., a.m. or PM, pm, P.M., p.m. Meridian indicator (uppercase and lowercase).
BC, bc, B.C., b.c or AD, ad, A.D., a.d. Era indicator (uppercase and lowercase).
WW Week number of the year (1:53) where the first week starts on the first day of the year.
IW ISO week number of the year (The first Thursday of the new year is in week 1.)
CC Century (2 digits).
Q Quarter
FORMATTING A DATE
The to_char command will take a value and convert it to a character string. This includes formatting a date.
Let's find the number of days Tera-Tom has been alive since his last birthday.
SELECT (date '2012-01-10') - (date '1959-01-10') AS "Tom"s Age In Days";
Tera-Tom's Age In Days
19358
A DATE DATE is an interval of days between dates. A DATE + or Integer = Date. The query above uses the
dates the traditional way to deliver the Interval.
Let's find the number of days Tera-Tom has been alive since his last birthday.
SELECT (date '2012-01-10') - (date '1959-01-10') AS "Tom"s Age In Days";
Tera-Tom's Age In Days
19358
Let's find the number of years Tera-Tom has been alive since his last birthday.
SELECT ((date '2012-01-10') - (date '1959-01-10'))/365 "Tom"s Age In Years";
Tera-Tom's Age In Years
53
A DATE DATE is an interval of days between dates. A DATE + or Integer = Date. Both queries above
perform the same function, but the top query uses the date functions to find "Days" and the query on the bottom
finds "Years".
The next SELECT operation uses entirely ANSI compliant code to show the month and day of the payment due
date in 2 months and 4 days. Notice it uses double quotes to allow reserved words as alias names.
--The following SELECT uses math to extract the three portions of Tom's literal birthday
SELECT to_char(date '2012-01-10','DD') AS Day_portion
to_char(date '2012-01-10','MM') AS Month_portion
to_char(date '2012-01-10','YYYY') AS Year_portion ;
It was mentioned earlier that Redshift stores a date as an integer and therefore allows math operations to be
performed on a date. Although the EXTRACT works great and it is ANSI compliant, it is a function. Therefore, it
must be executed and the parameters passed to it to identify the desired portion as data. Then, it must pass back
the answer. As a result, there is additional overhead processing required to use it.
DATE_PART FUNCTION
Compatibility: Redshift Extension. Syntax of DATE_PART:
DATE_PART('<text',<date-time-timestamp>)
Where <text> can be: 'YEAR', 'MONTH', 'DAY' for a date or a time
stamp and ('HOUR', 'MINUTE', 'SECOND' for time and time stamp.
The following SELECT will show DATE_PART with a Date, Time, and Timestamp.
SELECT CURRENT_DATE
DATE_PART('MONTH',CURRENT_DATE)
CURRENT_TIME
DATE_PART('MINUTE',CURRENT_TIME)
CURRENT_TIMESTAMP
DATE_PART('SECOND',CURRENT_TIMESTAMP) ;
The DATE_PART function works exactly like EXTRACT. Although the name contains DATE, it also works with
time and time stamp data. Notice the column headers!
SELECT
CURRENT_DATE
DATE_PART('MONTH',CURRENT_DATE) as "MONTH"
CURRENT_TIME
DATE_PART('MINUTE',CURRENT_TIME) as "MINUTE"
CURRENT_TIMESTAMP
DATE_PART('SECOND',CURRENT_TIMESTAMP) as "SECOND" ;
The DATE_PART function works exactly like EXTRACT. Although the name contains DATE, it also works with
time and time stamp data. Now notice the column headers!
DATE_TRUNC FUNCTION
Compatibility: Redshift Extension. Syntax of DATE_TRUNC:
DATE_TRUNC('<text',<date>)
Where <text> can be: 'YEAR', 'MONTH', 'DAY' for a date or a time stamp and ('HOUR', 'MINUTE', 'SECOND'
for time and time stamp. Although DAY and SECOND are allowed, they have no impact on the output data, see
below.
SELECT CURRENT_TIMESTAMP
DATE_TRUNC('YEAR',CURRENT_TIMESTAMP) AS Yr_Trunc
DATE_TRUNC('MONTH',CURRENT_TIMESTAMP) AS Mo_Trunc
DATE_TRUNC('DAY',CURRENT_TIMESTAMP) AS Da_Trunc;
The DATE_TRUNC function has an interesting capability in that it truncates the portion of a date back to the
first. Notice that the year portion becomes January 1, the month portion becomes August 1, and the day portion
does not change. However, the entire time portion is set back to 12:00:00. When DATE data is used, the time
portion is set to 12:00:00 just like in the above.
DATE_TRUNC('<text',<date>)
Where <text> can be: 'YEAR', 'MONTH', 'DAY' for a date or a time stamp and ('HOUR', 'MINUTE', 'SECOND'
for time and time stamp. Although DAY and SECOND are allowed, they have no impact on the output data, see
below.
SELECT CURRENT_TIMESTAMP
DATE_TRUNC('HOUR',CURRENT_TIMESTAMP) hr_trunc
DATE_TRUNC('MINUTE'CURRENT_TIMESTAMP) min_trunc
DATE_TRUNC('SECOND',CURRENT_TIMESTAMP) sec_trunc;
Notice that the hour portion becomes 9:00:00, the minute portion becomes 9:05:00, and the second and date
portions do not change. If TIME was used, there would be no DATE portion as in a TIME STAMP.
MONTHS_BETWEEN FUNCTION
Compatibility: Redshift Extension
The following example uses the MONTHS_BETWEEN with some fixed dates
ANSI TIME
Redshift has the ANSI time display and TIME data type.
CURRENT_TIME is the ANSI name of the time function.
SELECT CURRENT_TIME;
TIME
17:27:56
SELECT Current_Time
CURRENT_TIME - 55 as Subtract
TIME Subtract
17:27:56 17:27:01
As well as creating a TIME data type, intelligence has been added to the clock software. It can increment or
decrement TIME with the result increasing to the next minute or decreasing from the previous minute based on
the addition or subtraction of seconds.
ANSI TIMESTAMP
TIMESTAMP is a display format, a reserved name and a new data type. It is a combination of the DATE and
TIME data types combined together into a single column data type.
SELECT CURRENT_TIMESTAMP
Notice that there is a space between the DATE and TIME portions of a TIMESTAMP. This is a required element
to delimit or separate the day from the hour.
The TIMESTAMP function can be used to convert a date or combination of a date and time into a timestamp.
Syntax for using TIMESTAMP:
TIMESTAMP(<date> [ <time> ] )
SELECT TIMESTAMP(CURRENT_DATE)
,TIMESTAMP(CURRENT_DATE, CURRENT_TIME)
,TIMESTAMP(DATE '2005-10-01', TIME '08:30:05') ;
What a wonderful feature. Redshift allows you to convert a date or a combination of a date and a time into a
Timestamp. The example above shows an example of converting a date, a date and time, and a literal date and
time. This should be all you need.
TO_TIMESTAMP(<date-string> [ <time-string> ] )
TO_TIMESTAMP TO_TIMESTAMP
10/01/2005 8:30:05.204331 10/01/2005 8:30:05.204331
Redshift allows you to convert character strings into a Timestamp. Notice that both answers are exactly the same.
The second parameter is NOT how the data should be output or formatted, but instead it reflects how the string
should be interpreted.
SELECT NOW() ;
SELECT NOW() ;
NOW
08/03/2012 11:07:10
Redshift allows you to see the date and time with the NOW() function. The next time someone asks for the time
tell them NOW.
SELECT TIMEOFDAY () ;
Answer Set
TIMEOFDAY
Fri Aug 03 11:11:38 2012 EDT
Redshift allows you an extended version of a time stamp that is robust and verbose.
The AGE function returns the interval (discussed later in this chapter) between two time stamps. If you use a
single time stamp, the age function returns the interval between the current time and the time stamp provided.
The interval returned by the age function can include year and month data as well as day and time data.
Syntax of AGE:
AGE(<start-date>,<end-date>)
SELECT CURRENT_TIMESTAMP
,AGE('10-28-2004','7-20-2003')
,AGE(current_timestamp,'7-20-2003')
,AGE('7-20-2003') as AGE2 /* defaults to CURRENT_TIMESTAMP */ ;
To subtract one time stamp from another, use the AGE function.
TIME ZONES
A time zone relative to London (UTC) might be:
LA----------Miami-----------Frankfurt------------Hong Kong
+8:00 +05:00 00:00 -08:00
LA----------Miami-----------Frankfurt------------Hong Kong
+3:00 00:00 -05:00 -13:00
Redshift has the ability to adjust the time and timestamp values to reflect the hours difference between the user's
time zone, the system time zone, and the United Kingdom location that was historically called Greenwich Mean
Time (GMT). Since the Greenwich observatory has been "decommissioned," the new reference to this same time
zone is called Universal Time Coordinate (UTC).
Here, the time zones used are represented from the perspective of the system at EST. In the above, it appears to
be backward. This is because the time zone is set using the number of hours that the system is from the user.
Where xxx is the designation for standard time and yyy is for daylight savings time
A Redshift session can modify the time zone during normal operations without requiring a logoff and logon. At
this time, the NPS only recognizes time zone processing stored in a table with data type of TIME WITH TIME
ZONE. Hopefully, it will soon also be added to TIMESTAMP when stored in a table.
USING TIME ZONES
The way time zones are implemented in Redshift is that the session time zone setting adjusts the value returned
by the TIME and TIMESTAMP when referenced in an SQL statement. To make some of the changes more
apparent, the following statements "assume" that they are all run at the same time.
Examples above set the time zone and then query Current_Timestamp simultaneously.
Interval Chart
Its not the size of the dog in the fight, but the size of the fight in the dog.
Archie Griffin
Redshift has added INTERVAL processing, however, it is not ANSI compliant. Intervals are used to perform
DATE, TIME and TIMESTAMP arithmetic and conversion.
USING INTERVALS
SELECT Current_Date as Our_Date
,Current_Date + Interval '1' Day as Plus_1_Day
,Current_Date + Interval '3' Month as Plus_3_Months
,Current_Date + Interval '5' Year as Plus_5_Years
The afternoon knows what the morning never suspected.
- Swedish Proverb
To use the ANSI syntax for intervals, the SQL statement must be very specific as to what the data values mean
and the format in which they are coded. ANSI standards tend to be lengthier to write and more restrictive as to
what is and what is not allowed regarding the values and their use.
Our_Date Leap_Year
01/29/2012 02/29/2012
SELECT Date '2011-01-29' as Our_Date
,Date '2011-01-29' + INTERVAL '1' Month as Leap_Year
The first example works because we added 1 month to the date '2012-01-29' and we got '2012-02-29'. Because
this was leap year, there actually is a date of February 29, 2012. The next example is the real point. We have a
date of '2011-01-29' and we add 1-month to that, but there is no February 29 in 2011, so the query fails.
th
Once the game is over, the king and the pawn go back in the same box.
- Italian Proverb
To use DATE and TIME arithmetic, it is important to keep in mind the results of various operations. The above
chart is your Interval guide.
Actual_Days
4017
The default for all intervals is 2 digits. We received an overflow error because the Actual_Days is 4017. The
second example works because we demanded the output to be 4 digits (the maximum for intervals).
SELECT
(TIME '12:45:01' - TIME '10:10:01') HOUR AS Actual_Hours
,(TIME '12:45:01' - TIME '10:10:01') MINUTE AS Actual_Minutes
,(TIME '12:45:01' - TIME '10:10:01') SECOND(4) AS Actual_Seconds
,(TIME '12:45:01' - TIME '10:10:01') SECOND(4,4) AS Actual_Seconds4
ERROR Interval Field Overflow
The default for all intervals is 2 digits, but notice in the top example, we put in 3 digits for Minute, 4 digits for
Second, and 4,4 digits for the Acutal_Seconds4. If we had not, we would have received an overflow error as in
the bottom example.
Date Two_Year_Ago
06/18/2012 06/18/2010
I know that you believe that you understand what you think I said, but
I am not sure you realize that what you heard is not what I meant.
The CAST function (Convert And Store) is the ANSI method for converting data from one type to another. It can
also be used to convert one INTERVAL to another INTERVAL representation. Although the CAST is normally
used in the SELECT list, it works in the WHERE clause for comparison reasons.
The top query failed because the INTERVAL result defaults to 2-digits and we have a 3-digit answer for the year
portion (108). The bottom query fixes that specifying 3-digits. The biggest advantage in using the INTERVAL
processing is that SQL written on another system is now compatible.
THE OVERLAPS COMMAND
Compatibility: Redshift Extension
SELECT <literal>
WHERE (<start-date-time>, <end-date-time>) OVERLAPS
(<start-date-time>, <end-date-time>) ;
SELECT 'The Dates Overlap' as Dater
WHERE (DATE '2001-01-01', DATE '2001-11-30') OVERLAPS
(DATE '2001-10-15', DATE '2001-12-31');
When working with dates and times, sometimes it is necessary to determine whether two different ranges have
common points in time. Redshift provides a Boolean function to make this test for you. It is called OVERLAPS;
it evaluates true if multiple points are in common, otherwise it returns a false. The literal is returned because both
date ranges have from October 15 through November 30 in common.
The above SELECT example tests two literal dates and uses the OVERLAPS to determine whether or not to
display the character literal. The literal was not selected because the ranges do not overlap. So, the common
single date of November 30 does not constitute an overlap. When dates are used, 2 days must be involved, and
when time is used, 2 seconds must be contained in both ranges.
The above SELECT example tests two literal dates and uses the OVERLAPS to determine whether or not to
display the character literal:
When using the OVERLAPS function, there are a couple of situations to keep in mind:
1. A single point in time, i.e. the same date, does not constitute an overlap. There must be at least one second of
time in common for TIME or one day when using DATE.
2. Using a NULL as one of the parameters, the other DATE or TIME constitutes a single point in time versus a
range.
CSUM
This ANSI version of CSUM is SUM() Over. Right now, the syntax wants to see the sum of the Daily_Sales after
it is first sorted by Sale_Date. Rows Unbounded Preceding makes this a CSUM. The ANSI Syntax seems
difficult, but only at first.
The first thing the above query does before calculating is SORT all the rows by Sale_Date. The Sort is located
right after the ORDER BY.
The keywords Rows Unbounded Preceding determines that this is a CSUM. There are only a few different
statements and Rows Unbounded Preceding is the main one. It means start calculating at the beginning row, and
continue calculating until the last row.
The third SUMOVER row is 138739.28. That is derived by taking the first rows Daily_Sales (41888.88) and
adding it to the SECOND rows Daily_Sales (48850.40). Then, you add that total to the THIRD rows
Daily_Sales (48000.00).
You can have more than one SORT KEY. In the top query, Product_ID is the MAJOR Sort, and Sale_Date is the
MINOR Sort.
The PARTITION Statement is how you reset in ANSI. This will cause the SUMANSI to start over (reset) on its
calculating for each NEW Product_ID.
Above are two OLAP statements. Only one has PARTITION BY, so only it resets.
The SUM () Over allows you to get the moving SUM of a certain column.
With a Moving Window of 3, how is the 139350.69 amount derived in the Sum3_ANSI column in the third row?
With a Moving Window of 3, how is the 139350.69 amount derived in the Sum3_ANSI column in the third row?
It is the sum of 48850.40, 54500.22 and 36000.07. The current row of Daily_Sales plus the previous two rows of
Daily_Sales.
The ROWS 2 Preceding gives the MSUM for every 3 rows. The ROWS UNBOUNDED Preceding gives the
continuous MSUM.
Use a PARTITION BY Statement to Reset the ANSI OLAP. Notice it only resets the OLAP command containing
the Partition By statement, but not the other OLAPs.
MOVING AVERAGE
SELECT Product_ID Sale_Date, Daily_Sales,
AVG(Daily_Sales) OVER (ORDER BY Product_ID, Sale_Date
ROWS 2 Preceding) AS AVG_3
FROM Sales_Table ;
Notice the Moving Window of 3 in the syntax and that it is a 2 in the ANSI version. That is because in ANSI, it is
considered the Current Row and 2 preceding.
Much like the SUM OVER Command, the Average OVER places the sort keys via the ORDER BY keywords.
With a Moving Window of 3, how is the 46450.23 amount derived in the AVG_3_ANSI column in the third row?
With a Moving Window of 3, how is the 43566.91 amount derived in the AVG_3_ANSI column in the fourth
row?
With a Moving Window of 3, how is the 43566.91 amount derived in the AVG_3_ANSI column in the fourth
row? The current row plus Rows 2 Preceding.
The ROWS 2 Preceding gives the MAVG for every 3 rows. The ROWS UNBOUNDED Preceding gives the
continuous MAVG.
Use a PARTITION BY Statement to Reset the ANSI OLAP. The Partition By statement only resets the column
using the statement. Notice that only Continuous resets.
This is the RANK() OVER. It provides a rank for your queries. Notice how you do not place anything within the
() after the word RANK. Default Sort is ASC.
What does the PARTITION Statement in the RANK() OVER do? It resets the rank.
The Limit statement limits rows once the Ranks been calculated.
PERCENT_RANK() OVER
SELECT Product_ID Sale_Date Daily_Sales,
PERCENT_RANK() OVER (PARTITION BY PRODUCT_ID
ORDER BY Daily_Sales DESC) AS PercentRank1
FROM Sales_Table WHERE Product_ID in (1000, 2000) ;
We now have added a Partition statement which resets on Product_ID so this produces 7 rows for each of our
Product_IDs.
Percentage_Rank is just like RANK however, it gives you the Rank as a percent, but only a percent of all the
other rows up to 100%.
After the sort, the Max() Over shows the Max Value up to that point.
The largest value is 64300.00 in the column MaxOver. Once it was evaluated, it did not continue until the end
because of the PARTITION BY reset.
The last two answers (MinOver) are blank, so you can fill in the blank.
The ROW_NUMBER() Keyword(s) caused Seq_Number to increase sequentially. Notice that this does NOT
have a Rows Unbounded Preceding, and it still works!
The above information is the syntax for standard deviation using the OVER clause. In order to provide the
moving functionality, it is necessary to have a method that designates the number of rows to include in the
STDDEV. It allows the function to calculate values from columns contained in rows that are before the current
row and also rows that are after the current row.
The above information is the syntax for VARIANCE and VAR_SAMP using the OVER clause. In order to
provide the moving functionality, it is necessary to have a method that designates the number of rows to include
in the VARIANCE. It allows the function to calculate values from columns contained in rows that are before the
current row and also rows that are after the current row.
The above information is an introduction to the Variance functions used in conjunction with an OVER statement.
Wow! Another amazing example. The above example uses all three standard deviation functions to produce
output sorting on the sales date for the dates in September.
The above information provides information and the syntax for FIRST_VALUE and LAST_Value.
USING FIRST_VALUE
SELECT Last_name, first_name, dept_no
FIRST_VALUE(first_name)
OVER (ORDER BY dept_no, last_name desc
rows unbounded preceding) AS "First All"
FIRST_VALUE(first_name)
OVER (PARTITION BY dept_no
ORDER BY dept_no, last_name desc
rows unbounded preceding) AS "First Partition"
FROM SQL_Class..Employee_Table;
The above example uses FIRST_VALUE to show you the very first first_name returned. It also uses the keyword
Partition to show you the very first first_name returned in each department.
USING LAST_VALUE
SELECT Last_name, first_name, dept_no
LAST_VALUE(first_name)
OVER (ORDER BY dept_no, last_name desc
rows unbounded preceding) AS "Last All"
LAST_VALUE(first_name)
OVER (PARTITION BY dept_no
ORDER BY dept_no, last_name desc
rows unbounded preceding) AS "Last Partition"
FROM sql_class.Employee_Table;
The FIRST_VALUE and LAST_VALUE are good to use anytime you need to propagate a value from one row to
all or multiple rows based on a sorted sequence. However, the output from the LAST_VALUE function appears
to be incorrect or a little missing until you understand a few concepts. The SQL request specifies "rows
unbounded preceding" and LAST_VALUE looks at the last row. The current row is always the last row, and
therefore, it appears in the output.
The above provides information and the syntax for LAG and LEAD.
USING LEAD
SELECT last_name, dept_no
lead(dept_no)
over (order by dept_no, last_name) as "Lead All"
lead(dept_no) over (Partition by dept_no
order by dept_no, last_name) as "Lead Partition"
FROM employee_table;
As you can see, the first LEAD brings back the value from the next row except for the last which has no row
following it. The offset value was not specified in this example, so it defaulted to a value of 1 row.
USING LAG
SELECT last_name, dept_no
lag(dept_no)
over (order by dept_no, last_name) as "Lag All"
lag(dept_no)
over (Partition by dept_no
order by dept_no, last_name) as "Lag Partition"
FROM employee_table;
From the example above, you see that LAG uses the value from a previous row and makes it available in the next
row. For LAG, the first row(s) will contain a null based on the value in the offset, here it defaulted to 1. The first
null comes from the function, whereas the second row gets the null from the first row.
For this example, the first two rows have a null because there is not a row two rows before these. The number of
nulls will always be the same as the offset value. There is a third null because Jones Dept_No is null.
Chapter 14 - Temporary Tables
I cannot imagine any condition which would cause this ship to founder. Modern shipbuilding has gone beyond
that.
- E. I. Smith, Captain of the Titanic
The SELECT Statement that creates and populates the Derived table is always inside Parentheses.
A derived table will always have a SELECT query to materialize the derived table with data. The
SELECT query always starts with an open parenthesis and ends with aclose parenthesis.
The derived table must be given a name. Above we called our derived table TeraTom.
You will need to define (alias) the columns in the derived table. Above we allowed Dept_No to default
to Dept_No, but we had to specifically alias AVG(Salary) asAVGSAL.
Every derived table must have the three components listed above.
NAMING THE DERIVED TABLE
In the example above, TeraTom is the name we gave the Derived Table. It is mandatory that you always name the
table or it errors.
AVGSAL is the name we gave to the column in our Derived Table that we call TeraTom. Our SELECT (which
builds the columns) shows we are only going to have one column in our derived table, and we have named that
column AVGSAL.
Our example above shows the data in the derived table named TeraTom. This query allows us to see each
employee and the plus or minus avg of their salary compared to the other workers in their department.
TeraTom
Dept_No AVGSAL
? 32800.50
10 64300.00
100 48850.00
200 44944.44
300 40200.00
400 48333.33
In a derived table, you will always have a SELECT query in parenthesis, and you will always name the table.
You have options when aliasing the columns. As in the example above, you can let normal columns default to
their current name.
When using the WITH Command, we can CREATE our Derived table before running the main query. The only
issue here is that you can only have 1 WITH.
Now, the lower portion of the query refers to TeraTom Almost like it is a permanent table, but it is not!
WITH
with TeraTom as (select * from Student_Table)
SELECT *
FROM TeraTom
ORDER BY 1
LIMIT 5;
Dan Quayle
The following example shows the simplest possible case of a query that contains a WITH clause. The WITH
query named TeraTom selects all of the rows from the Student_Table. The main query, in turn, selects all of the
rows from TeraTom. The TeraTom table exists only for the life of the query.
The following example shows two tables created from the With statement.
6) Why were the join keys named differently? If both were named Dept_No, we would error unless we full
qualified.
ON E.Employee_No = S.Employee_No
ORDER BY T.Dept_No;
Create Table Syntax creates a new table in the current database. The owner of this table is the issuer of the
CREATE TABLE command.
When you create a temporary table, it is visible only within the current session. The table is automatically
dropped at the end of the session. A derived table only lasts the life of a single query, but a temporary table last
the entire session. This allows a user to run hundreds of queries against the temporary table. A temporary table
can have the same name as a permanent table, but I dont recommend this. You dont give a temporary table a
schema because it is automatically associated with the users session. Once the session is over, the table and data
are dropped. If the user tries to query the table in another session, the system wont recognize the table. In other
words, the table doesnt exist outside of the current session it was created in.
When you create a temporary table, it is visible only within the current session. The table is automatically
dropped at the end of the session. Above are some examples that allow you to define a different distkey, diststyle
and sortkey. Users (by default) are granted permission to create temporary tables by their automatic membership
in the PUBLIC group. To remove the privilege for any users to create temporary tables, revoke the TEMP
permission from the PUBLIC group, and then explicitly grant the permission to create temporary tables to
specific users or groups of users.
PERFORMING A DEEP COPY
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table. If a table
has a large unsorted region, a deep copy is much faster than a vacuum. You can choose one of four methods to
create a copy of the original table:
1) Use the original table DDL. This is the best method for perfect reproduction.
2) Use CREATE TABLE AS (CTAS). If the original DDL is not available, you can use CREATE TABLE AS to
create a copy of current table, then rename the copy. The new table will not inherit the encoding, distkey, sortkey,
not null, primary key, and foreign key attributes of the parent table.
3) Use CREATE TABLE LIKE. If the original DDL is not available, you can use CREATE TABLE LIKE to
recreate the original table. The new table will not inherit the primary key and foreign key attributes of the parent
table. The new table does, though, inherit the encoding, distkey, sortkey, and not null attributes of the parent
table.
4) Create a temporary table and truncate the original table. If you need to retain the primary key and foreign key
attributes of the parent table, you can use CTAS to create a temporary table, then truncate the original table and
populate it from the temporary table. This method is slower than CREATE TABLE LIKE because it requires two
insert statements.
A deep copy recreates and repopulates a table by using a bulk insert, which automatically sorts the table. If a
table has a large unsorted region, a deep copy is much faster than a vacuum. The difference is that you cannot
make concurrent updates during a deep copy operation which you can do during a vacuum. The next four slides
will show each technique with an example.
2. Use an INSERT INTO ... SELECT statement to populate the copy with data from the original table.
4. Use an ALTER TABLE statement to rename the copy to the original table name.
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.
3. Use an ALTER TABLE statement to rename the new table to the original table.
The following example performs a deep copy on the Sales_Table using a duplicate of the Sales_Table
namedSales_Table_Copy.
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.
2. Use an INSERT INTO ... SELECT statement to copy the rows from the current table to the new table.
4. Use an ALTER TABLE statement to rename the new table to the original table.
The following example performs a deep copy on the Sales_Table using a duplicate of the Sales_Table
namedSales_Table_Copy.
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.
3. Use an INSERT INTO ... SELECT statement to copy the rows from the temporary table to the original table.
TRUNCATE Sales_Table ;
A deep copy recreates and repopulates a table by using a bulk insert which automatically sorts the table.
An invasion of Armies can be resisted, but not an idea whose time has come.
- Victor Hugo
This query is very simple and easy to understand. It uses an IN List to find all Employees who are in Dept_No
100 or Dept_No 200.
3.
Duplicate values are ignored here. We got the same rows back as before, and it is as if the system ignored the
duplicate values in the IN List. That is exactly what happened.
THE SUBQUERY
The query above is a Subquery which means there are multiple queries in the same SQL. The bottom query runs
first, and its purpose in life is to build a distinct list of values that it passes to the top query. The top query then
returns the result set. This query solves the problem: Show all Employees in Valid Departments!
The bottom query runs first and builds a distinct IN list. Then the top query runs using the list.
A great question was asked above. Do you know the key to answering? Turn the page!
If you only want to see a report where the final result set has only columns from one table, use a Subquery.
Obviously, if you need columns on the report where the final result set has columns from both tables, you have to
do a Join.
Here is your opportunity to show how smart you are. Write a Subquery that will bring back everything from the
Customer_Table if the customer has placed an order in the Order_Table. Good luck! Advice: Look for the
common key among both tables!
The common key among both tables is Customer_Number. The bottom query runs first and delivers a distinct list
of Customer_Numbers which the top query uses in the IN List!
QUIZ- WRITE THE MORE DIFFICULT SUBQUERY
Here is your opportunity to show how smart you are. Write a Subquery that will bring back everything from the
Customer_Table if the customer has placed an order in the Order_Table that is greater than $10,000.00.
Another opportunity knocking! Would someone please answer the query door?
Another opportunity knocking! This is a tough one, and only the best get this written correctly.
The table name from the top query and the table name from the bottom query are given a different alias.
The bottom query WHERE clause co-relates Dept_No from Top and Bottom.
The bottom query is run one time for each distinct value delivered from the top query.
SELECT *
FROM Employee_Table as EE
WHERE Salary > (
SELECT AVG(Salary)
FROM Employee_Table as EEEE
WHERE EE.Dept_No = EEEE.Dept_No) ;
A correlated subquery breaks all the rules. It is the top query that runs first. Then, the bottom query is run one
time for each distinct column in the bottom WHERE clause. In our example, this is the column Dept_No. This is
because in our example, the WHERE clause is comparing the column Dept_No. After the top query runs and
brings back its rows, the bottom query will run one time for each distinct Dept_No. If this is confusing, it is not
you. These take a little time to understand, but I have a plan to make you an expert. Keep reading!
Select all columns in the Sales_Table if the Daily_Sales column is greater than the Average Daily_Sales within
its own Product_ID.
Another opportunity knocking! This is your second chance. I will even give you a third chance.
Select all columns in the Sales_Table if the Daily_Sales column is greater than the Average Daily_Sales within
its own Sale_Date.
Another opportunity knocking! There is just one minor adjustment and you are home free.
Another opportunity knocking! There is just one minor adjustment and you are home free.
Another opportunity to show your brilliance is ready for you to make it happen.
Get ready to be amazed at either yourself or the Answer on the next page!
This is how you utilize multiple parameters in a Subquery! Turn the page for more.
The bottom query runs first returning two columns. Next page for more info!
The IN list is built and the top query can now process for the final Answer Set.
Good luck in writing this. Remember that this will involve multiple Subqueries.
We really didnt place a new row inside the Order_Table with a NULL value for the Customer_Number column,
but in theory, if we had, how many rows would return?
The answer is no rows come back. This is because when you have a NULL value in a NOT IN list, the system
doesnt know the value of NULL, so it returns nothing.
How many rows return NOW from the query? 1 Acme Products
You can utilize a WHERE clause that tests to make sure Customer_Number IS NOT NULL. This should be used
when a NOT IN could encounter a NULL.
The EXISTS command will determine via a Boolean if something is True or False. If a customer placed an order,
it EXISTS, and using the Correlated Exists statement, only customers who have placed an order will return in the
answer set. EXISTS is different than IN as it is less restrictive as you will soon understand.
Only customers who placed an order return with the above Correlated EXISTS.
Use NOT EXISTS to find which Customers have NOT placed an Order?
SELECT Customer_Number, Customer_Name
FROM Customer_Table as Top1
WHERE NOT EXISTS
(SELECT * FROM Order_Table as Bot1
Where Top1.Customer_Number = Bot1.Customer_Number ) ;
The EXISTS command will determine via a Boolean if something is True or False. If a customer placed an order,
it EXISTS, and using the Correlated Exists statement, only customers who have placed an order will return in the
answer set. EXISTS is different than IN as it is less restrictive as you will soon understand.
Use NOT EXISTS to find which Customers have NOT placed an Order?
SELECT Customer_Number, Customer_Name
FROM Customer_Table as Top1
WHERE NOT EXISTS
(SELECT * FROM Order_Table as Bot1
Where Top1.Customer_Number = Bot1.Customer_Number ) ;
Customer_Number Customer_Name
The only customer who did NOT place an order was Acme Products.
QUIZ HOW MANY ROWS COME BACK FROM THIS NOT EXISTS?
A NULL value in a list for queries with NOT IN returned nothing, but you must now decide if that is also true for
the NOT EXISTS. How many rows will return?
ANSWER HOW MANY ROWS COME BACK FROM THIS NOT EXISTS?
NOT EXISTS is unaffected by a NULL in the list, thats why it is more flexible!
Chapter 16 - Substrings and Positioning Functions
Its always been and always will be the same in the world: the horse does the work and the coachman is tipped.
- Anonymous
Both queries trim both the leading and trailing spaces from Last_Name.
When you use the TRIM command on a column, that column will have all beginning and ending spaces
removed.
The above example removed the trailing y from the First_Name and the trailing g from the Last_Name.
Remember that this is case sensitive.
First_Name Quiz
Squiggy qui
John ohn
Richard ich
Herbert erb
Mandee and
Cletus let
William ill
Billy ill
Loraine ora
This is a SUBSTRING. The substring is passed two parameters, and they are the starting position of the string
and the number of positions to return (from the starting position). The above example will start in position 2 and
go for 3 positions!
HOW SUBSTRING WORKS WITH NO ENDING POSITION
First_Name GoToEnd
Squiggy quiggy
John ohn
Richard ichard
Herbert erbert
Mandee andee
Cletus letus
William illiam
Billy illy
Loraine oraine
If you dont tell the Substring the end position, it will go all the way to the end.
First_Name Before1
Squiggy Squig
John John
Richard Richa
Herbert Herbe
Mandee Mande
Cletus Cletu
William Willi
Billy Billy
Loraine Lorai
A starting position of zero moves one space in front of the beginning. Notice that our FOR Length is 6 so
Squiggy turns into Squig. The point being made here is that both the starting position and ending positions
can move backwards which will come in handy as you see other examples.
First_Name Before2
Squiggy S
John J
Richard R
Herbert H
Mandee M
Cletus C
William W
Billy B
Loraine L
A starting position of -1 moves two spaces in front of the beginning. Notice that our FOR Length is 3, so each
name delivers only the first initial. The point being made here is that both the starting position and ending
positions can move backwards which will come in handy as you see other examples.
First_Name WhatsUp
Squiggy
John
Richard
Herbert
Mandee
Cletus
William
Billy
Loraine
In our example above, we start in position 3, but we go for zero positions, so nothing is delivered in the column.
That is whats up!
This is the position counter. What it will do is tell you what position a letter is on. Why did Jones have a 4 in the
result set? The e was in the 4 position. Why did Smith get a zero for both columns? There is no e in Smith
th
and no f in Smith. If there are two fs, only the first occurrence is reported.
What is the Starting position of the Substring in the above query? Hint: This only looks for a Dept_Name that
has two words or more.
What is the Starting position of the Substring in the above query? See above!
Notice we only had three rows come back. That is because our WHERE looks for only Department_Name that
has multiple words. Then, notice that our starting position of the Substring is a subquery that looks for the first
space. Then, it adds 1 to the starting position, and we have a staring position for the 2 word. We dont give a
nd
It has 3 words
Why did only one row come back? Its the Only Department Name with three words. The SUBSTRING and the
WHERE clause both look for the first space, and if they find it, they look for the second space. If they find that,
add 1 to it, and their Starting Position is the third word. There is no FOR position, so it defaults to go to the
end.
CONCATENATION
See those || ? Those represent concatenation. That allows you to combine multiple columns into one column. The
|| (Pipe Symbol) on your keyboard is just above the ENTER key. Dont put a space in between, but just put two
Pipe Symbols together. In this example, we have combined the first name, then a single space, and then the last
name to get a new column called Full name like Squiggy Jones.
Of the three items being concatenated together, what is the first item of concatenation in the example above? The
first initial of the First_Name. Then, we concatenated a literal space and a period. Then, we concatenated the
Last_Name.
Why did we TRIM the Last_Name? To get rid of the spaces or the output would have looked odd. How many
items are being concatenated in the example above? There are 4 items concatenated. We start with the
Last_Name (after we trim it), then we have a single space, then we have the First Initial of the First Name, and
then we have a Period.
TROUBLESHOOTING CONCATENATION
What happened above to cause the error. Can you see it? The Pipe Symbols || have a space between them like | |,
when it should be ||. It is a tough one to spot, so be careful.
DECLARING A CURSOR
The above example declares a cursor named TeraTom to select sales information from the Sales_Table and then
fetch rows from the result set using the cursor.
"The difference between genius and stupidity is that genius has its limits"
- Albert Einstein
Sample_Table
Class_Code Grade_Pt
Fr 0
Use the fake table above called Sample_Table, and try and predict what the Answer will be if this query was
running on the system.
ANSWER TO QUIZ WHAT WOULD THE ANSWER BE?
Sample_Table
Class_Code Grade_Pt
Fr 0
You get an error when you DIVIDE by ZERO! Lets turn the page and fix it!
SELECT Class_Code
Grade_Pt / ( NULLIFZERO (Grade_pt) * 2 ) AS Math1
FROM Sample_Table;
What the NULLIFZERO does is make a zero into a NULL. So, the answer set youd get from this is a simple
FR, and then a NULL value represented usually by a ?. If you have a calculation where a ZERO could kill the
operation, and you dont want that, you can use the NULLIFZERO command to convert any zero value to a
NULL value.
Okay! Time to show me your brilliance! What would the Answer Set produce?
Here is the answer set! Howd you do? The NULLIFZERO command found a zero in Cust_No, so it made it
Null. The others were not zero, so they retained their value. The only time NULLIFZERO changes data is if it
finds a zero, and then it changes it to null.
Fill in the Answer Set above after looking at the table and the query.
You can also use the NULLIF(). What you are asking Redshift to do is to NULL the answer if the COLUMN
matches the number in the parentheses. What would the above Answer Set produce from your analysis?
Look at the answers above, and if it doesnt make sense, go over it again until it does.
Fill in the Answer Set above after looking at the table and the query.
This is the ZEROIFNULL. What it will do is put a zero into a place where a NULL shows up. What would the
Answer Set produce?
The answer set placed a zero in the place of the NULL Acc_Balance, but the other values didnt change because
they were NOT Null.
THE COALESCE COMMAND
SELECT Last_Name
,COALESCE (Home_Phone, Work_Phone, Cell_Phone) as Phone
FROM Sample_Table ;
Last_Name Phone
Fill in the Answer Set above after looking at the table and the query.
Coalesce returns the first non-Null value in a list, and if all values are Null, returns Null.
SELECT Last_Name
,COALESCE (Home_Phone, Work_Phone, Cell_Phone) as Phone
FROM Sample_Table ;
Last_Name Phone
Jones 555-1234
Patel 456-7890
Gonzales 354-0987
Nguyen ?
Coalesce returns the first non-Null value in a list, and if all values are Null, returns Null.
SELECT Last_Name
,COALESCE (Home_Phone, Work_Phone, Cell_Phone, 'No Phone') as Phone
FROM Sample_Table ;
Last_Name Phone
Fill in the Answer Set above after looking at the table and the query.
Coalesce returns the first non-Null value in a list, and if all values are Null, returns Null. Since we decided in the
above query we dont want NULLs, notice we have placed a literal No Phone in the list. How will this effect
the Answer Set?
SELECT Last_Name
COALESCE (Home_Phone, Work_Phone, Cell_Phone, 'No Phone') as Phone
FROM Sample_Table ;
Last_Name Phone
Jones 555-1234
Patel 456-7890
Gonzales 354-0987
Nguyen No Phone
Answers are above! We put a literal in the list so theres no chance of NULL returning.
Data can be converted from one type to another by using the CAST function. As long as the data involved does
not break any data rules (i.e. placing alphabetic or special characters into a numeric data type), the conversion
works. The name of the CAST function comes from the Convert And STore operation that it performs.
The first CAST truncates the five characters (left to right) to form the single character A. In the second CAST,
the integer 128 is converted to three characters and left justified in the output. The 127 was initially stored in a
SMALLINT (5 digits - up to 32767) and then converted to an INTEGER. Hence, it uses 11 character positions
for its display, ten numeric digits and a sign (positive assumed) and right justified as numeric.
121 122
The value of 121.53 was initially stored as a DECIMAL as 5 total digits with 2 of them to the right of the
decimal point. Then, it is converted to a SMALLINT using CAST to remove the decimal positions. Therefore, it
truncates data by stripping off the decimal portion. It does not round data using this data type. On the other hand,
the CAST in the fifth column called Rounder is converted to a DECIMAL as 3 digits with no digits (3,0) to the
right of the decimal, so it will round data values instead of truncating. Since .53 is greater than .5, it is rounded
up to 122.
The Column Chopped takes Order_Total (a Decimal (10,2) and CASTs it as an integer which chops off the
decimals. Rounded CASTs Order_Total as a Decimal (5,0), which takes the decimals and rounds up if the
decimal is .50 or above.
Sample_Table
Course_Name Credits
Tera-Tom on SQL 1
SELECT Course_Name
,CASE Credits
WHEN 1 THEN 'One Credit'
WHEN 2 THEN 'Two Credits'
WHEN 3 THEN 'Three Credits'
END AS CreditAlias
FROM Sample_Table ;
Course_Name CreditAlias
Fill in the Answer Set above after looking at the table and the query.
This is a CASE STATEMENT which allows you to evaluate a column in your table, and from that, come up with
a new answer for your report. Every CASE begins with a CASE, and they all must end with a corresponding
END. What would the answer be?
Sample_Table
Course_Name Credits
Tera-Tom on SQL 1
SELECT Course_Name
,CASE Credits
WHEN 1 THEN 'One Credit'
WHEN 2 THEN 'Two Credits'
WHEN 3 THEN 'Three Credits'
END AS CreditAlias
FROM Sample_Table ;
Course_Name CreditAlias
This is a CASE STATEMENT which allows you to evaluate a column in your table, and from that, come up with
a new answer for your report. Every CASE begins with a CASE, and they all must end with a corresponding
END. What would the answer be?
The second example is better unless you have a simple query like the first example.
Look at the CASE Statement and look at the Course_Table, and fill in the Answer Set.
Look at the CASE Statement and look at the Course_Table, and fill in the Answer Set.
A null value will occur when the evaluation falls through the case and there is no else statement. Notice above
that we have a 4 under the Credit Column. However, in our CASE statement, we dont have instructions on
what to do if the number is 4. That is why the null value is in the report.
Notice now that we have a 4 under the Credit Column. However, in our CASE statement, we dont have
instructions on what to do if the number is 4. What will occur?
Notice now that we dont have an ALIAS for the CASE Statement. What will the system place in there for the
Column Title.
Notice now that we dont have an ALIAS for the CASE Statement. The title given by default is < CASE
Expression >. That is why you should ALIAS your Case statements.
NESTED CASE
SELECT Last_Name
,CASE Class_Code
WHEN 'JR' THEN 'Jr'
||(CASE WHEN Grade_pt < 2 THEN 'Failing'
WHEN Grade_pt < 3.5 THEN 'Passing'
ELSE 'Exceeding'
END)
ELSE 'Sr'
||(CASE WHEN Grade_pt < 2 THEN 'Failing'
WHEN Grade_pt < 3.5 THEN 'Passing'
ELSE 'Exceeding'
END)
END AS Status
FROM Student_Table WHERE Class_Code IN ('JR','SR')
ORDER BY Class_Code, Last_Name;
Last_Name Status
Bond Jr Exceeding
McRoberts Jr Failing
Delaney Sr Passing
Phillips Sr Passing
A NESTED Case occurs when you have a Case Statement within another CASE Statement. Notice the Double
Pipe symbols (||) that provide Concatenation.
The purposes of views is to restrict access to certain columns, derive columns or Join Tables, and to restrict
access to certain rows (if a WHERE clause is used). This view does not allow the user to see the column salary.
The purposes of views is to restrict access to certain columns, derive columns or Join Tables, and to restrict
access to certain rows (if a WHERE clause is used). This view does not allow the user to see information about
rows unless the rows have a Dept_No of either 300 or 400.
This view is designed to join two tables together. By creating a view, we have now made it easier for the user
community to join these tables by merely selecting the columns you want from the view. The view exists now in
the database sql_views and accesses the tables in sql_class.
Once the view is created, then users can query them with a SELECT statement. Above, we have queried the view
we created to join the employee_table to the department_table (created on previous page). Users can select all
columns with an asterisk, or they can choose individual columns (separated by a comma). Above, we selected all
columns from the view.
Redshift allows for an ORDER BY statement in the creation of a view. In the example above, notice that we have
an ORDER BY statement inside the view creation. When the user selects from the view, the data comes back
already sorted.
Redshift allows for an ORDER BY statement in the creation of a view. In the example above, notice that we have
an ORDER BY statement inside the view creation. In our second example where the user selects from the view,
they also put in an different Order By statement. The data comes back sorted by class_code.
This view is used to find the top 3 salaried employees in the employee_table. Notice that the view creation has an
ORDER BY statement. This is another exception to the rule that you can't have an ORDER BY statement in a
view creation. The reason is that the TOP command goes with ORDER BY like bread goes with butter. This view
actually selects all the data from the employee_table. Then, the system sorts the data with the ORDER BY
statement so that the rows show the largest to the smallest salaries. Then, only the top 3 salaried employees are
selected.
This view is used to find the top 3 students with the highest grade points. Notice that the view creation has an
ORDER BY statement. This is another exception to the rule that you can't have an ORDER BY statement in a
view creation. The reason is that the LIMIT command goes with ORDER BY like bread goes with butter. This
view actually selects all the data from the student_table. Then, the system sorts the data with the ORDER BY
statement so that the rows show the highest to the lowest Grade_Pt' s. Then, only the top 3 students are selected.
ALTERING A TABLE
CREATE VIEW Emp_HR_v AS
SELECT Employee_No
,Dept_No
,Last_Name
,First_Name
FROM Employee_Table ;
Altering the actual Table
This view will run after the table has added an additional column!
This view runs after the table has added an additional column, but it wont include Mgr_No in the view results
even though there is a SELECT * in the view. The View includes only the columns present when the view was
CREATED.
This view will NOT run after the table has dropped a column referenced in the view.
TROUBLESHOOTING A VIEW
CREATE VIEW Emp_HR_v6 AS
SELECT *
FROM Employee_Table6 ;
Altering the actual Table
This view will NOT run after the table has dropped a column referenced in the view even though the View was
CREATED with a SELECT *. At View CREATE Time, the columns present were the only ones the view
considered responsible for, and Dept_No was one of those columns. Once Dept_No was dropped, the view no
longer works.
You can UPDATE a table through a View if you have the RIGHTS to do so.
"The man who doesn't read good books has no advantage over the man who can't read them."
-Mark Twain
In this example, what numbers in the answer set would come from the query above ?
In this example, only the number 3 was in both tables so they INTERSECT
In this example, what numbers in the answer set would come from the query above ?
1 2 3 4 5
Both top and bottom queries run simultaneously, then the two different spools files are merged to eliminate
duplicates and place the remaining numbers in the answer set.
In this example, what numbers in the answer set would come from the query above ?
1 2 3 3 4 5
Both top and bottom queries run simultaneously. Then, the two different spools files are merged together to build
the answer set. The ALL prevents eliminating Duplicates.
In this example, what numbers in the answer set would come from the query above ?
The Top query SELECTED 1, 2, 3 from Table_Red. From that point on, only 1, 2, 3 at most could come back.
The bottom query is run on Table_Blue, and if there are any matches, they are not ADDED to the 1, 2, 3 but
instead take away either the 1, 2, or 3.
What will the answer set be? Notice I changed the order of the tables in the query!
Will the result set be the same for both queries above?
Will both queries bring back the exact same result set? Check out the next page to find out.
Will the result set be the same for both queries above?
Yes
Both queries above are exactly the same to the system and produce the same result set.
Will the result set be the same for both queries above?
Will both queries bring back the exact same result set? Check out the next page to find out.
No! The first query returns 4, 5, and the query on the right returns 1, 2.
You must have an equal amount of columns in both SELECT lists. This is because data is compared from the two
spool files, and duplicates are eliminated. So, for comparison purposes, there must be an equal amount of
columns in both queries.
The above query works without error, but no data is returned. There are no First Names that are the same as
Department Names. This is like comparing Apples to Oranges. That means they are NOT in the same Domain.
The Bottom Query is responsible for sorting, but the ORDER BY statement must be a number, which represents
column1, column2, column3, etc.
The Derived Table gave us the empno for all managers, and we were able to join it.
"You can make more friends in two months by becoming interested in other people than you will in two years by
trying to get other people interested in you."
-Dale Carnegie
Above is the Stats_Table data in which we will use in our statistical examples.
STDDEV
The query below returns the sample standard deviation for the Daily_Sales column in the Sales_Table. The
Daily_Sales column is a DECIMAL. The scale of the result is reduced to 10 digits.
SELECT CAST(STDDEV(Daily_Sales) as dec(18,10)) FROM Sales_Table ;
stddev
-----------------------
13389.6235806995
The standard deviation function is a statistical measure of spread or dispersion of values. It is the roots square of
the difference of the mean (average). This measure is to compare the amount by which a set of values differs
from the arithmetical mean.
The STDDEV_POP function is one of two that calculates the standard deviation. The population is of all the
rows included based on the comparison in the WHERE clause.
Syntax for using STDDEV_POP:
STDDEV_POP(<column-name>)
A STDDEV_POP EXAMPLE
The standard deviation function is a statistical measure of spread or dispersion of values. It is the roots square of
the difference of the mean (average). This measure is to compare the amount by which a set of values differs
from the arithmetical mean.
The STDDEV_SAMP function is one of two that calculates the standard deviation. The sample is a random
selection of all rows returned based on the comparisons in the WHERE clause. The population is for all of the
rows based on the WHERE clause.
Syntax for using STDDEV_SAMP:
STDDEV_SAMP(<column-name>)
A STDDEV_SAMP EXAMPLE
Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is
not easy to elaborate. Most often variance is used in theoretical work where a variance of the sample is needed.
There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and
Friedman two-way Analysis of Variance by rank.
Syntax for using VAR_POP:
VAR_POP(<column-name>)
A VAR_POP EXAMPLE
The Variance function is a measure of dispersion (spread of the distribution) as the square of the standard
deviation. There are two forms of Variance in Redshift, VAR_SAMP is used for a random sampling of the data
rows allowed through by the WHERE clause.
Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is
not easy to elaborate. Most often variance is used in theoretical work where a variance of the sample is needed to
look for consistency.
There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and
Friedman two-way Analysis of Variance by rank.
Syntax for using VAR_SAMP:
VAR_SAMP(<column-name>)
A VAR_SAMP EXAMPLE