Professional Documents
Culture Documents
Lets understand this picture by taking an example of an organization. Lets take the example
- Calvin Klein (CK). We have outlets of Calvin Klein (CK) in most parts of India. Every outlet stores
their customer data in their respective database and its not mandatory that every outlet is using the
same database. Some outlets may have Sybase as their database, some might be using oracle or
some stores prefer storing their data in simple text files. Before proceeding ahead with our
explanation, we should know what OLTP is? It stands for Online Transaction Processing.
These are basically the online transactions (Insert, Update, Delete) performed on database at every
outlet by the customer. After storing daily data of the customers who visited Calvin Klein outlet at
different stores, the data is then integrated and saved it in a centralized database. This is done
with the help of OLTP component of MS SQL Server. Integration means merging of data from
heterogeneous data stores (i.e. it may be a text file, Spreadsheets, Mainframes, Oracle, etc.),
refreshing data in data warehouses and to cleanse data (e.g. -Date format may be different
for different outlets database, so same format is made to make it even) before loading to remove
errors. Now, you must be clear with the Integration concept. This is our Phase 1- SSIS. Next step is
to analyze the stored centralized data. This huge data is then divided into Data Marts on which
analytic process is carried on. Analysis services use OLAP (Online Analytical Processing)
component and data mining capabilities. It allows to build multi-dimensional structures called
CUBES to pre calculate and store complex aggregations, and also to buildmining models to
perform data analysis which helps to identify valuable information like what are the recent trends,
patterns, dislikes of customers. Business analyst then perform data mining function on multi
dimensional cube structure to look data from different perspectives. Multi Dimensional analysis of
huge data completes the Phase 2- SSAS. Now, the only thing left is to represent this analysis
graphically so that an organization (Calvin Klein) can makeeffective decision to enhance their
revenue, gain maximum profit and to reduce time wastage. So, this is done in forms of Reports,
Scorecards, Plans, Dashboards, Excel workbooks, etc. This data reports will tell the organization
what is the revenue of Calvin Klein in specific time at specific place, where they captured the
market, in which part they are lacking and needs to boost up and many other things which end
users wish to look into. This reporting is done with a tool SQL Server Reporting Services and
completes Phase 3 SSRS.
Helps in providing more accurate historical data by eliminating guess work.As analysis is
mainly done on huge volume of data.So,accurate historical data will make sure that we get
the correct result.
We can analysecustomers behaviour and taste(i.e. what he thinks,what he likes the most,
what he hates,etc) which can enhance your business and decision making power.
We can easily look where our customer needs more attention and where we dominates the
market in satisfying clients needs.
Complex Business queries are solved with a single click and at a faster rate which saves
lots of time.
Improve efficiency using forecasting.You can analyse data to see where your business has
been, where it is now and where it is going.
Integration of data from different data stores using ETL, on which analysis is to be done.
Dont panic after looking at these complex words. This expalins the meaning of Business
Intelligence to a large extend.
SSISSSIS stands for SQL Server Integration Services. It is a platform for Data integration and Work
flow applications. It can perform operations like Data Migration and ETL (Extract, Transform and
Load).
T Refreshing data in the data warehouses and data marts. Also used to cleanse data
before loading to remove errors. This process is known as TRANSFORMATION.
L- High-speed load of data into Online Transaction Processing (OLTP) and Online Analytical
Processing (OLAP) databases. This process is known as LOADING.
DTS:-
SSIS :-
BIDS: It is a tool which is used to develop the SSIS packages. It is available with SQL
Server as an interface which provides the developers to work on the control flow of
the package step by step.
SSMS: - It provides different options to make a SSIS package such as Import Export
wizard. With this wizard, we can create a structure on how the data flow should happen.
Created package can be deployed further as per the requirement.
Now, you must be hitting your head to know about Data flow and Control flow. So, Data flow
means extracting data into the servers memory, transform it and write it out to an alternative
destination whereas Control flow means a set of instructions which specify the Program Executor
on how to execute tasks and containers within the SSIS Packages. All these concepts are
explained in SSIS Architecture.SSIS Architecture:1. Packages A package is a collection of tasks framed together with precedence
constraints to manage and execute tasks in an order. It is compiled in a XML structured file
with .dtsx extension.
2. Control Flow - It acts as the brain of a package. It consists of one or more tasks and
containers that executes when package runs. Control flow orchestrates the order of
execution for all its components.
3. Tasks - A task can best be explained as an individual unit of work.
4. Precedence Constraints - These are the arrows in a Control flow of a package that connect
the tasks together and manage the order in which the tasks will execute. In Data flow, these
arrows are known as Service paths.
5. Containers - Core units in the SSIS architecture for grouping tasks together logically into
units of work are known as Containers.
6. Connection Managers - Connection managers are used to centralize connection strings to
data sources and to abstract them from the SSIS packages. Multiple tasks can share the
same Connection manager.
7. Data Flow - The core strength of SSIS is its capability to extract data into the servers
memory (Extraction), transform it (Transformation) and write it out to an alternative
destination (Loading).
8. Sources - A source is a component that you add to the Data Flow design surface to specify
the location of the source data.
9. Transformations - Transformations are key components within the Data Flow that allow
changes to the data within the data pipeline.
10. Destinations - Inside the Data Flow, destinations consume the data after the data pipe
leaves the last transformation components.
11. Variables - Variables can be set to evaluate to an expression at runtime.
12. Parameters - Parameters behave much like variables but with a few main exceptions.
13. Event Handlers The event handlers that run in response to the run-time events that
packages, tasks, and containers raise.
14. Log Providers Logging of package run-time information such as the start time and the
stop time of the package and its tasks and containers.
15. Package Configurations After development your package and before deploying the
package in production environment from UAT you need to perform certain package
configurations as per production Server.
This completes the basics of SSIS and its architecture
SSIS Architecture
Microsoft SQL Server Integration Services (SSIS) consist of four key parts:
SSIS Service
SSIS Object Model
Runs packages
Supports logging, debugging, config, connections, & transactions
Components
Control Flow
Data Flow
Control Flow
Control flow deals with orderly processing of tasks, which are individual, isolated units of
work that perform a specific action ending with a finite outcome (such that can be
evaluated as either Success, Failure, or Completion). While their sequence can be
customized by linking them into arbitrary arrangements with precedence constraints
and grouping them together or repeating their execution in a loop with the help of
containers, a subsequent task does not initiate unless its predecessor has completed.
Container
Containers provide structure in packages and services to tasks in the control flow.
Integration Services include the following container types, for grouping tasks and
implementing repeating control flows:
Tasks
Tasks do the work in packages. Integration Services includes tasks for performing a
variety of functions.
The Data Flow task: It defines and runs data flows that extract data, apply
transformations, and load data.
Data preparation tasks: It copies files and directories, downloads files and data,
saves data returned by Web methods, or works with XML documents.
SQL Server tasks: It accesses, copy, insert, delete, or modify SQL Server objects
and data.
Precedence constraints
Precedence constraints connect containers and task in packages into an ordered control
flow. You can control the sequence execution for tasks and containers, and specify
conditions that determine whether tasks and containers run.
Data Flow
Its processing responsibilities by employing the pipeline paradigm, carrying data record
by record from its source to a destination and modifying it in transit by applying
transformations. (There are exceptions to this rule, since some of them, such as Sort or
Aggregate require the ability to view the entire data set before handing it over to their
downstream counterparts). Items which are used to creating a data flow categorize into
three parts.
1.
Data Flow Sources: These elements are used to read data from different type
of sources like (SQL Server, Excelsheet, etc.)
2.
Data Flow Transformations: These elements are used to do process on data
like (cleaning, adding new columns, etc.)
3.
Data Flow Destinations: These elements are used save processed data into
desired destination. (SQL Server, Excelsheet, etc.)
over other formats,which is why the Flat File Source remains a popular Data Flow data
source.
OLE DB Source: The OLEDB Source is used when the data access is performed
via an OLE DB provider. Its a fairly simple data source type, and everyone is familiar
with OLE DB connections.
Raw file Source: The Raw File Source is used to import data that is stored in the
SQL Server raw file format. It is a rapid way to import data that has perhaps been
output by a previous package in the raw format.
XML Source: The XML Source requires an XML Schema Definition (XSD) file,
which is really the most important part of the component because it describes how SSIS
should handle the XML document.
Data Flow Transformation
Items in this category are used to perform different operations to make data in desired
format.
Conditional Split: The Conditional Split task splits Data Flow based on a
condition. Depending upon the results of an evaluated expression, data is routed as
specified by the developer.
Copy Column: The Copy Column task makes a copy of a column contained in
the input-columns collection and appends it to the output-columns collection.
Data Conversion: It is converting data from one type to another. Just like Type
Casting.
Data Mining Query: The data-mining implementation in SQL Server 2005 is all
about the discovery of factually correct forecasted trends in data. This is configured
within SSAS against one of the provided data-mining algorithms. The DMX query
requests a predictive set of results from one or more such models built on the same
mining structure. It can be a requirement to retrieve predictive information about the
same data calculated using the different available algorithms.
Derived Column: One or more new columns are appended to the outputcolumns collection based upon the work performed by the task, or the result of the
derived function replaces an existing column value.
Export Column: It is used to extract data from within the input stream and write
it to a file. Theres one caveat: the data type of the column or columns for export must
be DT_TEXT, DT_NTEXT, or DT_IMAGE.
Fuzzy Grouping: Fuzzy Grouping is for use in cleansing data. By setting and
tweaking task properties, you can achieve great results because the task interprets
input data and makes intelligent decisions about its uniqueness.
Fuzzy Lookup: It uses a reference (or lookup) table to find suitable matches. The
reference table needs to be available and selectable as a SQL Server 2005 table. It uses
a configurable fuzzy-matching algorithm to make intelligent matches.
Lookup: The Lookup task leverages reference data and joins between input
columns and columns in the reference data to provide a row-by-row lookup of source
values. This reference data can be a table, view, or dataset.
Merge: The Merge task combines two separate sorted datasets into a single
dataset that is expressed as a single output.
Merge Join: The Merge Join transform uses joins to generate output. Rather than
requiring you to enter a query containing the join, however (for example SELECT
x.columna, y.columnb FROM tablea x INNER JOIN tableb y ON x.joincolumna =
y.joincolumnb), the task editor lets you set it up graphically.
Multicast: The Multicast transform takes an input and makes any number of
copies directed as distinct outputs. Any number of copies can be made of the input.
Row Count: The Row Count task counts the number of rows as they flow through
the component. It uses a specified variable to store the final count. It is a very
lightweight component in that no processing is involved, because the count is just a
property of the input-rows collection.
Row Sampling: The Row Sampling task, in a similar manner to the Percentage
Sampling transform I discussed earlier, is used to create a (pseudo) random selection of
data from the Data Flow. This transform is very useful for performing operations that
would normally be executed against a full set of data held in a table. In very highvolume OLTP databases, however, this just isnt possible at times. The ability to execute
tasks against a representative subset of the data is a suitable and valuable alternative.
Sort: This transform is a step further than the equivalent ORDER BY clause in
the average SQL statement in that it can also strip out duplicate values.
Script Component: The Script Component is using for scripting custom code in
transformation. It can be used not only as a transform but also as a source or a
destination component.
Term Lookup: This task wraps the functionality of the Term Extraction transform
and uses the values extracted to compare to a reference table, just like the Lookup
transform.
Union All: Just like a Union All statement in SQL, the Union All task combines
any number of inputs into one output. Unlike in the Merge task, no sorting takes place in
this transformation. The columns and data types for the output are created when the
first input is connected to the task.
Data Mining Model Training: It trains data-mining models using sorted data
contained in the upstream Data Flow. The received data is piped through the SSAS datamining algorithms for the relevant model.
DataReader Destination: The results of an SSIS package executed from a .NET
assembly can be consumed by connecting to the DataReader destination.
Excel Destination: The Excel Destination has a number of options for how the
destination Excel file should be accessed. (Table or View, TableName or ViewName
variable, and SQL Command)
Flat File Destination: The Flat File Destination component writes data out to a
text file in one of the standard flat-file formats: delimited, fixed width, fixed width with
row delimiter.
OLE DB Destination: The OLE DB Destination component inserts data into any
OLE DBcompliant data source.
Raw File Destination: The Raw File Destination is all about raw speed. It is an
entirely native format and can be exported and imported more rapidly than any other
connection type, in part because the data doesnt need to pass through a connection
manager.
You will get new project dialog box where you should:
Select Business Intelligence Projects in Project Types
Solution Explorer - on the right you see solution explorer with your SSIS project (first
icon from top). If you dont have it go to view//solution explorer. In majority of cases
you will use SSIS Packages only. The rest is not used in practise (best practise).
Package tab - In middle we have our package.dtsx opened which contains control
flow, data flow that we will use.
Toolbox - This shows tools (items/tasks) that we can use to build our ETL package.
Toolbox is different for control flow and data flow tabs in the package.
Control Flow - Here you will be able to control your execution steps. For example you
can log certain information before you start the data transfer, you can check if file
exists, and you can send an email when a package fails or finishes. In here you will
also add a task to move data from source to destination however you will use data
flow tab to configure.
Data Flow - This is used to extra source data and define destination. During the "data
flow" you can perform all sorts of transformation for instance create new calculation
Double click it Employee Load data flow (ensure the box is not selected; otherwise
double click will work like rename). Notice that SSIS automatically goes to data flow
task where you can configure the data flow.
See below screenshot. Which shows that we are not in Data Flow tab and notice the
data flow task drop down box which says Employee Load. You can have multiple
data flow items in control flow so this drop down box allows you to change it.
From the toolbox (while in data flow tab) drag FlatFile Source into empty space.
Right click the source and select rename. Type Employee CSV Source.
Double click the Employee CSV Source. A dialog box will appear with header name
'Flat File Source Editor'.
Next we will create SSIS Package connection which will be stored in the package and
will connect to the CSV file. In order to that click the New button.
Type the connection manager name and description
Click browse button and find the employee.csv file (by default you will see on *.txt file
change it *.csv files)
Once you back tick Column names in the first data row
You should the warning that states that you columns are not defined. Simply click
columns which will set it for you (default settings should be fine).
OK button should be enabled now so click it to complete the process.
On the first dialog box connection manager should say EmployeeCSV click OK to close
the dialog box.
Now from the toolbox lets drag OLE DB destination into data flow empty space and
rename it to Employee Table (OLE DB Destination in toolbox is in Data Flow
Destination tab in toolbox. I thought I will clarify that as it is easy to pick OLE DB
source which is not what we want.)
Now we are going to create data path which means that we define source and its
destination. We will do that by clicking source (once). You should see green arrow.
Click it (once or press and hold) and move it over destination (click or release mouse).
You created "data path" in SSIS Package (Data Flow).
Now that new connection is selected. We will create destination table. Notice that I
highlighted data access mode with value table or view fast load this is an important
value that makes the load very quick, make sure you remember this one.
To create new table click New for the table/view drop down box (see below), change
the table name to [Employee] and click ok.
To finish the process click mappings that will create mapping between source fields
and destination fields and click OK
Lets test our SSIS Package. Click run (play button on toolbar). And you should see that
extract from source worked (green), arrows should show 2 rows from our CSV file and
destination should also go green which means it successfully loaded 2 rows from the
file.
Steps: Follow steps 1 to 3 on my first article to open the BIDS project and select the
right project to work on integration services project. Once the project is created, we will
see on how to use the Derived Columns control. Once you open the project just drag
and drop the Derived Column control and a source and destination provider as shown in
the below image. Now we need to do the configuration for each of the tasks, first we will
start with Source. In our example we are going to create a table as shown in the below
scripts
CREATE TABLE EmpDetails(EMPID int, EMPFNamevarchar(10), EMPLNamevarchar(10),
EMPDOB Datetime, EMPSalint, EMPHraint) GO
INSERT INTO EmpDetails (EMPID, EMPFName, EMPLName, EMPDOB, EMPSal, EMPHra)
VALUES (1,Karthik,'Anbu,01/01/1980, 10000,1500) ,(2,Arun,'Kumar,02/02/1981,
8000,1200) ,(3,Ram,'Kumar,01/02/1982, 6000,1000)
Now configure the source to get the details from the table above. Once the source is
configured now we need to do the configuration for the destination section. So here we
are going to create a new table as shown in the below script
CREATE TABLE EmpDetailsDestination (EmpFullNamevarchar(21), EmpAgeint,
EmpCTCint, InsertedDate DATETIME)
Now the records in both the source and destination tables are shown in the below
screen Our primary goal is to do some manipulations using the derived column task and
save it in a separate table. So we are configure the Derived Column by double clicking
the control will open the window for configuration as shown in the below screen In the
expression section if you see we have created some expressions to do some
manipulations as per our requirement. Now we need to do the configuration for the
destination source by mapping the columns as shown in the below screen Now once all
the task steps are configured press F5 to build and execute the package. Once your
package is executed your screen looks like below We can see the output in the
destination table as expected.
Merge Join
Problem
When loading data into SQL Server you have the option of using SQL Server Integration
Services to handle more complex loading and data transforms then just doing a straight
load such as using BCP. One problem that you may be faced with is that data is given to
you in multiple files such as sales and sales orders, but the loading process requires you to
join these flat files during the load instead of doing a preload and then later merging the
data. What options exist and how can this be done?
Solution
SQL Server Integration Services (SSIS) offers a lot more features and options then DTS
offered. One of these new options is the MERGE JOIN task. With this task you can merge
multiple input files into one process and handle this source data as if it was from one
source.
Let's take a look at an example of how to use this.
Here we have two source files an OrderHeader and an OrderDetail. We want to merge this
data and load into one table in SQL Server called Orders.
OrderHeader source file.
Orders table
Next we need to build our load from these two flat file sources and then use the MERGE
JOIN task to merge the data. So the Data Flow steps would look something like this.
At this point if you try to edit the MERGE JOIN task you will get the below error. The reason
for this is because the data needs to be sorted for the MERGE JOIN task to work. We will
look at two options for handling this sorting need.
Next you need to let SSIS know which column is the SortKey. Here we are specifying the
OrderID column. This also needs to be done for both of the flat file sources.
Once this is complete you will be able to move on with the setup and select the input
process as shown below.
From here you can select the columns that you want to have for output as well as determine
what type of join you want to employ between these two files.
Lastly you would need to add your OLE Destination, select the table and map the columns to
finish the process.
If you right click the Sort task and select Edit you will get a screen such as following. Here
you need to select which column the data should be sorted on. This needs to be done for
both of the flat source files.
After this is done you can move on and finish the load process. The MERGE JOIN works just
like it was stated above as well as the OLE DB Destination.
Lookup Transformation
The Lookup transformation performs lookups by joining data in input columns with columns in a
reference dataset. We use the lookup to access additional information in a related table that is based
on values in common join columns. Lookup transformation dataset can be a cache file, an existing
table or view, a new table, or the result of an SQL query.
Implementation
In this scenario we want to get the department name and location information from the department
table for each corresponding employee record from the source employee table.
Here we have the EMP table as OLEDB Source, next the DEPT table as the Lookup dataset and finally
the OLEDB Destination table to stage the data.
Next we double-click the Lookup transformation to go to the Editor. Select the Connection type to
OLEDB connection manager. When required the Lookup dataset can be a Cache file.
Cache Mode
There are three types of caching options available to be configured- Full cache, Partial cache and No
cache. In case of Full cache, the Lookup transformation generates a warning while caching, when the
transformation detects duplicates in the join key of the reference dataset.
Next we select the OLEDB connection object from the OLEDB connection manager browser. Next we
specify the table or view. We can also use the resultant dataset of an SQL statement as Lookup
reference as mentioned earlier if required.
Next we define the simple equi join condition between the Source Input Columns and the Reference
Lookup Available columns. Next we define the Lookup Columns as Output. We can rename or Alias the
Reference Lookup column name if required.
Next in case of Partial Cache mode we can specify the Cache size here. Also we can modify the
Custom query if required.
Select Ignore failure for Error. If there is no matching entry in the reference dataset, no join occurs.
By default, the Lookup transformation treats rows without matching entries as errors. However, if we
configure the Lookup transformation to Ignore lookup failure then such rows are redirected to no match
output.
Lookup Output
The Lookup transformation has the following outputs:
Match output- It handles the rows in the transformation input that matches at least one entry
in the reference dataset.
No Match output- It handles rows in the input that do not match any entry in the reference
dataset.
As mentioned earlier, if Lookup transformation is configured to treat the rows without matching
entries as errors, the rows are redirected to the error output else they are redirected to the no
match output.
Fuzzy Lookup
Select "Fuzzy Lookup" from "Data Flow Transformation" and Drag it on "Data Flow" tab.
And connect extended green arrow from OLE DB Source to your fuzzy lookup. Double
click on Fuzzy Lookup task to configure it.
Select "OLE DB Connection" and "Reference Table name" in "Reference Table" tab.
Map Lookup column and Output Column in "Columns tab. Add prefix "Ref_" in output
column filed.
Select "Conditional Split" from "Data Flow Transformation" and Drag it on "Data Flow"
tab. and connect extended green arrow from Fuzzy Lookup to your "Conditional Split".
Double click on Conditional Split task to configure it.
Create two output. One is "Solid Matched" which Condition is "_Similarity > 0.85 &&
_Confidence > 0.8" and another is "Likely Matched" which condition is "_Similarity > .65
&& _Confidence > 0.75". Click OK.
Select "Derived Column" from "Data Flow Transformation" and Drag it on "Data Flow"
tab. and connect extended green arrow from Conditional Split to your "Derived
Column".
Select another "Derived Column" from "Data Flow Transformation" and Drag it on "Data
Flow" tab. and connect extended green arrow from Conditional Split to your "Derived
Column 1".
Select another "Derived Column" from "Data Flow Transformation" and Drag it on "Data
Flow" tab. And connect extended green arrow from Conditional Split to your "Derived
Column 2".
Select another "Union All" from "Data Flow Transformation" and Drag it on "Data Flow"
tab. and connect extended green arrow from Derived Column to your "Union All" and
Derived Column 1 to your "Union All" and Derived Column 2 to your "Union All".
Select "SQL Server Destination" from "Data Flow Destination" and Drag it on "Data
Flow" tab. and connect extended green arrow from Union All to your "SQL Server
Destination".
Double click on SQL Server Destination task to configure it. Click New for create a New
Table or Select from List.
Click OK.
If you execute the package with debugging (press F5), the package should succeed and
appear as shown here:
SELECT [firstName]
,[LastName]
,[Ref_firstName]
,[Ref_LastName]
,[_Similarity]
,[_Confidence]
,[_Similarity_firstName]
,[_Similarity_LastName]
,[_Match]
Example:
Data looks like:
Product
iPhone
iPad
iPhone
iPod
iPad
iPod
iPhone
iPad
iPod
Color
White
White
Pink
White
Pink
Pink
orange
orange
orange
Price
199
300
250
50
350
75
150
399
50
Product
iPad
iPhone
iPod
orang
e
399
150
50
Pink
350
250
75
White
300
199
50
use AdventureWorks
select
YEAR(OrderDate) asYear,
pc.Name as ProductCategoryName,
SUM(linetotal) as LineTotal
from
Production.Product p
join Production.ProductSubcategory ps
on p.ProductSubcategoryID=ps.ProductSubcategoryID
join production.ProductCategory pc
on pc.ProductCategoryID=ps.ProductCategoryID
join sales.SalesOrderDetail sod
on sod.ProductID=p.ProductID
join sales.SalesOrderHeader soh
on soh.SalesOrderID=sod.SalesOrderID
groupby
YEAR(OrderDate),
pc.Name
Year
200
1
200
1
200
1
200
1
200
2
200
2
200
2
200
2
200
3
200
3
200
3
200
3
200
4
200
4
200
4
200
4
ProductCategoryNa
me
Accessories
Bikes
Clothing
Components
Accessories
Bikes
Clothing
Components
Accessories
Bikes
Clothing
Components
Accessories
Bikes
Clothing
Components
LineTotal
20235.364
61
10661722.
28
34376.335
25
615474.97
88
92735.351
71
26486358.
2
485587.15
28
3610092.4
72
590257.58
52
34923280.
24
1011984.5
04
5485514.8
32
568844.58
24
22579811.
98
588594.53
23
2091511.0
04
o Destination :
Pivoting ProductCategoryName using LineTotal value will result in:
Year
2001
2002
2003
2004
Accessor
ies
20235.36
461
92735.35
171
590257.5
852
568844.5
824
Bikes
10661722
.28
26486358
.2
34923280
.24
22579811
.98
Clothi
ng
34376.
34
48558
7.2
10119
85
58859
4.5
Compone
nts
615474.9
788
3610092.
472
5485514.
832
2091511.
004
USE temp db
GO
CREATETABLE [dbo].[Pivot_Example](
[Year] [int] NULL,
[Accessories] [float] NULL,
[Bikes] [float] NULL,
[Clothing] [float] NULL,
[Components] [float] NULL
) ON [PRIMARY]
GO
Pivot Key
Value
Our example
columns
Function
Year
ProductCategoryNa
me
LineTotal
Note:
Source Column: It is the lineage ID of the input column which holds the
value for the output column.
In our above example:
Output Column
Year
Accessories
Bikes
Clothing
Components
5)
Lineage
Lineage
Lineage
Lineage
Lineage
Lineage
ID
ID of Year Input Column
ID of LineTotal Input Column
ID of LineTotal Input Column
ID of LineTotal Input Column
ID of LineTotal Input Column
in SSIS and as such has its own workspace, which is represented by the Data Flow tab in SSIS Designer,
as shown in Figure 1.
To configure the data flow, double-click the Data Flow task in the control flow. This will move you to the Data
Flow tab, shown in Figure 3.
A Data Flow task will always start with a source and will usually end with a destination, but not always. You
can also add as many transformations as necessary to prepare the data for the destination. For example, you
can use the Derived Column transformation to add a computed column to the data flow, or you can use a
Conditional Split transformation to split data into different destinations based on specified criteria. This
and other components will be explained in future articles.
To add components to the Data Flow task, you need to open the Toolbox if its not already open. To do this,
point to the View menu and then click ToolBox, as shown in Figure 4.
A database icon is associated with that source type. Other source types will show different icons.
A reversed red X appears to the right of the name. This indicates that the component has not yet been
properly configured.
Two arrows extend below the component. These are called data paths. In this case, there is one green
and one red. The green data path marks the flow of data that has no errors. The red data path
redirects rows whose values are truncated or that generate an error. Together these data paths enable
the developer to specifically control the flow of data, even if errors are present.
To configure the OLE DB source, right-click the component and then click Edit. The OLE DB Source Editor
appears, as shown in Figure 8.
Table or view
Table name or view name variable
SQL command
For this example, well select the Table or View option because well be retrieving our data through the
uvw_GetEmployeePayRate view, which returns the latest employee pay raise and the amount of that raise.
Listing 1 shows the Transact-SQL used to create the view in the AdventureWorks database.
CREATEVIEWuvw_GetEmployeePayRate
AS
SELECTH.EmployeeID ,
RateChangeDate ,
Rate
FROMHumanResources.EmployeePayHistory H
JOIN( SELECTEmployeeID ,
MAX(RateChangeDate) AS [MaxDate]
FROMHumanResources.EmployeePayHistory
GROUPBYEmployeeID
) xx ONH.EmployeeID = xx.EmployeeID
ANDH.RateChangeDate = xx.MaxDate
GO
Listing 1: The uvw_GetEmployeePayRate view definition
After you ensure that Table or view is selected in the Data access mode drop-down list, select the
uvw_GetEmployeePayRate view from the Name of the table or the view drop-down list. Now go to
the Columns page to select the columns that will be returned from the data source. By default, all columns are
selected. Figure 9 shows the columns (EmployeeID, RateChangeDate, and Rate) that will be added to the
data flow for our package, as they appear on the Columns page.
Figure 10: The Error Output page of the OLE DB Source Editor
By default, if there is an error or truncation, the component will fail. You can override the default behavior, but
explaining how to do that is beyond the scope of this article. Youll learn about error handling in future articles.
Now return to the Connection Manager page and click the Preview button to view a sample dataset in the
Preview Query Results window, shown in Figure 11. Previewing the data ensures that what is being
returned is what you are expecting.
The next step in configuring our data flow is to add a transformation component. In this case, well add the
Derived Column transformation to create a column that calculates the annual pay increase for each
employee record we retrieve through the OLE DB source.
To add the component, expand the Data Flow Transformations category in the Toolbox window, and
drag the Derived Column transformation (shown in Figure 12) to the Data Flow tab design surface.
Figure 12: The Derived Column transformation as its listed in the Toolbox
Drag the green data path from the OLE DB source to the Derived Column transformation to associate the
two components, as shown in Figure 13. (If you dont connect the two components, they wont be linked and,
as a result, you wont be able to edit the transformation.)
Figure 13: Using the data path to connect the two components
The next step is to configure the Derived Column component. Double-click the component to open the
Derived Column Transformation Editor, as shown in Figure 14.
Objects you can use as a starting point. For example you can either select columns from your data
flow or select a variable. (We will be working with variables in a future article.)
Functions and operators you can use in your derived column expression. For example, you can use a
mathematical function to calculate data returned from a column or use a date/time function to extract
the year from a selected date.
3.
Workspace where you build one or more derived columns. Each row in the grid contains the details
necessary to define a derived column.
For this exercise, well be creating a derived column that calculates a pay raise for employees. The first step is
to select the existing column that will be the basis for our new column.
To select the column, expand the Columns node, and drag the Rate column to the Expression column of
the first row in the derived columns grid, as shown in Figure 15.
Figure 15: Adding a column to the Expression column of the derived column grid
When you add your column to the Expression column, SSIS prepopulates the other columns in that row of
the grid, as shown in Figure 16.
Once you are happy with the expression, click on OK to complete the process. You will be returned to the Data
Flow tab. From here, you can rename the Derived Column transformation to clearly show what it does.
Again, there are two data paths to use to link to further transformations or to connect to destinations.
Figure 19: Connecting the data path from the transformation to the destination
As you can see, even though we have connected the PayRate transformation to the Excel destination, we
still have the reversed red X showing us that there is a connection issue. This is because we have not yet
selected the connection manager or linked the data flow columns to those in the Excel destination.
Next, right-click the Excel destination, and click Edit. This launches the Excel Destination Editor
dialog box, shown in Figure 20. On the Connection Manager page, under OLE DB connection
manager, click on the Newbutton then under Excel File Path click on the Browse button and select the file
you created in the previous article and click on OK, then under Name of the Excel Sheet select the
appropriate sheet from the file.
Figure 22: Mapping the columns between the data flow and the destination
Once youve properly mapped the columns, click OK. The Data Flow tab should now look similar to the
screenshot in Figure 23.
Figure 24: Clicking the Execute button to run your SSIS package
As the package progresses through the data flow components, each one will change color. The component will
turn yellow while it is running, then turn green or red on completion. If it turns green, it has run successfully, and
if it turns red, it has failed. Note, however, that if a component runs too quickly, you wont see it turn yellow.
Instead, it will go straight from white to green or red.
The Data Flow tab also shows the number of rows that are processed along each step of the way. That
number is displayed next to the data path. For our example package, 290 rows were processed between the
Employees source and the PayRate transformation, and 290 rows were processed between the
transformation and the Excel destination. Figure 25 shows the data flow after the three components ran
successfully. Note that the number of processed rows are also displayed.
EmpKeyintidentity(1,1),
EmpIdint,
Name varchar(100),
Designation varchar(100),
City varchar(100),
Phone varchar(10),
StartDatedatetime,
EndDatedatetime
6. For all other screens in this wizard just select next, and on
last screen select Finish.
That's it..we implemented SCD transformation...your data flow should
look like as shown below.
So when i run my data flow for first time, all these 9 records will be
redirected to New Output and will be inserted to DimEmployee Table.
Next, I did some changes in EmployeeFeed.xls, Changes are marked in
yellow... so there are 4 records which are changed and 2 new records
added.
If you can see the data flow, two records are redirected through New
Output pipeline and 4 moved through Historical Attribute output, so
what happens to those 4 records is we update the EndDate to latest
date, then again insert them with new changed attrribute keeping
EndDate as null. as shown below.
EmpKeyintidentity(1,1),
EmpIdint,
Name varchar(100),
Designation varchar(100),
City varchar(100),
Phone varchar(10),
StartDatedatetime,
EndDatedatetime
three types :
Fixed Attribute--> No change expected
Changing Attribute --> Changes are expected, but no need to record
history, same record will be updated.
Historical Attribute--> If this attribute is changed, old record will be
expired (by setting EndDate as current date) and new record will be
inserted with new attribute value
In our example, we don't expect any change for Name Attribute hence
we selected this as Fixed Attribute, and rest all (Phone,Designation
and City ) will be selected as Historical Attribute. Once we are done
Click Next
6. For all other screens in this wizard just select next, and on
last screen select Finish.
That's it..we implemented SCD transformation...your data flow should
look like as shown below.
So when i run my data flow for first time, all these 9 records will be
redirected to New Output and will be inserted to DimEmployee Table.
Next, I did some changes in EmployeeFeed.xls, Changes are marked in
yellow... so there are 4 records which are changed and 2 new records
added.
If you can see the data flow, two records are redirected through New
Output pipeline and 4 moved through Historical Attribute output, so
what happens to those 4 records is we update the EndDate to latest
date, then again insert them with new changed attrribute keeping
EndDate as null. as shown below.
File System,For loop and for each loop Control flow tasksIn some ETL scenarios, when processing files, it is necessary to rename the
already processed files and move them to a different location. In SSIS you
can accomplish that in a single step using the File System Task. The example
I have prepared assumes the package will process a set of files using a
ForEach Loop container; then for each file, using the 'Rename' operation in
File System Task will do both; rename and move the file.
Here are some screen shots and notes about the package:
First of all, the big picture. The control flow has a ForEach Loop Container
with a File System Task inside. Notice that the DataFlow task is empty and it
is intended to show where the real ETL work should go; but this can be
Then details about the ForEach Loop container. Basically ,this container is
configured to process all *.txt files in C:\Temp\Source folder, where all the
files 'to be processed' are expected to be.
I am pretty sure there are different ways of accomplishing this simple task;
but I like this one because it does not require writing custom code and relies
on expressions
SSIS For Loop Containers
The For Loop is one of two Loop containers available in SSIS. In my opinion it
is easier to set up and use than the For Each Loop, but it is just as useful. The
basic Function of the for loop is to loop over whatever tasks you put inside
the container a predetermined number of times, or until a condition is met.
The For Loop Container, as is true of all the containers in SSIS, supports
transactions by setting the Transaction Option in the properties pane of the
container to ?Required?, or ?Supported? if a parent container, or the package
itself is set to ?Required?
There are three expressions that control the number of times the
loop executes in the For Loop container.
1. The InitExpression is the first expression to be evaluated on the For
Loop and is only evaluated once at the beginning. This expression is
optional in the For Loop Container. It is evaluated before any work is
done inside the loop. Typically you use it to set the initial value for the
variable that will be used in the other expressions in the For Loop
Container. You can also use it to initialize a variable that might be used
in the workflow of the loop.
2. The EvalExpression is the second expression evaluated when the
loop first starts. This expression is not optional. It is also evaluated
before any work is performed inside the container, and then evaluated
at the beginning of each loop. This is the expression that determines if
the loop continues or terminates. If the expression entered evaluates
to TRUE, the loop executes again. If it evaluates to FALSE, the loop
ends. Make sure to pay particular attention to this expression. I will
admit that I have accidentally written an expression in the
EvalExpression that evaluates to False right away and terminated the
loop before any work was done, and it took me longer than it probably
should have to figure out that the EvalExpression was the reason why
it was wrong.
3. The AssignExpression is the last expression used in the For Loop. It is
used to change the value of the variable used in the EvalExpression.
This expression is evaluated for each pass through the loop as well, but
at the end of the workflow. This expression is optional.
Lets walk through setting up an example of the package. In this example we?
ll create a loop that executes a given number of times.
Create a new package and add two variables to it, intStartVal and intEndVal.
Next add a For Loop Container to the package and open the editor. Assign
the following values for the expressions:
That is all the configuring that is required for the For Loop Container. Now
lets add a Script Task that will display a message box with the value of the
intStartVal variable as the loop updates the value of that variable. Here is the
code to do that:
Public Sub Main()
'
MsgBox(Dts.Variables("intStartVal").Value)
'
Dts.TaskResult = ScriptResults.Success
End Sub
Once that is done the package is ready to execute.
First Iteration
Second Iteration
Fifth Iteration
Complete
want to assign a meaningful value to the JobTitle variable to provide a better milepost
during the iterative process.
Next, I create a variable named JobTitles to hold the collection itself. You do not always
need to create a second variable. It depends on the collection type. In this case,
because Ill be retrieving data from a view, I need a variable to hold the result set
returned by my query, and that variable must be configured with the Object data type.
However, I dont need to assign an initial value to the variable. The value System.Object
is automatically inserted, as shown in Figure 1.
Figure 1: Adding the JobTitle and JobTitles variables to your SSIS package
Because I created the variables at a package scope, theyll be available to all
components in my control flow. I could have waited to create the JobTitle variable until
after I added the Foreach Loop container, then I could have configured the variable at
the scope of the container. Ive seen it done both ways, and Ive done it both ways. Keep
in mind, however, if you plan to use the variable outside of the Foreach Loop container,
make sure it has a package scope.
Figure 2: Configuring the General page of the Execute SQL Task editor
Because the Execute SQL task has been set up to return a result set, you need some
place to put those results. Thats where the JobTitles variable comes in. The task will
pass the result set to the variable as an ADO object, which is why the variable has to be
configured with the Object data type. The variable can then be used to provide those
results to the Foreach Loop container.
So the next step in configuring the Execute SQL task is to map the JobTitles variable to
the result set, which I do on the Result Set page of the Execute SQL Task editor, shown
in Figure 3.
Figure 3: Configuring the Result Set page of the Execute SQL Task editor
To create the mapping, I click Add and then specify the JobTitles variable in the first row
of the VariableName column. Notice in the figure that I include the User namespace,
followed by two colons. I then set the value in the Result Name column to 0.
Thats all you need to do to configure the Execute SQL task. The next step is to add a
Foreach Loop container and connect the precedence constraint from the Execute SQL
task to the container. Then you can configure the container. When doing so, you must
select an enumerator type. The enumerator type indicates the type of collection youre
working with, such as files in a folder or rows in a table. In this case, because the result
set is stored in the JobTitles variable as an ADO object, I select the ForeachADO
enumerator, as shown in Figure 4.
Figure 5: Configuring the Variable Mappings page of the Foreach Loop editor
For my example, I create a mapping to the JobTitle variable. To do this, I select the
variable from the drop-down list in the first row of the Variable column, and set the
index to 0. I use 0 because my collection is taken from the first column of the result set
stored in the JobTitles variable. If there were more columns, the number would depend
on the column position. The positions are based on a 0 cardinality, so the first column
requires a 0 value in the Index column. If my result set included four columns and I was
using the third column, my Index value would be 2.
Thats all there is to setting up the Foreach Loop container for this example. After I
complete the setup, I add a Data Flow task to the container. My control flow now looks
similar to the one shown in Figure 6.
NOTE: The query actually need only retrieve data from the FirstName and
LastName columns. However, I also included that JobTitle column simply as a
way to verify that the data populating the CSV files is the correct data.
To map the parameter to the variable, click the Parameters button on the Connection
Manager page of the OLE DB Source editor. The button is to the right of where you add
your Select statement. This launches the Set Query Parameters dialog box, shown in
Figure 8.
Bulk insert task is used to copy large amount of data into SQL Server tables from text files. For
example, imagine a data analyst in your organization provides a feed from a mainframe system to you
in the form of a text file and you need to import this into a SQL server table. The easiest way to
accomplish this is in SSIS package is through the bulk insert task.
Configuring Bulk Insert Task
Drag the bulk insert task from the toolbox into the control flow window.
Double click on the bulk insert task to open the task editor. Click on connections in left tab.
In the connections tab, Specify the OLE DB connection manager to connect to the destination SQL
Server database and the table into which data is inserted. Also, specify Flat File connection manager
to access the source file. Select The column and row delimiters used in the flat file.
Click on the Options in the left tab of the editor, and select the Code page the file, starting row
number (First row). Also Specify actions to perform on the destination table or view when the task
inserts the data. The options are to check constraints, enable identity inserts, keep nulls, fire triggers,
or lock the table.
On running the package the data will get be copied from the source to the destination. Bulk Insert
doesnt have an option to truncate and load; hence you must use an Execute SQL Task to delete the
data already present in the table before loading flat file data.
It is an easy to use and configure task but with few cons.
1.
It only allows to append the data into the table and you cannot perform truncate and load.
2.
Only Flat file can be used as source and not any other type of databases.
3.
Only SQL Server Databases can be used as destination. It doesnt support any other files/
RDBMS systems.
4.
A failure in the Bulk Insert task does not automatically roll back successfully loaded batches.
5.
Only members of the SYSADMIN fixed server role can run a package that contains a Bulk
Insert task.
Case
How do you get a rowcount when you execute an Insert, Update or Delete query with an Execute SQL
Task? I want to log the number of effected rows just like in a Data Flow.
Solution
The Transact-SQL function @@ROWCOUNT can help you here. It returns the number of rows affected
by the last statement.
1) Variable
Create an integer variable named 'NumberOfRecords' to store the number of affected rows in.
4) SQLStatement
Enter your query, but add the following text at the bottum of your query: SELECT @@ROWCOUNT as
NumberOfRecords; This query will return the number of affected rows in the column NumberOfRecords.
5) Result Set
Go to the Result Set tab and change the Result Name to NumberOfRecords. This is the name of the
column. Select the variable of step 1 to store the value in.
Result Set
6) The Result
To show you the value of the variable with the number of affected records, I added a Script Task with a
simple messagebox. You can add your own logging. For example a Script Task that fires an event or an
Execute SQL Task that inserts some logging record.
The Result
Configurations:
Figure 5: The Select Configuration Type screen in the Package Configuration wizard
From the Configuration type drop-down list, select XML configuration file. You
can then choose to specify your configuration settings directly or specify a Windows
environment variable that stores the path and file names for the configuration file.
For this example, I selected the Specify configuration settings directly option
and specified the following path and file name:
C:\Projects\SsisConfigFiles\LoadPersonData.dtsConfig. The main thing to notice is
that the file should use the extension dtsConfig.
NOTE: If you specify an XML file that already exists, youll be prompted whether to
use that file or whether to overwrite the files existing settings and use the
packages current settings. If you use the files settings, youll skip the next screen,
otherwise, the wizard will proceed as if the file had not existed. Also, if you choose
to use an environment variable to store the path and file names, the wizard will not
create a configuration file and will again skip the next screen. Even if you use an
environment variable, you might want to create the file first and then select the
environment variable option afterwards.
The next screen in the wizard is Select Properties to Export. As the name
implies, this is where you select the properties for which you want package
configurations. In this case, I selected the Value property for the ConnectMngr
variable and the ServerName property for each of the two connections managers,
as shown in Figure 6.
Figure 7: The Completing the Wizard screen in the Package Configuration wizard
If youre satisfied with the settings, click Finish. The wizard will automatically
generate the XML configuration file and add the properties that youve specified.
The file will also be listed in the Package Configuration Organizer, as shown in Figure
8.
Figure 8: The XML package configuration as its listed in the Package Configuration
Organizer
NOTE: When you add an XML configuration file, no values are displayed in the
Target Object and Target Property columns of the Package Configuration Organizer.
This is because XML configuration files support multiple package configurations.
You should also verify whether the XML package configuration file has been created
in the specified location. For this example, I added the file to the
C:\Projects\SsisConfigFiles\ folder. The file is automatically saved with the dtsConfig
extension. If you open the file in a text editor or browser, you should see the XML
necessary for a configuration file. Figure 9 shows the LoadPersonData.dtsConfig file
as it appears in Internet Explorer.
Figure 10: Running the LoadPersonData package with the default settings
As you would expect, the Server A data flow ran, but not the Server B data flow.
However, the advantage to using XML configuration files is that you can modify
property settings without modifying the package itself. When the package runs, it
checks the configuration file. If the file exists, it uses the values form the listed
properties. That means if I change the property values in the file, the package will
use those new values when it runs.
For instance, if I change the value of the ConnectMngr variable from Server A to
Server B, the package will use the value. As a result, the precedence constraint
that connects to the Server A Data Flow task will evaluate to False, and the
precedence constraint that connects to the Server B Data Flow task will evaluate to
True, and the Server B data flow will run. Figure 11 shows what happens if I change
the variables value in the XML configuration file to Server B.
Figure 11: Running the Server B Data Flow task in the LoadPersonData SSIS package
As you would expect, the Server B Data Flow task ran, but not the Server A Data
Flow task. If I had changed the values of the ServerName properties for the
connection managers, my source and destination servers would also have been
different.
Clearly, XML configuration files offer a great deal of flexibility for supplying property
values to your packages. They are particularly handy when deploying your packages
to different environments. Server and instance names can be easily changed, as
can any other value. If you hard-code the path and file name of the XML
configuration file into the package, as Ive done in this example, then you must
modify the package if that file location or name changes. You can get around this by
using a Windows environment variable, but thats not always a practical solution. In
addition, you can override the configuration path and file names by using the
/CONFIGURATION option with the DTExec utility.
Whatever approach you take, youll find XML configuration files to be a useful tool
that can help streamline your development and deployment efforts. Theyre easy to
set up and maintain, and well worth the time it takes to learn how to use them and
how to implement them into your solutions.
Debugging and Logging
SQL Server Business Intelligence Development Studio (BIDS) provides several tools
you can use to troubleshoot the data flow of a SQL Server Integration Services
(SSIS) package. The tools let you sample a subset of data, capture data flow row
counts, view data as it passes through data paths, redirect data that generates
errors, and monitor package execution. You can use these tools for any package
that contains a data flow, regardless of the datas source or destination or what
transformations are being performed.
The better you understand the debugging tools, the more efficiently you can
troubleshoot your data flow. In this article, I demonstrate how each debugging tool
works. To do so, I set up a test environment that includes a comma-separated text
file, a table in a SQL Server database, and an SSIS package that retrieves data from
the text file and inserts it into the table. The text file contains data from the
Person.Person table in the AdventureWorks2008R2 database. To populate the file, I
ran the following bcp command:
bcp "SELECT TOP 10000 BusinessEntityID, FirstName, LastName FROM
AdventureWorks2008R2.Person.Person ORDER BY BusinessEntityID" queryout
C:\DataFiles\PersonData.txt -c -t, -S localhost\SqlSrv2008R2 T
After I created the file, I manipulated the first row of data in the file by extending
the LastName value in the first row to a string greater than 50 characters. As youll
see later in the article, I did this in order to introduce an error into the data flow so I
can demonstrate how to handle such errors.
Next I used the following Transact-SQL script to create the PersonName table in the
AdentureWorks2008R2 database:
USE AdventureWorks2008R2
GO
IF OBJECT_ID('dbo.PersonName') IS NOT NULL
DROP TABLE dbo.PersonName
GO
CREATE TABLE dbo.PersonName
(
NameIDINT PRIMARY KEY,
FullNameNVARCHAR(110) NOT NULL
)
After I set up the source and target, I created an SSIS package. Initially, I configured
the package with the following components:
A Data Flow task that retrieves data from the text file, creates a derived
column, and inserts the data into the PersonName table.
Figure 1 shows the data flow components I added to the package, including those
components related to troubleshooting the data flow.
NOTE: You can download the SSIS package from the speech bubble at the top of the
article.
When youre developing an SSIS package that retrieves large quantities of data, it
can be helpful to work with only a subset of data until youve resolved any issues in
the data flow. SSIS provides two data flow components that let you work with a
randomly selected subset of data. The Row Sampling Transformation component
lets you specify the number of rows you want to include in your random data
sample, and the Percentage Sampling Transformation component lets you specify
the percentage of rows.
Both components support two data outputs: one for the sampled data and one for
the unsampled data. Each component also lets you specify a seed value so that the
samples are the same each time you run the package. (The seed value is tied to the
operating systems tick count.) When you dont specify a seed value, the data
sample is different each time you run the data flow.
If you refer back to Figure 1, youll see that I added a Row Sampling Transformation
component right after the Flat File Source component. Figure 2 shows the Row
Sampling Transformation Editor. Notice that I configured the component to retrieve
1000 rows of sample data, but I did not specify a seed value.
When data passes through a data flow, the SSIS design surface displays the number
of rows passing along each data path. The count changes as data moves through
the pipeline. After the package has finished executing, the number displayed is the
total number of rows that passed through the data path in the last buffer. If there
were multiple buffers, the final number would not provide an accurate count.
However, you can add a Row Count Transformation component to the data flow. The
transformation provides a final count that adds together the rows from all buffers
and stores the final count in a variable. This can be useful when you want to ensure
that a particular point in the data flow contains the number of rows you would
expect. You can then compare that number to the number of rows in your source or
destination.
To retrieve the row count from the variable, you can use whatever method you like.
For instance, you can create an event handler that captures the variable value and
saves it to a table in a SQL Server database. How you retrieve that value is up to
you. The trick is to use the Row Count Transformation component to capture the
total rows and save them to the variable.
In my sample SSIS package, I created a string variable named RowCount, then, after
the Derived Column component, I added a Row Count Transformation component.
Figure 3 shows the components editor. The only step I needed to take to configure
the editor was to add the variable name to the VariableName property.
Figure 3: Verifying the row counts of data passing along a data path
When the package runs, the final count from that part of the data flow will be saved
to the RowCount variable. I verified the RowCount value by adding a watch to the
control flow, but in an actual development environment, youd probably want to
retrieve the value through a mechanism such as an event viewer, as mentioned
above, so you have a record you can maintain as long as necessary.
Adding Data Viewers to the Data Path
When troubleshooting data flow, it can be useful to view the actual data as it passes
through a data path. You can do this by adding one or more data viewers to your
data flow. SSIS supports several types of data viewers. The one most commonly
used is the grid data viewer, which displays the data in tabular format. However,
you can also create data viewers that display histograms, scatter plot charts, or
column charts. These types of data viewers tend to be useful for more analytical
types of data review, but for basic troubleshooting, the grid data viewer is often the
best place to start.
To create a grid data viewer, open the editor for the data path on which you want to
view the data, then go to the Data Viewers page, as shown in Figure 4.
I connected the red data path to a Flat File Destination component so I can store
rows the generate errors to a text file. When you connect an error output to another
component, the Configure Error Output dialog box appears, as shown in Figure 8.
Notice that for each column, you can configure what action to take for either errors
or truncations. An error might be something like corrupt data or an incorrect data
type. A truncation occurs if a value is too long for the configured type. By default,
each column is configured to fail the component whether there is an error or
truncation.
You might not need to use all the tools that SSIS provides for debugging your data
flow, but whatever tools you do implement can prove quite useful when trying to
troubleshoot an issue. By working with data samples, monitoring row counts, using
data viewers, configuring error-handling, and monitoring package execution, you
should be able to pinpoint where any problems might exist in your data flow. From
there, you can take the steps necessary to address those problems. Without the
SSIS troubleshooting tools, locating the source of the problem can take an
inordinate amount of time. The effort you put in now to learn how to use these tools
and take advantage of their functionality can pay off big every time you run an SSIS
package.
Steps to configure logging
Open the package in Business Intelligence Development Studio (BIDS), see that you
are in the design mode. When you are in the Control Flow, right click (do not right
click on the control flow tasks) and select Logging from the drop menu displayed
(picture below).
A dialog box Configure SSIS Logs is displayed. In the left hand pane, there is a tree
view is displayed. Select the package by selecting the check box corresponding to it
(picture below). You can check individual tasks also.
Upon selecting the package or task, you can then configure logging through the
available logging providers from the drop down list as shown below. You can add
multiple logs of the same type and/or another type. In our example we will look at
selecting only one log provider and that is SSIS log provider for Text Files. After
selecting the log provider, click on Add button.
Once the Log type is selected and added, the dialog box looks like the picture below.
Choose the log file by selecting the check box to the left of it and go to
configuration column to configure the location of the log file in our example it is a
text file.
There would be a drop down list when you go to the configuration column, under
which you would get a <new connection> listed, choose that and it will open a
small window which would be similar to the one shown below.
Choose create file in the usage type and click browse button.. It would open a dialog
box and we need to navigate to the directory where the SSIS package log file will be
created. I am choosing the default Log directory of that instance here. (picture
below)
After choosing the location and the name of the file to be used, select Open button
in the current dialog box that would take back to the previous dialog, select OK to
configure the file location. Now we are all set, except the events that would be
logging into this log file. To select the events, switch to the details tab as show
below. Choose the events which needs to be logged into the log file. Choosing the
events selectively is important, since we do not want too much of information is
written into the log file, making it difficult to find information when needed. I always
choose OnError and OnTaskFailed events for every task and some additional events
in case of Data Flow tasks.
Set the "Create Deployment Utility" as "True" and specify the "Deployment Path".
As soon as you build your project deployment utility is created in the above specified
folder with the package file. The file type of Deployment Utility is "Integration Services
Deployment Manifest". The extension of the deployment package is
"*.SSISDeploymentManifest".
When you run this manifest file. The package deployment wizard is started which helps
in deploying the package.
As discussed above, you can also specify the deployment destination for our SSIS
package.
If you choose to install in the file system then you just have to specify the destination
folder and start the wizard. If you choose otherwise and install in the SQL Server
instance, then you have to specify the SQL Server instance in which we want to install
this package.
Security:
When the package is saved, any property that is tagged with Sensitive="1" gets handled per
the ProtectionLevel property setting in the SSIS package. The ProtectionLevel property can
be selected from the following list of available options (click anywhere in the design area of
the Control Flow tab in the SSIS designer to show the package properties):
DontSaveSensitive
EncryptSensitiveWithUserKey
EncryptSensitiveWithPassword
EncryptAllWithPassword
EncryptAllWithUserKey
ServerStorage
To show the effect of the ProtectionLevel property, add an OLE DB Connection Manager to
an SSIS package:
The above connection manager is for a SQL Server database that uses SQL Server
authentication; the password gives the SSIS package some sensitive information that must
be handled per the ProtectionLevel package property.
Now let's discuss each ProtectionLevel setting using an SSIS package with the above OLE
DB Connection Manager added to it.
DontSaveSensitive
When you specify DontSaveSensitive as the ProtectionLevel, any sensitive information is
simply not written out to the package XML file when you save the package. This could be
useful when you want to make sure that anything sensitive is excluded from the package
before sending it to someone. After saving the package using this setting, when you open it
up and edit the OLE DB Connection Manager, the password is blank even though the Save
my password checkbox is checked:
EncryptSensitiveWithUserKey
EncryptSensitiveWithUserKey encrypts sensitive information based on the credentials of the
user who created the package; e.g. the password in the package XML would look like the
following (actual text below is abbreviated to fit the width of the article):
<DTS:PASSWORD Sensitive="1" DTS:Name="Password"
Encrypted="1">AQAAANCMnd8BFdERjHoAwE/Cl+...</DTS:PASSWORD>
Note that the package XML for the password has the attribute Encrypted="1"; when the
user who created the SSIS package opens it the above text is decrypted automatically in
order to connect to the database. This allows the sensitive information to be stored in the
SSIS package but anyone looking at the package XML will not be able to decrypt the text
and see the password.
There is a limitation with this setting; if another user (i.e. a different user than the one who
created the package and saved it) opens the package the following error will be displayed:
If the user edits the OLE DB Connection Manager, the password will be blank. It is important
to note that EncryptSensitiveWithUserKey is the default value for the ProtectionLevel
property. During development this setting may work okay. However, you do not want to
deploy an SSIS package with this setting, as only the user who created it will be able to
execute it.
EncryptSesnitiveWithPassword
The EncryptSensitiveWithPassword setting for the ProtectionLevel property requires that you
specify a password in the package, and that password will be used to encrypt and decrypt
the sensitive information in the package. To fill in the package password, click on the button
in the PackagePassword field of the package properties as shown below:
You will be prompted to enter the password and confirm it. When opening a package with a
ProtectionLevel of EncryptSensitiveWithPassword, you will be prompted to enter the
password as shown below:
EncryptAllWithPassword
The EncryptAllWithPassword setting for the ProtectionLevel property allows you to encrypt
the entire contents of the SSIS package with your specified password. You specify the
package password in the PackagePassword property, same as with the
EncryptSensitiveWithPassword setting. After saving the package you can view the package
XML as shown below:
Note that the entire contents of the package is encrypted and the encrypted text is shown in
the CipherValue element. This setting completely hides the contents of the package. When
you open the package you will be prompted for the password. If you lose the password
there is no way to retrieve the package contents. Keep that in mind.
When you execute a package with this setting using DTEXEC, you can specify the password
on the command line using the /Decrypt password command line argument.
EncryptAllWithUserKey
The EncryptAllWithUserKey setting for the ProtectionLevel property allows you to encrypt
the entire contents of the SSIS package by using the user key. This means that only the
user who created the package will be able open it, view and/or modify it, and run it. After
saving a package with this setting the package XML will look similar to this:
Note that the entire contents of the package are encrypted and contained in the Encrypted
element.
ServerStorage
The ServerStorage setting for the ProtectionLevel property allows the package to retain all
sensitive information when you are saving the package to SQL Server. SSIS packages saved
to SQL Server use the MSDB database. This setting assumes that you can adequately secure
the MSDB database and therefore it's okay to keep sensitive information in a package in an
unencrypted form
Scheduling:
SSIS package can be scheduled in SQL Agent Jobs. Here is quick note on how one can do
the same.
First you can create new job from SQL Server Agent Menu.
Select Type as SQL Server Integration Services Packages. Select Package Source as file
system and give package path.
On next screen you can select schedule and configure desired schedule.
You can notice this is very easy process. Let me know if you have any further questions.