You are on page 1of 180

SAP HANA Predictive Analysis Library (PAL) Reference

SAP HANA Predictive Analysis Library (PAL) Reference


SAP HANA Appliance Software SPS 05

Target Audience Consultants Administrators SAP Hardware Partner Others

Public March 2013

SAP AG 2013

SAP HANA Predictive Analysis Library (PAL) Reference

Copyright
2013 SAP AG or an SAP affiliate company. All rights reserved. No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed without prior notice. Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors. Adobe, the Adobe logo, Acrobat, PostScript, and Reader are trademarks or registered trademarks of Adobe Systems Incorporated in the United States and other countries. Apple, App Store, FaceTime, iBooks, iPad, iPhone, iPhoto, iPod, iTunes, Multi-Touch, Objective-C, Retina, Safari, Siri, and Xcode are trademarks or registered trademarks of Apple Inc. Bluetooth is a registered trademark of Bluetooth SIG Inc. Citrix, ICA, Program Neighborhood, MetaFrame now XenApp, WinFrame, VideoFrame, and MultiWin are trademarks or registered trademarks of Citrix Systems Inc. Computop is a registered trademark of Computop Wirtschaftsinformatik GmbH. Edgar Online is a registered trademark of EDGAR Online Inc., an R.R. Donnelley & Sons Company. Facebook, the Facebook and F logo, FB, Face, Poke, Wall, and 32665 are trademarks of Facebook. Google App Engine, Google Apps, Google Checkout, Google Data API, Google Maps, Google Mobile Ads, Google Mobile Updater, Google Mobile, Google Store, Google Sync, Google Updater, Google Voice, Google Mail, Gmail, YouTube, Dalvik, and Android are trademarks or registered trademarks of Google Inc. HP is a registered trademark of the Hewlett-Packard Development Company L.P. HTML, XML, XHTML, and W3C are trademarks, registered trademarks, or claimed as generic terms by the Massachusetts Institute of Technology (MIT), European Research Consortium for Informatics and Mathematics (ERCIM), or Keio University. IBM, DB2, DB2 Universal Database, System i, System i5, System p, System p5, System x, System z, System z10, z10, z/VM, z/OS, OS/390, zEnterprise, PowerVM, Power Architecture, Power Systems, POWER7, POWER6+, POWER6, POWER, PowerHA, pureScale, PowerPC, BladeCenter, System Storage, Storwize, XIV, GPFS, HACMP, RETAIN, DB2 Connect, RACF, Redbooks, OS/2, AIX, Intelligent Miner, WebSphere, Tivoli, Informix, and Smarter Planet are trademarks or registered trademarks of IBM Corporation. Microsoft, Windows, Excel, Outlook, PowerPoint, Silverlight, and Visual Studio are registered trademarks of Microsoft Corporation. INTERMEC is a registered trademark of Intermec Technologies Corporation. IOS is a registered trademark of Cisco Systems Inc. The Klout name and logos are trademarks of Klout Inc. Linux is the registered trademark of Linus Torvalds in the United States and other countries. Motorola is a registered trademark of Motorola Trademark Holdings LLC. Mozilla and Firefox and their logos are registered trademarks of the Mozilla Foundation. Novell and SUSE Linux Enterprise Server are registered trademarks of Novell Inc. OpenText is a registered trademark of OpenText Corporation. Oracle and Java are registered trademarks of Oracle and its affiliates. QR Code is a registered trademark of Denso Wave Incorporated. RIM, BlackBerry, BBM, BlackBerry Curve, BlackBerry Bold, BlackBerry Pearl, BlackBerry Torch, BlackBerry Storm, BlackBerry Storm2, BlackBerry PlayBook, and BlackBerry AppWorld are trademarks or registered trademarks of Research in Motion Limited. SAVO is a registered trademark of The Savo Group Ltd. The Skype name is a trademark of Skype or related entities. Twitter and Tweet are trademarks or registered trademarks of Twitter. UNIX, X/Open, OSF/1, and Motif are registered trademarks of the Open Group. Wi-Fi is a registered trademark of Wi-Fi Alliance. SAP, R/3, ABAP, BAPI, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP BusinessObjects Explorer, StreamWork, SAP HANA, the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, Sybase, Adaptive Server, Adaptive Server Enterprise, iAnywhere, Sybase 365, SQL Anywhere, Crossgate, B2B 360 and B2B 360 Services, m@gic EDDY, Ariba, the Ariba logo, Quadrem, b-process, Ariba Discovery, SuccessFactors, Execution is the Difference, BizX Mobile Touchbase, It's time to love work again, SuccessFactors Jam and BadAss SaaS, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany or an SAP affiliate company. All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary. These materials are subject to change without notice. These materials are provided by SAP AG and its affiliated companies ("SAP Group") for informational purposes only, without representation or warranty of any kind, and SAP Group shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.

SAP AG 2013

SAP HANA Predictive Analysis Library (PAL) Reference

Table of Contents
1 2 What is PAL? ....................................................................................................................... 5 Getting Started .................................................................................................................... 6 2.1 2.2 2.3 2.4 Prerequisites ................................................................................................................ 6 Application Function Libraries (AFL) ............................................................................ 6 Security ........................................................................................................................ 6 How to Call PAL Functions .......................................................................................... 7 2.4.1 3 Parameter Table Structure .............................................................................. 9

PAL Functions ................................................................................................................... 10 3.1 Clustering Algorithms ................................................................................................. 12 3.1.1 3.1.2 3.1.3 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.2.7 3.2.8 3.2.9 3.3 3.4 3.3.1 3.4.1 3.4.2 3.4.3 3.5 3.5.1 3.5.2 3.5.3 3.5.4 3.5.5 3.6 3.6.1 3.6.2 Anomaly Detection ........................................................................................ 12 K-means ........................................................................................................ 17 Self-Organizing Maps .................................................................................... 26 Bi-Variate Geometric Regression .................................................................. 32 Bi-Variate Natural Logarithmic Regression ................................................... 41 C4.5 Decision Tree ........................................................................................ 50 CHAID Decision Tree .................................................................................... 60 Exponential Regression................................................................................. 69 KNN ............................................................................................................... 78 Multiple Linear Regression ............................................................................ 82 Polynomial Regression .................................................................................. 90 Logistic Regression ....................................................................................... 99 Apriori .......................................................................................................... 109 Single Exponential Smoothing .................................................................... 120 Double Exponential Smoothing ................................................................... 124 Triple Exponential Smoothing ..................................................................... 129 Binning ......................................................................................................... 134 Inter-quartile Range Test ............................................................................. 139 Sampling ...................................................................................................... 144 Scaling Range ............................................................................................. 151 Variance Test .............................................................................................. 156 ABC Analysis ............................................................................................... 161 Weighted Score Table ................................................................................. 165

Classification Algorithms ............................................................................................ 32

Association Algorithms ............................................................................................. 109 Time Series Algorithms ............................................................................................ 120

Preprocessing Algorithms ........................................................................................ 134

Miscellaneous ........................................................................................................... 161

SAP AG 2013

SAP HANA Predictive Analysis Library (PAL) Reference

4 5

End-to-End Scenarios ..................................................................................................... 171 Best Practices .................................................................................................................. 180

SAP AG 2013

SAP HANA Predictive Analysis Library (PAL) Reference

What is PAL?

SAP HANAs SQLScript, an extension of SQL that includes enhanced control-flow capabilities, lets developers define complex application logic inside database procedures. However, it is difficult or even impossible to describe predictive analysis logic with procedures. For example, an application may need to perform a cluster analysis in a huge customer table with 1T records. It is impossible to implement the analysis in a procedure using the simple classic K-means algorithms, and also impossible with the more complicated algorithms in the data-mining area. Transferring large tables to the application server to perform the K-means calculation would also be costly. The Predictive Analysis Library (PAL) defines functions that can be called from within SQLScript procedures to perform analytic algorithms. Currently, PAL includes classic and universal predictive analysis algorithms in six data-mining categories: Clustering Classification Association Time Series Preprocessing Miscellaneous

The algorithms in PAL were carefully selected based on the following criteria: The algorithms are needed for SAP HANA applications. The algorithms are the most commonly used based on market surveys (e.g., Rexer Analytics and KDnuggets polls). The algorithms are generally available in other database products.

SAP AG 2013

SAP HANA Predictive Analysis Library (PAL) Reference

2
2.1

Getting Started
Prerequisites

To use the PAL functions, you must: Install SAP HANA SPS05. Install the Application Function Library (AFL), which includes PAL. For more information, see the section Installing Application Function Libraries (AFLs) on a SAP HANA System in the SAP HANA Installation Guide with Unified Installer.

2.2

Application Function Libraries (AFL)

You can dramatically increase performance by executing complex computations in the database instead of at the application sever level. SAP HANA provides several techniques to move application logic into the database, and one of the most important is the use of application functions. Application functions are like database procedures written in C++ and called from outside to perform data intensive and complex operations. Functions for a particular topic are grouped into an application function library (AFL), such as the Predictive Analysis Library (PAL) and the Business Function Library (BFL). Currently, all AFLs are delivered in one archive (that is, one SAR file with the name AFL<version_string>.SAR). The AFL archive is not part of the HANA appliance, and must be installed separately by the administrator. Each release of AFL has a version in the form of <revision_number>.<patch_level>. For example, AFL 40.01 refers to revision 40 and patch level 01. The revision of the AFL must match the revision of the SAP HANA. Thus, an AFL revision 40 (any patch level) should be installed with SAP HANA revision 40 only.

2.3

Security

This section provides detailed security information which can help administrator and architects answer some common questions. 1. User and Schema During startup, the system creates the user _SYS_AFL, with default schema _SYS_AFL. All AFL objects (such as areas, packages, functions, and procedures) are created under this user and schema. Therefore, all these objects have fully specified names in the form of _SYS_AFL.<object name>. 2. Role Assignment For each AFL library, there is a role. You must be assigned this role to execute the functions in the library. The role for the PAL library is named: AFL__SYS_AFL_AFLPAL_EXECUTE

Note There are 2 underscores between AFL and SYS. Once a role is created, it cannot be dropped anymore. In other words, even when an area with all its objects is dropped and re-created during system startup, the user still keeps the role originally granted. SAP AG 2013 6

SAP HANA Predictive Analysis Library (PAL) Reference

2.4

How to Call PAL Functions

To use PAL functions, you must do the following: Create the AFL_WRAPPER_GENERATOR procedure. This only needs to be done once. From within SQLScript code, generate a procedure that wraps the PAL function. Call the procedure, for example, from an SQLScript procedure.

Step 1 Create the AFL_WRAPPER_GENERATOR Procedure Before using any AFL function, you need to create the AFL_WRAPPER_GENERATOR procedure. It is used to generate a wrapper for the AFL functions that take tables with a variable number of columns as inputs. This procedure only needs to be created once. 1. Make sure you are the SYSTEM user. 2. Go to /hanamnt/<SID>/HDB <instance_number>/exe/plugins/afl/ and run the script to execute the afl_wrapper_generator.sql script file. Thus, the AFL_WRAPPER_GENERATOR procedure is owned by the SYSTEM user. 3. Grant the EXECUTE privilege of system.afl_wrapper_generator to other users. For example, if the user name is USER1, run the command: GRANT EXECUTE ON system.afl_wrapper_generator to USER1 Note The above steps need to be performed each time after the HANA instance is restarted. Step 2 Generate a PAL Procedure Any user granted with the EXECUTE privilege on the system.afl_wrapper_generator procedure can generate a procedure for a specific PAL function. The syntax is shown below: CALL SYSTEM.AFL_WRAPPER_GENERATOR( '<procedure_name>', '<area_name>', '<function_name>', <signature_table>); <procedure_name>: A name for the PAL procedure. This can be anything you want. <area_name>: Always set to AFLPAL. <function_name>: A PAL built-in function name. <signature_table>: A user-defined table variable. The table contains records to describe the input table type, parameter table type, and result table type. A typical table variable references a table with the following definition: Index 1 2 3 Table Type Name <INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in out

SAP AG 2013

SAP HANA Predictive Analysis Library (PAL) Reference

Notes 1. The system.afl_wrapper_generator procedure is in definer mode, which means, the user who generates a PAL procedure should grant the SELECT privilege on signature table to the SYSTEM user who is the definer of system.afl_wrapper_generator. For example, if the user name is USER1, run the command: GRANT SELECT ON user1.<signature table> to SYSTEM 2. The records in the signature table must follow this order: first input table types, next parameter table type, and then output table types. 3. The signature table must be created before generating the PAL procedure. The table type names are user-defined. You can find detailed table type definitions for each PAL function in Chapter 3. 4. It is suggested that you add <schema_name> before the table type name in <signature_table>. 5. Since all the generated procedures and the procedure parameter table types belong to the _SYS_AFL schema, their names must be unique. The procedure names are defined by users. When generating a PAL procedure, make sure you give a unique procedure name. The parameter table type names are given by the system, so it is guaranteed the names are unique. 6. If you want to drop an existing procedure and then generate it again, you can use the user SYSTEM to remove the generated procedure and all its parameter table types, by running: DROP _SYS_AFL.<PROCEDURE_NAME>; DROP TYPE _SYS_AFL.<PROCEDURE_NAME>__TT_P1; DROP TYPE _SYS_AFL.<PROCEDURE_NAME>__TT_P2; DROP TYPE _SYS_AFL.<PROCEDURE_NAME>__TT_P3; (until all table type names in the signature table are dropped) Step 3 Call a PAL Procedure After generating a PAL procedure, any user that has the AFL__SYS_AFL_AFLPAL_EXECUTE role can call the procedure, using the syntax below. CALL <procedure_name>( <data_input_table> {,}, <parameter_table>, <output_table> {,}) with overview; <procedure_name>: The procedure name specified when generating the procedure in Step 2. <data_input_table>: User-defined name(s) of the procedures input table(s). Detailed input table definitions for each procedure can be found in Chapter 3. <parameter_table>: User-defined name of the procedures parameter table. The table structure is described in Section 2.4.1. Detailed parameter table definition for each procedure can be found in Chapter 3. <output_table>: User-defined name(s) of the procedures output table(s). Detailed output table definition for each procedure can be found in Chapter 3. Notes 1. The input, parameter, and output tables must be created before calling the procedure. 2. Some PAL algorithms have more than one input table or more than one output table. 3. All AFL objects are owned by the _SYS_AFL user and reside in the _SYS_AFL schema. To call the PAL procedure generated in Step 2, you need the AFL__SYS_AFL_AFLPAL_EXECUTE role.

SAP AG 2013

SAP HANA Predictive Analysis Library (PAL) Reference

2.4.1

Parameter Table Structure

PAL functions use parameter tables to transfer parameter values. Each PAL function has its own parameter table. To avoid a conflict of table names when several users call PAL functions at the same time, the parameter table must be created as a local temporary column table, so that each parameter table has its own unique scope per session. The table structure is as follows: Column Name Name intArgs doubleArgs stringArgs Data Type Varchar or char Integer Double Varchar or char Description Parameter name Integer parameter value Double parameter value String parameter value

Each row contains only one parameter value, either integer, double or string. The following table is an example of a parameter table with three parameters. The first parameter, THREAD_NUMBER, is an integer parameter. Thus, in the THREAD_NUMBER row, you should fill the parameter value in the intArgs column, and leave the doubleArgs and stringArgs columns blank. Name THREAD_NUMBER SUPPORT VAR_NAME intArgs 1 0.2 hello doubleArgs stringArgs

SAP AG 2013

SAP HANA Predictive Analysis Library (PAL) Reference

PAL Functions

The following are the available algorithms and functions in the Predictive Analysis Library. Category Clustering PAL Algorithm Anomaly Detection K-means Built-in Function Name ANOMALYDETECTION KMEANS VALIDATEKMEANS Self-Organizing Maps Classification Bi-Variate Geometric Regression SELFORGMAP GEOREGRESSION FORECASTWITHGEOR Bi-Variate Natural Logarithmic Regression LNREGRESSION FORECASTWITHLNR C4.5 Decision Tree CREATEDT PREDICTWITHDT CHAID Decision Tree CREATEDTWITHCHAID PREDICTWITHDT Exponential Regression EXPREGRESSION FORECASTWITHEXPR KNN Multiple Linear Regression KNN LRREGRESSION FORECASTWITHLR Polynomial Regression POLYNOMIALREGRESSION FORECASTWITHPOLYNOMIALR Logistic Regression LOGISTICREGRESSION FORECASTWITHLOGISTICR Association Apriori APRIORIRULE LITEAPRIORIRULE Preprocessing Binning Inter-Quartile Range Test Sampling Scaling Range Variance Test Time Series Single Exponential Smoothing Double Exponential Smoothing Triple Exponential Smoothing BINNING IQRTEST SAMPLING SCALINGRANGE VARIANCETEST SINGLESMOOTH DOUBLESMOOTH TRIPLESMOOTH

SAP AG 2013

10

SAP HANA Predictive Analysis Library (PAL) Reference

Category Miscellaneous

PAL Algorithm ABC Analysis Weighted Score Table

Built-in Function Name ABC WEIGHTEDTABLE

SAP AG 2013

11

SAP HANA Predictive Analysis Library (PAL) Reference

3.1 3.1.1

Clustering Algorithms Anomaly Detection

Anomaly detection is used to find the existing data objects that do not comply with the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called anomalies or outliers. Sometimes anomalies are also referred to as discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains. Anomalies in data can translate to significant (and often critical) actionable information in a wide variety of application domains. For example, an anomalous traffic pattern in a computer network could mean that a hacked computer is sending out sensitive data to an unauthorized destination. An anomalous MRI image may indicate presence of malignant tumors. Anomalies in credit card transaction data could indicate credit card or identity theft or anomalous readings from a space craft sensor could signify a fault in some component of the space craft. PAL uses k-means to realize anomaly detection in two steps: 1. Use k-means to group the origin data into k clusters. 2. Identify some points that are far from all cluster centers as anomalies.

Prerequisites
The input data contains an ID column and the other columns are of integer or double data type. The input data does not contain null value. The algorithm will issue errors when encountering null values.

ANOMALYDETECTION
Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>', 'AFLPAL', 'ANOMALYDETECTION', <signature table>); The signature table should contain the following records: Index 1 2 3 Table Type Name <INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table.

SAP AG 2013

12

SAP HANA Predictive Analysis Library (PAL) Reference

Signature Input Table Table Data Column 1 column Other columns


st

Column Data Type Integer or string Integer or double

Description ID Attribute data

Constraint It must be the first column.

Parameter Table Name GROUP_NUMBER Data Type Integer Description Number of groups (k). If k is not specified, the G-means method will be used to determine the number of clusters. DISTANCE_LEVEL Integer Computes the distance between the item and the cluster center. OUTLIER_PERCENTAGE Double 1 = Manhattan distance 2 = Euclidean distance 3 = Minkowski distance

Indicates the proportion of anomalies in the source data. Specifies which point should be defined as outlier:

OUTLIER_DEFINE

Integer

1 = max distance between the point and the center it belongs to 2 = max sum distance from the point to all centers

MAX_ITERATION

Integer

Maximum number of iterations. Center initialization type: 1 = first K 2 = random with replacement 3 = random without replacement 4 = one patent of selecting the init center (US 6,882,998 B1)

INIT_TYPE

Integer

Normalization type: NORMALIZATION Integer 0 = no 1 = yes. For each point X(x1,x,,xn), the normalized value will be X'(|x1|/S,|x2|/S,...,|xn|/S), where S = |x1|+|x2|+...|xn|. 2 = for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

THREAD_NUMBER EXIT_THRESHOLD Integer Double

Number of threads. Threshold (actual value) for exiting the iterations.

SAP AG 2013

13

SAP HANA Predictive Analysis Library (PAL) Reference

Output Table Table Result Column 1 column Other columns


st

Column Data Type Integer or string Integer or double

Description ID Coordinates of outliers

Constraint

It must have the same type as the input data table.

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PAL_AD_RESULT_T; CREATE TYPE PAL_AD_RESULT_T AS TABLE( "ID" INT, "V000" DOUBLE, "V001" DOUBLE ); DROP TYPE PAL_AD_DATA_T; CREATE TYPE PAL_AD_DATA_T AS TABLE( "ID" INT, "V000" DOUBLE, "V001" DOUBLE, primary key("ID") ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); -- create procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_AD_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); SAP AG 2013 14

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_AD_RESULT_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_ANOMALY_DETECTION', 'AFLPAL', 'ANOMALYDETECTION', PDATA); DROP TABLE PAL_AD_DATA_TAB; CREATE COLUMN TABLE PAL_AD_DATA_TAB ( "ID" INT, "V000" DOUBLE, "V001" DOUBLE, primary key("ID") ); INSERT INTO PAL_AD_DATA_TAB VALUES (0 , 0.5, 0.5); INSERT INTO PAL_AD_DATA_TAB VALUES (1 , 1.5, 0.5); INSERT INTO PAL_AD_DATA_TAB VALUES (2 , 1.5, 1.5); INSERT INTO PAL_AD_DATA_TAB VALUES (3 , 0.5, 1.5); INSERT INTO PAL_AD_DATA_TAB VALUES (4 , 1.1, 1.2); INSERT INTO PAL_AD_DATA_TAB VALUES (5 , 0.5, 15.5); INSERT INTO PAL_AD_DATA_TAB VALUES (6 , 1.5, 15.5); INSERT INTO PAL_AD_DATA_TAB VALUES (7 , 1.5, 16.5); INSERT INTO PAL_AD_DATA_TAB VALUES (8 , 0.5, 16.5); INSERT INTO PAL_AD_DATA_TAB VALUES (9 , 1.2, 16.1); INSERT INTO PAL_AD_DATA_TAB VALUES (10, 15.5, 15.5); INSERT INTO PAL_AD_DATA_TAB VALUES (11, 16.5, 15.5); INSERT INTO PAL_AD_DATA_TAB VALUES (12, 16.5, 16.5); INSERT INTO PAL_AD_DATA_TAB VALUES (13, 15.5, 16.5); INSERT INTO PAL_AD_DATA_TAB VALUES (14, 15.6, 16.2); INSERT INTO PAL_AD_DATA_TAB VALUES (15, 15.5, 0.5); INSERT INTO PAL_AD_DATA_TAB VALUES (16, 16.5, 0.5); INSERT INTO PAL_AD_DATA_TAB VALUES (17, 16.5, 1.5); INSERT INTO PAL_AD_DATA_TAB VALUES (18, 15.5, 1.5); INSERT INTO PAL_AD_DATA_TAB VALUES (19, 15.7, 1.6); INSERT INTO PAL_AD_DATA_TAB VALUES (20,-1.0, -1.0); DROP TABLE PAL_CONTROL_TAB; CREATE COLUMN TABLE PAL_CONTROL_TAB ( "NAME" VARCHAR (50), "INTARGS" INTEGER,

SAP AG 2013

15

SAP HANA Predictive Analysis Library (PAL) Reference

"DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('GROUP_NUMBER',4,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('INIT_TYPE',4,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('DISTANCE_LEVEL',2,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('MAX_ITERATION',100,null,null); DROP TABLE PAL_AD_RESULT_TAB; CREATE COLUMN TABLE PAL_AD_RESULT_TAB ( "ID" INT, "V000" DOUBLE, "V001" DOUBLE ); CALL _SYS_AFL.PAL_ANOMALY_DETECTION(PAL_AD_DATA_TAB, PAL_CONTROL_TAB, PAL_AD_RESULT_TAB) with overview; select * from PAL_AD_RESULT_TAB;

Expected Result PAL_AD_RESULT_TAB:

SAP AG 2013

16

SAP HANA Predictive Analysis Library (PAL) Reference

3.1.2

K-means

In predictive analysis, k-means clustering is a method of cluster analysis. The k-means algorithm partitions n observations or records into k clusters in which each observation belongs to the cluster with the nearest center. In marketing and customer relationship management areas, this algorithm uses customer data to track customer behavior and create strategic business initiatives. Organizations can thus divide their customers into segments based on variants such as demography, customer behavior, customer profitability, measure of risk, and lifetime value of a customer or retention probability. Clustering works to group records together according to an algorithm or mathematical formula that attempts to find centroids, or centers, around which similar records gravitate. The most common algorithm uses an iterative refinement technique. It is also referred to as Lloyd's algorithm: Given an initial set of k means m1, ..., mk, the algorithm proceeds by alternating between two steps: Assignment step: assigns each observation to the cluster with the closest mean. Update step: calculates the new means to be the center of the observations in the cluster.

The algorithm repeats until the assignments no longer change. The k-means implementation in PAL supports multi-thread, data normalization, different distance level measurement, and cluster quality measurement (Silhouette).The implementation does not support categorical data, but this can be managed through data transformation. The first K and random K starting methods are supported.

Prerequisites
The input data contains an ID column and the other columns are of integer or double data type. The input data does not contain null value. The algorithm will issue errors when encountering null values.

KMEANS
This is a clustering function using the k-means algorithm. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>', 'AFLPAL', 'KMEANS', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <Center Point OUTPUT table type> Direction in in out out

SAP AG 2013

17

SAP HANA Predictive Analysis Library (PAL) Reference

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <center point output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column Other columns
st

Column Data Type Integer or string Integer or double

Description ID Attribute data

Constraint This must be the first column.

Parameter Table Name GROUP_NUMBER DISTANCE_LEVEL Data Type Integer Integer Description Number of groups (k). Computes the distance between the item and cluster center. MAX_ITERATION INIT_TYPE Integer Integer 1 = Manhattan distance 2 = Euclidean distance 3 = Minkowski distance

Maximum iterations. Center initialization type: 1 = first K 2 = random with replacement 3 = random without replacement 4 = one patent of selecting the init center (US 6,882,998 B1)

NORMALIZATION

Integer

Normalization type: 0 = no 1 = yes. For each point X (x1,x2,...,xn), the normalized value will be X'(|x1|/S,|x2|/S,...,|xn|/S), where S = |x1|+|x2|+...|xn|. 2 = for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

THREAD_NUMBER EXIT_THRESHOLD Integer Double

Number of threads. Threshold (actual value) for exiting the iterations.

SAP AG 2013

18

SAP HANA Predictive Analysis Library (PAL) Reference

Output Tables Table Result Column 1 column 2 column 3 column Center Points 1 column Other columns Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PAL_KMEANS_RESASSIGN_T; CREATE TYPE PAL_KMEANS_RESASSIGN_T AS TABLE( "ID" INT, "CENTER_ASSIGN" INT, "DISTANCE" DOUBLE ); DROP TYPE PAL_KMEANS_DATA_T; CREATE TYPE PAL_KMEANS_DATA_T AS TABLE( "ID" INT, "V000" DOUBLE, "V001" DOUBLE, primary key("ID") ); DROP TYPE PAL_KMEANS_CENTERS_T; CREATE TYPE PAL_KMEANS_CENTERS_T AS TABLE( "CENTER_ID" INT, "V000" DOUBLE, "V001" DOUBLE ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER,
st rd nd st

Column Data Type Integer or string Integer or double Integer or double Integer Double

Description ID Clustered item assigned to class number The distance between the cluster and each point in the cluster Cluster center ID Cluster center coordinates

SAP AG 2013

19

SAP HANA Predictive Analysis Library (PAL) Reference

"DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); -- create kmeans procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_KMEANS_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_KMEANS_RESASSIGN_T', 'out'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_KMEANS_CENTERS_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_KMEANS', 'AFLPAL', 'KMEANS', PDATA); DROP TABLE PAL_KMEANS_DATA_TAB; CREATE COLUMN TABLE PAL_KMEANS_DATA_TAB( "ID" INT, "V000" DOUBLE, "V001" DOUBLE, primary key("ID") ); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (0 , 0.5, 0.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (1 , 1.5, 0.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (2 , 1.5, 1.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (3 , 0.5, 1.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (4 , 1.1, 1.2); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (5 , 0.5, 15.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (6 , 1.5, 15.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (7 , 1.5, 16.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (8 , 0.5, 16.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (9 , 1.2, 16.1); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (10, 15.5, 15.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (11, 16.5, 15.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (12, 16.5, 16.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (13, 15.5, 16.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (14, 15.6, 16.2);

SAP AG 2013

20

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PAL_KMEANS_DATA_TAB VALUES (15, 15.5, 0.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (16, 16.5, 0.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (17, 16.5, 1.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (18, 15.5, 1.5); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (19, 15.7, 1.6); DROP TABLE PAL_CONTROL_TAB; CREATE COLUMN TABLE PAL_CONTROL_TAB( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('GROUP_NUMBER',4,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('INIT_TYPE',4,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('DISTANCE_LEVEL',2,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('MAX_ITERATION',100,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('EXIT_THRESHOLD',null,0.000001,null); INSERT INTO PAL_CONTROL_TAB VALUES ('NORMALIZATION',0,null,null); --clean kmeans result DROP TABLE PAL_KMEANS_RESASSIGN_TAB; CREATE COLUMN TABLE PAL_KMEANS_RESASSIGN_TAB( "ID" INT, "CENTER_ASSIGN" INT, "DISTANCE" DOUBLE, primary key("ID") ); DROP TABLE PAL_KMEANS_CENTERS_TAB; CREATE COLUMN TABLE PAL_KMEANS_CENTERS_TAB( "CENTER_ID" INT, "V000" DOUBLE, "V001" DOUBLE ); CALL _SYS_AFL.PAL_KMEANS(PAL_KMEANS_DATA_TAB, PAL_CONTROL_TAB, PAL_KMEANS_RESASSIGN_TAB, PAL_KMEANS_CENTERS_TAB) with overview; SELECT * FROM PAL_KMEANS_CENTERS_TAB; SELECT * FROM PAL_KMEANS_RESASSIGN_TAB;

SAP AG 2013

21

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result PAL_KMEANS_RESASSIGN_TAB:

PAL_KMEANS_CENTERS_TAB:

SAP AG 2013

22

SAP HANA Predictive Analysis Library (PAL) Reference

VALIDATEKMEANS
This is a quality measurement function for k-means clustering. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','VALIDATEKMEANS', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <Data INPUT table type> <Type INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in in out

Procedure Calling CALL <procedure name>(<data input table>, <type input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Tables Table Data Column 1 column Other columns Type Data/ Class Data 1 column 2 column
nd st st

Column Data Type Integer Integer or double Integer Integer

Description ID Attribute data ID Class type

Parameter Table Name VARIABLE_NUM THREAD_NUMBER Output Table Table Result Column 1 column 2 column
nd st

Data Type Integer Integer

Description Number of variables Number of threads

Column Data Type Varchar or char Double

Description Name Measure result

SAP AG 2013

23

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE T_KMEANS_DATA; CREATE TYPE T_KMEANS_DATA AS TABLE( "ID" INT, "V000" DOUBLE, "V001" DOUBLE, primary key("ID") ); DROP TYPE T_KMEANS_TYPE_ASSIGN; CREATE TYPE T_KMEANS_TYPE_ASSIGN AS TABLE( "ID" INTEGER, "TYPE_ASSIGN" INTEGER ); DROP TYPE T_KMEANS_RESULT_SVALUE; CREATE TYPE T_KMEANS_RESULT_SVALUE AS TABLE( "NAME" VARCHAR (50), "S" DOUBLE ); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.T_KMEANS_DATA','in'); insert into PDATA values (2,'DM_PAL.T_KMEANS_TYPE_ASSIGN','in'); insert into PDATA values (3,'DM_PAL.CONTROL_T','in'); insert into PDATA values (4,'DM_PAL.T_KMEANS_RESULT_SVALUE','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palValidateKMeans','AFLPAL','VALIDATEKMEANS' ,PDATA);

SAP AG 2013

24

SAP HANA Predictive Analysis Library (PAL) Reference

DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('VARIABLE_NUM', 2, null, null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER', 1, null, null); DROP VIEW V_KMEANS_TYPE_ASSIGN; CREATE VIEW V_KMEANS_TYPE_ASSIGN AS SELECT "ID", "CENTER_ASSIGN" AS "TYPE_ASSIGN" FROM PAL_KMEANS_RESASSIGN_TAB; DROP TABLE KMEANS_SVALUE_TAB; CREATE COLUMN TABLE KMEANS_SVALUE_TAB ( "NAME" VARCHAR (50), "S" DOUBLE ); CALL _SYS_AFL.palValidateKMeans(PAL_KMEANS_DATA_TAB, V_KMEANS_TYPE_ASSIGN, "#CONTROL_TAB", KMEANS_SVALUE_TAB) with overview; SELECT * FROM KMEANS_SVALUE_TAB;

Expected Result KMEANS_SVALUE_TAB:

SAP AG 2013

25

SAP HANA Predictive Analysis Library (PAL) Reference

3.1.3

Self-Organizing Maps

Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, after their creator, Teuvo Kohonen, or as topologically ordered maps. SOMs aim to represent all points in a highdimensional source space by points in a low-dimensional (usually 2-D or 3-D) target space, such that the distance and proximity relationships are preserved as much as possible. This makes SOMs useful for visualizing low-dimensional views of high-dimensional data, akin to multidimensional scaling. SOMs can also be viewed as a constrained version of k-means clustering, in which the cluster centers tend to lie in low-dimensional manifold in the feature or attribute space. The learning process mainly includes three steps: 1. Initialize the weighted vectors in each unit. 2. Select the Best Matching Unit (BMU) for every point and update the weighted vectors of BMU and its neighbours. 3. Repeat Step 2 until convergence or the maximum iterations are reached. The SOM approach has many applications such as virtualization, web document clustering, and speech recognition.

Prerequisites
The first column of the input data is an ID column and the other columns are of integer or double data type. The input data does not contain null value. The algorithm will issue errors when encountering null values.

SELFORGMAP
Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>', 'AFLPAL', 'SELFORGMAP', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <INPUT table type> <PARAMETER table type> <Map OUTPUT table type> <Assign OUTPUT table type> Direction in in out out

SAP AG 2013

26

SAP HANA Predictive Analysis Library (PAL) Reference

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <map output table>, <assign output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column Other columns
st

Column Data Type Integer or string Integer or double

Description ID Attribute data

Constraint This must be the first column.

Parameter Table Name MAX_ITERATION NORMALIZATION Data Type Integer Integer Description Maximum number of iterations. Normalization type: THREAD_NUMBER SIZE_OF_MAP Integer Integer 0 = no 1 = transform to new range (0.0, 1.0) 2 = z-score normalization

Number of threads. Self-organizing map is made up of n n unit cells. This parameter indicates the n.

Output Tables Table Column 1 column SOM Map Other columns except the last one Last column 1 column SOM Assign 2 column
th st st

Column Data Type Integer double Integer Integer or string Integer

Description Unit cell ID. Weight vectors used to simulate the original tuples. Number of original tuples that every unit cell contains. ID of original tuples ID of the unit cells

SAP AG 2013

27

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PAL_SOM_DATA_T; CREATE TYPE PAL_SOM_DATA_T AS TABLE( "TRANS_ID" INT, "V000" DOUBLE, "V001" DOUBLE, primary key("TRANS_ID") ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); DROP TYPE PAL_SOM_MAP_T; CREATE TYPE PAL_SOM_MAP_T AS TABLE( "CELL_ID" INT, "WEIGHT000" DOUBLE, "WEIGHT001" DOUBLE, "NUMS_TUPLE" INT ); DROP TYPE PAL_SOM_RESASSIGN_T; CREATE TYPE PAL_SOM_RESASSIGN_T AS TABLE( "TRANS_ID" INT, "CELL_ID" INT, primary key("TRANS_ID") ); -- create procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT,

SAP AG 2013

28

SAP HANA Predictive Analysis Library (PAL) Reference

"TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_SOM_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_SOM_MAP_T', 'out'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_SOM_RESASSIGN_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_SELF_ORG_MAP', 'AFLPAL', 'SELFORGMAP', PDATA); DROP TABLE PAL_SOM_DATA_TAB; CREATE COLUMN TABLE PAL_SOM_DATA_TAB( "TRANS_ID" INT, "V000" DOUBLE, "V001" DOUBLE, primary key("TRANS_ID") ); INSERT INTO PAL_SOM_DATA_TAB VALUES (0 , 0.1, 0.2); INSERT INTO PAL_SOM_DATA_TAB VALUES (1 , 0.22, 0.25); INSERT INTO PAL_SOM_DATA_TAB VALUES (2 , 0.3, 0.4); INSERT INTO PAL_SOM_DATA_TAB VALUES (3 , 0.4, 0.5); INSERT INTO PAL_SOM_DATA_TAB VALUES (4 , 0.5, 1.0); INSERT INTO PAL_SOM_DATA_TAB VALUES (5 , 1.1, 15.1); INSERT INTO PAL_SOM_DATA_TAB VALUES (6 , 2.2, 11.2); INSERT INTO PAL_SOM_DATA_TAB VALUES (7 , 1.3, 15.3); INSERT INTO PAL_SOM_DATA_TAB VALUES (8 , 1.4, 15.4); INSERT INTO PAL_SOM_DATA_TAB VALUES (9 , 3.5, 15.9); INSERT INTO PAL_SOM_DATA_TAB VALUES (10,13.1, 1.1); INSERT INTO PAL_SOM_DATA_TAB VALUES (11,16.2, 1.5); INSERT INTO PAL_SOM_DATA_TAB VALUES (12,16.3, 1.3); INSERT INTO PAL_SOM_DATA_TAB VALUES (13,12.4, 2.4); INSERT INTO PAL_SOM_DATA_TAB VALUES (14,16.9, 1.9); INSERT INTO PAL_SOM_DATA_TAB VALUES (15,49.0, 40.1); INSERT INTO PAL_SOM_DATA_TAB VALUES (16,50.1, 50.2); INSERT INTO PAL_SOM_DATA_TAB VALUES (17,50.2, 48.3); INSERT INTO PAL_SOM_DATA_TAB VALUES (18,55.3, 50.4); INSERT INTO PAL_SOM_DATA_TAB VALUES (19,50.4, 56.5); DROP TABLE PAL_CONTROL_TAB; CREATE COLUMN TABLE PAL_CONTROL_TAB ( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE,

SAP AG 2013

29

SAP HANA Predictive Analysis Library (PAL) Reference

"STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER', 2, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('MAX_ITERATION', 200, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('SIZE_OF_MAP', 4, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('NORMALIZATION', 0, null, null); DROP TABLE PAL_SOM_MAP_TAB; CREATE COLUMN TABLE PAL_SOM_MAP_TAB ( "CELL_ID" INT, "WEIGHT000" DOUBLE, "WEIGHT001" DOUBLE, "NUMS_TUPLE" INT ); DROP TABLE PAL_SOM_RESASSIGN_TAB; CREATE COLUMN TABLE PAL_SOM_RESASSIGN_TAB ( "TRANS_ID" INT, "CELL_ID" INT, primary key("TRANS_ID") ); CALL _SYS_AFL.PAL_SELF_ORG_MAP(PAL_SOM_DATA_TAB, PAL_CONTROL_TAB, PAL_SOM_MAP_TAB, PAL_SOM_RESASSIGN_TAB) with overview; select * from PAL_SOM_MAP_TAB; select * from PAL_SOM_RESASSIGN_TAB;

SAP AG 2013

30

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result PAL_SOM_MAP_TAB:

PAL_SOM_RESASSIGN_TAB:

SAP AG 2013

31

SAP HANA Predictive Analysis Library (PAL) Reference

3.2 3.2.1

Classification Algorithms Bi-Variate Geometric Regression

Geometric regression is an approach used to model the relationship between a scalar variable y and one or more variables denoted X. In geometric regression, data are modeled using geometric functions, and unknown model parameters are estimated from the data. Such models are called geometric models. In PAL, the implementation of geometric regression is to transform to linear regression and solve it:

y = 0 x 1
Where

and

are parameters that need to be calculated.

The steps are: 1. Put natural logarithmic operation on both sides: 2. Transform it into: 3. Let

ln( y ) = ln( 0 x 1 )

ln( y ) = ln( 0) + 1 ln( x )

y ' = ln( y ) , x ' = ln( x ) , 0' = ln( 0)

y ' = 0'+ 1 x '


Thus,

y ' and x ' is a linear relationship and can be solved with the linear regression method.

The implementation also supports calculating the F value and R^2 to determine statistical significance.

Prerequisites
No missing or null data in the inputs. The data is numeric, not categorical.

SAP AG 2013

32

SAP HANA Predictive Analysis Library (PAL) Reference

GEOREGRESSION
This is a geometric regression function. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','GEOREGRESSION', <signature table>); The signature table should contain the following records: Index 1 2 3 4 5 6 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <Fitted OUTPUT table type> <Significance OUTPUT table type> <PMML OUTPUT table type> Direction in in out out out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <fitted output table>, <significance output table>, <PMML output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column 2 column 3 column
rd nd st

Column Data Type Integer or varchar Integer or double Integer or double

Description ID Variable y Variable x

SAP AG 2013

33

SAP HANA Predictive Analysis Library (PAL) Reference

Parameter Table Name THREAD_NUMBER PMML_EXPORT Data Type Integer Integer Description Number of threads. 0 (default): does not export geometric regression model in PMML. 1: exports geometric regression model in PMML in single row. 2: exports geometric regression model in several rows, each row containing a maximum of 5000 characters.

Output Tables Table Result Column 1 column 2 column


nd st

Column Data Type Integer Integer or double

Description ID Value Ai A0: intercept A1: beta coefficient for X1

Constraint

Fitted Data

1 column 2 column
nd st

st

Integer or varchar Integer or double Varchar or char Double Integer CLOB or varchar

ID Value Yi Name Value ID Geometric regression model in PMML format (R^2 / F)

Significance

1 column 2 column
nd st

PMML Result

1 column 2 column
nd

SAP AG 2013

34

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"Y" DOUBLE,"X1" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP TYPE SIGNIFICANCE_T; CREATE TYPE SIGNIFICANCE_T AS TABLE("NAME" varchar(50),"VALUE" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE MODEL_T; CREATE TYPE MODEL_T AS TABLE("ID" INT,"Model" varchar(5000)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); insert into PDATA values (5,'DM_PAL.SIGNIFICANCE_T','out'); insert into PDATA values (6,'DM_PAL.MODEL_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palGeoR','AFLPAL','GEOREGRESSION',PDATA); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100));

SAP AG 2013

35

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); INSERT INTO #CONTROL_TAB VALUES ('PMML_EXPORT',2,null,null); DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ( "ID" INT,"Y" DOUBLE,"X1" DOUBLE); INSERT INTO DATA_TAB VALUES (0,1.1,1); INSERT INTO DATA_TAB VALUES (1,4.2,2); INSERT INTO DATA_TAB VALUES (2,8.9,3); INSERT INTO DATA_TAB VALUES (3,16.3,4); INSERT INTO DATA_TAB VALUES (4,24,5); INSERT INTO DATA_TAB VALUES (5,36,6); INSERT INTO DATA_TAB VALUES (6,48,7); INSERT INTO DATA_TAB VALUES (7,64,8); INSERT INTO DATA_TAB VALUES (8,80,9); INSERT INTO DATA_TAB VALUES (9,101,10); DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" INT,"Ai" DOUBLE); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); DROP TABLE SIGNIFICANCE_TAB; CREATE COLUMN TABLE SIGNIFICANCE_TAB ("NAME" varchar(50),"VALUE" DOUBLE); DROP TABLE MODEL_TAB; CREATE COLUMN TABLE MODEL_TAB ("ID" INT, "PMMLMODEL" VARCHAR(5000)); CALL _SYS_AFL.palGeoR(DATA_TAB, "#CONTROL_TAB", RESULTS_TAB, FITTED_TAB, SIGNIFICANCE_TAB, MODEL_TAB) with overview; SELECT * FROM RESULTS_TAB; SELECT * FROM FITTED_TAB; SELECT * FROM SIGNIFICANCE_TAB; SELECT * FROM MODEL_TAB;

SAP AG 2013

36

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result RESULTS_TAB:

FITTED_TAB:

SIGNIFICANCE_TAB:

PAL_PMMLMODEL_TAB:

SAP AG 2013

37

SAP HANA Predictive Analysis Library (PAL) Reference

FORECASTWITHGEOR
This function performs prediction with the geometric regression result. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','FORECASTWITHGEOR', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <Predictive INPUT table type> <Coefficient INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in in out

Procedure Calling CALL <procedure name>(<predictive input table>, <coefficient input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Tables Table Predictive Data Column 1 column 2 column Coefficient 1 column 2 column
nd st nd st

Column Data Type Integer or varchar Integer or double Integer Integer or double

Description ID Variable X ID (start from 0) Value Ai

Parameter Table Name THREAD_NUMBER Data Type Integer Description Number of threads

Output Table Table Fitted Result Column 1 column 2 column


nd st

Column Data Type Integer/ varchar Integer/ double

Description ID Value Yi

SAP AG 2013

38

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PREDICT_T; CREATE TYPE PREDICT_T AS TABLE( "ID" INT,"X1" DOUBLE); DROP TYPE COEFFICIENT_T; CREATE TYPE COEFFICIENT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.PREDICT_T','in'); insert into PDATA values (2,'DM_PAL.COEFFICIENT_T','in'); insert into PDATA values (3,'DM_PAL.CONTROL_T','in'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palForecastWithGeoR','AFLPAL','FORECASTWITHG EOR',PDATA); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); DROP TABLE PREDICTDATA_TAB; CREATE COLUMN TABLE PREDICTDATA_TAB ( "ID" INT,"X1" DOUBLE); INSERT INTO PREDICTDATA_TAB VALUES (0,1); INSERT INTO PREDICTDATA_TAB VALUES (1,2); INSERT INTO PREDICTDATA_TAB VALUES (2,3);

SAP AG 2013

39

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PREDICTDATA_TAB VALUES (3,4); INSERT INTO PREDICTDATA_TAB VALUES (4,5); DROP TABLE COEEFICIENT_TAB; CREATE COLUMN TABLE COEEFICIENT_TAB ("ID" INT,"Ai" DOUBLE); INSERT INTO COEEFICIENT_TAB VALUES (0,1); INSERT INTO COEEFICIENT_TAB VALUES (1,1.99); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); CALL _SYS_AFL.palForecastWithGeoR(PREDICTDATA_TAB, COEEFICIENT_TAB, "#CONTROL_TAB", FITTED_TAB) with overview; SELECT * FROM FITTED_TAB;

Expected Result FITTED_TAB:

SAP AG 2013

40

SAP HANA Predictive Analysis Library (PAL) Reference

3.2.2

Bi-Variate Natural Logarithmic Regression

Bi-variate natural logarithmic regression is an approach to modeling the relationship between a scalar variable y and one variable denoted X. In natural logarithmic regression, data are modeled using natural logarithmic functions, and unknown model parameters are estimated from the data. Such models are called natural logarithmic models. In PAL, the implementation of natural logarithmic regression is to transform to linear regression and solve it:

y = 1 ln( x ) + 0
Where Let

0 and 1 are parameters that need to be calculated.

x ' = ln( x ) y = 0 + 1 x'


y and x ' is a linear relationship and can be solved with the linear regression method.

Then Thus,

The implementation also supports calculating the F value and R^2 to determine statistical significance.

Prerequisites
No missing or null data in the inputs. The data is numeric, not categorical. Given the structure as Y and X, there are more than 2 records available for analysis.

SAP AG 2013

41

SAP HANA Predictive Analysis Library (PAL) Reference

LNREGRESSION
This is a logarithmic regression function. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','LNREGRESSION', <signature table>); The signature table should contain the following records: Index 1 2 3 4 5 6 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <Fitted OUTPUT table type> <Significance OUTPUT table type> <PMML OUTPUT table type> Direction in in out out out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <fitted output table>, <significance output table>, <PMML output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column 2 column 3 column
rd nd st

Column Data Type Integer or varchar Integer or double Integer or double

Description ID Variable y Variable X

SAP AG 2013

42

SAP HANA Predictive Analysis Library (PAL) Reference

Parameter Table Name THREAD_NUMBER PMML_EXPORT Data Type Integer Integer Description Number of threads 0 (default): does not export logarithmic regression model in PMML. 1: exports logarithmic regression model in PMML in single row. 2: exports logarithmic regression model in PMML in several rows, each row containing a maximum of 5000 characters.

Output Tables Table Result Column 1 column 2 column


nd st

Column Data Type Integer Integer or double

Description ID Value Ai A0: intercept A1: beta coefficient for X1 A2: beta coefficient for X2

Constraint

Fitted Data

1 column 2 column
nd st

st

Integer or varchar Integer or double Varchar or char Double Integer CLOB or varchar

ID Value Yi Name Value ID Logarithmic regression model in PMML format (R^2 / F)

Significance

1 column 2 column
nd st

PMML Result

1 column 2 column
nd

SAP AG 2013

43

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"Y" DOUBLE,"X1" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP TYPE SIGNIFICANCE_T; CREATE TYPE SIGNIFICANCE_T AS TABLE("NAME" varchar(50),"VALUE" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE MODEL_T; CREATE TYPE MODEL_T AS TABLE("ID" INT,"Model" varchar(5000)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); insert into PDATA values (5,'DM_PAL.SIGNIFICANCE_T','out'); insert into PDATA values (6,'DM_PAL.MODEL_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palLnR','AFLPAL','LNREGRESSION',PDATA); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null);

SAP AG 2013

44

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO #CONTROL_TAB VALUES ('PMML_EXPORT',2,null,null); DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ( "ID" INT,"Y" DOUBLE,"X1" DOUBLE); INSERT INTO DATA_TAB VALUES (0,10,1); INSERT INTO DATA_TAB VALUES (1,80,2); INSERT INTO DATA_TAB VALUES (2,130,3); INSERT INTO DATA_TAB VALUES (3,160,4); INSERT INTO DATA_TAB VALUES (4,180,5); INSERT INTO DATA_TAB VALUES (5,190,6); INSERT INTO DATA_TAB VALUES (6,192,7); DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" INT,"Ai" DOUBLE); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); DROP TABLE SIGNIFICANCE_TAB; CREATE COLUMN TABLE SIGNIFICANCE_TAB ("NAME" varchar(50),"VALUE" DOUBLE); DROP TABLE MODEL_TAB; CREATE COLUMN TABLE MODEL_TAB("ID" INT, "PMMLMODEL" VARCHAR(5000)); CALL _SYS_AFL.palLnR(DATA_TAB, "#CONTROL_TAB", RESULTS_TAB, FITTED_TAB, SIGNIFICANCE_TAB, MODEL_TAB) with overview; SELECT * FROM RESULTS_TAB; SELECT * FROM FITTED_TAB; SELECT * FROM SIGNIFICANCE_TAB; SELECT * FROM MODEL_TAB;

SAP AG 2013

45

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result RESULTS_TAB:

FITTED_TAB:

SIGNIFICANCE_TAB:

PAL_PMMLMODEL_TAB:

SAP AG 2013

46

SAP HANA Predictive Analysis Library (PAL) Reference

FORECASTWITHLNR
This function performs prediction with the natural logarithmic regression result. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','FORECASTWITHLNR', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <Predictive INPUT table type> <Coefficient INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in in out

Procedure Calling CALL <procedure name>(<predictive input table>, <coefficient input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Tables Table Predictive Data Column 1 column 2 column Coefficient 1 column 2 column
nd st nd st

Column Data Type Integer or varchar Integer or double Integer Integer or double

Description ID Variable X ID (start from 0) Value Ai

Parameter Table Name THREAD_NUMBER Data Type Integer Description Number of threads

Output Table Table Fitted Result Column 1 column 2 column


nd st

Column Data Type Integer or varchar Integer or double

Description ID Value Yi

SAP AG 2013

47

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PREDICT_T; CREATE TYPE PREDICT_T AS TABLE( "ID" INT,"X1" DOUBLE); DROP TYPE COEFFICIENT_T; CREATE TYPE COEFFICIENT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.PREDICT_T','in'); insert into PDATA values (2,'DM_PAL.COEFFICIENT_T','in'); insert into PDATA values (3,'DM_PAL.CONTROL_T','in'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palForecastWithLnR','AFLPAL','FORECASTWITHLN R',PDATA); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); DROP TABLE PREDICTDATA_TAB; CREATE COLUMN TABLE PREDICTDATA_TAB ( "ID" INT,"X1" DOUBLE); INSERT INTO PREDICTDATA_TAB VALUES (0,1); INSERT INTO PREDICTDATA_TAB VALUES (1,2); INSERT INTO PREDICTDATA_TAB VALUES (2,3);

SAP AG 2013

48

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PREDICTDATA_TAB VALUES (3,4); INSERT INTO PREDICTDATA_TAB VALUES (4,5); INSERT INTO PREDICTDATA_TAB VALUES (5,6); INSERT INTO PREDICTDATA_TAB VALUES (6,7); DROP TABLE COEEFICIENT_TAB; CREATE COLUMN TABLE COEEFICIENT_TAB ("ID" INT,"Ai" DOUBLE); INSERT INTO COEEFICIENT_TAB VALUES (0,14.86160299); INSERT INTO COEEFICIENT_TAB VALUES (1,98.29359746); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); CALL _SYS_AFL.palForecastWithLnR(PREDICTDATA_TAB, COEEFICIENT_TAB, "#CONTROL_TAB", FITTED_TAB) with overview; SELECT * FROM FITTED_TAB; Expected Result FITTED_TAB:

SAP AG 2013

49

SAP HANA Predictive Analysis Library (PAL) Reference

3.2.3

C4.5 Decision Tree

A decision tree is used as a classifier for determining an appropriate action or decision among a predetermined set of actions for a given case. A decision tree helps you to effectively identify the factors to consider and how each factor has historically been associated with different outcomes of the decision. A decision tree uses a tree-like structure of conditions and their possible consequences. Each node of a decision tree can be a leaf node or a decision node. Leaf node: mentions the value of the dependent (target) variable. Decision node: contains one condition that specifies some test on an attribute value. The outcome of the condition is further divided into branches with subtrees or leaf nodes.

As a classification algorithm, C4.5 builds decision trees from a set of training data, using the concept of information entropy. The training data is a set of already classified samples. At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits it into subsets in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then proceeds recursively until meeting some stopping criteria such as the minimum number of cases in a leaf node. The C4.5 decision tree functions implemented in PAL support both discrete and continuous values. We discrete a continuous attribute by defining fixed intervals provided by users. For example, if the salary ranges from $100 to $20000, then we can form intervals like $0 $8000, $8000 $18000, and $18000 $20000. An attribute value will fall into any one of these intervals. In PAL implementation, the REP (Reduced Error Pruning) algorithm is used as pruning method.

Prerequisites
The column order and column number of the predicted data are the same as the order and number used in tree model building. The last column of the training data is used as a predicted field and is of discrete type. The predicted data set has an ID column. The input data does not contain null value. The algorithm will issue errors when encountering null values. The table used to store the tree model is a column table.

SAP AG 2013

50

SAP HANA Predictive Analysis Library (PAL) Reference

CREATEDT
This function creates a decision tree from the input training data. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','CREATEDT', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <PMML OUTPUT table type> Direction in in out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <PMML output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Training / Historical Data Column Columns Column Data Type Varchar, char, integer, or double Description Table used to build the predictive tree model Constraint Discrete value: integer or varcar/char Continuous value: integer or double

SAP AG 2013

51

SAP HANA Predictive Analysis Library (PAL) Reference

Parameter Table Name PERCENTAGE MIN_NUMS_RECORDS THREAD_NUMBER IS_SPLIT_MODEL Data Type Double Integer Integer Integer Description The percentage to be applied to input training data set. Controls the minimum training records in every leaf node. The default is zero. Number of threads. Indicates whether the string of the tree model should be split or not. If the value does not equal to 0, the tree model will be split, and the maximum length of each unit is k. CONTINUOUS_COL Integer or double (optional) Defines which column needs discretization and the interval provided by the user. The column index starts from zero. The integer value specifies the column position. The double value specifies the interval. PMML_EXPORT Integer 0 (default): does not export PMML tree model. 1: exports PMML tree model in single row. 2: exports PMML tree model in several rows, each row containing a maximum of 5000 characters.

Output Tables Table Result (tree model) Column 1 column 2 column


nd st

Column Data Type Integer CLOB or varchar

Description ID Tree model saved as a JSON string.

Constraint

The table must be a column table. The maximum length is 5000.

PMML Result

1 column 2 column
nd

st

Integer CLOB or varchar

ID C4.5 decision tree model in PMML format

SAP AG 2013

52

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PAL_DATA_T; CREATE TYPE PAL_DATA_T AS TABLE( "REGION" VARCHAR(50), "SALESPERIOD" VARCHAR(50), "REVENUE" Double, "CLASSLABEL" VARCHAR(50) ); DROP TYPE PAL_JSONMODEL_T; CREATE TYPE PAL_JSONMODEL_T AS TABLE( "ID" INT, "JSONMODEL" VARCHAR(5000) ); DROP TYPE PAL_PMMLMODEL_T; CREATE TYPE PAL_PMMLMODEL_T AS TABLE( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR(100) ); --create procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_DATA_T', 'in');

SAP AG 2013

53

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_JSONMODEL_T', 'out'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_PMMLMODEL_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_CREATEDT', 'AFLPAL', 'CREATEDT', PDATA);

DROP TABLE

PAL_TRAINING_TAB;

CREATE COLUMN TABLE PAL_TRAINING_TAB( "REGION" VARCHAR(50), "SALESPERIOD" VARCHAR(50), "REVENUE" Double, "CLASSLABEL" VARCHAR(50) ); INSERT INTO PAL_TRAINING_TAB VALUES ('South', 'Winter', 100000, 'Good'); INSERT INTO PAL_TRAINING_TAB VALUES ('North', 'Spring', 45000, 'Average'); INSERT INTO PAL_TRAINING_TAB VALUES ('West', 'Summer', 30000, 'Poor'); INSERT INTO PAL_TRAINING_TAB VALUES ('East', 'Autumn', 5000, 'Poor'); INSERT INTO PAL_TRAINING_TAB VALUES ('West', 'Spring', 5000, 'Poor'); INSERT INTO PAL_TRAINING_TAB VALUES ('East', 'Spring', 200000, 'Good'); INSERT INTO PAL_TRAINING_TAB VALUES ('South', 'Summer', 25000, 'Poor'); INSERT INTO PAL_TRAINING_TAB VALUES ('South', 'Spring', 10000, 'Average'); INSERT INTO PAL_TRAINING_TAB VALUES ('North', 'Winter', 50000, 'Average'); DROP TABLE PAL_CONTROL_TAB; PAL_CONTROL_TAB(

CREATE COLUMN TABLE "INTARGS" INTEGER,

"NAME" VARCHAR (50), "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('PERCENTAGE',null,1.0,null); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('IS_SPLIT_MODEL',1,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('PMML_EXPORT', 2, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('CONTINUOUS_COL',2,25000,null); INSERT INTO PAL_CONTROL_TAB VALUES ('CONTINUOUS_COL',2,60000,null); DROP TABLE PAL_JSONMODEL_TAB; CREATE COLUMN TABLE PAL_JSONMODEL_TAB( "ID" INT, "JSONMODEL" VARCHAR(5000)

SAP AG 2013

54

SAP HANA Predictive Analysis Library (PAL) Reference

); DROP TABLE PAL_PMMLMODEL_TAB; CREATE COLUMN TABLE PAL_PMMLMODEL_TAB( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); CALL _SYS_AFL.PAL_CREATEDT(PAL_TRAINING_TAB, PAL_CONTROL_TAB, PAL_JSONMODEL_TAB, PAL_PMMLMODEL_TAB) with overview; SELECT * FROM PAL_JSONMODEL_TAB; SELECT * FROM PAL_PMMLMODEL_TAB;

Expected Result PAL_JSONMODEL_TAB:

PAL_PMMLMODEL_TAB:

SAP AG 2013

55

SAP HANA Predictive Analysis Library (PAL) Reference

PREDICTWITHDT
This function uses decision trees to perform prediction. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>', 'AFLPAL', 'PREDICTWITHDT', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <Data INPUT table type> <PARAMETER table type> <Model INPUT table type> <OUTPUT table type> Direction in in in out

Procedure Calling CALL <procedure name>(<data input table>, <parameter table>, <model input table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Tables Table Predicted Data Column 1 column Other columns Predictive Model 1 column 2 column
nd st st

Column Data Type Integer Varchar or char Integer Varchar

Description ID Data to be classified (predicted) ID Serialized tree model

Parameter Table Name THREAD_NUMBER Data Type Integer Description Number of threads

Output Table Table Result (tree model) Column 1 column 2 column


nd st

Column Data Type Integer Varchar or char

Description ID Predictive result

SAP AG 2013

56

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. -- Note: Before generating this model, make sure you have created the tree model using the CREATEDT function. SET SCHEMA DM_PAL; DROP TYPE PAL_DATA_T; CREATE TYPE PAL_DATA_T AS TABLE( "ID" INT, "REGION" VARCHAR(50), "SALESPERIOD" VARCHAR(50), "REVENUE" Double ); DROP TYPE PAL_JSONMODEL_T; CREATE TYPE PAL_JSONMODEL_T AS TABLE( "ID" INT, "JSONMODEL" VARCHAR(5000) ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); DROP TYPE PAL_RESULT_T; CREATE TYPE PAL_RESULT_T AS TABLE( "ID" INT, "CLASSLABEL" VARCHAR(50) ); -- create procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100),

SAP AG 2013

57

SAP HANA Predictive Analysis Library (PAL) Reference

"DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_JSONMODEL_T', 'in'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_RESULT_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_PREDICTWITHDT', 'AFLPAL', 'PREDICTWITHDT', PDATA); DROP TABLE "ID" INT, "REGION" VARCHAR(50), "SALESPERIOD" VARCHAR(50), "REVENUE" Double ); INSERT INTO PAL_DATA_TAB VALUES (0,'South', 'Autumn', 60000); INSERT INTO PAL_DATA_TAB VALUES (1,'North', 'Spring', 30000); INSERT INTO PAL_DATA_TAB VALUES (2,'South', 'Summer', 25000); INSERT INTO PAL_DATA_TAB VALUES (3,'West', 'Winter', 5000); DROP TABLE PAL_CONTROL_TAB; CREATE COLUMN TABLE PAL_CONTROL_TAB ( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); DROP TABLE PAL_RESULT_TAB; CREATE TABLE PAL_RESULT_TAB( "ID" INT, "CLASSLABEL" VARCHAR(50) ); CALL _SYS_AFL.PAL_PREDICTWITHDT(PAL_DATA_TAB, PAL_CONTROL_TAB, PAL_JSONMODEL_TAB, PAL_RESULT_TAB) with overview; SELECT * FROM PAL_RESULT_TAB; PAL_DATA_TAB;

CREATE COLUMN TABLE PAL_DATA_TAB (

SAP AG 2013

58

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result PAL_RESULT_TAB:

SAP AG 2013

59

SAP HANA Predictive Analysis Library (PAL) Reference

3.2.4

CHAID Decision Tree

CHAID stands for CHi-squared Automatic Interaction Detection. It is similar to the C4.5 decision tree. CHAID is a classification method for building decision trees by using chi-square statistics to identify optimal splits. CHAID examines the cross tabulations between each of the input fields and the outcome, and tests for significance using a chi-square independence test. If more than one of these relations is statistically significant, CHAID will select the input field that is the most significant (smallest p value). CHAID can generate non-binary trees.

Prerequisites
The column order and column number of the predicted data are the same as the order and number used in tree model building. The last column of the training data is used as a predicted field and is of discrete type. The predicted data set has an ID column. The input data does not contain null value. The algorithm will issue errors when encountering null values. The table used to store the tree model is a column table.

CREATEDTWITHCHAID
This function creates a decision tree from the input training data. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>', 'AFLPAL', 'CREATEDTWITHCHAID', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <PMML OUTPUT table type> Direction in in out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <PMML output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table.

SAP AG 2013

60

SAP HANA Predictive Analysis Library (PAL) Reference

Signature Input Table Table Training / Historical Data Column Columns Column Data Type Varchar, char, integer, or double Description Table used to build the predictive tree model Constraint Discrete value: integer or varchar/char Continuous value: integer or double

Parameter Table Name MIN_NUMS_RECORDS PERCENTAGE IS_SPLIT_MODEL Data Type Integer Double Integer Description Controls the minimum training records in every leaf node. The default is zero. The percentage to be applied to determine the input training data set. Indicates whether the string of the tree model should be split. If the value does not equal zero, the tree model will be split, and the maximum length of each unit is 1k. THREAD_NUMBER CONTINUOUS_COL Integer Integer or double (optional) Number of threads. Defines which column needs discretization and the interval provided by the user. Column index starts from zero. The integer value specifies the column position. The double value specifies the interval. PMML_EXPORT Integer 0 (default): does not export PMML tree model. 1: exports PMML tree model in single row. 2: exports PMML tree model in several rows, each row containing a maximum of 5000 characters.

Output Tables Table Result (tree model) Column 1 column 2 column


nd st

Column Data Type Integer Varchar or CLOB

Description ID Tree model saved as a JSON string in the nd 2 column.

Constraint

The table must be a column table. The maximum length is 5000.

PMML Result

1 column 2 column
nd

st

Integer CLOB or varchar

ID CHAID decision tree model in PMML format

SAP AG 2013

61

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PAL_DATA_T; CREATE TYPE PAL_DATA_T AS TABLE( "REGION" VARCHAR(50), "SALESPERIOD" VARCHAR(50), "REVENUE" Double, "CLASSLABEL" VARCHAR(50) ); DROP TYPE PAL_PMMLMODEL_T; CREATE TYPE PAL_PMMLMODEL_T AS TABLE( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); DROP TYPE PAL_JSONMODEL_T; CREATE TYPE PAL_JSONMODEL_T AS TABLE( "ID" INT, "JSONMODEL" VARCHAR(5000) ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); --create procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_DATA_T', 'in');

SAP AG 2013

62

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_JSONMODEL_T', 'out'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_PMMLMODEL_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_CREATEDT_WITH_CHAID', 'AFLPAL', 'CREATEDTWITHCHAID', PDATA); DROP TABLE PAL_TRAINING_TAB;

CREATE COLUMN TABLE PAL_TRAINING_TAB( "REGION" VARCHAR(50), "SALESPERIOD" VARCHAR(50), "REVENUE" Double, "CLASSLABEL" VARCHAR(50) ); INSERT INTO PAL_TRAINING_TAB VALUES ('South', 'Winter', 100000, 'Good'); INSERT INTO PAL_TRAINING_TAB VALUES ('North', 'Spring', 45000, 'Average'); INSERT INTO PAL_TRAINING_TAB VALUES ('West', 'Summer', 30000, 'Poor'); INSERT INTO PAL_TRAINING_TAB VALUES ('East', 'Autumn', 5000, 'Poor'); INSERT INTO PAL_TRAINING_TAB VALUES ('West', 'Spring', 5000, 'Poor'); INSERT INTO PAL_TRAINING_TAB VALUES ('East', 'Spring', 200000, 'Good'); INSERT INTO PAL_TRAINING_TAB VALUES ('South', 'Summer', 25000, 'Poor'); INSERT INTO PAL_TRAINING_TAB VALUES ('South', 'Spring', 10000, 'Average'); INSERT INTO PAL_TRAINING_TAB VALUES ('North', 'Winter', 50000, 'Average'); DROP TABLE PAL_CONTROL_TAB; PAL_CONTROL_TAB(

CREATE COLUMN TABLE "INTARGS" INTEGER,

"NAME" VARCHAR (50), "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('PERCENTAGE',null,1.0,null); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('IS_SPLIT_MODEL',0,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_NUMS_RECORDS',1,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('CONTINUOUS_COL',2,25000,null); INSERT INTO PAL_CONTROL_TAB VALUES ('CONTINUOUS_COL',2,60000,null); INSERT INTO PAL_CONTROL_TAB VALUES ('PMML_EXPORT',2,null,null); DROP TABLE PAL_JSONMODEL_TAB; CREATE COLUMN TABLE PAL_JSONMODEL_TAB( "ID" INT, "JSONMODEL" VARCHAR(5000)

SAP AG 2013

63

SAP HANA Predictive Analysis Library (PAL) Reference

); DROP TABLE PAL_PMMLMODEL_TAB; CREATE COLUMN TABLE PAL_PMMLMODEL_TAB( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); CALL _SYS_AFL.PAL_CREATEDT_WITH_CHAID(PAL_TRAINING_TAB, PAL_CONTROL_TAB, PAL_JSONMODEL_TAB, PAL_PMMLMODEL_TAB) with overview; SELECT * FROM PAL_JSONMODEL_TAB; SELECT * FROM PAL_PMMLMODEL_TAB;

Expected Result PAL_JSONMODEL_TAB:

PAL_PMMLMODEL_TAB:

SAP AG 2013

64

SAP HANA Predictive Analysis Library (PAL) Reference

PREDICTWITHDT
This function uses decision trees to perform prediction. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>', 'AFLPAL', 'PREDICTWITHDT', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <Data INPUT table type> <PARAMETER table type> <Model INPUT table type> <OUTPUT table type> Direction in in in out

Procedure Calling CALL <procedure name>(<data input table>, <parameter table>, <model input table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Tables Table Predicted Data Column 1 column Other columns 1 column Predictive Model 2 column
nd st st

Column Data Type Integer Varchar or char Integer Varchar

Description ID Data to be classified (predicted) ID Serialized tree model

Parameter Table Name THREAD_NUMBER Data Type Integer Description Number of threads

Output Table Table Result Column 1 column 2 column


nd st

Column Data Type Integer Varchar or char

Description ID Predictive result

SAP AG 2013

65

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. -- Note: Before generating this model, make sure you have created the tree model using the CREATEDWITHCHAID function. SET SCHEMA DM_PAL; DROP TYPE PAL_DATA_T; CREATE TYPE PAL_DATA_T AS TABLE( "ID" INT, "REGION" VARCHAR(50), "SALESPERIOD" VARCHAR(50), "REVENUE" Double ); DROP TYPE PAL_JSONMODEL_T; CREATE TYPE PAL_JSONMODEL_T AS TABLE( "ID" INT, "JSONMODEL" VARCHAR(5000) ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); DROP TYPE PAL_RESULT_T; CREATE TYPE PAL_RESULT_T AS TABLE( "ID" INT, "CLASSLABEL" VARCHAR(50) ); -- create procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100),

SAP AG 2013

66

SAP HANA Predictive Analysis Library (PAL) Reference

"DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_JSONMODEL_T', 'in'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_RESULT_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_PREDICTWITHDT', 'AFLPAL', 'PREDICTWITHDT', PDATA); DROP TABLE "ID" INT, "REGION" VARCHAR(50), "SALESPERIOD" VARCHAR(50), "REVENUE" Double ); INSERT INTO PAL_DATA_TAB VALUES (0,'South', 'Autumn', 60000); INSERT INTO PAL_DATA_TAB VALUES (1,'North', 'Spring', 30000); INSERT INTO PAL_DATA_TAB VALUES (2,'South', 'Summer', 25000); INSERT INTO PAL_DATA_TAB VALUES (3,'West', 'Winter', 5000); DROP TABLE PAL_CONTROL_TAB; CREATE COLUMN TABLE PAL_CONTROL_TAB ( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); DROP TABLE PAL_RESULT_TAB; CREATE TABLE PAL_RESULT_TAB( "ID" INT, "CLASSLABEL" VARCHAR(50) ); CALL _SYS_AFL.PAL_PREDICTWITHDT(PAL_DATA_TAB, PAL_CONTROL_TAB, PAL_JSONMODEL_TAB, PAL_RESULT_TAB) with overview; SELECT * FROM PAL_RESULT_TAB; PAL_DATA_TAB;

CREATE COLUMN TABLE PAL_DATA_TAB (

SAP AG 2013

67

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result PAL_RESULT_TAB:

SAP AG 2013

68

SAP HANA Predictive Analysis Library (PAL) Reference

3.2.5

Exponential Regression

Exponential regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X. In exponential regression, data are modeled using exponential functions, and unknown model parameters are estimated from the data. Such models are called exponential models. In PAL, the implementation of exponential regression is to transform to linear regression and solve it:

y = 0 exp( 1 x1 + 2 x 2 + ... + n xn )
Where

0...n are parameters that need to be calculated.

The steps are: 1. Put natural logarithmic operation on both sides:

ln( y ) = ln( 0 exp( 1 x1 + 2 x 2 + ... + n xn ))


2. Transform it into: 3. Let

ln( y ) = ln( 0) + 1 x1 + 2 x 2 + ... + n xn

y ' = ln( y ) , 0' = ln( 0)

y ' = 0'+ 1 x1 + 2 x 2 + ... + n xn


Thus,

y ' and x1... xn is a linear relationship and can be solved using the linear regression method.

The implementation also supports calculating the F value and R^2 to determine statistical significance.

Prerequisites
No missing or null data in the inputs. The data is numeric, not categorical. Given the structure as Y and X1...Xn, there are more than n+1 records available for analysis.

SAP AG 2013

69

SAP HANA Predictive Analysis Library (PAL) Reference

EXPREGRESSION
This is an exponential regression function. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','EXPREGRESSION', <signature table>); The signature table should contain the following records: Index 1 2 3 4 5 6 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <Fitted OUTPUT table type> <Significance OUTPUT table type> <PMML OUTPUT table type> Direction in in out out out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <fitted output table>, <significance output table>, <PMML output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column 2 column Other columns
nd st

Column Data Type Integer or varchar Integer or double Integer or double

Description ID Variable y Variable Xn

SAP AG 2013

70

SAP HANA Predictive Analysis Library (PAL) Reference

Parameter Table Name THREAD_NUMBER PMML_EXPORT Data Type Integer Integer Description Number of threads 0 (default): does not export exponential regression model in PMML. 1: exports exponential regression model in PMML in single row. 2: exports exponential regression model in PMML in several rows, each row containing a maximum of 5000 characters.

Output Tables Table Result Column 1 column 2 column


nd st

Column Data Type Integer Integer or double

Description ID Value Ai A0: the intercept A1: the beta coefficient for X1 A2: the beta coefficient for X2

Constraint

Fitted Data

1 column 2 column
nd st

st

Integer or varchar Integer or double Varchar or char Double Integer CLOB or varchar

ID Value Yi Name Value ID Exponential regression model in PMML format (R^2 / F)

Significance

1 column 2 column
nd st

PMML Result

1 column 2 column
nd

SAP AG 2013

71

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"Y" DOUBLE,"X1" DOUBLE, "X2" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP TYPE SIGNIFICANCE_T; CREATE TYPE SIGNIFICANCE_T AS TABLE("NAME" varchar(50),"VALUE" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE MODEL_T; CREATE TYPE MODEL_T AS TABLE("ID" INT,"Model" varchar(5000)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); insert into PDATA values (5,'DM_PAL.SIGNIFICANCE_T','out'); insert into PDATA values (6,'DM_PAL.MODEL_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palExpR','AFLPAL','EXPREGRESSION',PDATA); DROP TABLE #CONTROL_TAB;

SAP AG 2013

72

SAP HANA Predictive Analysis Library (PAL) Reference

CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); INSERT INTO #CONTROL_TAB VALUES ('PMML_EXPORT',2,null,null); DROP TABLE DATA_TAB; CREATE COLUMN TABLE DOUBLE); DATA_TAB ( "ID" INT,"Y" DOUBLE,"X1" DOUBLE, "X2"

INSERT INTO DATA_TAB VALUES (0,0.5,0.13,0.33); INSERT INTO DATA_TAB VALUES (1,0.15,0.14,0.34); INSERT INTO DATA_TAB VALUES (2,0.25,0.15,0.36); INSERT INTO DATA_TAB VALUES (3,0.35,0.16,0.35); INSERT INTO DATA_TAB VALUES (4,0.45,0.17,0.37); INSERT INTO DATA_TAB VALUES (5,0.55,0.18,0.38); INSERT INTO DATA_TAB VALUES (6,0.65,0.19,0.39); INSERT INTO DATA_TAB VALUES (7,0.75,0.19,0.31); INSERT INTO DATA_TAB VALUES (8,0.85,0.11,0.32); INSERT INTO DATA_TAB VALUES (9,0.95,0.12,0.33); DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" INT,"Ai" DOUBLE); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); DROP TABLE SIGNIFICANCE_TAB; CREATE COLUMN TABLE SIGNIFICANCE_TAB ("NAME" varchar(50),"VALUE" DOUBLE); DROP TABLE MODEL_TAB; CREATE COLUMN TABLE MODEL_TAB ("ID" INT, "PMMLMODEL" VARCHAR(5000)); CALL _SYS_AFL.palExpR(DATA_TAB, "#CONTROL_TAB", RESULTS_TAB, FITTED_TAB, SIGNIFICANCE_TAB, MODEL_TAB) with overview; SELECT * FROM RESULTS_TAB; SELECT * FROM FITTED_TAB; SELECT * FROM SIGNIFICANCE_TAB; SELECT * FROM MODEL_TAB;

SAP AG 2013

73

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result RESULTS_TAB:

FITTED_TAB:

SIGNIFICANCE_TAB:

PAL_PMMLMODEL_TAB:

SAP AG 2013

74

SAP HANA Predictive Analysis Library (PAL) Reference

FORECASTWITHEXPR
This function performs prediction with the exponential regression result. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','FORECASTWITHEXPR', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <Data INPUT table type> <Coefficient INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in in out

Procedure Calling CALL <procedure name>(<data input table>, <coefficient input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Tables Table Predictive Data Column 1 column Other columns Coefficient 1 column 2 column
nd st st

Column Data Type Integer or varchar Integer or double Integer Integer or double

Description ID Variable Xn ID (start from 0) Value Ai

Parameter Table Name THREAD_NUMBER Data Type Integer Description Number of threads

Output Table Table Fitted Result Column 1 column 2 column


nd st

Column Data Type Integer or varchar Integer or double

Description ID Value Yi

SAP AG 2013

75

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PREDICT_T; CREATE TYPE PREDICT_T AS TABLE( "ID" INT,"X1" DOUBLE, "X2" DOUBLE); DROP TYPE COEFFICIENT_T; CREATE TYPE COEFFICIENT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.PREDICT_T','in'); insert into PDATA values (2,'DM_PAL.COEFFICIENT_T','in'); insert into PDATA values (3,'DM_PAL.CONTROL_T','in'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palForecastWithExpR','AFLPAL','FORECASTWITHE XPR',PDATA); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); DROP TABLE PREDICTDATA_TAB; CREATE COLUMN TABLE PREDICTDATA_TAB ("ID" INT,"X1" DOUBLE, "X2" DOUBLE); INSERT INTO PREDICTDATA_TAB VALUES (0,0.5,0.3); INSERT INTO PREDICTDATA_TAB VALUES (1,4,0.4); INSERT INTO PREDICTDATA_TAB VALUES (2,0,1.6);

SAP AG 2013

76

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PREDICTDATA_TAB VALUES (3,0.3,0.45); INSERT INTO PREDICTDATA_TAB VALUES (4,0.4,1.7); DROP TABLE COEEFICIENT_TAB; CREATE COLUMN TABLE COEEFICIENT_TAB ("ID" INT,"Ai" DOUBLE); INSERT INTO COEEFICIENT_TAB VALUES (0,1.7120914258645001); INSERT INTO COEEFICIENT_TAB VALUES (1,0.2652771198483208); INSERT INTO COEEFICIENT_TAB VALUES (2,-3.471103742302148); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); CALL _SYS_AFL.palForecastWithExpR(PREDICTDATA_TAB, COEEFICIENT_TAB, "#CONTROL_TAB", FITTED_TAB) with overview; SELECT * FROM FITTED_TAB;

Expected Result FITTED_TAB:

SAP AG 2013

77

SAP HANA Predictive Analysis Library (PAL) Reference

3.2.6

KNN

K-Nearest Neighbor (KNN) is a machine learning algorithm for classifying objects based on learning by analogy, that is, comparing a given tuple with similar training tuples. The training tuples are described by n attributes, each tuple representing a point in an n-dimensional space. All the training tuples are stored in an n-dimensional pattern space. Once there is an unknown tuple, the KNN method searches the pattern space for the k training tuples that are closest to the unknown tuple. These k training tuples are the k nearest neighbors of the unknown tuple.

Prerequisites
The first column of the training data and input data is an ID column. The second column of the training data is of class type. The class type column is of integer type. Other data columns are of integer or double type. The input data does not contain null value.

KNN
This is a classification function using the KNN algorithm. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','KNN', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <Training INPUT table type> <Class INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in in out

Procedure Calling CALL <procedure name>(<training input table>, <class input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table.

SAP AG 2013

78

SAP HANA Predictive Analysis Library (PAL) Reference

Signature Input Tables Table Training Data Column 1 column 2 column Other columns Class Data 1 column Other columns
st nd st

Column Data Type Integer or varchar Integer Integer or double Integer or varchar Integer or double

Description ID Class type Attribute data ID Attribute data

Parameter Table Name K_NEAREST_NEIGHBOURS ATTRIBUTE_NUM VOTING_TYPE Data Type Integer Integer Integer Description Number of nearest neighbors (k) Number of attributes Voting type: THREAD_NUMBER Integer 0 = majority voting 1 = distance-weighted voting

Number of threads

Output Table Table Result Column 1 column 2 column Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"TYPE" INT,"X1" DOUBLE, "X2" DOUBLE); DROP TYPE CLASSDATA_T; CREATE TYPE CLASSDATA_T AS TABLE( "ID" INT,"X1" DOUBLE, "X2" DOUBLE); DROP TYPE RESULT_T;
nd st

Column Data Type Integer or varchar Integer or double

Description ID Class type

SAP AG 2013

79

SAP HANA Predictive Analysis Library (PAL) Reference

CREATE TYPE RESULT_T AS TABLE("ID" INT,"Type" INT); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CLASSDATA_T','in'); insert into PDATA values (3,'DM_PAL.CONTROL_T','in'); insert into PDATA values (4,'DM_PAL.RESULT_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palKNN','AFLPAL','KNN',PDATA); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('K_NEAREST_NEIGHBOURS',3,null,null); INSERT INTO #CONTROL_TAB VALUES ('ATTRIBUTE_NUM',2,null,null); INSERT INTO #CONTROL_TAB VALUES ('VOTING_TYPE',0,null,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ( "ID" INT,"TYPE" INT,"X1" DOUBLE, "X2" DOUBLE); INSERT INTO DATA_TAB VALUES (0,2,1,1); INSERT INTO DATA_TAB VALUES (1,3,10,10); INSERT INTO DATA_TAB VALUES (2,3,10,11); INSERT INTO DATA_TAB VALUES (3,3,10,10); INSERT INTO DATA_TAB VALUES (4,1,1000,1000); INSERT INTO DATA_TAB VALUES (5,1,1000,1001); INSERT INTO DATA_TAB VALUES (6,1,1000,999); INSERT INTO DATA_TAB VALUES (7,1,999,999); INSERT INTO DATA_TAB VALUES (8,1,999,1000); INSERT INTO DATA_TAB VALUES (9,1,1000,1000); DROP TABLE CLASSDATA_TAB; CREATE COLUMN TABLE CLASSDATA_TAB ( "ID" INT,"X1" DOUBLE, "X2" DOUBLE); INSERT INTO CLASSDATA_TAB VALUES (0,2,1); INSERT INTO CLASSDATA_TAB VALUES (1,9,10);

SAP AG 2013

80

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO CLASSDATA_TAB VALUES (2,9,11); INSERT INTO CLASSDATA_TAB VALUES (3,15000,15000); INSERT INTO CLASSDATA_TAB VALUES (4,1000,1000); INSERT INTO CLASSDATA_TAB VALUES (5,500,1001); INSERT INTO CLASSDATA_TAB VALUES (6,500,999); INSERT INTO CLASSDATA_TAB VALUES (7,199,999); DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" INT,"Type" INT); CALL _SYS_AFL.palKNN(DATA_TAB, CLASSDATA_TAB, "#CONTROL_TAB", RESULTS_TAB) with overview; SELECT * FROM RESULTS_TAB;

Expected Result RESULTS_TAB:

SAP AG 2013

81

SAP HANA Predictive Analysis Library (PAL) Reference

3.2.7

Multiple Linear Regression

Linear regression is an approach to modeling the linear relationship between a variable Y, usually referred to as the dependent variable, and one or more variables, usually referred to as independent 1 2 3 variables, denoted X , X , X In linear regression, data are modeled using linear functions, and unknown model parameters are estimated from the data. Such models are called linear models. According to linear least-squares estimation, linear regression is to solve the following equation:

( AT A) X = ( AT y )
Where

A is MxN matrix, x is Nx1 matrix, and y is Mx1 matrix.

The implementation also supports calculating F and R^2 to determine statistical significance.

Prerequisites
No missing or null data in the inputs. The data is numeric, not categorical. Given the structure as Y and X1...Xn, there are more than n+1 records available for analysis.

LRREGRESSION
This is a multiple linear regression function. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','LRREGRESSION', <signature table>); The signature table should contain the following records: Index 1 2 3 4 5 6 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <Fitted OUTPUT table type> <Significance OUTPUT table type> <PMML OUTPUT table type> Direction in in out out out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <fitted output table>, <significance output table>, <PMML output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. SAP AG 2013 82

SAP HANA Predictive Analysis Library (PAL) Reference

Signature Input Table Table Data Column 1 column 2 column Other columns
nd st

Column Data Type Integer or varchar Integer or double Integer or double

Description ID Variable y Variable Xn

Parameter Table Name VARIABLE_NUM THREAD_NUMBER PMML_EXPORT Data Type Integer Integer Integer Description Number of variable X. Number of threads. 0 (default): does not export multiple linear regression model in PMML. 1: exports multiple linear regression model in PMML in single row. 2: exports multiple linear regression model in PMML in several rows, each row containing a maximum of 5000 characters.

Output Tables Table Result Column 1 column 2 column


nd st

Column Data Type Integer Integer or double

Description ID Value Ai

Constraint

A0: the intercept A1: the beta coefficient for X1 A2: the beta coefficient for X2
...

Fitted Data 1 column 2 column Significance 1 column 2 column PMML Result 1 column 2 column
nd st nd st nd st

Integer or varchar Integer or double Varchar or char Double Integer CLOB or varchar

ID Value Yi Name Value ID Multiple linear regression model in PMML format (R^2 / F)

SAP AG 2013

83

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"Y" DOUBLE,"X1" DOUBLE, "X2" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP TYPE SIGNIFICANCE_T; CREATE TYPE SIGNIFICANCE_T AS TABLE("NAME" varchar(50),"VALUE" DOUBLE); DROP TYPE MODEL_T; CREATE TYPE MODEL_T AS TABLE("ID" INT,"Model" varchar(5000)); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); insert into PDATA values (5,'DM_PAL.SIGNIFICANCE_T','out'); insert into PDATA values (6,'DM_PAL.MODEL_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palLR','AFLPAL','LRREGRESSION',PDATA); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100));

SAP AG 2013

84

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); INSERT INTO #CONTROL_TAB VALUES ('PMML_EXPORT',0,null,null); DROP TABLE DATA_TAB; CREATE COLUMN TABLE DOUBLE); DATA_TAB ( "ID" INT,"Y" DOUBLE,"X1" DOUBLE, "X2"

INSERT INTO DATA_TAB VALUES (0,0.5,0.13,0.33); INSERT INTO DATA_TAB VALUES (1,0.15,0.14,0.34); INSERT INTO DATA_TAB VALUES (2,0.25,0.15,0.36); INSERT INTO DATA_TAB VALUES (3,0.35,0.16,0.35); INSERT INTO DATA_TAB VALUES (4,0.45,0.17,0.37); INSERT INTO DATA_TAB VALUES (5,0.55,0.18,0.38); INSERT INTO DATA_TAB VALUES (6,0.65,0.19,0.39); INSERT INTO DATA_TAB VALUES (7,0.75,0.19,0.31); INSERT INTO DATA_TAB VALUES (8,0.85,0.11,0.32); INSERT INTO DATA_TAB VALUES (9,0.95,0.12,0.33); DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" INT,"Ai" DOUBLE); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); DROP TABLE SIGNIFICANCE_TAB; CREATE COLUMN TABLE SIGNIFICANCE_TAB ("NAME" varchar(50),"VALUE" DOUBLE); DROP TABLE PAL_PMMLMODEL_TAB; CREATE COLUMN TABLE PAL_PMMLMODEL_TAB ("ID" INT, "PMMLMODEL" VARCHAR(5000)); CALL _SYS_AFL.palLR(DATA_TAB, "#CONTROL_TAB", RESULTS_TAB, FITTED_TAB, SIGNIFICANCE_TAB, PAL_PMMLMODEL_TAB) with overview; SELECT * FROM RESULTS_TAB; SELECT * FROM FITTED_TAB; SELECT * FROM SIGNIFICANCE_TAB;

SAP AG 2013

85

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result RESULTS_TAB:

FITTED_TAB:

SIGNIFICANCE_TAB:

PAL_PMMLMODEL_TAB:

SAP AG 2013

86

SAP HANA Predictive Analysis Library (PAL) Reference

FORECASTWITHLR
This function performs prediction with the linear regression result. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','FORECASTWITHLR', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <Data INPUT table type> <Coefficient INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in in out

Procedure Calling CALL <procedure name>(<data input table>, <coefficient input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Tables Table Predictive Data Column 1 column Other columns Coefficient 1 column 2 column Parameter Table Name VARIABLE_NUM THREAD_NUMBER Output Table Table Fitted Result Column 1 column 2 column
nd st nd st st

Column Data Type Integer or Varchar Integer or double Integer Integer or double

Description ID Variable Xn ID (start from 0) Value Ai

Data Type Integer Integer

Description Number of variable X Number of threads

Column Data Type Integer or varchar Integer or double

Description ID Value Yi

SAP AG 2013

87

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PREDICT_T; CREATE TYPE PREDICT_T AS TABLE( "ID" INT,"X1" DOUBLE, "X2" DOUBLE); DROP TYPE COEFFICIENT_T; CREATE TYPE COEFFICIENT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.PREDICT_T','in'); insert into PDATA values (2,'DM_PAL.COEFFICIENT_T','in'); insert into PDATA values (3,'DM_PAL.CONTROL_T','in'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palForecastWithLR','AFLPAL','FORECASTWITHLR' ,PDATA); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); DROP TABLE PREDICTDATA_TAB; CREATE COLUMN TABLE PREDICTDATA_TAB ("ID" INT,"X1" DOUBLE, "X2" DOUBLE); INSERT INTO PREDICTDATA_TAB VALUES (0,0.5,0.3); INSERT INTO PREDICTDATA_TAB VALUES (1,4,0.4); INSERT INTO PREDICTDATA_TAB VALUES (2,0,1.6);

SAP AG 2013

88

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PREDICTDATA_TAB VALUES (3,0.3,0.45); INSERT INTO PREDICTDATA_TAB VALUES (4,0.4,1.7); DROP TABLE COEEFICIENT_TAB; CREATE COLUMN TABLE COEEFICIENT_TAB ("ID" INT,"Ai" DOUBLE); INSERT INTO COEEFICIENT_TAB VALUES (0,1.7120914258645001); INSERT INTO COEEFICIENT_TAB VALUES (1,0.2652771198483208); INSERT INTO COEEFICIENT_TAB VALUES (2,-3.471103742302148); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); CALL _SYS_AFL.palForecastWithLR(PREDICTDATA_TAB, COEEFICIENT_TAB, "#CONTROL_TAB", FITTED_TAB) with overview; SELECT * FROM FITTED_TAB;

Expected Result FITTED_TAB:

SAP AG 2013

89

SAP HANA Predictive Analysis Library (PAL) Reference

3.2.8

Polynomial Regression

Polynomial regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X. In polynomial regression, data are modeled using polynomial functions, and unknown model parameters are estimated from the data. Such models are called polynomial models. In PAL, the implementation of exponential regression is to transform to linear regression and solve it:

y = 0 + 1 x + 2 x 2 + ... + n x n
Where Let

0...n

are parameters that need to be calculated.

x = x1' , x 2 = x 2' ,..., x n = xn ' , and then

y ' = 0'+ 1 x1 + 2 x 2 + ... + n xn


So,

y ' and x1... xn is a linear relationship and can be solved using the linear regression method.

The implementation also supports calculating the F value and R^2 to determine statistical significance.

Prerequisites
No missing or null data in the inputs. The data is numeric, not categorical. Given the structure as Y and X1...Xn, there are more than n+1 records available for analysis.

SAP AG 2013

90

SAP HANA Predictive Analysis Library (PAL) Reference

POLYNOMIALREGRESSION
This is a polynomial regression function. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL', 'POLYNOMIALREGRESSION', <signature table>); The signature table should contain the following records: Index 1 2 3 4 5 6 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <Fitted OUTPUT table type> <Significance OUTPUT table type> <PMML OUTPUT table type> Direction in in out out out out

Procedure Calling CALL <procedure name> (<input table>, <parameter table>, <result output table>, <fitted output table>, <significance output table>, <PMML output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column 2 column 3 column
rd nd st

Column Data Type Integer or varchar Integer or double Integer or double

Description ID Variable y Variable X

SAP AG 2013

91

SAP HANA Predictive Analysis Library (PAL) Reference

Parameter Table Name THREAD_NUMBER PMML_EXPORT Data Type Integer Integer Description Number of threads. 0 (default): does not export polynomial regression model in PMML. 1: exports polynomial regression model in PMML in single row. 2: exports polynomial regression model in PMML in several rows, each row containing a maximum of 5000 characters.

Output Tables Table Result Column 1 column 2 column


nd st

Column Data Type Integer Integer or double

Description ID Value Ai

Constraint

A0: the intercept A1: the beta coefficient for X1 A2: the beta coefficient for X2 ...

Fitted Data 1 column 2 column Significance 1 column 2 column PMML Result 1 column 2 column
nd st nd st nd st

Integer or varchar Integer or double VARCHAR/CHAR Double Integer CLOB or varchar

ID Value Yi Name Value ID Polynomial regression model in PMML format (R^2 / F)

SAP AG 2013

92

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"Y" DOUBLE,"X1" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP TYPE SIGNIFICANCE_T; CREATE TYPE SIGNIFICANCE_T AS TABLE("NAME" varchar(50),"VALUE" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE MODEL_T; CREATE TYPE MODEL_T AS TABLE("ID" INT,"Model" varchar(5000)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); insert into PDATA values (5,'DM_PAL.SIGNIFICANCE_T','out'); insert into PDATA values (6,'DM_PAL.MODEL_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palPolynomialR','AFLPAL','POLYNOMIALREGRESSI ON',PDATA); DROP TABLE #CONTROL_TAB;

SAP AG 2013

93

SAP HANA Predictive Analysis Library (PAL) Reference

CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('VARIABLE_NUM',3,null,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); INSERT INTO #CONTROL_TAB VALUES ('PMML_EXPORT',2,null,null); DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ( "ID" INT,"Y" DOUBLE,"X1" DOUBLE); INSERT INTO DATA_TAB VALUES (0,5,1); INSERT INTO DATA_TAB VALUES (1,20,2); INSERT INTO DATA_TAB VALUES (2,43,3); INSERT INTO DATA_TAB VALUES (3,89,4); INSERT INTO DATA_TAB VALUES (4,166,5); INSERT INTO DATA_TAB VALUES (5,247,6); INSERT INTO DATA_TAB VALUES (6,403,7); DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" INT,"Ai" DOUBLE); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); DROP TABLE SIGNIFICANCE_TAB; CREATE COLUMN TABLE SIGNIFICANCE_TAB ("NAME" varchar(50),"VALUE" DOUBLE); DROP TABLE MODEL_TAB; CREATE COLUMN TABLE MODEL_TAB ("ID" INT, "PMMLMODEL" VARCHAR(5000));

CALL _SYS_AFL.palPolynomialR(DATA_TAB, "#CONTROL_TAB", RESULTS_TAB, FITTED_TAB, SIGNIFICANCE_TAB, MODEL_TAB) with overview; SELECT * FROM RESULTS_TAB; SELECT * FROM FITTED_TAB; SELECT * FROM SIGNIFICANCE_TAB; SELECT * FROM MODEL_TAB;

SAP AG 2013

94

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result RESULTS_TAB:

FITTED_TAB:

SIGNIFICANCE_TAB:

PAL_PMMLMODEL_TAB:

SAP AG 2013

95

SAP HANA Predictive Analysis Library (PAL) Reference

FORECASTWITHPOLYNOMIALR
This function performs prediction with the polynomial regression result. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL', 'FORECASTWITHPOLYNOMIALR', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <Data INPUT table type> <Coefficient INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in in out

Procedure Calling CALL <procedure name>(<data input table>, <coefficient input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Tables Table Predictive Data Column 1 column 2 column Coefficient 1 column 2 column
nd st nd st

Column Data Type Integer or varchar Integer or double Integer Integer or double

Description ID Variable X ID (start from 0) Value Ai

Parameter Table Name THREAD_NUMBER Data Type Integer Description Number of threads

Output Table Table Fitted Result Column 1 column 2 column


nd st

Column Data Type Integer or varchar Integer or double

Description ID Value Yi

SAP AG 2013

96

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PREDICT_T; CREATE TYPE PREDICT_T AS TABLE( "ID" INT,"X1" DOUBLE); DROP TYPE COEFFICIENT_T; CREATE TYPE COEFFICIENT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.PREDICT_T','in'); insert into PDATA values (2,'DM_PAL.COEFFICIENT_T','in'); insert into PDATA values (3,'DM_PAL.CONTROL_T','in'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palForecastWithPolynomialR','AFLPAL','FORECA STWITHPOLYNOMIALR',PDATA); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('VARIABLE_NUM',3,null,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); DROP TABLE PREDICTDATA_TAB; CREATE COLUMN TABLE PREDICTDATA_TAB ( "ID" INT,"X1" DOUBLE); INSERT INTO PREDICTDATA_TAB VALUES (0,0.3); INSERT INTO PREDICTDATA_TAB VALUES (1,4.0);

SAP AG 2013

97

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PREDICTDATA_TAB VALUES (2,1.6); INSERT INTO PREDICTDATA_TAB VALUES (3,0.45); INSERT INTO PREDICTDATA_TAB VALUES (4,1.7); DROP TABLE COEEFICIENT_TAB; CREATE COLUMN TABLE COEEFICIENT_TAB ("ID" INT,"Ai" DOUBLE); INSERT INTO COEEFICIENT_TAB VALUES (0,4.0); INSERT INTO COEEFICIENT_TAB VALUES (1,3.0); INSERT INTO COEEFICIENT_TAB VALUES (2,2.0); INSERT INTO COEEFICIENT_TAB VALUES (3,1.0); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); CALL _SYS_AFL.palForecastWithPolynomialR(PREDICTDATA_TAB, COEEFICIENT_TAB, "#CONTROL_TAB", FITTED_TAB) with overview; SELECT * FROM FITTED_TAB;

Expected Result FITTED_TAB:

SAP AG 2013

98

SAP HANA Predictive Analysis Library (PAL) Reference

3.2.9

Logistic Regression

Logistic regression is a prediction approach similar to Ordinary Least Squares (OLS) regression, but logistic regression can be used to predict a dichotomous outcome. Logistic regression allows you to predict a discrete outcome, such as group membership, from a set of variables that are continuous, discrete, dichotomous, or a mix of any of these. Generally, the dependent or response variable is dichotomous, such as presence/absence or success/failure. Discriminant analysis is also used to predict group membership with only two groups, but only from continuous independent variables. Thus, when independent variables are categorical, or a mix of continuous and categorical, logistic regression is preferred. A simple logistic regression function can be defined by the formula:

p (t ) = 1 /(1 + e t )
In PAL, the logistic regression model is made by:

h ( x ) = g ( T x ) = 1 /(1 + exp( T x ))
Where T x = 0 x0 + 1 x1 + ... + n xn
Assuming that there are only two class labels, {0,1}, you can get the below formula:

P( y = 1 | x; ) = h ( x ) P ( y = 0 | x; ) = 1 h ( x )
And merge them into:

P( y | x; ) = h ( x ) y (1 h ( x ))1 y
Where

0 , 1 , , n

are regression coefficients and their values can be obtained through the

Maximum Likelihood Estimation (MLE) method. The log likelihood function is:

L( ) = ln( L( )) = ln( i =1 p( y ( i ) | x ( i ) ; ))
m

= ln( i =1 ( h ( x ( i ) )) y (1 h ( x ( i ) ))1 y )
m
(i) (i)

= i =1 ln(( h ( x ( i ) )) y (1 h ( x ( i ) ))1 y )
m
(i) (i)

= i =1 ( y ( i ) ln(h ( x ( i ) )) + (1 y ( i ) ) ln(1 h ( x ( i ) )))


m

The maximum value of the function can be obtained through the method of Newton iteration or Stochastic Gradient Ascent. You can choose a threshold by random and then iterate the formula until the result of new value subtracting old value is smaller than this threshold. The formula is:

j := j + ( y ( i ) h ( x ( i ) )) x j ( i )
Prerequisites
No missing or null data in inputs. Data is numeric, not categorical. Given the structure as Y and X1...Xn, there must be more than n+1 records available for analysis.

SAP AG 2013

99

SAP HANA Predictive Analysis Library (PAL) Reference

LOGISTICREGRESSION
This is a logistic regression function. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL', 'LOGISTICREGRESSION', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <PMML OUTPUT table type> Direction in in out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <PMML output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column Columns Type column Column Data Type Integer or double Integer Description Variable Xn Variable TYPE Only 0 and 1 are supported Constraint

Parameter Table Name VARIABLE_NUM METHOD Data Type Integer Integer Description Number of variable X. STEP_SIZE EXIT_THRESHOLD THREAD_NUMBER Double Double Integer 0 (recommended): uses the Newton iteration method. 1: uses the gradient-decent method.

Step size for convergence. This parameter is used only when METHOD is 1. Threshold (actual value) for exiting the iterations. Number of threads. Note: It is recommended to specify this parameter to a value equal to or greater than 4.

SAP AG 2013

100

SAP HANA Predictive Analysis Library (PAL) Reference

Name MAX_ITERATION PMML_EXPORT

Data Type Integer Integer

Description Maximum number of iterations. 0 (default): does not export logistic regression model in PMML. 1: exports logistic regression model in PMML in single row. 2: exports logistic regression model in PMML in several rows, each row containing a maximum of 5000 characters.

Output Table Table Result Column 1 column 2 column


nd st

Column Data Type Integer Integer or double

Description ID Value Ai A0: intercept A1: beta coefficient for X1 A2: beta coefficient for X2

PMML Result (logistic regression model)

1 column 2 column
nd

st

Integer CLOB or varchar

ID Logistic regression model in PMML format

SAP AG 2013

101

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE("X1" DOUBLE,"X2" DOUBLE,"TYPE" INT); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE( "Name" VARCHAR (50),"intArgs" INTEGER,"doubleArgs" DOUBLE,"stringArgs" VARCHAR (100)); DROP TYPE PAL_PMMLMODEL_T; CREATE TYPE PAL_PMMLMODEL_T AS TABLE( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); insert into PDATA values (4,'DM_PAL.PAL_PMMLMODEL_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palLogisticR','AFLPAL','LOGISTICREGRESSION', PDATA); DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ("X1" DOUBLE,"X2"DOUBLE,"TYPE" INT); INSERT INTO DATA_TAB VALUES (110,2.62,1); INSERT INTO DATA_TAB VALUES (110,2.875,1); INSERT INTO DATA_TAB VALUES (93,2.32,1); INSERT INTO DATA_TAB VALUES (110,3.215,0);

SAP AG 2013

102

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO DATA_TAB VALUES (175,3.44,0); INSERT INTO DATA_TAB VALUES (105,3.46,0); INSERT INTO DATA_TAB VALUES (245,3.57,0); INSERT INTO DATA_TAB VALUES (62,3.19,0); INSERT INTO DATA_TAB VALUES (95,3.15,0); INSERT INTO DATA_TAB VALUES (123,3.44,0); INSERT INTO DATA_TAB VALUES (123,3.44,0); INSERT INTO DATA_TAB VALUES (180,4.07,0); INSERT INTO DATA_TAB VALUES (180,3.73,0); INSERT INTO DATA_TAB VALUES (180,3.78,0); INSERT INTO DATA_TAB VALUES (205,5.25,0); INSERT INTO DATA_TAB VALUES (215,5.424,0); INSERT INTO DATA_TAB VALUES (230,5.345,0); INSERT INTO DATA_TAB VALUES (66,2.2,1); INSERT INTO DATA_TAB VALUES (52,1.615,1); INSERT INTO DATA_TAB VALUES (65,1.835,1); INSERT INTO DATA_TAB VALUES (97,2.465,0); INSERT INTO DATA_TAB VALUES (150,3.52,0); INSERT INTO DATA_TAB VALUES (150,3.435,0); INSERT INTO DATA_TAB VALUES (245,3.84,0); INSERT INTO DATA_TAB VALUES (175,3.845,0); INSERT INTO DATA_TAB VALUES (66,1.935,1); INSERT INTO DATA_TAB VALUES (91,2.14,1); INSERT INTO DATA_TAB VALUES (113,1.513,1); INSERT INTO DATA_TAB VALUES (264,3.17,1); INSERT INTO DATA_TAB VALUES (175,2.77,1); INSERT INTO DATA_TAB VALUES (335,3.57,1); INSERT INTO DATA_TAB VALUES (109,2.78,1); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ( "Name" VARCHAR (50), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); INSERT INTO #CONTROL_TAB VALUES ('VARIABLE_NUM',2,null,null); INSERT INTO #CONTROL_TAB VALUES ('EXIT_THRESHOLD',null,0.00001,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); INSERT INTO #CONTROL_TAB VALUES ('MAX_ITERATION',80,null,null); INSERT INTO #CONTROL_TAB VALUES ('PMML_EXPORT', 1, null, null); INSERT INTO #CONTROL_TAB VALUES ('METHOD', 0, null, null);

DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" INT,"Ai" DOUBLE);

SAP AG 2013

103

SAP HANA Predictive Analysis Library (PAL) Reference

DROP TABLE PAL_PMMLMODEL_TAB; CREATE COLUMN TABLE PAL_PMMLMODEL_TAB( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); CALL _SYS_AFL.palLogisticR(DATA_TAB, "#CONTROL_TAB", RESULTS_TAB, PAL_PMMLMODEL_TAB) with overview; SELECT * FROM RESULTS_TAB; SELECT * FROM PAL_PMMLMODEL_TAB; Expected Result RESULTS_TAB:

PAL_PMMLMODEL_TAB:

SAP AG 2013

104

SAP HANA Predictive Analysis Library (PAL) Reference

FORECASTWITHLOGISTICR
This function performs predication with logistic regression result. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL', 'FORECASTWITHLOGISTICR', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <Data INPUT table type> <PARAMETER table type> <Coefficient INPUT table type> <OUTPUT table type> Direction in in in out

Procedure Calling CALL <procedure name>(<data input table>, <parameter table>, input table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Predictive Data Column 1 column Other columns Coefficient 1 column 2 column
nd st st

<coefficient

Column Data Type Integer Integer or double Integer Integer or double

Description ID Variable Xn ID Value Ai

Parameter Table Name VARIABLE_NUM THREAD_NUMBER Data Type Integer Integer Description Number of variable X Number of threads

SAP AG 2013

105

SAP HANA Predictive Analysis Library (PAL) Reference

Output Table Table Fitted Result Column 1 column 2 column 3 column Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. set schema DM_PAL; DROP TYPE PREDICT_T; CREATE TYPE PREDICT_T AS TABLE("ID" INT,"X1" DOUBLE,"X2" DOUBLE); DROP TYPE COEFFICIENT_T; CREATE TYPE COEFFICIENT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE,"TYPE" INT); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.PREDICT_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.COEFFICIENT_T','in'); insert into PDATA values (4,'DM_PAL.FITTED_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palForecastWithLogisticR','AFLPAL','FORECAST WITHLOGISTICR',PDATA); DROP TABLE PREDICTDATA_TAB; CREATE COLUMN TABLE PREDICTDATA_TAB ( "ID" INT,"X1" DOUBLE, "X2" DOUBLE); INSERT INTO PREDICTDATA_TAB VALUES (0,120,2.8); INSERT INTO PREDICTDATA_TAB VALUES (1,110,2.875);
rd nd st

Column Data Type Integer Integer or double Integer

Description ID Value Yi TYPE

SAP AG 2013

106

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PREDICTDATA_TAB VALUES (2,93,2.32); INSERT INTO PREDICTDATA_TAB VALUES (3,110,3.215); INSERT INTO PREDICTDATA_TAB VALUES (4,175,3.44); INSERT INTO PREDICTDATA_TAB VALUES (5,105,3.46); INSERT INTO PREDICTDATA_TAB VALUES (6,245,3.57); INSERT INTO PREDICTDATA_TAB VALUES (7,62,3.19); INSERT INTO PREDICTDATA_TAB VALUES (8,95,3.15); INSERT INTO PREDICTDATA_TAB VALUES (9,123,3.44); INSERT INTO PREDICTDATA_TAB VALUES (10,123,3.44); INSERT INTO PREDICTDATA_TAB VALUES (11,180,4.07); INSERT INTO PREDICTDATA_TAB VALUES (12,180,3.73); INSERT INTO PREDICTDATA_TAB VALUES (13,180,3.78); INSERT INTO PREDICTDATA_TAB VALUES (14,205,5.25); INSERT INTO PREDICTDATA_TAB VALUES (15,215,5.424); INSERT INTO PREDICTDATA_TAB VALUES (16,230,5.345); INSERT INTO PREDICTDATA_TAB VALUES (17,66,2.2); INSERT INTO PREDICTDATA_TAB VALUES (18,52,1.615); INSERT INTO PREDICTDATA_TAB VALUES (19,65,1.835); INSERT INTO PREDICTDATA_TAB VALUES (20,97,2.465); INSERT INTO PREDICTDATA_TAB VALUES (21,150,3.52); INSERT INTO PREDICTDATA_TAB VALUES (22,150,3.435); INSERT INTO PREDICTDATA_TAB VALUES (23,245,3.84); INSERT INTO PREDICTDATA_TAB VALUES (24,175,3.845); INSERT INTO PREDICTDATA_TAB VALUES (25,66,1.935); INSERT INTO PREDICTDATA_TAB VALUES (26,91,2.14); INSERT INTO PREDICTDATA_TAB VALUES (27,113,1.513); INSERT INTO PREDICTDATA_TAB VALUES (28,264,3.17); INSERT INTO PREDICTDATA_TAB VALUES (29,175,2.77); INSERT INTO PREDICTDATA_TAB VALUES (30,335,3.57); INSERT INTO PREDICTDATA_TAB VALUES (31,109,2.78); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('VARIABLE_NUM',2,null,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); DROP TABLE COEEFICIENT_TAB; CREATE COLUMN TABLE COEEFICIENT_TAB ("ID" INT,"Ai" DOUBLE); INSERT INTO COEEFICIENT_TAB VALUES (0, 18.866298717199392); INSERT INTO COEEFICIENT_TAB VALUES (1, 0.03625559608220791); INSERT INTO COEEFICIENT_TAB VALUES (2, -8.08347518244258); DROP TABLE FITTED_TAB;

SAP AG 2013

107

SAP HANA Predictive Analysis Library (PAL) Reference

CREATE COLUMN TABLE FITTED_TAB ("ID" INT, "Fitted" DOUBLE,"TYPE" INT); CALL _SYS_AFL.palForecastWithLogisticR(PREDICTDATA_TAB, "#CONTROL_TAB", COEEFICIENT_TAB, FITTED_TAB) with overview; SELECT * FROM FITTED_TAB; Expected Result FITTED_TAB:

SAP AG 2013

108

SAP HANA Predictive Analysis Library (PAL) Reference

3.3 3.3.1

Association Algorithms Apriori

Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis. Association analysis uncovers the hidden patterns, correlations or casual structures among a set of items or objects. For example, association analysis enables you to understand what products and services customers tend to purchase at the same time. By analyzing the purchasing trends of your customers with association analysis, you can predict their future behavior. Apriori is designed to operate on databases containing transactions. As is common in association rule mining, given a set of items, the algorithm attempts to find subsets which are common to at least a minimum number of the item sets. Apriori uses a bottom up approach, where frequent subsets are extended one item at a time, a step known as candidate generation, and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Apriori uses breadth-first search and a tree structure to count candidate item sets efficiently. It generates candidate item sets of length k from item sets of length k-1, and then prunes the candidates which have an infrequent sub pattern. The candidate set contains all frequent k -length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates. The Apriori function in PAL uses vertical data format to store the transaction data in memory. The function can take varchar/char or integer transaction ID and item ID as input. It supports the output of confidence, support, and lift value, but does not limit the number of output rules. However, you can use SQL script to select the number of output rules, for example: SELECT TOP 2000 FROM RULE_RESULTS where lift > 0.5

Prerequisite
The input data does not contain null value.

SAP AG 2013

109

SAP HANA Predictive Analysis Library (PAL) Reference

APRIORIRULE
This function reads input transaction data and generates association rules by the Apriori algorithm. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','APRIORIRULE', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <PMML OUTPUT table type> Direction in in out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <PMML output table>) WITH overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Dataset/ Historical Data Column 1 column Item column
st

Column Data Type Integer, varchar, or char Integer, varchar, or char

Description Transaction ID Item ID

Parameter Table Name MIN_SUPPORT MIN_CONFIDENCE THREAD_NUMBER MAXITEMLENGTH PMML_EXPORT Data Type Double Double Integer Integer Integer Description User-specified minimum support (actual value). User-specified minimum confidence (actual value). Number of threads. Total length of leading items and dependent items in the output. The default is 10. 0 (default): does not export Apriori model in PMML. 1: exports Apriori model in PMML in single row. 2: exports Apriori model in PMML in several rows, each row containing a maximum of 5000 characters.

SAP AG 2013

110

SAP HANA Predictive Analysis Library (PAL) Reference

Output Tables Table Result Column 1 column 2 column 3 column 4 column 5 column PMML Result 1 column 2 column
nd st th th rd nd st

Column Data Type Varchar or char Varchar or char Double Double Double Integer CLOB or varchar

Description Leading items Dependent items Support value Confidence value Lift value ID Apriori model in PMML format

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PAL_DATA_T; CREATE TYPE PAL_DATA_T AS TABLE( "CUSTOMER" INT, "ITEM" VARCHAR(20) ); DROP TYPE PAL_RESULT_T; CREATE TYPE PAL_RESULT_T AS TABLE( "PRERULE" VARCHAR(500), "POSTRULE" VARCHAR(500), "SUPPORT" DOUBLE, "CONFIDENCE" DOUBLE, "LIFT" DOUBLE ); DROP TYPE PAL_PMMLMODEL_T; CREATE TYPE PAL_PMMLMODEL_T AS TABLE( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50),

SAP AG 2013

111

SAP HANA Predictive Analysis Library (PAL) Reference

"INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_RESULT_T', 'out'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_PMMLMODEL_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_APRIORI_RULE', 'AFLPAL', 'APRIORIRULE', PDATA); DROP TABLE PAL_TRANS_TAB; CREATE COLUMN TABLE PAL_TRANS_TAB( "CUSTOMER" INT, "ITEM" VARCHAR(20) ); INSERT INTO PAL_TRANS_TAB VALUES (2, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (2, 'item3'); INSERT INTO PAL_TRANS_TAB VALUES (3, 'item1'); INSERT INTO PAL_TRANS_TAB VALUES (3, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (3, 'item4'); INSERT INTO PAL_TRANS_TAB VALUES (4,'item1'); INSERT INTO PAL_TRANS_TAB VALUES (4,'item3'); INSERT INTO PAL_TRANS_TAB VALUES (5, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (5, 'item3'); INSERT INTO PAL_TRANS_TAB VALUES (6, 'item1'); INSERT INTO PAL_TRANS_TAB VALUES (6, 'item3'); INSERT INTO PAL_TRANS_TAB VALUES (0, 'item1'); INSERT INTO PAL_TRANS_TAB VALUES (0, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (0, 'item5'); INSERT INTO PAL_TRANS_TAB VALUES (1, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (1, 'item4'); INSERT INTO PAL_TRANS_TAB VALUES (7, 'item1'); INSERT INTO PAL_TRANS_TAB VALUES (7, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (7, 'item3'); INSERT INTO PAL_TRANS_TAB VALUES (7, 'item5'); INSERT INTO PAL_TRANS_TAB VALUES (8, 'item1');

SAP AG 2013

112

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PAL_TRANS_TAB VALUES (8, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (8, 'item3'); DROP TABLE PAL_CONTROL_TAB; CREATE COLUMN TABLE PAL_CONTROL_TAB( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER', 2, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_SUPPORT', null, 0.2, null); INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_CONFIDENCE', null, 0.4, null); DROP TABLE PAL_RESULT_TAB; CREATE COLUMN TABLE PAL_RESULT_TAB( "PRERULE" VARCHAR(500), "POSTRULE" VARCHAR(500), "SUPPORT" Double, "CONFIDENCE" Double, "LIFT" DOUBLE ); DROP TABLE PAL_PMMLMODEL_TAB; CREATE COLUMN TABLE PAL_PMMLMODEL_TAB( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); CALL _SYS_AFL.PAL_APRIORI_RULE(PAL_TRANS_TAB, PAL_CONTROL_TAB, PAL_RESULT_TAB, PAL_PMMLMODEL_TAB) WITH overview; SELECT * FROM PAL_RESULT_TAB; SELECT * FROM PAL_PMMLMODEL_TAB;

SAP AG 2013

113

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result: PAL_RESULT_TAB:

PAL_PMMLMODEL_TAB:

SAP AG 2013

114

SAP HANA Predictive Analysis Library (PAL) Reference

LITEAPRIORIRULE
This is a light association rule mining algorithm to realize the Apriori algorithm. It only calculates two large item sets. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','LITEAPRIORIRULE', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <PMML OUTPUT table type> Direction in in out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <PMML output table>) WITH overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Dataset/ Historical Data Column 1 column 2 column
nd st

Column Data Type Integer, varchar, or char Integer, varchar, or char

Description Transaction ID Item ID

SAP AG 2013

115

SAP HANA Predictive Analysis Library (PAL) Reference

Parameter Table Name MIN_SUPPORT MIN_CONFIDENCE THREAD_NUMBER OPTIMIZATION_TYPE Data Type Double Double Integer Integer or double Description User-specified minimum support (actual value). User-specified minimum confidence (actual value). Number of threads. If you want to use the entire data, set it to 0. If you want to sample the source input data, specify a double value as the sampling percentage. If you use the sampling data, this parameter indicates whether to calculate the precise result. The setting 0 represents NOT to recalculate the precise result. 0 (default): does not export liteApriori model in PMML. 1: exports liteApriori model in PMML in single row. 2: exports liteApriori model in PMML in several rows, each row containing a maximum of 5000 characters.

IS_RECALCULATE

Integer

PMML_EXPORT

Integer

Output Tables Table Result Column 1 column 2 column 3 column 4 column 5 column PMML Result 1 column 2 column
nd st th th rd nd st

Column Data Type Varchar or char Varchar or char Double Double Double Integer CLOB or varchar

Description Leading items Dependent items Support value Confidence value Lift value ID liteApriori model in PMML format

SAP AG 2013

116

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PAL_DATA_T; CREATE TYPE PAL_DATA_T AS TABLE( "CUSTOMER" INT, "ITEM" VARCHAR(20) ); DROP TYPE PAL_RESULT_T; CREATE TYPE PAL_RESULT_T AS TABLE( "PRERULE" VARCHAR(500), "POSTRULE" VARCHAR(500), "SUPPORT" DOUBLE, "CONFIDENCE" DOUBLE, "LIFT" DOUBLE ); DROP TYPE PAL_PMMLMODEL_T; CREATE TYPE PAL_PMMLMODEL_T AS TABLE( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_RESULT_T', 'out'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_PMMLMODEL_T', 'out');

SAP AG 2013

117

SAP HANA Predictive Analysis Library (PAL) Reference

GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_LITE_APRIORI_RULE', 'AFLPAL', 'LITEAPRIORIRULE', PDATA); DROP TABLE PAL_TRANS_TAB; CREATE COLUMN TABLE PAL_TRANS_TAB( "CUSTOMER" INT, "ITEM" VARCHAR(20) ); INSERT INTO PAL_TRANS_TAB VALUES (2, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (2, 'item3'); INSERT INTO PAL_TRANS_TAB VALUES (3, 'item1'); INSERT INTO PAL_TRANS_TAB VALUES (3, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (3, 'item4'); INSERT INTO PAL_TRANS_TAB VALUES (4,'item1'); INSERT INTO PAL_TRANS_TAB VALUES (4,'item3'); INSERT INTO PAL_TRANS_TAB VALUES (5, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (5, 'item3'); INSERT INTO PAL_TRANS_TAB VALUES (6, 'item1'); INSERT INTO PAL_TRANS_TAB VALUES (6, 'item3'); INSERT INTO PAL_TRANS_TAB VALUES (0, 'item1'); INSERT INTO PAL_TRANS_TAB VALUES (0, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (0, 'item5'); INSERT INTO PAL_TRANS_TAB VALUES (1, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (1, 'item4'); INSERT INTO PAL_TRANS_TAB VALUES (7, 'item1'); INSERT INTO PAL_TRANS_TAB VALUES (7, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (7, 'item3'); INSERT INTO PAL_TRANS_TAB VALUES (7, 'item5'); INSERT INTO PAL_TRANS_TAB VALUES (8, 'item1'); INSERT INTO PAL_TRANS_TAB VALUES (8, 'item2'); INSERT INTO PAL_TRANS_TAB VALUES (8, 'item3'); DROP TABLE PAL_CONTROL_TAB; CREATE COLUMN TABLE PAL_CONTROL_TAB( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100)); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER', 2, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_SUPPORT', null, 0.3, null); INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_CONFIDENCE', null, 0.4, null); INSERT INTO PAL_CONTROL_TAB VALUES ('OPTIMIZATION_TYPE', 0, 0.7, null); INSERT INTO PAL_CONTROL_TAB VALUES ('IS_RECALCULATE', 1, null, null);

SAP AG 2013

118

SAP HANA Predictive Analysis Library (PAL) Reference

DROP TABLE PAL_RESULT_TAB; CREATE COLUMN TABLE PAL_RESULT_TAB( "PRERULE" VARCHAR(500), "POSTRULE" VARCHAR(500), "SUPPORT" Double, "CONFIDENCE" Double, "LIFT" DOUBLE ); DROP TABLE PAL_PMMLMODEL_TAB; CREATE COLUMN TABLE PAL_PMMLMODEL_TAB( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); CALL _SYS_AFL.PAL_LITE_APRIORI_RULE(PAL_TRANS_TAB, PAL_CONTROL_TAB, PAL_RESULT_TAB, PAL_PMMLMODEL_TAB) WITH overview; SELECT * FROM PAL_RESULT_TAB; SELECT * FROM PAL_PMMLMODEL_TAB; Expected Result PAL_RESULT_TAB:

PAL_PMMLMODEL_TAB:

SAP AG 2013

119

SAP HANA Predictive Analysis Library (PAL) Reference

3.4 3.4.1

Time Series Algorithms Single Exponential Smoothing

Single exponential smoothing is often used in financial market and economic data. In PAL, the algorithm begins by setting S 2 to y1 , where S stands for smoothed observation, y stands for the original observation, and the subscripts refer to the time periods, 1, 2, , n. There is no S1 , because the smoothed series starts with the smoothed version of the second observation. For any time period t , the smoothed value St is found by computing

St = yt 1 + (1 ) St 1 (t > 1)
Where the constant or parameter And you can get

is the smoothing factor, and

0 < <1.

S1 through the below equation:

S1 = y0 Prerequisites
No missing or null data in the inputs. The data is numeric, not categorical.

SINGLESMOOTH
This is a single exponential smoothing function. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','SINGLESMOOTH', <signature table>); The signature table should contain the following records: Index 1 2 3 Table Type Name <INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table.

SAP AG 2013

120

SAP HANA Predictive Analysis Library (PAL) Reference

Signature Input Table Table Data Column 1 column 2 column


nd st

Column Data Type Integer Integer or double

Description ID Raw data

Parameter Table Name RAW_DATA_COL ALPHA FORECAST_NUM STARTTIME Data Type Integer Double Integer Integer Description Column number of the column that contains the raw data. Value of the smoothing constant alpha (0 < < 1). Number of values to be forecast. When it is set to 1, the algorithm only forecasts one value. Start time of raw data sequence. The default is 1.

Output Table Table Result Column 1 column 2 column Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE("ID" INT, "RAWDATA" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("TIME" INT, "OUTPUT" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "strArgs" VARCHAR(100)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in');
nd st

Column Data Type Integer Integer or double

Description ID Output result

SAP AG 2013

121

SAP HANA Predictive Analysis Library (PAL) Reference

insert into PDATA values (3,'DM_PAL.RESULT_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('SINGLESMOOTH_TEST','AFLPAL','SINGLESMOOTH',P DATA); DROP TABLE CONTROL_TAB;

CREATE COLUMN TABLE CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "strArgs" VARCHAR(100)); INSERT INTO CONTROL_TAB VALUES ('RAW_DATA_COL',1,null,null); INSERT INTO CONTROL_TAB VALUES ('ALPHA',null,0.1,null); INSERT INTO CONTROL_TAB VALUES ('FORECAST_NUM',1,null,null); INSERT INTO CONTROL_TAB VALUES ('STARTTIME',2000,null,null); DROP TABLE SINGLE_TAB; CREATE COLUMN TABLE SINGLE_TAB ("ID" INT, "RAWDATA" DOUBLE); INSERT INTO SINGLE_TAB VALUES (0,200.0); INSERT INTO SINGLE_TAB VALUES (1,135.0); INSERT INTO SINGLE_TAB VALUES (2,195.0); INSERT INTO SINGLE_TAB VALUES (3,197.5); INSERT INTO SINGLE_TAB VALUES (4,310.0); INSERT INTO SINGLE_TAB VALUES (5,175.0); INSERT INTO SINGLE_TAB VALUES (6,155.0); INSERT INTO SINGLE_TAB VALUES (7,130.0); INSERT INTO SINGLE_TAB VALUES (8,220.0); INSERT INTO SINGLE_TAB VALUES (9,277.5); INSERT INTO SINGLE_TAB VALUES (10,235.0); DROP TABLE RESULT_TAB; CREATE COLUMN TABLE RESULT_TAB ("TIME" INT, "OUTPUT" DOUBLE); CALL _SYS_AFL.SINGLESMOOTH_TEST(SINGLE_TAB, CONTROL_TAB, RESULT_TAB) with overview; SELECT * FROM RESULT_TAB;

SAP AG 2013

122

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result RESULT_TAB:

SAP AG 2013

123

SAP HANA Predictive Analysis Library (PAL) Reference

3.4.2

Double Exponential Smoothing

In PAL, double exponential smoothing is also referred to as "Holt-Winters double exponential smoothing." The algorithm uses weighted historical trending to predict the future values of an account. It is more accurate for accounts that tend to trend in one direction over time. In PAL, the result of double exponential smoothing is computed by the following formula:

S0 = X 0 B0 = X 1 X 0 St = X t + (1 ) ( St 1 + Bt 1 ) Bt = ( St St 1 ) + (1 ) Bt 1 Ft + m = St + m Bt
Where

{ X t } : raw data sequence of observations, beginning at time t = 0 . {S t } : smoothed value for time t . {Bt } : the best estimate of the trend at time t . Ft + m : output of the algorithm, which is an estimate of x at time t + m based on the raw data up to time t .

: data smoothing factor. The range is : trend smoothing factor. The range is

0 < < 1.

0 < < 1.

Note: get

F0 is not defined because there is no estimation for time 0. According to the definition, you can

F1 = S 0 + B0 and so on.

Prerequisites
No missing or null data in the inputs. The data is numeric, not categorical.

SAP AG 2013

124

SAP HANA Predictive Analysis Library (PAL) Reference

DOUBLESMOOTH
This is a double exponential smoothing function. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','DOUBLESMOOTH', <signature table>); The signature table should contain the following records: Index 1 2 3 Table Type Name <INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column 2 column Parameter Table Name RAW_DATA_COL ALPHA BETA FORECAST_NUM STARTTIME Output Table Table Result Column 1 column 2 column
nd st nd st

Column Data Type Integer Integer or double

Description ID Raw data

Data Type Integer Double Double Integer Integer

Description Column number of the column that contains the raw data. Value of the smoothing constant alpha (0 < < 1). Value of the smoothing constant beta (0 < < 1). Number of values to be forecast (num > 0). Start time of raw data sequence. The default is 1.

Column Data Type Integer Integer or double

Description ID Output result

SAP AG 2013

125

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE("ID" INT, "RAWDATA" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("TIME" INT, "OUTPUT" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "strArgs" VARCHAR(100)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('DOUBLESMOOTH_TEST','AFLPAL','DOUBLESMOOTH',P DATA); DROP TABLE CONTROL_TAB; CREATE COLUMN TABLE CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "strArgs" VARCHAR(100)); INSERT INTO CONTROL_TAB VALUES ('RAW_DATA_COL',1,null,null); INSERT INTO CONTROL_TAB VALUES ('ALPHA',null,0.501,null); INSERT INTO CONTROL_TAB VALUES ('BETA',null,0.072,null); INSERT INTO CONTROL_TAB VALUES ('FORECAST_NUM',6,null,null); INSERT INTO CONTROL_TAB VALUES ('STARTTIME',2000,null,null); DROP TABLE DOUBLE_TAB; CREATE COLUMN TABLE DOUBLE_TAB ("ID" INT, "RAWDATA" DOUBLE); INSERT INTO DOUBLE_TAB VALUES (0,143.0); INSERT INTO DOUBLE_TAB VALUES (1,152.0); INSERT INTO DOUBLE_TAB VALUES (2,161.0); INSERT INTO DOUBLE_TAB VALUES (3,139.0); INSERT INTO DOUBLE_TAB VALUES (4,137.0);

SAP AG 2013

126

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO DOUBLE_TAB VALUES (5,174.0); INSERT INTO DOUBLE_TAB VALUES (6,142.0); INSERT INTO DOUBLE_TAB VALUES (7,141.0); INSERT INTO DOUBLE_TAB VALUES (8,162.0); INSERT INTO DOUBLE_TAB VALUES (9,180.0); INSERT INTO DOUBLE_TAB VALUES (10,164.0); INSERT INTO DOUBLE_TAB VALUES (11,171.0); INSERT INTO DOUBLE_TAB VALUES (12,206.0); INSERT INTO DOUBLE_TAB VALUES (13,193.0); INSERT INTO DOUBLE_TAB VALUES (14,207.0); INSERT INTO DOUBLE_TAB VALUES (15,218.0); INSERT INTO DOUBLE_TAB VALUES (16,229.0); INSERT INTO DOUBLE_TAB VALUES (17,225.0); INSERT INTO DOUBLE_TAB VALUES (18,204.0); INSERT INTO DOUBLE_TAB VALUES (19,227.0); INSERT INTO DOUBLE_TAB VALUES (20,223.0); INSERT INTO DOUBLE_TAB VALUES (21,242.0); INSERT INTO DOUBLE_TAB VALUES (22,239.0); INSERT INTO DOUBLE_TAB VALUES (23,266.0); DROP TABLE RESULT_TAB; CREATE COLUMN TABLE RESULT_TAB ("TIME" INT, "OUTPUT" DOUBLE); CALL _SYS_AFL.DOUBLESMOOTH_TEST(DOUBLE_TAB, CONTROL_TAB, RESULT_TAB) with overview; SELECT * FROM RESULT_TAB;

SAP AG 2013

127

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result RESULT_TAB

SAP AG 2013

128

SAP HANA Predictive Analysis Library (PAL) Reference

3.4.3

Triple Exponential Smoothing

Triple exponential smoothing is used to handle the time series data containing a seasonal component. This method is based on three smoothing equations: Stationary Component, Trend, and Seasonal. Both Seasonal and Trend can be additive or multiplicative. In PAL, the algorithm is finished with multiplicative and triple exponential smoothing is given by the formula below:

St =

Xt + (1 ) ( S t 1 + Bt 1 ) Ct L Xt + (1 ) Ct L St

Bt = ( S t S t 1 ) + (1 ) Bt 1 Ct =

Ft + m = ( S t + m Bt ) Ct L+1+(( m 1) mod L )
Where:

Data smoothing factor. The range is

0 < < 1.

Trend smoothing factor. The range is

0 < < 1.
0 < < 1.

X
S

Seasonal change smoothing factor. The range is Observation Smoothed observation Trend factor Seasonal index The forecast at

B
C

F t
Note:

m periods ahead

The index that denotes a time period

, , and

are the constants that must be estimated in such a way that the MSE of the error

is minimized. The formula for the initial trend estimate is:

Setting the initial estimates for the seasonal indices

for i = 0,1,...,L-1 is a bit more involved, then:

Where

Note:

is the average value of

x in the L cycle of your data.

SAP AG 2013

129

SAP HANA Predictive Analysis Library (PAL) Reference

Prerequisites
No missing or null data in the inputs. The data is numeric, not categorical.

TRIPLESMOOTH
This is a triple exponential smoothing function. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','TRIPLESMOOTH', <signature table>); The signature table should contain the following records: Index 1 2 3 Table Type Name <INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column 2 column
nd st

Column Data Type Integer Integer or double

Description ID Raw data

Parameter Table Name RAW_DATA_COL ALPHA BETA GAMMA Data Type Integer Double Double Double Description Column number of the column that contains the raw data. Value of the smoothing constant alpha (0 < < 1). Value of the smoothing constant beta (0 < < 1). Value of the smoothing constant gamma ( 0 < < 1).

SAP AG 2013

130

SAP HANA Predictive Analysis Library (PAL) Reference

Name CYCLE FORECAST_NUM STARTTIME

Data Type Integer Integer Integer

Description A cycle of length L (L > 1). For example, quarterly data cycle is 4, and monthly data cycle is 12. Number of values to be forecast (num > 0). Start time of raw data sequence (default value = 1).

Output Table Table Result Column 1 column 2 column Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE("ID" INT, "RAWDATA" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("TIME" INT, "OUTPUT" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "strArgs" VARCHAR(100)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('TRIPLESMOOTH_TEST','AFLPAL','TRIPLESMOOTH',P DATA); DROP TABLE CONTROL_TAB;
nd st

Column Data Type Integer Integer or double

Description ID Output result

CREATE COLUMN TABLE CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "strArgs" VARCHAR(100));

SAP AG 2013

131

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO CONTROL_TAB VALUES ('RAW_DATA_COL',1,null,null); INSERT INTO CONTROL_TAB VALUES ('ALPHA',null,0.822,null); INSERT INTO CONTROL_TAB VALUES ('BETA',null,0.055,null); INSERT INTO CONTROL_TAB VALUES ('GAMMA',null,0.055,null); INSERT INTO CONTROL_TAB VALUES ('CYCLE',4,null,null); INSERT INTO CONTROL_TAB VALUES ('STARTTIME',2000,null,null); INSERT INTO CONTROL_TAB VALUES ('FORECAST_NUM',6,null,null); DROP TABLE TRIPLE_TAB; CREATE COLUMN TABLE TRIPLE_TAB ("ID" INT, "RAWDATA" DOUBLE); INSERT INTO TRIPLE_TAB VALUES (0,362.0); INSERT INTO TRIPLE_TAB VALUES (1,385.0); INSERT INTO TRIPLE_TAB VALUES (2,432.0); INSERT INTO TRIPLE_TAB VALUES (3,341.0); INSERT INTO TRIPLE_TAB VALUES (4,382.0); INSERT INTO TRIPLE_TAB VALUES (5,409.0); INSERT INTO TRIPLE_TAB VALUES (6,498.0); INSERT INTO TRIPLE_TAB VALUES (7,387.0); INSERT INTO TRIPLE_TAB VALUES (8,473.0); INSERT INTO TRIPLE_TAB VALUES (9,513.0); INSERT INTO TRIPLE_TAB VALUES (10,582.0); INSERT INTO TRIPLE_TAB VALUES (11,474.0); INSERT INTO TRIPLE_TAB VALUES (12,544.0); INSERT INTO TRIPLE_TAB VALUES (13,582.0); INSERT INTO TRIPLE_TAB VALUES (14,681.0); INSERT INTO TRIPLE_TAB VALUES (15,557.0); INSERT INTO TRIPLE_TAB VALUES (16,628.0); INSERT INTO TRIPLE_TAB VALUES (17,707.0); INSERT INTO TRIPLE_TAB VALUES (18,773.0); INSERT INTO TRIPLE_TAB VALUES (19,592.0); INSERT INTO TRIPLE_TAB VALUES (20,627.0); INSERT INTO TRIPLE_TAB VALUES (21,725.0); INSERT INTO TRIPLE_TAB VALUES (22,854.0); INSERT INTO TRIPLE_TAB VALUES (23,661.0); DROP TABLE RESULT_TAB; CREATE COLUMN TABLE RESULT_TAB ("TIME" INT, "OUTPUT" DOUBLE); CALL _SYS_AFL.TRIPLESMOOTH_TEST(TRIPLE_TAB, CONTROL_TAB, RESULT_TAB) with overview; SELECT * FROM RESULT_TAB;

SAP AG 2013

132

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result RESULT_TAB:

SAP AG 2013

133

SAP HANA Predictive Analysis Library (PAL) Reference

3.5 3.5.1

Preprocessing Algorithms Binning

Binning data is a common requirement prior to running certain predictive algorithms. It generally reduces the complexity of the model, for example, the model in a decision tree. Binning methods smooth a sorted data value by consulting its neighborhood, that is, the values around it. The sorted values are distributed into a number of buckets, or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. There are four binning methods: Equal widths based on the number of bins Equal widths based on the bin width Equal number of records per bin Mean / standard deviation bin boundaries

And three methods for smoothing: Smoothing by bin means: each value in a bin is replaced by the mean value of the bin. Smoothing by bin medians: each bin value is replaced by the bin median. Smoothing by bin boundaries: the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by its closest boundary value.

Prerequisites
The input data does not contain null value. The data is numeric, not categorical.

SAP AG 2013

134

SAP HANA Predictive Analysis Library (PAL) Reference

BINNING
This function preprocesses the data. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','BINNING', <signature table>); The signature table should contain the following records: Index 1 2 3 Table Type Name <INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column 2 column
nd st

Column Data Type Integer or String Integer or double

Description ID Variable temperature

SAP AG 2013

135

SAP HANA Predictive Analysis Library (PAL) Reference

Parameter Table Name BINNING_METHOD Data Type Integer Description Binning methods: SMOOTH_METHOD Integer 0: equal widths based on the number of bins 1: equal widths based on the bin width 2: equal number of records per bin 3: mean/ standard deviation bin boundaries

Smoothing methods: 0: smoothing by bin means 1: smoothing by bin medians 2: smoothing by bin boundaries

BIN_NUMBER BIN_DISTANCE SD

Integer Integer Integer

Number of needed bins Specifies the distance for binning. This is required only when you have set BINNING_METHOD to 1. Specifies the standard deviation method. This is required only when you have set BINNING_METHOD to 3. Examples: 1 S.D.; 2 S.D.; 3 S.D.

Output Table Table Result Column 1 column 2 column 3 column


rd nd st

Column Data Type Integer or string Integer Integer or double

Description ID Variable TYPE Variable PRE_RESULT

SAP AG 2013

136

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T ; CREATE TYPE DATA_T AS TABLE("ID" INT, "TEMPERATURE" DOUBLE) ; DROP TYPE CONTROL_T ; CREATE TYPE CONTROL_T AS TABLE( "Name" VARCHAR (50),"intArgs" INTEGER,"doubleArgs" DOUBLE,"stringArgs" VARCHAR (100)); DROP TYPE RESULT_T ; CREATE TYPE RESULT_T AS TABLE("ID" INT, "BIN_NUMBER" DOUBLE); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('BINNING_TEST','AFLPAL','BINNING',PDATA); DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ("ID" INT, "TEMPERATURE" INSERT INTO DATA_TAB VALUES (0, 6.0) ; INSERT INTO DATA_TAB VALUES (1, 12.0) ; INSERT INTO DATA_TAB VALUES (2, 13.0) ; INSERT INTO DATA_TAB VALUES (3, 15.0) ; INSERT INTO DATA_TAB VALUES (4, 10.0) ; INSERT INTO DATA_TAB VALUES (5, 23.0) ; INSERT INTO DATA_TAB VALUES (6, 24.0) ; INSERT INTO DATA_TAB VALUES (7, 30.0) ; INSERT INTO DATA_TAB VALUES (8, 32.0) ; INSERT INTO DATA_TAB VALUES (9, 25.0) ; INSERT INTO DATA_TAB VALUES (10, 38.0) ; DOUBLE) ; INT, "PRE_RESULT"

SAP AG 2013

137

SAP HANA Predictive Analysis Library (PAL) Reference

DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ( "Name" VARCHAR (50), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); INSERT INTO #CONTROL_TAB VALUES ('BINNING_METHOD',0,null,null); INSERT INTO #CONTROL_TAB VALUES ('SMOOTH_METHOD',0,null,null); INSERT INTO #CONTROL_TAB VALUES ('BIN_NUMBER',4,null,null); INSERT INTO #CONTROL_TAB VALUES ('BIN_DISTANCE',10,null,null); INSERT INTO #CONTROL_TAB VALUES ('SD',1,null,null); DROP TABLE RESULT_TAB; CREATE TABLE RESULT_TAB ("ID" INT, "BIN_NUMBER" INT, "PRE_RESULT" DOUBLE) ; CALL _SYS_AFL.BINNING_TEST(DATA_TAB, "#CONTROL_TAB", RESULT_TAB) with overview; SELECT * FROM RESULT_TAB; Expected Result RESULT_TAB:

SAP AG 2013

138

SAP HANA Predictive Analysis Library (PAL) Reference

3.5.2

Inter-quartile Range Test

Given a series of numeric data, the inter-quartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1) of the data.

IQR = Q 3 Q1 Q1 is equal to 25th percentile and Q 3 is equal to 75th percentile.


The p-th percentile of a numeric vector is a number, which is greater than or equal to p% of all the values of this numeric vector. IQR Test is a method to test the outliers of a series of numeric data. The algorithm performs the following tasks: 1. Calculates

Q1 , Q 3 , and IQR . Q 3 + 1.5 IQR Q1 1.5 IQR

2. Set upper and lower bound as follows: Upper-bound = Lower-bound =

3. Tests all the values of a numeric vector to determine if it is in the range. The value outside the range is marked as an outlier, meaning it does not pass the IQR test.

Prerequisites
The input data does not contain null value. The algorithm will issue errors when encountering null values.

IQRTEST
This function performs the inter-quartile range test and outputs the test results.

SAP AG 2013

139

SAP HANA Predictive Analysis Library (PAL) Reference

Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','IQRTEST', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <INPUT table type> <PARAMETER table type> <IQR OUTPUT table type> <Test OUTPUT table type> Direction in in out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <IQR output table>, <test output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column 2 column
nd st

Column Data Type Integer, varchar, or char Integer or double

Description ID Data that needs to be tested

Parameter Table Name MULTIPLIER Data Type Double Description The multiplier used in the IQR test. The default is 1.5.

Output Tables Table IQR Values Column 1 column 2 column Test Result 1 column 2 column
nd st nd st

Column Data Type Double Double Integer Integer or double

Description Q1 value Q3 value ID Test result: 0: a value is in the range 1: a value is out of range

SAP AG 2013

140

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" VARCHAR(10),"VAL" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); DROP TYPE IQR_T; CREATE TYPE IQR_T AS TABLE("Q1" DOUBLE, "Q3" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" VARCHAR(10), "TEST" INT); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.IQR_T','out'); insert into PDATA values (4,'DM_PAL.RESULT_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palIQR','AFLPAL','IQRTEST',PDATA); DROP TABLE TESTDT_TAB;

CREATE COLUMN TABLE TESTDT_TAB("ID" VARCHAR(10),"VAL" DOUBLE); INSERT INTO TESTDT_TAB VALUES ('P1', 10); INSERT INTO TESTDT_TAB VALUES ('P2', 11); INSERT INTO TESTDT_TAB VALUES ('P3', 10); INSERT INTO TESTDT_TAB VALUES ('P4', 9); INSERT INTO TESTDT_TAB VALUES ('P5', 10); INSERT INTO TESTDT_TAB VALUES ('P6', 24); INSERT INTO TESTDT_TAB VALUES ('P7', 11); INSERT INTO TESTDT_TAB VALUES ('P8', 12); INSERT INTO TESTDT_TAB VALUES ('P9', 10);

SAP AG 2013

141

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO TESTDT_TAB VALUES ('P10', 9); INSERT INTO TESTDT_TAB VALUES ('P11', 1); INSERT INTO TESTDT_TAB VALUES ('P12', 11); INSERT INTO TESTDT_TAB VALUES ('P13', 12); INSERT INTO TESTDT_TAB VALUES ('P14', 13); INSERT INTO TESTDT_TAB VALUES ('P15', 12); DROP TABLE #CONTROL_TAB;

CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); INSERT INTO #CONTROL_TAB VALUES ('MULTIPLIER',null,1.5,null); DROP TABLE IQR_TAB; CREATE COLUMN TABLE IQR_TAB ("Q1" DOUBLE, "Q3" DOUBLE); DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" VARCHAR(10), "TEST" INT); CALL _SYS_AFL.palIQR(TESTDT_TAB, "#CONTROL_TAB", IQR_TAB, RESULTS_TAB) with overview; SELECT * FROM IQR_TAB; SELECT * FROM RESULTS_TAB;

SAP AG 2013

142

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result IQR value:

Test result:

SAP AG 2013

143

SAP HANA Predictive Analysis Library (PAL) Reference

3.5.3

Sampling

Sampling is used to extract a subset of sample units from all the samples. It is usually difficult for researchers to make direct observations on every individual in the population of concern, so they extract part of the sample units for research. The basic requirement for sampling is to guarantee that the extracted sample unit has a full representation of all the samples. There are many sampling methods. PAL supports eight of them, including: First_N Middle_N Last_N Every_Nth SimpleRandom_WithReplacement SimpleRandom_WithoutReplacement Systematic Stratified

Prerequisites
The input data does not contain null value.

SAMPLING
This function takes samples from a population. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','SAMPLING', <signature table>); The signature table should contain the following records: Index 1 2 3 Table Type Name <INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table.

SAP AG 2013

144

SAP HANA Predictive Analysis Library (PAL) Reference

Signature Input Table Table Data Column Columns Column Data Type Integer, double, varchar, or char Description Any data users need

Parameter Table Name SAMPLING_METHOD Data Type Integer Description Sampling method: SAMPLING_SIZE Integer 0 : First_N 1 : Middle_N 2 : Last_N 3 : Every_Nth 4 : SimpleRandom_WithReplacement 5 : SimpleRandom_WithoutReplacement 6 : Systematic 7 : Stratified

Number of the samples. Use this parameter when PERCENTAGE is not set.

PERCENTAGE

Double

Percentage of the samples. Use this parameter when SAMPLING_SIZE is not set.

THREAD_NUMBER INTERVAL

Integer Integer

Number of threads The interval between two samples Note: This parameter is only required for the Every_Nth method. If this parameter is not specified, the SAMPLING_SIZE parameter will be used.

STRATA_NUM

Integer

The number of the sub-populations. Note: This parameter is only required for the stratified method. In this function a population with three strata is sampled.

STRATA1_COUNT STRATA2_COUNT STRATA3_COUNT

Integer Integer Integer

The needed numbers of the first strata. The needed numbers of the second strata. The needed numbers of the third strata.

Output Table Table Result Column Columns Column Data Type Integer, double, varchar, or char Description The Output Table has the same structure as defined in the Input Table.

SAP AG 2013

145

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T ; CREATE TYPE DATA_T AS TABLE("EMPNO" INT, DOUBLE) ; DROP TYPE CONTROL_T ; CREATE TYPE CONTROL_T AS TABLE( "Name" VARCHAR (50),"intArgs" INTEGER,"doubleArgs" DOUBLE,"stringArgs" VARCHAR (100)); DROP TYPE RESULT_T ; CREATE TYPE RESULT_T AS TABLE("RESULT_EMPNO" INT, "RESULT_GENDER" VARCHAR (50), "RESULT_INCOME" DOUBLE); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('SAMPLING_TEST','AFLPAL','SAMPLING',PDATA); DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ("EMPNO" INT, "GENDER" VARCHAR (50), "INCOME" DOUBLE) ; INSERT INTO DATA_TAB VALUES (1, 'male', 4000.5) ; INSERT INTO DATA_TAB VALUES (2, 'male', 5000.7) ; INSERT INTO DATA_TAB VALUES (3, 'female', 5100.8) ; INSERT INTO DATA_TAB VALUES (4, 'male', 5400.9) ; INSERT INTO DATA_TAB VALUES (5, 'female', 5500.2) ; INSERT INTO DATA_TAB VALUES (6, 'male', 5540.4) ; INSERT INTO DATA_TAB VALUES (7, 'male', 4500.9) ; INSERT INTO DATA_TAB VALUES (8, 'female', 6000.8) ; INSERT INTO DATA_TAB VALUES (9, 'male', 7120.8) ; INSERT INTO DATA_TAB VALUES (10, 'female', 8120.9) ; INSERT INTO DATA_TAB VALUES (11, 'female', 7453.9) ; "GENDER" VARCHAR (50), "INCOME"

SAP AG 2013

146

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO DATA_TAB VALUES (12, 'male', 7643.8) ; INSERT INTO DATA_TAB VALUES (13, 'male', 6754.3) ; INSERT INTO DATA_TAB VALUES (14, 'male', 6759.9) ; INSERT INTO DATA_TAB VALUES (15, 'male', 9876.5) ; INSERT INTO DATA_TAB VALUES (16, 'female', 9873.2) ; INSERT INTO DATA_TAB VALUES (17, 'male', 9889.9) ; INSERT INTO DATA_TAB VALUES (18, 'male', 9910.4) ; INSERT INTO DATA_TAB VALUES (19, 'male', 7809.3) ; INSERT INTO DATA_TAB VALUES (20, 'female', 8705.7) ; INSERT INTO DATA_TAB VALUES (21, 'male', 8756.0) ; INSERT INTO DATA_TAB VALUES (22, 'female', 7843.2) ; INSERT INTO DATA_TAB VALUES (23, 'male', 8576.9) ; INSERT INTO DATA_TAB VALUES (24, 'male', 9560.9) ; INSERT INTO DATA_TAB VALUES (25, 'female', 8794.9) ; DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ( "Name" VARCHAR (50), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); INSERT INTO #CONTROL_TAB VALUES ('SAMPLING_METHOD',0,null,null); INSERT INTO #CONTROL_TAB VALUES ('SAMPLING_SIZE',8,null,null); --INSERT INTO #CONTROL_TAB VALUES ('PERCENTAGE',NULL,0.1,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); INSERT INTO #CONTROL_TAB VALUES ('INTERVAL',5,null, null); INSERT INTO #CONTROL_TAB VALUES ('STRATA_NUM',3,null,null); INSERT INTO #CONTROL_TAB VALUES ('STRATA1_COUNT',9,null,null); INSERT INTO #CONTROL_TAB VALUES ('STRATA2_COUNT',9,null,null); INSERT INTO #CONTROL_TAB VALUES ('STRATA3_COUNT',7,null,null); DROP TABLE RESULT_TAB; CREATE TABLE RESULT_TAB ("RESULT_EMPNO" INT, "RESULT_GENDER" VARCHAR (50), "RESULT_INCOME" DOUBLE ) ; CALL _SYS_AFL.SAMPLING_TEST(DATA_TAB, "#CONTROL_TAB", RESULT_TAB) WITH OVERVIEW; SELECT * FROM RESULT_TAB;

SAP AG 2013

147

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result If method is 0 and SAMPLING_SIZE is 8:

If method is 1 and SAMPLING_SIZE is 8:

If method is 2 and SAMPLING_SIZE is 8:

SAP AG 2013

148

SAP HANA Predictive Analysis Library (PAL) Reference

If method is 3 and INTERVAL is 5:

If method is 4 and SAMPLING_SIZE is 8:

If method is 5 and SAMPLING_SIZE is 8:

SAP AG 2013

149

SAP HANA Predictive Analysis Library (PAL) Reference

If method is 6 and SAMPLING_SIZE is 8:

If method is 7 and SAMPLING_SIZE is 8:

If method is 0 and PERCENTAGE is 0.1:

SAP AG 2013

150

SAP HANA Predictive Analysis Library (PAL) Reference

3.5.4

Scaling Range

This function is used when the attribute data are to be scaled to fall within a specified range, such as, -1.0 to 1.0, or 0.0 to 1.0. You can normalize an attribute by scaling its values to make them fall within a specified range. Normalization is particularly useful for classification algorithms involving neural networks, or distance measurements such as nearest neighbor classification and clustering. There are many data normalization methods. In PAL, the scaling range algorithm includes three methods: min-max normalization, z-score normalization, and normalization by decimal scaling. Min-max normalization performs a linear transformation on the original data. Suppose that minA and maxA are the minimum and maximum values of an attribute, A. Min-max normalization maps a value v, of A to V in the range[new_minA, new_maxA] by computing

V ' = ( v min A) ( new _ max A new _ min A) /(max A min A) + new _ min A
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are normalized based on the mean and standard deviation of A. A value, v, of A is normalized to V by computing V = (v A) / A Where A and A are the mean and standard deviation, respectively, of attribute A. This method of normalization is useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers that dominate the min-max normalization. Normalization by decimal scaling normalizes by moving the decimal point of values of attributes A. The number of decimal points moved depends on the maximum absolute value of A. A value, v, of A is normalized to V by computing

V ' = v / 10 j
Where

j is the smallest integer such that Max(|V|) < 1.

Prerequisites
The input data does not contain null value. The data is numeric, not categorical.

SAP AG 2013

151

SAP HANA Predictive Analysis Library (PAL) Reference

SCALINGRANGE
This function normalizes the data. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','SCALINGRANGE', <signature table>); The signature table should contain the following records: Index 1 2 3 Table Type Name <INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column Other columns
st

Column Data Type Integer or string Integer or double

Description ID Variable Xn

Parameter Table Name SCALING_METHOD Data Type Integer Description Scaling method: THREAD_NUMBER NEW_MAX NEW_MIN Integer Double or integer Double or integer 0: Min-max normalization 1: Z-Score normalization 2: Decimal scaling normalization

Number of threads The new maximum value of the min-max normalization method The new minimum value of min-max normalization method

SAP AG 2013

152

SAP HANA Predictive Analysis Library (PAL) Reference

Output Table Table Result Column 1 column Other columns


st

Column Data Type Integer or string Integer or double

Description ID Variable Xn

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T ; CREATE TYPE DATA_T AS TABLE("ID" INT, "X1" DOUBLE, "X2" DOUBLE) ; DROP TYPE CONTROL_T ; CREATE TYPE CONTROL_T AS TABLE( "Name" VARCHAR (50),"intArgs" INTEGER,"doubleArgs" DOUBLE,"stringArgs" VARCHAR (100)); DROP TYPE RESULT_T ; CREATE TYPE RESULT_T AS TABLE("ID" INT, "PRE_X1" DOUBLE); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('SCALINGRANGE_TEST','AFLPAL','SCALINGRANGE',P DATA); DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ("ID" INT, "X1" INSERT INTO DATA_TAB VALUES (0, 6.0, 9.0) ; INSERT INTO DATA_TAB VALUES (1, 12.1, 8.3) ; INSERT INTO DATA_TAB VALUES (2, 13.5, 15.3) ; INSERT INTO DATA_TAB VALUES (3, 15.4, 18.7) ; INSERT INTO DATA_TAB VALUES (4, 10.2, 19.8) ; DOUBLE, "X2" DOUBLE) ; DOUBLE, "PRE_X2"

SAP AG 2013

153

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO DATA_TAB VALUES (5, 23.3, 20.6) ; INSERT INTO DATA_TAB VALUES (6, 24.4,24.3) ; INSERT INTO DATA_TAB VALUES (7, 30.6, 25.3) ; INSERT INTO DATA_TAB VALUES (8, 32.5, 27.6) ; INSERT INTO DATA_TAB VALUES (9, 25.6, 28.5) ; INSERT INTO DATA_TAB VALUES (10, 38.7, 29.4) ; INSERT INTO DATA_TAB VALUES (11, 38.7, 29.4) ; DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ( "Name" VARCHAR (50), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); INSERT INTO #CONTROL_TAB VALUES ('SCALING_METHOD',2,null,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); INSERT INTO #CONTROL_TAB VALUES ('NEW_MAX',null,1.0,null); INSERT INTO #CONTROL_TAB VALUES ('NEW_MIN',null,0.0,null); DROP TABLE RESULT_TAB; CREATE TABLE RESULT_TAB ("ID" INT, "PRE_X1" DOUBLE, "PRE_X2" DOUBLE) ; CALL _SYS_AFL.SCALINGRANGE_TEST(DATA_TAB, "#CONTROL_TAB", RESULT_TAB) with overview; SELECT * FROM RESULT_TAB;

Expected Result If method is 0:

SAP AG 2013

154

SAP HANA Predictive Analysis Library (PAL) Reference

If method is 1:

If method is 2:

SAP AG 2013

155

SAP HANA Predictive Analysis Library (PAL) Reference

3.5.5

Variance Test

Variance Test is a method to identify the outliers of

n number of numeric data {xi } where 0 < i < n + 1 , using the mean {} and the standard deviation { } of n number of numeric data {xi } .

Below is the algorithm for Variance Test: 1. Calculate the mean ( ) and the standard deviation ( ) :

1 n xi n i =1

1 n ( xi ) 2 n i =1

2. Set the upper and lower bounds as follows: Upper-bound = + multiplier * Lower-bound = - multiplier * Where the multiplier is a double type coefficient provided by the user to test whether all the values of a numeric vector are in the range. If a value is outside the range, it means it doesn't pass the Variance Test. The value is marked as an outlier.

Prerequisites
No missing or null data in the inputs. The data is numeric, not categorical.

SAP AG 2013

156

SAP HANA Predictive Analysis Library (PAL) Reference

VARIANCETEST
This is a variance test function. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','VARIANCETEST', <signature table>); The signature table should contain the following records: Index 1 2 3 4 Table Type Name <INPUT table type> <PARAMETER table type> <Result OUTPUT table type> <Test OUTPUT table type> Direction in in out out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <result output table>, <test output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table. Signature Input Table Table Data Column 1 column 2 column
nd st

Column Data Type Integer or varchar Integer or double

Description ID Raw data

Parameter Table Name SIGMA_NUM THREAD_NUMBER Data Type Double Integer Description Multiplier for sigma Number of threads

Output Tables Table Result Column 1 column 2 column


nd st

Column Data Type Double Double

Description Mean value Standard deviation

Constraint

SAP AG 2013

157

SAP HANA Predictive Analysis Library (PAL) Reference

Table Test

Column 1 column 2 column


nd st

Column Data Type Integer or varchar Integer

Description ID Result output

Constraint

0: in bounds 1: out of bounds

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE("ID" INT,"X" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("MEAN" DOUBLE,"SD" DOUBLE); DROP TYPE TEST_T; CREATE TYPE TEST_T AS TABLE("ID" INT,"Test" INT); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); insert into PDATA values (4,'DM_PAL.TEST_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('palVarianceTest','AFLPAL','VARIANCETEST',PDA TA); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TAB VALUES ('SIGMA_NUM',null,3.0,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null);

SAP AG 2013

158

SAP HANA Predictive Analysis Library (PAL) Reference

DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ( "ID" INT,"X" DOUBLE); INSERT INTO DATA_TAB VALUES (0,25); INSERT INTO DATA_TAB VALUES (1,20); INSERT INTO DATA_TAB VALUES (2,23); INSERT INTO DATA_TAB VALUES (3,29); INSERT INTO DATA_TAB VALUES (4,26); INSERT INTO DATA_TAB VALUES (5,23); INSERT INTO DATA_TAB VALUES (6,22); INSERT INTO DATA_TAB VALUES (7,21); INSERT INTO DATA_TAB VALUES (8,22); INSERT INTO DATA_TAB VALUES (9,25); INSERT INTO DATA_TAB VALUES (10,26); INSERT INTO DATA_TAB VALUES (11,28); INSERT INTO DATA_TAB VALUES (12,29); INSERT INTO DATA_TAB VALUES (13,27); INSERT INTO DATA_TAB VALUES (14,26); INSERT INTO DATA_TAB VALUES (15,23); INSERT INTO DATA_TAB VALUES (16,22); INSERT INTO DATA_TAB VALUES (17,23); INSERT INTO DATA_TAB VALUES (18,25); INSERT INTO DATA_TAB VALUES (19,103); DROP TABLE RESULT_TAB; CREATE COLUMN TABLE RESULT_TAB ("MEAN" DOUBLE,"SD" DOUBLE); DROP TABLE TEST_TAB; CREATE COLUMN TABLE TEST_TAB ("ID" INT,"Test" INT); CALL _SYS_AFL.palVarianceTest(DATA_TAB, "#CONTROL_TAB", RESULT_TAB, TEST_TAB) with overview; SELECT * FROM RESULT_TAB; SELECT * FROM TEST_TAB;

SAP AG 2013

159

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result RESULT_TAB:

TEST_TAB:

SAP AG 2013

160

SAP HANA Predictive Analysis Library (PAL) Reference

3.6 3.6.1

Miscellaneous ABC Analysis

This algorithm is used to classify objects (such as customers, employees, or products) based on a particular measure (such as revenue or profit). It suggests that inventories of an organization are not of equal value, thus can be grouped into three categories (A, B, and C) by their estimated importance. A items are very important for an organization. B items are of medium importance, that is, less important than A items and more important than C items. C items are of the least importance. An example of ABC classification is as follows: A items 20% of the items (customers) accounts for 70% of the revenue. B items 30% of the items (customers) accounts for 20% of the revenue. C items 50% of the items (customers) accounts for 10% of the revenue.

Prerequisites
Input data cannot contain null value. The item names in the Input table must be of string data type and be unique.

ABC
This function performs the ABC analysis algorithm. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>','AFLPAL','ABC', <signature table>); The signature table should contain the following records: Index 1 2 3 Table Type Name <INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in out

Procedure Calling CALL <procedure name>(<input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table.

SAP AG 2013

161

SAP HANA Predictive Analysis Library (PAL) Reference

Signature Input Table Table Data Column 1st column 2nd column Column Data Type VARCHAR/CHAR Double Description Item name Value

Parameter Table Name THREAD_NUMBER PERCENT_A PERCENT_B PERCENT_C Data Type Integer Double Double Double Description Number of threads Interval for A class Interval for B class Interval for C class Default Value

Output Table Table Result Column 1st column 2nd column Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE("ITEM" VARCHAR(100),"VALUE" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ABC" VARCHAR(10),"ITEM" VARCHAR(100)); DROP table PDATA; CREATE column table PDATA("ID" INT,"TYPENAME" VARCHAR(100),"DIRECTION" VARCHAR(100)); insert into PDATA values (1,'DM_PAL.DATA_T','in'); Column Data Type VARCHAR/CHAR VARCHAR/CHAR Description ABC class Items

SAP AG 2013

162

SAP HANA Predictive Analysis Library (PAL) Reference

insert into PDATA values (2,'DM_PAL.CONTROL_T','in'); insert into PDATA values (3,'DM_PAL.RESULT_T','out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_ABC','AFLPAL','ABC',PDATA); DROP TABLE #CONTROL_TBL; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TBL ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TBL VALUES ('THREAD_NUMBER',1,null,null); INSERT INTO #CONTROL_TBL VALUES ('PERCENT_A',null,0.7,null); INSERT INTO #CONTROL_TBL VALUES ('PERCENT_B',null,0.2,null); INSERT INTO #CONTROL_TBL VALUES ('PERCENT_C',null,0.1,null); DROP TABLE TESTABCTAB; CREATE COLUMN TABLE TESTABCTAB("ITEM" VARCHAR(100),"VALUE" DOUBLE); INSERT INTO TESTABCTAB VALUES ('item1', 15.4); INSERT INTO TESTABCTAB VALUES ('item2', 200.4); INSERT INTO TESTABCTAB VALUES ('item3', 280.4); INSERT INTO TESTABCTAB VALUES ('item4', 100.9);#100.9 INSERT INTO TESTABCTAB VALUES ('item5', 40.4); INSERT INTO TESTABCTAB VALUES ('item6', 25.6); INSERT INTO TESTABCTAB VALUES ('item7', 18.4); INSERT INTO TESTABCTAB VALUES ('item8', 10.5); INSERT INTO TESTABCTAB VALUES ('item9', 96.15); INSERT INTO TESTABCTAB VALUES ('item10', 9.4); DROP TABLE RESULT_TBL; CREATE COLUMN TABLE RESULT_TBL("ABC" VARCHAR(10),"ITEM" VARCHAR(100)); CALL _SYS_AFL.PAL_ABC(TESTABCTAB, "#CONTROL_TBL", RESULT_TBL) with overview; select * from RESULT_TBL;

SAP AG 2013

163

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result RESULT_TBL:

SAP AG 2013

164

SAP HANA Predictive Analysis Library (PAL) Reference

3.6.2

Weighted Score Table

A weighted score table is a method of evaluating alternatives when the importance of each criterion differs. In a weighted score table, each alternative is given a score for each criterion. These scores are then weighted by the importance of each criterion. All of an alternative's weighted scores are then added together to calculate its total weighted score. The alternative with the highest total score should be the best alternative. You can use weighted score tables to make predictions about future customer behavior. You first create a model based on historical data in the data mining application, and then apply the model to new data to make the prediction. The prediction, that is, the output of the model, is called a score. You can create a single score for your customers by taking into account different dimensions. A function defined by weighted score tables is a linear combination of functions of a variable.

f ( x1 ,..., xn ) = w1 f1 ( x1 ) + ... + wn f n ( xn ) Prerequisites


The input data does not contain a null value. The column of the Map Function table is sorted by the attribute order of the Input Data table.

WEIGHTEDTABLE
This function performs weighted table calculation. It is similar to the Volume Driver function in the Business Function Library (BFL). Volume Driver calculates only one column, but weightedTable calculates multiple columns at the same time. Procedure Generation CALL SYSTEM.AFL_WRAPPER_GENERATOR('<procedure name>', 'AFLPAL', 'WEIGHTEDTABLE', <signature table>); The signature table should contain the following records: Index 1 2 3 4 5 Table Type Name <Data INPUT table type> <Map INPUT table type> <Control INPUT table type> <PARAMETER table type> <OUTPUT table type> Direction in in in in out

Procedure Calling CALL <procedure name>(<data input table>, <map input table>, <control input table>, <parameter table>, <output table>) with overview; The procedure name is the same as specified in the procedure generation. The input, parameter, and output tables must be of the types specified in the signature table.

SAP AG 2013

165

SAP HANA Predictive Analysis Library (PAL) Reference

Signature Input Tables Table Target/ Input Data Column Columns Column Data Type Varchar, char, integer, or double Description Specifies which will be used to calculate the scores Constraint Discrete value: integer, string, double Continuous value: integer, double An ID column is mandatory. Its data type should be integer. Map Function Columns Varchar, char, integer, or double Creates the map function Every attribute (except ID) in the Input Data table maps to two columns in the Map Function table: Key column and Value column. The Value column must be of double type. This table has three columns. When the Input Data table has n attributes (except ID), the Weight Table will have n rows.

Control

Columns

Integer or double

Parameter Table Name THREAD_NUMBER Data Type Integer Description Number of threads

Output Table Table Result Column 1 column 2 column


nd st

Column Data Type Integer Double

Description ID Result value

SAP AG 2013

166

SAP HANA Predictive Analysis Library (PAL) Reference

Example Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role. SET SCHEMA DM_PAL; DROP TYPE PAL_DATA_T; CREATE TYPE PAL_DATA_T AS TABLE( "ID" INT, "GENDER" VARCHAR(10), "INCOME" INT, "HEIGHT" DOUBLE ); DROP TYPE PAL_MAP_FUN_T; CREATE TYPE PAL_MAP_FUN_T AS TABLE( "GENDER" VARCHAR(10), "VAL1" DOUBLE, "INCOME" INT, "VAL2" DOUBLE, "HEIGHT" DOUBLE, "VAL3" DOUBLE ); DROP TYPE PAL_PARA_T; CREATE TYPE PAL_PARA_T AS TABLE( "WEIGHT" DOUBLE, "ISDIS" INT, "ROWNUM" INT ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); DROP TYPE PAL_RESULT_T; CREATE TYPE PAL_RESULT_T AS TABLE( "ID" INT,

SAP AG 2013

167

SAP HANA Predictive Analysis Library (PAL) Reference

"RESULT" DOUBLE ); -- create procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_MAP_FUN_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_PARA_T', 'in'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (5, 'DM_PAL.PAL_RESULT_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_WEIGHTEDTABLE', 'AFLPAL', 'WEIGHTEDTABLE', PDATA); DROP TABLE PAL_DATA_TAB; CREATE COLUMN TABLE PAL_DATA_TAB ( "ID" INT, "GENDER" VARCHAR(10), "INCOME" INT, "HEIGHT" DOUBLE ); INSERT INTO PAL_DATA_TAB VALUES (0,'male',5000,1.73); INSERT INTO PAL_DATA_TAB VALUES (1,'male',9000,1.80); INSERT INTO PAL_DATA_TAB VALUES (2,'female',6000,1.55); INSERT INTO PAL_DATA_TAB VALUES (3,'male',15000,1.65); INSERT INTO PAL_DATA_TAB VALUES (4,'female',2000,1.70); INSERT INTO PAL_DATA_TAB VALUES (5,'female',12000,1.65); INSERT INTO PAL_DATA_TAB VALUES (6,'male',1000,1.65); INSERT INTO PAL_DATA_TAB VALUES (7,'male',8000,1.60); INSERT INTO PAL_DATA_TAB VALUES (8,'female',5500,1.85);#5500 INSERT INTO PAL_DATA_TAB VALUES (9,'female',9500,1.85); DROP TABLE PAL_MAP_FUN_TAB; PAL_MAP_FUN_TAB (

CREATE COLUMN TABLE "VAL1" DOUBLE, "INCOME" INT, "VAL2" DOUBLE, "HEIGHT" DOUBLE,

"GENDER" VARCHAR(10),

SAP AG 2013

168

SAP HANA Predictive Analysis Library (PAL) Reference

"VAL3" DOUBLE ); INSERT INTO PAL_MAP_FUN_TAB VALUES ('male',2.0, INSERT INTO PAL_MAP_FUN_TAB VALUES (null,0.0, INSERT INTO PAL_MAP_FUN_TAB VALUES (null,0.0, DROP TABLE PAL_PARA_TAB; CREATE COLUMN TABLE PAL_PARA_TAB ( "WEIGHT" DOUBLE, "ISDIS" INT, "ROWNUM" INT ); INSERT INTO PAL_PARA_TAB VALUES (0.5,1,2); INSERT INTO PAL_PARA_TAB VALUES (2.0,-1,4); INSERT INTO PAL_PARA_TAB VALUES (1.0,-1,4); DROP TABLE PAL_CONTROL_TAB; CREATE COLUMN TABLE PAL_CONTROL_TAB ( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); DROP TABLE PAL_RESULT_TAB; CREATE COLUMN TABLE PAL_RESULT_TAB( "ID" INT, "RESULT" DOUBLE ); CALL _SYS_AFL.PAL_WEIGHTEDTABLE(PAL_DATA_TAB, PAL_MAP_FUN_TAB, PAL_PARA_TAB, PAL_CONTROL_TAB, PAL_RESULT_TAB) with overview; SELECT * FROM PAL_RESULT_TAB; 0,0.0, 9000,2.0, 1.5,0.0); 1.6,1.0); 1.71,2.0); INSERT INTO PAL_MAP_FUN_TAB VALUES ('female',1.5, 5500,1.0,

12000,3.0, 1.80,3.0);

SAP AG 2013

169

SAP HANA Predictive Analysis Library (PAL) Reference

Expected Result PAL_RESULT_TAB:

SAP AG 2013

170

SAP HANA Predictive Analysis Library (PAL) Reference

End-to-End Scenarios

Scenario
You want to predict segmentation/clustering of new customers for a supermarket. First use the Kmeans function in PAL to perform segmentation/clustering for existing customers in the supermarket. The output can then be used as the training data for the C4.5 Decision Tree function to predict new customers segmentation/clustering.

Technology Background
K-means clustering is a method of cluster analysis whereby the algorithm partitions N observations or records into K clusters, in which each observation belongs to the cluster with the nearest center. It is one of the most commonly used algorithms in clustering method. Decision trees are powerful and popular tools for classification and prediction. Decision tree learning, used in statistics, data mining, and machine learning uses a decision tree as a predictive model which maps the observations about an item to the conclusions about the item's target value.

Implementation Steps
Assume that: DM_PAL is a schema belonging to USER1, who has been granted the privilege of executing SYSTEM.afl_wrapper_generator; and USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE role.

Step 1 Input customer data and use the K-means function to partition the data set into K clusters. In this example, nine rows of data will be input. K equals 3, which means the customers will be partitioned into three levels. SET SCHEMA DM_PAL; DROP TYPE PAL_KMEANS_RESASSIGN_T; CREATE TYPE PAL_KMEANS_RESASSIGN_T AS TABLE( "ID" INT, "CENTER_ASSIGN" INT, "DISTANCE" DOUBLE ); DROP TYPE PAL_KMEANS_DATA_T; CREATE TYPE PAL_KMEANS_DATA_T AS TABLE( "ID" INT, "AGE" DOUBLE, "INCOME" DOUBLE, primary key("ID") );

SAP AG 2013

171

SAP HANA Predictive Analysis Library (PAL) Reference

DROP TYPE PAL_KMEANS_CENTERS_T; CREATE TYPE PAL_KMEANS_CENTERS_T AS TABLE( "CENTER_ID" INT, "V000" DOUBLE, "V001" DOUBLE ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); -- create kmeans procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_KMEANS_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_KMEANS_RESASSIGN_T', 'out'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_KMEANS_CENTERS_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_KMEANS', 'AFLPAL', 'KMEANS', PDATA); DROP TABLE PAL_KMEANS_DATA_TAB; CREATE COLUMN TABLE PAL_KMEANS_DATA_TAB( "ID" INT, "AGE" DOUBLE, "INCOME" DOUBLE, primary key("ID") ); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (0 , 20, 100000); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (1 , 21, 101000); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (2 , 22, 102000); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (3 , 30, 200000); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (4 , 31, 201000);

SAP AG 2013

172

SAP HANA Predictive Analysis Library (PAL) Reference

INSERT INTO PAL_KMEANS_DATA_TAB VALUES (5 , 32, 202000); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (6 , 40, 400000); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (7 , 41, 401000); INSERT INTO PAL_KMEANS_DATA_TAB VALUES (8 , 42, 402000); DROP TABLE PAL_CONTROL_TAB; CREATE COLUMN TABLE PAL_CONTROL_TAB( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('GROUP_NUMBER',3,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('INIT_TYPE',4,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('DISTANCE_LEVEL',2,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('MAX_ITERATION',100,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('EXIT_THRESHOLD',null,0.000001,null); INSERT INTO PAL_CONTROL_TAB VALUES ('NORMALIZATION',0,null,null); --clean kmeans result DROP TABLE PAL_KMEANS_RESASSIGN_TAB; CREATE COLUMN TABLE PAL_KMEANS_RESASSIGN_TAB( "ID" INT, "CENTER_ASSIGN" INT, "DISTANCE" DOUBLE, primary key("ID") ); DROP TABLE PAL_KMEANS_CENTERS_TAB; CREATE COLUMN TABLE PAL_KMEANS_CENTERS_TAB( "CENTER_ID" INT, "V000" DOUBLE, "V001" DOUBLE ); CALL _SYS_AFL.PAL_KMEANS(PAL_KMEANS_DATA_TAB, PAL_CONTROL_TAB, PAL_KMEANS_RESASSIGN_TAB, PAL_KMEANS_CENTERS_TAB) with overview;

SELECT * FROM PAL_KMEANS_CENTERS_TAB; SELECT * FROM PAL_KMEANS_RESASSIGN_TAB; DROP TABLE PAL_KMEANS_RESULT_TAB; CREATE COLUMN TABLE PAL_KMEANS_RESULT_TAB(

SAP AG 2013

173

SAP HANA Predictive Analysis Library (PAL) Reference

"AGE" DOUBLE, "INCOME" DOUBLE, "LEVEL" INT ); TRUNCATE TABLE PAL_KMEANS_RESULT_TAB; INSERT INTO PAL_KMEANS_RESULT_TAB( SELECT PAL_KMEANS_DATA_TAB.AGE,PAL_KMEANS_DATA_TAB.INCOME,PAL_KMEANS_RESASSIGN_TA B.CENTER_ASSIGN FROM PAL_KMEANS_RESASSIGN_TAB INNER JOIN PAL_KMEANS_DATA_TAB ON PAL_KMEANS_RESASSIGN_TAB.ID = PAL_KMEANS_DATA_TAB.ID); SELECT * FROM PAL_KMEANS_RESULT_TAB;

The result should show the following in PAL_KMEANS_RESULT_TAB.

SAP AG 2013

174

SAP HANA Predictive Analysis Library (PAL) Reference

Step 2 Use the above output as the training data of C4.5 Decision Tree. The C4.5 Decision Tree function will generate a tree model which maps the observations about an item to the conclusions about the item's target value. SET SCHEMA DM_PAL; DROP TYPE PAL_DATA_T; CREATE TYPE PAL_DATA_T AS TABLE( "AGE" DOUBLE, "INCOME" DOUBLE, "LEVEL" INT ); DROP TYPE PAL_JSONMODEL_T; CREATE TYPE PAL_JSONMODEL_T AS TABLE( "ID" INT, "JSONMODEL" VARCHAR(5000) ); DROP TYPE PAL_PMMLMODEL_T; CREATE TYPE PAL_PMMLMODEL_T AS TABLE( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR(100) ); --create procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_JSONMODEL_T', 'out'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_PMMLMODEL_T', 'out');

SAP AG 2013

175

SAP HANA Predictive Analysis Library (PAL) Reference

GRANT SELECT ON DM_PAL.PDATA to SYSTEM; call SYSTEM.afl_wrapper_generator('PAL_CREATEDT', 'AFLPAL', 'CREATEDT', PDATA);

DROP TABLE

PAL_TRAINING_TAB; "REGION" VARCHAR(50), "SALESPERIOD" VARCHAR(50), "REVENUE" Double, "CLASSLABEL" VARCHAR(50)

CREATE COLUMN TABLE PAL_TRAINING_TAB(

); DROP TABLE PAL_CONTROL_TAB; PAL_CONTROL_TAB( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('PERCENTAGE',null,1.0,null); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('IS_SPLIT_MODEL',1,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('PMML_EXPORT', 2, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('CONTINUOUS_COL',1,102001,null); INSERT INTO PAL_CONTROL_TAB VALUES ('CONTINUOUS_COL',1,202001,null); INSERT INTO PAL_CONTROL_TAB VALUES ('CONTINUOUS_COL',0,23,null); INSERT INTO PAL_CONTROL_TAB VALUES ('CONTINUOUS_COL',0,12,null); DROP TABLE PAL_JSONMODEL_TAB; CREATE COLUMN TABLE PAL_JSONMODEL_TAB( "ID" INT, "JSONMODEL" VARCHAR(5000) ); DROP TABLE PAL_PMMLMODEL_TAB; CREATE COLUMN TABLE PAL_PMMLMODEL_TAB( "ID" INT, "PMMLMODEL" VARCHAR(5000) ); CALL _SYS_AFL.PAL_CREATEDT(PAL_KMEANS_RESULT_TAB, PAL_CONTROL_TAB, PAL_JSONMODEL_TAB, PAL_PMMLMODEL_TAB) with overview; SELECT * FROM PAL_JSONMODEL_TAB;

CREATE COLUMN TABLE

SAP AG 2013

176

SAP HANA Predictive Analysis Library (PAL) Reference

SELECT * FROM PAL_PMMLMODEL_TAB;

SAP AG 2013

177

SAP HANA Predictive Analysis Library (PAL) Reference

Step 3 Use the above tree model to map each new customer to the corresponding level he or she belongs to. SET SCHEMA DM_PAL; DROP TYPE PAL_DATA_T; CREATE TYPE PAL_DATA_T AS TABLE( "ID" INT, "AGE" DOUBLE, "INCOME" DOUBLE ); DROP TYPE PAL_JSONMODEL_T; CREATE TYPE PAL_JSONMODEL_T AS TABLE( "ID" INT, "JSONMODEL" VARCHAR(5000) ); DROP TYPE PAL_CONTROL_T; CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); DROP TYPE PAL_RESULT_T; CREATE TYPE PAL_RESULT_T AS TABLE( "ID" INT, "CLASSLABEL" VARCHAR(50) ); -- create procedure DROP TABLE PDATA; CREATE COLUMN TABLE PDATA( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) ); INSERT INTO PDATA VALUES (1, 'DM_PAL.PAL_DATA_T', 'in'); INSERT INTO PDATA VALUES (2, 'DM_PAL.PAL_CONTROL_T', 'in'); INSERT INTO PDATA VALUES (3, 'DM_PAL.PAL_JSONMODEL_T', 'in'); INSERT INTO PDATA VALUES (4, 'DM_PAL.PAL_RESULT_T', 'out'); GRANT SELECT ON DM_PAL.PDATA to SYSTEM;

SAP AG 2013

178

SAP HANA Predictive Analysis Library (PAL) Reference

call SYSTEM.afl_wrapper_generator('PAL_PREDICTWITHDT', 'AFLPAL', 'PREDICTWITHDT', PDATA); DROP TABLE PAL_DATA_TAB; "ID" INT, "AGE" DOUBLE, "INCOME" DOUBLE ); INSERT INTO PAL_DATA_TAB VALUES (10 ,20, 100003); INSERT INTO PAL_DATA_TAB VALUES (11 ,30, 200003); INSERT INTO PAL_DATA_TAB VALUES (12 ,40, 400003); DROP TABLE PAL_CONTROL_TAB; CREATE COLUMN TABLE PAL_CONTROL_TAB ( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); DROP TABLE PAL_RESULT_TAB; CREATE TABLE PAL_RESULT_TAB( "ID" INT, "CLASSLABEL" VARCHAR(50) ); CALL _SYS_AFL.PAL_PREDICTWITHDT(PAL_DATA_TAB, PAL_CONTROL_TAB, PAL_JSONMODEL_TAB, PAL_RESULT_TAB) with overview; SELECT * FROM PAL_RESULT_TAB;

CREATE COLUMN TABLE PAL_DATA_TAB (

The expected prediction result is as follows:

SAP AG 2013

179

SAP HANA Predictive Analysis Library (PAL) Reference

Best Practices
Create an SQL view for the input table if the table structure does not meet what is specified in this guide. Avoid null values in the input data. You can replace the null values with the default values via an SQL statement (SQL view or SQL update) because PAL functions cannot infer the default values. Create the parameter table as a local temporary table to avoid table name conflicts. If you do not use PMML export, you do not need to create a PMML output table to store the result. Just set the PMML_EXPORT parameter to 0 and pass ? or null to the function. When using the KMEANS function, different INIT_TYPF and NORMALIZATION settings may produce different results. You may need to try a few combinations of these two parameters to get the best result. When using the APRIORIRULE function, in some circumstances the rules set can be huge. To avoid an extra long runtime, you can set the MAXITEMLENGTH parameter to a smaller number, such as 2 or 3.

SAP AG 2013

180

You might also like