You are on page 1of 23

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

Building data warehouses using open source technologies


Draftversion146 MichelJansen (mjansen@betterbe.com)

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

1 Table of Contents
1TableofContents...................................................................................................................................2 2Introduction............................................................................................................................................3 2.1Intendedaudience...........................................................................................................................3 2.2Whybuildadatawarehouse..........................................................................................................3 3Buildingadatawarehouse.....................................................................................................................5 3.1Designingadimensionaldatawarehouse......................................................................................5 3.1.1Askingquestions.....................................................................................................................5 3.1.2Modelingstructures................................................................................................................5 3.1.3Pickingafactgrain.................................................................................................................5 3.1.4Addingdimensions.................................................................................................................6 3.2Constructingthedatawarehouse....................................................................................................6 3.2.1DesigningtransformationsusingSpoon.................................................................................7 Somepreparations........................................................................................................................7 Updatingtype1dimensions.....................................................................................................7 Updatingtype2dimensionsandthefacttable.........................................................................8 Aggregatingthedata..................................................................................................................10 3.2.2Puttingitalltogether............................................................................................................10 4Usingthedatawarehouse....................................................................................................................11 4.1Preparingforonlineanalyticalprocessing..................................................................................11 4.1.1Fromrelationaltodimensional.............................................................................................11 4.1.2DoingitinMondrian............................................................................................................11 4.2Askingmultidimensionalqueries.................................................................................................12 4.3Visualizationandpresentation.....................................................................................................13 5References............................................................................................................................................14 6Historyofthisdocument......................................................................................................................15 1stofseptember2006Initialrelease................................................................................................15 AppendixA:Technologyoverview.........................................................................................................16 Kettle...................................................................................................................................................16 Mondrian.............................................................................................................................................16 Jpivot...................................................................................................................................................16 AppendixB:GeneratingadatedimensionusingJavaScript...................................................................17 JavaScriptcode....................................................................................................................................17 AppendixC:ExampleMondrianXMLschema......................................................................................18 License.....................................................................................................................................................19

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

2 Introduction
Thisarticleisaboutbuildingdatawarehouses.Adatawarehouseisacomputerdatabasethatcollects, integratesandstoresanorganization'sdatawiththeaimofproducingaccurateandtimelymanagement informationandsupportingdataanalysis1.Itexplaintheimportanceofagooddatawarehouseand covertheprocessofbuildingsuchaspecializeddatabaseusingopensourcetechnologies.

2.1

Intended audience

Thisarticleismeantforsoftwaredevelopers,databaseadministrators,integratedsoftwarevendorsor otherpeoplewhoarefacingthechallengeofmakingtheanalysisoflargeamountsofdatageneratedby abusinessoraninformationsystempossible.Ithasbeenwrittenwiththeassumptionofabasic understandingofthecoveredconceptslikedatabases,informationsystemsandbusinesstransactionsin mind.

2.2

Why build a data warehouse

Therearemanyreasonstojustifybuildingadatawarehouse,butalmostallofthemboildowntothe samebasicwish:providemeansforanalysisofdatatosupportmanagementdecisions.Youare probablyalreadyprovidingmanagementwithdatalikesiteusagestatistics,referrertrendsandthe numberofregisteredusersoftheirinformationsystem.Thisisbasicinformation,whichcanbe retrieveddirectlyfromtheinformationsystemitself.However,thishassomeimportantdrawbacks. Firstofall,sinceallofthisdataisretrieveddirectlyfromtheinformationsystem,itplacesa considerableburdenonthissystem.Analysisoftenrequireshugeamountsofdatatobeprocessed, whichisoftenaproblemif,forinstance,thesystem'sdatabasetablesgetlockedforretrieval.Adata warehousecanbecompletelydetachedfromtheinformationsystem,evenrunningonadifferent system. Secondly,anOLTP2system'sdatamodelisrarelyoptimizedforanalysis.Wealllearntodevelop systemstouseanormalizeddatabasemodeledafterourentityrelationships.Thisisagoodthingforthe informationsystem,sincetheunderlyingmodelisaclosereflectionofthesystemitself,butitmakes queryingforlargesumsofaggregateddataacostlyoperation.Furthermore,redundancyisrarelypart ofthisdatabasedesign,becauseredundancyishardtomaintain,oftencausingdatainconsistenciesor worse.Fordataanalysis,redundancycanbegreat,becauseitspeedsuptheprocess. Moreover,datainanOLTPsystemmightchangeovertime.Acustomermightmovetoanother country,leavingyou,thedataanalysisproviderwithanimpossiblechoice:eitheryouupdatethe
1 http://en.wikipedia.org/wiki/Data_warehouse 2 OnLineTransactionProcessing

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

customer'srecord,discardinghispreviousstateandinvalidatingpreviousanalysisoryouhaveto somehowcreateanewrecordfortheonlinesystemandchangeallreferencestoit.Neitherofthemare desirable,butinanofflinedatawarehouseyoucankeepboththeoldandthenewstateofthecustomer andspecifyatwhatperiodintimeithaschanged. Finally,thereareprobablyalotofdatasourcesyourinformationsystemisn'tusingthatcouldbeuseful indataanalysis.Adatawarehousecanprovidecentralstorageforallthisdata,soallthecollected informationcanbequeriedinonestep.

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

3 Building a data warehouse


Inthischapter,Iwillguideyouthroughthedifferentphasesinbuildingadatawarehouse.Iwill illustratethisbyusingasimplifiedwebbasedinformationsystemasanexampleforcreatingadata warehouse.Thissystemcontainsfamiliarentitiessuchasrequests,usersandpages.Thedata modelofthissystemisshowninfigure<TODO>.

3.1
3.1.1

Designing a dimensional data warehouse


Asking questions

Themostimportantstepinbuildingadatawarehouseisdesigningit.Youhavetoaskyourself:What doesthemanagementwanttoknow?.First,you'llhavetofigureoutwhichquestionsneedtobe answeredbyanalysisofthedatawarehouseto.be3.Forexample,isthereacorrelationbetweenusersof awebbasedsystemandthepagesitprovides?Docertaingroupsofusersvisitotherpagesthanothers? Wheredothevisitorsfromdifferentpagescomefrom?Mybeliefisatleast80%ofallmanagement questionsaboutthedatageneratedbyaninformationsystemcanbeansweredbyadecentdata warehouse. 3.1.2 Modeling structures

Onceyougetagoodviewonwhatquestionsyou'llwanttoaskthedata warehouselateron,youhavetodeterminethedifferentdata arrangementsthatcomewiththesequestions.Forexample,aquestion aboutsaleswillhavetooperateonadifferentstructurethanoneabout employment.Thesestructuresarecalledcubesintheworldof OLAP4,becausetheyareessentiallyanextensiontothetwo dimensionalarrayofdatathatisstoredinaconventional(forexample SQL)database.Dependingonthedata,acubemayhavemorethantwo oreventhreedimensions.We'llmodelthesecubesinarelational databaseasastarschema,containingasinglefacttablelinking togethermultipledimensiontables. 3.1.3 Picking a fact grain

Illustration1:Requestcube containingonlyatimedimension

Theartinmakingupthisstructureisfindingagoodbasisforafacttable.Themoreaggregatedthe
3 Thisisnotatypo 4 OnLineAnalyticalProcessing

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

chosenfactelementforeachlineinthefacttableis,theharderitbecomestoaddusefuldimensions. Theidealbasisforafacttableisthelowestpossibleatomicgrainofdata.Thisisoftenasingle transaction,paymentor,asinmyexample,apagerequest. 3.1.4 Adding dimensions

Nowthatwehaveacentralstartingpointforourdata cube,itistimetodecoratethebasicfactsbyaddingnew dimensions,containingaspectsthatcanbeattributedto thefacts.Ifyoucan'tthinkofanydimensionthatisvalid foreveryvalueinyourfacttable,thenyou'veprobably chosenthewronggraintobeginwith.Themosttrivialof alldimensionsisthatoftime.Everyrequesttakesplace onacertainpointintime,sowecreatedimensiontable timeandhaveeveryrequestentryinthefacttablelink Illustration2:AcubemodelledasanOLAPstarina totheentryinthistablecorrespondingtothetimeittook relationaldatabase place.Iftworequeststookplaceonthesamemoment, theywilllinktothesameentryinthetimedimension'stable. Anotherdimensionwewanttoaddisthatofpages.Foreachdistinctpageassociatedtoarequestinthe facttable,anentrywillexistinthepagedimensiontable.Thesamegoesforreferrers. Finally,we'llwanttolinkrequeststotheusersthatmadethem.Everyrequestinthefacttablewas madeeitherbyaknownuser,orbynouseratall,whichwillbeaspecialentryintheuserdimension's table. Aswecreateandlinkdimensions,animportantruleistoneverusetheexistingkeysfromtheonline system,aswehavenocontroloverhowtheymightchangeorevendisappear.Instead,everydimension willgetit'sownsurrogatekeycalledthetechnicalkeywhichisuniquetothedatawarehouse.This alsogoesforthefacttable.

3.2

Constructing the data warehouse

Nowthatweknowwhichcubeswewanttoexpressasastarjoinschema,it'stimetogettowork.In thisstep,we'llpopulateadatawarehousewithdatafromtheOLTPsystem.Thisphaseoftheprocessis knownasETL,whichstandsforExtract,Transform,Load.Thisisexactlywhatneedstobedone. Extractthedataneededforthefactanddimensiontablesfromalldifferentdatasources,transformitto fitourneedsandloaditintothedatawarehousesoitcanbequeried. We'llperformtheseactionswithKETTLE,anopensourceETLtool.Kettleisroughlymadeupoutof twopackages.SpoonandPan,whichareusedtocreateandexecutetransformationsandChefand 6

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

Kitchen,whichletyoudefinejobsfromtransformationsandscheduleandexecutethem. 3.2.1 Designing transformations using Spoon

WestartbyusingSpoontomakethetransformationsthatwillpopulateourdatawarehouse.Inorderto beabletofillthecentralfacttable,thekeystoallofthedimensionsmustbeknown.Thereforewe makeadistinctionbetweentwotypesofdimensions(nottobeconfusedbyRalphKimball'sslowly changingdimensiontypes): 1. Dimensionsconsistingofdataalreadyknowntotheonlinesystem. 2. Dimensionsthataretobegeneratedfromthefactdataandsurroundingsources. Whilewegenerateorupdatethedimensionsoftypetwowhilefillingourfacttableandthusknowits keysatthattime,wecannotdothisforthedimensionsofthefirstkind. Inourexample,thetimeandpagedimensionsareofthesecondkind,meaningtheyaregeneratedwhen updatingourfacttable.Theuserdimensionhastobeknownbeforethen,becauseitisbasedonsome independenttablesintheonlinesystem.Becauseofthis,wewillfillourdatawarehouseintwo separatetransformations,ensuringfirsttheexistenceofourtype1userdimensionandlatertheother dimensionsandthefacttableitself.

Some preparations
BeforewecanbeginusingSpoon,wehavetodefinethedatasources.Inourcase,wehaveonlyone datasource:thedatabaseofouronlinesystem.WeneedtoaddthisdatabaseasaConnectionto SpoonasdescribedinitsdocumentationaboutDatabaseconnections.Wealsodefinetheconnection toourtargetdatabasethisway.

Updating typ e 1 d imensions


Inourexample,theuserdimensionistheonlyonethatcannotbeupdatedwhilefillingthefacttable. Thedataalreadyexistsinthesourcesystemasasingletable,soallwehavetodoisreaditandrunit throughSpoon'sgreatDimensionlookup/updatestep. Thisstepiscapableofcreating,updatingandlookingupslowlychangingdimensionsasdescribed byRalphKimball.Forsourcesthatcontainchangingdata,suchasthecustomerrecordsmentioned earlier,itcandothisintwoways: I. Overwritingtheexistingrecordbytheupdatedone. II. Creatinganotherdimensionrecord,maintainingmultiplecopiesofchangedrecordsovertime. Thissecondtypeisimplementedbyaddingaversion,date_fromanddate_tofieldtothe 7

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

dimension'stableanditisalmostalwaysthemostusefulone. Inourexample,wewanttheuserdimensiontobeof typetwo,sothetransformationinSpoonwilllooklike Illustration3.First,thedataisreadfromtheonline system'susertable,thenthedatawewanttoputinour dimensiontableisfilteredusingaselectstepand Illustration3:The"updateuserdimension"transformationin finally,itallgetsinsertedintoourdatawarehouses Spoon user_dimtable.Becauseallofthestepsarequite accuratelydescribedinSpoon'sdocumentation,wewillonlyexplainhowtoconfigurethedimension lookup/updatestepforthisexample. Becausewecreatethisdimensiondirectlyfromatablefrom theonlinesystem,thekeyfieldtouseinthisdimensionis trivial.Itisthekeyusedintheonlinesystem,whichinour caseistheuser'semail.Assaidbefore,thisfieldwillnotbe usedasakeyforourdatawarehouse,aswereplaceitbya technicalkeyuniquetothewarehouse.Itismerelyneededto beabletoretrievethistechnicalkeylater,aswewillneeditto linkthedimensiontothefacttable.Inourcase,thistechnical keyfieldiscalleduser_id. IntheFieldstab,wespecifythefieldsthatareofimportance tothisdimension,suchashierarchicaldata.Inourcase,we keeptrackofinformationliketheuser'sfullname,the companyheispartofandhisorherssex. IncaseoftypeIIdimensions,theVersionfieldisusedto keeptrackofthedifferentversionsinaslowlychanging dimension.Thedaterangeswillbeusedtoindicatetheperiod Illustration4:DimensionLookup/Updatedialog ofvalidityforeachversioninthatcase.Iftheonlinesystem keepstrackofwhenrecordschange,itisusefultousetheStreamDatefieldforbettervalidity,but therearemanymoreapplicationspossible,whicharenotinthescopeofthisdocument.Usually,the default(now)willsuffice.

Updating typ e 2 d imensions and the fact table


Nowthatallindependentdimensionshavebeenprepareditistimetopopulatethefacttable.Inthe process,theremainingdimensionscanbeupdatedaswell,havingtheirtechnicalkeysreadywhen needed. 8

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

Westartbyreadingthebasicgrainforthefacttablefrom therequestlog. Then,thedatedimensionisgenerated,whichisdonebyusing theJavaScriptsteponthetimestampfieldintherequestlog,as describedinAppendixB:Generatingadatedimensionusing JavaScript.UsingtheDimensionLookup/Updatestep,we putthisdataintothedatabase,andkeepreferencestoitinthe formofthegeneratedtechnicalkeytime_id,whichisadded tothestreamasanextracolumn. Nextupistheuserdimension.Remember,wealreadyupdated thatoneinthepreviousstep,sonowallwehavetodoislink itscorrespondingentryforeachrequestinthefacttable. Throughacomplexlookupusingthesessionkeytomatch eachrequesttoaloginactionfromtheactionstable,weare abletofinallymatchausertoeachrequest.Ifnousercanbe foundforthissession,weturnitskey(theemail)toNULL,so itwillmatchaspecialcasecreatedbySpoon'sDimension Lookup/Updatestep:theunknownuser.Becausethefiltering createsseparatethreads,wesorttheuserstreamsafterwards usingSpoon'ssortstep.WhenitreachestheLookup user_dimstep,thestreamcontainsanextrafieldnamed email,whichcontainstheemailofeachrequest,orNULLif itisnotknown.Rememberwespecifiedemailasthelookup keywhenupdatingthisdimensionearlier,sousingthisfield, thetransformationstepcanfindatechnicalkeyforeachentry inthestream,addingitasanextrafieldcalleduser_id. Alloftheinformationneededtofillthepagedimensionis alreadyintherequestlog'sstream.Whilefillingthedimension tablewithhierarchicaldataaboutthisrequest'sdomain,path andpage,thenewlygeneratedtechnicalkeyfieldpage_idis addedtothestream. Thesameisdoneforareferrerdimensionweaddedforfun, whichwasnotmentionedearlier,butissimilartothepage dimension. Finally,thedatathatistogointothefacttableisfilteredso
Illustration5:Dimensionandfactupdatetransformation

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

thatonlythetechnicalkeystothedimensionsandthegrain'sfactsremain.Itistheninsertedintoa table,request_factsinthisexample.

Aggregating the data


TODO:Explainhowtomakeaggregatetablesforspeedingupanalysis. 3.2.2 Putting it all together

Nowthatwe'vedefinedthetransformationsthatfillourdatawarehouse,it'stimetoprovethatoneplus oneequalsthreebyusingKettle'sChefandKitchentocombinethetransformationsintoonejob, addinglogginganddebuggingfunctionalityintheprocess. ThetoolChefdoesn'trequiremuchexplanation,asitsinterfaceisalmostequaltothatofSpoon.Where Spoonisusedtocombinestepstocreateatransformation,Cheflinkstogethertransformationstomake ajob.BecauseChefisalreadyprettywelldocumented,asimplescreenshotsaysitallforthisexample. Becausetherearetwotypesofdimensionsinourcase, theprocessoffillingthedatawarehouseisdividedintwo transformationphases,asdescribedearlier.First,Chef willexecutethetransformationthatupdatesourtype1 Userdimension.Onfailure,itwillsendanemailwith moredetailsonwhatwentwrong.Ifnot,itwillcontinue withthesecondphase:thetransactionthatupdatesthe otherdimensionsandthefacttable.Inthisexample,Ialso wantedtoreceiveanemailonsuccess,soIalsolinked thelasttransformationtotheMailstep.

Illustration6:TheChefjobtofillthedatawarehouse

AfinalfeatureofChefisthatitsjobscanbescheduled.UsingKitchenasacommandlinetool,this makesKettleaverypowerfulsystemforcreating,updatingandmaintainingdatawarehouses.

10

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

4 Using the data warehouse


Thischapterdescribeshowtoperformanalysisonadatawarehouse.Itassumesastarjoineddatabase structure,liketheonedescribedinthepreviouschapter'sexample,isinplace.

4.1

Preparing for on line analytical processing

Insection3.1.2:Modelingstructures,westartedbystructuringdatacubesinthedatawarehouseasa relationalstarschema,containingasinglefacttablelinkingtogethermultipledimensiontables.This way,eachdimensionislinkedtoeveryotherdimensioninannnwayforeverycubeitispartof. Becauseofthespecialwaythedataisorganized,itispossibletoquerythedatainageneralizedform. ThisisthetaskforanOLAPserver.Inourexample,wewillusetheopensourcerelationalOLAP serverMondriantodoso. 4.1.1 From relational to dimensional

Asexplainedearlier,thedatainthewarehouseisstoredinarelationalform.Afacttablelinkstogether multipledimensions,eachintheirowntable,containingtheirowndata.InanOLAPcube,datais organizedinamultidimensionalformthatconsistsofallofthesetablestogether. Eachcubecancontainmultiplemeasures,thatcanbeobtainedfromthefacttable.Thesemeasuresare usedtocalculateaggregationsandgroupdata.Acube'smeasurescanbeanything,likesalesquantity, productpriceoracalculatedtotalofthesefacts.Inthecaseoffactlessfacttables*theredoesn'teven havetobeameasure,itcanbe1allthetime. Everycubealsohasmultipledimensions.Adimensionisonewayoflookingatthedata,constructed fromusuallyonedimensiontableintherelationaldatabase.Themosttrivialexampleofadimensionis ofcourse,thatoftime. Everydimensioncanhaveoneormorehierarchies.Ahierarchyis,asyoumightexpect,anorderedset ofincreasinglyaggregatedlevelsofinformationaboutthedimension.Incaseofatimedimension,one hierarchymightlooklike:daymonthquarteryear.Hierarchiesareveryusefulingroupingand aggregatingdatawhileanalyzing. TheseconceptscanbeusedbyanOLAPservertoquerythedatainamultidimensionalway.Iwillnow tellyouhowtoconfiguretheMondrianOLAPservertodoso. 4.1.2 Doing it in Mondrian

AftertheinstallationoftheMondrianOLAPserver,whichistrivialandoutsidethescopeofthis document,weneedtotellitaboutourdatawarehouse'sstructure.Thisisdoneusingafairlywell 11

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

documentedXMLformat,containinginformationabouthowtheOLAPcubesandtheirdimensions, hierarchiesandmeasuresaretobebuiltfromtablesinarelationaldatabase. Let'sdothisforourexample.AscanbeseeninAppendixC:ExampleMondrianXMLschema,the configurationstartsbytellingMondrianthenameoftheschema.Inourcase,theschemacontainsonly onecube,whichiscalledRequests.Therequestscubeislinkeddirectlytotherequest_factstable, andhasonemeasureandfourdimensions. Becausetheonlyfactsinthefacttablearethevariablesandwedon'treallycareaboutthoserightnow, themeasureisatrivialone:Thecountofthenumberofid'sofeachrequest,whichisalwaysequalto one. Thefourdimensionsaredefinedasbeingpartofthecube.InourcasewehaveaUser,Page, ReferrerandTimedimension.Eachcontainingoneormorehierarchies.Foreverydimension, Mondrianneedstoknowwhichbywhichtableitisrepresentedandbywhatkeyitislinkedtothefact table.Foreachdimension,atablecolumnorcalculatedvalueistobedefined. OnceMondrianknowshowtointerpretthedatawarehouse'srelationaldatabase,itcanactasa translationlayer,allowingmultidimensionalqueriestoberunonarelationaldatabase.

4.2

Asking multidimensional queries


SELECT [<axis_specification> [, <axis_specification>...]] FROM [<cube_specification>] [WHERE [<slicer_specification>]]

ThedimensionalnatureoftheOLAPcubesMondrianuses, makesitpossibletoaskcomplexquestionstoourdata warehouse.Todothis,Mondriansupportsadialectofthe MDXlanguage,whichisamultidimensionalvariantto SQL.It'ssyntaxisroughlylikethatinIllustration7:Basic MDXsyntax.

Illustration7:BasicMDXsyntax

MDXdiffersfromSQLinthatSQLqueriesalwaysreturnatwodimensionaldataset:asingletable withafixednumberofcolumnsononeaxisandoneormorerowsonanotheraxis,whereasMDX queriesoperateonamultidimensionaldatasetandthereforereturnaresultwithanynumberofaxises. Anexampleofwhatsuchaquerymightlooklikewhenwewanttoknowwhatcompaniesvisitedour onlinesysteminwhichmonths,isshowninIllustration8:ExampleMDXquery.


SELECT {[Time].[Months].Members} ON COLUMNS, {[Users].[Company Name].Members} ON ROWS FROM [Requests] WHERE [Time].[2006] Illustration8:ExampleMDXquery

12

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

4.3

Visualization and presentation

Thelaststepinprovidingausefulsystemforanalysistotheenduserisaddingapresentationlayerthat iscapableofformulatingmultidimensionalqueriesandvisualizingtheirresults.Oneofthetools capableofdoingsoisJPivot. JPivotisincludedinrecentdistributionsofMondrian,soitrequiresnoadditionalsetup.It's configurationiseasy:justspecifythedatasettouseinaJSPfileandstartquerying. Anythingispossible.AsmoreprojectslikeJPivotandJFreereportsarebeingembeddedinthePentaho project,itwillbecomeeasiertogeneratethekindofpresentationrequiredforanysituation.

13

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

5 References
ADimensionalModelingManifestoRalphKimball1997 SlowlyChangingDimensionsRalphKimball1996 SpoonDocumentationPentaho20012006

14

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

6 History of this document


1st of september 2006 Initial release
Thefirstversiontobereleasedtothecommunity,stillindraftform.Iwouldliketothankmy employer,Betterbe,forgivingmethetimeandresourcestoworkonthisdocument.Withouttheir support,thisdocumentwouldnotexist.Also,Iwouldliketothank,inadvance,anyonewhowill contributetothisdocumentinthefuture.Youmakeopensourcetowhatitistoday. MichelJansen

15

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

Appendix A: Technology overview


Inthischapter,I'llexplainsomeimportanttechnologiesusedandtheirplaceinthedatawarehouse creationprocess.

Kettle
http://www.kettle.be

KettleisanopensourceETLsuitefor Extractingdatafromvarioussources, TransformingitandLoadingitintothedata warehouse. Kettleconsistsoftwosetsoftools.Spoon andPanareusedtorespectivelycreateand executetransformationsondata.Chefand Kitchenareusedtoorganizetransformations intojobsandexecutetheminascheduled way.

Illustration9:Usedtechnologiesandtheirplace

Mondrian
http://mondrian.sourceforge.net

MondrianisarelationalOLAPserver,whichactsasalayerontopofarelationaldatabasetoallowfor multidimensionalqueriestobeperformedonthedata.

Jpivot
http://jpivot.sourceforge.net

JPivotisatoolthatcanactasapresentationlayerontopofMondrian.Itgeneratesmultidimensional queriesanddisplaystheirresultsasinteractivepivottables.

16

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

Appendix B: Generating a date dimension using JavaScript


Adimensionpresentinalmostalldatawarehousesisthatoftime.Spoon'sJavaScriptstepmakesit easytoextracthierarchicaldatafromdateortimefieldsinthesourcesystem.Whilemostofthisis describedinSpoon'sdocumentation,anexampleofhowtodothisisgivenhere.

JavaScript code
// The fields we want to calculate var day_of_month; var week_of_year; var month_of_year; var year; var quarter; var name_day; var name_month; // Calculate! day_of_month = dateTime.Clone().dat2str("dd"); week_of_year = dateTime.Clone().Clone().dat2str("ww"); month_of_year = dateTime.Clone().dat2str("MM"); year = dateTime.Clone().dat2str("yyyy"); name_day = dateTime.Clone().dat2str("E").getString(); name_month = dateTime.Clone().dat2str("MMMM").getString(); if(day_of_month <= 3) { quarter = "Q1"; } else if(day_of_month <= 6) { quarter = "Q2"; } else if(day_of_month <= 9) { quarter = "Q3"; } else { quarter = "Q4"; }

17

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

Appendix C: Example Mondrian XML schema


<?xml version="1.0"?> <Schema name="DataWareHouseTest"> <Cube name="Requests"> <Table name="request_facts" /> <Dimension name="Users" foreignKey="user_id"> <Hierarchy hasAll="true" allMemberName="All Users" primaryKey="user_id"> <Table name="user_dim"/> <Level name="Company Branche" column="companybranche" uniqueMembers="false"/> <Level name="Company Name" column="companyname" uniqueMembers="false"/> <Level name="Role" column="role" uniqueMembers="false"/> </Hierarchy> <Hierarchy hasAll="true" primaryKey="user_id"> <Level name="Sex" column="sex" uniqueMembers="false"/> <Level name="User" column="email" uniqueMembers="true"> <Property name="First Name" column="firstname"/> <Property name="Infix" column="infix"/> <Property name="Last Name" column="lastname"/> </Level> </Hierarchy> </Dimension> <Dimension name="Page" foreignKey="page_id"> <Hierarchy hasAll="true" primaryKey="page_id"> <Table name="page_dim" /> <Level name="Domain" column="domain" /> <Level name="Path" column="path" /> <Level name="Page" column="page" /> </Hierarchy> </Dimension> <Dimension name="Referrer" foreignKey="referrer_id"> <Hierarchy hasAll="true" primaryKey="referrer_id"> <Table name="referrer_dim" /> <Level name="Domain" column="referrerDomain" /> <Level name="Path" column="referrerPath" /> </Hierarchy> </Dimension> <Dimension name="Time" type="TimeDimension" foreignKey="time_id"> <Hierarchy hasAll="false" primaryKey="time_id"> <Table name="time_dim"/> <Level name="Year" column="year" type="Numeric" uniqueMembers="true" levelType="TimeYears"/> <Level name="Quarter" column="quarter" uniqueMembers="false" levelType="TimeQuarters"/> <Level name="Month" column="month_of_year" uniqueMembers="false" type="Numeric" levelType="TimeMonths"/> </Hierarchy> <Hierarchy hasAll="true" name="Weekly" primaryKey="time_id" defaultMember="[Time.Weekly].[All Time.Weeklys]"> <Table name="time_dim"/> <Level name="Year" column="year" type="Numeric" uniqueMembers="true" levelType="TimeYears"/> <Level name="Week" column="week_of_year" type="Numeric" uniqueMembers="false" levelType="TimeWeeks"/> <Level name="Day" column="day_of_month" uniqueMembers="false" type="Numeric" levelType="TimeDays"/> </Hierarchy> </Dimension> <Measure name="Pages" column="id" aggregator="count" datatype="Integer" formatString="#,##" /> </Cube> </Schema>

18

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

License
THEWORK(ASDEFINEDBELOW)ISPROVIDEDUNDERTHETERMSOFTHISCREATIVE COMMONSPUBLICLICENSE("CCPL"OR"LICENSE").THEWORKISPROTECTEDBY COPYRIGHTAND/OROTHERAPPLICABLELAW.ANYUSEOFTHEWORKOTHERTHAN ASAUTHORIZEDUNDERTHISLICENSEORCOPYRIGHTLAWISPROHIBITED. BYEXERCISINGANYRIGHTSTOTHEWORKPROVIDEDHERE,YOUACCEPTAND AGREETOBEBOUNDBYTHETERMSOFTHISLICENSE.THELICENSORGRANTSYOU THERIGHTSCONTAINEDHEREINCONSIDERATIONOFYOURACCEPTANCEOFSUCH TERMSANDCONDITIONS. 1.Definitions 1."CollectiveWork"meansawork,suchasaperiodicalissue,anthologyorencyclopedia,inwhich theWorkinitsentiretyinunmodifiedform,alongwithanumberofothercontributions,constituting separateandindependentworksinthemselves,areassembledintoacollectivewhole.Aworkthat constitutesaCollectiveWorkwillnotbeconsideredaDerivativeWork(asdefinedbelow)forthe purposesofthisLicense. 2."DerivativeWork"meansaworkbasedupontheWorkorupontheWorkandotherpreexisting works,suchasatranslation,musicalarrangement,dramatization,fictionalization,motionpicture version,soundrecording,artreproduction,abridgment,condensation,oranyotherforminwhichthe Workmayberecast,transformed,oradapted,exceptthataworkthatconstitutesaCollectiveWork willnotbeconsideredaDerivativeWorkforthepurposeofthisLicense.Fortheavoidanceofdoubt, wheretheWorkisamusicalcompositionorsoundrecording,thesynchronizationoftheWorkin timedrelationwithamovingimage("synching")willbeconsideredaDerivativeWorkforthepurpose ofthisLicense. 3."Licensor"meanstheindividualorentitythatofferstheWorkunderthetermsofthisLicense. 4."OriginalAuthor"meanstheindividualorentitywhocreatedtheWork. 5."Work"meansthecopyrightableworkofauthorshipofferedunderthetermsofthisLicense. 6."You"meansanindividualorentityexercisingrightsunderthisLicensewhohasnotpreviously violatedthetermsofthisLicensewithrespecttotheWork,orwhohasreceivedexpresspermission fromtheLicensortoexerciserightsunderthisLicensedespiteapreviousviolation. 7."LicenseElements"meansthefollowinghighlevellicenseattributesasselectedbyLicensorand indicatedinthetitleofthisLicense:Attribution,ShareAlike.

19

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

2.FairUseRights.Nothinginthislicenseisintendedtoreduce,limit,orrestrictanyrightsarising fromfairuse,firstsaleorotherlimitationsontheexclusiverightsofthecopyrightownerunder copyrightlaworotherapplicablelaws. 3.LicenseGrant.SubjecttothetermsandconditionsofthisLicense,LicensorherebygrantsYoua worldwide,royaltyfree,nonexclusive,perpetual(forthedurationoftheapplicablecopyright)license toexercisetherightsintheWorkasstatedbelow: 1.toreproducetheWork,toincorporatetheWorkintooneormoreCollectiveWorks,andto reproducetheWorkasincorporatedintheCollectiveWorks; 2.tocreateandreproduceDerivativeWorks; 3.todistributecopiesorphonorecordsof,displaypublicly,performpublicly,andperformpublicly bymeansofadigitalaudiotransmissiontheWorkincludingasincorporatedinCollectiveWorks; 4.todistributecopiesorphonorecordsof,displaypublicly,performpublicly,andperformpublicly bymeansofadigitalaudiotransmissionDerivativeWorks. 5. Fortheavoidanceofdoubt,wheretheworkisamusicalcomposition: 1.PerformanceRoyaltiesUnderBlanketLicenses.Licensorwaivestheexclusiverighttocollect, whetherindividuallyorviaaperformancerightssociety(e.g.ASCAP,BMI,SESAC),royaltiesforthe publicperformanceorpublicdigitalperformance(e.g.webcast)oftheWork. 2.MechanicalRightsandStatutoryRoyalties.Licensorwaivestheexclusiverighttocollect, whetherindividuallyorviaamusicrightssocietyordesignatedagent(e.g.HarryFoxAgency), royaltiesforanyphonorecordYoucreatefromtheWork("coverversion")anddistribute,subjecttothe compulsorylicensecreatedby17USCSection115oftheUSCopyrightAct(ortheequivalentinother jurisdictions). 6.WebcastingRightsandStatutoryRoyalties.Fortheavoidanceofdoubt,wheretheWorkisasound recording,Licensorwaivestheexclusiverighttocollect,whetherindividuallyorviaaperformance rightssociety(e.g.SoundExchange),royaltiesforthepublicdigitalperformance(e.g.webcast)ofthe Work,subjecttothecompulsorylicensecreatedby17USCSection114oftheUSCopyrightAct(or theequivalentinotherjurisdictions). Theaboverightsmaybeexercisedinallmediaandformatswhethernowknownorhereafterdevised. Theaboverightsincludetherighttomakesuchmodificationsasaretechnicallynecessarytoexercise therightsinothermediaandformats.AllrightsnotexpresslygrantedbyLicensorareherebyreserved. 4.Restrictions.ThelicensegrantedinSection3aboveisexpresslymadesubjecttoandlimitedbythe 20

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

followingrestrictions: 1.Youmaydistribute,publiclydisplay,publiclyperform,orpubliclydigitallyperformtheWork onlyunderthetermsofthisLicense,andYoumustincludeacopyof,ortheUniformResource Identifierfor,thisLicensewitheverycopyorphonorecordoftheWorkYoudistribute,publicly display,publiclyperform,orpubliclydigitallyperform.Youmaynotofferorimposeanytermsonthe WorkthatalterorrestrictthetermsofthisLicenseortherecipients'exerciseoftherightsgranted hereunder.YoumaynotsublicensetheWork.YoumustkeepintactallnoticesthatrefertothisLicense andtothedisclaimerofwarranties.Youmaynotdistribute,publiclydisplay,publiclyperform,or publiclydigitallyperformtheWorkwithanytechnologicalmeasuresthatcontrolaccessoruseofthe WorkinamannerinconsistentwiththetermsofthisLicenseAgreement.Theaboveappliestothe WorkasincorporatedinaCollectiveWork,butthisdoesnotrequiretheCollectiveWorkapartfrom theWorkitselftobemadesubjecttothetermsofthisLicense.IfYoucreateaCollectiveWork,upon noticefromanyLicensorYoumust,totheextentpracticable,removefromtheCollectiveWorkany creditasrequiredbyclause4(c),asrequested.IfYoucreateaDerivativeWork,uponnoticefromany LicensorYoumust,totheextentpracticable,removefromtheDerivativeWorkanycreditasrequired byclause4(c),asrequested. 2.Youmaydistribute,publiclydisplay,publiclyperform,orpubliclydigitallyperformaDerivative WorkonlyunderthetermsofthisLicense,alaterversionofthisLicensewiththesameLicense ElementsasthisLicense,oraCreativeCommonsiCommonslicensethatcontainsthesameLicense ElementsasthisLicense(e.g.AttributionShareAlike2.5Japan).Youmustincludeacopyof,orthe UniformResourceIdentifierfor,thisLicenseorotherlicensespecifiedintheprevioussentencewith everycopyorphonorecordofeachDerivativeWorkYoudistribute,publiclydisplay,publiclyperform, orpubliclydigitallyperform.YoumaynotofferorimposeanytermsontheDerivativeWorksthat alterorrestrictthetermsofthisLicenseortherecipients'exerciseoftherightsgrantedhereunder,and YoumustkeepintactallnoticesthatrefertothisLicenseandtothedisclaimerofwarranties.Youmay notdistribute,publiclydisplay,publiclyperform,orpubliclydigitallyperformtheDerivativeWork withanytechnologicalmeasuresthatcontrolaccessoruseoftheWorkinamannerinconsistentwith thetermsofthisLicenseAgreement.TheaboveappliestotheDerivativeWorkasincorporatedina CollectiveWork,butthisdoesnotrequiretheCollectiveWorkapartfromtheDerivativeWorkitselfto bemadesubjecttothetermsofthisLicense. 3.Ifyoudistribute,publiclydisplay,publiclyperform,orpubliclydigitallyperformtheWorkorany DerivativeWorksorCollectiveWorks,YoumustkeepintactallcopyrightnoticesfortheWorkand provide,reasonabletothemediumormeansYouareutilizing:(i)thenameoftheOriginalAuthor(or pseudonym,ifapplicable)ifsupplied,and/or(ii)iftheOriginalAuthorand/orLicensordesignate anotherpartyorparties(e.g.asponsorinstitute,publishingentity,journal)forattributioninLicensor's copyrightnotice,termsofserviceorbyotherreasonablemeans,thenameofsuchpartyorparties;the 21

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

titleoftheWorkifsupplied;totheextentreasonablypracticable,theUniformResourceIdentifier,if any,thatLicensorspecifiestobeassociatedwiththeWork,unlesssuchURIdoesnotrefertothe copyrightnoticeorlicensinginformationfortheWork;andinthecaseofaDerivativeWork,acredit identifyingtheuseoftheWorkintheDerivativeWork(e.g.,"FrenchtranslationoftheWorkby OriginalAuthor,"or"ScreenplaybasedonoriginalWorkbyOriginalAuthor").Suchcreditmaybe implementedinanyreasonablemanner;provided,however,thatinthecaseofaDerivativeWorkor CollectiveWork,ataminimumsuchcreditwillappearwhereanyothercomparableauthorshipcredit appearsandinamanneratleastasprominentassuchothercomparableauthorshipcredit. 5.Representations,WarrantiesandDisclaimer UNLESSOTHERWISEAGREEDTOBYTHEPARTIESINWRITING,LICENSOROFFERSTHE WORKASISANDMAKESNOREPRESENTATIONSORWARRANTIESOFANYKIND CONCERNINGTHEMATERIALS,EXPRESS,IMPLIED,STATUTORYOROTHERWISE, INCLUDING,WITHOUTLIMITATION,WARRANTIESOFTITLE,MERCHANTIBILITY, FITNESSFORAPARTICULARPURPOSE,NONINFRINGEMENT,ORTHEABSENCEOF LATENTOROTHERDEFECTS,ACCURACY,ORTHEPRESENCEOFABSENCEOFERRORS, WHETHERORNOTDISCOVERABLE.SOMEJURISDICTIONSDONOTALLOWTHE EXCLUSIONOFIMPLIEDWARRANTIES,SOSUCHEXCLUSIONMAYNOTAPPLYTOYOU. 6.LimitationonLiability.EXCEPTTOTHEEXTENTREQUIREDBYAPPLICABLELAW,INNO EVENTWILLLICENSORBELIABLETOYOUONANYLEGALTHEORYFORANY SPECIAL,INCIDENTAL,CONSEQUENTIAL,PUNITIVEOREXEMPLARYDAMAGES ARISINGOUTOFTHISLICENSEORTHEUSEOFTHEWORK,EVENIFLICENSORHAS BEENADVISEDOFTHEPOSSIBILITYOFSUCHDAMAGES. 7.Termination 1.ThisLicenseandtherightsgrantedhereunderwillterminateautomaticallyuponanybreachby YouofthetermsofthisLicense.IndividualsorentitieswhohavereceivedDerivativeWorksor CollectiveWorksfromYouunderthisLicense,however,willnothavetheirlicensesterminated providedsuchindividualsorentitiesremaininfullcompliancewiththoselicenses.Sections1,2,5,6, 7,and8willsurviveanyterminationofthisLicense. 2.Subjecttotheabovetermsandconditions,thelicensegrantedhereisperpetual(forthedurationof theapplicablecopyrightintheWork).Notwithstandingtheabove,Licensorreservestherightto releasetheWorkunderdifferentlicensetermsortostopdistributingtheWorkatanytime;provided, howeverthatanysuchelectionwillnotservetowithdrawthisLicense(oranyotherlicensethathas 22

BuildingdatawarehousesusingopensourcetechnologiesMichelJansen

been,orisrequiredtobe,grantedunderthetermsofthisLicense),andthisLicensewillcontinuein fullforceandeffectunlessterminatedasstatedabove. 8.Miscellaneous 1.EachtimeYoudistributeorpubliclydigitallyperformtheWorkoraCollectiveWork,the LicensorofferstotherecipientalicensetotheWorkonthesametermsandconditionsasthelicense grantedtoYouunderthisLicense. 2.EachtimeYoudistributeorpubliclydigitallyperformaDerivativeWork,Licensorofferstothe recipientalicensetotheoriginalWorkonthesametermsandconditionsasthelicensegrantedtoYou underthisLicense. 3.IfanyprovisionofthisLicenseisinvalidorunenforceableunderapplicablelaw,itshallnotaffect thevalidityorenforceabilityoftheremainderofthetermsofthisLicense,andwithoutfurtheraction bythepartiestothisagreement,suchprovisionshallbereformedtotheminimumextentnecessaryto makesuchprovisionvalidandenforceable. 4.NotermorprovisionofthisLicenseshallbedeemedwaivedandnobreachconsentedtounless suchwaiverorconsentshallbeinwritingandsignedbythepartytobechargedwithsuchwaiveror consent. 5.ThisLicenseconstitutestheentireagreementbetweenthepartieswithrespecttotheWorklicensed here.Therearenounderstandings,agreementsorrepresentationswithrespecttotheWorknot specifiedhere.Licensorshallnotbeboundbyanyadditionalprovisionsthatmayappearinany communicationfromYou.ThisLicensemaynotbemodifiedwithoutthemutualwrittenagreementof theLicensorandYou.

23

You might also like