You are on page 1of 47

Building a large scale SaaS app

Open Source, Storage and Scalability


Dan Hanley, CTO
http://www.magus.co.uk
14 March, 2008

1
Agenda
Who are Magus?
ƒ What do we do?
ƒ Who do we do it for?

How do we do it?
ƒ SOA
ƒ Scalability
ƒ Storage
ƒ F/OSS

2
The Magus proposition
• Leading provider of innovative web-content
engineering
g g solutions to g
global corporations
p
• Specialise in managed applications that help
clients build value from their online assets and from
the wider web
• Three main applications:
ƒ ActiveStandards
ƒ RemoteSearch
ƒ CrucialInformation
• Delivering solutions since 1995
Our managed applications

Delivering Software-as-a-service (ASP model)

z ActiveStandards designed to help companies stay on-brand, on-line


by tracking and managing corporate web standards compliance,
worldwide

z RemoteSearch a multi-site
multi site search engine,
engine providing integrated search
frameworks for enterprise websites

z CrucialInformation
C i lI f ti a premiumi currentt awareness service
i d delivering
li i
high-quality, strategic intelligence from the web and syndicated services

4
ActiveStandards

5
RemoteSearch

6
CrucialInformation

7
Social Networking

8
Our clients

9
Technically - where we were
1 product
• Web
W bd design
i bbusiness
i
• All home grown
• No appservers
• No failover
• No common
infrastructure
• Scalability worries
• No version
ersion control
• Unclear methodology

10
Technically – where we are now

• 3 main applications
pp
• Bespoke capability
• Common
infrastructure
• Platform of services
• Fault tolerant
• Scalable
• Defined process &
methodology

11
Approach

• Do a lot with a little – 35 p


people,
p ,p punching
g
above our weight

• Don't reinvent the wheel

• Extract commonality – keep it DRY

12
The components of the stack

• Trawl • Routing
• Harvest • Store
• Index • Quartz
• Search • ClientEngine
• Analysis • Profile
• Monitor • LinkChecker

13
Logical architecture

REST (not SOAP)

14
Trawl

• Responsible for managing the gathering of data in its raw


form into the Store.
• Currently have Trawlers for:
ƒ HTTP
ƒ FTP (several flavors)
ƒ RSS,
RSS A Atom etc
ƒ SMTP
ƒ Google
G l
ƒ Technorati
ƒ Moreover
M
ƒ FT (several flavors)

15
Trawler service
Pluggable architecture based on JMX Mbean service

16
Harvest

• Responsible for extracting explicit data from Links


and
d storing
t i ththe fi
fielded
ld d d
data
t iin th
the d
database,
t b andd th
the
non fielded data in the Store.

17
Harvest service
Pluggable architecture based on JMX Mbean service

18
Index

• Responsible for building, purging, maintaining indices.

19
Search

• Responsible for searching indices and delivering results.

20
Analysis

• Responsible for deriving scores for information implicit in the page


ƒ Sentiment
ƒ Readability
ƒ Language
g g detection etc

21
Monitor

− Badly named, should be called “Classifier”


− Responsible for creating filings between Links and Categories.
− A Link
Li k can b
be a b
bookmark,
k k news iitem, bl
blog article
i l etc.
− A Category can be Users Bookmarks, News Topic, an AST Guideline etc.

22
Classifier (monitor) service
Pluggable architecture based on JMX Mbean service

23
LinkChecker

• Responsible for checking the life of links and removing them correctly
from the system when they have expired
expired.

24
Routing

• Manages the workflow of jobs through the stack


• Has the capability to dynamically loadbalance workloads
workloads.

25
Content stores

z We needed a multiple terabyte (currently 24 TB) distributed, fail


safe,
f filesystem
fil
z NFS was crumbling under load

z ZFS was vapourware

z Lustre was too complex

z We built our own!

z Magus Contentstores, responsible for holding both the raw and

processed non fielded content of links which have been trawled


and harvested

26
Content stores - configuration

<mbean code="uk.co.magus.store.service.StoreService" name="magus.service.store:service=StoreServiceLocalCalls">


<attribute name="JndiName">magus/services/StoreServiceLocalCalls</attribute>
<attribute name="Config">
<TryEachStripeStore>
<List>
<MirrorStore>
<List>
<RemoteStore>nas:1299;StoreServiceRemoteCallsInvokeTarget</RemoteStore>
<RemoteStore>m4:1099;StoreServiceRemoteCallsInvokeTarget</RemoteStore>
</List>
</MirrorStore>
<MirrorStore>
<List>
<RemoteStore>nas:1199;StoreServiceRemoteCallsInvokeTarget</RemoteStore>
<RemoteStore>m5:1099;StoreServiceRemoteCallsInvokeTarget</RemoteStore>
</List>
</MirrorStore>
</List>
</TryEachStripeStore>
</attribute>
<depends>jboss:service=Naming</depends>
</mbean>

27
28
Store Interfaces

29
Store JMX Beans

30
Contentstore - engines
Can use many types of engine on a node
Currently supports:
ƒ Mysql
ƒ SleepyCat
ƒ Filesystem

These can be decorated to enhance functionality


Content Store Classes

32
Quartz

• Responsible for firing messages on time.


• The “heartbeat” of the stack.

33
Client Engine

• Responsible for stack based processing for Client


A li ti
Applications.
• Keeps “heavy lifting” out of the Web Tier.
• Coordinates Client Applications requests across
multiple stack services.

34
Management Application

zManage
g taxonomyy
zManage rules

zManage scheduling

zFocus on managing the business

zLeave
L service
i managementt tto JMX or web
b
consoles
zSwing

35
Management App
Management App
Management App
Management App
Profile

• An internal service used to collect metrics on


system
t wide
id performance
f

40
41
Infrastructure
architecture
hit t

42
Methodology

• Agility
g y – sprints
p

• Issue tracking – Jira

• Regular,
R l scheduled,
h d l d d
deployments
l t

• Consolidated build & version control

43
Deployment
Deployment
1. C heck out

Subversion
(Code repository )
2. C ode /
Loc al T est

4. auto C heck out

D eveloper
3. C hec k In

Developer Local Box

6 & 12 N otify

5 . Build / U nit T es ts /
Metr ic s

7 . Publish r esults

8. D eploy Bamboo

10. D eploy D ependenc ies


9 & 11 . F IT T ests

D ev C luster

13 . Pr epar e R eleas e N ote


P a y to
$

R elease N ote 14. G et R eleas e N ote

15 . R eject R elease G r anite T L 16. G et Applic ation Ar tifac ts


Pr oduc t O w ner /
D ev T L

17 . Manage T est & Pr oduction Envir onments

18. D eploy Applications


Str ess T est T est C luster

19 . D eploy Applications

Jboss ON Pr oduc tion

44
Throughput

• 11,000
, sources in system
y

• ~16
16,000,000
000 000 pages rolling store

• ~200,000
200 000 new pages per day
d

• Average < 2 minutes from page detection to


fully classified and indexed.

45
Cost comparisons

• Apples
pp and oranges?
g
Proprietary Licence Free
Product Per CPU CPUs Total Product Per CPU CPUs Total
O l
Oracle 20 000 00
20,000.00 10 $200 000
$200,000 M S l
MySql $0 00
$0.00 10 $0.00
$0 00
Weblogic AS 10,000.00 38 $380,000 Jboss AS $0.00 38 $0.00
MS Windows Server 3,919.00 48 $188,112 Redhat/Apa $0.00 48 $0.00
Visual Team Studio 1,000.00 12 $12,000 Eclipse $0.00 12 $0.00
ClearCase 4,125.00 1 $4,125 Subversion $0.00 1 $0.00
Jira 2 000 00
2,000.00 1 $2 000
$2,000 Trac $0 00
$0.00 1 $0 00
$0.00
Autonomy IDOL bundl 75,000.00 2 $150,000 Carrot2 $0.00 12 $0.00
IBM Intelligent Datami 132,000.00 1 $132,000 LingPipe $0.00 12 $0.00
Verity K2 50,000.00 2 $100,000 Lucene $0.00 8 $0.00
UIMA $0.00 12 $0.00
$1,068,237 $0.00
£580,531.26
€849,629.77

46
Thank you

Questions?

47

You might also like