You are on page 1of 190

Mesosphere

Documentation for Version 1.9 ............................................................................................... 1


High Availability ............................................................................................................ 7
Networking ....................................................................................................................... 10
Load Balancing and VIPs ........................................................................................... 11
Marathon-LB .............................................................................................................. 15
High-Availability ......................................................................................................... 16
DNS Quick Reference ................................................................................................ 20
Upgrading ........................................................................................................................ 21
Installing DC/OS .............................................................................................................. 28
Upgrading .................................................................................................................. 29
DC/OS Custom Installation Options ........................................................................... 36
DC/OS Cloud Installation Options .............................................................................. 36
Local .......................................................................................................................... 37
High-Availibility ........................................................................................................... 37
DC/OS Ports .............................................................................................................. 41
Opt-Out ...................................................................................................................... 43
Frequently Asked Questions ...................................................................................... 43
Troubleshooting a Custom Installation ....................................................................... 47
Administering Clusters ................................................................................................... 60
Monitoring, Logging, and Debugging ........................................................................... 60
Performance Monitoring ............................................................................................. 61
Performance Monitoring ............................................................................................. 65
Performance Monitoring ............................................................................................. 69
Logging ...................................................................................................................... 72
Debugging from the DC/OS Web Interface ................................................................ 74
Debugging .................................................................................................................. 76
Jobs .................................................................................................................................. 78
Quick Start ................................................................................................................. 79
Tutorials ........................................................................................................................... 84
Building an IoT Pipeline ............................................................................................. 85
Autoscaling with Marathon ......................................................................................... 90
Creating and Running a Service ................................................................................ 91
Deploying Marathon Apps with Jenkins ..................................................................... 97
Labeling Tasks and Jobs ......................................................................................... 101
Release Notes ................................................................................................................ 105
GUI .................................................................................................................................. 106
CLI .................................................................................................................................. 113
Security .......................................................................................................................... 116
Managing users and groups .................................................................................... 116
Identity provider-based authentication ..................................................................... 117
Storage ........................................................................................................................... 118
Mount Disk Resources ............................................................................................. 118
External Persistent Volumes .................................................................................... 125
Local Persistent Volumes ........................................................................................ 130
Metrics ............................................................................................................................ 137
Quick Start ............................................................................................................... 138
Metrics API ............................................................................................................... 141
Metrics Reference .................................................................................................... 144
Deploying Jobs ............................................................................................................. 146
Deploying Services and Pods ...................................................................................... 147
Installing Services .................................................................................................... 148
Pods ......................................................................................................................... 150
Monitoring Services ................................................................................................. 150
Updating a User-Created Service ............................................................................ 151
Service Ports ............................................................................................................ 153
Exposing a Service .................................................................................................. 159
Deploying non-native Marathons ............................................................................. 161
Marathon REST API ................................................................................................ 163
Enabling GPU Support ............................................................................................. 164
Frequently Asked Questions .................................................................................... 165
Developing DC/OS Services ......................................................................................... 167
Service Requirements Specification ........................................................................ 168
CLI Specification ...................................................................................................... 173
Creating a Universe Package .................................................................................. 177
Access by Proxy and VPN using DC/OS Tunnel ..................................................... 180
DC/OS Integration .................................................................................................... 185
Documentation for Version 1.9
ENTERPRISE DC/OS Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.

Getting Started
origin/dev The overview topics help you get started and learn the DC/OS fundamentals.
DC/OS is a distributed operating system based on the Apache Mesos distributed systems
kernel. ...

High Availability
Networking
Load Balancing and VIPs
Marathon-LB
High-Availability
DNS Quick Reference
Upgrading
This document provides instructions for upgrading a DC/OS cluster. If this upgrade is
performed on a supported OS with all prerequisites fulfilled, this upgrade should preserve
the...
Installing DC/OS
Enterprise DC/OS is designed to be configured, deployed, managed, scaled, and upgraded
on any cluster of physical or virtual machines. You can install DC/OS in the environment of
y...

Upgrading
DC/OS Custom Installation Options
DC/OS Cloud Installation Options
Local
High-Availibility

View All 9 Posts


Administering Clusters

Monitoring, Logging, and Debugging


Monitoring the health of all the pieces that make up DC/OS is vital to datacenter operators
and for troubleshoooting hard-to-diagnose bugs. You can monitor the health of your clust...

Performance Monitoring
Performance Monitoring
Performance Monitoring
Logging
Debugging from the DC/OS Web Interface

View All 6 Posts


Jobs
You can create scheduled jobs in DC/OS without installing a separate service. Create and
administer jobs in the DC/OS web interface, the DC/OS CLI, or via an API. Note: The Jobs
fu...

Quick Start
Tutorials
This is a collection of tutorials about using DC/OS. Learn how to run services and operate
services in production.

Building an IoT Pipeline


Autoscaling with Marathon
Creating and Running a Service
Deploying Marathon Apps with Jenkins
Labeling Tasks and Jobs

View All 5 Posts

Release Notes

GUI
The DC/OS web interface provides a rich graphical view of your DC/OS cluster. With the
web interface you can view the current state of your entire cluster and DC/OS services. The
w...
CLI
You can use the DC/OS command-line interface (CLI) to manage your cluster nodes, install
DC/OS packages, inspect the cluster state, and administer the DC/OS service
subcommands. Yo...
Security
Enterprise DC/OS makes managing users easier with LDAP, SAML, and Open ID Connect
integrations. You can also use permissions to define which resources users can access. In
strict a...

Managing users and groups


Identity provider-based authentication
Storage
<<<<<<< HEAD DC/OS applications lose their state when they terminate and are
relaunched. In some contexts, for instance, if your application uses MySQL, or if ...

Mount Disk Resources


External Persistent Volumes
Local Persistent Volumes
Metrics
<<<<<<< HEAD The metrics component provides metrics from DC/OS cluster hosts,
containers running on those hosts, and from applications running on DC/OS that se...

Quick Start
Metrics API
Metrics Reference
Deploying Jobs
You can create scheduled jobs in DC/OS without installing a separate service. Create and
administer jobs in the DC/OS web interface, the DC/OS CLI, or via an API. Note: The Jobs
fu...
Deploying Services and Pods
DC/OS uses Marathon to manage processes and services. Marathon is the init system for
DC/OS. Marathon starts and monitors your applications and services, automaticall...

Installing Services
Pods
Monitoring Services
Updating a User-Created Service
Service Ports

View All 10 Posts


Developing DC/OS Services
This section describes the developer-specific DC/OS components, explaining what is
necessary to package and provide your own service on DC/OS. The Mesosphere
Datacenter Operating S...

Service Requirements Specification


CLI Specification
Creating a Universe Package
Access by Proxy and VPN using DC/OS Tunnel
DC/OS Integration

View All 5 Posts


MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
GETTING STARTED

High Availability
Updated: April 17, 2017

This document discusses the high availability (HA) features in DC/OS and best practices for
building HA applications on DC/OS.

Leader/Follower Architecture
A common pattern in HA systems is the leader/follower concept. This is also sometimes
referred to as: master/slave, primary/replica, or some combination thereof. This architecture
is used when you have one authoritative process, with N standby processes. In some
systems, the standby processes might also be capable of serving requests or performing
other operations. For example, when running a database like MySQL with a master and
replica, the replica is able to serve read-only requests, but it cannot accept writes (only the
master will accept writes).

In DC/OS, a number of components follow the leader/follower pattern. Well discuss some of
them here and how they work.

Mesos

Mesos can be run in high availability mode, which requires running 3 or 5 masters. When run
in HA mode, one master is elected as the leader, while the other masters are followers. Each
master has a replicated log which contains some state about the cluster. The leading master
is elected by using ZooKeeper to perform leader election. For more detail on this, see the
Mesos HA documentation.

Marathon

Marathon can be run in HA mode, which allows running multiple Marathon instances (at
least 2 for HA), with one elected leader. Marathon uses ZooKeeper for leader election. The
followers do not accept writes or API requests, instead the followers proxy all API requests
to the leading Marathon instance.

ZooKeeper

ZooKeeper is used by numerous services in DC/OS to provide consistency. ZooKeeper can


be used as a distributed locking service, a state store, and a messaging system. ZooKeeper
uses [Paxos-like](https://en.wikipedia.org/wiki/Paxos_(computer_science)) log replication
and a leader/follower architecture to maintain consistency across multiple ZooKeeper
instances. For a more detailed explanation of how ZooKeeper works, check out the
ZooKeeper internals document.

Fault Domain Isolation


Fault domain isolation is an important part of building HA systems. To correctly handle
failure scenarios, systems must be distributed across fault domains to survive outages.
There are different types of fault domains, a few examples of which are:

Physical domains: this includes machine, rack, datacenter, region, and availability zone.
Network domains: machines within the same network may be subject to network
partitions. For example, a shared network switch may fail or have invalid configuration.

With DC/OS, you can distribute masters across racks for HA. Agents can be distributed
across regions, and its recommended that you tag agents with attributes to describe their
location. Synchronous services like ZooKeeper should also remain within the same region to
reduce network latency. For more information, see the Configuring High-Availability
documentation.

For applications which require HA, they should also be distributed across fault domains. With
Marathon, this can be accomplished by using the UNIQUE and GROUP_BY constraints operator.

Separation of Concerns
HA services should be decoupled, with responsibilities divided amongst services. For
example, web services should be decoupled from databases and shared caches.

Eliminating Single Points of Failure


Single points of failure come in many forms. For example, a service like ZooKeeper can
become a single point of failure when every service in your system shares one ZooKeeper
cluster. You can reduce risks by running multiple ZooKeeper clusters for separate services.
Theres an Exhibitor Universe package that makes this easy.

Other common single points of failure include:

Single database instances (like a MySQL)


One-off services
Non-HA load balancers

Fast Failure Detection


Fast failure detection comes in many forms. Services like ZooKeeper can be used to provide
failure detection, such as detecting network partitions or host failures. Service health checks
can also be used to detect certain types of failures. As a matter of best practice, services
should expose health check endpoints, which can be used by services like Marathon.
Fast Failover
When failures do occur, failover should be as fast as possible. Fast failover can be achieved
by:

Using an HA load balancer like Marathon-LB, or the internal Layer 4 load balancer.
Building apps in accordance with the 12-factor app manifesto.
Following REST best-practices when building services: in particular, avoiding storing
client state on the server between requests.

A number of DC/OS services follow the fail-fast pattern in the event of errors. Specifically,
both Mesos and Marathon will shut down in the case of unrecoverable conditions such as
losing leadership.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Networking
ENTERPRISE DC/OS Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.

Load Balancing and VIPs


DC/OS comes with an east-west load balancer thats meant to be used to enable multi-tier
microservices architectures. It acts as a TCP Layer 4 load balancer, and its t...
Marathon-LB
Marathon-LB is based on HAProxy, a rapid proxy and load balancer. HAProxy provides
proxying and load balancing for TCP and HTTP based applications, with features such as
SSL suppor...
High-Availability
This document discusses the high availability (HA) features in DC/OS and best practices for
building HA applications on DC/OS. Terminology Zone A zone is a failure domain that has ...
DNS Quick Reference
This quick reference provides a summary of the available options. To help explain, well use
this imaginary application: The Service is in the following hierarchy: Group: out...

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
NETWORKING

Load Balancing and VIPs


Updated: April 17, 2017

DC/OS comes with an east-west load balancer thats meant to be used to enable multi-tier
microservices architectures. It acts as a TCP Layer 4 load balancer, and its tightly integrated
with the kernel.

Usage
You can use the layer 4 load balancer by assigning a VIP from the DC/OS web interface.
Alternatively, if youre using something other than Marathon, you can create a label on the
port protocol buffer while launching a task on Mesos. This labels key must be in the format
VIP_$IDX, where $IDX is replaced by a number, starting from 0. Once you create a task, or a
set of tasks with a VIP, they will automatically become available to all nodes in the cluster,
including the masters.

Details
When you launch a set of tasks with these labels, DC/OS distributes them to all of the nodes
in the cluster. All of the nodes in the cluster act as decision makers in the load balancing
process. A process runs on all the agents that the kernel consults when packets are
recognized with this destination address. This process keeps track of availability and
reachability of these tasks to attempt to send requests to the right backends.

Recommendations
Caveats

Do not firewall traffic between the nodes.


Do not change ip_local_port_range.
You must have the ipset package installed.
You must use a supported operating system.
Persistent Connections

It is recommended when you use our VIPs you keep long-running, persistent connections.
The reason behind this is that you can very quickly fill up the TCP socket table if you do not.
The default local port range on Linux allows source connections from 32768 to 61000. This
allows 28232 connections to be established between a given source IP and a destination
address, port pair. TCP connections must go through the time wait state prior to being
reclaimed. The Linux kernels default TCP time wait period is 120 seconds. Given this, you
would exhaust the connection table by only making 235 new connections / sec.

Health checks

We also recommend taking advantage of Mesos health checks. Mesos health checks are
surfaced to the load balancing layer. Marathon only converts command health checks to
Mesos health checks. You can simulate HTTP health checks via a command similar to test
"$(curl -4 -w '%{http_code}' -s http://localhost:${PORT0}/|cut -f1 -d" ")" == 200.
This ensures the HTTP status code returned is 200. It also assumes your application binds
to localhost. The ${PORT0} is set as a variable by Marathon. We do not recommend using
TCP health checks as they can be misleading as to the liveness of a service.

Important: Docker container command health checks are run inside the Docker container.
For example, if cURL is used to check NGINX, the NGINX container must have cURL
installed, or the container must mount /opt/mesosphere in RW mode.

Demo
If you would like to run a demo, you can configure a Marathon app as mentioned above, and
use the URI https://s3.amazonaws.com/sargun-mesosphere/linux-amd64, as well as the
command chmod 755 linux-amd64 && ./linux-amd64 -listener=:${PORT0} -say-
string=version1 to execute it. You can then test it by hitting the application with the
command: curl http://1.2.3.4:5000. This app exposes an HTTP API. This HTTP API
answers with the PID, hostname, and the say-string thats specified in the app definition. In
addition, it exposes a long-running endpoint at http://1.2.3.4:5000/stream, which will
continue to stream until the connection is terminated. The code for the application is
available here: https://github.com/mesosphere/helloworld.

Exposing it to the outside

Prior to this, you had to run a complex proxy that would reconfigure based on the tasks
running on the cluster. Fortunately, you no longer need to do this. Instead, you can have an
incredible simple HAProxy configuration like so:

defaults log global mode tcp contimeout 50000000 clitimeout 50000000 srvtimeout 50000000 listen
appname 0.0.0.0:80 mode tcp balance roundrobin server mybackend 1.2.3.4:5000

A Marathon app definition for this looks like:

{ "acceptedResourceRoles": [ "slave_public" ], "container": { "docker": { "image":


"sargun/haproxy-demo:3", "network": "HOST" }, "type": "DOCKER" }, "cpus": 0.5, "env": {
"CONFIGURL":
"https://gist.githubusercontent.com/sargun/3037bdf8be077175e22c/raw/be172c88f4270d9dfe409114a36
21a28d01294c3/gistfile1.txt" }, "instances": 1, "mem": 128, "ports": [ 80 ], "requirePorts":
true }

This will run an HAProxy on the public agent, on port 80. If youd like, you can make the
number of instances equal to the number of public agents. Then, you can point your external
load balancer at the pool of public agents on port 80. Adapting this would simply involve
changing the backend entry, as well as the external port.

Potential Roadblocks
IP Overlay
Problems can arise if the VIP address that you specified is used elsewhere in the network.
Although the VIP is a 3-tuple, it is best to ensure that the IP dedicated to the VIP is only in
use by the load balancing software and isnt in use at all in your network. Therefore, you
should choose IPs from the RFC1918 range.

IPSet
You must have the command ipset installed. If you do not, you may see an error like:

15:15:59.731 [error] Unknown response: {ok,"iptables v1.4.21: Set minuteman doesn't


exist.\n\nTry `iptables -h' or 'iptables --help' for more information.\n"}

Ports
The ports 61420, and 61421 must be open for the load balancer to work correctly. Because
the load balancer maintains a partial mesh, it needs to ensure that connectivity between
nodes is unhindered.

Connection table exhaustion


If you begin to see the behaviour as described earlier where the connection table is being
exhausted, youll see various errors in the logs. You can set two sysctls to alleviate this
issue, but it doesnt come without caveats.

net.netfilter.nf_conntrack_tcp_timeout_time_wait=0 You can set this to 0, but the


time_wait state may break connection tracking for other applications
net.ipv4.tcp_tw_reuse=1 This sysctl can be dangerous and break firewalls, as well as
NAT implementations. Although, if the firewall properly implements tracking TCP
timestamps, itll be okay. Do not set the net.ipv4.tcp_tw_recycle sysctl as it is RFC non-
compliant and will break firewall connection tracking.

More information about these sysctls can be found here:


https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt.
Debugging
The load balancer exposes a few endpoints on every node in the DC/OS cluster that can be
used for gathering statistics. The URI for these metrics are:
http://localhost:61421/metrics. This includes data about the backends, and the dataplane
runtime.

Implementation
The local process polls the master node roughly every 5 seconds. The master node caches
this for 5 seconds as well, bounding the propagation time for an update to roughly 11
seconds. Although this is the case for new VIPs, it is not the case for failed nodes.

Data plane
The load balancer dataplane primarily utilizes Netfilter. The load balancer installs 4 IPTables
rules to enable this, therefore the load balancer must start after firewalld, or any other
destructive IPTables management system. These 4 rules tell IPTables to put the packets
that match them on an NFQueue. NFQueue is a kernel subsystem that allows userspace to
process network traffic.

The rules are two types the first type is to intercept the traffic, and the second is to drop it.
The purpose of the latter rule is to provide an immediate connection reset to the client. The
prior set of rules matches based on the combination of a TCP packet, the SYN flag (and
none else), and an IPSet match which is populated with the list of VIPs.

Once the packet is on the nfqueue, we calculate the backend that the connection should be
mapped to. We use this information to program in an in-kernel conntrack entry which maps
(port DNATs) the 5-tuple from the original destination (the VIP) to the new destination (the
backend). In some cases where hairpin load balancing occurs, SNAT may be required as
well.

Once the NAT programming is done, the packet is released back into the kernel. Since our
rules are in the raw chain the packet doesnt yet have a conntrack entry associated with it.
The conntrack subsystem recognizes the connection based on the prior program and
continues to handle the rest of the flow independently from the load balancer.

Load balancing algorithm


The load balancing algorithm is adapted from The Power of Two Choices in Randomized
Load Balancing (IEEE Trans. Parallel Distrib. Syst. 12, Michael Mitzenmacher). We switch
between this algorithm in the raw sense, and a more probabilistic algorithm depending on
whether or not more than 10 backends exist for a given VIP. The simple vs. probabilistic
algorithm utilization is exposed in the metrics information.

The simple algorithm maintains an EWMA of latencies for a given backend at connection
time. It also maintains a set of consecutive failures, and when they happened. If a backend
observes enough consecutive failures in a short period of time (<5m) it is considered to be
unavailable. A failure is classified as three way handshake failing to occur.
The primary way the algorithm works is that it iterates over the backends and finds those
that we assume are available after taking the the historical failures as well as the group
failure detector. It then takes two random nodes from the most available bucket.

The probabilistic failure detector randomly chooses backends and checks whether or not the
group failure detector considers the agent to be alive. It will continue to do this until it either
finds 2 backends that are in the ideal bucket, or until 20 lookups happen. If the prior case
happens, itll choose one at random. If the latter case happens itll choose one of the 20 at
random.

Failure detection
The load balancer includes a state of the art failure detection scheme. This failure detection
scheme takes some of the work done in the HyParView work. The failure detector maintains
a fully connected sparse graph of connections amongst the nodes in the cluster.

Every node maintains an adjacency table. These adjacency tables are gossiped to every
other node in the cluster. These adjacency tables are then used to build an application-level
multicast overlay.

These connections are monitored via an adaptive ping algorithm. The adaptive ping
algorithm maintains a window of pings between neighbors, and if the ping times out, they
sever the connections. Once this connection is severed the new adjacencies are gossiped to
all other nodes, therefore potentially triggering cascading health checks. This allows the
system to detect failures in less than a second. Although, the system has backpressure
when there are lots of failures, and fault detection can rise to 30 seconds.

Next steps
Assign a VIP to your application

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
NETWORKING

Marathon-LB
Updated: April 17, 2017

Marathon-LB is based on HAProxy, a rapid proxy and load balancer. HAProxy provides
proxying and load balancing for TCP and HTTP based applications, with features such as
SSL support, HTTP compression, health checking, Lua scripting and more. Marathon-LB
subscribes to Marathons event bus and updates the HAProxy configuration in real time.

Up to date documentation for Marathon-LB can be found on the GitHub page.

Marathon-LB GitHub project


Detailed templates documentation

You can can configure Marathon-LB with various topologies. Here are some examples of
how you might use Marathon-LB:

Use Marathon-LB as your edge load balancer and service discovery mechanism. You
could run Marathon-LB on public-facing nodes to route ingress traffic. You would use the
IP addresses of your public-facing nodes in the A-records for your internal or external
DNS records (depending on your use-case).
Use Marathon-LB as an internal LB and service discovery mechanism, with a separate
HA load balancer for routing public traffic in. For example, you may use an external F5
load balancer on-premise, or an Elastic Load Balancer on Amazon Web Services.
Use Marathon-LB strictly as an internal load balancer and service discovery mechanism.
You might also want to use a combination of internal and external load balancers, with
different services exposed on different load balancers.

Here we discuss Marathon-LB as an edge load balancer and as an internal and external
load balancer.

Marathon-LB as an edge load balancer

Marathon-LB as an internal and external load balancer

Next Steps
Install

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
NETWORKING
High-Availability
PREVIEW Updated: April 17, 2017

This document discusses the high availability (HA) features in DC/OS and best practices for
building HA applications on DC/OS.

Terminology
Zone
A zone is a failure domain that has isolated power, networking, and connectivity. Typically, a
zone is a single data center or independent fault domain on-premise, or managed by a cloud
provider. For example, AWS Availability Zones or GCP Zones. Servers within a zone are
connected via high bandwidth (e.g. 1-10+ Gbps), low latency (up to 1 ms), and low cost
links.

Region
A region is a geographical region, such as a metro area, that consists of one or more zones.
Zones within a region are connected via high bandwidth (e.g. 1-4 Gbps), low latency (up to
10 ms), low cost links. Regions are typically connected through public internet via variable
bandwidth (e.g. 10-100 Mbps) and latency (100-500 ms) links.

Rack
A rack is typically composed of a set of servers (nodes). A rack has its own power supply
and switch (or switches), all attached to the same frame. On public cloud platforms such as
AWS, there is no equivalent concept of a rack.

General Recommendations
Latency
DC/OS master nodes should be connected to each other via highly available and low latency
network links. This is required because some of the coordinating components running on
these nodes use quorum writes for high availability. For example, Mesos masters, Marathon
schedulers, and ZooKeeper use quorum writes.

Similarly, most DC/OS services use ZooKeeper (or etcd, consul, etc) for scheduler leader
election and state storage. For this to be effective, service schedulers must be connected to
the ZooKeeper ensemble via a highly available, low latency network link.

Routing
DC/OS networking requires a unique address space. Cluster entities cannot share the same
IP address. For example, apps and DC/OS agents must have unique IP addresses.

All IP addresses should be routable within the cluster.

Leader/Follower Architecture
A common pattern in HA systems is the leader/follower concept. This is also sometimes
referred to as: master/slave, primary/replica, or some combination thereof. This architecture
is used when you have one authoritative process, with N standby processes. In some
systems, the standby processes might also be capable of serving requests or performing
other operations. For example, when running a database like MySQL with a master and
replica, the replica is able to serve read-only requests, but it cannot accept writes (only the
master will accept writes).

In DC/OS, a number of components follow the leader/follower pattern. Well discuss some of
them here and how they work.

Mesos

Mesos can be run in high availability mode, which requires running 3 or 5 masters. When run
in HA mode, one master is elected as the leader, while the other masters are followers. Each
master has a replicated log which contains some state about the cluster. The leading master
is elected by using ZooKeeper to perform leader election. For more detail on this, see the
Mesos HA documentation.

Marathon

Marathon can be run in HA mode, which allows running multiple Marathon instances (at
least 2 for HA), with one elected leader. Marathon uses ZooKeeper for leader election. The
followers do not accept writes or API requests, instead the followers proxy all API requests
to the leading Marathon instance.

ZooKeeper

ZooKeeper is used by numerous services in DC/OS to provide consistency. ZooKeeper can


be used as a distributed locking service, a state store, and a messaging system. ZooKeeper
uses Paxos-like log replication and a leader/follower architecture to maintain consistency
across multiple ZooKeeper instances. For a more detailed explanation of how ZooKeeper
works, check out the ZooKeeper internals document.
Fault Domain Isolation
Fault domain isolation is an important part of building HA systems. To correctly handle
failure scenarios, systems must be distributed across fault domains to survive outages.
There are different types of fault domains, a few examples of which are:

Physical domains: this includes machine, rack, datacenter, region, and availability zone.
Network domains: machines within the same network may be subject to network
partitions. For example, a shared network switch may fail or have invalid configuration.

For more information, see the multi-zone and multi-region documentation.

For applications which require HA, they should also be distributed across fault domains. With
Marathon, this can be accomplished by using the UNIQUE and GROUP_BY constraints operator.

Separation of Concerns
HA services should be decoupled, with responsibilities divided amongst services. For
example, web services should be decoupled from databases and shared caches.

Eliminating Single Points of Failure


Single points of failure come in many forms. For example, a service like ZooKeeper can
become a single point of failure when every service in your system shares one ZooKeeper
cluster. You can reduce risks by running multiple ZooKeeper clusters for separate services.
Theres an Exhibitor Universe package that makes this easy.

Other common single points of failure include:


Single database instances (like a MySQL)
One-off services
Non-HA load balancers

Fast Failure Detection


Fast failure detection comes in many forms. Services like ZooKeeper can be used to provide
failure detection, such as detecting network partitions or host failures. Service health checks
can also be used to detect certain types of failures. As a matter of best practice, services
should expose health check endpoints, which can be used by services like Marathon.

Fast Failover
When failures do occur, failover should be as fast as possible. Fast failover can be achieved
by:
Using an HA load balancer like Marathon-LB, or the internal Layer 4 load balancer.
Building apps in accordance with the 12-factor app manifesto.
Following REST best-practices when building services: in particular, avoiding storing
client state on the server between requests.

A number of DC/OS services follow the fail-fast pattern in the event of errors. Specifically,
both Mesos and Marathon will shut down in the case of unrecoverable conditions such as
losing leadership.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
NETWORKING

DNS Quick Reference


Updated: April 17, 2017

This quick reference provides a summary of the available options.

To help explain, well use this imaginary application:

The Service is in the following hierarchy:


Group: outergroup > Group: subgroup > Service Name: myapp

Port: 555
Port Name: myport
Load Balanced

Running on a Virtual Network


Running on the Marathon framework (if unsure, its probably this one)
If you are running another framework, then replace any instance of marathon with the
name of your framework.

Service Discovery Options


Use one of these options to find the DNS name for your task.
You should choose the first option that satisfies your requirements:

outergroupsubgroupmyapp.marathon.l4lb.thisdcos.directory:555
This is only available when the service is load balanced. :555 is not a part of the DNS
address, but is there to show that this address and port is load balanced as a pair
rather than individually.

myapp-subgroup-outergroup.marathon.containerip.dcos.thisdcos.directory
This is only available when the service is running on a virtual network.

myapp-subgroup-outergroup.marathon.agentip.dcos.thisdcos.directory
This is always available and should be used when the service is not running on a
virtual network.

myapp-subgroup-outergroup.marathon.autoip.dcos.thisdcos.directory
This is always available and should be used to address an application that is
transitioning on or off a virtual network.

myapp-subgroup-outergroup.marathon.mesos
This is always available, and is equivalent for the most part to the agentip. However it
is less specific and less performant than the agentip and thus use is discouraged.

Other discovery option(s):


_myport._myapp.subgroup.outergroup._tcp.marathon.mesos
This is not a DNS A record but rather a DNS SRV record. This is only available when
the port has a name. SRV records are a mapping from a
name to an Address + Port pair.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Upgrading
ENTERPRISE DC/OS Updated: April 17, 2017

This document provides instructions for upgrading a DC/OS cluster.

If this upgrade is performed on a supported OS with all prerequisites fulfilled, this upgrade
should preserve the state of running tasks on the cluster. This document reuses portions of
the Advanced DC/OS Installation Guide.

Important:

Review the release notes before upgrading DC/OS.


The DC/OS GUI and other higher-level system APIs may be inconsistent or unavailable
until all master nodes have been upgraded. For example, an upgraded DC/OS Marathon
leader cannot connect to the leading Mesos master until it has also been upgraded.
When this occurs:
The DC/OS GUI may not provide an accurate list of services.
For multi-master configurations, after one master has finished upgrading, you can
monitor the health of the remaining masters from the Exhibitor UI on port 8181.

The VIP features, added in DC/OS 1.8, require that ports 32768 65535 are open
between all agent and master nodes for both TCP and UDP.
Virtual networks require Docker 1.11 or later. For more information, see the
documentation.
An upgraded DC/OS Marathon leader cannot connect to an non-secure (i.e. not
upgraded) leading Mesos master. The DC/OS UI cannot be trusted until all masters are
upgraded. There are multiple Marathon scheduler instances and multiple Mesos masters,
each being upgraded, and the Marathon leader may not be the Mesos leader.
Task history in the Mesos UI will not persist through the upgrade.

Modifying DC/OS configuration


You cannot change your cluster configuration at the same time as upgrading to a new
version. Cluster configuration changes must be done with an update to an already installed
version. For example, you cannot simultaneously upgrade a cluster from 1.9.x to 1.9.y and
add more public agents. You can add more public agents with an update to 1.9.x, and then
upgrade to 1.9.y. Or you can upgrade to 1.9.y and then add more public agents by updating
1.9.y after the upgrade.

To modify your DC/OS configuration, you must run the installer with the modified
config.yaml and update your cluster using the new installation files. Changes to the DC/OS
configuration have the same risk as upgrading a host. Incorrect configurations could
potentially crash your hosts, or an entire cluster.

Only a subset of DC/OS configuration parameters can be modified. The adverse effects on
any software that is running on top of DC/OS is outside of the scope of this document.
Contact Mesosphere Support for more information.

Here is a list of the parameters that you can modify:

check_time

dns_search

docker_remove_delay

gc_delay

resolvers

telemetry_enabled

use_proxy
http_proxy

https_proxy

no_proxy

The security mode (security) can be changed but has special caveats.
You can only update to a stricter security mode. Security downgrades are not supported.
For example, if your cluster is in permissive mode and you want to downgrade to
disabled mode, you must reinstall the cluster and terminate all running workloads.

During each update, you can only increase your security by a single level. For example,
you cannot update directly from disabled to strict mode. To increase from disabled to
strict mode you must first update to permissive mode, and then update from permissive
to strict mode.

See the security mode for a description of the different security modes and what each
means.

Instructions
These steps must be performed for version upgrades and cluster configuration changes.

Prerequisites
Mesos, Mesos Frameworks, Marathon, Docker and all running tasks in the cluster should
be stable and in a known healthy state.
For Mesos compatibility reasons, we recommend upgrading any running Marathon-on-
Marathon instances to Marathon version 1.3.5 before proceeding with this DC/OS
upgrade.
You must have access to copies of the config files used with the previous DC/OS version:
config.yaml and ip-detect.

You must be using systemd 218 or newer to maintain task state.


All hosts (masters and agents) must be able to communicate with all other hosts on all
ports, for both TCP and UDP.
In CentOS or RedHat, install IP sets with this command (used in some IP detect scripts):
$ sudo yum install -y ipset

You must be familiar with using systemctl and journalctl command line tools to review
and monitor service status. Troubleshooting notes can be found at the end of this
document.
You must be familiar with the Advanced DC/OS Installation Guide.
You should take a snapshot of ZooKeeper prior to upgrading. Marathon supports
rollbacks, but does not support downgrades.

Bootstrap Node
Choose your desired security mode and then follow the applicable upgrade instructions.

Installing DC/OS 1.9 without changing security mode


Installing DC/OS 1.9 in permissive mode
Installing DC/OS 1.9 in strict mode

Installing DC/OS 1.9 without changing


security mode
This procedure upgrades a DC/OS 1.8 cluster to DC/OS 1.9 without changing the clusters
security mode.

Copy your existing config.yaml and ip-detect files to an empty folder on your bootstrap
node.
Merge the old config.yaml into the new config.yaml format. In most cases the differences
will be minimal.
Important:

You cannot change the exhibitor_zk_backend setting during an upgrade.


The syntax of the DC/OS 1.9 config.yaml may be different from the 1.8 version. For a
detailed description of the 1.9 config.yaml syntax and parameters, see the
documentation.

Modify the ip-detect file as desired.


Build your installer package.
Download the dcos_generate_config.ee.sh file.
Generate the installation files. Replace <installed_cluster_version> in the below
command with the DC/OS version currently running on the cluster you intend to
upgrade, for example 1.8.8.
bash
$ dcos_generate_config.ee.sh --generate-node-upgrade-script
<installed_cluster_version>

The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.

Go to the DC/OS Master procedure to complete your installation.

Installing DC/OS 1.9 in permissive mode


This procedure upgrades to DC/OS 1.9 in permissive security mode.

Prerequisite:
Your cluster must be upgraded to DC/OS 1.9 and running in disabled security mode
before it can be upgraded to permissive mode. If your cluster was running in permissive
mode before it was upgraded to DC/OS 1.9, you can skip this procedure.

To update a 1.9 cluster from disabled security to permissive security, complete the following
procedure:

Replace security: disabled with security: permissive in your config.yaml.


Modify the ip-detect file as desired.
Build your installer package.
Download the dcos_generate_config.ee.sh file.
Generate the installation files. Replace <installed_cluster_version> in the below
command with the DC/OS version currently running on the cluster you intend to
upgrade, for example 1.8.8.
bash
$ dcos_generate_config.ee.sh --generate-node-upgrade-script
<installed_cluster_version>

The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.

Go to the DC/OS Master procedure to complete your installation.

Installing DC/OS 1.9 in strict mode


This procedure upgrades to DC/OS 1.9 in security strict mode.

If you are updating a running DC/OS cluster to run in security: strict mode, beware that
security vulnerabilities may persist even after migration to strict mode. When moving to strict
mode, your services will now require authentication and authorization to register with Mesos
or access its HTTP API. You should test these configurations in permissive mode before
upgrading to strict, to maintain scheduler and script uptimes across the upgrade.

Prerequisite:

Your cluster must be upgraded to DC/OS 1.9 and running in permissive security mode
before it can be updated to strict mode. If your cluster was running in strict mode before it
was upgraded to DC/OS 1.9, you can skip this procedure.

To update a cluster from permissive security to strict security, complete the following
procedure:

Replace security: permissive with security: strict in your config.yaml.


Modify the ip-detect file as desired.
Build your installer package.
Download the dcos_generate_config.ee.sh file.
Generate the installation files. Replace <installed_cluster_version> in the below
command with the DC/OS version currently running on the cluster you intend to
upgrade, for example 1.8.8.
bash
$ dcos_generate_config.ee.sh --generate-node-upgrade-script
<installed_cluster_version>

The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.

Go to the DC/OS Master procedure to complete your installation.

DC/OS Masters
Proceed with upgrading every master node one-at-a-time in any order using the following
procedure. When you complete each upgrade, monitor the Mesos master metrics to ensure
the node has rejoined the cluster and completed reconciliation.

Download and run the node upgrade script:


$ curl -O <Node upgrade script URL> $ sudo bash dcos_node_upgrade.sh

Validate the upgrade:


Monitor Exhibitor and wait for it to converge at
http://<master-ip>:8181/exhibitor/v1/ui/index.html. Confirm that the master rejoins
the ZooKeeper quorum successfully (the status indicator will turn green).
Tip: If you are upgrading from permissive to strict mode, this URL will be https://....

Wait until the dcos-mesos-master unit is up and running.


Verify that curl http://<dcos_master_private_ip>:5050/metrics/snapshot has the
metric registrar/log/recovered with a value of 1.
Tip: If you are upgrading from permissive to strict mode, this URL will be curl
https://... and you will need a JWT for access.

Verify that $ /opt/mesosphere/bin/mesos-master --version indicates that the upgraded


master is running Mesos 1.2.0.

Go to the DC/OS Agents procedure to complete your installation.

DC/OS Agents
Important: When upgrading agent nodes, there is a 5 minute timeout for the agent to
respond to health check pings from the mesos-masters before it is considered lost and its
tasks are given up for dead.
On all DC/OS agents:

Navigate to the /opt/mesosphere/lib directory and delete this library file. Deleting this file
will prevent conflicts.
libltdl.so.7

Download and run the node upgrade script:


$ curl -O <Node upgrade script URL> $ sudo bash dcos_node_upgrade.sh

Validate the upgrade:


Verify that curl http://<dcos_agent_private_ip>:5051/metrics/snapshot has the
metric slave/registered with a value of 1.
Monitor the Mesos UI to verify that the upgraded node rejoins the DC/OS cluster and
that tasks are reconciled (http://<master-ip>/mesos). If you are upgrading from
permissive to strict mode, this URL will be https://<master-ip>/mesos.

Troubleshooting Recommendations
The following commands should provide insight into upgrade issues:

On All Cluster Nodes

$ sudo journalctl -u dcos-download $ sudo journalctl -u dcos-spartan $ sudo systemctl | grep


dcos

On DC/OS Masters

$ sudo journalctl -u dcos-exhibitor $ less


/opt/mesosphere/active/exhibitor/usr/zookeeper/zookeeper.out $ sudo journalctl -u dcos-mesos-
dns $ sudo journalctl -u dcos-mesos-master

On DC/OS Agents

$ sudo journalctl -u dcos-mesos-slave

Notes:
Packages available in the DC/OS 1.9 Universe are newer than those in the DC/OS 1.8
Universe. Services are not automatically upgraded when DC/OS 1.9 is installed because
not all DC/OS services have upgrade paths that will preserve existing state.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Installing DC/OS
ENTERPRISE DC/OS Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.

Upgrading
This document provides instructions for upgrading a DC/OS cluster. If this upgrade is
performed on a supported OS with all prerequisites fulfilled, this upgrade should preserve
the...
DC/OS Custom Installation Options
The DC/OS Enterprise Edition is installed in your environment by using a dynamically
generated setup file. This file is generated by using specific parameters that are set during
c...
DC/OS Cloud Installation Options
You can install DC/OS on by using cloud templates.
Local
This installation method uses Vagrant to create a cluster of virtual machines on your local
machine that can be used for demos, development, and testing with DC/OS. System
requirem...
High-Availibility
This document discusses the high availability (HA) features in DC/OS and best practices for
building HA applications on DC/OS. Terminology Zone A zone is a failure domain that has ...
DC/OS Ports
This topic lists the ports that are required to launch DC/OS. Additional ports may be required
to launch the individual DC/OS services. All nodes TCP Port DC/OS component systemd u...
Opt-Out
Telemetry You can opt-out of providing anonymous data by disabling telemetry for your
cluster. To disable telemetry, add this parameter to your config.yaml file during installation...
Frequently Asked Questions
Q. Can I install DC/OS on an already running Mesos cluster? We recommend starting with a
fresh cluster to ensure all defaults are set to expected values. This prevents unexpected c...
Troubleshooting a Custom Installation
General troubleshooting approach Verify that you have a valid IP detect?????script,
functioning DNS resolvers to bind the DC/OS services to, and that all nodes are synchr...

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS

Upgrading
ENTERPRISE DC/OS Updated: April 17, 2017

This document provides instructions for upgrading a DC/OS cluster.

If this upgrade is performed on a supported OS with all prerequisites fulfilled, this upgrade
should preserve the state of running tasks on the cluster. This document reuses portions of
the Advanced DC/OS Installation Guide.

Important:

Review the release notes before upgrading DC/OS.


The DC/OS GUI and other higher-level system APIs may be inconsistent or unavailable
until all master nodes have been upgraded. For example, an upgraded DC/OS Marathon
leader cannot connect to the leading Mesos master until it has also been upgraded.
When this occurs:
The DC/OS GUI may not provide an accurate list of services.
For multi-master configurations, after one master has finished upgrading, you can
monitor the health of the remaining masters from the Exhibitor UI on port 8181.

The VIP features, added in DC/OS 1.8, require that ports 32768 65535 are open
between all agent and master nodes for both TCP and UDP.
Virtual networks require Docker 1.11 or later. For more information, see the
documentation.
An upgraded DC/OS Marathon leader cannot connect to an non-secure (i.e. not
upgraded) leading Mesos master. The DC/OS UI cannot be trusted until all masters are
upgraded. There are multiple Marathon scheduler instances and multiple Mesos masters,
each being upgraded, and the Marathon leader may not be the Mesos leader.
Task history in the Mesos UI will not persist through the upgrade.
Enterprise DC/OS downloads can be found here.

Supported upgrade paths


From the latest GA version of 1.8 to the latest GA version of 1.9. For example, if 1.8.8 is
the latest and 1.9.0 is the latest, this upgrade would be supported.
From any 1.9 release to the next. For example, an upgrade from 1.9.1 to 1.9.2 would be
supported.
From any 1.9 release to an identical 1.9 release. For example, an upgrade from 1.9.0 to
1.9.0 would be supported. This is useful for making configuration changes.

Modifying DC/OS configuration


You cannot change your cluster configuration at the same time as upgrading to a new
version. Cluster configuration changes must be done with an update to an already installed
version. For example, you cannot simultaneously upgrade a cluster from 1.9.x to 1.9.y and
add more public agents. You can add more public agents with an update to 1.9.x, and then
upgrade to 1.9.y. Or you can upgrade to 1.9.y and then add more public agents by updating
1.9.y after the upgrade.

To modify your DC/OS configuration, you must run the installer with the modified
config.yaml and update your cluster using the new installation files. Changes to the DC/OS
configuration have the same risk as upgrading a host. Incorrect configurations could
potentially crash your hosts, or an entire cluster.

Only a subset of DC/OS configuration parameters can be modified. The adverse effects on
any software that is running on top of DC/OS is outside of the scope of this document.
Contact Mesosphere Support for more information.
Here is a list of the parameters that you can modify:

check_time

dns_search

docker_remove_delay

gc_delay

resolvers

telemetry_enabled

use_proxy
http_proxy

https_proxy

no_proxy

The security mode (security) can be changed but has special caveats.

You can only update to a stricter security mode. Security downgrades are not supported.
For example, if your cluster is in permissive mode and you want to downgrade to
disabled mode, you must reinstall the cluster and terminate all running workloads.

During each update, you can only increase your security by a single level. For example,
you cannot update directly from disabled to strict mode. To increase from disabled to
strict mode you must first update to permissive mode, and then update from permissive
to strict mode.

See the security mode for a description of the different security modes and what each
means.

Instructions
These steps must be performed for version upgrades and cluster configuration changes.

Prerequisites
Mesos, Mesos Frameworks, Marathon, Docker and all running tasks in the cluster should
be stable and in a known healthy state.
For Mesos compatibility reasons, we recommend upgrading any running Marathon-on-
Marathon instances to Marathon version 1.3.5 before proceeding with this DC/OS
upgrade.
You must have access to copies of the config files used with the previous DC/OS version:
config.yaml and ip-detect.

You must be using systemd 218 or newer to maintain task state.


All hosts (masters and agents) must be able to communicate with all other hosts on all
ports, for both TCP and UDP.
In CentOS or RedHat, install IP sets with this command (used in some IP detect scripts):
sudo yum install -y ipset

You must be familiar with using systemctl and journalctl command line tools to review
and monitor service status. Troubleshooting notes can be found at the end of this
document.
You must be familiar with the Advanced DC/OS Installation Guide.
You should take a snapshot of ZooKeeper prior to upgrading. Marathon supports
rollbacks, but does not support downgrades.

Bootstrap Node
Choose your desired security mode and then follow the applicable upgrade instructions.

Installing DC/OS 1.9 without changing security mode


Installing DC/OS 1.9 in permissive mode
Installing DC/OS 1.9 in strict mode

Installing DC/OS 1.9 without changing


security mode
This procedure upgrades a DC/OS 1.8 cluster to DC/OS 1.9 without changing the clusters
security mode.

Copy your existing config.yaml and ip-detect files to an empty folder on your bootstrap
node.
Merge the old config.yaml into the new config.yaml format. In most cases the differences
will be minimal.
Important:

You cannot change the exhibitor_zk_backend setting during an upgrade.


The syntax of the DC/OS 1.9 config.yaml may be different from the 1.8 version. For a
detailed description of the 1.9 config.yaml syntax and parameters, see the
documentation.

Modify the ip-detect file as desired.


Build your installer package.
Download the dcos_generate_config.ee.sh file.
Generate the installation files. Replace <installed_cluster_version> in the below
command with the DC/OS version currently running on the cluster you intend to
upgrade, for example 1.8.8.
bash
dcos_generate_config.ee.sh --generate-node-upgrade-script
<installed_cluster_version>

The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.

Go to the DC/OS Master procedure to complete your installation.

Installing DC/OS 1.9 in permissive mode


This procedure upgrades to DC/OS 1.9 in permissive security mode.

Prerequisite:
Your cluster must be upgraded to DC/OS 1.9 and running in disabled security mode
before it can be upgraded to permissive mode. If your cluster was running in permissive
mode before it was upgraded to DC/OS 1.9, you can skip this procedure.

To update a 1.9 cluster from disabled security to permissive security, complete the following
procedure:

Replace security: disabled with security: permissive in your config.yaml.


Modify the ip-detect file as desired.
Build your installer package.
Download the dcos_generate_config.ee.sh file.
Generate the installation files. Replace <installed_cluster_version> in the below
command with the DC/OS version currently running on the cluster you intend to
upgrade, for example 1.8.8.
bash
dcos_generate_config.ee.sh --generate-node-upgrade-script
<installed_cluster_version>

The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.

Go to the DC/OS Master procedure to complete your installation.

Installing DC/OS 1.9 in strict mode


This procedure upgrades to DC/OS 1.9 in security strict mode.

If you are updating a running DC/OS cluster to run in security: strict mode, beware that
security vulnerabilities may persist even after migration to strict mode. When moving to strict
mode, your services will now require authentication and authorization to register with Mesos
or access its HTTP API. You should test these configurations in permissive mode before
upgrading to strict, to maintain scheduler and script uptimes across the upgrade.

Prerequisite:

Your cluster must be upgraded to DC/OS 1.9 and running in permissive security mode
before it can be updated to strict mode. If your cluster was running in strict mode before it
was upgraded to DC/OS 1.9, you can skip this procedure.

To update a cluster from permissive security to strict security, complete the following
procedure:

Replace security: permissive with security: strict in your config.yaml.


Modify the ip-detect file as desired.
Build your installer package.
Download the dcos_generate_config.ee.sh file.
Generate the installation files. Replace <installed_cluster_version> in the below
command with the DC/OS version currently running on the cluster you intend to
upgrade, for example 1.8.8.
bash
dcos_generate_config.ee.sh --generate-node-upgrade-script
<installed_cluster_version>

The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.

Go to the DC/OS Master procedure to complete your installation.

DC/OS Masters
Proceed with upgrading every master node one-at-a-time in any order using the following
procedure. When you complete each upgrade, monitor the Mesos master metrics to ensure
the node has rejoined the cluster and completed reconciliation.

Download and run the node upgrade script:


curl -O <Node upgrade script URL> sudo bash dcos_node_upgrade.sh

Verify that the upgrade script succeeded and exited with the status code :
echo $? 0

Validate the upgrade:


Monitor Exhibitor and wait for it to converge at
http://<master-ip>:8181/exhibitor/v1/ui/index.html. Confirm that the master rejoins
the ZooKeeper quorum successfully (the status indicator will turn green).
Tip: If you are upgrading from permissive to strict mode, this URL will be https://....

Wait until the dcos-mesos-master unit is up and running.


Verify that curl http://<dcos_master_private_ip>:5050/metrics/snapshot has the
metric registrar/log/recovered with a value of 1.
Tip: If you are upgrading from permissive to strict mode, this URL will be curl
https://... and you will need a JWT for access.

Verify that /opt/mesosphere/bin/mesos-master --version indicates that the upgraded


master is running Mesos 1.2.0.

Go to the DC/OS Agents procedure to complete your installation.


DC/OS Agents
Important: When upgrading agent nodes, there is a 5 minute timeout for the agent to
respond to health check pings from the mesos-masters before it is considered lost and its
tasks are given up for dead.

On all DC/OS agents:

Navigate to the /opt/mesosphere/lib directory and delete this library file. Deleting this file
will prevent conflicts.
libltdl.so.7

Download and run the node upgrade script:


curl -O <Node upgrade script URL> sudo bash dcos_node_upgrade.sh

Verify that the upgrade script succeeded and exited with the status code :
echo $? 0

Validate the upgrade:


Verify that curl http://<dcos_agent_private_ip>:5051/metrics/snapshot has the
metric slave/registered with a value of 1.
Monitor the Mesos UI to verify that the upgraded node rejoins the DC/OS cluster and
that tasks are reconciled (http://<master-ip>/mesos). If you are upgrading from
permissive to strict mode, this URL will be https://<master-ip>/mesos.

Troubleshooting Recommendations
The following commands should provide insight into upgrade issues:

On All Cluster Nodes

sudo journalctl -u dcos-download sudo journalctl -u dcos-spartan sudo systemctl | grep dcos

On DC/OS Masters

sudo journalctl -u dcos-exhibitor less


/opt/mesosphere/active/exhibitor/usr/zookeeper/zookeeper.out sudo journalctl -u dcos-mesos-dns
sudo journalctl -u dcos-mesos-master
On DC/OS Agents

sudo journalctl -u dcos-mesos-slave

Notes:
Packages available in the DC/OS 1.9 Universe are newer than those in the DC/OS 1.8
Universe. Services are not automatically upgraded when DC/OS 1.9 is installed because
not all DC/OS services have upgrade paths that will preserve existing state.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS

DC/OS Custom Installation Options


ENTERPRISE DC/OS Updated: April 17, 2017

The DC/OS Enterprise Edition is installed in your environment by using a dynamically


generated setup file. This file is generated by using specific parameters that are set during
configuration. This installation file contains a Bash install script and a Docker container that
is loaded with everything you need to deploy a customized DC/OS build.

The DC/OS installation process requires a cluster of nodes to install DC/OS onto and a
single node to run the DC/OS installation from.

Contact your sales representative or sales@mesosphere.io for access tothe DC/OS setup
file.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS

DC/OS Cloud Installation Options


ENTERPRISE DC/OS Updated: April 17, 2017
You can install DC/OS on by using cloud templates.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS

Local
ENTERPRISE DC/OS Updated: April 17, 2017

This installation method uses Vagrant to create a cluster of virtual machines on your local
machine that can be used for demos, development, and testing with DC/OS.

System requirements
Hardware
Minimum 5 GB of memory to run DC/OS.

Software
Enterprise DC/OS setup file. Contact your sales representative or
sales@mesosphere.com to obtain this file.
DC/OS Vagrant. The installation and usage instructions are maintained in the dcos-
vagrant GitHub repository. Follow the the deploy instructions to set up your host machine
correctly and to install DC/OS.
For the latest bug fixes, use the master branch.
For increased stability, use the latest official release.
For older releases on DC/OS, you may need to download an older release of DC/OS
Vagrant.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS
High-Availibility
PREVIEW Updated: April 17, 2017

This document discusses the high availability (HA) features in DC/OS and best practices for
building HA applications on DC/OS.

Terminology
Zone
A zone is a failure domain that has isolated power, networking, and connectivity. Typically, a
zone is a single data center or independent fault domain on-premise, or managed by a cloud
provider. For example, AWS Availability Zones or GCP Zones. Servers within a zone are
connected via high bandwidth (e.g. 1-10+ Gbps), low latency (up to 1 ms), and low cost
links.

Region
A region is a geographical region, such as a metro area, that consists of one or more zones.
Zones within a region are connected via high bandwidth (e.g. 1-4 Gbps), low latency (up to
10 ms), low cost links. Regions are typically connected through public internet via variable
bandwidth (e.g. 10-100 Mbps) and latency (100-500 ms) links.

Rack
A rack is typically composed of a set of servers (nodes). A rack has its own power supply
and switch (or switches), all attached to the same frame. On public cloud platforms such as
AWS, there is no equivalent concept of a rack.

General Recommendations
Latency
DC/OS master nodes should be connected to each other via highly available and low latency
network links. This is required because some of the coordinating components running on
these nodes use quorum writes for high availability. For example, Mesos masters, Marathon
schedulers, and ZooKeeper use quorum writes.

Similarly, most DC/OS services use ZooKeeper (or etcd, consul, etc) for scheduler leader
election and state storage. For this to be effective, service schedulers must be connected to
the ZooKeeper ensemble via a highly available, low latency network link.

Routing
DC/OS networking requires a unique address space. Cluster entities cannot share the same
IP address. For example, apps and DC/OS agents must have unique IP addresses.

All IP addresses should be routable within the cluster.

Leader/Follower Architecture
A common pattern in HA systems is the leader/follower concept. This is also sometimes
referred to as: master/slave, primary/replica, or some combination thereof. This architecture
is used when you have one authoritative process, with N standby processes. In some
systems, the standby processes might also be capable of serving requests or performing
other operations. For example, when running a database like MySQL with a master and
replica, the replica is able to serve read-only requests, but it cannot accept writes (only the
master will accept writes).

In DC/OS, a number of components follow the leader/follower pattern. Well discuss some of
them here and how they work.

Mesos

Mesos can be run in high availability mode, which requires running 3 or 5 masters. When run
in HA mode, one master is elected as the leader, while the other masters are followers. Each
master has a replicated log which contains some state about the cluster. The leading master
is elected by using ZooKeeper to perform leader election. For more detail on this, see the
Mesos HA documentation.

Marathon

Marathon can be run in HA mode, which allows running multiple Marathon instances (at
least 2 for HA), with one elected leader. Marathon uses ZooKeeper for leader election. The
followers do not accept writes or API requests, instead the followers proxy all API requests
to the leading Marathon instance.

ZooKeeper

ZooKeeper is used by numerous services in DC/OS to provide consistency. ZooKeeper can


be used as a distributed locking service, a state store, and a messaging system. ZooKeeper
uses Paxos-like log replication and a leader/follower architecture to maintain consistency
across multiple ZooKeeper instances. For a more detailed explanation of how ZooKeeper
works, check out the ZooKeeper internals document.
Fault Domain Isolation
Fault domain isolation is an important part of building HA systems. To correctly handle
failure scenarios, systems must be distributed across fault domains to survive outages.
There are different types of fault domains, a few examples of which are:

Physical domains: this includes machine, rack, datacenter, region, and availability zone.
Network domains: machines within the same network may be subject to network
partitions. For example, a shared network switch may fail or have invalid configuration.

For more information, see the multi-zone and multi-region documentation.

For applications which require HA, they should also be distributed across fault domains. With
Marathon, this can be accomplished by using the UNIQUE and GROUP_BY constraints operator.

Separation of Concerns
HA services should be decoupled, with responsibilities divided amongst services. For
example, web services should be decoupled from databases and shared caches.

Eliminating Single Points of Failure


Single points of failure come in many forms. For example, a service like ZooKeeper can
become a single point of failure when every service in your system shares one ZooKeeper
cluster. You can reduce risks by running multiple ZooKeeper clusters for separate services.
Theres an Exhibitor Universe package that makes this easy.

Other common single points of failure include:


Single database instances (like a MySQL)
One-off services
Non-HA load balancers

Fast Failure Detection


Fast failure detection comes in many forms. Services like ZooKeeper can be used to provide
failure detection, such as detecting network partitions or host failures. Service health checks
can also be used to detect certain types of failures. As a matter of best practice, services
should expose health check endpoints, which can be used by services like Marathon.

Fast Failover
When failures do occur, failover should be as fast as possible. Fast failover can be achieved
by:
Using an HA load balancer like Marathon-LB, or Minuteman for internal layer 4 load
balancing.
Building apps in accordance with the 12-factor app manifesto.
Following REST best-practices when building services: in particular, avoiding storing
client state on the server between requests.

A number of DC/OS services follow the fail-fast pattern in the event of errors. Specifically,
both Mesos and Marathon will shut down in the case of unrecoverable conditions such as
losing leadership.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS

DC/OS Ports
ENTERPRISE DC/OS Updated: April 17, 2017

This topic lists the ports that are required to launch DC/OS. Additional ports may be required
to launch the individual DC/OS services.

All nodes
TCP
Port DC/OS component systemd unit
61003REX-Ray dcos-rexray.service
61053Mesos DNS dcos-mesos-dns.service
61420Erlang Port Mapping Daemon (EPMD) dcos-epmd.service
62053DNS Forwarder (Spartan) dcos-spartan.service
62080Navstar dcos-navstar.service
62501DNS Forwarder (Spartan) dcos-spartan.service
62502Navstar dcos-navstar.service

UDP
Port DC/OS component systemd unit
61053 Mesos DNS dcos-mesos-dns.service
62053 DNS Forwarder (Spartan) dcos-spartan.service
64000 Navstar dcos-navstar.service
Master
TCP
Port DC/OS component systemd unit
53 DNS Forwarder (Spartan) dcos-spartan.service
80 Admin Router Master dcos-adminrouter.service
443 Admin Router Master dcos-adminrouter.service
1050 DC/OS Diagnostics (3DT) dcos-3dt.service
1337 DC/OS Secrets dcos-secrets.service |
2181 Exhibitor and Zookeeper dcos-exhibitor.service
5050 Mesos Master dcos-mesos-master.service
7070 DC/OS Package Manager (Cosmos) dcos-cosmos.service
8080 Marathon dcos-marathon.service
8101 DC/OS Identity and Access Manager dcos-bouncer.service |
8123 Mesos DNS dcos-mesos-dns.service
8181 Exhibitor and Zookeeper dcos-exhibitor.service
8200 Vault dcos-vault.service |
8888 DC/OS Certificate Authority dcos-ca.service |
9990 DC/OS Package Manager (Cosmos) dcos-cosmos.service
15055 DC/OS History dcos-history-service.service
15101 Marathon libprocess dcos-marathon.service
15201 DC/OS Jobs (Metronome) libprocess dcos-metronome.service
62500 DC/OS Network Metrics dcos-networking_api.service |
DynamicDC/OS Jobs (Metronome) dcos-metronome.service
DynamicDC/OS Component Package Manager (Pkgpanda) dcos-pkgpanda-api.service

UDP
Port DC/OS component systemd unit
53 DNS Forwarder (Spartan) dcos-spartan.service

Agent
TCP
Port DC/OS component systemd unit
5051 Mesos Agent dcos-mesos-slave.service
61001 Admin Router Agent dcos-adminrouter-agent
61002 DC/OS Diagnostics (3DT) dcos-3dt.service
Default advertised port ranges (for Marathon health
1025-2180
checks)
Default advertised port ranges (for Marathon health
2182-3887
checks)
Default advertised port ranges (for Marathon health
3889-5049
checks)
Default advertised port ranges (for Marathon health
5052-8079
checks)
Port DC/OS component systemd unit
Default advertised port ranges (for Marathon health
8082-8180
checks)
Default advertised port ranges (for Marathon health
8182-32000
checks)

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS

Opt-Out
ENTERPRISE DC/OS Updated: April 17, 2017

Telemetry
You can opt-out of providing anonymous data by disabling telemetry for your cluster. To
disable telemetry, add this parameter to your config.yaml file during installation (note this
requires using the CLI or advanced installers):

telemetry_enabled: 'false'

If youve already installed your cluster and want to disable this in-place, you can go through
an upgrade with the same parameter set.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS

Frequently Asked Questions


Updated: April 17, 2017
Q. Can I install DC/OS on an already running Mesos
cluster?
We recommend starting with a fresh cluster to ensure all defaults are set to expected values.
This prevents unexpected conditions related to mismatched versions and configurations.

Q. What are the OS requirements of DC/OS?


<<<<<<< HEAD

See the system requirements.


See the system requirements.

origin/dev

Q. Does DC/OS install ZooKeeper, or can I use my own


ZooKeeper quorum?
DC/OS runs its own ZooKeeper supervised by Exhibitor and systemd, but users are able to
create their own ZooKeeper quorums as well. The ZooKeeper quorum installed by default
will be available at master.mesos:[2181|2888|3888].

Q. Is it necessary to maintain a bootstrap node after the


cluster is created?
<<<<<<< HEAD
If you specify an Exhibitor storage backend type other than exhibitor_storage_backend:
static in your cluster configuration file, you must maintain the external storage for the
lifetime of your cluster to facilitate leader elections. If your cluster is mission critical, you
should harden your external storage by using S3 or running the bootstrap ZooKeeper as a
quorum. Interruptions of service from the external storage can be tolerated, but permanent
loss of state can lead to unexpected conditions.

## Q. How to add Mesos attributes to


nodes in order to use Marathon
constraints?
If you specify an Exhibitor storage backend type other than exhibitor_storage_backend:
static in your cluster configuration file, you must maintain the external storage for the
lifetime of your cluster to facilitate leader elections. If your cluster is mission critical, you
should harden your external storage by using S3 or running the bootstrap ZooKeeper as a
quorum. Interruptions of service from the external storage can be tolerated, but permanent
loss of state can lead to unexpected conditions.

How to add Mesos attributes to nodes in order to use


Marathon constraints?

origin/dev

In DC/OS, add the line MESOS_ATTRIBUTES=<key>:<value> to the file /var/lib/dcos/mesos-


slave-common (it may need to be created) for each attribute youd like to add. More
information can be found via the Mesos doc.

Q. How do I gracefully shut down an agent?


<<<<<<< HEAD

To gracefully kill an agent nodes Mesos process and allow systemd to restart it, use the
following command. _Note: If Auto Scaling Groups are in use, the node will be replaced
automatically:
sudo systemctl kill -s SIGUSR1 dcos-mesos-slave

For a public agent:


`bash

</ul>

<h1> sudo systemctl kill -s SIGUSR1 dcos-mesos-slave-public</h1>

<p>To gracefully kill an agent node's Mesos process and allow systemd to restart it, use the
following command. <em>Note: If Auto Scaling Groups are in use, the node will be replaced
automatically</em>:

<pre><code class="bash">$ sudo systemctl kill -s SIGUSR1 dcos-mesos-slave


</code></pre>

<ul>
<li><em>For a public agent:</em>
<pre><code class="bash">$ sudo systemctl kill -s SIGUSR1 dcos-mesos-slave-public
>>>>>>> origin/dev
</code></pre></li>
<li>To gracefully kill the process and prevent systemd from restarting it, add a
<code>stop</code> command:

`bash
<<<<<<< HEAD

sudo systemctl kill -s SIGUSR1 dcos-


mesos-slave && sudo systemctl stop
dcos-mesos-slave
$ sudo systemctl kill -s SIGUSR1 dcos-mesos-slave && sudo systemctl stop dcos-mesos-slave

origin/dev
`
For a public agent:
`bash
<<<<<<< HEAD</p></li>
</ul>

<h1> sudo systemctl kill -s SIGUSR1 dcos-mesos-slave-public && sudo systemctl stop
dcos-mesos-slave-public</h1>

<pre><code>$ sudo systemctl kill -s SIGUSR1 dcos-mesos-slave-public && sudo


systemctl stop dcos-mesos-slave-public
</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
`

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS

Troubleshooting a Custom
Installation
Updated: April 17, 2017

General troubleshooting approach


Verify that you have a valid IP detect?????script, functioning DNS resolvers to bind the
DC/OS services to, and that all nodes are synchronized with NTP.
IP detect script
<<<<<<< HEAD

You must have a valid ip-detect script.


You can manually run on all the ip-detect

nodes in your cluster or check


on an existing installation
/opt/mesosphere/bin/detect_ip

to ensure that it returns a valid IP


address. A valid IP address does not
have:
You must have a valid [ip-detect](/1.9/administration/installing/custom/advanced/) script. You
can manually run `ip-detect` on all the nodes in your cluster or check
`/opt/mesosphere/bin/detect_ip` on an existing installation to ensure that it returns a valid
IP address. A valid IP address does not have:

origin/dev

- extra lines - white space - special or hidden characters

<<<<<<< HEAD
It is recommended that you use the ip-detect examples.

## DNS resolvers
You must have working DNS resolvers,
specified in your config.yaml file. It is
recommended that you have forward and
reverse lookups for FQDNs, short
hostnames, and IP addresses. It is
possible for DC/OS to function in
environments without valid DNS support,
but the following must work to support
DC/OS services, including Spark:
It is recommended that you use the `ip-detect`
[examples](/1.9/administration/installing/custom/advanced/). ## DNS resolvers You must have
working DNS resolvers, specified in your
[config.yaml](/1.9/administration/installing/custom/configuration-parameters/#resolvers) file.
It is recommended that you have forward and reverse lookups for FQDNs, short hostnames, and IP
addresses. It is possible for DC/OS to function in environments without valid DNS support, but
the following _must_ work to support DC/OS services, including Spark:

origin/dev

- `hostname -f` returns the FQDN - `hostname -s` returns the short hostname You should sanity
check the output of `hostnamectl` on all of your nodes as well. When troubleshooting problems
with a DC/OS installation, you should explore the components in this sequence: 1. Exhibitor 1.
Mesos master 1. Mesos DNS 1. DNS Forwarder (Spartan) 1. DC/OS Marathon 1. Jobs 1. Admin Router
Be sure to check that all services are up and healthy on the masters before checking the
agents. ### NTP Network Time Protocol (NTP) must be enabled on all nodes for clock
synchronization. By default, during DC/OS startup you will receive an error if this is not
enabled. You can check if NTP is enabled by running one of these commands, depending on your OS
and configuration: ```bash </code></pre> <<<<<<< HEAD ntptime adjtimex -p <h1> timedatectl</h1>
<pre><code>$ ntptime $ adjtimex -p $ timedatectl </code></pre> <blockquote> <blockquote>
<blockquote> <blockquote> <blockquote> <blockquote> <blockquote> origin/dev ```

Ensure that firewalls and any other connection-filtering mechanisms are not interfering with
cluster component communications. TCP, UDP, and ICMP must be permitted. Ensure that services
that bind to port 53, which is required by DNS Forwarder (dcos-spartan.service), are
disabled and stopped. For example:

sudo systemctl disable dnsmasq && sudo systemctl stop dnsmasq

Verify that Exhibitor is up and running athttp://<MASTER_IP>:8181/exhibitor. If Exhibitor


is not up and running:

<<<<<<< HEAD

- SSH to your master node and enter


this command to check the Exhibitor
service logs:
- [SSH](/1.9/administration/access-node/sshcluster/) to your master node and enter
this command to check the Exhibitor service logs:

origin/dev

```bash journalctl -flu dcos-exhibitor ``` - Verify that `/tmp` is mounted *without*
`noexec`. If it is mounted with `noexec`, Exhibitor will fail to bring up ZooKeeper because
Java JNI won't be able to `exec` a file it creates in `/tmp` and you will see multiple
`permission denied` errors in the log. To repair `/tmp` mounted with `noexec`: 1. Enter this
command: ```bash mount -o remount,exec /tmp ``` 1. Check the output of
`/exhibitor/v1/cluster/status` and verify that it shows the correct number of masters and
that all of them are `"serving"` but only one of them is designated as `"isLeader": true`

<<<<<<< HEAD
For example, SSH to your master node and enter this command:

```bash

curl -fsSL
http://localhost:8181/exhibitor/v1/cluster/
status | python -m json.tool
For example, [SSH](/1.9/administration/access-node/sshcluster/) to your master node and
enter this command: ```bash $ curl -fsSL http://localhost:8181/exhibitor/v1/cluster/status |
python -m json.tool

origin/dev
[
{
"code": 3,
"descripti
on":
"serving",
"hostnam
e":
"10.0.6.7
0",
"isLeader
": false
},
{
"code": 3,
"descripti
on":
"serving",
"hostnam
e":
"10.0.6.6
9",
"isLeader
": false
},
{
"code": 3,
"descripti
on":
"serving",
"hostnam
e":
"10.0.6.6
8",
"isLeader
": true
}
]
```

**Note:** Running this command in multi-master configurations can take up to 10-15 minutes
to complete. If it doesn't complete after 10-15 minutes, you should carefully review the
`journalctl -flu dcos-exhibitor` logs.

Verify whether you can ping the DNS Forwarder (ready.spartan). If not, review the DNS
Dispatcher service logs: ?????

journalctl -flu dcos-spartan?????

Verify that you can ping ????leader.mesos and ?????master.mesos. If not:


Review the Mesos-DNS service logs with this command: ?

????journalctl -flu dcos-mesos-dns?????

If you are able to ping ready.spartan, but not leader.mesos, review the Mesos master
service logs by using this command:

????journalctl -flu dcos-mesos-master

?
The Mesos masters must be up and running with a leader elected before Mesos-DNS
can generate its DNS records from ?????/state.?????
Component logs
During DC/OS installation, each of the components will converge from a failing state to a
running state in the logs.
Admin Router

DC/OS agent nodes

DC/OS Marathon

gen_resolvconf

Mesos DNS

Mesos master process

ZooKeeper and Exhibitor

Admin Router
The Admin Router is started on the master nodes. The Admin Router provides central
authentication and proxy to DC/OS services within the cluster. This allows you to
administer your cluster from outside the network without VPN or a SSH tunnel. For HA,
an optional load balancer can be configured in front of each master node, load balancing
port 80, to provide failover and load balancing.

Troubleshooting:
SSH to your master node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>

<h1> journalctl -u dcos-adminrouter -b</h1>

<pre><code>$ journalctl -u dcos-adminrouter -b


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```

For example, here is a snippet of the Admin Router log as it converges to a successful
state: ```bash systemd[1]: Starting A high performance web server and a reverse proxy
server... systemd[1]: Started A high performance web server and a reverse proxy server.
nginx[1652]: ip-10-0-7-166.us-west-2.compute.internal nginx: 10.0.7.166 - -
[18/Nov/2015:14:01:10 +0000] "GET /mesos/master/state-summary HTTP/1.1" 200 575 "-" "python-
requests/2.6.0 CPython/3.4.2 Linux/4.1.7-coreos" nginx[1652]: ip-10-0-7-166.us-
west-2.compute.internal nginx: 10.0.7.166 - - [18/Nov/2015:14:01:10 +0000] "GET /metadata
HTTP/1.1" 200 175 "-" "python-requests/2.6.0 CPython/3.4.2 Linux/4.1.7-coreos" ```

DC/OS agent nodes


DC/OS private and public agent nodes are started. Deployed apps and services are run
on the private agent nodes. You must have at least 1 private agent node.

Publicly accessible applications are run in the public agent node. Public agent nodes can
be configured to allow outside traffic to access your cluster. Public agents are optional
and there is no minimum. This is where you'd run a load balancer, providing a service
from inside the cluster to the external public.

Troubleshooting:
You might not be able to SSH to agent nodes, depending on your cluster network
configuration. We have made this a little bit easier with the DC/OS CLI. For more
information, see SSHing to a DC/OS cluster.

You can get the IP address of registered agent nodes from the Nodes tab in the
DC/OS web interface. Nodes that have not registered are not shown.

SSH to your agent node and enter this command to view the logs from boot time:

```bash
<<<<<<< HEAD</p></li>
</ul>

<h1> journalctl -u dcos-mesos-slave -b</h1>

<pre><code>$ journalctl -u dcos-mesos-slave -b


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```

For example, here is a snippet of the Mesos agent log as it converges to a successful state:
```bash mesos-slave[1080]: I1118 14:00:43.687366 1080 main.cpp:272] Starting Mesos slave
mesos-slave[1080]: I1118 14:00:43.688474 1080 slave.cpp:190] Slave started on
1)@10.0.1.108:5051 mesos-slave[1080]: I1118 14:00:43.688503 1080 slave.cpp:191] Flags at
startup: --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5" --
cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --
cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --
container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --
disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --
docker_remove_delay="1hrs" --docker_socket="/var/run/docker.sock" --
docker_stop_timeout="0ns" --enforce_container_disk_quota="false" --
executor_environment_variables="{"LD_LIBRARY_PATH":"\/opt\/mesosphere\/lib","PATH":"\/usr\/b
in","SASL_PATH":"\/opt\/mesosphere\/lib\/sasl2","SHELL":"\/usr\/bin\/bash"}" --
executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --
fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --
gc_delay="2days" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --
hostname_lookup="false" --image_provisioner_backend="copy" --
initialize_driver_logging="true" --ip_discovery_command="/opt/mesosphere/bin/detect_ip" --
isolation="cgroups/cpu,cgroups/mem" --launcher_dir="/opt/mesosphere/packages/mesos-
-30d3fbeb6747bb086d71385e3e2e0eb74ccdcb8b/libexec/mesos" --log_dir="/var/log/mesos" --
logbufsecs="0" --logging_level="INFO" --master="zk://leader.mesos:2181/mesos" --
oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins"
--port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --
recovery_timeout="15mins" --registration_backoff_factor="1secs" --
resource_monitoring_interval="1secs" --
resources="ports:[1025-2180,2182-3887,3889-5049,5052-8079,8082-8180,8182-32000]" --
revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --
slave_subsystems="cpu,memory" --strict="true" --switch_user="true" --
systemd_runtime_directory="/run/systemd/system" --version="false" --
work_dir="/var/lib/mesos/slave" mesos-slave[1080]: I1118 14:00:43.688711 1080 slave.cpp:211]
Moving slave process into its own cgroup for subsystem: cpu mesos-slave[1080]: 2015-11-18
14:00:43,689:1080(0x7f9b526c4700):ZOO_INFO@check_events@1703: initiated connection to server
[10.0.7.166:2181] mesos-slave[1080]: I1118 14:00:43.692811 1080 slave.cpp:211] Moving slave
process into its own cgroup for subsystem: memory mesos-slave[1080]: I1118 14:00:43.697872
1080 slave.cpp:354] Slave resources: ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079,
8082-8180, 8182-32000]; cpus(*):4; mem(*):14019; disk(*):32541 mesos-slave[1080]: I1118
14:00:43.697916 1080 slave.cpp:390] Slave hostname: 10.0.1.108 mesos-slave[1080]: I1118
14:00:43.697928 1080 slave.cpp:395] Slave checkpoint: true ```

DC/OS Marathon
DC/OS Marathon is started on the master nodes. The native Marathon instance that is
the init system for DC/OS. It starts and monitors applications and services.

Troubleshooting:

<<<<<<< HEAD

* Go to the Services > Services tab on


the web interface and view status.
Go to the Services > Services tab on the web interface and view status.

origin/
dev
SSH to your master node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>

<h1> journalctl -u dcos-marathon -b</h1>

<pre><code>$ journalctl -u dcos-marathon -b


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```

For example, here is a snippet of the DC/OS Marathon log as it converges to a successful
state: ```bash java[1288]: I1118 13:59:39.125041 1363 group.cpp:331] Group process
(group(1)@10.0.7.166:48531) connected to ZooKeeper java[1288]: I1118 13:59:39.125100 1363
group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
java[1288]: I1118 13:59:39.125121 1363 group.cpp:403] Trying to create path '/mesos' in
ZooKeeper java[1288]: [2015-11-18 13:59:39,130] INFO Scheduler actor ready
(mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-5)
java[1288]: I1118 13:59:39.147804 1363 detector.cpp:156] Detected a new leader: (id='1')
java[1288]: I1118 13:59:39.147924 1363 group.cpp:674] Trying to get
'/mesos/json.info_0000000001' in ZooKeeper java[1288]: I1118 13:59:39.148727 1363
detector.cpp:481] A new leading master (UPID=master@10.0.7.166:5050) is detected java[1288]:
I1118 13:59:39.148787 1363 sched.cpp:262] New master detected at master@10.0.7.166:5050
java[1288]: I1118 13:59:39.148952 1363 sched.cpp:272] No credentials provided. Attempting to
register without authentication java[1288]: I1118 13:59:39.150403 1363 sched.cpp:641]
Framework registered with cdcb6222-65a1-4d60-83af-33dadec41e92-0000 ```

gen_resolvconf
gen_resolvconf is started. This is a service that helps the agent nodes locate the master
nodes. It updates /etc/resolv.conf so that agents can use the Mesos-DNS service for
service discovery. gen_resolvconf uses either an internal load balancer, vrrp, or a static
list of masters to locate the master nodes. For more information, see the
master_discovery configuration parameter.

Troubleshooting:
When gen_resolvconf is up and running, you can view /etc/resolv.conf contents. It should
contain one or more IP addresses for the master nodes, and the optional external DNS server.

SSH to your master node and enter this command to view the logs from boot time:

```bash
<<<<<<< HEAD</p></li>
</ul>
<h1> journalctl -u dcos-gen-resolvconf -b</h1>

<pre><code>$ journalctl -u dcos-gen-resolvconf -b


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```

For example, here is a snippet of the gen_resolvconf log as it converges to a successful


state: ```bash systemd[1]: Started Update systemd-resolved for mesos-dns. systemd[1]:
Starting Update systemd-resolved for mesos-dns... gen_resolvconf.py[1073]: options timeout:1
gen_resolvconf.py[1073]: options attempts:3 gen_resolvconf.py[1073]: nameserver 10.0.7.166
gen_resolvconf.py[1073]: nameserver 10.0.0.2 gen_resolvconf.py[1073]: Updating
/etc/resolv.conf ```

Mesos master process


The Mesos master process starts on the master nodes. The mesos-master process runs
on a node in the cluster and orchestrates the running of tasks on agents by receiving
resource offers from agents and offering those resources to registered services, such as
Marathon or Chronos. For more information, see the Mesos Master Configuration
documentation.

Troubleshooting:
Go directly to the Mesos web interface and view status at <master-hostname>/mesos.

SSH to your master node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>

<h1> journalctl -u dcos-mesos-master -b</h1>

<pre><code>$ journalctl -u dcos-mesos-master -b


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```

For example, here is a snippet of the Mesos master log as it converges to a successful
state: ```bash mesos-master[1250]: I1118 13:59:33.890916 1250 master.cpp:376] Master
cdcb6222-65a1-4d60-83af-33dadec41e92 (10.0.7.166) started on 10.0.7.166:5050 mesos-
master[1250]: I1118 13:59:33.890945 1250 master.cpp:378] Flags at startup: --
allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --
authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --
cluster="pool-880dfdbf0f2845bf8191" --framework_sorter="drf" --help="false" --
hostname_lookup="false" --initialize *driver_logging="true" --
ip_discovery_command="/opt/mesosphere/bin/detect_ip" --log_auto_initialize="true" --
log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max*
slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="1" --
recovery_slave_removal_limit="100%" --registry="replicated_log" --
registry_fetch_timeout="1mins" --registry_sto re_timeout="5secs" --registry_strict="false" -
-roles="slave_public" --root_submissions="true" --slave_ping_timeout="15secs" --
slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --
webui_dir="/opt/mesosphere/packages/mesos-
-30d3fbeb6747bb086d71385e3e2e0eb74ccdcb8b/share/mesos/webui" --weights="slave_public=1" --
work_dir="/var/lib/mesos/mas ter" --zk="zk://127.0.0.1:2181/mesos" --
zk_session_timeout="10secs" mesos-master[1250]: 2015-11-18
13:59:33,891:1250(0x7f14427fc700):ZOO_INFO@check_events@1750: session establishment complete
on server [127.0.0.1:2181], sessionId=0x1511ae440bc0001, negotiated timeout=10000 ```

Mesos-DNS
Mesos-DNS is started on the DC/OS master nodes. Mesos-DNS provides service
discovery within the cluster. Optionally, Mesos-DNS can forward unhandled requests to
an external DNS server, depending on how the cluster is configured. For example,
anything that does not end in .mesos will be forwarded to the external resolver.

Troubleshooting:
SSH to your master node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>

<h1> journalctl -u dcos-mesos-dns -b</h1>

<pre><code>$ journalctl -u dcos-mesos-dns -b


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
For example, here is a snippet of the Mesos-DNS log as it converges to a successful state:
```bash mesos-dns[1197]: I1118 13:59:34.763885 1197 detect.go:135] changing leader node from
"" -> "json.info_0000000001" mesos-dns[1197]: I1118 13:59:34.764537 1197 detect.go:145]
detected master info:
&MasterInfo{Id:*cdcb6222-65a1-4d60-83af-33dadec41e92,Ip:*2785476618,Port:*5050,Pid:*master@1
0.0.7.166:5050,Hostname:*10\.0.7.166,Version:*0\.25.0,Address:&Address{Hostname:*10\.0.7.166
,Ip:*10\.0.7.166,Port:*5050,XXX_unrecognized:[],},XXX_unrecognized:[],} mesos-dns[1197]:
VERY VERBOSE: 2015/11/18 13:59:34 masters.go:47: Updated leader:
&MasterInfo{Id:*cdcb6222-65a1-4d60-83af-33dadec41e92,Ip:*2785476618,Port:*5050,Pid:*master@1
0.0.7.166:5050,Hostname:*10\.0.7.166,Version:*0\.25.0,Address:&Address{Hostname:*10\.0.7.166
,Ip:*10\.0.7.166,Port:*5050,XXX_unrecognized:[],},XXX_unrecognized:[],} mesos-dns[1197]:
VERY VERBOSE: 2015/11/18 13:59:34 main.go:76: new masters detected: [10.0.7.166:5050] mesos-
dns[1197]: VERY VERBOSE: 2015/11/18 13:59:34 generator.go:70: Zookeeper says the leader is:
10.0.7.166:5050 mesos-dns[1197]: VERY VERBOSE: 2015/11/18 13:59:34 generator.go:162:
reloading from master 10.0.7.166 mesos-dns[1197]: I1118 13:59:34.766005 1197 detect.go:219]
notifying of master membership change:
[&MasterInfo{Id:*cdcb6222-65a1-4d60-83af-33dadec41e92,Ip:*2785476618,Port:*5050,Pid:*master@
10.0.7.166:5050,Hostname:*10\.0.7.166,Version:*0\.25.0,Address:&Address{Hostname:*10\.0.7.16
6,Ip:*10\.0.7.166,Port:*5050,XXX_unrecognized:[],},XXX_unrecognized:[],}] mesos-dns[1197]:
VERY VERBOSE: 2015/11/18 13:59:34 masters.go:56: Updated masters:
[&MasterInfo{Id:*cdcb6222-65a1-4d60-83af-33dadec41e92,Ip:*2785476618,Port:*5050,Pid:*master@
10.0.7.166:5050,Hostname:*10\.0.7.166,Version:*0\.25.0,Address:&Address{Hostname:*10\.0.7.16
6,Ip:*10\.0.7.166,Port:*5050,XXX_unrecognized:[],},XXX_unrecognized:[],}] mesos-dns[1197]:
I1118 13:59:34.766124 1197 detect.go:313] resting before next detection cycle ```

ZooKeeper and Exhibitor


ZooKeeper and Exhibitor start on the master nodes. The Exhibitor storage location must
be configured properly for this to work. For more information, see the
exhibitor_storage_backend parameter.

DC/OS uses ZooKeeper, a high-performance coordination service to manage the


installed DC/OS services. Exhibitor automatically configures ZooKeeper on the master
nodes during your DC/OS installation. For more information, see Configuration
Parameters.
Go to the Exhibitor web interface and view status at <master-hostname>/exhibitor.

SSH to your master node and enter this command to view the logs from boot time:

```bash
<<<<<<< HEAD</p></li>
</ul>

<h1> journalctl -u dcos-exhibitor -b</h1>

<pre><code>$ journalctl -u dcos-exhibitor -b


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```

For example, here is a snippet of the Exhibitor log as it converges to a successful state:
```bash INFO com.netflix.exhibitor.core.activity.ActivityLog Automatic Instance Management
will change the server list: ==> 1:10.0.7.166 [ActivityQueue-0] INFO
com.netflix.exhibitor.core.activity.ActivityLog State: serving [ActivityQueue-0] INFO
com.netflix.exhibitor.core.activity.ActivityLog Server list has changed [ActivityQueue-0]
INFO com.netflix.exhibitor.core.activity.ActivityLog Attempting to stop instance
[ActivityQueue-0] INFO com.netflix.exhibitor.core.activity.ActivityLog Attempting to
start/restart ZooKeeper [ActivityQueue-0] INFO
com.netflix.exhibitor.core.activity.ActivityLog Kill attempted result: 0 [ActivityQueue-0]
INFO com.netflix.exhibitor.core.activity.ActivityLog Process started via:
/opt/mesosphere/active/exhibitor/usr/zookeeper/bin/zkServer.sh [ActivityQueue-0] ERROR
com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper Server: JMX enabled by default
[pool-3-thread-1] ERROR com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper Server:
Using config: /opt/mesosphere/active/exhibitor/usr/zookeeper/bin/../conf/zoo.cfg [pool-3-
thread-1] INFO com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper Server: Starting
zookeeper ... STARTED [pool-3-thread-3] INFO com.netflix.exhibitor.core.activity.ActivityLog
Cleanup task completed [pool-3-thread-6] INFO
com.netflix.exhibitor.core.activity.ActivityLog Cleanup task completed [pool-3-thread-9] ```

<<<<<<< HEAD

origin/dev

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Administering Clusters
Updated: April 17, 2017
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Monitoring, Logging, and


Debugging
ENTERPRISE DC/OS Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.

Performance Monitoring
Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur. Your monito...
Performance Monitoring
Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur. Your monito...
Performance Monitoring
Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur. Your monito...
Logging
DC/OS cluster nodes generate logs that contain diagnostic and status information for
DC/OS core components and DC/OS services. Service, Task, and Node Logs <<<<<<...
Debugging from the DC/OS Web Interface
You can debug your service or pod from the DC/OS web interface. Service and Pod
Health and Status Summaries If you have added a Marathon health check to your service
or pod, the Se...
Debugging
DC/OS offers several tools to debug your services when they are stuck in deployment or
are not behaving as you expect. This topic discusses how to debug your services using
both th...
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING

Performance Monitoring
ENTERPRISE DC/OS Updated: April 17, 2017

Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur.

Your monitoring tools should leverage historic data points so that you can track changes
and deviations. You should monitor your cluster when it is known to be in a healthy state
as well as unhealthy. This will give you a baseline for what is normal in the DC/OS
environment. With this historical data, you can fine tune your tools and set appropriate
thresholds and conditions. When these thresholds are exceeded, you can send alerts to
administrators.

Mesos and Marathon expose the following types of metrics:


Gauges are metrics that provide the current state at the moment it was queried.

Counters have metrics that are additive and include past and present results. These metrics
are not persisted across failover.

Marathon has a timer metric that determines how long an event has taken place. Timer
does not exist for Mesos observability metrics.

Marathon metrics
Marathon provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS.

Lifetime metrics
service.mesosphere.marathon.uptime (gauge) This metric provides the uptime, in
milliseconds, of the reporting Marathon process. Use this metric to diagnose stability
problems that can cause Marathon to restart.

service.mesosphere.marathon.leaderDuration (gauge) This metric provides the amount of


time, in milliseconds, since the last leader election occurred. Use this metric to diagnose
stability problems and determine the frequency of leader election.

Running tasks
service.mesosphere.marathon.task.running.count (gauge) This metric provides the
number of tasks that are running.

Staged tasks
service.mesosphere.marathon.task.staged.count (gauge) This metric provides the number
of tasks that are staged. Tasks are staged immediately after they are launched. A
consistently high number of staged tasks indicates a high number of tasks are being stopped
and restarted. This can be caused by either:
A high number of app updates or manual restarts.

Apps with stability problems that are automatically restarted frequently.

Task update status


service.mesosphere.marathon.core.task.update.impl.ThrottlingTaskStatusUpdatePro
cessor.queued (gauge) This metric provides the number of queued status updates.
service.mesosphere.marathon.core.task.update.impl.ThrottlingTaskStatusUpdatePro
cessor.processing (gauge) This metric provides the number of status updates that are
currently in process.

service.mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl
.publishFuture (timer) This metric calculates how long it takes Marathon to process status
updates.

App and group count


service.mesosphere.marathon.app.count (gauge) This metric provides the number of apps
that are defined. The number of apps defined affects the performance of Marathon: the more
apps that are defined, the lower the Marathon performance.

service.mesosphere.marathon.group.count (gauge) This metric provides the number of


groups that are defined. The number groups that are defined affects the performance of
Marathon: the more groups that are defined, the lower the Marathon performance.

Communication between Marathon and Mesos

If healthy, these metrics should always be increasing.


service.mesosphere.marathon.core.launcher.impl.OfferProcessorImpl.incomingOffer
s This metric provides the number of offers that Mesos is receiving from Marathon.
service.mesosphere.marathon.MarathonScheduler.resourceOffers This Dropwizard
metric measures the number of resource offers that Marathon receives from Mesos.

service.mesosphere.marathon.MarathonScheduler.statusUpdate This Dropwizard metric


measures the number of status updates that Marathon receives from Mesos.

Mesos metrics
Mesos provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS.

Master
These metrics should not increase over time

If these metrics increase, something is probably wrong.


master/slave_reregistrations (counter) This metric provides the number of agent re-
registrations and restarts. Use this metric along with historical data to determine
deviations and spikes of when a network partition occurs. If this number drastically
increases, then the cluster has experienced an outage but has reconnected.

master/slave_removals (counter) This metric provides the number of agents removed for
various reasons, including maintenance. Use this metric to determine network partitions
after a large number of agents have disconnected. If this number greatly deviates from the
previous number, your system administrator should be notified (PagerDuty etc).
master/tasks_error (counter) This metric provides the number of invalid tasks.
master/tasks_failed (counter) This metric provides the number of failed tasks.
master/tasks_killed (counter) This metric provides the number of killed tasks.
master/tasks_lost (counter) This metric provides the number of lost tasks. A lost task
means a task was killed or disconnected by an external factor. Use this metric when a large
number of task deviate from the previous historic number.

master/slaves_disconnected (gauge) This metric provides the number of disconnected


agents. This metric is helpful along with master/slave_removals. If an agent disconnects
this number will increase. If an agent reconnects, this number will decrease.

master/messages_kill_task (counter) This metric provides the number of kill task


messages.

master/slave_removals (counter) This metric provides the number of agents that were not
re-registered during master failover. This is a broad endpoint that combines
../reason_unhealthy, ../reason_unregistered, and ../reason_registered. You can
monitor this explicitly or leverage master/slave_removals/reason_unhealthy,
master/slave_removals/reason_unregistered, and
master/slave_removals/reason_registered for specifics.
master/slave_removals/reason_unhealthy (counter) This metric provides the number of
agents failed because of failed health checks. This endpoint returns the total number of
agents that were unhealthy.

master/slave_removals/reason_unregistered (counter) This metric provides the number


of agents unregistered. If this number increases drastically, this indicates that the master
or agent is unable to communicate properly. Use this endpoint to determine network
partition.

master/slave_removals/reason_registered (counter) This metric provides the number of


agents that were removed when new agents were registered at the same address. New agents
replaces old agents. This should be a rare event. If this number increases, your system
administrator should be notified (PagerDuty etc).

These metrics should not decrease over time


master/slaves_active (counter) This metric provides the number of active agents. The
number of active agents is calculated by adding slaves_connected and
slave_disconnected.
master/slaves_connected (counter) This metric provides the number of connected agents.
This number should equal the total number of Mesos agents (slaves_active). Use this metric
to determine the general health of your cluster as a percentage of the total.

master/elected (gauge) This metric indicates whether this is the elected master. This
metric should be fetched from all masters, and add up to 1. If this number is not 1 for a
period of time, your system administrator should be notified (PagerDuty etc).

master/uptime_secs (gauge) This metric provides the master uptime, in seconds. This
number should be at least 5 minutes (300 seconds) to indicate a stable master. You can use
this metric to detect flapping. For example, if the master has an uptime of less than 1
minute (60 seconds) for more than 10 minutes, it has probably restarted 10 or more times.

master/messages_decline_offers (counter) This metric provides the number of declined


offers. This number should equal the number of agents x the number of frameworks. If this
number drops to a low value, something is probably getting starved.

Agent
These metrics should not decrease over time
slave/uptime_secs (gauge) This metric provides the agent uptime, in seconds. This number
should be always increasing. The moment this number resets to , this indicates that the
agent process has been rebooted. You can use this metric to detect flapping. For example,
if the agent has an uptime of less than 1 minute (60 seconds) for more than 10 minutes, it
has probably restarted 10 or more times.

slave/registered (gauge) This metric indicates whether this agent is registered with a
master. This value should always be 1. A indicates that the agent is looking to join a new
master.

General
Check the Marathon App Health API endpoint for critical applications API endpoint.

Check for agents being shut down:


Tail /var/log/mesos warning logs and watch for Shutting down

Mesos endpoint that indicates how many agents have been shut down increases

Check for mesos masters having short uptimes, which is exposed in Mesos metrics.

Change mom-marathon-service logging level from WARN to INFO.

Modify the mesos-master log rotation configuration to store the complete logs for at least
one day.
Make sure the master nodes have plenty of disk space.

Change the logrotation option from rotate 7 to maxage 14 or more. For example:

... /var/log/mesos/* { olddir /var/log/mesos/archive maxsize 2000k daily maxage 14


copytruncate postrotate find /var/log/mesos /var/log/mesos/archive -mtime +14 -delete
endscript } EOF ...

See the Apache Mesos documentation for Mesos basic alerts.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING

Performance Monitoring
ENTERPRISE DC/OS Updated: April 17, 2017

Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur.
Your monitoring tools should leverage historic data points so that you can track changes
and deviations. You should monitor your cluster when it is known to be in a healthy state
as well as unhealthy. This will give you a baseline for what is normal in the DC/OS
environment. With this historical data, you can fine tune your tools and set appropriate
thresholds and conditions. When these thresholds are exceeded, you can send alerts to
administrators.
Mesos and Marathon expose the following types of metrics:
Gauges are metrics that provide the current state at the moment it was queried.

Counters have metrics that are additive and include past and present results. These metrics
are not persisted across failover.

Marathon has a timer metric that determines how long an event has taken place. Timer
does not exist for Mesos observability metrics.

Marathon metrics
Marathon provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS.

Lifetime metrics
service.mesosphere.marathon.uptime (gauge) This metric provides the uptime, in
milliseconds, of the reporting Marathon process. Use this metric to diagnose stability
problems that can cause Marathon to restart.

service.mesosphere.marathon.leaderDuration (gauge) This metric provides the amount of


time, in milliseconds, since the last leader election occurred. Use this metric to diagnose
stability problems and determine the frequency of leader election.

Running tasks
service.mesosphere.marathon.task.running.count (gauge) This metric provides the
number of tasks that are running.

Staged tasks
service.mesosphere.marathon.task.staged.count (gauge) This metric provides the number
of tasks that are staged. Tasks are staged immediately after they are launched. A
consistently high number of staged tasks indicates a high number of tasks are being stopped
and restarted. This can be caused by either:
A high number of app updates or manual restarts.

Apps with stability problems that are automatically restarted frequently.

Task update status


service.mesosphere.marathon.core.task.update.impl.ThrottlingTaskStatusUpdatePro
cessor.queued (gauge) This metric provides the number of queued status updates.
service.mesosphere.marathon.core.task.update.impl.ThrottlingTaskStatusUpdatePro
cessor.processing (gauge) This metric provides the number of status updates that are
currently in process.

service.mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl
.publishFuture (timer) This metric calculates how long it takes Marathon to process status
updates.

App and group count


service.mesosphere.marathon.app.count (gauge) This metric provides the number of apps
that are defined. The number of apps defined affects the performance of Marathon: the more
apps that are defined, the lower the Marathon performance.

service.mesosphere.marathon.group.count (gauge) This metric provides the number of


groups that are defined. The number groups that are defined affects the performance of
Marathon: the more groups that are defined, the lower the Marathon performance.

Communication between Marathon and Mesos

If healthy, these metrics should always be increasing.


service.mesosphere.marathon.core.launcher.impl.OfferProcessorImpl.incomingOffer
s This metric provides the number of offers that Mesos is receiving from Marathon.
service.mesosphere.marathon.MarathonScheduler.resourceOffers This Dropwizard
metric measures the number of resource offers that Marathon receives from Mesos.

service.mesosphere.marathon.MarathonScheduler.statusUpdate This Dropwizard metric


measures the number of status updates that Marathon receives from Mesos.

Mesos metrics
Mesos provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS.

Master
These metrics should not increase over time

If these metrics increase, something is probably wrong.


master/slave_reregistrations (counter) This metric provides the number of agent re-
registrations and restarts. Use this metric along with historical data to determine
deviations and spikes of when a network partition occurs. If this number drastically
increases, then the cluster has experienced an outage but has reconnected.

master/slave_removals (counter) This metric provides the number of agents removed for
various reasons, including maintenance. Use this metric to determine network partitions
after a large number of agents have disconnected. If this number greatly deviates from the
previous number, your system administrator should be notified (PagerDuty etc).

master/tasks_error (counter) This metric provides the number of invalid tasks.


master/tasks_failed (counter) This metric provides the number of failed tasks.
master/tasks_killed (counter) This metric provides the number of killed tasks.
master/tasks_lost (counter) This metric provides the number of lost tasks. A lost task
means a task was killed or disconnected by an external factor. Use this metric when a large
number of task deviate from the previous historic number.

master/slaves_disconnected (gauge) This metric provides the number of disconnected


agents. This metric is helpful along with master/slave_removals. If an agent disconnects
this number will increase. If an agent reconnects, this number will decrease.

master/messages_kill_task (counter) This metric provides the number of kill task


messages.

master/slave_removals (counter) This metric provides the number of agents that were not
re-registered during master failover. This is a broad endpoint that combines
../reason_unhealthy, ../reason_unregistered, and ../reason_registered. You can
monitor this explicitly or leverage master/slave_removals/reason_unhealthy,
master/slave_removals/reason_unregistered, and
master/slave_removals/reason_registered for specifics.
master/slave_removals/reason_unhealthy (counter) This metric provides the number of
agents failed because of failed health checks. This endpoint returns the total number of
agents that were unhealthy.

master/slave_removals/reason_unregistered (counter) This metric provides the number


of agents unregistered. If this number increases drastically, this indicates that the master
or agent is unable to communicate properly. Use this endpoint to determine network
partition.

master/slave_removals/reason_registered (counter) This metric provides the number of


agents that were removed when new agents were registered at the same address. New agents
replaces old agents. This should be a rare event. If this number increases, your system
administrator should be notified (PagerDuty etc).

These metrics should not decrease over time


master/slaves_active (counter) This metric provides the number of active agents. The
number of active agents is calculated by adding slaves_connected and
slave_disconnected.
master/slaves_connected (counter) This metric provides the number of connected agents.
This number should equal the total number of Mesos agents (slaves_active). Use this metric
to determine the general health of your cluster as a percentage of the total.

master/elected (gauge) This metric indicates whether this is the elected master. This
metric should be fetched from all masters, and add up to 1. If this number is not 1 for a
period of time, your system administrator should be notified (PagerDuty etc).

master/uptime_secs (gauge) This metric provides the master uptime, in seconds. This
number should be at least 5 minutes (300 seconds) to indicate a stable master. You can use
this metric to detect flapping. For example, if the master has an uptime of less than 1
minute (60 seconds) for more than 10 minutes, it has probably restarted 10 or more times.

master/messages_decline_offers (counter) This metric provides the number of declined


offers. This number should equal the number of agents x the number of frameworks. If this
number drops to a low value, something is probably getting starved.

Agent
These metrics should not decrease over time
slave/uptime_secs (gauge) This metric provides the agent uptime, in seconds. This number
should be always increasing. The moment this number resets to , this indicates that the
agent process has been rebooted. You can use this metric to detect flapping. For example,
if the agent has an uptime of less than 1 minute (60 seconds) for more than 10 minutes, it
has probably restarted 10 or more times.

slave/registered (gauge) This metric indicates whether this agent is registered with a
master. This value should always be 1. A indicates that the agent is looking to join a new
master.

General
Check the Marathon App Health API endpoint for critical applications API endpoint.

Check for agents being shut down:


Tail /var/log/mesos warning logs and watch for Shutting down

Mesos endpoint that indicates how many agents have been shut down increases

Check for mesos masters having short uptimes, which is exposed in Mesos metrics.

Change mom-marathon-service logging level from WARN to INFO.

Modify the mesos-master log rotation configuration to store the complete logs for at least
one day.
Make sure the master nodes have plenty of disk space.

Change the logrotation option from rotate 7 to maxage 14 or more. For example:

... /var/log/mesos/* { olddir /var/log/mesos/archive maxsize 2000k daily maxage 14


copytruncate postrotate find /var/log/mesos /var/log/mesos/archive -mtime +14 -delete
endscript } EOF ...

See the Apache Mesos documentation for Mesos basic alerts.


MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING

Performance Monitoring
Updated: April 17, 2017

Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur.

Your monitoring tools should leverage historic data points so that you can track changes
and deviations. You should monitor your cluster when it is known to be in a healthy state
as well as unhealthy. This will give you a baseline for what is normal in the DC/OS
environment. With this historical data, you can fine tune your tools and set appropriate
thresholds and conditions. When these thresholds are exceeded, you can send alerts to
administrators.

Mesos and Marathon expose the following types of metrics:


Gauges are metrics that provide the current state at the moment it was queried.

Counters have metrics that are additive and include past and present results. These metrics
are not persisted across failover.

Marathon has a timer metric that determines how long an event has taken place. Timer
does not exist for Mesos observability metrics.

Marathon metrics
Marathon provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS. You can query the metrics HTTP endpoint in your DC/OS
cluster at <Master-Public-IP>/marathon/metrics.

Lifetime metrics
service.mesosphere.marathon.uptime (gauge) This metric provides the uptime, in
milliseconds, of the reporting Marathon process. Use this metric to diagnose stability
problems that can cause Marathon to restart.

service.mesosphere.marathon.leaderDuration (gauge) This metric provides the amount of


time, in milliseconds, since the last leader election occurred. Use this metric to diagnose
stability problems and determine the frequency of leader election.

Running tasks
service.mesosphere.marathon.task.running.count (gauge) This metric provides the
number of tasks that are running.
Staged tasks
service.mesosphere.marathon.task.staged.count (gauge) This metric provides the number
of tasks that are staged. Tasks are staged immediately after they are launched. A
consistently high number of staged tasks indicates a high number of tasks are being stopped
and restarted. This can be caused by either:
A high number of app updates or manual restarts.

Apps with stability problems that are automatically restarted frequently.

Task update status


service.mesosphere.marathon.core.task.update.impl.ThrottlingTaskStatusUpdatePro
cessor.queued (gauge) This metric provides the number of queued status updates.
service.mesosphere.marathon.core.task.update.impl.ThrottlingTaskStatusUpdatePro
cessor.processing (gauge) This metric provides the number of status updates that are
currently in process.

service.mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl
.publishFuture (timer) This metric calculates how long it takes Marathon to process status
updates.

App and group count


service.mesosphere.marathon.app.count (gauge) This metric provides the number of apps
that are defined. The number of apps defined affects the performance of Marathon: the more
apps that are defined, the lower the Marathon performance.

service.mesosphere.marathon.group.count (gauge) This metric provides the number of


groups that are defined. The number groups that are defined affects the performance of
Marathon: the more groups that are defined, the lower the Marathon performance.

Communication between Marathon and Mesos

If healthy, these metrics should always be increasing.


service.mesosphere.marathon.core.launcher.impl.OfferProcessorImpl.incomingOffer
s This metric provides the number of offers that Mesos is receiving from Marathon.
service.mesosphere.marathon.MarathonScheduler.resourceOffers This Dropwizard
metric measures the number of resource offers that Marathon receives from Mesos.

service.mesosphere.marathon.MarathonScheduler.statusUpdate This Dropwizard metric


measures the number of status updates that Marathon receives from Mesos.

Mesos metrics
Mesos provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS.

Master
These metrics should not increase over time

If these metrics increase, something is probably wrong.


master/slave_reregistrations (counter) This metric provides the number of agent re-
registrations and restarts. Use this metric along with historical data to determine
deviations and spikes of when a network partition occurs. If this number drastically
increases, then the cluster has experienced an outage but has reconnected.

master/slave_removals (counter) This metric provides the number of agents removed for
various reasons, including maintenance. Use this metric to determine network partitions
after a large number of agents have disconnected. If this number greatly deviates from the
previous number, your system administrator should be notified (PagerDuty etc).

master/tasks_error (counter) This metric provides the number of invalid tasks.


master/tasks_failed (counter) This metric provides the number of failed tasks.
master/tasks_killed (counter) This metric provides the number of killed tasks.
master/tasks_lost (counter) This metric provides the number of lost tasks. A lost task
means a task was killed or disconnected by an external factor. Use this metric when a large
number of task deviate from the previous historic number.

master/slaves_disconnected (gauge) This metric provides the number of disconnected


agents. This metric is helpful along with master/slave_removals. If an agent disconnects
this number will increase. If an agent reconnects, this number will decrease.

master/messages_kill_task (counter) This metric provides the number of kill task


messages.

master/slave_removals (counter) This metric provides the number of agents that were not
re-registered during master failover. This is a broad endpoint that combines
../reason_unhealthy, ../reason_unregistered, and ../reason_registered. You can
monitor this explicitly or leverage master/slave_removals/reason_unhealthy,
master/slave_removals/reason_unregistered, and
master/slave_removals/reason_registered for specifics.
master/slave_removals/reason_unhealthy (counter) This metric provides the number of
agents failed because of failed health checks. This endpoint returns the total number of
agents that were unhealthy.

master/slave_removals/reason_unregistered (counter) This metric provides the number


of agents unregistered. If this number increases drastically, this indicates that the master
or agent is unable to communicate properly. Use this endpoint to determine network
partition.

master/slave_removals/reason_registered (counter) This metric provides the number of


agents that were removed when new agents were registered at the same address. New agents
replaces old agents. This should be a rare event. If this number increases, your system
administrator should be notified (PagerDuty etc).

These metrics should not decrease over time


master/slaves_active (counter) This metric provides the number of active agents. The
number of active agents is calculated by adding slaves_connected and
slave_disconnected.
master/slaves_connected (counter) This metric provides the number of connected agents.
This number should equal the total number of Mesos agents (slaves_active). Use this metric
to determine the general health of your cluster as a percentage of the total.

master/elected (gauge) This metric indicates whether this is the elected master. This
metric should be fetched from all masters, and add up to 1. If this number is not 1 for a
period of time, your system administrator should be notified (PagerDuty etc).

master/uptime_secs (gauge) This metric provides the master uptime, in seconds. This
number should be at least 5 minutes (300 seconds) to indicate a stable master. You can use
this metric to detect flapping. For example, if the master has an uptime of less than 1
minute (60 seconds) for more than 10 minutes, it has probably restarted 10 or more times.

master/messages_decline_offers (counter) This metric provides the number of declined


offers. This number should equal the number of agents x the number of frameworks. If this
number drops to a low value, something is probably getting starved.

Agent
These metrics should not decrease over time
slave/uptime_secs (gauge) This metric provides the agent uptime, in seconds. This number
should be always increasing. The moment this number resets to , this indicates that the
agent process has been rebooted. You can use this metric to detect flapping. For example,
if the agent has an uptime of less than 1 minute (60 seconds) for more than 10 minutes, it
has probably restarted 10 or more times.

slave/registered (gauge) This metric indicates whether this agent is registered with a
master. This value should always be 1. A indicates that the agent is looking to join a new
master.

General
Check the Marathon App Health API endpoint for critical applications API endpoint.

Check for agents being shut down:


Tail /var/log/mesos warning logs and watch for Shutting down

Mesos endpoint that indicates how many agents have been shut down increases

Check for mesos masters having short uptimes, which is exposed in Mesos metrics.

Change mom-marathon-service logging level from WARN to INFO.

Modify the mesos-master log rotation configuration to store the complete logs for at least
one day.
Make sure the master nodes have plenty of disk space.

Change the logrotation option from rotate 7 to maxage 14 or more. For example:

... /var/log/mesos/* { olddir /var/log/mesos/archive maxsize 2000k daily maxage 14


copytruncate postrotate find /var/log/mesos /var/log/mesos/archive -mtime +14 -delete
endscript } EOF ...

See the Apache Mesos documentation for Mesos basic alerts.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING

Logging
PREVIEW Updated: April 17, 2017

DC/OS cluster nodes generate logs that contain diagnostic and status information for
DC/OS core components and DC/OS services.

Service, Task, and Node Logs


<<<<<<< HEAD
The logging component provides an HTTP API (/system/v1/logs/), which exposes the
system logs.

You can access information about DC/OS scheduler services, like Marathon or Kafka,
with the following CLI command:

dcos service log --follow <scheduler-service-name> ======= You can access information about
DC/OS scheduler services, like Marathon or Kafka, with the following CLI command: ```bash $
dcos service log --follow <scheduler-service-name> >>>>>>> origin/dev

You can access DC/OS task logs by running this CLI command:

<<<<<<< HEAD dcos task log --follow <service-name> ======= $ dcos task log --follow
<service-name> >>>>>>> origin/dev

You access the logs for the master node with the following CLI command:

<<<<<<< HEAD dcos node log --leader ======= $ dcos node log --leader >>>>>>> origin/dev

To access the logs for an agent node, run dcos node to get the Mesos IDs of your nodes,
then run the following CLI command:

<<<<<<< HEAD dcos node log --mesos-id=<node-id>

You can download all the log files for your service from the Services > Services tab in the
DC/OS GUI. You can also monitor stdout/stderr.

For more information, see the Service


and Task Logs quick start guide.
$ dcos node log mesos-id=

You can download all the log files for your service from the **Services > Services** tab in
the [DC/OS GUI](/1.9/usage/webinterface/). You can also monitor stdout/stderr. For more
information, see the Service and Task Logs [quick start
guide](/1.9/administration/logging/quickstart/). >>>>>>> origin/dev ## System Logs DC/OS
components use `systemd-journald` to store their logs. To access the DC/OS core component
logs, [SSH into a node][5] and run this command to see all logs: ```bash <<<<<<< HEAD
journalctl -u "dcos-*" -b ======= $ journalctl -u "dcos-*" -b >>>>>>> origin/dev

You can view the logs for specific components by entering the component name. For
example, to access Admin Router logs, run this command:
journalctl -u dcos-nginx -b

You can find which components are unhealthy in the DC/OS GUI from the Nodes tab.

<<<<<<< HEAD

origin/dev

Aggregation
Unfortunately, streaming logs from machines in your cluster isnt always viable.
Sometimes, you need the logs stored somewhere else as a history of whats happened.
This is where log aggregation really is required. Check out how to get it setup with some
of the most common solutions:

<<<<<<< HEAD
ELK
Splunk

=======
ELK
Splunk

origin/dev
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING

Debugging from the DC/OS Web


Interface
PREVIEW Updated: April 17, 2017

You can debug your service or pod from the DC/OS web interface.

Service and Pod Health and Status Summaries


If you have added a Marathon health check to your service or pod, the Services Health
box on the DC/OS dashboard will report the health of your service or pod.

The Services > Services page lists each service or pod, the resources it has requested,
and its status. Possible statuses are Deploying, Waiting, or Running. If you have set up a
Marathon health check, you can also see the health of your service or pod: a green dot
for healthy and a red dot for unhealthy. If you have not set up a health check, the dot will
be gray.

Debugging Page
Clicking the name of a service or pod and then the Debug tab reveals a detailed
debugging page. There, you will see sections for Last Changes, Last Task Failure, Task
Statistics, Recent Resource Offers. You will also see a Summary of resource offers and
what percentage of those offers matched your pod or services requirements, as well as a
Details section that lists the host where your service or pod is running and which
resource offers were successful and unsuccessful for each deployment. You can use the
information on this page to learn where and how you need to modify your service or pod
definition.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING

Debugging
PREVIEW Updated: April 17, 2017
DC/OS offers several tools to debug
your services when they are stuck in
deployment or are not behaving as you
expect. This topic discusses how to
debug your services using both the
DC/OS CLI and the DC/OS web
interface.
menu_order: 3
post_excerpt:
feature_maturity: preview

enterprise: no
The dcos task exec command allows you to execute an arbitrary command inside of a
tasks container and stream its output back to your local terminal. It offers an experience
very similar to docker exec.

Users do not need SSH keys to execute the dcos task exec command. Enterprise
DC/OS provides several debugging permissions so that users do not need the
dcos:superuser permission either.

You can execute this command in any of the following four modes.
dcos task exec <task-id> <command> (no flags): streams STDOUT and STDERR from the
remote terminal to your local terminal as raw bytes.

dcos task exec --tty <task-id> <command>: streams STDOUT and STDERR from the
remote terminal to your local terminal, but not as raw bytes. Instead, this option puts
your local terminal into raw mode, allocates a remote pseudo terminal (PYT), and
streams the STDOUT and STDERR through the remote PTY.

dcos task exec --interactive <task-id> <command> streams STDOUT and STDERR
from the remote terminal to your local terminal and streams STDIN from your local
terminal to the remote command.

dcos task exec --interactive --tty <task-id> <command>: streams STDOUT and
STDERR from the remote terminal to your local terminal and streams STDIN from
your local terminal to the remote terminal. Also puts your local terminal into raw mode;
allocates a remote pseudo terminal (PTY); and streams STDOUT, STDERR, and
STDIN through the remote PTY. This mode offers the maximum functionality.

Note: If your mode streams raw bytes, you wont be able to launch programs like vim
because these programs require the use of control characters.

Tip: We have included the text of the full flags above for readability, but each one can be
shortened. Instead of typing --interactive, you can just type -i. Likewise, instead of
typing --tty, you can just type -t.

Requirement: To use the debugging feature, the service or job must be launched using
either the Mesos container runtime or the Universal container runtime. Debugging cannot
be used on containers launched with the Docker runtime. See Using Mesos
Containerizers for more information.

For more information, see:


Command reference

Quick Start

Provisioning a user with debugging permissions

origin/dev

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Jobs
PREVIEW Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
Quick Start
You can create and administer jobs in the DC/OS web interface, from the DC/OS CLI, or
via the API. DC/OS Web Interface Note: The DC/OS web interface provides a subset of
the CLI an...

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
JOBS

Quick Start
PREVIEW Updated: April 17, 2017

You can create and administer jobs in the DC/OS web interface, from the DC/OS CLI, or
via the API.

DC/OS Web Interface


Note: The DC/OS web interface provides a subset of the CLI and API functionality. For
advanced job configurations, use the dcos job commands or the Jobs API.

Add a Job
From the DC/OS web interface, click the Jobs tab, then the Create a Job button. Fill in
the following fields, or toggle to JSON mode to edit the JSON directly.

General Tab
ID The ID of your job.

Description A description of your job.

CPUs The amount of CPU your job requires.

Mem The amount of memory, in MiB, your job requires.

Disk space The amount of disk space, in MiB, your job requires.

Command The command your job will execute. Leave this blank if you will use a Docker
image.

Schedule Tab
Check the Run on a Schedule to reveal the following fields.
* Cron Schedule Specify the schedule in cron format. Use this crontab generator for
help.
* Time Zone Enter the time zone in TZ format, e.g. America/New_York.
* Starting Deadline This is the time, in seconds, to start the job if it misses scheduled
time for any reason. Missed jobs executions will be counted as failed ones.

Docker Container Tab


Image Enter the the Docker image you will use to specify the action of your job, if you
are using one.

Labels
Label Name and Label Value Attach metadata to your jobs so you can filter them.
Learn more about labels.

Modify, View, or Remove a Job


From the Jobs tab, click the name of your job and then the menu on the upper right to
modify or delete it. While the job is running you can click the job instance to drill down to
Details, Files, and Logs data.

DC/OS CLI
You can create and manage jobs from the DC/OS CLI using dcos job commands. To see
a full list of available commands, run dcos job --help.

Add a Job
Create a job file in JSON format. The id parameter is the job ID. You will use this ID later to
manage your job.

{ "id": "myjob", "description": "A job that sleeps regularly", "run": { "cmd": "sleep
20000", "cpus": 0.01, "mem": 32, "disk": 0 }, "schedules": [ { "id": "sleep-schedule",
"enabled": true, "cron": "20 0 * * *", "concurrencyPolicy": "ALLOW" } ] }

Note: You can only assign one schedule to a job.

Add the job:


$ dcos job add <myjob>.json

Note: You can choose any name for your job file.

Go to the Jobs tab of the DC/OS web interface to verify that you have added your job,
or verify from the CLI:

$ dcos job list

Schedule-Only JSON
If you use the same schedule for more than one job, you can create a separate JSON file
for the schedule. Use the $ dcos job schedule add <job-id> <schedule-file> command
to associate a job with the schedule.

{ "concurrencyPolicy": "ALLOW", "cron": "20 0 * * *", "enabled": true, "id": "nightly",


"nextRunAt": "2016-07-26T00:20:00.000+0000", "startingDeadlineSeconds": 900, "timezone":
"UTC" }

Remove a Job
Enter the following command on the DC/OS CLI:

$ dcos job remove <job-id>

Go to the Jobs tab of the DC/OS web interface to verify that you have removed your job, or
verify from the CLI:

$ dcos job list

Modify a Job
To modify your job, by update your JSON job file, then run

$ dcos job update <job-file>.json

Modify a Jobs Schedule


You can update the schedule of your job in two ways, depending if your job has a
schedule specified in the <job-file>.json or if your jobs schedule is kept in a separate
file.

Modify a Job with a Schedule

Modify the schedules parameter of your <job-file>.json. Then run

$ dcos job update <job-file>.json

Modify a Job with a Separate Schedule file

Modify your <schedule-file>.json. Then, run one of the following commands:

$ dcos job schedule add <job-id> <schedule-file>.json $ dcos job schedule remove <job-id>
<schedule-id> $ dcos job schedule update <job-id> <schedule-file>.json

View Job Details


List all jobs:

$ dcos job list

List all previous runs of your job:

$ dcos job history <job-id>

To view details about your job, run:

$ dcos job show <job-id>

To view details about your jobs schedule, run:

$ dcos job schedule show <job-id>

Read Job logs


Inspect the log for your job:

$ dcos task log --completed <job-id>

To get the log for only a specific job run, use a job run ID from dcos job history <job-
id>

$ dcos task log --completed <job-run-id>


Jobs API
You can also create and administer jobs via the API. View the full API here.

Note: The DC/OS CLI and web interface support a combined JSON format (accessed via
the /v0 endpoint) that allows you to specify a schedule in the job descriptor. To schedule
a job via the API, use two calls: one to add an unscheduled job and another to associate
a <schedule-file>.json with the job.

Add a Job
The following command adds a job called myjob.json.

$ curl -X POST -H "Content-Type: application/json" -H "Authorization: token=$(dcos config


show core.dcos_acs_token)" $(dcos config show core.dcos_url)/service/metronome/v1/jobs -
d@/Users/<your-username>/<myjob>.json

Remove a Job
The following command removes a job regardless of whether the job is running:

$ curl -X DELETE -H "Authorization: token=$(dcos config show core.dcos_acs_token)" $(dcos


config show core.dcos_url)/service/metronome/v1/jobs/<myjob>?stopCurrentJobRuns=true

To remove a job only if it is not running, set stopCurrentJobRuns to False.

Modify or View a Job


The following command shows all jobs:

$ curl -H "Authorization: token=$(dcos config show core.dcos_acs_token)" $(dcos config show


core.dcos_url)/service/metronome/v1/jobs

The following command lists job runs:

$ curl -H "Authorization: token=$(dcos config show core.dcos_acs_token)" "$(dcos config show


core.dcos_url)/service/metronome/v1/jobs/<myjob>/runs/"

Stop a run with the following command:

$ curl -X POST -H "Authorization: token=$(dcos config show core.dcos_acs_token)" "$(dcos


config show
core.dcos_url)/service/metronome/v1/jobs/<myjob>/runs/20160725212507ghwfZ/actions/stop"
Add a Schedule to a Job
The following command adds a schedule to a job:

$ curl -X POST -H "Content-Type: application/json" -H "Authorization: token=$(dcos config


show core.dcos_acs_token)" $(dcos config show core.dcos_url)/service/metronome/v1/jobs/<job-
id>/schedules -d@/Users/<your-username>/<schedule-file>.json

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Tutorials
ENTERPRISE DC/OS Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.

Building an IoT Pipeline


In this tutorial, a containerized Ruby on Rails app named Tweeter in installed and
deployed using DC/OS. Tweeter is an app similar to Twitter that you can use to post 140-
character...
Autoscaling with Marathon
You can use autoscaling to automatically increase or decrease computing resources
based on usage so that youre using only the resources you need. Here are some
examples to s...
Creating and Running a Service
This tutorial shows how to create and deploy a simple one-command service and a
containerized service using both the DC/OS web interface and the CLI. Prerequisites
<<<<...
Deploying Marathon Apps with Jenkins
About Deploying Applications on Marathon This tutorial shows how to deploy applications
on Marathon using Jenkins for DC/OS. Well walk you through creating a new Jenkins
job...
Labeling Tasks and Jobs
This tutorial illustrates how labels can be defined using the DC/OS web interface and the
Marathon HTTP API, and how information pertaining to applications and jobs that are
runnin...

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
TUTORIALS

Building an IoT Pipeline


ENTERPRISE DC/OS Updated: April 17, 2017

In this tutorial, a containerized Ruby on Rails app named Tweeter in installed and
deployed using DC/OS. Tweeter is an app similar to Twitter that you can use to post 140-
character messages to the internet. Then, you use Zeppelin to perform real-time
analytics on the data created by Tweeter.

Tweeter:
Stores tweets in the DC/OS Cassandra service.

Streams tweets to the DC/OS Kafka service in real-time.

Performs real-time analytics with the DC/OS Spark and Zeppelin services.

This tutorial uses DC/OS to launch and deploy these microservices to your cluster.
The Cassandra database is used on the backend to store the Tweeter app data.

The Kafka publish-subscribe message service receives tweets from Cassandra and routes them
to Zeppelin for real-time analytics.

The Marathon load balancer (Marathon-LB) is an HAProxy based load balancer for Marathon
only. It is useful when you require external routing or layer 7 load balancing features.

Zeppelin is an interactive analytics notebook that works with DC/OS Spark on the backend to
enable interactive analytics and visualization. Because its possible for Spark and Zeppelin
to consume all of your cluster resources, you must specify a maximum number of cores for the
Zeppelin service.

This tutorial demonstrates how you can build a complete IoT pipeline on DC/OS in about
15 minutes! You will learn:
How to install DC/OS services.

How to add apps to DC/OS Marathon.

How to route public traffic to the private application with Marathon-LB.

How your apps are discovered.

How to scale your apps.

Prerequisites:
Enterprise DC/OS is installed with:
Security mode set to permissive or strict. By default, DC/OS installs in permissive
security mode.

Minimum 5 private agents and 1 public agent.


DC/OS CLI installed.

The public IP address of your public agent node. After you have installed DC/OS with a
public agent node declared, you can navigate to the public IP address of your public agent
node.

Git:
OS X: Get the installer from Git downloads.

Unix/Linux: See these installation instructions.

Install the DC/OS services youll need


From the DC/OS web interface Universe -> Packages tab, install Cassandra, Kafka, and
Zeppelin.

Tip: You can also install DC/OS packages from the DC/OS CLI with the dcos package
install command.

Find the cassandra package and click the INSTALL PACKAGE button and accept the default
installation. Cassandra will spin up to at least 3 nodes.

Find the kafka package and click the INSTALL button and accept the default installation.
Kafka will spin up 3 brokers.

Install Zeppelin.
zeppelin package
Find the and click the INSTALL button and then choose the ADVANCED
INSTALLATION option.
Click the spark tab and set cores_max to 8.

Click REVIEW AND INSTALL and INSTALL to complete your installation.

Click Review and Install and Install to complete your installation.

Install Marathon-LB.
Install the security CLI (dcos-enterprise-cli) by using the DC/OS CLI package install
commands. You will use this to partially configure the Marathon-LB security.
Search for the security CLI package repository by using the dcos package search command. In
this example the partial value enterprise* is used as an argument.

dcos package search enterprise*

Here is the output:

NAME VERSION SELECTED FRAMEWORK DESCRIPTION dcos-enterprise-cli 1.0.7 False False Enterprise
DC/OS CLI

Install the security CLI package.

dcos package install dcos-enterprise-cli

Here is the output:

Installing CLI subcommand for package [dcos-enterprise-cli] version [1.0.7] New command
available: dcos security

Configure service authentication for Marathon-LB.


Create a public-private key pair by using the security CLI.

dcos security org service-accounts keypair private-key.pem public-key.pem

Create a new service account with the ID marathon-lb-service-acct. This command uses the
public-key.pem created in the previous step.

dcos security org service-accounts create -p public-key.pem -d "Marathon-LB service account"


marathon-lb-service-acct

Create a new secret (marathon-lb-secret) using the private key (private-key.pem) and the
name of the service account (marathon-lb-service-acct).

dcos security secrets create-sa-secret private-key.pem marathon-lb-service-acct marathon-lb-


secret

You can verify that the secret was created successfully with this command.

dcos security secrets list /

You should see output similar to this:

- marathon-lb-secret

Assign the Marathon-LB permissions.


Run this command to get the DC/OS certificate for your cluster, where <master-ip> is your
master IP address.

curl -k -v https://<master-ip>/ca/dcos-ca.crt

You should see output similar to this:

> GET /ca/dcos-ca.crt HTTP/1.1 > Host: 54.149.23.77 > User-Agent: curl/7.43.0 > Accept: */*
> < HTTP/1.1 200 OK < Server: openresty/1.9.15.1 < Date: Tue, 11 Oct 2016 18:30:49 GMT <
Content-Type: application/x-x509-ca-cert < Content-Length: 1241 < Last-Modified: Tue, 11 Oct
2016 15:17:28 GMT < Connection: keep-alive < ETag: "57fd0288-4d9" < Accept-Ranges: bytes < -
----BEGIN CERTIFICATE----- MIIDaDCCAlCgAwI... -----END CERTIFICATE-----

Copy the contents of dcos-ca.crt between -----BEGIN CERTIFICATE----- and -----END


CERTIFICATE-----, and save as dcos-cert.pem.

Create the necessary permissions by using the dcos-cert.pem file.

curl -X PUT --cacert dcos-cert.pem -H "Authorization: token=$(dcos config show


core.dcos_acs_token)" $(dcos config show
core.dcos_url)/acs/api/v1/acls/dcos:service:marathon:marathon:services:%252F -d
'{"description":"Allows access to any service launched by the native Marathon instance"}' -H
'Content-Type: application/json'

curl -X PUT --cacert dcos-cert.pem -H "Authorization: token=$(dcos config show


core.dcos_acs_token)" $(dcos config show
core.dcos_url)/acs/api/v1/acls/dcos:service:marathon:marathon:admin:events -d
'{"description":"Allows access to Marathon events"}' -H 'Content-Type: application/json'

Grant the permissions and the allowed action to the service account.

curl -X PUT --cacert dcos-cert.pem -H "Authorization: token=$(dcos config show


core.dcos_acs_token)" $(dcos config show
core.dcos_url)/acs/api/v1/acls/dcos:service:marathon:marathon:services:%252F/users/marathon-
lb-service-acct/read

curl -X PUT --cacert dcos-cert.pem -H "Authorization: token=$(dcos config show


core.dcos_acs_token)" $(dcos config show
core.dcos_url)/acs/api/v1/acls/dcos:service:marathon:marathon:admin:events/users/marathon-
lb-service-acct/read

Install the Marathon-LB package by using the DC/OS CLI.


Create a config.json Marathon app definition file with these contents. A Marathon app
definition file specifies the required parameters for launching a containerized app with
Marathon.

{ "marathon-lb": { "secret_name": "marathon-lb-secret" } }

Install Marathon-LB from the DC/OS CLI with the config.json file specified.

dcos package install --options=config.json marathon-lb

Monitor the Services tab to watch as your microservices are deployed on DC/OS. You will see the
Health status go from Idle to Unhealthy, and finally to Healthy as the nodes come online. This
may take several minutes.

Note: It can take up to 10 minutes for Cassandra to initialize with DC/OS because of race
conditions.

Deploy the containerized app


In this step you deploy the containerized Tweeter app to a public node.
Navigate to the Tweeter GitHub repository and save the /tweeter/tweeter.json Marathon app
definition file.

Add the HAPROXY_0_VHOST definition with the public IP address of your public agent node to
your tweeter.json file.

Important: You must remove the leading http:// and the trailing /.

... ], "labels": { "HAPROXY_GROUP": "external", "HAPROXY_0_VHOST": "<public-agent-IP>" } }

In this example, a DC/OS cluster is running on AWS:

... ], "labels": { "HAPROXY_GROUP": "external", "HAPROXY_0_VHOST": "joel-ent-publicsl-


e7wjol669l9f-741498241.us-west-2.elb.amazonaws.com" } }

Install and deploy Tweeter to your DC/OS cluster with this CLI command.

dcos marathon app add tweeter.json

Tip: The instances parameter in tweeter.json specifies the number of app instances.
Use the following command to scale your app up or down:

dcos marathon app update tweeter instances=<number_of_desired_instances>

The service talks to Cassandra via node-0.cassandra.mesos:9042, and Kafka via


broker-0.kafka.mesos:9557 in this example. Traffic is routed via Marathon-LB because of
the HAPROXY_0_VHOST definition in the tweeter.json app definition file.

Go to the Services tab to verify your app is up and healthy.

Navigate to public agent node endpoint to see the Tweeter UI and post a tweet!

Tip: If youre having trouble, verify the HAPROXY_0_VHOST value in the tweeter.json file.

Post 100K Tweets


Deploy the post-tweets containerized app to see DC/OS load balancing in action. This
app automatically posts a large number of tweets from Shakespeare. The app will post
more than 100k tweets one by one, so youll see them coming in steadily when you
refresh the page.
Navigate to the Tweeter GitHub repository and save the tweeter/post-tweets.json Marathon
app definition file.

Deploy the post-tweets.json Marathon app definition file.

dcos marathon app add post-tweets.json


After the post-tweets.json is running:
Refresh your browser to see the incoming Shakespeare tweets.

Click the Networking -> Service Addresses tab in the DC/OS web interface and select
the 1.1.1.1:30000 virtual network to see the load balancing in action.

The post-tweets app works by streaming to the VIP 1.1.1.1:30000. This address is
declared in the cmd parameter of the post-tweets.json app definition.

{ "id": "/post-tweets", "cmd": "bin/tweet shakespeare-tweets.json http://1.1.1.1:30000", ...

The Tweeter app uses the service discovery and load balancer service that is installed on
every DC/OS node. This address is defined in the tweeter.json definition VIP_0.

... { "containerPort": 3000, "hostPort": 0, "servicePort": 10000, "labels": { "VIP_0":


"1.1.1.1:30000" ...

Add Streaming Analytics


Next, youll perform real-time analytics on the stream of tweets coming in from Kafka.
Navigate to Zeppelin at https://<master_ip>/service/zeppelin/, click Import Note and
import tweeter-analytics.json. Zeppelin is preconfigured to execute Spark jobs on the DC/OS
cluster, so there is no further configuration or setup required. Be sure to use https://, not
http://.
Tip: Your master IP address is the URL of the DC/OS web interface.

Navigate to Notebook > Tweeter Analytics.

Run the Load Dependencies step to load the required libraries into Zeppelin.

Run the Spark Streaming step, which reads the tweet stream from ZooKeeper and puts
them into a temporary table that can be queried using SparkSQL.

Run the Top Tweeters SQL query, which counts the number of tweets per user using the
table created in the previous step. The table updates continuously as new tweets come
in, so re-running the query will produce a different result every time.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
TUTORIALS
Autoscaling with Marathon
ENTERPRISE DC/OS Updated: April 17, 2017

You can use autoscaling to automatically increase or decrease computing resources


based on usage so that youre using only the resources you need. Here are some
examples to show you how to implement autoscaling for your services.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
TUTORIALS

Creating and Running a Service


Updated: April 17, 2017

This tutorial shows how to create and deploy a simple one-command service and a
containerized service using both the DC/OS web interface and the CLI.

Prerequisites
<<<<<<< HEAD

A DC/OS cluster
A DC/OS cluster
>>>>>>> origin/dev

Create and Run a Simple Service from the DC/OS Web


Interface
Click the Services tab of the DC/OS web interface, then click the RUN A SERVICE.
Click Single Container.
In the SERVICE ID field, enter a name for your service.
In the Command field, enter sleep 10.
<<<<<<< HEAD

Click MORE SETTINGS and choose your container runtime.

DOCKER ENGINE Use this option if you require specific features of the Docker package.
If you select this option, you must specify a Docker container image in the Container
Image field. For example, you can specify the Alpine Docker image.

MESOS RUNTIME Use this option if you prefer the original Mesos container runtime. It
does not support Docker containers.

UNIVERSAL CONTAINER RUNTIME Use this option if you are using Pods or GPUs. This
option also supports Docker images without depending on the Docker Engine. If you select
this option, you can optionally specify a Docker container image in the Container Image
field. For example, you can specify the Alpine Docker image.
For more information, see the containerizer documentation.

Click REVIEW & RUN.

Thats it! Click the name of your service in the Services view to see it running and monitor
health.

In the Container Image field, enter a container image to use.


Click REVIEW & RUN.
Thats it! Click the name of your service in the Services view to see it running and monitor
health.
origin/dev

Create and Run a Simple Service from the DC/OS CLI


Create a JSON file called my-app-cli.json with the following contents:

{ "id": "/my-app-cli", "cmd": "sleep 10", "instances": 1, "cpus": 1, "mem": 128, "disk": 0,
"gpus": 0, "backoffSeconds": 1, "backoffFactor": 1.15, "maxLaunchDelaySeconds": 3600,
"upgradeStrategy": { "minimumHealthCapacity": 1, "maximumOverCapacity": 1 },
"portDefinitions": [ { "protocol": "tcp", "port": 10000 } ], "requirePorts": false }

Run the service with the following command.


`bash
<<<<<<< HEAD</p></li>
</ol>

<h1> dcos marathon app add my-app-cli.json</h1>

<pre><code>$ dcos marathon app add my-app-cli.json


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
`
Run the following command to verify that your service is running:
`bash
<<<<<<< HEAD</li>
</ol>

<h1> dcos marathon app list</h1>

<pre><code>$ dcos marathon app list


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
origin/dev
`
You can also click the name of your service in the Services view of the DC/OS web
interface to see it running and monitor health.

Create and Run a Containerized Service from the


DC/OS Web Interface
Go to the hello-dcos page of the Mesosphere Docker Hub repository and note down the latest
image tag.

Click the Services tab of the DC/OS web interface, then click the RUN A SERVICE.
Click Single Container and enter a name for your service in the SERVICE ID field.

Click the Container Settings tab and enter the following in the Container Image field:
mesosphere/hello-dcos:<image-tag>. Replace <image-tag> with the tag you copied in step
1.

<<<<<<< HEAD

![Containerized service in the DC/OS UI](/1.9/usage/tutorials/img/deploy-container-ui.png)


origin/dev

Click Deploy.
In the Services tab, click the name of your service, then choose on of the task instances.
Click Logs, then toggle to the STDERR and STDOUT to see the output of the service.

<<<<<<< HEAD

![Running containerized service in the DC/OS UI](/1.9/usage/tutorials/img/container-running-


ui.png)

origin/dev

Create and Run a Containerized Service from the


DC/OS CLI
Go to the hello-dcos page of the Mesosphere Docker Hub repository and note down the latest
image tag.

Create a JSON file called hello-dcos-cli.json with the following contents. Replace <image-
tag> in the docker:image field with the tag you copied in step 1.
{
"id": "/hello-dcos-cli",
"instances": 1,
"cpus": 1,
"mem": 128,
"disk": 0,
"gpus": 0,
"backoffSeconds": 1,
"backoffFactor": 1.15,
"maxLaunchDelaySeconds": 3600,
"container": {
"docker": {
"image": "mesosphere/hello-dcos:<image-tag>",
"forcePullImage": false,
"privileged": false,
"network": "HOST"
}
},
"upgradeStrategy": {
"minimumHealthCapacity": 1,
"maximumOverCapacity": 1
},
"portDefinitions": [
{
"protocol": "tcp",
"port": 10001
}
],
"requirePorts": false
}
Run the service with the following command.
bash
dcos marathon app hello-dcos-cli.json
Run the following command to verify that your service is running:
`bash
<<<<<<< HEAD</li>
</ol>

<h1> dcos marathon app list</h1>


<pre><code>$ dcos marathon app list
</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
origin/dev
`
1. In the Services tab of the DC/OS web interface, click the name of your service, then
choose on of the task instances. Click Logs, then toggle to the Output (stdout) view to
see the output of the service.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
TUTORIALS

Deploying Marathon Apps with


Jenkins
Updated: April 17, 2017
About Deploying Applications on
Marathon
This tutorial shows how to deploy applications on Marathon using Jenkins for DC/OS.
Well walk you through creating a new Jenkins job, publishing a Docker container on
source code changes, and deploying those changes to Marathon based on the
application definition contained in the projects marathon.json file.

Prerequisite:
This tutorial assumes that you have a working Jenkins installation and permission to
launch applications on Marathon. Jenkins for DC/OS must be installed as described on
the Jenkins Quickstart page.

The Example Project


The project used in this tutorial is taken from the cd-demo repository and runs a Jekyll
website inside a Docker container.

The required files for this tutorial are Dockerfile, marathon.json, and the site directory.
Copy those items to a new project and push to a new Git repository on the host of your
choice.

This tutorial uses Docker Hub to store the created image and requires account
information to perform this task.

Accessing Jenkins for DC/OS


Jenkins for DC/OS can be accessed through the Dashboard or Services navigation
menus within the DC/OS web interface.

Click the Jenkins service and then Open Service to access the Jenkins web interface.

Adding Docker Hub Credentials


Jenkins stores account credentials within its Credential Store, which allows jobs to utilize
credentials in a secure manner. From the main Jenkins page, click Credentials from the
left-hand menu. From there, select System (also from the left-hand menu) and finally
the Global credentials (unrestricted) link presented in the main viewing area. The left-
hand menu should now have an Add Credentials option.
Click Add Credentials to create a new credential for Docker Hub. The Kind drop-down
should have the Username with password option selected. Fill out the rest of the
information to match your Docker Hub account.

The Job
Well create a new Jenkins job that performs several operations with Docker Hub and
then either update or create a Marathon application.

Create a new Freestyle job with a name that includes only lowercase letters and
hyphens. This name will be used later in the Docker image name and possibly as the
Marathon application ID.

SCM / Git
From the Example Project section above, fill in the Git repository URL with the newly
created Git repository. This must be accessible to Jenkins and may require adding
credentials to the Jenkins instance.

Build Triggers
Select the Poll SCM build trigger with a schedule of: */5 * * * *. This will check the Git
repository every five minutes for changes.

Build Steps
The Jenkins job performs these actions:
Build a new Docker image.

Push the new image to Docker Hub.

These steps can be performed by a single build step using the Docker Build and Publish
plugin, which is already included and ready for use. From the Add build step drop-down
list, select the Docker Build and Publish option.

The Repository Name is your Docker Hub username with /${JOB_NAME} attached to the
end (myusername/${JOB_NAME}); the Tag field should be ${GIT_COMMIT}.
Set the Registry credentials to the credentials for Docker Hub that were created above.

Marathon Deployment
Add a Marathon Deployment post-build action by selecting the Marathon Deployment
option from the Add post-build action drop-down.

The Marathon instance within DC/OS can be accessed using the URL
http://leader.mesos/service/marathon. Fill in the fields appropriately, using Jenkins
variables if desired. The Docker Image should be the same as the build step above
(myusername/${JOB_NAME}:${GIT_COMMIT}) to ensure the correct image is used.

How It Works
The Marathon Deployment post-build action reads the application definition file, by
default marathon.json, contained within the projects Git repository. This is a JSON file
and must contain a valid Marathon application definition.

The configurable fields in the post-build action will overwrite the content of matching
fields from the file. For example, setting the Application Id will replace the id field in the
file. In the configuration above, Docker Image is configured and will overwrite the image
field contained within the docker field.

The final JSON payload is sent to the configured Marathon instance and the application
is updated or created.

Save
Save the job configuration.

Build It
Click Build Now and let the job build.
Deployment
Upon a successful run in Jenkins, the application will begin deployment on DC/OS. You
can visit the DC/OS web interface to monitor progress.

When the Status has changed to Running, the deployment is complete and you can visit
the website.

Visit Your Site


Visit port 80 on the public DC/OS agent to display a jekyll website.

Adding a New Post


The content in the _posts directory generates a Jekyll website. For this example project,
that directory is site/_posts. Copy an existing post and create a new one with a more
recent date in the filename. I added a post entitled An Update.

Commit the new post to Git. Shortly after the new commit lands on the master branch,
Jenkins will see the change and redeploy to Marathon.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
TUTORIALS

Labeling Tasks and Jobs


Updated: April 17, 2017

This tutorial illustrates how labels can be defined using the DC/OS web interface and the
Marathon HTTP API, and how information pertaining to applications and jobs that are
running can be queried based on label value criteria.

When you deploy applications, containers, or jobs in a DC/OS cluster, you can associate
a tag or label with your deployed components in order to track and report usage of the
cluster by those components. For example, you may want to assign a cost center
identifier or a customer number to a Mesos application and produce a summary report at
the end of the month with usage metrics such as the amount of CPU and memory
allocated to the applications by cost center or customer.

Assigning Labels to Applications and


Tasks
You can attach labels to tasks either via the Services tab of the DC/OS web interface or
from the DC/OS CLI. You can specify more than one label, but each label can have only
one value.

Assign a Label to an Application or Task from the


DC/OS Web Interface
From the DC/OS web interface, click the Services tab. You can add labels when you
deploy a new service or edit an existing one from the Labels tab.

Assign a Label to an Application or Task from the


DC/OS CLI
You can also specify label values in the labels parameter of your application definition.

<<<<<<< HEAD

vi myapp.json
$ vi myapp.json

origin/dev

{ "id": "myapp", "cpus": 0.1, "mem": 16.0, "ports": [ 0 ], "cmd":


"/opt/mesosphere/bin/python3 -m http.server $PORT0", "instances": 2, "labels": {
"COST_CENTER": "0001" } }

Then, deploy from the DC/OS CLI:

<<<<<<< HEAD dcos marathon app add <myapp>.json ======= $ dcos marathon app add <myapp>.json
>>>>>>> origin/dev

Assigning Labels to Jobs


You can attach labels to jobs either via the Jobs tab of the DC/OS web interface or from
the DC/OS CLI. You can specify more than one label, but each label can have only one
value.

Assign a Label to a Job from the DC/OS Web Interface


From the DC/OS web interface, click the Jobs tab. You can add labels when you deploy
a new job or edit an existing one from the Labels tab.

<<<<<<< HEAD
origin/dev

Assign a Label to a Job from the DC/OS CLI


You can also specify label values in the labels parameter of your job definition.

<<<<<<< HEAD

vi myjob.json
$ vi myjob.json

origin/dev

```json { "id": "my-job", "description": "A job that sleeps", "labels": { "department":
"marketing" }, "run": { "cmd": "sleep 1000", "cpus": 0.01, "mem": 32, "disk": 0 } } ```

Then, deploy from the DC/OS CLI:

<<<<<<< HEAD dcos marathon job add <myjob>.json ======= $ dcos marathon job add <myjob>.json
>>>>>>> origin/dev

Displaying Label Information


Once your applications is deployed and started, you can filter by label from the Services
tab of the DC/OS UI.

You can also use the Marathon HTTP API from the DC/OS CLI to query the running
applications based on the label value criteria.
The code snippet below shows an HTTP request issued to the Marathon HTTP API. The
curl program is used in this example to submit the HTTP GET request, but you can use
any program that is able to send HTTP GET/PUT/DELETE requests. You can see the
HTTP end-point is https://52.88.210.228/marathon/v2/apps and the parameters sent
along with the HTTP request include the label criteria ?label=COST_CENTER==0001:

<<<<<<< HEAD

curl insecure \
$ curl --insecure \

origin/dev
https://52.
88.210.22
8/maratho
n/v2/apps
?label=C
OST_CE
NTER==0
001 \
| python -
m
json.tool |
more

You can also specify multiple label criteria like so:


?label=COST_CENTER==0001,COST_CENTER==0002

In the example above, the response you receive will include only the applications that
have a label COST_CENTER defined with a value of 0001. The resource metrics are also
included, such as the number of CPU shares and the amount of memory allocated. At the
bottom of the response, you can see the date/time this application was deployed, which
can be used to compute the uptime for billing or charge-back purposes.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Release Notes
ENTERPRISE DC/OS Updated: April 17, 2017

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

GUI
ENTERPRISE DC/OS Updated: April 17, 2017

The DC/OS web interface provides a rich graphical view of your DC/OS cluster. With the
web interface you can view the current state of your entire cluster and DC/OS services.
The web interface is installed as a part of your DC/OS installation.

Additionally, there is a User Menu on the upper-left side of the web interface that includes
links for documentation, CLI installation, and user sign out.

Dashboard
The dashboard is the home page of the DC/OS web interface and provides an overview
of your DC/OS cluster.
From the dashboard you can easily monitor the health of your cluster.
The CPU Allocation panel provides a graph of the current percentage of available general
compute units that are being used by your cluster.

The Memory Allocation panel provides a graph of the current percentage of available
memory that is being used by your cluster.

The Task Failure Rate panel provides a graph of the current percentage of tasks that
are failing in your cluster.

The Services Health panel provides an overview of the health of your services. Each
service provides a healthcheck, run at intervals. This indicator shows the current
status according to that healthcheck. A maximum of 5 services are displayed, sorted
by priority of the most unhealthy. You can click the View all Services button for
detailed information and a complete list of your services.

The Tasks panel provides the current number of tasks that are staged and running.

The Nodes panel provides a view of the nodes in your cluster.

Services
The Services tab provides a full featured interface to the native DC/OS Marathon
instance.

You can click the Deployments tab to view all active Marathon deployments.

Tip: You can access the Mesos web interface at <hostname>/mesos.


Jobs
The Jobs tab provides native support for creating and administering scheduled jobs. You
can set up jobs with a scheduler by using the cron format. For more information, see the
documentation.

Universe
The Universe tab shows all of the available DC/OS services. You can install packages
from the DC/OS Universe with a single click. The packages can be installed with defaults
or customized directly in the web interface.
Nodes
The Nodes tab provides a comprehensive view of all of the nodes that are used across
your cluster. You can view a graph that shows the allocation percentage rate for CPU,
memory, or disk.

By default all of your nodes are displayed in List view, sorted by hostname. You can filter
nodes by service type or hostname. You can also sort the nodes by number of tasks or
percentage of CPU, memory, or disk space allocated.
You can switch to Grid view to see a donuts percentage visualization.

Clicking on a node opens the Nodes side panel, which provides CPU, memory, and disk
usage graphs and lists all tasks on the node. Use the dropdown or a custom filter to sort
tasks and click on details for more information. Click on a task listed on the Nodes side
panel to see detailed information about the tasks CPU, memory, and disk usage and the
tasks files and directory tree.

Networking
The Networking tab provides a comprehensive view of the health of your VIPs. For more
information, see the documentation.

Security
The Security tab provides secret and certificates management. For more information, see
the secrets and certificates documentation.
System Overview
View the cluster details from the System Overview tab.
Components
View the system health of your DC/OS components from the Components tab.

Settings
Manage your DC/OS package repositories, secrets stores, LDAP directories, and identity
providers from the Settings tab.
Organization
Manage user access from the Organization tab.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

CLI
Updated: April 17, 2017

You can use the DC/OS command-line interface (CLI) to manage your cluster nodes,
install DC/OS packages, inspect the cluster state, and administer the DC/OS service
subcommands. You can install the CLI from the DC/OS web interface.

To list available commands, either run dcos with no parameters or run dcos help:

dcos Command line utility for the Mesosphere Datacenter Operating System (DC/OS). The
Mesosphere DC/OS is a distributed operating system built around Apache Mesos. This utility
provides tools for easy management of a DC/OS installation. Available DC/OS commands: auth
Authenticate to DCOS cluster config Get and set DC/OS CLI configuration properties help
Display command line usage information marathon Deploy and manage applications on the DC/OS
node Manage DC/OS nodes package Install and manage DC/OS packages service Manage DC/OS
services task Manage DC/OS tasks Get detailed command description with 'dcos <command> --
help'.

Environment Variables
For easy reference, these environment variables are supported by the DC/OS CLI:

The DC/OS CLI supports several environment variables that you can set dynamically.

DCOS_CONFIGSet the path to the DC/OS configuration file. By default, this variable is set to
DCOS_CONFIG=/<home-directory>/.dcos/dcos.toml. For example, if you moved your
DC/OS configuration file to /home/jdoe/config/ you can specify this command:

export DCOS_CONFIG=/home/jdoe/config/dcos.toml

DCOS_SSL_VERIFY Indicates whether to verify SSL certificates for HTTPS (true) or set the
path to the SSL certificates (false). By default, this is variable is set to true. This is
equivalent to setting the core.ssl_config option in the DC/OS configuration file. For
example, to set the path to SSL certificates:

export DCOS_SSL_VERIFY=false

DCOS_LOG_LEVEL Prints log messages to stderr at or above the level indicated. This is
equivalent to the --log-level command-line option. The severity levels are:
debug Prints all messages to stderr, including informational, warning, error, and
critical.

info Prints informational, warning, error, and critical messages to stderr.

warning Prints warning, error, and critical messages to stderr.

error Prints error and critical messages to stderr.

critical Prints only critical messages to stderr.

For example, to set the log level to warning:

export DCOS_LOG_LEVEL=warning

DCOS_DEBUG Indicates whether to print additional debug messages to stdout. By default


this is set to false. For example:

export DCOS_DEBUG=true
Configuration Files
By default, the DC/OS command line stores its configuration files in a directory called
~/.dcos within your HOME directory. However, you can specify a different location by
using the DCOS_CONFIG environment variable.

The configuration settings are stored in the dcos.toml file. You can modify these settings
with the dcos config command.

dcos_url The the public master IP of your DC/OS installation. This is set by default during
installation. For example:

dcos config set core.dcos_url 52.36.102.191

email Your email address. This is set by default during installation. For example, to reset
your email address:

dcos config set core.email jdoe@mesosphere.com

mesos_master_url The Mesos mast URL. This must be of the format:


http://<host>:<port>. For example, to set your Mesos master URL:

dcos config set core.mesos_master_url 52.34.160.132:5050

reporting Indicate whether to report usage events to Mesosphere. By default this is set to
True. For example, to set to false:

dcos config set core.reporting False

ssl_verify Indicates whether to verify SSL certs for HTTPS or path to certs. By default this
is set to False. For example, to set to true:

dcos config set core.ssl_verify True

timeout Request timeout in seconds, with a minimum value of 1 second. By default this is
set to 5 seconds. For example, to set to 3 seconds:

dcos config set core.timeout 3

token The OAuth access token. For example, to change the OAuth token:

dcos config set core.token <token>


MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Security
ENTERPRISE DC/OS Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.

Managing users and groups


Enterprise DC/OS can manage two types of users: Local: local user accounts exist only
in DC/OS. External: DC/OS stores only the users ID or user name, along with other
DC/OS...
Identity provider-based authentication
Configuring identity provider-based authentication To provide Single Sign-On (SSO) in
your organization, you can configure Enterprise DC/OS to authenticate users against one
or mor...

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
SECURITY

Managing users and groups


ENTERPRISE DC/OS Updated: April 17, 2017

Enterprise DC/OS can manage two types of users:


Local: local user accounts exist only in DC/OS.

External: DC/OS stores only the users ID or user name, along with other DC/OS-
specific information, such as permissions and group membership. DC/OS never
receives or stores the passwords of external users. Instead, it delegates the
verification of the users credentials to one of the following: LDAP directory, SAML, or
OpenID Connect.

All users must have a unique identifier, i.e., a user ID or user name. Because DC/OS
needs to pass the users name or ID in URLs, it cannot contain any spaces or commas.
Only the following characters are supported: lowercase alphabet, uppercase alphabet,
numbers, @, ., \, _, and -.

Enteprise DC/OS also allows you to create groups of users and import groups of users
from LDAP. Groups can make it easier to manage permissions. Instead of assigning
permissions to each user account individually, you can assign the permissions to an
entire group of users at once.

Importing groups from LDAP makes it easier to add external users.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
SECURITY

Identity provider-based
authentication
ENTERPRISE DC/OS Updated: April 17, 2017

Configuring identity provider-based


authentication
To provide Single Sign-On (SSO) in your organization, you can configure Enterprise
DC/OS to authenticate users against one or more external user identity providers. In
contrast to directory-based authentication, the identity provider-based authentication is
not as rich (less information available) but more flexible for individual users.

When a user attempts to log on from the DC/OS web interface, they will be presented
with a list of the third-party identity providers that you have configured. They can click the
one that they have an account with for SSO.

Users logging in from the DC/OS CLI can use the following command to discover the
names of the IdPs that have been configured dcos auth list-providers. They can then
use the following command to log in using an IdP dcos auth login --
provider=<provider-name> --username=<user-email> --password=<secret-password.
Enterprise DC/OS supports two types of identity provider-based authentication methods:
Security Assertion Markup Language (SAML) and OpenID Connect (OIDC):
Adding a SAML Identity Provider

Adding an OpenID Identity Provider:

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Storage
Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.

Mount Disk Resources


Overview With DC/OS you can configure Mesos mount disk resources across your
cluster by simply mounting storage resources on agents using a well-known path. When
a DC/OS agent init...
External Persistent Volumes
<<<<<<< HEAD Warning: Volume size is specified in GiB. Warning: Volume size is
specified in GiB. There is currently an inaccuracy in the DC/OS GUI where the vo...
Local Persistent Volumes
When you specify a local volume or volumes, tasks and their associated data are
pinned to the node they are first launched on and will be relaunched on that node if t...

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
STORAGE

Mount Disk Resources


Updated: April 17, 2017
Overview
With DC/OS you can configure Mesos mount disk resources across your cluster by
simply mounting storage resources on agents using a well-known path. When a DC/OS
agent initially starts, it scans for volumes that match the pattern /dcos/volumeN, where N is
an integer. The agent is then automatically configured to offer these disk resources to
other services.

Example using loopback device


In this example, a disk resource is added to a DC/OS agent post-install on a running
cluster. These same steps can be used pre-install without having to stop services or clear
the agent state.

Warning: This will terminate any running tasks or services on the node.
Connect to an agent in the cluster with SSH.

Examine the current agent resource state.

Note there are no references yet for /dcos/volume0.

`bash
<<<<<<< HEAD</p></li>
</ol>

<h1> cat /var/lib/dcos/mesos-resources</h1>

<pre><code>$ cat /var/lib/dcos/mesos-resources


</code></pre>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
# Generated by make_disk_resources.py on 2016-05-05 17:04:29.868595
#
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>

<pre><code>MESOS_RESOURCES='[{"ranges": {"range": [{"end": 21, "begin": 1},


{"end": 5050, "begin": 23}, {"end": 32000, "begin": 5052}]}, "type": "RANGES", "name":
"ports"}, {"role": "*", "type": "SCALAR", "name": "disk", "scalar": {"value": 47540}}]'
`
Stop the agent.
On a private agent:

`bash
<<<<<<< HEAD</p></li>
</ol>

<h1> sudo systemctl stop dcos-mesos-slave.service</h1>

<pre><code>$ sudo systemctl stop dcos-mesos-slave.service


</code></pre>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
`

On a [public](/1.9/overview/concepts/#public) agent:

```bash
</code></pre>

<<<<<<< HEAD

<h1> sudo systemctl stop dcos-mesos-slave-public.service</h1>

<pre><code>$ sudo systemctl stop dcos-mesos-slave-public.service


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
origin/dev
```
Clear agent state.
Remove Volume Mount Discovery resource state with this command:

```bash
<<<<<<< HEAD</p></li>
</ol>

<h1> sudo rm -f /var/lib/dcos/mesos-resources</h1>

<pre><code>$ sudo rm -f /var/lib/dcos/mesos-resources


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```

Remove agent checkpoint state with this command:

```bash
</code></pre>

<<<<<<< HEAD

<h1> sudo rm -f /var/lib/mesos/slave/meta/slaves/latest</h1>

<pre><code>$ sudo rm -f /var/lib/mesos/slave/meta/slaves/latest


</code></pre>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
origin/dev
```

Create a 200 MB loopback device.


This is suitable for testing purposes only. Mount volumes must have at least 200 MB of
free space available. 100 MB on each volume is reserved by DC/OS and is not available
for other services.

```bash
<<<<<<< HEAD
sudo mkdir -p /dcos/volume0
sudo dd if=/dev/zero of=/root/volume0.img bs=1M count=200
sudo losetup /dev/loop0 /root/volume0.img
sudo mkfs -t ext4 /dev/loop0</p></li>
</ol>

<h1> sudo losetup -d /dev/loop0</h1>

<pre><code>$ sudo mkdir -p /dcos/volume0


$ sudo dd if=/dev/zero of=/root/volume0.img bs=1M count=200
$ sudo losetup /dev/loop0 /root/volume0.img
$ sudo mkfs -t ext4 /dev/loop0
$ sudo losetup -d /dev/loop0
</code></pre>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```

Create fstab entry and mount.


Ensure the volume is mounted automatically at boot time. Something similar could also
be done with a Systemd Mount unit.

```bash
<<<<<<< HEAD
echo "/root/volume0.img /dcos/volume0 auto loop 0 2" | sudo tee -a /etc/fstab</p></li>
</ol>

<h1> sudo mount /dcos/volume0</h1>

<pre><code>$ echo "/root/volume0.img /dcos/volume0 auto loop 0 2" | sudo tee -a


/etc/fstab
$ sudo mount /dcos/volume0
</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```

Reboot.
```bash
<<<<<<< HEAD</p></li>
</ol>

<h1> sudo reboot</h1>

<pre><code>$ sudo reboot


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```

SSH to the agent and verify a new resource state.


Review the journald logs for references to the new volume /dcos/volume0. In particular,
there should be an entry for the agent starting up and the new volume0 Disk Mount
resource.

```bash
<<<<<<< HEAD</p></li>
</ol>
<h1> journalctl -b | grep '/dcos/volume0'</h1>

<pre><code>$ journalctl -b | grep '/dcos/volume0'


</code></pre>

<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
May 05 19:18:40 dcos-agent-public-01234567000001 systemd<a
href="http://mesos.apache.org/documentation/latest/multiple-disk/">1</a>: Mounting
/dcos/volume0...
May 05 19:18:42 dcos-agent-public-01234567000001 systemd<a
href="http://mesos.apache.org/documentation/latest/multiple-disk/">1</a>: Mounted
/dcos/volume0.
May 05 19:18:46 dcos-agent-public-01234567000001 make_disk_resources.py[888]:
Found matching mounts : [('/dcos/volume0', 74)]
May 05 19:18:46 dcos-agent-public-01234567000001 make_disk_resources.py[888]:
Generated disk resources map: [{'name': 'disk', 'type': 'SCALAR', 'disk': {'source':
{'mount': {'root': '/dcos/volume0'}, 'type': 'MOUNT'}}, 'role': '<em>', 'scalar': {'value': 74}},
{'name': 'disk', 'type': 'SCALAR', 'role': '</em>', 'scalar': {'value': 47540}}]
May 05 19:18:58 dcos-agent-public-01234567000001 mesos-slave[1891]: " --
oversubscribed_resources_interval="15secs" --perf_duration="10secs" --
perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --
recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" -
-resources="[{"name": "ports", "type": "RANGES", "ranges": {"range": [{"end": 21, "begin":
1}, {"end": 5050, "begin": 23}, {"end": 32000, "begin": 5052}]}}, {"name": "disk", "type":
"SCALAR", "disk": {"source": {"mount": {"root": "/dcos/volume0"}, "type": "MOUNT"}},
"role": "<em>", "scalar": {"value": 74}}, {"name": "disk", "type": "SCALAR", "role": "</em>",
"scalar": {"value": 47540}}]" --revocable_cpu_low_priority="true" --
sandbox_directory="/mnt/mesos/sandbox" --slave_subsystems="cpu,memory" --
strict="true" --switch_user="true" --systemd_enable_support="true" --
systemd_runtime_directory="/run/systemd/system" --version="false" --
work_dir="/var/lib/mesos/slave"
```

Cloud Provider Resources


Cloud provider storage services are typically used to back DC/OS Mount Volumes. This
reference material can be useful when designing a production DC/OS deployment:
Amazon: EBS

Azure: About disks and VHDs for Azure virtual machines

Azure: Introduction to Microsoft Azure storage (see Blob Storage section on Page blobs)

Best Practices
Disk Mount Resources are primarily for stateful services like Kafka and Cassandra which
can benefit from having dedicated storage available throughout the cluster. Any service
that utilizes a Disk Mount Resource has exclusive access to the reserved resource.
However, it is still important to consider the performance and reliability requirements for
the service. The performance of a Disk Mount Resource is based on the characteristic of
the underlying storage and DC/OS does not provide any data replication services.
Consider the following:
Use Disk Mount Resources with stateful services that have strict storage requirements.

Carefully consider the filesystem type, storage media (network attached, SSD, etc.), and
volume characteristics (RAID levels, sizing, etc.) based on the storage needs and
requirements of the stateful service.

Label Mesos agents using a Mesos attribute that reflects the characteristics of the agent's
Disk Mounts, e.g. IOPS200, RAID1, etc.

Associate stateful services with storage Agents using Mesos Attribute constraints.

Consider isolating demanding storage services to dedicated storage agents, since the
filesystem page cache is a host-level shared resource.

Ensure all services using Disk Mount Resources are designed handle the permanent loss of one
or more Disk Mount Resources. Services are still responsible for managing data replication
and retention, graceful recovery from failed agents, and backups of critical service state.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
STORAGE

External Persistent Volumes


EXPERIMENTAL Updated: April 17, 2017

<<<<<<< HEAD

Warning: Volume size is specified in


GiB.
Warning: Volume size is specified in GiB. There is currently an inaccuracy in the DC/OS
GUI where the volume size is marked in MiB. When creating a volume, make sure you
specify the number of GiB you need, not MiB.

origin/dev

Use external volumes when fault-tolerance is crucial for your app. If a host fails, the
native Marathon instance reschedules your app on another host, along with its
associated data, without user intervention. External volumes also typically offer a larger
amount of storage.

Marathon applications normally lose their state when they terminate and are relaunched.
In some contexts, for instance, if your application uses MySQL, youll want your
application to preserve its state. You can use an external storage service, such as
Amazons Elastic Block Store (EBS), to create a persistent volume that follows your
application instance.

An external storage service enables your apps to be more fault-tolerant. If a host fails,
Marathon reschedules your app on another host, along with its associated data, without
user intervention.

Configuring External Volumes


To use external volumes with DC/OS, you must enable them during installation. If youre
installing DC/OS from the AWS cloud templates, then your cluster will be configured to
use Amazon EBS and you can skip this section. Otherwise, install DC/OS using the CLI
or Advanced installation method with these special configuration settings:
Specify your REX-Ray config in the rexray_config parameter in your genconf/config.yaml
file. Consult the REX-Ray documentation for more information on how to create your
configuration.

rexray_config: rexray: loglevel: info modules: default-admin: host: tcp://127.0.0.1:61003


storageDrivers: - ec2 volume: unmount: ignoreusedcount: true

This example configures REX-Ray to use Amazons EBS for storage and IAM for
authorization.

Note: Setting rexray.modules.default-admin.host to tcp://127.0.0.1:61003 is


recommended to avoid port conflicts with tasks running on agents.

If your cluster will be hosted on Amazon Web Services and REX-Ray is configured to use
IAM, assign an IAM role to your agent nodes with the following policy:

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "ec2:CreateTags",


"ec2:DescribeInstances", "ec2:CreateVolume", "ec2:DeleteVolume", "ec2:AttachVolume",
"ec2:DetachVolume", "ec2:DescribeVolumes", "ec2:DescribeVolumeStatus",
"ec2:DescribeVolumeAttribute", "ec2:CreateSnapshot", "ec2:CopySnapshot",
"ec2:DeleteSnapshot", "ec2:DescribeSnapshots", "ec2:DescribeSnapshotAttribute" ],
"Resource": "*", "Effect": "Allow" } ] }

Consult the REX-Ray documentation for more information.

Scaling your App


Apps that use external volumes can only be scaled to a single instance because a
volume can only attach to a single task at a time. This may change in a future release.

If you scale your app down to 0 instances, the volume is detached from the agent where
it was mounted, but it is not deleted. If you scale your app up again, the data that was
associated with it is still be available.

Create an Application with External Volumes


Create an Application with a Marathon App Definition
You can specify an external volume in your Marathon app definition. Learn more about
Marathon application definitions.

Using a Mesos Container

{ "id": "hello", "instances": 1, "cpus": 0.1, "mem": 32, "cmd": "/usr/bin/tail -f


/dev/null", "container": { "type": "MESOS", "volumes": [ { "containerPath": "test-rexray-
volume", "external": { "size": 100, "name": "my-test-vol", "provider": "dvdi", "options": {
"dvdi/driver": "rexray" } }, "mode": "RW" } ] }, "upgradeStrategy": {
"minimumHealthCapacity": 0, "maximumOverCapacity": 0 } }

In the app definition above:


containerPath specifies where the volume is mounted inside the container. For Mesos
external volumes, this must be a single-level path relative to the container; it cannot
contain a forward slash (/). For more information, see the REX-Ray documentation on data
directories.

<<<<<<< HEAD
* The size of the volume must be specified in GiB.

=======

origin/dev
* name is
the name
that your
volume
driver
uses to
look up
your
volume.
When
your task
is staged
on an
agent, the
volume
driver
queries
the
storage
service
for a
volume
with this
name. If
one does
not exist,
it is
created
implicitly.
Otherwise
, the
existing
volume is
reused.

The external.options["dvdi/driver"] option specifies which Docker volume driver to use


for storage. The only Docker volume driver provided with DC/OS is rexray. Learn more about
REX-Ray.

You can specify additional options with


container.volumes[x].external.options[optionName]. The dvdi provider for Mesos
containers uses dvdcli, which offers the options documented here. The availability of
any option depends on your volume driver.

Create multiple volumes by adding additional items in the container.volumes array.

Volume parameters cannot be changed after you create the application.

Important: Marathon will not launch apps with external volumes if


upgradeStrategy.minimumHealthCapacity is greater than 0.5, or if
upgradeStrategy.maximumOverCapacity does not equal 0.

Using a Docker Container

Below is a sample app definition that uses a Docker container and specifies an external
volume:

{ "id": "/test-docker", "instances": 1, "cpus": 0.1, "mem": 32, "cmd": "/usr/bin/tail -f


/dev/null", "container": { "type": "DOCKER", "docker": { "image": "alpine:3.1", "network":
"HOST", "forcePullImage": true }, "volumes": [ { "containerPath": "/data/test-rexray-
volume", "external": { "name": "my-test-vol", "provider": "dvdi", "options": {
"dvdi/driver": "rexray" } }, "mode": "RW" } ] }, "upgradeStrategy": {
"minimumHealthCapacity": 0, "maximumOverCapacity": 0 } }
The containerPath must be absolute for Docker containers.

Important: Refer to the REX-Ray documentation to learn which versions of Docker are
compatible with the REX-Ray volume
driver.

Create an Application with External Volumes from the DC/OS Web


Interface
Click the Services tab, then RUN A SERVICE.
If you are using a Docker Container, click Container Settings and configure your Docker
container.

Click Volumes and enter your Volume Name and Container Path.

Click Deploy.

Implicit Volumes
The default implicit volume size is 16 GiB. If you are using the Mesos containerizer, you
can modify this default for a particular volume by setting volumes[x].external.size. You
cannot modify this default for a particular volume if you are using the Docker
containerizer. For both the Mesos and Docker containerizers, however, you can modify
the default size for all implicit volumes by modifying the REX-Ray configuration.

Potential Pitfalls
You can only assign one task per volume. Your storage provider might have other limitations.

The volumes you create are not automatically cleaned up. If you delete your cluster,
you must go to your storage provider and delete the volumes you no longer need. If
youre using EBS, find them by searching by the container.volumes.external.name
that you set in your Marathon app definition. This name corresponds to an EBS
volume Name tag.

Volumes are namespaced by their storage provider. If youre using EBS, volumes
created on the same AWS account share a namespace. Choose unique volume
names to avoid conflicts.

If you are using Docker, you must use a compatible Docker version. Refer to the REX-
Ray documentation to learn which versions of Docker are compatible with the REX-
Ray volume driver.

If you are using Amazons EBS, it is possible to create clusters in different availability
zones (AZs). If you create a cluster with an external volume in one AZ and destroy it,
a new cluster may not have access to that external volume because it could be in a
different AZ.

Launch time might increase for applications that create volumes implicitly. The amount
of the increase depends on several factors which include the size and type of the
volume. Your storage providers method of handling volumes can also influence
launch time for implicitly created volumes.

For troubleshooting external volumes, consult the agent or system logs. If you are
using REX-Ray on DC/OS, you can also consult the systemd journal.

<<<<<<< HEAD

origin/dev

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
STORAGE

Local Persistent Volumes


Updated: April 17, 2017

When you specify a local volume or volumes, tasks and their associated data are
pinned to the node they are first launched on and will be relaunched on that node if they
terminate. The resources the application requires are also reserved. Marathon will
implicitly reserve an appropriate amount of disk space (as declared in the volume via
persistent.size) in addition to the sandbox disk size you specify as part of your
application definition.

Benefits of using local persistent volumes


All resources needed to run tasks of your stateful service are dynamically reserved, thus
ensuring the ability to relaunch the task on the same node using the same volume when
needed.

You dont need constraints to pin a task to a particular agent where its data resides
You can still use constraints to specify distribution logic

Marathon lets you locate and destroy an unused persistent volume if you dont need it
anymore

Create an application with local


persistent volumes
Prerequisites
<<<<<<< HEAD

See the DC/OS system requirements.


See the DC/OS system requirements.

origin/dev

Configuration options
Configure a persistent volume with the following options:

{ "containerPath": "data", "mode": "RW", "persistent": { "size": 10 } }

containerPath: The path where your application will read and write data. This must be a
single-level path relative to the container; it cannot contain a forward slash (/).
("data", but not "/data", "/var/data" or "var/data"). If your application requires an
absolute path, or a relative path with slashes, use this configuration.

mode: The access mode of the volume. Currently, "RW" is the only possible value and will
let your application read from and write to the volume.

persistent.size: The size of the persistent volume in MiBs.

You also need to set the residency node in order to tell Marathon to setup a stateful
application. Currently, the only valid option for this is:
"residency": { "taskLostBehavior": "WAIT_FOREVER" }

Specifing an unsupported container path


The value of containerPath must be relative to allow you to dynamically add a local
persistent volume to a running container and to ensure consistency across operating
systems. However, your application may require an absolute or container path, or a
relative one with slashes.

If your application does require an unsupported containerPath, configure two volumes.


The first volume has the absolute container path you need and does not have the
persistent parameter. The hostPath parameter will match the relative containerPath
value for the second volume.

{ "containerPath": "/var/lib/data", "hostPath": "mydata", "mode": "RW" }

The second volume is a persistent volume with a containerPath that matches the
hostPath of the first volume.

{ "containerPath": "mydata", "mode": "RW", "persistent": { "size": 1000 } }

For a complete example, see the Running stateful MySQL on Marathon section.

Creating a stateful application via the DC/OS Web


Interface
Create a new service via the web interface in Services > Services > RUN A SERVICE.
Click the Volumes tab.

Choose the size of the volume or volumes you will use. Be sure that you choose a volume size
that will fit the needs of your application; you will not be able to modify this size after you
launch your application.

Specify the container path from which your application will read and write data. The container
path must be non-nested and cannot contain slashes e.g. data, but not ../../../etc/opt or
/user/data/. If your application requires such a container path, use this configuration.
Click Create.

Scaling stateful applications


When you scale your app down, the volumes associated with the terminated instances
are detached but all resources are still reserved. At this point, you may delete the tasks
via the Marathon REST API, which will free reserved resources and destroy the
persistent volumes.
Since all the resources your application needs are still reserved when a volume is
detached, you may wish to destroy detached volumes in order to allow other applications
and frameworks to use the resources. You may wish to leave them in the detached state,
however, if you think you will be scaling your app up again; the data on the volume will
still be there.

Notes:
If your app is destroyed, any associated volumes and reserved resources will also be
deleted.

Mesos will currently not remove the data but might do so in the future.

Upgrading or restarting stateful


applications
The default UpgradeStrategy for a stateful application is a minimumHealthCapacity of 0.5
and a maximumOverCapacity of . If you override this default, your definition must stay
below these values in order to pass validation. The UpgradeStrategy must stay below
these values because Marathon needs to be able to kill old tasks before starting new
ones so that the new versions can take over reservations and volumes and Marathon
cannot create additional tasks (as a maximumOverCapacity > 0 would induce) in order to
prevent additional volume creation.

Note: For a stateful application, Marathon will never start more instances than specified
in the UpgradeStrategy, and will kill old instances rather than create new ones during an
upgrade or restart.

Under the Hood


Marathon leverages three Mesos features to run stateful applications: dynamic
reservations, reservation labels, and persistent volumes.

In contrast to static reservations, dynamic reservations are created at runtime for a given
role and will associate resources with a combination of frameworkId and taskId using
reservation labels. This allows Marathon to restart a stateful task after it has terminated
for some reason, since the associated resources will not be offered to frameworks that
are not configured to use this role. Consult non-unique roles for more information.

Mesos creates persistent volumes to hold your applications stateful data. Because
persistent volumes are local to an agent, the stateful task using this data will be pinned to
the agent it was initially launched on, and will be relaunched on this node whenever
needed. You do not need to specify any constraints for this to work: when Marathon
needs to launch a task, it will accept a matching Mesos offer, dynamically reserve the
resources required for the task, create persistent volumes, and make sure the task is
always restarted using these reserved resources so that it can access the existing data.
When a task that used persistent volumes has terminated, its metadata will be kept. This
metadata will be used to launch a replacement task when needed.

For example, if you scale down from 5 to 3 instances, you will see 2 tasks in the Waiting
state along with the information about the persistent volumes the tasks were using as
well as about the agents on which they are placed. Marathon will not unreserve those
resources and will not destroy the volumes. When you scale up again, Marathon will
attempt to launch tasks that use those existing reservations and volumes as soon as it
gets a Mesos offer containing the labeled resources. Marathon will only schedule
unreserve/destroy operations when:
the application is deleted (in which case volumes of all its tasks are destroyed, and all
reservations are deleted).

you explicitly delete one or more suspended tasks with a wipe=true flag.

If reserving resources or creating persistent volumes fails, the created task will timeout
after the configured task_reservation_timeout (default: 20 seconds) and a new
reservation attempt will be made. In case a task is LOST (because its agent is
disconnected or crashed), the reservations and volumes will not timeout and you need to
manually delete and wipe the task in order to let Marathon launch a new one.

Potential Pitfalls
Be aware of the following issues and limitations when using stateful applications in
Marathon that make use of dynamic resevations and persistent volumes.

Resource requirements
Currently, the resource requirements of a stateful application cannot be changed. Your
initial volume size, cpu usage, memory requirements, etc., cannot be changed once
youve posted the AppDefinition.

Replication and Backups


Because persistent volumes are pinned to nodes, they are no longer reachable if the
node is disconnected from the cluster, e.g. due to a network partition or a crashed agent.
If the stateful service does not take care of data replication on its own, you need to
manually setup a replication or backup strategy to guard against data loss from a network
partition or from a crashed agent.

If an agent re-registers with the cluster and offers its resources, Marathon is eventually
able to relaunch a task there. If a node does not re-register with the cluster, Marathon will
wait forever to receive expected offers, as its goal is to re-use the existing data. If the
agent is not expected to come back, you can manually delete the relevant tasks by
adding a wipe=true flag and Marathon will eventually launch a new task with a new
volume on another agent.
Disk consumption
As of Mesos 0.28, destroying a persistent volume will not cleanup or destroy data. Mesos
will delete metadata about the volume in question, but the data will remain on disk. To
prevent disk consumption, you should manually remove data when you no longer need it.

Non-unique Roles
Both static and dynamic reservations in Mesos are bound to roles, not to frameworks or
framework instances. Marathon will add labels to claim that resources have been
reserved for a combination of frameworkId and taskId, as noted above. However, these
labels do not protect from misuse by other frameworks or old Marathon instances (prior
to 1.0). Every Mesos framework that registers for a given role will eventually receive
offers containing resources that have been reserved for that role.

However, if another framework does not respect the presence of labels and the
semantics as intended and uses them, Marathon is unable to reclaim these resources for
the initial purpose. We recommend never using the same role for different frameworks if
one of them uses dynamic reservations. Marathon instances in HA mode do not need to
have unique roles, though, because they use the same role by design.

The Mesos Sandbox


The temporary Mesos sandbox is still the target for the stdout and stderr logs. To view
these logs, go to the Marathon pane of the DC/OS web interface.

Examples
Running stateful PostgreSQL on Marathon
A model app definition for PostgreSQL on Marathon would look like this. Note that we set
the postgres data folder to pgdata which is relative to the Mesos sandbox (as contained in
the $MESOS_SANDBOX variable). This enables us to set up a persistent volume with a
containerPath of pgdata. This path is is not nested and relative to the sandbox as
required:

{ "id": "/postgres", "cpus": 1, "instances": 1, "mem": 512, "container": { "type": "DOCKER",


"volumes": [ { "containerPath": "pgdata", "mode": "RW", "persistent": { "size": 100 } } ],
"docker": { "image": "postgres:latest", "network": "BRIDGE", "portMappings": [ {
"containerPort": 5432, "hostPort": 0, "protocol": "tcp", "name": "postgres" } ] } }, "env":
{ "POSTGRES_PASSWORD": "password", "PGDATA": "/mnt/mesos/sandbox/pgdata" }, "residency": {
"taskLostBehavior": "WAIT_FOREVER" }, "upgradeStrategy": { "maximumOverCapacity": 0,
"minimumHealthCapacity": 0 } }
Running stateful MySQL on Marathon
The default MySQL docker image does not allow you to change the data folder. Since we
cannot define a persistent volume with an absolute nested containerPath like
/var/lib/mysql, we need to configure a workaround to set up a docker mount from
hostPath mysqldata (relative to the Mesos sandbox) to /var/lib/mysql (the path that
MySQL attempts to read/write):

{ "containerPath": "/var/lib/mysql", "hostPath": "mysqldata", "mode": "RW" }

In addition to that, we configure a persistent volume with a containerPath mysqldata,


which will mount the local persistent volume as mysqldata into the docker container:

{ "containerPath": "mysqldata", "mode": "RW", "persistent": { "size": 1000 } }

The complete JSON application definition reads as follows:

{ "id": "/mysql", "cpus": 1, "mem": 512, "disk": 0, "instances": 1, "container": { "type":


"DOCKER", "volumes": [ { "containerPath": "mysqldata", "mode": "RW", "persistent": { "size":
1000 } }, { "containerPath": "/var/lib/mysql", "hostPath": "mysqldata", "mode": "RW" } ],
"docker": { "image": "mysql", "network": "BRIDGE", "portMappings": [ { "containerPort":
3306, "hostPort": 0, "servicePort": 10000, "protocol": "tcp" } ], "forcePullImage": false }
}, "env": { "MYSQL_USER": "wordpress", "MYSQL_PASSWORD": "secret", "MYSQL_ROOT_PASSWORD":
"supersecret", "MYSQL_DATABASE": "wordpress" }, "upgradeStrategy": {
"minimumHealthCapacity": 0, "maximumOverCapacity": 0 } }

Inspecting and deleting suspended stateful tasks


In order to destroy and clean up persistent volumes and free the reserved resources
associated with a task, perform 2 steps:
Locate the agent containing the persistent volume and remove the data inside it.

Send an HTTP DELETE request to Marathon that includes the wipe=true flag.

To locate the agent, inspect the Marathon UI and check out the detached volumes on the
Volumes tab. Or, query the /v2/apps endpoint, which provides information about the host
and Mesos slaveId.

<<<<<<< HEAD http GET http://dcos/service/marathon/v2/apps/postgres/tasks ======= $ http GET


http://dcos/service/marathon/v2/apps/postgres/tasks >>>>>>> origin/dev response: { "appId":
"/postgres", "host": "10.0.0.168", "id": "postgres.53ab8733-fd96-11e5-8e70-76a1c19f8c3d",
"localVolumes": [ { "containerPath": "pgdata", "persistenceId": "postgres#pgdata#53ab8732-
fd96-11e5-8e70-76a1c19f8c3d" } ], "slaveId": "d935ca7e-e29d-4503-94e7-25fe9f16847c-S1" }

Note: A running task will show stagedAt, startedAt and version in addition to the
information provided above.

You can then


Remove the data on disk by ssh'ing into the agent and running the rm -rf <volume-path>/*
command.

Delete the task with wipe=true, which will expunge the task information from the Marathon
internal repository and eventually destroy the volume and unreserve the resources previously
associated with the task:

http DELETE
http://dcos/service/marathon/v2/apps/postgres/tasks/postgres.53ab8733-fd96-11e5-8e70-76a1c19
f8c3d?wipe=true

View the Status of Your Application with Persistent


Local Volumes
After you have created your application, click the Volumes tab of the application detail
view to get detailed information about your app instances and associated volumes.

The Status column tells you if your app instance is attached to the volume or not. The
app instance will read as detached if you have scaled down your application. Currently
the only Operation Type available is read/write (RW).

Click a volume to view the Volume Detail Page, where you can see information about the
individual volume.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Metrics
PREVIEW Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.

Quick Start
Use this guide to get started with the DC/OS metrics component. The metrics component
is natively integrated with DC/OS and no additional setup is required. Prerequisites: You
must...
Metrics API
Use the Metrics API to poll for data about your cluster, hosts, containers, and
applications. You can then pass this data to a third party service of your choice.
Metrics Reference
These metrics are automatically collected by DC/OS. Node Metrics Metric Description
cpu.cores Percentage of cores used. cpu.idle Percentage of CPUs idle. cpu.system
Percentage of s...

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
METRICS

Quick Start
ENTERPRISE DC/OS PREVIEW Updated: April 17, 2017

Use this guide to get started with the DC/OS metrics component. The metrics component
is natively integrated with DC/OS and no additional setup is required.

Prerequisites:
You must have the DC/OS CLI installed and be logged in as a superuser via the dcos auth
login command.
Optional: the CLI JSON processor jq.

Optional: Deploy a sample Marathon app for use in this quick start guide. If you already have
tasks running on DC/OS, you can skip this setup step.
Create the following Marathon app definition and save as test-metrics.json.

{ "id": "/test-metrics", "cmd": "while true;do echo stdout;echo stderr >&2;sleep 1;done",
"cpus": 0.001, "instances": 1, "mem": 128 }

Deploy the app with this CLI command:

dcos marathon app add test-metrics.json

Obtain your DC/OS authentication token and copy for later use:

dcos config show core.dcos_acs_token

The output should resemble:

eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOiJib290c3RyYXB1c2VyIi...

SSH to the agent node that is running your app, where (--mesos-id=<mesos-id>) is the Mesos
ID of the node running your app.
dcos node ssh --master-proxy --mesos-id=<mesos-id>

Tip: To get the Mesos ID of the node that is running your app, run dcos task followed by
dcos node. For example:

Running dcos task shows that host 10.0.0.193 is running the Marathon task test-
metrics.93fffc0c-fddf-11e6-9080-f60c51db292b.

dcos task NAME HOST USER STATE ID test-metrics 10.0.0.193 root R test-metrics.93fffc0c-
fddf-11e6-9080-f60c51db292b

Running dcos node shows that host 10.0.0.193 has the Mesos ID 7749eada-4974-44f3-
aad9-42e2fc6aedaf-S1.

dcos node HOSTNAME IP ID 10.0.0.193 10.0.0.193 7749eada-4974-44f3-aad9-42e2fc6aedaf-S1

View metrics.
Metrics for all containers running on a host
To show all containers that are deployed on the agent node, run this command from
your agent node with your authentication token (<auth-token>) specified.

curl -H "Authorization: token=<auth-token>"


http://localhost:61001/system/v1/metrics/v0/containers | jq

The output should resemble this:

["121f82df-b0a0-424c-aa4b-81626fb2e369","87b10e5e-6d2e-499e-ae30-1692980e669a"]

Metrics for a specific container


To view the metrics for a specific container, run this command from your agent node
with your authentication token (<auth-token>) and container ID (<container-id>)
specified.

curl -H "Authorization: token=<auth-token>"


http://localhost:61001/system/v1/metrics/v0/containers/<container-id>/app | jq

The output should resemble:

{ "datapoints": [ { "name": "dcos.metrics.module.container_received_bytes_per_sec",


"value": 0, "unit": "", "timestamp": "2016-12-15T18:12:24Z" }, { "name":
"dcos.metrics.module.container_throttled_bytes_per_sec", "value": 0, "unit": "",
"timestamp": "2016-12-15T18:12:24Z" } ], "dimensions": { "mesos_id": "", "container_id":
"d41ae47f-c190-4072-abe7-24d3468d40f6", "executor_id": "test-metrics.e3a1fe9e-c2f1-11e6-
b94b-2e2d1faf2a70", "framework_id": "fd39fe4f-930a-4b89-bb3b-a392e518c9a5-0001",
"hostname": "" } }

Metrics from container-level cgroup allocations


To view cgroup allocations, run this command from your agent node with your
authentication token (<auth-token>) and container ID (<container-id>) specified.

curl -H "Authorization: token=<auth-token>"


http://localhost:61001/system/v1/metrics/v0/containers/<container-id> | jq

The output will contain a datapoints array that contains information about container
resource allocation and utilization provided by Mesos. For example:

{ "datapoints": [ { "name": "cpus_system_time_secs", "value": 0.68, "unit": "",


"timestamp": "2016-12-13T23:15:19Z" }, { "name": "cpus_limit", "value": 1.1, "unit": "",
"timestamp": "2016-12-13T23:15:19Z" }, { "name": "cpus_throttled_time_secs", "value":
23.12437475, "unit": "", "timestamp": "2016-12-13T23:15:19Z" }, { "name":
"mem_total_bytes", "value": 327262208, "unit": "", "timestamp": "2016-12-13T23:15:19Z"
},

The output will also contain an object named dimensions that contains metadata about
the cluster/node/app.

... "dimensions": { "mesos_id": "a29070cd-2583-4c1a-969a-3e07d77ee665-S0",


"container_id": "6972ad7c-1701-4970-ae14-4372f76eda37", "executor_id": "confluent-
kafka.7aff271b-c182-11e6-a88f-22e5385a5fd7", "framework_name": "marathon",
"framework_id": "a29070cd-2583-4c1a-969a-3e07d77ee665-0001", "framework_role":
"slave_public", "hostname": "", "labels": { "DCOS_MIGRATION_API_PATH": "/v1/plan",
"DCOS_MIGRATION_API_VERSION": "v1", "DCOS_PACKAGE_COMMAND":
"eyJwaXAiOlsiaHR0cHM6Ly9kb3dubG9hZHMubWVzb3NwaGVyZS5jb20va2Fma2EvYX...
"DCOS_PACKAGE_FRAMEWORK_NAME": "confluent-kafka", "DCOS_PACKAGE_IS_FRAMEWORK": "true",
"DCOS_PACKAGE_METADATA": "eyJwYWNrYWdpbmdWZXJzaW9uIjoiMy4wIi... "DCOS_PACKAGE_NAME":
"confluent-kafka", "DCOS_PACKAGE_REGISTRY_VERSION": "3.0", "DCOS_PACKAGE_RELEASE": "10",
"DCOS_PACKAGE_SOURCE": "https://universe.mesosphere.com/repo", "DCOS_PACKAGE_VERSION":
"1.1.16-3.1.1", "DCOS_SERVICE_NAME": "confluent-kafka", "DCOS_SERVICE_PORT_INDEX": "1",
"DCOS_SERVICE_SCHEME": "http", "MARATHON_SINGLE_INSTANCE_APP": "true" } } } ...

Host level metrics


To view host-level metrics, run this command from your agent node with your
authentication token (<auth-token>) specified:

curl -H "Authorization: token=<auth-token>"


http://localhost:61001/system/v1/metrics/v0/node | jq

The output will contain a datapoints array about resource allocation and utilization.
For example:

... "datapoints": [ { "name": "uptime", "value": 23631, "unit": "", "timestamp":


"2016-12-14T01:00:19Z" }, { "name": "processes", "value": 209, "unit": "", "timestamp":
"2016-12-14T01:00:19Z" }, { "name": "cpu.cores", "value": 4, "unit": "", "timestamp":
"2016-12-14T01:00:19Z" } ...

The output will contain an object named dimensions that contains metadata about the
cluster and node. For example:
... "dimensions": { "mesos_id": "a29070cd-2583-4c1a-969a-3e07d77ee665-S0", "hostname":
"10.0.2.255" } ...

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
METRICS

Metrics API
ENTERPRISE DC/OS PREVIEW Updated: April 17, 2017

About the Metrics API


Use the Metrics API to periodically poll for data about your cluster, hosts, containers, and
applications. You can then pass this data to a third party service of your choice to
achieve informative charts, dashboards, and alerts.

Response format
The API supports JSON only. You will not need to send any JSON, but must indicate
Accept: application/json in the HTTP header, as shown below.

Accept: application/json

Host name or IP address


The host name or IP address to use varies according to where your app is running.
Private agents will only return metrics to apps running inside of the DC/OS cluster. For
this reason, we recommend situating your app inside the cluster so that it can obtain
private agent metrics. You might also consider running it as a DC/OS service or job.
If your app will run inside of the cluster, use http[s]://localhost[:61001].
If your app will run outside of the DC/OS cluster, you should use the cluster URL. In a
production environment, this should be the path to the load balancer that sits in front
of your masters. To obtain the cluster URL, launch the DC/OS web interface and copy
the domain name from the browser. Alternatively, you can log into the DC/OS CLI and
type dcos config show core.dcos_url to get the cluster URL. In addition, the DC/OS
CLI makes this value available as a variable that you can reference using $(dcos
config show core.dcos_url).

Base path
Append /system/v1/metrics/v0/ to the host name, as shown below.

https://<host-name-or-ip>/system/v1/metrics/v0/

Authentication and authorization


About authentication and authorization
All Metrics API endpoints require an authentication token with one of the following
permissions:
dcos:superuser
dcos:adminrouter:ops:system-metrics

We recommend using dcos:adminrouter:ops:system-metrics for more secure operations.

Obtaining an authentication token


Via the IAM API
To get an authentication token, pass the user name and password of a user with the
required permissions in the body of a request to the /auth/login endpoint of the Identity
and Access Management Service API. It returns an authentication token as shown
below.

{ "token":
"eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1aWQiOiJib290c3RyYXB1c2VyIiwiZXhwIjoxNDgyNjE1NDU2fQ
.j3_31keWvK15shfh_BII7w_10MgAj4ay700Rub5cfNHyIBrWOXbedxdKYZN6ILW9vLt3t5uCAExOOFWJkYcsI0sVFcM
1HSV6oIBvJ6UHAmS9XPqfZoGh0PIqXjE0kg0h0V5jjaeX15hk-LQkp7HXSJ-
V7d2dXdF6HZy3GgwFmg0Ayhbz3tf9OWMsXgvy_ikqZEKbmPpYO41VaBXCwWPmnP0PryTtwaNHvCJo90ra85vV85C02NE
dRHB7sqe4lKH_rnpz980UCmXdJrpO4eTEV7FsWGlFBuF5GAy7_kbAfi_1vY6b3ufSuwiuOKKunMpas9_NfDe7UysfPVH
lAxJJgg" }
Via the DC/OS CLI
When you log into the DC/OS CLI using dcos auth login, it stores the authentication
token value locally. You can reference this value as a variable in curl commands
(discussed in the next section).

Alternatively, you can use the following command to get the authentication token value.

$ dcos config show core.dcos_acs_token

Passing an authentication token


Via the HTTP header
Copy the token value and pass it in the Authorization field of the HTTP header, as
shown below.

Authorization:
token=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1aWQiOiJib290c3RyYXB1c2VyIiwiZXhwIjoxNDgyNjE1N
DU2fQ.j3_31keWvK15shfh_BII7w_10MgAj4ay700Rub5cfNHyIBrWOXbedxdKYZN6ILW9vLt3t5uCAExOOFWJkYcsI0
sVFcM1HSV6oIBvJ6UHAmS9XPqfZoGh0PIqXjE0kg0h0V5jjaeX15hk-LQkp7HXSJ-
V7d2dXdF6HZy3GgwFmg0Ayhbz3tf9OWMsXgvy_ikqZEKbmPpYO41VaBXCwWPmnP0PryTtwaNHvCJo90ra85vV85C02NE
dRHB7sqe4lKH_rnpz980UCmXdJrpO4eTEV7FsWGlFBuF5GAy7_kbAfi_1vY6b3ufSuwiuOKKunMpas9_NfDe7UysfPVH
lAxJJgg

Via curl as a string value


Using curl, for example, you would pass this value as follows.

$ curl -H "Authorization:
token=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1aWQiOiJib290c3RyYXB1c2VyIiwiZXhwIjoxNDgyNjE1N
DU2fQ.j3_31keWvK15shfh_BII7w_10MgAj4ay700Rub5cfNHyIBrWOXbedxdKYZN6ILW9vLt3t5uCAExOOFWJkYcsI0
sVFcM1HSV6oIBvJ6UHAmS9XPqfZoGh0PIqXjE0kg0h0V5jjaeX15hk-LQkp7HXSJ-
V7d2dXdF6HZy3GgwFmg0Ayhbz3tf9OWMsXgvy_ikqZEKbmPpYO41VaBXCwWPmnP0PryTtwaNHvCJo90ra85vV85C02NE
dRHB7sqe4lKH_rnpz980UCmXdJrpO4eTEV7FsWGlFBuF5GAy7_kbAfi_1vY6b3ufSuwiuOKKunMpas9_NfDe7UysfPVH
lAxJJgg"

Via curl as a DC/OS CLI variable


You can then reference this value in your curl commands, as shown below.

$ curl -H "Authorization: token=$(dcos config show core.dcos_acs_token)"

Refreshing the authentication token


Authentication tokens expire after five days by default. If your program needs to run
longer than five days, you will need a service account. Please see Provisioning custom
services for more information.

API reference
Loading ...

Logging
While the API returns informative error messages, you may also find it useful to check
the logs of the Metrics service. Refer to Service and Task Logging for instructions.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
METRICS

Metrics Reference
PREVIEW Updated: April 17, 2017

These metrics are automatically collected by DC/OS.

Node
Metrics
Metric Description
cpu.cores Percentage of cores used.
cpu.idle Percentage of CPUs idle.
cpu.system Percentage of system used.
cpu.total Percentage of CPUs used.
cpu.user Percentage of CPU used by the user.
cpu.wait Percentage idle while waiting for an operation to complete.
load.1min Load average for the past minute.
load.5min Load average for the past 5 minutes.
Metric Description
load.15min Load average for the past 15 minutes.
memory.buffers Number of memory buffers.
memory.cachedAmount of cached memory.
memory.free Amount of free memory in bytes.
memory.total Total memory in bytes.
processes Number of processes that are running.
swap.free Amount of free swap space.
swap.total Total swap space.
swap.used Amount of swap space used.
uptime The system reliability and load average.

Filesystems
Metric Description
filesystem.{{.Name}}.capacity.free Amount of available capacity in bytes.
filesystem.{{.Name}}.capacity.total Total capacity in bytes.
filesystem.{{.Name}}.capacity.usedCapacity used in bytes.
filesystem.{{.Name}}.inodes.free Amount of available inodes in bytes.
filesystem.{{.Name}}.inodes.total Total inodes in bytes.
filesystem.{{.Name}}.inodes.used Inodes used in bytes.

Note: {{.Name}} is part of a go template and is automatically populated based on the


mount path of the local filesystem (e.g., /, /boot, etc).

Network interfaces
Metric Description
network.{{.Name}}.in.bytes Number of bytes downloaded.
network.{{.Name}}.in.dropped Number of downloaded bytes dropped.
network.{{.Name}}.in.errors Number of downloaded bytes in error.
network.{{.Name}}.in.packets Number of packets downloaded.
network.{{.Name}}.out.bytes Number of bytes uploaded.
network.{{.Name}}.out.droppedNumber of uploaded bytes dropped.
network.{{.Name}}.out.errors Number of uploaded bytes in error.
network.{{.Name}}.out.packets Number of packets uploaded.

Note: {{.Name}} is part of a go template and is automatically populated based on the


mount path of the local filesystem (e.g., /, /boot, etc).

Container
The following per-container resource utilization metrics are collected.

CPU usage info


Metric Description
cpus_limit The number of CPU shares allocated.
cpus_system_time_secs Total CPU time spent in kernel mode in seconds.
cpus_throttled_time_secsTotal time, in seconds, that CPU was throttled.
cpus_user_time_secs Total CPU time spent in user mode.

Disk info
Metric Description
disk_limit_bytes Hard memory limit for disk in bytes.
disk_used_bytes Hard memory used in bytes.

Memory info

Metric Description
mem_limit_bytes Hard memory limit for a container.
mem_total_bytesTotal memory of a process in RAM (as opposed to in Swap).

Dimensions

Dimensions are metadata about the metrics that are contained in a common message
format and are broadcast to one or more metrics producers.
Metric Description
net_rx_bytes Bytes received.
net_rx_dropped Packets dropped on receive.
net_rx_errors Errors reported on receive.
net_rx_packets Packets received.
net_tx_bytes Bytes sent.
net_tx_dropped Packets dropped on send
net_tx_errors Errors reported on send.
net_tx_packets Packets sent.

For more information, see the dcos-metrics repository.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Deploying Jobs
PREVIEW Updated: April 17, 2017
You can create scheduled jobs in DC/OS without installing a separate service. Create
and administer jobs in the DC/OS web interface, the DC/OS CLI, or via an API.

Note: The Jobs functionality of DC/OS is provided by the DC/OS Jobs (Metronome)
component, an open source Mesos framework that comes pre-installed with DC/OS. You
may sometimes see the Jobs functionality referred to as Metronome in the logs, and the
service endpoint is service/metronome.

Functionality
You can create a job as a single command you include when you create the job, or you
can point to a Docker image.

When you create your job, you can specify:


The amount of CPU your job will consume.
The amount of memory your job will consume.

The disk space your job will consume.

The schedule for your job, in cron format. You can also set the time zone and starting
deadline.

An arbitrary number of labels to attach to your job.

Permissions for your job.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Deploying Services and Pods


Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.

Installing Services
Installing a service using the CLI The general syntax for installing a service with the CLI
follows. dcos package install [--options=<config-file-name>.json] <servicename&...
Pods
Monitoring Services
You can monitor deployed DC/OS services from the CLI and web interface. Monitoring
Universe services CLI From the DC/OS CLI, enter the dcos service command. In this
example you can...
Updating a User-Created Service
You can easily view and update the configuration of a deployed app by using the dcos
marathon command. Note: The process for updating packages from the DC/OS Universe
is different....
Service Ports
Port configuration for applications in Marathon can be confusing and there is an
outstanding issue to redesign the ports API. This page attempts to explain more clearly
how they wo...
Exposing a Service
DC/OS agent nodes can be designated as public or private during installation. Public
agent nodes provide access from outside of the cluster via infrastructure networking to
your DC...
Deploying non-native Marathons
About deploying non-native Marathons Each service that Marathon launches uses the
same Mesos role that Marathon registered with for quotas and reservations. In addition,
any users ...
Marathon REST API
The Marathon API allows you to manage long-running containerized services (apps and
pods). The Marathon API is backed by the Marathon component, which runs on the
master nodes. One...
Enabling GPU Support
DC/OS supports allocating GPUs (Graphics Processing Units) to your long-running
DC/OS services. Adding GPUs to your service can dramatically accelerate big data
workloads. Learn mo...
Frequently Asked Questions
Weve collected some questions we often encounter concerning the usage of DC/OS.
Have got a new question youd like to see? Use the Submit feedback button at the
bottom...

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
Installing Services
Updated: April 17, 2017

Installing a service using the CLI


The general syntax for installing a service with the CLI follows.

dcos package install [--options=<config-file-name>.json] <servicename>

Use the optional --options flag to specify the name of the customized JSON file you
created in advanced configuration.

For example, you would use the following command to install Chronos with the default
parameters.

dcos package install chronos

Installing a service using the GUI


From the DC/OS GUI you can install services from the Services or Universe tab. The
Universe tab shows all of the available DC/OS services from package repositories. The
Services tab provides a full featured interface to the native DC/OS Marathon instance.

Universe tab
Navigate to the Universe > Packages page in the DC/OS GUI.

Choose your package and click INSTALL PACKAGE.

Confirm your installation or choose ADVANCED INSTALLATION.

Services tab
Navigate to the Services tab in the DC/OS GUI.

Click RUN A SERVICE and specify your Marathon app definition.


Verifying your installation
CLI

dcos package list

Web GUI
Go to the Services tab and confirm that the service is running. For more information, see
the GUI documentation.

Tip: Some services from the Community Packages section of the Universe will not show
up in the DC/OS service listing. For these, inspect the services Marathon app in the
Marathon GUI to verify that the service is running and healthy.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS

Pods
ENTERPRISE DC/OS Updated: April 17, 2017

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS

Monitoring Services
Updated: April 17, 2017

You can monitor deployed DC/OS services from the CLI and web interface.
Monitoring Universe services
CLI
From the DC/OS CLI, enter the dcos service command. In this example you can see the
installed DC/OS services Chronos, HDFS, and Kafka.

dcos service NAME HOST ACTIVE TASKS CPU MEM DISK ID chronos <privatenode1> True 0 0.0 0.0
0.0 <service-id1> hdfs <privatenode2> True 1 0.35 1036.8 0.0 <service-id2> kafka
<privatenode3> True 0 0.0 0.0 0.0 <service-id3>

Web interface
See the monitoring documentation.

Monitoring user-created services


CLI
From the DC/OS CLI, enter the dcos task command. In this example you can see the
installed DC/OS services Chronos, HDFS, Kafka, and the user-created service suzanne-
simple-service.

dcos task NAME HOST USER STATE ID cassandra 10.0.3.224 root R cassandra.36031a0f-feb4-11e6-
b09b-3638c949fe6b node-0 10.0.3.224 root R node-0__0b165525-13f2-485b-a5f8-e00a9fabffd9
suzanne-simple-service 10.0.3.224 root R suzanne-simple-service.47359150-feb4-11e6-
b09b-3638c949fe6b

Web interface
See the monitoring documentation.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
Updating a User-Created Service
Updated: April 17, 2017

You can easily view and update the configuration of a deployed app by using the dcos
marathon command.

Note: The process for updating packages from the DC/OS Universe is different. For more
information, see the documentation.

Update an Environment Variable


Use the dcos marathon app update command from the DC/OS CLI to update any aspect
of your services JSON service definition. For instance, follow the instructions below to
update the environment variable (env field) of the service definition.

A single element of the env field can be updated by specifying a JSON string in a
command argument.

dcos marathon app update test-app env='{"APISERVER_PORT":"25502"}'

Now, run the command below to see the result of your update:

dcos marathon app show test-app | jq '.env'

Update all Environment Variables


The entire env field can also be updated by specifying a JSON file in a command
argument.

First, save the existing environment variables to a file:

dcos marathon app show test-app | jq .env >env_vars.json

The file will contain the JSON for the env field:

{ "SCHEDULER_DRIVER_PORT": "25501", }
Now edit the env_vars.json file. Make the JSON a valid object by enclosing the file
contents with { "env" :} and add your update:

{ "env" : { "APISERVER_PORT" : "25502", "SCHEDULER_DRIVER_PORT" : "25501" } }

Specify this CLI command with the JSON file specified:

dcos marathon app update test-app < env_vars.json

View the results of your update:

dcos marathon app show test-app | jq '.env'

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS

Service Ports
Updated: April 17, 2017

Port configuration for applications in Marathon can be confusing and there is an


outstanding issue to redesign the ports API. This page attempts to explain more clearly
how they work.

You can use virtual addresses (VIPs) to make ports management easier. VIPs simplify
inter-app communication and implement a reliable service-oriented architecture. VIPs
map traffic from a single virtual address to multiple IP addresses and ports.

Definitions
containerPort: A container port specifies a port within a container. This is only necessary
as part of a port mapping when using BRIDGE or USER mode networking with a Docker
container.

hostPort: A host port specifies a port on the host to bind to. When used with BRIDGE or
USER mode networking, you specify a port mapping from a host port to a container port. In
HOST networking, requested ports are host ports by default. Note that only host ports are
made available to a task through environment variables.

BRIDGE networking: used by Docker applications that specify BRIDGE mode networking.
In this mode, container ports (a port within the container) are mapped to host ports (a
port on the host machine). In this mode, applications bind to the specified ports within the
container and Docker networking binds to the specified ports on the host.

USER networking: used by Docker applications that specify USER mode networking. In
this mode, container ports (a port within the container) are mapped to host ports (a port
on the host machine). In this mode, applications bind to the specified ports within the
container and Docker networking binds to the specified ports on the host. USER network
mode is expected to be useful when integrating with user-defined Docker networks. In
the Mesos world such networks are often made accessible via CNI plugins used in
concert with a Mesos CNI network isolator.

HOST networking: used by non-Docker Marathon applications and Docker applications


that use HOST mode networking. In this mode, applications bind directly to one or more
ports on the host machine.

portMapping: In Docker BRIDGE mode, a port mapping is necessary for every port that
should be reachable from outside of your container. A port mapping is a tuple containing
a host port, container port, service port and protocol. Multiple port mappings may be
specified for a Marathon application; an unspecified hostPort defaults to (meaning that
Marathon will assign one at random). In Docker USER mode the semantic for hostPort
slightly changes: hostPort is not required for USER mode and if left unspecified Marathon
WILL NOT automatically allocate one at random. This allows containers to be deployed
on USER networks that include containerPort and discovery information, but do NOT
expose those ports on the host network (and by implication would not consume host port
resources).

ports: The ports array is used to define ports that should be considered as part of a
resource offer in HOST mode. It is necessary only if no port mappings are specified. Only
one of ports and portDefinitions should be defined for an application.
portDefinitions: The portDefinitions array is used to define ports that should be
considered as part of a resource offer. It is necessary only to define this array if you are
using HOST networking and no port mappings are specified. This array is meant to replace
the ports array, and makes it possible to specify a port name, protocol and labels. Only
one of ports and portDefinitions should be defined for an application.

protocol: Protocol specifies the internet protocol to use for a port (e.g. tcp, udp or udp,tcp
for both). This is only necessary as part of a port mapping when using BRIDGE or USER
mode networking with a Docker container.

requirePorts: requirePorts is a property that specifies whether Marathon should


specifically look for specified ports in the resource offers it receives. This ensures that
these ports are free and available to be bound to on the Mesos agent. This does not
apply to BRIDGE or USER mode networking.

servicePort: When you create a new application in Marathon (either through the REST
API or the front end), you may assign one or more service ports to it. You can specify all
valid port numbers as service ports or you can use 0 to indicate that Marathon should
allocate free service ports to the app automatically. If you do choose your own service
port, you have to ensure yourself that it is unique across all of your applications.

Random Port Assignment


Using the value for any port settings indicates to Marathon that you would like a random
port assignment. However, if containerPort is set to within a portMapping, it is set to the
same value as hostPort.

Environment Variables
Each host port value is exposed to the running application instance via environment
variables $PORT0, $PORT1, etc. Each Marathon application is given a single port by default,
so $PORT0 is always available. These variables are available inside a Docker container
being run by Marathon too. Additionally, if the port is named NAME, it will also be
accessible via the environment variable, $PORT_NAME.

When using BRIDGE or USER mode networking, be sure to bind your application to the
containerPorts you have specified in your portMappings. However, if you have set
containerPort to 0 then this will be the same as hostPort and you can use the $PORT
environment variables.

Example Configuration
Host Mode
Host mode networking is the default networking mode for Docker containers and the only
networking mode for non-Docker applications. Note that it not necessary to EXPOSE ports
in your Dockerfile.

Enabling Host Mode


Host mode is enabled by default for containers. If you wish to be explicit, you can also
specify it manually through the network property:

"container": { "type": "DOCKER", "docker": { "image": "my-image:1.0", "network": "HOST" }


},

For non-Docker applications, you dont need to specify anything.


Specifying Ports
You can specify the ports that are available through the ports array:

"ports": [ 0, 0, 0 ],

Or through the portDefinitions array:

"portDefinitions": [ {"port": 0}, {"port": 0}, {"port": 0} ],

In this example, we specify three randomly assigned host ports which would then be
available to our command via the environment variables $PORT0, $PORT1 and $PORT2.
Marathon will also randomly assign three service posts in addition to these three host
ports.

You can also specify specific service ports:

"ports": [ 2001, 2002, 3000 ],

Or:

"portDefinitions": [ {"port": 2001}, {"port": 2002}, {"port": 3000} ],

In this case, host ports $PORT0, $PORT1 and $PORT3 remain randomly assigned. However,
the three service ports for this application are now 2001, 2002 and 3000.

In this example, as with the previous one, it is necessary to use a service discovery
solution such as HAProxy to proxy requests from service ports to host ports.

If you want the applications service ports to be equal to its host ports, you can set
requirePorts to true (requirePorts is false by default). This will tell Marathon to only
schedule this application on agents which have these ports available:

"ports": [ 2001, 2002, 3000 ], "requirePorts" : true

The service and host ports (including the environment variables $PORT0, $PORT1, and
$PORT2), are both now 2001, 2002 and 3000.

This property is useful if you dont use a service discovery solution to proxy requests from
service ports to host ports.

Defining the portDefinitions array allows you to specify a protocol, a name and labels
for each port. When starting
new tasks, Marathon will pass this metadata to Mesos. Mesos will expose this
information in the discovery field of the
task. Custom network discovery solutions can consume this field.

Example port definition requesting a dynamic tcp port named http with the label VIP_0
set to 10.0.0.1:80:

"portDefinitions": [ { "port": 0, "protocol": "tcp", "name": "http", "labels": {"VIP_0":


"10.0.0.1:80"} } ],

The port field is mandatory. The protocol, name and labels fields are optional. A port
definition in which only
the port field is set is equivalent to an element of the ports array.

Note that only the ports array and the portDefinitions array should not be specified
together, unless all their
elements are equivalent.

Referencing Ports
You can reference host ports in the Dockerfile for our fictitious app as follows:

CMD ./my-app --http-port=$PORT0 --https-port=$PORT1 --monitoring-port=$PORT2

Alternatively, if you arent using Docker or had specified a cmd in your Marathon
application definition, it works in the same way:

"cmd": "./my-app --http-port=$PORT0 --https-port=$PORT1 --monitoring-port=$PORT2"

Bridge Mode
Bridge mode networking allows you to map host ports to ports inside your container and
is only applicable to Docker containers. It is particularly useful if you are using a container
image with fixed port assignments that you cant modify. Note that it not necessary to
EXPOSE ports in your Dockerfile.

Enabling Bridge Mode


You need to specify bridge mode through the network property:

"container": { "type": "DOCKER", "docker": { "image": "my-image:1.0", "network": "BRIDGE" }


},

Enabling User Mode


You need to specify user mode through the network property:

"container": { "type": "DOCKER", "docker": { "image": "my-image:1.0", "network": "USER" }


}, "ipAddress": { "networkName": "someUserNetwork" }
Specifying Ports
Port mappings are similar to passing -p into the Docker command line and specify a
relationship between a port on the host machine and a port inside the container. In this
case, the portMappings array is used instead of the ports or portDefinitions array used
in host mode.

Port mappings are specified inside the portMappings object for a container:

"container": { "type": "DOCKER", "docker": { "image": "my-image:1.0", "network": "BRIDGE",


"portMappings": [ { "containerPort": 0, "hostPort": 0 }, { "containerPort": 0, "hostPort": 0
}, { "containerPort": 0, "hostPort": 0 } ] } },

In this example, we specify 3 mappings. A value of 0 will ask Marathon to randomly


assign a value for hostPort. In this case, setting containerPort to 0 will cause it to have
the same value as hostPort. These values are available inside the container as $PORT0,
$PORT1 and $PORT2 respectively.

Alternatively, if our process running in the container had fixed ports, we might do
something like the following:

"container": { "type": "DOCKER", "docker": { "image": "my-image:1.0", "network": "BRIDGE",


"portMappings": [ { "containerPort": 80, "hostPort": 0 }, { "containerPort": 443,
"hostPort": 0 }, { "containerPort": 4000, "hostPort": 0 } ] } },

In this case, Marathon will randomly allocate host ports and map these to ports 80, 443
and 4000 respectively. Its important to note that the $PORT variables refer to the host
ports. In this case, $PORT0 will be set to the value of hostPort for the first mapping and so
on.

Specifying Protocol

You can also specify the protocol for these port mappings. The default is tcp:

"container": { "type": "DOCKER", "docker": { "image": "my-image:1.0", "network": "BRIDGE",


"portMappings": [ { "containerPort": 80, "hostPort": 0, "protocol": "tcp" }, {
"containerPort": 443, "hostPort": 0, "protocol": "tcp" }, { "containerPort": 4000,
"hostPort": 0, "protocol": "udp" } ] } },

Specifying Service Ports

By default, Marathon will be creating service ports for each of these ports and assigning
them random values. Service ports are used by service discovery solutions and it is often
desirable to set these to well known values. You can do this by setting a servicePort for
each mapping:

"container": { "type": "DOCKER", "docker": { "image": "my-image:1.0", "network": "BRIDGE",


"portMappings": [ { "containerPort": 80, "hostPort": 0, "protocol": "tcp", "servicePort":
2000 }, { "containerPort": 443, "hostPort": 0, "protocol": "tcp", "servicePort": 2001 }, {
"containerPort": 4000, "hostPort": 0, "protocol": "udp", "servicePort": 3000 } ] } },
In this example, the host ports $PORT0, $PORT1 and $PORT3 remain randomly assigned.
However, the service ports for this application are now 2001, 2002 and 3000. An external
proxy, like HAProxy, should be configured to route from the service ports to the host
ports.

Referencing Ports
If you set containerPort to 0, then you should specify ports in the Dockerfile for our
fictitious app as follows:

CMD ./my-app --http-port=$PORT0 --https-port=$PORT1 --monitoring-port=$PORT2

However, if youve specified containerPort values, you simply use the same values in
the Dockerfile:

CMD ./my-app --http-port=80 --https-port=443 --monitoring-port=4000

Alternatively, you can specify a cmd in your Marathon application definition, it works in the
same way as before:

"cmd": "./my-app --http-port=$PORT0 --https-port=$PORT1 --monitoring-port=$PORT2"

Or, if youve used fixed values:

"cmd": "./my-app --http-port=80 --https-port=443 --monitoring-port=4000"

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS

Exposing a Service
Updated: April 17, 2017

DC/OS agent nodes can be designated as public or private during installation. Public
agent nodes provide access from outside of the cluster via infrastructure networking to
your DC/OS services. By default, services are launched on private agent nodes and are
not accessible from outside the cluster.

To launch a service on a public node, you must create a Marathon app definition with the
"acceptedResourceRoles":["slave_public"] parameter specified and configure an edge
load balancer and service discovery mechanism.

Prerequisites:
DC/OS is installed

DC/OS CLI is installed

Create a Marathon app definition with the required


"acceptedResourceRoles":["slave_public"] parameter specified. For example:

{ "id": "/product/service/myApp", "container": { "type": "DOCKER", "docker": { "image":


"group/image", "network": "BRIDGE", "portMappings": [ { "hostPort": 80, "containerPort": 80,
"protocol": "tcp"} ] } }, "acceptedResourceRoles": ["slave_public"], "instances": 1, "cpus":
0.1, "mem": 64 }

For more information about the acceptedResourceRoles parameter, see the Marathon
REST API documentation.

Add the your app to Marathon by using this command, where myApp.json is your app:

dcos marathon app add myApp.json

If this is added successfully, there is no output.

Tip: You can also add your app by using the Services tab of DC/OS GUI.

Verify that the app is added with this command:

dcos marathon app list

The output should look like this:

ID MEM CPUS TASKS HEALTH DEPLOYMENT CONTAINER CMD /myApp 64 0.1 0/1 --- scale DOCKER None

Tip: You can also view deployed apps by using the Services tab of DC/OS GUI.

Configure an edge load balancer and service discovery mechanism.


AWS users: If you installed DC/OS by using the AWS CloudFormation templates, an ELB is
included. However, you must reconfigure the health check on the public ELB to expose the app
to the port specified in your app definition (e.g. port 80).

All other users: You can use Marathon-LB, a rapid proxy and load balancer that is based on
HAProxy.

Go to your public agent to see the site running. For information about how to find your public
agent IP, see the documentation.
You should see the following message in your browser:
Next steps
Learn how to load balance your app on a public node using Marathon-LB.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS

Deploying non-native Marathons


ENTERPRISE DC/OS PREVIEW Updated: April 17, 2017

About deploying non-native Marathons


Each service that Marathon launches uses the same Mesos role that Marathon
registered with for quotas and reservations.

In addition, any users of a given Marathon can run tasks under any Linux user that
Marathon can run tasks under.

To achieve finer-grained control over reservations, quotas and Linux user accounts, you
must deploy non-native instances of Marathon. The non-native instances of Marathon will
be launched by the native instance of Marathon.

While the native Marathon instance runs on the master nodes, the non-native Marathon
instances will run on the private agent nodes. You may need additional private agent
nodes to accommodate the increased resource demands.

Namespacing considerations
You can copy and paste the code snippets in this section as is and succeed in deploying
a single non-native Marathon instance. However, if you need to deploy more than one
non-native Marathon instance or desire more descriptive names, you will need to modify
the code snippets before issuing them.

We recommend a simple strategy for modifying the code snippets. Just replace each
instance of serv-group with the name of the service group that the non-native Marathon
will be deployed into.
In the procedures, we will use the following names:
momee-serv-group as the name of the service group

momee-serv-group-service as the name of the non-native Marathon service

momee-serv-group-principal as the name of the non-native Marathon service account

momee-serv-group/momee-serv-group-service/momee-serv-group-secret where momee-


serv-group-secret is the name of the non-native Marathon service account secret and
the secret is available in the serv-group/momee-serv-group/ path

momee-serv-group-role as the Mesos role of the non-native Marathon

momee-serv-group-private-key.pem as the name of the file containing the private key of


the non-native Marathon service account

momee-serv-group-public-key.pem as the name of the file containing the public key of


the non-native Marathon service account

Lets imagine you have a service group called test, another called dev, and a third called
prod. After replacing serv-group with the name of the actual service group as we
recommend, you will end up with the following names.
momee-test as the name of the service group

momee-test-service as the name of the non-native Marathon service

momee-test-principal as the name of the non-native Marathon service account

momee-test/momee-test-service/momee-test-secret where momee-test-secret is the


name of the non-native Marathon service account secret and the secret is available in
the momee-test/momee-test-service/ path

momee-test-role as the Mesos role of the non-native Marathon

momee-test-private-key.pem as the name of the file containing the private key of the
non-native Marathon service account

momee-test-public-key.pem as the name of the file containing the public key of the
non-native Marathon service account

By following this scheme, you will end up with unique yet descriptive names for each of
your non-native Marathon instances. These names will match with the various roles,
service accounts, secrets, and files associated with the non-native Marathon instance. In
addition, following this scheme will protect the service account secret from other non-
native Marathon instances and from the services that the non-native Marathon launches.
Linux user account considerations
The procedures that follow will result in a non-native Marathon instance that runs under
the nobody Linux user. Feel free to replace the nobody Linux user in the config.json and
in the code snippets with another user of your choice. However, you must ensure that the
Linux user account exists on each of your agent nodes before attempting to deploy.

Deploying a non-native Marathon


instance
Requirement: You must have a private Docker registry that each private DC/OS agent
can access over the network. Popular options include DockerHub, Amazon EC2
Container Registry, and Quay. We also provide a Docker registry DC/OS package, but it
is experimental at this time and not explicitly covered in the following procedures.

To deploy a non-native Marathon instance, complete the following steps.

Contact your sales or support representative to obtain the Marathon tarball.


Important: Ensure that you receive mesosphere/marathon-dcos-ee:1.4.0-RC4_1.9.4 or
later.

Load and push the Marathon image up to your private Docker registry.

Create a service account for the non-native Marathon instance.

Provision each private agent with the credentials for the private Docker registry.

Deploy the non-native Marathon.

Deploy a test service.

Granting users permissions to the non-native Marathon.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS

Marathon REST API


Updated: April 17, 2017
The Marathon API allows you to manage long-running containerized services (apps and
pods).

The Marathon API is backed by the Marathon component, which runs on the master
nodes.

One of the Marathon instances is elected as leader, while the rest are hot backups in
case of failure. All API requests must go through the Marathon leader. To enforce this,
Admin Router proxies requests from any master node to the Marathon leader.

For more information about using Marathon, see Managing Services.

Routes
Access to the Marathon API is proxied through the Admin Router on each master node
using the following route:

/service/marathon/

Resources
<

div class=swagger-section>

<

div id=swagger-ui-container class=swagger-ui-wrap data-


api=/1.9/api/marathon.yaml>
Loading docs

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS

Enabling GPU Support


PREVIEW Updated: April 17, 2017

DC/OS supports allocating GPUs (Graphics Processing Units) to your long-running


DC/OS services. Adding GPUs to your service can dramatically accelerate big data
workloads. Learn more.

To get started, create a GPU-enabled DC/OS cluster.


Configure an AWS cluster.

Configure a non-AWS cluster.

After creating a GPU-enabled DC/OS cluster, you can configure your service to use
GPUs.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS

Frequently Asked Questions


Updated: April 17, 2017

Weve collected some questions we often encounter concerning the usage of DC/OS.
Have got a new question youd like to see? Use the Submit feedback button at the bottom
of this page to suggest it or check out how you can contribute also the answer to it.

Why is my Marathon app stuck in Waiting?


This most commonly occurs when an application being launched has higher system
requirements than any of the available offers coming to Marathon via Mesos. The
deployment will eventually fail; check system requirements and increase the resources to
the application if you want the deployment to succeed.

Why is my Marathon app launching on private agent


instead of public?
By default apps are launched on private nodes. For more information, see the
documentation.

What is meant by service ports in a Marathon app?


A service port is the globally-unique port assigned to the app when using automatic app
discovery through a proxy or load balancer system, as described in Service Discovery
and Load Balancing.

Why cant I start more tasks? I have free resources in


my cluster.
The most common causes for this are requesting ports or resource roles that are not
available in the cluster. Tasks cannot launch unless they find agents with the required
port available, and they will not accept offers that do not contain their accepted resource
roles.

How can I add automatically more agents to the


cluster?
DC/OS cannot automatically spin up new nodes in response to load on hardware unless
the cloud provider autoscaling groups have been configured with standby hosts and
dcos_install.sh placed on the standby node. This is an involved process that requires
setting up Autoscaling groups with your Cloud Provider (AWS, GCE, Azure) and placing
an install file on each node. Please reach out to Mesosphere Support for more guidance
if you need to set this up.

What is your best practice for service discovery?


A comprehensive overview of a few common service discovery implementations is
available at Service Discovery in Marathon.

Is it possible to span my cluster over different cloud


providers?
This is not currently supported. For more information, see this document.

How to add Mesos roles for a node in order to dedicate


this node for some apps?
Please review this link on the Mesosphere Knowledge Base.

How can I upload files to Spark driver/executor?


Here is the example of a command you should launch in order to make it work:

dcos spark run --submit-args='--conf spark.mesos.uris=https://path/to/pi.conf --class


JavaSparkPiConf https://path/to/sparkPi_without_config_file.jar /mnt/mesos/sandbox/pi.conf'

More info:

> --conf spark.mesos.uris=... A comma-separated list of URIs to be downloaded to the sandbox


when driver or executor is launched by Mesos. This applies to both coarse-grained and fine-
grained mode. > /mnt/mesos/sandbox/pi.conf A path to the downloaded file which your main
class receives as a 0th parameter (see the code snippet below). /mnt/mesos/sandbox/ is a
standard path inside a container which is mapped to a corespondent mesos-task sandbox.

How does the installer work?


DC/OS is installed in your environment by using a dynamically generated setup file. This
file is generated by using specific parameters that are set during configuration. This
installation file contains a Bash install script and a Docker container that is loaded with
everything you need to deploy a customized DC/OS build.

For more information, see the installation documentation.

What versions of kernel, local OS, Docker Engine,


Union Mount are recommended?
We recommend using CoreOS, matched with its correct versions and sensible defaults of
Docker, filesystem, and other settings.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9

Developing DC/OS Services


Updated: April 17, 2017

Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.

Service Requirements Specification


Disclaimer: This document provides the DC/OS Service requirements, but is not the
complete DC/OS service certification. For the complete DC/OS Service Specification,
send an email ...
CLI Specification
This document is intended for a developer creating new DC/OS CLI commands. See also
the DC/OS Service Specification. The DC/OS CLI You can install the DC/OS Command
Line Interface ...
Creating a Universe Package
This page covers general advice and information about creating a DC/OS package that
can be published to the Mesosphere Universe. Consult the [Publish a Package][2] page
of the Univ...
Access by Proxy and VPN using DC/OS Tunnel
When developing services on DC/OS, you may find it helpful to access your cluster from
your local machine via SOCKS proxy, HTTP proxy, or VPN. For instance, you can work
from your ...
DC/OS Integration
You can leverage several integration points when creating a DC/OS Service. The
sections below explain how to integrate with each respective component. Admin Router
When a DC/OS Ser...

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEVELOPING DC/OS SERVICES

Service Requirements Specification


Updated: April 17, 2017

Disclaimer: This document provides the DC/OS Service requirements, but is not the
complete DC/OS service certification. For the complete DC/OS Service Specification,
send an email to partnerships@mesosphere.com.

This document is intended for a developer creating a Mesosphere DC/OS Service. It is


assumed the you are familiar with Mesos framework development.

The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT,


SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this
document are to be interpreted as described in RFC 2119.

By completing the requirements below, you can integrate with DC/OS and have your
service certified by Mesosphere.
Terminology
Universe
DC/OS Universe contains all services that have been certified by Mesosphere. For more
information on DC/OS Universe, see the GitHub Universe repository.

Framework
A Mesos framework is the combination of a Mesos scheduler and an optional custom
executor. A framework receives resource offers describing CPU, RAM, etc., and
allocates them for discrete tasks that can be launched on Mesos agent nodes.
Mesosphere-certified Mesos frameworks, called DCOS services, are packaged and
available from public GitHub package repositories. DCOS services include Mesosphere-
certified Mesos frameworks and other applications.

DC/OS Marathon
The native Marathon instance that is the init system for DCOS. It starts and monitors
DCOS applications and services.

State abstraction
Mesos provides an abstraction for accessing storage for schedulers for Java and C++
only. This is the preferred method to access ZooKeeper.

Service
01. Service MUST be able to install the service without supplying a
configuration.
Your service must be installable by using default values. The options.json must not be
required for installation. There are cases where a service might require a license to work.
Your service must provide a CLI option to pass the license information to the service to
enable it.

If the service isnt running because it is missing license information, that fact MUST be
logged through stdout or stderr.

02. Service MUST be uninstallable.


A DC/OS user can uninstall your service with this command:

dcos package uninstall <service name>


Packaging
03. Service MUST use standard DC/OS packaging.
A DC/OS user must be able to install your service by running this command:

dcos package install <service name>

For this to work, the metadata for your service must be registered in the Mesosphere
Universe package repository. The metadata format is defined in the Universe repository
README.

04. Service SHOULD have a simple lowercase service name.


The name of the service is the name provided in Universe. That name should be a simple
name without reference to Mesos or DC/OS. For example, the HDFS-Mesos framework
is listed in the universe as hdfs. This name should also be the first level property of the
config.json file.

05. Service package MUST include a Marathon deployment descriptor file.


Services in DC/OS are started and managed by the native DC/OS Marathon instance.
Your DC/OS service package MUST include a Marathon descriptor file (usually named
marathon.json) which is used to launch the service. The Scheduler must be designed so
that it can be launched by Marathon.
You MUST supply a marathon.json.mustache file as part of your Service metadata.

Your long-running app MAY use a Docker image retrieved by using a Docker registry or a
binary retrieved by using a CDN backed HTTP server.

All resource configurations MUST be parameterized.

06. Service package MUST specify install-time configuration in a


config.json file.
The Marathon descriptor file (marathon.json.mustache) must be templatized, following the
examples in the Universe repository. All variables must be defined in the config.json file.

Any components that are dynamically configured, for example the Mesos master or
ZooKeeper configuration, MUST be available as command line parameters or
environment variables to the service executable. This allows the parameters to be
passed to the scheduler during package installation.

07. Service package MUST specify framework-name in a config.json file.


The framework-name property is required. The framework-name property:
MUST be a second-level property under the service property. For example, see the HDFS
config.json file.

MUST default to the service name.


SHOULD be used as the app-id in the marathon.json file. For example, see the Spark
marathon.json.mustache file.

08. All URIs used by the scheduler and executor MUST be specified in
config.json.
All URIs that are used by the service MUST be specified in the config.json file. Any URL
that is accessed by the service must be overridable and specified in the the config.json
file, including:
URLs required in the marathon.json file

URLs that retrieve the executor (if not supplied by the scheduler)

URLs required by the executors, except for URLs that are for the scheduler; or a process
launched by the scheduler for retrieving artifacts or executors that are local to the
cluster.

All URLs that are used by the service must be passed in by using the command line or
provided as environment variables to the scheduler at startup time.

09. Service MUST provide a package.json file.


The package.json file MUST have:
The name field in package.json must match the package name in Universe and the default
value for the framework-name parameter in config.json. See the Chronos package in
Universe as an example.

Contact email address of owner

Description

Indication of whether this is a framework

Tags

All images

License information

Pre-install notes that indicate required resources

Post-install notes that indicate documentation, tutorials, and how to get support

Post-uninstall notes that indicate any documentation for a full uninstallation

For a reference example, see the Marathon package.

10. Service MAY provide a command.json file.


The command.json file is required when a command line subcommand is provided by the
service. This file specifies a Python wheel package for subcommand installation. For a
reference example, see the Spark package.

11. Service MAY provide a resource.json file.


The resource.json file is specified in the Universe repository.
Scheduler
12. Scheduler MUST provide a health check.
The app running in Marathon to represent your service, usually the scheduler, MUST
implement one or more Marathon health checks.

The output from these checks is used by the DC/OS web interface to display your service
health:
If ALL of your health checks pass, your service is marked in green as Healthy.

If ANY of your health checks fail, your service is marked in red as Sick. Your documentation
must provide troubleshooting information for resolving the issue.

If your Service has no tasks running in Marathon, your service is marked in yellow as Idle.
This state is normally temporary and occurs only when your service is launching.

Your app MAY set maxConsecutiveFailures=0 on any of your health checks to prevent
Marathon from terminating your app if the failure threshold of the health check is
reached.

Services must have Health checks configured in the marathon.json file.

13. Scheduler MUST distribute its own binaries for executor and tasks.
The scheduler MUST attempt to run executors/tasks with no external dependencies. If an
executor/task requires custom dependencies, the scheduler should bundle the
dependencies and configure the Mesos fetcher to download the dependencies from the
scheduler or run executors/tasks in a preconfigured Docker image.

Mesos can fetch binaries by using HTTP[S], FTP[S], HDFS, or Docker pull. Many
frameworks run an HTTP server in the scheduler that can distribute the binaries, or just
rely on pulling from a public or private Docker registry. Remember that some clusters do
not have access to the public internet.

URLs for downloads must be parameterized and externalized in the config.json file, with
the exception of Docker images. The scheduler and executor MUST NOT use URLS
without externalizing them and allowing them to be configurable. This requirement
ensures that DC/OS supports on-prem datacenter environments which do not have
access to the public internet.

14. Configuration MUST be via CLI parameters or environment variables.


If your service requires configuration, the scheduler and executors MUST implement this
by passing parameters on the command line or setting environment variables.

Secret or sensitive information should NOT include passwords as command-line


parameters, since those are exposed by ps. Storing sensitive information in environment
variables, files, or otherwise is fine.

Secrets/tokens themselves may be passed around as URIs, task labels, or otherwise. A


hook may place those credentials on disk somewhere and update the environment to
point to the on-disk credentials.
15. Service MAY provide a DC/OS CLI Subcommand.
Your Service MAY provide a custom DC/OS subcommand. For the DC/OS CLI
Specification, send an email to partnerships@mesosphere.com.

16. A Service with a DC/OS CLI Subcommand MUST implement the


minimum command set.
If providing a custom DC/OS CLI subcommand, you must implement the minimum set of
requirements.

17. A Service with DC/OS CLI MUST be driven by HTTP APIs.


Custom subcommands must interact with your service by using HTTP. The supported
method of interaction with your service is through the DC/OS Admin Router. Your service
will be exposed under the convention <dcos>/service/<service-name>.

18. In config.json all required properties MUST be specified as required.


Any property that is used by the marathon.json file that is required MUST be specified in
its appropriate required block. For an example, see Marathons optional HTTPS mode
which makes the marathon.https-port parameter optional.

ALL properties that are used in the marathon.json file that are not in a conditional block
must be defined as required.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEVELOPING DC/OS SERVICES

CLI Specification
ENTERPRISE DC/OS Updated: April 17, 2017

This document is intended for a developer creating new DC/OS CLI commands.

See also the DC/OS Service Specification.

The DC/OS CLI


You can install the DC/OS Command Line Interface (CLI) locally on your machine. The
DC/OS CLI communicates with the DC/OS cluster, running either on-premise or with a
cloud provider.

The DC/OS CLI uses a single command, dcos. All functions are expressed as
subcommands, and are shown with the dcos help command.

The DC/OS CLI is open and extensible: anyone can create a new subcommand and
make it available for installation by end users.

For example, the Spark DC/OS Service provides CLI extensions for working with Spark.
When installed, you can type the following command to run Spark jobs in the datacenter
and query their status:

dcos spark [parameters]

How to create a new DC/OS CLI subcommand


Requirements
DC/OS CLI subcommands:
executables specified for Mac, Linux, and Windows

Example: Hello World


The Hello World example implements a new subcommand called helloworld:

dcos package install helloworld --cli dcos helloworld

How the DC/OS CLI discovers subcommands


When the dcos command is run, it searches the current shells PATH for executables with
names that are prefixed with dcos- in the ~/.dcos/subcommands directory.

In the Hello World example, written in Python, you can create an executable of the
subcommand using pyinstaller.

DC/OS CLI configuration


The DC/OS CLI maintains a configuration file in TOML format, where subcommands can
store configuration data.

The environment variable DCOS_CONFIG contains the path to this file.

Example of a configuration file:

[marathon] host = "localhost" port = "8080" [package] sources = [


"git://github.com/mesosphere/universe.git",
"https://github.com/mesosphere/universe/archive/master.zip",] cache = "/tmp/dcos-cache"
[your-subcommand] foo = [ "bar", "baz" ]
You can make changes to the configuration file by using the dcos config command. For
example, to change the marathon.host value:

dcos config set marathon.host localhost

Standard flags
You must assign a standard set of flags to each DC/OS CLI subcommand, described
below:

--info --version --help -h --config-schema

info

The --info flag shows a short, one-line description of the function of your subcommand.
This content is displayed when the user runs dcos help.

Example from the Spark CLI:

dcos spark --info Run and manage Spark jobs

dcos help | grep spark spark Run and manage Spark jobs

version

The --version flag shows the version of the subcommand package. Notice that the
subcommand package is unrelated to the version of the Service running on DC/OS.

For example, Spark v1.2.1 might be installed on DC/OS, whereas the local spark DC/OS
CLI package might be at v0.1.0.

An example from the Marathon CLI:

dcos marathon --version dcos-marathon version 0.1.0

help and -h

The --help and -h flags both show the detailed usage for your subcommand.

An example from the Marathon CLI:

dcos marathon --help Deploy and manage applications on the DC/OS Usage: dcos marathon --
config-schema dcos marathon --info dcos marathon app add [<app-resource>] dcos marathon app
list dcos marathon app remove [--force] <app-id> dcos marathon app restart [--force] <app-
id> dcos marathon app show [--app-version=<app-version>] <app-id> dcos marathon app start [-
-force] <app-id> [<instances>] dcos marathon app stop [--force] <app-id> dcos marathon app
update [--force] <app-id> [<properties>...] dcos marathon app version list [--max-
count=<max-count>] <app-id> dcos marathon deployment list [<app-id>] dcos marathon
deployment rollback <deployment-id> dcos marathon deployment stop <deployment-id> dcos
marathon deployment watch [--max-count=<max-count>] [--interval=<interval>] <deployment-id>
dcos marathon task list [<app-id>] dcos marathon task show <task-id> Options: -h, --help
Show this screen --info Show a short description of this subcommand --version Show version -
-force ... --app-version=<app-version> ... --config-schema ... --max-count=<max-count> ... -
-interval=<interval> ... Positional arguments: <app-id> The application id <app-resource>
... <deployment-id> The deployment id <instances> The number of instances to start
<properties> ... <task-id> The task id

config-schema

The DC/OS CLI validates configurations set with the dcos config set command, by
comparing them against a JSON Schema that you define.

When your Marathon CLI subcommand is passed the --config-schema flag, it MUST
output a JSON Schema document for its configuration.

Heres an example from the Marathon CLI:

dcos marathon --config-schema { "$schema": "http://json-schema.org/schema#",


"additionalProperties": false, "properties": { "host": { "default": "localhost",
"description": "", "title": "Marathon hostname or IP address", "type": "string" }, "port": {
"default": 8080, "description": "", "maximum": 65535, "minimum": 1, "title": "Marathon
port", "type": "integer" } }, "required": [ "host", "port" ], "type": "object" }

Parameter naming conventions


The DC/OS CLI commands naming convention is:

dcos <subcommand> <resource> <verb>

A resource is typically a noun. For example:

dcos marathon app add

Logging
The environment variable DCOS_LOG_LEVEL is set to the log level the user sets at the
command line.

The logging levels are described in Pythons logging HOWTO: DEBUG, INFO,
WARNING, ERROR and CRITICAL.

The DC/OS CLI Module


The DC/OS Python package is a set of common functionality useful to subcommand
developers.

Packaging
To make your subcommand available to end users, you must:
Add a package entry to the Mesosphere Universe repository. See the Universe README for the
specification.

The package entry contains a file named resource.json that contains links to the
executable subcommands.

When the end user runs dcos package install spark --cli:

The package entry for Spark is retrieved from the repository,

The resource.json file is parsed to find the CLI resources

The executable for the users platform is downloaded

How to install a new CLI subcommand


You can install a new CLI subcommand by using this syntax:

dcos package install <cli package> --cli

The same packaging format and repository is used for both DC/OS Services and CLI
subcommands.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEVELOPING DC/OS SERVICES

Creating a Universe Package


Updated: April 17, 2017

This page covers general advice and information about creating a DC/OS package that
can be published to the Mesosphere Universe. Consult the [Publish a Package][2] page
of the Universe documentation for full details.

Each DC/OS Universe package consists of 4 JSON files:


package.json High-level metadata about the package.
resource.json Contains all of the externally hosted resources (e.g. Docker images, HTTP
objects and images) that are required to install the application.

config.json Configuration properties supported by the package, represented as a json-


schema.

marathon.json.mustache A mustache template that, when rendered, creates a Marathon app


definition capable of running your service.
package.json

Every package in Universe must have a package.json file that specifies the highest-level
metadata about the package (comparable to a package.json in Node.js or setup.py in
Python).

Currently, a package can specify one of two values for .packagingVersion, either 2.0 or
3.0. The version declared will dictate which other files are required for the complete
package as well as the schemas all the files must adhere to.

Consider the following guidelines when creating your package.json file:


Focus the description on the service. Assume that all users are familiar with DC/OS and
Mesos.

The tags parameter is used for user searches (dcos package search <criteria>). Add tags
that distinguish the service in some way. Avoid the following terms: Mesos, Mesosphere,
DC/OS, and datacenter. For example, the unicorns service could have: "tags": ["rainbows",
"mythical"].
The preInstallNotes parameter gives the user information theyll need before starting the
installation process. For example, you could explain what the resource requirements are for
the service: "preInstallNotes": "Unicorns take 7 nodes with 1 core each and 1TB
of ram."
The postInstallNotes parameter gives the user information theyll need after the
installation. Focus on providing a documentation URL, a tutorial, or both. For example:
"postInstallNotes": "Thank you for installing the Unicorn
service.\n\n\tDocumentation: http://<service-url>\n\tIssues:
https://github.com/"
The postUninstallNotes parameter gives the user information theyll need after an
uninstall. For example, further cleanup before reinstalling again and a link to the details.
A common issue is cleaning up ZooKeeper entries. For example: postUninstallNotes": "The
Unicorn DC/OS Service has been uninstalled and will no longer run.\nPlease
follow the instructions at http://<service-URL> to clean up any persisted
state" }

See package.json for details on what can be defined in package.json.

Example package.json
{ "packagingVersion": "2.0", // use either 2.0 or 3.0 "name": "foo", // your package name
"version": "1.2.3", // the version of the package "tags": ["mesosphere", "framework"],
"maintainer": "help@bar.io", // who to contact for help "description": "Does baz.", //
description of package "scm": "https://github.com/bar/foo.git", "website":
"http://bar.io/foo", "framework": true, "postInstallNotes": "Have fun foo-ing and baz-ing!"
}

resource.json

This file declares all the externally hosted assets the package will needfor example:
Docker containers, images, or native binary CLI. See the resource.json for details on
what can be defined in resource.json.

Example resource.json
{ "images": { "icon-small": "http://some.org/foo/small.png", "icon-medium":
"http://some.org/foo/medium.png", "icon-large": "http://some.org/foo/large.png",
"screenshots": [ "http://some.org/foo/screen-1.png", "http://some.org/foo/screen-2.png" ] },
"assets": { "uris": { "log4j-properties": "http://some.org/foo/log4j.properties" },
"container": { "docker": { "23b1cfe8e04a": "some-org/foo:1.0.0" } } } }

config.json

This file declares the packages configuration properties, such as the amount of CPUs,
number of instances, and allotted memory. The defaults specified in config.json will be
part of the context when marathon.json.mustache is rendered. This file describes the
configuration properties supported by the package, represented as a json-schema.

Each property should provide a default value, specify whether its required, and provide
validation (minimum and maximum values). Users can then override specific values at
installation time by passing an options file to the DC/OS CLI or by setting config values
through the DC/OS web interface.

Example config.json
{ "type": "object", "properties": { "foo": { "type": "object", "properties": { "baz": {
"type": "integer", "description": "How many times to do baz.", "minimum": 0, "maximum": 16,
"required": false, "default": 4 } }, "required": ["baz"] } }, "required": ["foo"] }

marathon.json.mustache

marathon.json.mustache is a mustache template that, when rendered, creates a Marathon


app definition capable of running your service.

Variables in the mustache template are evaluated from a union object created by
merging three objects in the following order:
Defaults specified in config.json.

User-supplied options from either the DC/OS CLI or the DC/OS UI.

The contents of resource.json.


Example marathon.json.mustache
{ "id": "foo", "cpus": "1.0", "mem": "1024", "instances": "1", "args": ["{{{foo.baz}}}"],
"container": { "type": "DOCKER", "docker": { "image":
"{{resource.assets.container.docker.foo23b1cfe8e04a}}", "network": "BRIDGE", "portMappings":
[ { "containerPort": 8080, "hostPort": 0, "servicePort": 0, "protocol": "tcp" } ] } } }

Testing and Distributing Your Package


To test your package, follow these instructions to build and run a Universe Server. After
your Universe Server is up and running, install your package using either the DC/OS CLI
or DC/OS UI.

After you have tested your package, follow the Submit Your Package instructions to
submit it.

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEVELOPING DC/OS SERVICES

Access by Proxy and VPN using


DC/OS Tunnel
Updated: April 17, 2017

When developing services on DC/OS, you may find it helpful to access your cluster from
your local machine via SOCKS proxy, HTTP proxy, or VPN. For instance, you can work
from your own development environment and immediately test against your DC/OS
cluster.

Warning: DC/OS Tunnel is appropriate for development, debugging, and testing only. Do
not use DC/OS Tunnel in production.
SOCKS
DC/OS Tunnel can run a SOCKS proxy over SSH to the cluster. SOCKS proxies work for
any protocol, but your client must be configured to use the proxy, which runs on port
1080 by default.

HTTP
The HTTP proxy can run in two modes: transparent and standard.

Transparent Mode
In transparent mode, the HTTP proxy runs as superuser on port 80 and does not require
modification to your application. Access URLs by appending the mydcos.directory
domain. You can also use DNS SRV records as if they were URLs. The HTTP proxy
cannot currently access HTTPS in transparent mode.

Standard Mode
Though you must configure your client to use the HTTP proxy in standard mode, it does
not have any of the limitations of transparent mode. As in transparent mode, you can use
DNS SRV records as URLs.

SRV Records
A SRV DNS record is a mapping from a name to a IP/port pair. DC/OS creates SRV
records in the form _<port-name>._<service-name>._tcp.marathon.mesos. The HTTP
proxy exposes these as URLs. This feature can be useful for communicating with DC/OS
services.

VPN
DC/OS Tunnel provides you with full access to the DNS, masters, and agents from within
the cluster. OpenVPN requires root privileges to configure these routes.
DC/OS Tunnel Options at a Glance
Pros Cons
-
Specify ports -
SOCKS
-Requires application configuration
All protocols
-
-Cannot specify ports (except through SRV)
HTTP SRV as URL -
(transparent) -Only supports HTTP
No application configuration -
Runs as superuser
- -
HTTP SRV as URL Requires application configuration
(standard) - -
Specify ports Only supports HTTP/HTTPS
- -
No application configuration More prerequisites
- -
Full and direct access to cluster Runs as superuser
VPN
- -
Specify ports May need to manually reconfigure DNS
- -
All protocols Relatively heavyweight

Using DC/OS Tunnel


Prerequisites
Only Linux and macOS are currently supported.

The DC/OS CLI.

The DC/OS Tunnel package. Run dcos package install tunnel-cli --cli.

SSH access (key authentication only).

The OpenVPN client for VPN functionality.

Example Application
All examples will refer to this sample application:
* Service Name: myapp
* Group: mygroup
* Port: 555
* Port Name: myport

myapp is a web server listening on port 555. Well be using curl


as our client application. Each successful example will result in the HTML
served by myapp to be output output as text.
Using DC/OS Tunnel to run a SOCKS Proxy
Run the following command from the DC/OS CLI:

dcos tunnel socks ## Example curl --proxy socks5h://127.0.0.1:1080 myapp-


mygroup.marathon.agentip.dcos.thisdcos.directory:555

Configure your application to use the proxy on port 1080.

Using DC/OS Tunnel to run a HTTP Proxy


Transparent Mode
Run the following command from the DC/OS CLI:

sudo dcos tunnel http ## Example curl


_myport._myapp.mygroup._tcp.marathon.mesos.mydcos.directory ### Watch out! ## This won't
work because you can't specify a port in transparent mode curl myapp-
mygroup.marathon.agentip.dcos.thisdcos.directory.mydcos.directory:555

In transparent mode, the HTTP proxy works by port forwarding. Append .mydcos.directory to the
end of your domain when you enter commands. For instance, http://example.com/?query=hello
becomes http://example.com.mydcos.directory/?query=hello. Note: In transparent mode,
you cannot specify a port in a URL.

Standard mode
To run the HTTP proxy in standard mode, without root privileges, use the --port flag to
configure it to use another port:

dcos tunnel http --port 8000 ## Example curl --proxy 127.0.0.1:8000


_myport._myapp.mygroup._tcp.marathon.mesos curl --proxy 127.0.0.1:8000 myapp-
mygroup.marathon.agentip.dcos.thisdcos.directory:555

Configure your application to use the proxy on the port you specified above.

SRV Records
The HTTP proxy exposes DC/OS SRV records as URLs in the form _<port-
name>._<service-name>._tcp.marathon.mesos.mydcos.directory (transparent mode) or
_<port-name>._<service-name>._tcp.marathon.mesos (standard mode).

Find your Service Name

The <service-name> is the entry in the ID field of a service you create from the DC/OS
web interface or the value of the id field in your Marathon application definition.
Add a Named Port from the DC/OS Web Interface

To name a port from the DC/OS web interface, go to the Services > Services tab, click
the name of your service, and then click Edit. Enter a name for your port on the
Networking tab.

Add a Named Port in a Marathon Application Definition

Alternatively, you can add name to the portMappings or portDefinitions field of a


Marathon application definition. Whether you use portMappings or portDefinitions
depends on whether you are using BRIDGE or HOST networking. Learn more about
networking and ports in Marathon.

"portMappings": [ { "name": "<my-port-name>", "containerPort": 3000, "hostPort": 0,


"servicePort": 10000, "labels": { "VIP_0": "1.1.1.1:30000" } } ]

"portDefinitions": [ { "name": "<my-port-name>", "protocol": "tcp", "port": 0, } ]

Using DC/OS Tunnel to run a VPN


Run the following command from the DC/OS CLI

sudo dcos tunnel vpn ## Example curl myapp-


mygroup.marathon.agentip.dcos.thisdcos.directory:555

The VPN client attempts to auto-configure DNS, but this functionality does not work on
macOS. To use the VPN client on macOS, add the DNS servers that DC/OS Tunnel
instructs you to use.

When you use the VPN, you are virtually within your cluster. You can access
your master and agent nodes directly:

ping master.mesos ping slave.mesos

macOS OpenVPN Client Installation


If using homebrew then install with:

brew install openvpn

Then to use it:

Either add /usr/local/sbin to your $PATH,

or add the flag --client=/usr/local/sbin/openvpn like so:

sudo dcos tunnel vpn --client=/usr/local/sbin/openvpn


Another option is to install TunnelBlick
(dont run it, we are only installing it for the openvpn executable)
and add the flag --
client=/Applications/Tunnelblick.app/Contents/Resources/openvpn/openvpn-
*/openvpn like so:

sudo dcos tunnel vpn --


client=/Applications/Tunnelblick.app/Contents/Resources/openvpn/openvpn-*/openvpn

Linux OpenVPN Client Installation


openvpn should be available via your distributions package manager.

For example:
* Ubuntu: apt-get update && apt-get install openvpn
* ArchLinux: pacman -S openvpn

MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEVELOPING DC/OS SERVICES

DC/OS Integration
Updated: April 17, 2017

You can leverage several integration points when creating a DC/OS Service. The
sections below explain how to integrate with each respective component.

Admin Router
When a DC/OS Service is installed and run on DC/OS, the service is generally deployed
on a private agent node. In order to allow users to access a running instance of the
service, Admin Router can function as a reverse proxy for the DC/OS Service.

Admin Router currently supports only one reverse proxy destination.

Service Endpoints
Admin Router allows marathon tasks to define custom service UI and HTTP endpoints,
which are made available as /service/<service-name>. Set the following marathon task
labels to enable this:

"labels": { "DCOS_SERVICE_NAME": "<service-name>", "DCOS_SERVICE_PORT_INDEX": "0",


"DCOS_SERVICE_SCHEME": "http" }

In this case, http://<dcos-cluster>/service/<service-name> would be forwarded to the


host running the task using the first port allocated to the task.

In order for the forwarding to work reliably across task failures, we recommend co-
locating the endpoints with the task. This way, if the task is restarted on another host and
with different ports, Admin Router will pick up the new labels and update the routing.
Note: Due to caching, there can be an up to 30-second delay before the new routing is
working.

We recommend having only a single task setting these labels for a given service name. If
multiple task instances have the same service name label, Admin Router will pick one of
the task instances deterministically, but this might make debugging issues more difficult.
Since the paths to resources for clients connecting to Admin Router will differ from those
paths the service actually has, ensure the service is configured to run behind a proxy.
This often means relative paths are preferred to absolute paths. In particular, resources
expected to be used by a UI should be verified to work through a proxy.

Tasks running in nested marathon app groups will be available only using their service
name (i.e., /service/<service-name>), not by the marathon app group name (i.e.,
/service/app-group/<service-name>).

DC/OS UI
Service health check information can be surfaced in the DC/OS services UI tab by:
Defining one or more healthChecks in the Services Marathon template, for example:

"healthChecks": [ { "path": "/", "portIndex": 1, "protocol": "HTTP", "gracePeriodSeconds":


5, "intervalSeconds": 60, "timeoutSeconds": 10, "maxConsecutiveFailures": 3 } ]

Defining the label DCOS_PACKAGE_FRAMEWORK_NAME in the Services Marathon template, with the
same value that will be used when the framework registers with Mesos. For example:

"labels": { "DCOS_PACKAGE_FRAMEWORK_NAME": "unicorn" }

Setting .framework to true in package.json


CLI Subcommand
If you would like to publish a DC/OS CLI subcommand for use with your service, it is
common to have the subcommand communicate with the running service by sending
HTTP requests through Admin Router to the service.

See dcos-helloworld for an example on how to develop a CLI Subcommand.

You might also like