Tag: mesosphere

Mesosphere DC/OS Masterclass : Tips and Tricks to Make Life Easier

DC/OS is an open-source operating system and distributed system for data center built on Apache Mesos distributed system kernel. As a distributed system, it is a cluster of master nodes and private/public nodes, where each node also has host operating system which manages the underlying machine.

It enables the management of multiple machines as if they were a single computer. It automates resource management, schedules process placement, facilitates inter-process communication, and simplifies the installation and management of distributed services. Its included web interface and available command-line interface (CLI) facilitate remote management and monitoring of the cluster and its services.

Distributed System : DC/OS is distributed system with group of private and public nodes which are coordinated by master nodes.
Cluster Manager : DC/OS is responsible for running tasks on agent nodes and providing required resources to them. DC/OS uses Apache Mesos to provide cluster management functionality.
Container Platform : All DC/OS tasks are containerized. DC/OS uses two different container runtimes, i.e. docker and mesos. So that containers can be started from docker images or they can be native executables (binaries or scripts) which are containerized at runtime by mesos.
Operating System : As name specifies, DC/OS is an operating system which abstracts cluster h/w and s/w resources and provide common services to applications.

Unlike Linux, DC/OS is not a host operating system. DC/OS spans multiple machines, but relies on each machine to have its own host operating system and host kernel.

The high level architecture of DC/OS can be seen below :

For the detailed architecture and components of DC/OS, please click here.

Adoption and usage of Mesosphere DC/OS:

Mesosphere customers include :

30% of the Fortune 50 U.S. Companies
5 of the top 10 North American Banks
7 of the top 12 Worldwide Telcos
5 of the top 10 Highest Valued Startups

Some companies using DC/OS are :

Cisco
Yelp
Tommy Hilfiger
Uber
Netflix
Verizon
Cerner
NIO

Installing and using DC/OS

A guide to installing DC/OS can be found here. After installing DC/OS on any platform, install dcos cli by following documentation found here.

Using dcos cli, we can manager cluster nodes, manage marathon tasks and services, install/remove packages from universe and it provides great support for automation process as each cli command can be output to json.

NOTE: The tasks below are executed with and tested on below tools:

DC/OS 1.11 Open Source
DC/OS cli 0.6.0
jq:1.5-1-a5b5cbe

DC/OS commands and scripts

Setup DC/OS cli with DC/OS cluster

dcos cluster setup <CLUSTER URL>

dcos cluster setup <CLUSTER URL>

Example :

dcos cluster setup http://dcos-cluster.com

dcos cluster setup http://dcos-cluster.com

The above command will give you the link for oauth authentication and prompt for auth token. You can authenticate yourself with any of Google, Github or Microsoft account. Paste the token generated after authentication to cli prompt. (Provided oauth is enabled).

DC/OS authentication token

docs config show core.dcos_acs_token

docs config show core.dcos_acs_token

DC/OS cluster url

dcos config show core.dcos_url

dcos config show core.dcos_url

DC/OS cluster name

dcos config show cluster.name

dcos config show cluster.name

Access Mesos UI

<DC/OS_CLUSTER_URL>/mesos

<DC/OS_CLUSTER_URL>/mesos

Example:

http://dcos-cluster.com/mesos

http://dcos-cluster.com/mesos

Access Marathon UI

<DC/OS_CLUSTER_URL>/service/marathon

<DC/OS_CLUSTER_URL>/service/marathon

Example:

http://dcos-cluster.com/service/marathon

http://dcos-cluster.com/service/marathon

Access any DC/OS service, like Marathon, Kafka, Elastic, Spark etc.[DC/OS Services]

<DC/OS_CLUSTER_URL>/service/<SERVICE_NAME>

<DC/OS_CLUSTER_URL>/service/<SERVICE_NAME>

Example:

http://dcos-cluster.com/service/marathon
http://dcos-cluster.com/service/kafka

http://dcos-cluster.com/service/marathon
http://dcos-cluster.com/service/kafka

Access DC/OS slaves info in json using Mesos API [Mesos Endpoints]

curl -H "Authorization: Bearer $(dcos config show 
core.dcos_acs_token)" $(dcos config show 
core.dcos_url)/mesos/slaves | jq

curl -H "Authorization: Bearer $(dcos config show 
core.dcos_acs_token)" $(dcos config show 
core.dcos_url)/mesos/slaves | jq

Access DC/OS slaves info in json using DC/OS cli

dcos node --json

dcos node --json

Note : DC/OS cli ‘dcos node –json’ is equivalent to running mesos slaves endpoint (/mesos/slaves)

Access DC/OS private slaves info using DC/OS cli

dcos node --json | jq '.[] | select(.type | contains("agent")) | select(.attributes.public_ip == null) | "Private Agent : " + .hostname ' -r

dcos node --json | jq '.[] | select(.type | contains("agent")) | select(.attributes.public_ip == null) | "Private Agent : " + .hostname ' -r

Access DC/OS public slaves info using DC/OS cli

dcos node --json | jq '.[] | select(.type | contains("agent")) | select(.attributes.public_ip != null) | "Public Agent : " + .hostname ' -r

dcos node --json | jq '.[] | select(.type | contains("agent")) | select(.attributes.public_ip != null) | "Public Agent : " + .hostname ' -r

Access DC/OS private and public slaves info using DC/OS cli

dcos node --json | jq '.[] | select(.type | contains("agent")) | if (.attributes.public_ip != null) then "Public Agent : " else "Private Agent : " end + " - " + .hostname ' -r | sort

dcos node --json | jq '.[] | select(.type | contains("agent")) | if (.attributes.public_ip != null) then "Public Agent : " else "Private Agent : " end + " - " + .hostname ' -r | sort

Get public IP of all public agents

#!/bin/bash
for id in $(dcos node --json | jq --raw-output '.[] | select(.attributes.public_ip == "true") | .id'); 
do 
      dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --mesos-id=$id "curl -s ifconfig.co"
done 2>/dev/null

#!/bin/bash

for id in $(dcos node --json | jq --raw-output '.[] | select(.attributes.public_ip == "true") | .id'); 
do 
      dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --mesos-id=$id "curl -s ifconfig.co"
done 2>/dev/null

Note: As ‘dcos node ssh’ requires private key to be added to ssh. Make sure you add your private key as ssh identity using :

ssh-add </path/to/private/key/file/.pem>

ssh-add </path/to/private/key/file/.pem>

Get public IP of master leader

dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --leader "curl -s ifconfig.co" 2>/dev/null

dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --leader "curl -s ifconfig.co" 2>/dev/null

Get all master nodes and their private ip

dcos node --json | jq '.[] | select(.type | contains("master"))
| .ip + " = " + .type' -r

dcos node --json | jq '.[] | select(.type | contains("master"))
| .ip + " = " + .type' -r

Get list of all users who have access to DC/OS cluster

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
"$(dcos config show core.dcos_url)/acs/api/v1/users" | jq ‘.array[].uid’ -r

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
"$(dcos config show core.dcos_url)/acs/api/v1/users" | jq ‘.array[].uid’ -r

Add users to cluster using Mesosphere script (Run this on master)

Users to add are given in list.txt, each user on new line

for i in `cat list.txt`; do echo $i;
sudo -i dcos-shell /opt/mesosphere/bin/dcos_add_user.py $i; done

for i in `cat list.txt`; do echo $i;
sudo -i dcos-shell /opt/mesosphere/bin/dcos_add_user.py $i; done

Add users to cluster using DC/OS API

#!/bin/bash
# Uage dcosAddUsers.sh <Users to add are given in list.txt, each user on new line>
for i in `cat users.list`; 
do 
  echo $i
  curl -X PUT -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)" "$(dcos config show core.dcos_url)/acs/api/v1/users/$i" -d "{}"
done

#!/bin/bash

# Uage dcosAddUsers.sh <Users to add are given in list.txt, each user on new line>
for i in `cat users.list`; 
do 
  echo $i
  curl -X PUT -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)" "$(dcos config show core.dcos_url)/acs/api/v1/users/$i" -d "{}"
done

Delete users from DC/OS cluster organization

#!/bin/bash
# Usage dcosDeleteUsers.sh <Users to delete are given in list.txt, each user on new line>
for i in `cat users.list`; 
do 
  echo $i
  curl -X DELETE -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)" "$(dcos config show core.dcos_url)/acs/api/v1/users/$i" -d "{}"
done

#!/bin/bash

# Usage dcosDeleteUsers.sh <Users to delete are given in list.txt, each user on new line>

for i in `cat users.list`; 
do 
  echo $i
  curl -X DELETE -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)" "$(dcos config show core.dcos_url)/acs/api/v1/users/$i" -d "{}"
done

Offers/resources from individual DC/OS agent

In recent versions of the many dcos services, a scheduler endpoint at

http://yourcluster.com/service/<service-name>/v1/debug/offers

http://yourcluster.com/service/<service-name>/v1/debug/offers

will display an HTML table containing a summary of recently-evaluated offers. This table’s contents are currently very similar to what can be found in logs, but in a slightly more accessible format. Alternately, we can look at the scheduler’s logs in stdout. An offer is a set of resources all from one individual DC/OS agent.

<DC/OS_CLUSTER_URL>/service/<service_name>/v1/debug/offers

<DC/OS_CLUSTER_URL>/service/<service_name>/v1/debug/offers

Example:

http://dcos-cluster.com/service/kafka/v1/debug/offers
http://dcos-cluster.com/service/elastic/v1/debug/offers

http://dcos-cluster.com/service/kafka/v1/debug/offers
http://dcos-cluster.com/service/elastic/v1/debug/offers

Save JSON configs of all running Marathon apps

#!/bin/bash
# Save marathon configs in json format for all marathon apps
# Usage : saveMarathonConfig.sh
for service in `dcos marathon app list --quiet | tr -d "/" | sort`; do
  dcos marathon app show $service | jq '. | del(.tasks, .version, .versionInfo, .tasksHealthy, .tasksRunning, .tasksStaged, .tasksUnhealthy, .deployments, .executor, .lastTaskFailure, .args, .ports, .residency, .secrets, .storeUrls, .uris, .user)' >& $service.json
done

#!/bin/bash

# Save marathon configs in json format for all marathon apps
# Usage : saveMarathonConfig.sh

for service in `dcos marathon app list --quiet | tr -d "/" | sort`; do
  dcos marathon app show $service | jq '. | del(.tasks, .version, .versionInfo, .tasksHealthy, .tasksRunning, .tasksStaged, .tasksUnhealthy, .deployments, .executor, .lastTaskFailure, .args, .ports, .residency, .secrets, .storeUrls, .uris, .user)' >& $service.json
done

Get report of Marathon apps with details like container type, Docker image, tag or service version used by Marathon app.

#!/bin/bash
TMP_CSV_FILE=$(mktemp /tmp/dcos-config.XXXXXX.csv)
TMP_CSV_FILE_SORT="${TMP_CSV_FILE}_sort"
#dcos marathon app list --json | jq '.[] | if (.container.docker.image != null ) then .id + ",Docker Application," + .container.docker.image else .id + ",DCOS Service," + .labels.DCOS_PACKAGE_VERSION end' -r > $TMP_CSV_FILE
dcos marathon app list --json | jq '.[] | .id + if (.container.type == "DOCKER") then ",Docker Container," + .container.docker.image else ",Mesos Container," + if(.labels.DCOS_PACKAGE_VERSION !=null) then .labels.DCOS_PACKAGE_NAME+":"+.labels.DCOS_PACKAGE_VERSION  else "[ CMD ]" end end' -r > $TMP_CSV_FILE
sed -i "s|^/||g" $TMP_CSV_FILE
sort -t "," -k2,2 -k3,3 -k1,1 $TMP_CSV_FILE > ${TMP_CSV_FILE_SORT}
cnt=1
printf '%.0s=' {1..150}
printf "n  %-5s%-35s%-23s%-40s%-20sn" "No" "Application Name" "Container Type" "Docker Image" "Tag / Version"
printf '%.0s=' {1..150}
while IFS=, read -r app typ image; 
do
        tag=`echo $image | awk -F':' -v im="$image" '{tag=(im=="[ CMD ]")?"NA":($2=="")?"latest":$2; print tag}'`
        image=`echo $image | awk -F':' '{print $1}'`
        printf "n  %-5s%-35s%-23s%-40s%-20s" "$cnt" "$app" "$typ" "$image" "$tag"
        cnt=$((cnt + 1))
        sleep 0.3
done < $TMP_CSV_FILE_SORT
printf "n"
printf '%.0s=' {1..150}
printf "n"

#!/bin/bash

TMP_CSV_FILE=$(mktemp /tmp/dcos-config.XXXXXX.csv)
TMP_CSV_FILE_SORT="${TMP_CSV_FILE}_sort"
#dcos marathon app list --json | jq '.[] | if (.container.docker.image != null ) then .id + ",Docker Application," + .container.docker.image else .id + ",DCOS Service," + .labels.DCOS_PACKAGE_VERSION end' -r > $TMP_CSV_FILE
dcos marathon app list --json | jq '.[] | .id + if (.container.type == "DOCKER") then ",Docker Container," + .container.docker.image else ",Mesos Container," + if(.labels.DCOS_PACKAGE_VERSION !=null) then .labels.DCOS_PACKAGE_NAME+":"+.labels.DCOS_PACKAGE_VERSION  else "[ CMD ]" end end' -r > $TMP_CSV_FILE
sed -i "s|^/||g" $TMP_CSV_FILE
sort -t "," -k2,2 -k3,3 -k1,1 $TMP_CSV_FILE > ${TMP_CSV_FILE_SORT}
cnt=1
printf '%.0s=' {1..150}
printf "n  %-5s%-35s%-23s%-40s%-20sn" "No" "Application Name" "Container Type" "Docker Image" "Tag / Version"
printf '%.0s=' {1..150}
while IFS=, read -r app typ image; 
do
        tag=`echo $image | awk -F':' -v im="$image" '{tag=(im=="[ CMD ]")?"NA":($2=="")?"latest":$2; print tag}'`
        image=`echo $image | awk -F':' '{print $1}'`
        printf "n  %-5s%-35s%-23s%-40s%-20s" "$cnt" "$app" "$typ" "$image" "$tag"
        cnt=$((cnt + 1))
        sleep 0.3
done < $TMP_CSV_FILE_SORT
printf "n"
printf '%.0s=' {1..150}
printf "n"

Get DC/OS nodes with more information like node type, node ip, attributes, number of running tasks, free memory, free cpu etc.

#!/bin/bash
printf "n  %-15s %-18s%-18s%-10s%-15s%-10sn" "Node Type" "Node IP" "Attribute" "Tasks" "Mem Free (MB)" "CPU Free"
printf '%.0s=' {1..90}
printf "n"
TAB=`echo -e "t"`
dcos node --json | jq '.[] | if (.type | contains("leader")) then "Master (leader)" elif ((.type | contains("agent")) and .attributes.public_ip != null) then "Public Agent" elif ((.type | contains("agent")) and .attributes.public_ip == null) then "Private Agent" else empty end + "t"+ if(.type |contains("master")) then .ip else .hostname end + "t" +  (if (.attributes | length !=0) then (.attributes | to_entries[] | join(" = ")) else "NA" end) + "t" + if(.type |contains("agent")) then (.TASK_RUNNING|tostring) + "t" + ((.resources.mem - .used_resources.mem)| tostring) + "tt" +  ((.resources.cpus - .used_resources.cpus)| tostring)  else "ttNAtNAttNA"  end' -r | sort -t"$TAB" -k1,1d -k3,3d -k2,2d
printf '%.0s=' {1..90}
printf "n"

#!/bin/bash

printf "n  %-15s %-18s%-18s%-10s%-15s%-10sn" "Node Type" "Node IP" "Attribute" "Tasks" "Mem Free (MB)" "CPU Free"
printf '%.0s=' {1..90}
printf "n"
TAB=`echo -e "t"`
dcos node --json | jq '.[] | if (.type | contains("leader")) then "Master (leader)" elif ((.type | contains("agent")) and .attributes.public_ip != null) then "Public Agent" elif ((.type | contains("agent")) and .attributes.public_ip == null) then "Private Agent" else empty end + "t"+ if(.type |contains("master")) then .ip else .hostname end + "t" +  (if (.attributes | length !=0) then (.attributes | to_entries[] | join(" = ")) else "NA" end) + "t" + if(.type |contains("agent")) then (.TASK_RUNNING|tostring) + "t" + ((.resources.mem - .used_resources.mem)| tostring) + "tt" +  ((.resources.cpus - .used_resources.cpus)| tostring)  else "ttNAtNAttNA"  end' -r | sort -t"$TAB" -k1,1d -k3,3d -k2,2d
printf '%.0s=' {1..90}
printf "n"

Framework Cleaner

Uninstall framework and clean reserved resources if any after framework is deleted/uninstalled. (applicable if running DC/OS 1.9 or older, if higher than 1.10, then only uninstall cli is sufficient)

SERVICE_NAME=
dcos package uninstall $SERVICE_NAME
dcos node ssh --option StrictHostKeyChecking=no --master-proxy
--leader "docker run mesosphere/janitor /janitor.py -r
${SERVICE_NAME}-role -p ${SERVICE_NAME}-principal -z dcos-service-${SERVICE_NAME}"

SERVICE_NAME=
dcos package uninstall $SERVICE_NAME
dcos node ssh --option StrictHostKeyChecking=no --master-proxy
--leader "docker run mesosphere/janitor /janitor.py -r
${SERVICE_NAME}-role -p ${SERVICE_NAME}-principal -z dcos-service-${SERVICE_NAME}"

Get DC/OS apps and their placement constraints

dcos marathon app list --json | jq '.[] |
if (.constraints != null) then .id, .constraints else empty end'

dcos marathon app list --json | jq '.[] |
if (.constraints != null) then .id, .constraints else empty end'

Run shell command on all slaves

#!/bin/bash
# Run any shell command on all slave nodes (private and public)
# Usage : dcosRunOnAllSlaves.sh <CMD= any shell command to run, Ex: ulimit -a >
CMD=$1
for i in `dcos node | egrep -v "TYPE|master" | awk '{print $1}'`; do 
   echo -e "n###> Running command [ $CMD ] on $i"
   dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --private-ip=$i "$CMD"
   echo -e "======================================n"
done

#!/bin/bash

# Run any shell command on all slave nodes (private and public)

# Usage : dcosRunOnAllSlaves.sh <CMD= any shell command to run, Ex: ulimit -a >
CMD=$1
for i in `dcos node | egrep -v "TYPE|master" | awk '{print $1}'`; do 
   echo -e "n###> Running command [ $CMD ] on $i"
   dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --private-ip=$i "$CMD"
   echo -e "======================================n"
done

Run shell command on master leader

CMD=<shell command, Ex: ulimit -a >dcos node ssh --option StrictHostKeyChecking=no --option
LogLevel=quiet --master-proxy --leader "$CMD"

CMD=<shell command, Ex: ulimit -a >dcos node ssh --option StrictHostKeyChecking=no --option
LogLevel=quiet --master-proxy --leader "$CMD"

Run shell command on all master nodes

#!/bin/bash
# Run any shell command on all master nodes
# Usage : dcosRunOnAllSlaves.sh <CMD= any shell command to run, Ex: ulimit -a >
CMD=$1
for i in `dcos node | egrep -v "TYPE|agent" | awk '{print $2}'` 
do 
  echo -e "n###> Running command [ $CMD ] on $i"
  dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --private-ip=$i "$CMD"
 echo -e "======================================n"
done

#!/bin/bash

# Run any shell command on all master nodes

# Usage : dcosRunOnAllSlaves.sh <CMD= any shell command to run, Ex: ulimit -a >
CMD=$1
for i in `dcos node | egrep -v "TYPE|agent" | awk '{print $2}'` 
do 
  echo -e "n###> Running command [ $CMD ] on $i"
  dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --private-ip=$i "$CMD"
 echo -e "======================================n"
done

Add node attributes to dcos nodes and run apps on nodes with required attributes using placement constraints

#!/bin/bash
#1. SSH on node 
#2. Create or edit file /var/lib/dcos/mesos-slave-common
#3. Add contents as :
#    MESOS_ATTRIBUTES=<key>:<value>
#    Example:
#    MESOS_ATTRIBUTES=TYPE:DB;DB_TYPE:MONGO;
#4. Stop dcos-mesos-slave service
#    systemctl stop dcos-mesos-slave
#5. Remove link for latest slave metadata
#    rm -f /var/lib/mesos/slave/meta/slaves/latest
#6. Start dcos-mesos-slave service
#    systemctl start dcos-mesos-slave
#7. Wait for some time, node will be in HEALTHY state again.
#8. Add app placement constraint with field = key and value = value
#9. Verify attributes, run on any node
#    curl -s http://leader.mesos:5050/state | jq '.slaves[]| .hostname ,.attributes'
#    OR Check DCOS cluster UI
#    Nodes => Select any Node => Details Tab
tmpScript=$(mktemp "/tmp/addDcosNodeAttributes-XXXXXXXX")
# key:value paired attribues, separated by ;
ATTRIBUTES=NODE_TYPE:GPU_NODE
cat <<EOF > ${tmpScript}
echo "MESOS_ATTRIBUTES=${ATTRIBUTES}" | sudo tee /var/lib/dcos/mesos-slave-common
sudo systemctl stop dcos-mesos-slave
sudo rm -f /var/lib/mesos/slave/meta/slaves/latest
sudo systemctl start dcos-mesos-slave
EOF
# Add the private ip of nodes on which you want to add attrubutes, one ip per line.
for i in `cat nodes.txt`; do 
    echo $i
    dcos node ssh --master-proxy --option StrictHostKeyChecking=no --private-ip $i <$tmpScript
    sleep 10
done

#!/bin/bash

#1. SSH on node 
#2. Create or edit file /var/lib/dcos/mesos-slave-common
#3. Add contents as :
#    MESOS_ATTRIBUTES=<key>:<value>
#    Example:
#    MESOS_ATTRIBUTES=TYPE:DB;DB_TYPE:MONGO;
#4. Stop dcos-mesos-slave service
#    systemctl stop dcos-mesos-slave
#5. Remove link for latest slave metadata
#    rm -f /var/lib/mesos/slave/meta/slaves/latest
#6. Start dcos-mesos-slave service
#    systemctl start dcos-mesos-slave
#7. Wait for some time, node will be in HEALTHY state again.
#8. Add app placement constraint with field = key and value = value
#9. Verify attributes, run on any node
#    curl -s http://leader.mesos:5050/state | jq '.slaves[]| .hostname ,.attributes'
#    OR Check DCOS cluster UI
#    Nodes => Select any Node => Details Tab

tmpScript=$(mktemp "/tmp/addDcosNodeAttributes-XXXXXXXX")

# key:value paired attribues, separated by ;
ATTRIBUTES=NODE_TYPE:GPU_NODE

cat <<EOF > ${tmpScript}
echo "MESOS_ATTRIBUTES=${ATTRIBUTES}" | sudo tee /var/lib/dcos/mesos-slave-common
sudo systemctl stop dcos-mesos-slave
sudo rm -f /var/lib/mesos/slave/meta/slaves/latest
sudo systemctl start dcos-mesos-slave
EOF

# Add the private ip of nodes on which you want to add attrubutes, one ip per line.
for i in `cat nodes.txt`; do 
    echo $i
    dcos node ssh --master-proxy --option StrictHostKeyChecking=no --private-ip $i <$tmpScript
    sleep 10
done

Install DC/OS Datadog metrics plugin on all DC/OS nodes

#!/bin/bash

# Usage : bash installDCOSDataDogMetricsPlugin.sh <Datadog API KEY>

DDAPI=$1

if [[ -z $DDAPI ]]; then
    echo "[Datadog Plugin] Need datadog API key as parameter."
    echo "[Datadog Plugin] Usage : bash installDCOSDataDogMetricsPlugin.sh <Datadog API KEY>."
fi
tmpScriptMaster=$(mktemp "/tmp/installDatadogPlugin-XXXXXXXX")
tmpScriptAgent=$(mktemp "/tmp/installDatadogPlugin-XXXXXXXX")

declare agent=$tmpScriptAgent
declare master=$tmpScriptMaster

for role in "agent" "master"
do
cat <<EOF > ${!role}
curl -s -o /opt/mesosphere/bin/dcos-metrics-datadog -L https://downloads.mesosphere.io/dcos-metrics/plugins/datadog
chmod +x /opt/mesosphere/bin/dcos-metrics-datadog
echo "[Datadog Plugin] Downloaded dcos datadog metrics plugin."
export DD_API_KEY=$DDAPI
export AGENT_ROLE=$role
sudo curl -s -o /etc/systemd/system/dcos-metrics-datadog.service https://downloads.mesosphere.io/dcos-metrics/plugins/datadog.service
echo "[Datadog Plugin] Downloaded dcos-metrics-datadog.service."
sudo sed -i "s/--dcos-role master/--dcos-role \$AGENT_ROLE/g;s/--datadog-key .*/--datadog-key \$DD_API_KEY/g" /etc/systemd/system/dcos-metrics-datadog.service
echo "[Datadog Plugin] Updated dcos-metrics-datadog.service with DD API Key and agent role."
sudo systemctl daemon-reload
sudo systemctl start dcos-metrics-datadog.service
echo "[Datadog Plugin] dcos-metrics-datadog.service is started !"
servStatus=\$(sudo systemctl is-failed dcos-metrics-datadog.service)
echo "[Datadog Plugin] dcos-metrics-datadog.service status : \${servStatus}"
#sudo systemctl status dcos-metrics-datadog.service | head -3
#sudo journalctl -u dcos-metrics-datadog
EOF
done

echo "[Datadog Plugin] Temp script for master saved at : $tmpScriptMaster"
echo "[Datadog Plugin] Temp script for agent saved at : $tmpScriptAgent"

for i in `dcos node | egrep -v "TYPE|master" | awk '{print $1}'` 
do 
    echo -e "\n###> Node - $i"
    dcos node ssh --option LogLevel=quiet --option StrictHostKeyChecking=no --master-proxy --private-ip=$i < $tmpScriptAgent
    echo -e "======================================================="
done

for i in `dcos node | egrep -v "TYPE|agent" | awk '{print $2}'` 
do 
    echo -e "\n###> Master Node - $i"
    dcos node ssh --option LogLevel=quiet --option StrictHostKeyChecking=no --master-proxy --private-ip=$i < $tmpScriptMaster
    echo -e "======================================================="
done

# Check status of dcos-metrics-datadog.service on all nodes.
#for i in `dcos node | egrep -v "TYPE|master" | awk '{print $1}'` ; do  echo -e "\n###> $i"; dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --private-ip=$i "sudo systemctl is-failed dcos-metrics-datadog.service"; echo -e "======================================\n"; done

#!/bin/bash

# Usage : bash installDCOSDataDogMetricsPlugin.sh <Datadog API KEY>

DDAPI=$1

if [[ -z $DDAPI ]]; then
    echo "[Datadog Plugin] Need datadog API key as parameter."
    echo "[Datadog Plugin] Usage : bash installDCOSDataDogMetricsPlugin.sh <Datadog API KEY>."
fi
tmpScriptMaster=$(mktemp "/tmp/installDatadogPlugin-XXXXXXXX")
tmpScriptAgent=$(mktemp "/tmp/installDatadogPlugin-XXXXXXXX")

declare agent=$tmpScriptAgent
declare master=$tmpScriptMaster

for role in "agent" "master"
do
cat <<EOF > ${!role}
curl -s -o /opt/mesosphere/bin/dcos-metrics-datadog -L https://downloads.mesosphere.io/dcos-metrics/plugins/datadog
chmod +x /opt/mesosphere/bin/dcos-metrics-datadog
echo "[Datadog Plugin] Downloaded dcos datadog metrics plugin."
export DD_API_KEY=$DDAPI
export AGENT_ROLE=$role
sudo curl -s -o /etc/systemd/system/dcos-metrics-datadog.service https://downloads.mesosphere.io/dcos-metrics/plugins/datadog.service
echo "[Datadog Plugin] Downloaded dcos-metrics-datadog.service."
sudo sed -i "s/--dcos-role master/--dcos-role \$AGENT_ROLE/g;s/--datadog-key .*/--datadog-key \$DD_API_KEY/g" /etc/systemd/system/dcos-metrics-datadog.service
echo "[Datadog Plugin] Updated dcos-metrics-datadog.service with DD API Key and agent role."
sudo systemctl daemon-reload
sudo systemctl start dcos-metrics-datadog.service
echo "[Datadog Plugin] dcos-metrics-datadog.service is started !"
servStatus=\$(sudo systemctl is-failed dcos-metrics-datadog.service)
echo "[Datadog Plugin] dcos-metrics-datadog.service status : \${servStatus}"
#sudo systemctl status dcos-metrics-datadog.service | head -3
#sudo journalctl -u dcos-metrics-datadog
EOF
done

echo "[Datadog Plugin] Temp script for master saved at : $tmpScriptMaster"
echo "[Datadog Plugin] Temp script for agent saved at : $tmpScriptAgent"

for i in `dcos node | egrep -v "TYPE|master" | awk '{print $1}'` 
do 
    echo -e "\n###> Node - $i"
    dcos node ssh --option LogLevel=quiet --option StrictHostKeyChecking=no --master-proxy --private-ip=$i < $tmpScriptAgent
    echo -e "======================================================="
done

for i in `dcos node | egrep -v "TYPE|agent" | awk '{print $2}'` 
do 
    echo -e "\n###> Master Node - $i"
    dcos node ssh --option LogLevel=quiet --option StrictHostKeyChecking=no --master-proxy --private-ip=$i < $tmpScriptMaster
    echo -e "======================================================="
done

# Check status of dcos-metrics-datadog.service on all nodes.
#for i in `dcos node | egrep -v "TYPE|master" | awk '{print $1}'` ; do  echo -e "\n###> $i"; dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --private-ip=$i "sudo systemctl is-failed dcos-metrics-datadog.service"; echo -e "======================================\n"; done

Get app / node metrics fetched by dcos-metrics component using metrics API

Get DC/OS node id [dcos node]
Get Node metrics (CPU, memory, local filesystems, networks, etc) : <dc os_cluster_url=””>/system/v1/agent/<agent_id>/metrics/v0/node</agent_id></dc>
Get id of all containers running on that agent : <dc os_cluster_url=””>/system/v1/agent/<agent_id>/metrics/v0/containers</agent_id></dc>
Get Resource allocation and usage for the given container ID. : <dc os_cluster_url=””>/system/v1/agent/<agent_id>/metrics/v0/containers/<container_id></container_id></agent_id></dc>
Get Application-level metrics from the container (shipped in StatsD format using the listener available at STATSD_UDP_HOST and STATSD_UDP_PORT) : <dc os_cluster_url=””>/system/v1/agent/<agent_id>/metrics/v0/containers/<container_id>/app </container_id></agent_id></dc>

Get app / node metrics fetched by dcos-metrics component using dcos cli

Summary of container metrics for a specific task

dcos task metrics summary <task-id>

dcos task metrics summary <task-id>

All metrics in details for a specific task

dcos task metrics details <task-id>

dcos task metrics details <task-id>

Summary of Node metrics for a specific node

dcos task metrics summary <mesos-node-id>

dcos task metrics summary <mesos-node-id>

All Node metrics in details for a specific node

dcos node metrics details <mesos-node-id>

dcos node metrics details <mesos-node-id>

NOTE – All above commands have ‘–json’ flag to use them programmatically.

Launch / run command inside container for a task

DC/OS task exec cli only supports Mesos containers, this script supports both Mesos and Docker containers.

#!/bin/bash
echo "DCOS Task Exec 2.0"
if [ "$#" -eq 0 ]; then
        echo "Need task name or id as input. Exiting."
        exit 1
fi
taskName=$1
taskCmd=${2:-bash}
TMP_TASKLIST_JSON=/tmp/dcostasklist.json
dcos task --json > $TMP_TASKLIST_JSON
taskExist=`cat /tmp/dcostasklist.json | jq --arg tname $taskName '.[] | if(.name == $tname ) then .name else empty end' -r | wc -l`
if [[ $taskExist -eq 0 ]]; then 
        echo "No task with name $taskName exists."
        echo "Do you mean ?"
        dcos task | grep $taskName | awk '{print $1}'
        exit 1
fi
taskType=`cat $TMP_TASKLIST_JSON | jq --arg tname $taskName '[.[] | select(.name == $tname)][0] | .container.type' -r`
TaskId=`cat $TMP_TASKLIST_JSON | jq --arg tname $taskName '[.[] | select(.name == $tname)][0] | .id' -r`
if [[ $taskExist -ne 1 ]]; then
        echo -e "More than one instances. Please select task ID for executing command.n"
        #allTaskIds=$(dcos task $taskName | tee /dev/tty | grep -v "NAME" | awk '{print $5}' | paste -s -d",")
        echo ""
        read TaskId
fi
if [[ $taskType !=  "DOCKER" ]]; then
        echo "Task [ $taskName ] is of type MESOS Container."
        execCmd="dcos task exec --interactive --tty $TaskId $taskCmd"
        echo "Running [$execCmd]"
        $execCmd
else
        echo "Task [ $taskName ] is of type DOCKER Container."
        taskNodeIP=`dcos task $TaskId | awk 'FNR == 2 {print $2}'`
        echo "Task [ $taskName ] with task Id [ $TaskId ] is running on node [ $taskNodeIP ]."
        taskContID=`dcos node ssh --option LogLevel=quiet --option StrictHostKeyChecking=no --private-ip=$taskNodeIP --master-proxy "docker ps -q --filter "label=MESOS_TASK_ID=$TaskId"" 2> /dev/null`
        taskContID=`echo $taskContID | tr -d 'r'`
        echo "Task Docker Container ID : [ $taskContID ]"
        echo "Running [ docker exec -it $taskContID $taskCmd ]"
        dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --private-ip=$taskNodeIP --master-proxy "docker exec -it $taskContID $taskCmd" 2>/dev/null
fi

#!/bin/bash

echo "DCOS Task Exec 2.0"
if [ "$#" -eq 0 ]; then
        echo "Need task name or id as input. Exiting."
        exit 1
fi
taskName=$1
taskCmd=${2:-bash}
TMP_TASKLIST_JSON=/tmp/dcostasklist.json
dcos task --json > $TMP_TASKLIST_JSON
taskExist=`cat /tmp/dcostasklist.json | jq --arg tname $taskName '.[] | if(.name == $tname ) then .name else empty end' -r | wc -l`
if [[ $taskExist -eq 0 ]]; then 
        echo "No task with name $taskName exists."
        echo "Do you mean ?"
        dcos task | grep $taskName | awk '{print $1}'
        exit 1
fi
taskType=`cat $TMP_TASKLIST_JSON | jq --arg tname $taskName '[.[] | select(.name == $tname)][0] | .container.type' -r`
TaskId=`cat $TMP_TASKLIST_JSON | jq --arg tname $taskName '[.[] | select(.name == $tname)][0] | .id' -r`
if [[ $taskExist -ne 1 ]]; then
        echo -e "More than one instances. Please select task ID for executing command.n"
        #allTaskIds=$(dcos task $taskName | tee /dev/tty | grep -v "NAME" | awk '{print $5}' | paste -s -d",")
        echo ""
        read TaskId
fi
if [[ $taskType !=  "DOCKER" ]]; then
        echo "Task [ $taskName ] is of type MESOS Container."
        execCmd="dcos task exec --interactive --tty $TaskId $taskCmd"
        echo "Running [$execCmd]"
        $execCmd
else
        echo "Task [ $taskName ] is of type DOCKER Container."
        taskNodeIP=`dcos task $TaskId | awk 'FNR == 2 {print $2}'`
        echo "Task [ $taskName ] with task Id [ $TaskId ] is running on node [ $taskNodeIP ]."
        taskContID=`dcos node ssh --option LogLevel=quiet --option StrictHostKeyChecking=no --private-ip=$taskNodeIP --master-proxy "docker ps -q --filter "label=MESOS_TASK_ID=$TaskId"" 2> /dev/null`
        taskContID=`echo $taskContID | tr -d 'r'`
        echo "Task Docker Container ID : [ $taskContID ]"
        echo "Running [ docker exec -it $taskContID $taskCmd ]"
        dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --private-ip=$taskNodeIP --master-proxy "docker exec -it $taskContID $taskCmd" 2>/dev/null
fi

Get DC/OS tasks by node

#!/bin/bash 
function tasksByNodeAPI
{
    echo "DC/OS Tasks By Node"
    if [ "$#" -eq 0 ]; then
        echo "Need node ip as input. Exiting."
        exit 1
    fi
    nodeIp=$1
    mesosId=`dcos node | grep $nodeIp | awk '{print $3}'`
    if [ -z "mesosId" ]; then
        echo "No node found with ip $nodeIp. Exiting."
        exit 1
    fi
    curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)" "$(dcos config show core.dcos_url)/mesos/tasks?limit=10000" | jq --arg mesosId $mesosId '.tasks[] | select (.slave_id == $mesosId and .state == "TASK_RUNNING") | .name + "ttt" + .id'  -r
}
function tasksByNodeCLI
{
        echo "DC/OS Tasks By Node"
        if [ "$#" -eq 0 ]; then
                echo "Need node ip as input. Exiting."
                exit 1
        fi
        nodeIp=$1
        dcos task | egrep "HOST|$nodeIp"
}

#!/bin/bash 

function tasksByNodeAPI
{
    echo "DC/OS Tasks By Node"
    if [ "$#" -eq 0 ]; then
        echo "Need node ip as input. Exiting."
        exit 1
    fi
    nodeIp=$1
    mesosId=`dcos node | grep $nodeIp | awk '{print $3}'`
    if [ -z "mesosId" ]; then
        echo "No node found with ip $nodeIp. Exiting."
        exit 1
    fi
    curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)" "$(dcos config show core.dcos_url)/mesos/tasks?limit=10000" | jq --arg mesosId $mesosId '.tasks[] | select (.slave_id == $mesosId and .state == "TASK_RUNNING") | .name + "ttt" + .id'  -r
}

function tasksByNodeCLI
{
        echo "DC/OS Tasks By Node"
        if [ "$#" -eq 0 ]; then
                echo "Need node ip as input. Exiting."
                exit 1
        fi
        nodeIp=$1
        dcos task | egrep "HOST|$nodeIp"
}

Get cluster metadata – cluster Public IP and cluster ID

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"           
$(dcos config show core.dcos_url)/metadata

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"           
$(dcos config show core.dcos_url)/metadata

Sample Output:

{
"PUBLIC_IPV4": "123.456.789.012",
"CLUSTER_ID": "abcde-abcde-abcde-abcde-abcde-abcde"
}

{
"PUBLIC_IPV4": "123.456.789.012",
"CLUSTER_ID": "abcde-abcde-abcde-abcde-abcde-abcde"
}

Get DC/OS metadata – DC/OS version

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
$(dcos config show core.dcos_url)/dcos-metadata/dcos-version.jsonq

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
$(dcos config show core.dcos_url)/dcos-metadata/dcos-version.jsonq

Sample Output:

{
"version": "1.11.0",
"dcos-image-commit": "b6d6ad4722600877fde2860122f870031d109da3",
"bootstrap-id": "a0654657903fb68dff60f6e522a7f241c1bfbf0f"
}

{
"version": "1.11.0",
"dcos-image-commit": "b6d6ad4722600877fde2860122f870031d109da3",
"bootstrap-id": "a0654657903fb68dff60f6e522a7f241c1bfbf0f"
}

Get Mesos version

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
$(dcos config show core.dcos_url)/mesos/version

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
$(dcos config show core.dcos_url)/mesos/version

Sample Output:

{
"build_date": "2018-02-27 21:31:27",
"build_time": 1519767087.0,
"build_user": "",
"git_sha": "0ba40f86759307cefab1c8702724debe87007bb0",
"version": "1.5.0"
}

{
"build_date": "2018-02-27 21:31:27",
"build_time": 1519767087.0,
"build_user": "",
"git_sha": "0ba40f86759307cefab1c8702724debe87007bb0",
"version": "1.5.0"
}

Access DC/OS cluster exhibitor UI (Exhibitor supervises ZooKeeper and provides a management web interface)

<CLUSTER_URL>/exhibitor

<CLUSTER_URL>/exhibitor

Access DC/OS cluster data from cluster zookeeper using Zookeeper Python client – Run inside any node / container

from kazoo.client import KazooClient
zk = KazooClient(hosts='leader.mesos:2181', read_only=True)
zk.start()
clusterId = ""
# Here we can give znode path to retrieve its decoded data,
# for ex to get cluster-id, use
# data, stat = zk.get("/cluster-id")
# clusterId = data.decode("utf-8")
# Get cluster Id
if zk.exists("/cluster-id"):
    data, stat = zk.get("/cluster-id")
    clusterId = data.decode("utf-8")
zk.stop()
print (clusterId)

from kazoo.client import KazooClient

zk = KazooClient(hosts='leader.mesos:2181', read_only=True)
zk.start()

clusterId = ""
# Here we can give znode path to retrieve its decoded data,
# for ex to get cluster-id, use
# data, stat = zk.get("/cluster-id")
# clusterId = data.decode("utf-8")

# Get cluster Id
if zk.exists("/cluster-id"):
    data, stat = zk.get("/cluster-id")
    clusterId = data.decode("utf-8")

zk.stop()

print (clusterId)

Access dcos cluster data from cluster zookeeper using exhibitor rest API

# Get znode data using endpoint :
# /exhibitor/exhibitor/v1/explorer/node-data?key=/path/to/node
# Example : Get znode data for path = /cluster-id
curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
$(dcos config show core.dcos_url)/exhibitor/exhibitor/v1/explorer/node-data?key=/cluster-id

# Get znode data using endpoint :
# /exhibitor/exhibitor/v1/explorer/node-data?key=/path/to/node
# Example : Get znode data for path = /cluster-id
curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
$(dcos config show core.dcos_url)/exhibitor/exhibitor/v1/explorer/node-data?key=/cluster-id

Sample Output:

{
"bytes": "3333-XXXXXX",
"str": "abcde-abcde-abcde-abcde-abcde-",
"stat": "XXXXXX"
}

{
"bytes": "3333-XXXXXX",
"str": "abcde-abcde-abcde-abcde-abcde-",
"stat": "XXXXXX"
}

Get cluster name using Mesos API

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
$(dcos config show core.dcos_url)/mesos/state-summary | jq .cluster -r

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
$(dcos config show core.dcos_url)/mesos/state-summary | jq .cluster -r

Mark Mesos node as decommissioned

Some times instances which are running as DC/OS node gets terminated and can not come back online, like AWS EC2 instances, once terminated due to any reason, can not start back. When Mesos detects that a node has stopped, it puts the node in the UNREACHABLE state because Mesos does not know if the node is temporarily stopped and will come back online, or if it is permanently stopped. In such case, we can explicitly tell Mesos to put a node in the GONE state if we know a node will not come back.

dcos node decommission <mesos-agent-id>

dcos node decommission <mesos-agent-id>

Conclusion

We learned about Mesosphere DC/OS, its functionality and roles. We also learned how to setup and use DC/OS cli and use http authentication to access DC/OS APIs as well as using DC/OS cli for automating tasks.

We went through different API endpoints like Mesos, Marathon, DC/OS metrics, exhibitor, DC/OS cluster organization etc. Finally, we looked at different tricks and scripts to automate DC/OS, like DC/OS node details, task exec, Docker report, DC/OS API http authentication etc.

December 12, 2022

Installing Redis Cluster with Persistent Storage on Mesosphere DC/OS
In the first part of this blog, we saw how to install standalone redis service on DC/OS with Persistent storage using RexRay and AWS EBS volumes.

A single server is a single point of failure in every system, so to ensure high availability of redis database, we can deploy a master-slave cluster of Redis servers. In this blog, we will see how to setup such 6 node (3 master, 3 slave) Redis cluster and persist data using RexRay and AWS EBS volumes. After that we will see how to import existing data into this cluster.

Redis Cluster

It is form of replicated Redis servers in multi-master architecture. All the data is sharded into 16384 buckets, where every master node is assigned subset of buckets out of them (generally evenly sharded) and each master replicated by its slaves. It provides more resilience and scaling for production grade deployments where heavy workload is expected. Applications can connect to any node in cluster mode and the request will be redirected to respective master node.

Source: Octo

Objective: To create a Redis cluster with number of services in DCOC environment with persistent storage and import the existing Redis dump.rdb data to the cluster.

Prerequisites :
- Make sure rexray component is running and is in a healthy state for DCOS cluster.
Steps:
- As per Redis doc, the minimal cluster should have at least 3 master and 3 slave nodes, so making it a total 6 Redis services.
- All services will use similar json configuration except changes in names of service, external volume, and port mappings.
- We will deploy one Redis service for each Redis cluster node and once all services are running, we will form cluster among them.
- We will use host network for Redis node containers, for that we will restrict Redis nodes to run on particular node. This will help us to troubleshoot cluster (fixed IP, so we can restart Redis node any time without data loss).
- Using host network adds a prerequisites that number of dcos nodes >= number of Redis nodes.
1. First create Redis node services on DCOS:
2. Click on the Add button in Services tab of DCOS UI
- Click on JSON configuration
- Add below json config for Redis service, change the values which are written in BLOCK letters with # as prefix and suffix.
- #NODENAME# – Name of Redis node (Ex. redis-node-1)
- #NODEHOSTIP# – IP of dcos node on which this Redis node will run. This ip must be unique for each Redis node. (Ex. 10.2.12.23)
- #VOLUMENAME# – Name of persistent volume, Give name to identify volume on AWS EBS (Ex. <dcos cluster=”” name=””>-redis-node-<node number=””>)</node></dcos>
- #NODEVIP# – VIP For the Redis node. It must be ‘Redis’ for first Redis node, for others it can be the same as NODENAME (Ex. redis-node-2)
{ "id": "/#NODENAME#", "backoffFactor": 1.15, "backoffSeconds": 1, "constraints": [ [ "hostname", "CLUSTER", "#NODEHOSTIP#" ] ], "container": { "type": "DOCKER", "volumes": [ { "external": { "name": "#VOLUMENAME#", "provider": "dvdi", "options": { "dvdi/driver": "rexray" } }, "mode": "RW", "containerPath": "/data" } ], "docker": { "image": "parvezkazi13/redis:latest", "forcePullImage": false, "privileged": false, "parameters": [] } }, "cpus": 0.5, "disk": 0, "fetch": [], "healthChecks": [], "instances": 1, "maxLaunchDelaySeconds": 3600, "mem": 4096, "gpus": 0, "networks": [ { "mode": "host" } ], "portDefinitions": [ { "labels": { "VIP_0": "/#NODEVIP#:6379" }, "name": "#NODEVIP#", "protocol": "tcp", "port": 6379 } ], "requirePorts": true, "upgradeStrategy": { "maximumOverCapacity": 0, "minimumHealthCapacity": 0.5 }, "killSelection": "YOUNGEST_FIRST", "unreachableStrategy": { "inactiveAfterSeconds": 300, "expungeAfterSeconds": 600 } }
```
{
   "id": "/#NODENAME#",
   "backoffFactor": 1.15,
   "backoffSeconds": 1,
   "constraints": [
     [
       "hostname",
       "CLUSTER",
       "#NODEHOSTIP#"
     ]
   ],
   "container": {
     "type": "DOCKER",
     "volumes": [
       {
         "external": {
           "name": "#VOLUMENAME#",
           "provider": "dvdi",
           "options": {
             "dvdi/driver": "rexray"
           }
         },
         "mode": "RW",
         "containerPath": "/data"
       }
     ],
     "docker": {
       "image": "parvezkazi13/redis:latest",
       "forcePullImage": false,
       "privileged": false,
       "parameters": []
     }
   },
   "cpus": 0.5,
   "disk": 0,
   "fetch": [],
   "healthChecks": [],
   "instances": 1,
   "maxLaunchDelaySeconds": 3600,
   "mem": 4096,
   "gpus": 0,
   "networks": [
     {
       "mode": "host"
     }
   ],
   "portDefinitions": [
     {
       "labels": {
         "VIP_0": "/#NODEVIP#:6379"
       },
       "name": "#NODEVIP#",
       "protocol": "tcp",
       "port": 6379
     }
   ],
   "requirePorts": true,
   "upgradeStrategy": {
     "maximumOverCapacity": 0,
     "minimumHealthCapacity": 0.5
   },
   "killSelection": "YOUNGEST_FIRST",
   "unreachableStrategy": {
     "inactiveAfterSeconds": 300,
     "expungeAfterSeconds": 600
   }
 }
```
- After updating the highlighted fields, copy above json to json configuration box, click on ‘Review & Run’ button in the right corner, this will start the service with above configuration.
- Once above service is UP and Running, then repeat the step 2 to 4 for each Redis node with respective values for highlighted fields.
- So if we go with 6 node cluster, at the end we will have 6 Redis nodes UP and Running, like:
Note: Since we are using external volume for persistent storage, we can not scale our services, i.e. each service will only one instance max. If we try to scale, we will get below error :

2. Form the Redis cluster between Redis node services:
- To create or manage Redis-cluster, first deploy redis-cluster-util container on DCOS using below json config:
{ "id": "/infrastructure/redis-cluster-util", "backoffFactor": 1.15, "backoffSeconds": 1, "constraints": [], "container": { "type": "DOCKER", "volumes": [ { "containerPath": "/backup", "hostPath": "backups", "mode": "RW" } ], "docker": { "image": "parvezkazi13/redis-util", "forcePullImage": true, "privileged": false, "parameters": [] } }, "cpus": 0.25, "disk": 0, "fetch": [], "instances": 1, "maxLaunchDelaySeconds": 3600, "mem": 4096, "gpus": 0, "networks": [ { "mode": "host" } ], "portDefinitions": [], "requirePorts": true, "upgradeStrategy": { "maximumOverCapacity": 0, "minimumHealthCapacity": 0.5 }, "killSelection": "YOUNGEST_FIRST", "unreachableStrategy": { "inactiveAfterSeconds": 300, "expungeAfterSeconds": 600 }, "healthChecks": [] }
```
{
 "id": "/infrastructure/redis-cluster-util",
 "backoffFactor": 1.15,
 "backoffSeconds": 1,
 "constraints": [],
 "container": {
   "type": "DOCKER",
   "volumes": [
     {
       "containerPath": "/backup",
       "hostPath": "backups",
       "mode": "RW"
     }
   ],
   "docker": {
     "image": "parvezkazi13/redis-util",
     "forcePullImage": true,
     "privileged": false,
     "parameters": []
   }
 },
 "cpus": 0.25,
 "disk": 0,
 "fetch": [],
 "instances": 1,
 "maxLaunchDelaySeconds": 3600,
 "mem": 4096,
 "gpus": 0,
 "networks": [
   {
     "mode": "host"
   }
 ],
 "portDefinitions": [],
 "requirePorts": true,
 "upgradeStrategy": {
   "maximumOverCapacity": 0,
   "minimumHealthCapacity": 0.5
 },
 "killSelection": "YOUNGEST_FIRST",
 "unreachableStrategy": {
   "inactiveAfterSeconds": 300,
   "expungeAfterSeconds": 600
 },
 "healthChecks": []
}
```
This will run service as :
- Get the IP addresses of all Redis nodes to form the cluster, as Redis-cluster can not be created with node’s hostname / dns. This is an open issue.
Since we are using host network, we need the dcos node IP on which Redis nodes are running.

Get all Redis nodes IP using:
```
NODE_BASE_NAME=redis-nodedcos task $NODE_BASE_NAME | grep -E "$NODE_BASE_NAME-<[0-9]>" | awk '{print $2":6379"}' | paste -s -d' '  
```
Here Redis-node is the prefix used for all Redis nodes.

Note the output of this command, we will use it in further steps.
- Get the node where redis-cluster-util container is running and ssh to dcos node using:
```
dcos node ssh --master-proxy --private-ip $(dcos task | grep "redis-cluster-util" | awk '{print $2}')
```
- Now find the docker container id of redis-cluster-util and exec it using:
```
docker exec -it $(docker ps -qf ancestor="parvezkazi13/redis-util") bash  
```
- No we are inside the redis-cluster-util container. Run below command to form Redis cluster.
```
redis-trib.rb create --replicas 1 <Space separated IP address:PORT pair of all Redis nodes>
```
- Here use the Redis nodes IP addresses which retrieved in step 2.
```
redis-trib.rb create --replicas 1 10.0.1.90:6379 10.0.0.19:6379 10.0.9.203:6379 10.0.9.79:6379 10.0.3.199:6379 10.0.9.104:6379
```
- Parameters:
- The option –replicas 1 means that we want a slave for every master created.
- The other arguments are the list of addresses (host:port) of the instances we want to use to create the new cluster.
- Output:
- Select ‘yes’ when it prompts to set the slot configuration shown.
- Run below command to check the status of the newly create cluster
```
redis-trib.rb check <Any redis node host:PORT>
Ex:
redis-trib.rb check 10.0.1.90:6379
```
- Parameters:
- host:port of any node from the cluster.
- Output:
- If all OK, it will show OK with status, else it will show ERR with the error message.
3. Import existing dump.rdb to Redis cluster
- At this point, all the Redis nodes should be empty and each one should have an ID and some assigned slots:
Before reuse an existing dump data, we have to reshard all slots to one instance. We specify the number of slots to move (all, so 16384), the id we move to (here Node 1 – 10.0.1.90:6379) and where we take these slots from (all other nodes).
```
redis-trib.rb reshard 10.0.1.90:6379  
```
Parameters:

host:port of any node from the cluster.

Output:

It will prompt for number of slots to move – here all. i.e 16384

Receiving node id – here id of node 10.0.1.90:6379 (redis-node-1)

Source node IDs – here all, as we want to shard all slots to one node.

And prompt to proceed – press ‘yes’
- Now check again node 10.0.1.90:6379
```
redis-trib.rb check 10.0.1.90:6379  
```
Parameters: host:port of any node from the cluster.

Output: it will show all (16384) slots moved to node 10.0.1.90:6379
- Next step is Importing our existing Redis dump data.
Now copy the existing dump.rdb to our redis-cluster-util container using below steps:

– Copy existing dump.rdb to dcos node on which redis-cluster-util container is running. Can use scp from any other public server to dcos node.

– Now we have dump .rdb in our dcos node, copy this dump.rdb to redis-cluster-util container using below command:
```
docker cp dump.rdb $(docker ps -qf ancestor="parvezkazi13/redis-util"):/data
```
Now we have dump.rdb in our redis-cluster-util container, we can import it to our Redis cluster. Execute and go to the redis-cluster-util container using:
```
docker exec -it $(docker ps -qf ancestor="parvezkazi13/redis-util") bash
```
It will execute redis-cluster-util container which is already running and start its bash cmd.

Run below command to import dump.rdb to Redis cluster:
```
rdb --command protocol /data/dump.rdb | redis-cli --pipe -h 10.0.1.90 -p 6379
```
Parameters:

Path to dump.rdb

host:port of any node from the cluster.

Output:

If successful, you’ll see something like:
```
All data transferred. Waiting for the last reply...Last reply received from server.errors: 0, replies: 4341259  
```
as well as this in the Redis server logs:
```
95086:M 01 Mar 21:53:42.071 * 10000 changes in 60 seconds. Saving...95086:M 01 Mar 21:53:42.072 * Background saving started by pid 9822398223:C 01 Mar 21:53:44.277 * DB saved on disk
```
WARNING:
Like our Oracle DB instance can have multiple databases, similarly Redis saves keys in keyspaces.
Now when Redis is in cluster mode, it does not accept the dumps which has more than one keyspaces. As per documentation:
‍
“Redis Cluster does not support multiple databases like the stand alone version of Redis. There is just database 0 and the SELECT command is not allowed. “‍

So while importing such multi-keyspace Redis dump, server fails while starting on below issue :
```
23049:M 16 Mar 17:21:17.772 * DB loaded from disk: 5.222 seconds
23049:M 16 Mar 17:21:17.772 # You can't have keys in a DB different than DB 0 when in Cluster mode. Exiting.
Solution / WA :
```
There is redis-cli command “MOVE” to move keys from one keyspace to another keyspace.

Also can run below command to move all the keys from keyspace 1 to keyspace 0 :
```
redis-cli -h "$HOST" -p "$PORT" -n 1 --raw keys "*" |  xargs -I{} redis-cli -h "$HOST" -p "$PORT" -n 1 move {} 0
```
- Verify import status, using below commands : (inside redis-cluster-util container)
```
redis-cli -h 10.0.1.90 -p 6379 info keyspace
```
It will run Redis info command on node 10.0.1.90:6379 and fetch keyspace information, like below:

# Keyspace
db0:keys=33283,expires=0,avg_ttl=0
- Now reshard all the slots to all instances evenly
The reshard command will again list the existing nodes, their IDs and the assigned slots.
```
redis-trib.rb reshard 10.0.1.90:6379
```
Parameters:

host:port of any node from the cluster.

Output:

It will prompt for number of slots to move – here (16384 /3 Masters = 5461)

Receiving node id – here id of master node 2

Source node IDs – id of first instance which has currently all the slots. (master 1)

And prompt to proceed – press ‘yes’

Repeat above step and for receiving node id, give id of master node 3.
- After the above step, all 3 masters will have equal slots and imported keys will be distributed among the master nodes.
- Put keys to cluster for verification
```
redis-cli -h 10.0.1.90 -p 6379 set foo bar
OK
redis-cli -h 10.0.1.90 -p 6379 set foo bar
(error) MOVED 4813 10.0.9.203:6379
```
Above error shows that server saved this key to instance 10.0.9.203:6379, so client redirected it. To follow redirection, use flag “-c” which says it is a cluster mode, like:
```
redis-cli -h 10.0.1.90 -p 6379 -c set foo bar
OK
```
Redis Entrypoint

Application entrypoint for Redis cluster is mostly depends how your Redis client handles cluster support. Generally connecting to one of master nodes should do the work.

Use below host:port in your applications :

redis.marathon.l4lb.thisdcos.directory:6379

Automation of Redis Cluster Creation

We have automation script in place to deploy 6 node Redis cluster and form a cluster between them.

Script location: Github
- It deploys 6 marathon apps for 6 Redis nodes. All nodes are deployed on different nodes with CLUSTER_NAME as prefix to volume name.
- Once all nodes are up and running, it deploys redis-cluster-util app which will be used to form Redis cluster.
- Then it will print the Redis nodes and their IP addresses and prompt the user to proceed cluster creation.
- If user selects to proceed, it will run redis-cluster-util app and create the cluster using IP addresses collected. Util container will prompt for some input that the user has to select.
Conclusion

We learned about Redis cluster deployment on DCOS with Persistent storage using RexRay. We also learned how rexray automatically manages volumes over aws ebs and how to integrate them in DCOS apps/services. We saw how to use redis-cluster-util container to manage Redis cluster for different purposes, like forming cluster, resharding, importing existing dump.rdb data etc. Finally, we looked at the automation part of whole cluster setup using dcos cli and bash.

Reference
- Persistent Volumes
- Storage Management Solution
- Redis Cluster Official Tutorial
- Redis Cluster Dockerfile
- Redis Cluster Util Dockerfile
- Redis Clients
December 12, 2022

Installing Redis Service in DC/OS With Persistent Storage Using AWS Volumes

Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker.

It supports various data structures such as Strings, Hashes, Lists, Sets etc. DCOS offers Redis as a service.

Why Do We Use External Persistent Storage for Redis Mesos Containers?

Since Redis is an in-memory database, an instance/service restart will result in loss of data. To counter this, it is always advisable to snapshot the Redis in-memory database from time to time.

This helps Redis instance to recover from the point in time failure.

In DCOS, Redis is deployed as a stateless service. To make it a stateful and persistent data, we can configure local volumes or external volumes.

The disadvantage of having a local volume mapped to Mesos containers is when a slave node goes down, your local volume becomes unavailable, and the data loss occurs.

However, with external persistent volumes, as they are available on each node of the DCOS cluster, a slave node failure does not impact the data availability.

Rex-Ray

REX-Ray is an open source, storage management solution designed to support container runtimes such as Docker and Mesos.

REX-Ray enables stateful applications such as databases to persist and maintain its data after the life cycle of the container has ended. Built-in high availability enables orchestrators such as Docker Swarm, Kubernetes, and Mesos Frameworks like Marathon to automatically orchestrate storage tasks between hosts in a cluster.

Built on top of the libStorage framework, REX-Ray’s simplified architecture consists of a single binary and runs as a stateless service on every host using a configuration file to orchestrate multiple storage platforms.

Objective: To create a Redis service in DC/OS environment with persistent storage.

Warning: The Persistent Volume feature is still in beta Phase for DC/OS Version 1.11.

Prerequisites:

Make sure the rexray service is running and is in a healthy state for the cluster.

Steps:

Click on the Add button in Services component of DC/OS GUI.

Click on JSON Configuration.

Note: For persistent storage, below code should be added in the normal Redis service configuration JSON file to mount external persistent volumes.

"volumes": [
      {
        "containerPath": "/data",
        "mode": "RW",
        "external": {
          "name": "redis4volume",
          "provider": "dvdi",
          "options": {
            "dvdi/driver": "rexray"
          }
        }
      }
    ],

"volumes": [
      {
        "containerPath": "/data",
        "mode": "RW",
        "external": {
          "name": "redis4volume",
          "provider": "dvdi",
          "options": {
            "dvdi/driver": "rexray"
          }
        }
      }
    ],

Make sure the service is up and in a running state:

If you look closely, the service was suspended and respawned on a different slave node. We populated the database with dummy data and saved the snapshot in the data directory.

When the service did come upon a different node 10.0.3.204, the data persisted and the volume was visible on the new node.

core@ip-10-0-3-204 ~ $ /opt/mesosphere/bin/rexray volume list
- name: datavolume
  volumeid: vol-00aacade602cf960c
  availabilityzone: us-east-1a
  status: in-use
  volumetype: standard
  iops: 0
  size: "16"
  networkname: ""
  attachments:
  - volumeid: vol-00aacade602cf960c
    instanceid: i-0d7cad91b62ec9a64
    devicename: /dev/xvdb

core@ip-10-0-3-204 ~ $ /opt/mesosphere/bin/rexray volume list

- name: datavolume
  volumeid: vol-00aacade602cf960c
  availabilityzone: us-east-1a
  status: in-use
  volumetype: standard
  iops: 0
  size: "16"
  networkname: ""
  attachments:
  - volumeid: vol-00aacade602cf960c
    instanceid: i-0d7cad91b62ec9a64
    devicename: /dev/xvdb

Check the volume tab :

Note: For external volumes, the status will be unavailable. This is an issue with DC/OS.

The Entire Service JSON file:

{
  "id": "/redis4.0-new-failover-test",
  "instances": 1,
  "cpus": 1.001,
  "mem": 2,
  "disk": 0,
  "gpus": 0,
  "backoffSeconds": 1,
  "backoffFactor": 1.15,
  "maxLaunchDelaySeconds": 3600,
  "container": {
    "type": "DOCKER",
    "volumes": [
      {
        "containerPath": "/data",
        "mode": "RW",
        "external": {
          "name": "redis4volume",
          "provider": "dvdi",
          "options": {
            "dvdi/driver": "rexray"
          }
        }
      }
    ],
    "docker": {
      "image": "redis:4",
      "network": "BRIDGE",
      "portMappings": [
        {
          "containerPort": 6379,
          "hostPort": 0,
          "servicePort": 10101,
          "protocol": "tcp",
          "name": "default",
          "labels": {
            "VIP_0": "/redis4.0:6379"
          }
        }
      ],
      "privileged": false,
      "forcePullImage": false
    }
  },
  "healthChecks": [
    {
      "gracePeriodSeconds": 60,
      "intervalSeconds": 5,
      "timeoutSeconds": 5,
      "maxConsecutiveFailures": 3,
      "portIndex": 0,
      "protocol": "TCP"
    }
  ],
  "upgradeStrategy": {
    "minimumHealthCapacity": 0.5,
    "maximumOverCapacity": 0
  },
  "unreachableStrategy": {
    "inactiveAfterSeconds": 300,
    "expungeAfterSeconds": 600
  },
  "killSelection": "YOUNGEST_FIRST",
  "requirePorts": true
}

{
  "id": "/redis4.0-new-failover-test",
  "instances": 1,
  "cpus": 1.001,
  "mem": 2,
  "disk": 0,
  "gpus": 0,
  "backoffSeconds": 1,
  "backoffFactor": 1.15,
  "maxLaunchDelaySeconds": 3600,
  "container": {
    "type": "DOCKER",
    "volumes": [
      {
        "containerPath": "/data",
        "mode": "RW",
        "external": {
          "name": "redis4volume",
          "provider": "dvdi",
          "options": {
            "dvdi/driver": "rexray"
          }
        }
      }
    ],
    "docker": {
      "image": "redis:4",
      "network": "BRIDGE",
      "portMappings": [
        {
          "containerPort": 6379,
          "hostPort": 0,
          "servicePort": 10101,
          "protocol": "tcp",
          "name": "default",
          "labels": {
            "VIP_0": "/redis4.0:6379"
          }
        }
      ],
      "privileged": false,
      "forcePullImage": false
    }
  },
  "healthChecks": [
    {
      "gracePeriodSeconds": 60,
      "intervalSeconds": 5,
      "timeoutSeconds": 5,
      "maxConsecutiveFailures": 3,
      "portIndex": 0,
      "protocol": "TCP"
    }
  ],
  "upgradeStrategy": {
    "minimumHealthCapacity": 0.5,
    "maximumOverCapacity": 0
  },
  "unreachableStrategy": {
    "inactiveAfterSeconds": 300,
    "expungeAfterSeconds": 600
  },
  "killSelection": "YOUNGEST_FIRST",
  "requirePorts": true
}

Redis entrypoint

To connect with Redis service, use below host:port in your applications:

redis.marathon.l4lb.thisdcos.directory:6379

Conclusion

We learned about Standalone Redis Service deployment from DCOS catalog on DCOS. Also, we saw how to add Persistent storage to it using RexRay. We also learned how RexRay automatically manages volumes over AWS ebs and how to integrate them in DCOS apps/services. Finally, we saw how other applications can communicate with this Redis service.

References

December 12, 2022

Tag: mesosphere

Mesosphere DC/OS Masterclass : Tips and Tricks to Make Life Easier

Installing and using DC/OS

DC/OS commands and scripts

Setup DC/OS cli with DC/OS cluster

DC/OS authentication token

DC/OS cluster url

DC/OS cluster name

Access Mesos UI

Access Marathon UI

Access any DC/OS service, like Marathon, Kafka, Elastic, Spark etc.[DC/OS Services]

Access DC/OS slaves info in json using Mesos API [Mesos Endpoints]

Access DC/OS slaves info in json using DC/OS cli

Access DC/OS private slaves info using DC/OS cli

Access DC/OS public slaves info using DC/OS cli

Access DC/OS private and public slaves info using DC/OS cli

Get public IP of all public agents

Get public IP of master leader

Get all master nodes and their private ip

Get list of all users who have access to DC/OS cluster

Add users to cluster using Mesosphere script (Run this on master)

Add users to cluster using DC/OS API

Delete users from DC/OS cluster organization

Offers/resources from individual DC/OS agent

Save JSON configs of all running Marathon apps

Get report of Marathon apps with details like container type, Docker image, tag or service version used by Marathon app.

Get DC/OS nodes with more information like node type, node ip, attributes, number of running tasks, free memory, free cpu etc.

Framework Cleaner

Get DC/OS apps and their placement constraints

Run shell command on all slaves

Run shell command on master leader

Run shell command on all master nodes

Add node attributes to dcos nodes and run apps on nodes with required attributes using placement constraints

Install DC/OS Datadog metrics plugin on all DC/OS nodes

Get app / node metrics fetched by dcos-metrics component using metrics API

Get app / node metrics fetched by dcos-metrics component using dcos cli

Launch / run command inside container for a task

Get DC/OS tasks by node

Get cluster metadata – cluster Public IP and cluster ID

Get DC/OS metadata – DC/OS version

Get Mesos version

Access DC/OS cluster exhibitor UI (Exhibitor supervises ZooKeeper and provides a management web interface)

Access DC/OS cluster data from cluster zookeeper using Zookeeper Python client – Run inside any node / container

Access dcos cluster data from cluster zookeeper using exhibitor rest API

Get cluster name using Mesos API

Mark Mesos node as decommissioned

Conclusion

Installing Redis Cluster with Persistent Storage on Mesosphere DC/OS

Redis Cluster

2. Form the Redis cluster between Redis node services:

3. Import existing dump.rdb to Redis cluster

Redis Entrypoint

Automation of Redis Cluster Creation

Conclusion

Reference

Installing Redis Service in DC/OS With Persistent Storage Using AWS Volumes

Why Do We Use External Persistent Storage for Redis Mesos Containers?

Rex-Ray

Conclusion

References