eBay Tech Blog

Delivering eBay’s CI Solution with Apache Mesos – Part II

by The eBay PaaS Team on 05/12/2014

in Cloud,Data Infrastructure and Services,Software Engineering

In part I of this post we laid out in detail how to run a large Jenkins CI farm in Mesos. In this post we explore running the builds inside Docker containers and more:

  • Explain the motivation for using Docker containers for builds.
  • Show how to handle the case where the build itself is a Docker build.
  • Peek into how the Mesos 0.19 release is going to change Docker integration.
  • Walk through a Vagrant all-in-one-box setup so you can try things out.

Overview

Jenkins follows the master-slave model and is capable of launching tasks as remote Java processes on Mesos slave machines. Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. We can leverage the capabilities of Jenkins and Mesos to run a Jenkins slave process within a Docker container using Mesos as the resource manager.

Why use Docker containers?

This page gives a good picture of what Docker is all about.

At eBay Inc., we have several different build clusters. They are primarily partitioned due to a number of factors:  requirements to run different OS flavors (mostly RHEL and Ubuntu), software version conflicts, associated application dependencies, and special hardware. When using Mesos, we try to operate on a single cluster with heteregeneous workloads instead of having specialized clusters. Docker provides a good solution to isolate the different dependencies inside the container irrespective of the host setup where the Mesos slave is running, thereby helping us operate on a single cluster. Special hardware requirements can always be handled though slave attributes that the Jenkins plugin already supports. Overall, then, this setup scheme helps maintain consistent host images in the cluster, avoids having to introduce a wide combination of different flavors of Mesos slave hosts running, yet handles all the varied build dependencies within a container.

Now why support Docker-in-Docker setup?

When we started experimenting with running the builds in Docker containers, some of our teammates were working on enabling Docker images for applications. They posed the question, How do we support Docker build and push/pull operations within the Docker container used for the build? Valid point! So, we will explore two ways of handling this challenge. Many thanks to Jérôme Petazzoni from the Docker team for his guidance.

Environment setup

A Vagrant development VM setup demonstrates CI using Docker containers. This VM can be used for testing other frameworks like Chronos and Aurora; however, we will focus on the CI use of it with Marathon. The screenshots shown below have been taken from the Vagrant development environment setup, which runs a cluster of three Mesos masters, three Mesos slave instances, and one Marathon instance. (Marathon is a Mesos framework for long-running services. It provides a REST API for starting, stopping, and scaling services.)

192.168.56.101 mesos1 marathon1
192.168.56.102 mesos2
192.168.56.103 mesos3

Running Jenkins slaves inside Mesos Docker containers requires the following ecosystem:

  1. Jenkins master server with the Mesos scheduler plugin installed (used for building Docker containers via CI jobs).
  2. Apache Mesos master server with at least one slave server .
  3. Mesos Docker Executor installed on all Mesos slave servers. Mesos slaves delegate execution of tasks within Docker containers to the Docker executor. (Note that integration with Docker changes with the Mesos 0.19 release, as explained in the miscellaneous section at the end of this post.)
  4. Docker installed on all slave servers (to automate the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere).
  5. Docker build container image in the Docker registry.
  6. Marathon framework.

1. Creating the Jenkins master instance

We needed to first launch a standalone Jenkins master instance in Mesos via the Marathon framework.  We placed Jenkins plugins in the plugins directory, and included a default config.xml file with pre-configured settings. Jenkins was then launched by executing the jenkins.war file. Here is the directory structure that we used for launching the Jenkins master:

.
├── README.md
├── config.xml
├── hudson.model.UpdateCenter.xml
├── jenkins.war
├── jobs
├── nodeMonitors.xml
├── plugins
│   ├── mesos.hpi
│   └── saferestart.jpi
└── userContent
└── readme.txt
3 directories, 8 files

2. Launching the Jenkins master instance

Marathon launched the Jenkins master instance using the following command, also shown in the Marathon UI screenshots below. We zipped our Jenkins files and downloaded them for the job by using the URIs field in the UI; however, for demonstration purposes, below we show using a Git repository to achieve the same goal.

git clone https://github.com/ahunnargikar/jenkins-standalone && cd jenkins-standalone;
export JENKINS_HOME=$(pwd);
java -jar jenkins.war
--webroot=war
--httpPort=$PORT0
--ajp13Port=-1
--httpListenAddress=0.0.0.0
--ajp13ListenAddress=127.0.0.1
--preferredClassLoader=java.net.URLClassLoader
--logfile=../jenkins.log

jenkins_marathon1

jenkins_marathon2

jenkins_marathon3

jenkins_marathon4

jenkins_marathon5

3. Launching Jenkins slaves using the Mesos Docker executor

ahunnargikar_cloud4

Here’s a sample supervisord startup configuration for a Docker image capable of executing Jenkins slave jobs:

[supervisord]
nodaemon=true

[program:java-jenkins-slave]
command=/bin/bash -c "eval $JENKINS_COMMAND"

As you can see, Jenkins passed its slave launch command as an environment variable to the Docker container. The container then initialized the Jenkins slave process, which fulfilled the basic requirement for kicking off the Jenkins slave job.

This configuration was sufficient to launch regular builds within the Docker container of choice. Now let’s walk through the two options that we explored to run Docker operations for a CI build inside a Docker container. Strategy #1 required use of supervisord to control the Docker daemon process. For the default case (regular non-Docker builds) and strategy #2, supervisord was not required; one could simply pass the command directly to the Docker container.

3.1 Strategy #1 – Using an individual Docker-in-Docker (dind) setup on each Mesos slave

This strategy, inspired by this blog,  involved a dedicated Docker daemon inside the Docker container. The advantage of this approach was that we didn’t have a single Docker daemon handling a large number of container builds. On the flip side, each container was now absorbing the I/O overhead of downloading and duplicating all the AUFS file system layers.

docker_caching_multiple

The Docker-in-Docker container had to be launched in privileged mode (by including the “-privileged” option in the Mesos Docker executor code); otherwise, nested Docker containers wouldn’t work. Using this strategy, we ended up having two Docker executors:  one for launching Docker containers in non-privileged mode (/var/lib/mesos/executors/docker) and the other for launching Docker-in-Docker containers in privileged mode (/var/lib/mesos/executors/docker2). The supervisord process manager configuration was updated to run the Docker daemon process in addition to the Jenkins slave job process.

[program:docker] 
command=/usr/local/bin/wrapdocker 

The following Docker-in-Docker image has been provided for demonstration purposes for testing out the multi-Docker setup:

ahunnargikar/jenkins-dind/multiple-docker

In real life, the actual build container image would capture the build dependencies and base image flavor, in addition to the contents of the above dind image. The actual command that the Docker executor ran looked similar to this one:

docker run 
-cidfile /tmp/docker_cid.6c6bba3db72b7483 
-privileged
-c 51 -m 302365697638 
-e JENKINS_COMMAND=wget -O slave.jar http://192.168.56.101:9000/jnlpJars/slave.jar && java -DHUDSON_HOME=jenkins -server -Xmx256m -Xms16m -XX:+UseConcMarkSweepGC -Djava.net.preferIPv4Stack=true -jar slave.jar  -jnlpUrl http://192.168.56.101:9000/computer/mesos-jenkins-beb3a8ae-3de7-4117-8c4e-efe50b37fbb4/slave-agent.jnlp hashish/jenkins-dind

3.2 Strategy #2 – Using a shared Docker Setup on each Mesos slave

All of the Jenkins slaves running on a Mesos slave host could simply use a single Docker daemon for running their Docker containers, which was the default standard setup. This approach eliminated redundant network and disk I/O involved with downloading the AUFS file system layers. For example, all Java application projects could now reuse the same AUFS file system layers that contained the JDK, Tomcat, and other static Linux package dependencies. We lost isolation as far as the Docker daemon was concerned, but we gained a massive reduction in I/O and were able to leverage caching of build layers. This was the optimal strategy for our use case.

docker_caching_single

The Docker container mounted the host’s /var/run/docker.sock file descriptor as a shared volume so that its native Docker binary, located at /usr/local/bin/docker, could now communicate with the host server’s Docker daemon. So all Docker commands were now directly being executed by the host server’s Docker daemon. This eliminated the need for running individual Docker daemon processes on the Docker containers that were running on a Mesos slave server.

The following Docker image has been provided for demonstration purposes for a shared Docker setup. The actual build Docker container image of choice essentially just needed to execute the Docker binary via its CLI. We could even have mounted the Docker binary from the host server itself to the same end.

ahunnargikar/jenkins-dind/single-docker

The actual command that the Docker executor ran looked similar to this:

docker run 
-cidfile /tmp/docker_cid.6c6bba3db72b7483 
-v /var/run/docker.sock:/var/run/docker.sock 
-c 51 -m 302365697638 
-e JENKINS_COMMAND=wget -O slave.jar http://192.168.56.101:9000/jnlpJars/slave.jar && java -DHUDSON_HOME=jenkins -server -Xmx256m -Xms16m -XX:+UseConcMarkSweepGC -Djava.net.preferIPv4Stack=true -jar slave.jar  -jnlpUrl http://192.168.56.101:9000/computer/mesos-jenkins-beb3a8ae-3de7-4117-8c4e-efe50b37fbb4/slave-agent.jnlp hashish/jenkins-dind-single

4. Specifying the cloud configuration for the Jenkins master

We then needed to configure the Jenkins master so that it would connect to the Mesos master server and start receiving resource offers, after which it could begin launching tasks on Mesos. The following screenshots illustrate how we configured the Jenkins master via its web administration UI.

jenkins_cloud1

jenkins_cloud2

jenkins_cloud3

jenkins_cloud4

jenkins_cloud5

Note: The Docker-specific configuration options above are not available in the stable release of the Mesos plugin. Major changes are underway in the upcoming Mesos 0.19.0 release, which will introduce the pluggable containerizer functionality. We decided to wait for 0.19.0 to be released before making a pull request for this feature. Instead, a modified .hpi plugin file was created from this Jenkins Mesos plugin branch and has been included in the Vagrant dev setup.

jenkins_cloud6

jenkins_cloud7

5. Creating the Jenkins Mesos Docker job

Now that the Jenkins scheduler had registered as a framework in Mesos, it started receiving resource offers from the Mesos master. The next step was to create a Jenkins job that would be launched on a Mesos slave whose resource offer satisfied the cloud configuration requirements.

5.1 Creating a Docker Tomcat 7 application container image

Jenkins first needed a Docker container base image that packaged the application code and dependencies as well as a web server. For demonstration purposes, here’s a sample Docker Tomcat 7 image created from this Github repository:

hashish/tomcat7

Every application’s Git repository would be expected to have its unique Dockerfile with whatever combination of Java/PHP/Node.js pre-installed in a base container. In the case of our Java apps, we simply built the .war file using Maven and then inserted it into the Docker image during build time. The Docker image was then tagged with the application name, version, and timestamp, and then uploaded into our private Docker registry.

5.2 Running a Jenkins Docker job

For demonstration purposes, the following example assumes that we are building a basic Java web application.

jenkins_job1

jenkins_job2

jenkins_job3

jenkins_job4

jenkins_job5

jenkins_job6

Once Jenkins built and uploaded the new application’s Docker image containing the war, dependencies, and other packages, this Docker image was launched in Mesos and scaled up or down to as many instances as required via the Marathon APIs.

Miscellaneous points

Our Docker integration with Mesos is going to be outdated soon with the 0.19 release. Our setup was against Mesos 0.17 and Docker 0.9.  You can read about the Mesos pluggable containerizer feature in this blog and in this ticket. The Mesosphere team is also working on the deimos project to integrate Docker with the external containerization approach. There is an old pull request against the Mesos Jenkins plugin to integrate containerization once it’s released. We will update our setup accordingly when this feature is rolled out. We’d like to add a disclaimer that the Docker integration in the above post hasn’t been tested at scale yet; we will do our due diligence once Mesos 0.19 and deimos are out.

For different build dependencies, you can define a build label for each. A merged PR already specifies the attributes per label. Hence, a Docker container image of choice can be added per build label.

Conclusion

This concludes the description of our journey, giving a good overview of how we ran a distributed CI solution on top of Mesos, utilizing resources in the most efficient manner and isolating build dependencies through Docker.

{ 12 comments… read them below or add one }

Daniel May 18, 2014 at 5:41AM

Linking /var/run/docker.sock into the container seems an interesting approach. I have experimented with this setup as well. But now I struggle with this container: It cannot be removed (“device or resource busy” when removing root file system of the container) or started again (“Error getting container … from driver aufs). Did you faced the same problem?

Reply

The eBay PaaS Team May 21, 2014 at 12:05PM

Daniel thanks for bringing the point up. In the version of the mesos-docker executor from mesosphere that we were using for our testing, cleanup of the mesos task only did a docker stop operation which was successful even though unmount failed.
In latest version of mesos-docker executor,
https://github.com/mesosphere/mesos-docker/blob/master/bin/mesos-docker
in the cleanup_container() its doing a “docker rm” that would hang at least in docker 0.9 because of mounting docker.sock. However, the good news is in latest docker 0.11 release, the issue is fixed. It was also tracked in this resolved issue
https://github.com/dotcloud/docker/issues/4844

We will also update the vagrant setup mentioned in the article to use the latest docker executor and update to docker 0.11 so benefits of removing the container is achieved.
The latest docker documentation is also endorsing this approach
http://docs.docker.io/reference/commandline/cli/#run
(To quote ‘By bind-mounting the docker unix socket and statically linked docker binary (such as that provided by https://get.docker.io), you give the container the full access to create and manipulate the host’s docker daemon.’)

Thanks for bringing this point up.

Reply

Daniel May 27, 2014 at 3:28AM

Thanks for the answer and insights!

Reply

Adam Spektor June 1, 2014 at 7:16AM

Hi
Great article.
I have a several question, in case parent docker will execute all the commands we will still have a problem with concurrent execution of tomcat (for example), Im talking about port collisions. I thought that all containers will be isolated in child docker so I can execute several tomcats inside different containers without taking care about random ports. Also when I stop internal docker I still can see this containers that were executed inside -> I should take care about tear-down.

In case all this true , I hope it not :) what are the benefits of using Docker inside Docker ?

Thanks.

Adam

Reply

The eBay PaaS Team June 2, 2014 at 3:24PM

Adam,

Glad you asked! This blog post is in the context of CI builds and not about running application Docker containers on Mesos. To simply run multiple docker containers on Mesos, you do not need to apply any of the strategies that we talked about.

To answer your question about port assignments & collisions first – You can rely on the Marathon framework to run Docker containers in a Mesos cluster. In general Mesos has the ability to dynamically assign host-level ports which the executor (mesos-docker in this case) maps to the static ports defined in the app Dockerfile. This takes care of the port collision problem.

EBay has a polyglot platform running Java, C++, Node.js, Python and Scala applications. Running CI builds as a plain Mesos task would require us to install all the dependencies on the Mesos slave host server. Imagine having to install the latest JDK or Python updates on thousands of production Mesos slave nodes….painful indeed! Downloading and installing these dependencies during build time in every CI job is equally painful due to the I/O overhead. So relying on Docker is a necessity since all these dependencies are now isolated within the cached Docker container layers used to run the Jenkins job.

Now, Docker-in-Docker wouldn’t be required if the final deliverable of the Jenkins job was simply a war or zip file (a popular use case). A single Docker build container isolating the different dependencies can achieve that. Our final build deliverable in this case is a standalone Docker image and in our example the build job running inside the Docker container had to run docker operations like build and push. The outer Docker container handles caching of the build job CI dependencies and the inner Docker installation handles the build & push operation of the app Docker image. By using strategy #2 as described in the above post we’re simply relaying the Docker build & push commands to the Mesos slave host’s Docker daemon.

Finally, about the container tear down and cleanup – I believe that if you used strategy #1 then you would have to clean up the AUFS layers used by the nested Docker container manually and that’s why we preferred strategy #2. Using the Docker “rm” option should clean up the remnant container. Please refer to http://docs.docker.io/reference/commandline/cli/

Hope this answers your questions.

Thanks!

Reply

Chong Chen June 10, 2014 at 10:16AM

Very interesting article. I do have a technical question regarding mesos docker executor and Jenkin slave.

In Mesos world, once framework gets offer, it will launch tasks to use offered resource. In Jenkin context, what does each task represent here? A build item or just a Jenkin slave daemon that will need to connect back to master to fetch build item? In particular, what is this task mean in your picture?

mesos-jenkins-be65c8fa-d409-4743-a3fa-c8679808c7cc

And when do you terminate Jenkin docker container? Will you terminate it once each build item completes?

Reply

The eBay PaaS Team June 11, 2014 at 11:16AM

A “Task” is a generic term used for anything launched in Mesos. For example, if you ran a bash script in Mesos using the default command executor that is a task; if you ran a Docker container in Mesos using the Docker executor that too is a task. The Jenkins master instance, itself acting as a Mesos framework, launches its build job/task and a unique id is assigned to it which is “mesos-jenkins-be65c8fa-d409-4743-a3fa-c8679808c7cc”. This task is actually a Docker container inside which the Jenkins slave agent is initialized using supervisord. The slave agent then proceeds to download the job build scripts from the Jenkins master and executes them.

Jenkins has an “Idle Termination Minutes” field in its configuration (shown in one of the screenshots above) which controls how long the slave job is kept around after its build has completed. Developers can set the timeout appropriately so that the Docker container is terminated by the Jenkins master via the Docker executor immediately after the build job has completed or it can be kept around for several hours for reuse in a subsequent build. Either approach is fine depending on how you want to manage your Mesos cluster resources.

Thanks!

Reply

Julien Eid June 16, 2014 at 4:01PM

Hey! I’m trying to use https://github.com/ahunnargikar/mesos-plugin which is located in https://github.com/ahunnargikar/vagrant-mesos and I’m finding that I get this error with Docker enabled.

INFO: Received offers 1
Jun 16, 2014 10:54:06 PM org.jenkinsci.plugins.mesos.JenkinsScheduler resourceOffers
INFO: Received offers 1
Jun 16, 2014 10:54:09 PM org.jenkinsci.plugins.mesos.MesosCloud provision
INFO: Provisioning Jenkins Slave on Mesos with 1 executors. Remaining excess workload: 0 executors)
Jun 16, 2014 10:54:09 PM hudson.slaves.NodeProvisioner update
INFO: Started provisioning MesosCloud from MesosCloud with 1 executors. Remaining excess workload:0.0
Jun 16, 2014 10:54:09 PM org.jenkinsci.plugins.mesos.MesosComputerLauncher
INFO: Constructing MesosComputerLauncher
Jun 16, 2014 10:54:09 PM org.jenkinsci.plugins.mesos.MesosSlave
INFO: Constructing Mesos slave
Jun 16, 2014 10:54:12 PM org.jenkinsci.plugins.mesos.JenkinsScheduler resourceOffers
INFO: Received offers 1
Jun 16, 2014 10:54:17 PM org.jenkinsci.plugins.mesos.JenkinsScheduler resourceOffers
INFO: Received offers 1
Jun 16, 2014 10:54:19 PM org.jenkinsci.plugins.mesos.MesosComputerLauncher launch
INFO: Launching slave computer mesos-jenkins-bc416717-768b-4e8b-a7cd-3bab75ae0db4
Jun 16, 2014 10:54:19 PM org.jenkinsci.plugins.mesos.MesosComputerLauncher launch
INFO: Sending a request to start jenkins slave mesos-jenkins-bc416717-768b-4e8b-a7cd-3bab75ae0db4
Jun 16, 2014 10:54:19 PM org.jenkinsci.plugins.mesos.JenkinsScheduler requestJenkinsSlave
INFO: Enqueuing jenkins slave request
Jun 16, 2014 10:54:19 PM hudson.slaves.NodeProvisioner update
INFO: MesosCloud provisioning successfully completed. We have now 2 computer(s)
Jun 16, 2014 10:54:22 PM org.jenkinsci.plugins.mesos.JenkinsScheduler resourceOffers
INFO: Received offers 1
Jun 16, 2014 10:54:22 PM org.jenkinsci.plugins.mesos.JenkinsScheduler matches
WARNING: Ignoring disk resources from offer
Jun 16, 2014 10:54:22 PM org.jenkinsci.plugins.mesos.JenkinsScheduler matches
INFO: Ignoring ports resources from offer
Jun 16, 2014 10:54:22 PM org.jenkinsci.plugins.mesos.JenkinsScheduler resourceOffers
INFO: Offer matched! Creating mesos Docker task
Jun 16, 2014 10:54:22 PM org.jenkinsci.plugins.mesos.JenkinsScheduler createMesosDockerTask
INFO: Launching task mesos-jenkins-bc416717-768b-4e8b-a7cd-3bab75ae0db4 with command exec /var/lib/mesos/executors/docker ubuntu
java.lang.NoSuchMethodError: org.apache.mesos.MesosSchedulerDriver.launchTasks(Ljava/util/Collection;Ljava/util/Collection;Lorg/apache/mesos/Protos$Filters;)Lorg/apache/mesos/Protos$Status;
at org.jenkinsci.plugins.mesos.JenkinsScheduler.createMesosDockerTask(JenkinsScheduler.java:420)
at org.jenkinsci.plugins.mesos.JenkinsScheduler.resourceOffers(JenkinsScheduler.java:191)
Jun 16, 2014 10:54:22 PM org.jenkinsci.plugins.mesos.JenkinsScheduler$1 run
SEVERE: The mesos driver was aborted!

It is a NoSuchMethodError. I only have this issue when Docker is enabled, causing driver.launchTasks(offerIds, tasks, filters); to be run instead of normally driver.launchTasks(offer.getId(), tasks, filters); when Docker is disabled. I see that offerIds is a List of ID’s, does launchTasks actually take a List? Because normally, you pass a single id.

Reply

The eBay PaaS Team June 18, 2014 at 11:09AM

Julien,

Just a guess but the exception might indicate an incompatibility between the Mesos driver version that plugin is using and your Mesos cluster version. The driver.launchTasks() method probably requires a list of OfferIds instead of simply passing in a single OfferId now.

Are you using the above Vagrant setup which has Mesos 0.18.2 and Docker 0.11.0 or is it a different environment? This plugin .hpi file was last built using Mesos 0.17.0 jars and seems to work fine with Mesos 0.18.2.

Thanks!

Reply

Ivan Kurnosov July 6, 2014 at 8:50PM

They are a really interesting 2 articles, but there is something not clear for me:

you’ve started a jenkins master as a mesos(/marathon) job and didn’t do anything explicit to persist the changes.

Which means that as soon as the jenkins master dies – marathon will resurrect it but it will be the clean installation without jobs and other things configured.

Is it for sake of simplicity of the article or I’m missing something?

Reply

The eBay PaaS Team July 7, 2014 at 4:01PM

Ivan,

Great question! You’ve pointed out correctly that as soon as the Jenkins instance dies and Marathon re-spawns it, all the job configs and history will be lost. At eBay our PaaS system maintains preconfigured Jenkins config.xml and job templates depending on the stack chosen. It then provides a vanity Jenkins master URL to the developer using an HTTP proxy which resolves it to the correct Mesos instance. This vanity URL doesn’t change and the Marathon event bus can be used to update the proxy dynamically, capture lost tasks and create replacement tasks. Check out https://github.com/mesosphere/marathon/wiki/Event-Bus for more information.

The build artifacts can be persisted using NFS mounts, Amazon S3 or Openstack Swift which is outside the scope of this blog post.

Thanks!

Reply

The eBay PaaS Team July 10, 2014 at 10:30AM

Leave a Comment

{ 7 trackbacks }

Previous post:

Next post:

Copyright © 2011 eBay Inc. All Rights Reserved - User Agreement - Privacy Policy - Comment Policy