RDO Community News

See also blogs.rdoproject.org

Recent blog posts

It's been a little while since we've posted a roundup of blogposts around RDO, and you all have been rather prolific in the past month!

Here's what we as a community have been talking about:

Hooroo! Australia bids farewell to incredible OpenStack Summit by August Simonelli, Technical Marketing Manager, Cloud

We have reached the end of another successful and exciting OpenStack Summit. Sydney did not disappoint giving attendees a wonderful show of weather ranging from rain and wind to bright, brilliant sunshine. The running joke was that Sydney was, again, just trying to be like Melbourne. Most locals will get that joke, and hopefully now some of our international visitors do, too!

Read more at http://redhatstackblog.redhat.com/2017/11/16/hooroo-australia-bids-farewell-to-incredible-openstack-summit/

Build your Software Defined Data Center with Red Hat CloudForms and Openstack – part 2 by Michele Naldini

Welcome back, here we will continue with the second part of my post, where we will work with Red Hat Cloudforms. If you remember, in our first post we spoke about Red Hat OpenStack Platform 11 (RHOSP). In addition to the blog article, at the end of this article is also a demo video I created to show to our customers/partners how they can build a fully automated software data center.

Read more at https://developers.redhat.com/blog/2017/11/02/build-software-defined-data-center-red-hat-cloudforms-openstack/

Build your Software Defined Data Center with Red Hat CloudForms and Openstack – part 1 by Michele Naldini

In this blog, I would like to show you how you can create your fully software-defined data center with two amazing Red Hat products: Red Hat OpenStack Platform and Red Hat CloudForms. Because of the length of this article, I have broken this down into two parts.

Read more at https://developers.redhat.com/blog/2017/11/02/build-software-defined-data-center-red-hat-cloudforms-openstack-2/

G’Day OpenStack! by August Simonelli, Technical Marketing Manager, Cloud

In less than one week the OpenStack Summit is coming to Sydney! For those of us in the Australia/New Zealand (ANZ) region this is a very exciting time as we get to showcase our local OpenStack talents and successes. This summit will feature Australia’s largest banks, telcos, and enterprises and show the world how they have adopted, adapted, and succeeded with Open Source software and OpenStack.

Read more at http://redhatstackblog.redhat.com/2017/10/30/gday-openstack/

Restarting your TripleO hypervisor will break cinder volume service thus the overcloud pingtest by Carlos Camacho

I don’t usualy restart my hypervisor, today I had to install LVM2 and virsh stopped to work so a restart was required, once the VMs were up and running the overcloud pingtest failed as cinder was not able to start.

Read more at http://anstack.github.io/blog/2017/10/30/restarting-your-tripleo-hypervisor-will-break-cinder.html

CERN CentOS Dojo, Part 4 of 4, Geneva by rbowen

On Friday evening, I went downtown Geneva with several of my colleagues and various people that had attended the event.

Read more at http://drbacchus.com/cern-centos-dojo-part-4-of-4-geneva/

CERN CentOS Dojo, part 3 of 4: Friday Dojo by rbowen

On Friday, I attended the CentOS Dojo at CERN, in Meyrin Switzerland.

Read more at http://drbacchus.com/cern-centos-dojo-part-3-of-4-friday-dojo/

CERN Centos Dojo, event report: 2 of 4 – CERN tours by rbowen

The second half of Thursday was where we got to geek out and tour various parts of CERN.

Read more at http://drbacchus.com/cern-centos-dojo-cern-tours/

CERN Centos Dojo 2017, Event Report (1 of 4): Thursday Meeting by rbowen

On Thursday, prior to the main event, a smaller group of CentOS core community got together for some deep-dive discussions around the coming challenges that the project is facing, and constructive ways to address them.

Read more at http://drbacchus.com/cern-centos-dojo-2017-thursday/

CERN Centos Dojo 2017, Event report (0 of 4) by rbowen

For the last few days I’ve been in Geneva for the CentOS dojo at CERN.

Read more at http://drbacchus.com/cern-centos-dojo-2017/

Using Ansible Openstack modules on CentOS 7 by Fabian Arrotin

Suppose that you have a RDO/Openstack cloud already in place, but that you'd want to automate some operations : what can you do ? On my side, I already mentioned that I used puppet to deploy initial clouds, but I still prefer Ansible myself when having to launch ad-hoc tasks, or even change configuration[s]. It's particulary true for our CI environment where we run "agentless" so all configuration changes happen through Ansible.

Read more at https://arrfab.net/posts/2017/Oct/11/using-ansible-openstack-modules-on-centos-7/

Using Falcon to cleanup Satellite host records that belong to terminated OSP instances by Simeon Debreceni

In an environment where OpenStack instances are automatically subscribed to Satellite, it is important that Satellite is notified of terminated instances so that is can safely delete its host record. Not doing so will:

Read more at https://developers.redhat.com/blog/2017/10/06/using-falcon-cleanup-satellite-host-records-belong-terminated-osp-instances/

My interview with Cool Python Codes by Julien Danjou

A few days ago, I was contacted by Godson Rapture from Cool Python codes to answer a few questions about what I work on in open source. Godson regularly interviews developers and I invite you to check out his website!

Read more at https://julien.danjou.info/blog/2017/interview-coolpythoncodes

Using Red Hat OpenStack Platform director to deploy co-located Ceph storage – Part Two by Dan Macpherson, Principal Technical Writer

Previously we learned all about the benefits in placing Ceph storage services directly on compute nodes in a co-located fashion. This time, we dive deep into the deployment templates to see how an actual deployment comes together and then test the results!

Read more at http://redhatstackblog.redhat.com/2017/10/04/using-red-hat-openstack-platform-director-to-deploy-co-located-ceph-storage-part-two/

Using Red Hat OpenStack Platform director to deploy co-located Ceph storage – Part One by Dan Macpherson, Principal Technical Writer

An exciting new feature in Red Hat OpenStack Platform 11 is full Red Hat OpenStack Platform director support for deploying Red Hat Ceph storage directly on your overcloud compute nodes. Often called hyperconverged, or HCI (for Hyperconverged Infrastructure), this deployment model places the Red Hat Ceph Storage Object Storage Daemons (OSDs) and storage pools directly on the compute nodes.

Read more at http://redhatstackblog.redhat.com/2017/10/02/using-red-hat-openstack-director-to-deploy-co-located-ceph-storage-part-one/

View article »

Anomaly Detection in CI logs

Continous Integration jobs can generate a lot of data and it can take a lot of time to figure out what went wrong when a job fails. This article demonstrates new strategies to assist with failure investigations and to reduce the need to crawl boring log files.

First, I will introduce the challenge of anomaly detection in CI logs. Second, I will present a workflow to automatically extract and report anomalies using a tool called LogReduce. Lastly, I will discuss the current limitations and how more advanced techniques could be used.

Introduction

Finding anomalies in CI logs using simple patterns such as "grep -i error" is not enough because interesting log lines doesn't necessarly feature obvious anomalous messages such as "error" or "failed". Sometime you don't even know what you are looking for.

In comparaison to regular logs, such as system logs of a production service, CI logs have a very interresting characteristic: they are reproducible. Thus, it is possible to carefully look for new events that are not present in other job execution logs. This article focuses on this particular characteristic to detect anomalies.

The challenge

For this article, baseline events are defined as the collection of log lines produced by nominal jobs execution and target events are defined as the collection of log lines produced by a failed job run.

Searching for anomalous events is challenging because:

  • Events can be noisy: they often includes unique features such as timestamps, hostnames or uuid.
  • Events can be scattered accross many differents files.
  • False positives events may appear for various reasons, for example when a new test option has been introduced. However they often share a common semantic with some baseline events.

Moreover, there can be a very high number of events, for example, more than 1 million lines for tripleo jobs. Thus, we can not easily look for each target event not present in baseline events.

OpenStack Infra CRM114

It is worth noting that anomaly detection is already happening live in the openstack-infra operated review system using classify-log.crm, which is based on CRM114 bayesian filters.

However it is currently only used to classify global failures in the context of the elastic-recheck process. The main drawbacks to using this tool are:

  • Events are processed per words without considering complete lines: it only computes the distances of up to a few words.
  • Reports are hard to find for regular users, they would have to go to elastic-recheck uncategorize, and click the crm114 links.
  • It is written in an obscure language

LogReduce

This part presents the techniques I used in LogReduce to overcome the challenges described above.

Reduce noise with tokenization

The first step is to reduce the complexity of the events to simplify further processing. Here is the line processor I used, see the Tokenizer module:

  • Skip known bogus events such as ssh scan: "sshd.+[iI]nvalid user"
  • Remove known words:
    • Hashes which are hexa decimal words that are 32, 64 or 128 characters long
    • UUID4
    • Date names
    • Random prefixes such as "(tmp req- qdhcp-)[^\s\/]+"
  • Discard every character that is not [a-z_\/]

For example this line:

  2017-06-21 04:37:45,827 INFO [nodepool.builder.UploadWorker.0] Uploading DIB image build 0000000002 from /tmpxvLOTg/fake-image-0000000002.qcow2 to fake-provider

Is reduced to:

  INFO nodepool builder UploadWorker Uploading image build from /fake image fake provider

Index events in a NearestNeighbors model

The next step is to index baseline events. I used a NearestNeighbors model to query target events' distance from baseline events. This helps remove false-postive events that are similar from known baseline events. The model is fitted with all the baseline events transformed using Term Frequency Inverse Document Frequency (tf-idf). See the SimpleNeighbors model

vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(
    analyzer='word', lowercase=False, tokenizer=None,
    preprocessor=None, stop_words=None)
nn = sklearn.neighbors.NearestNeighbors(
    algorithm='brute',
    metric='cosine')
train_vectors = vectorizer.fit_transform(train_data)
nn.fit(train_vectors)

Instead of having a single model per job, I built a model per file type. This requires some pre-processing work to figure out what model to use per file. File names are converted to model names using another Tokenization process to group similar files. See the filename2modelname function.

For example, the following files are grouped like so:

audit.clf: audit/audit.log audit/audit.log.1
merger.clf: zuul/merger.log zuul/merge.log.2017-11-12
journal.clf: undercloud/var/log/journal.log overcloud/var/log/journal.log

Detect anomalies based on kneighbors distance

Once the NearestNeighbor model is fitted with baseline events, we can repeat the process of Tokenization and tf-idf transformation of the target events. Then using the kneighbors query we compute the distance of each target event.

test_vectors = vectorizer.transform(test_data)
distances, _ = nn.kneighbors(test_vectors, n_neighbors=1)

Using a distance threshold, this technique can effectively detect anomalies in CI logs.

Automatic process

Instead of manually running the tool, I added a server mode that automatically searches and reports anomalies found in failed CI jobs. Here are the different components:

  • listener connects to mqtt/gerrit event-stream/cistatus.tripleo.org and collects all success and failed job.

  • worker processes jobs collected by the listener. For each failed job, it does the following in pseudo-code:

Build model if it doesn't exist or if it is too old:
	For each last 5 success jobs (baseline):
		Fetch logs
	For each baseline file group:
		Tokenize lines
		TF-IDF fit_transform
		Fit file group model
Fetch target logs
For each target file:
	Look for the file group model
	Tokenize lines
	TF-IDF transform
	file group model kneighbors search
	yield lines that have distance > 0.2
Write report
  • publisher processes each report computed by the worker and notifies:
    • IRC channel
    • Review comment
    • Mail alert (e.g. periodic job which doesn't have a associated review)

Reports example

Here are a couple of examples to illustrate LogReduce reporting.

In this change I broke a service configuration (zuul gerrit port), and logreduce correctly found the anomaly in the service logs (zuul-scheduler can't connect to gerrit): sf-ci-functional-minimal report

In this tripleo-ci-centos-7-scenario001-multinode-oooq-container report, logreduce found 572 anomalies out of a 1078248 lines. The interesting ones are:

  • Non obvious new DEBUG statements in /var/log/containers/neutron/neutron-openvswitch-agent.log.txt.
  • New setting of the firewall_driver=openvswitch in neutron was detected in:
    • /var/log/config-data/neutron/etc/neutron/plugins/ml2/ml2_conf.ini.txt
    • /var/log/extra/docker/docker_allinfo.log.txt
  • New usage of cinder-backup was detected accross several files such as:
    • /var/log/journal contains new puppet statement
    • /var/log/cluster/corosync.log.txt
    • /var/log/pacemaker/bundles/rabbitmq-bundle-0/rabbitmq/rabbit@centos-7-rax-iad-0000787869.log.txt.gz
    • /etc/puppet/hieradata/service_names.json
    • /etc/sensu/conf.d/client.json.txt
    • pip2-freeze.txt
    • rpm-qa.txt

Caveats and improvements

This part discusses the caveats and limitations of the current implementation and suggests other improvements.

Empty success logs

This method doesn't work when the debug events are only included in the failed logs. To successfully detect anomalies, failure and success logs need to be similar, otherwise all the extra information in failed logs will be considered anomalous.

This situation happens with testr results where success logs only contain 'SUCCESS'.

Building good baseline model

Building a good baseline model with nominal job events is key to anomaly detection. We could use periodic execution (with or without failed runs), or the gate pipeline.

Unfortunately Zuul currently lacks build reporting and we have to scrap gerrit comments or status web pages, which is sub-optimal. Hopefully the upcomming zuul-web builds API and zuul-scheduler MQTT reporter will make this task easier to implement.

Machine learning

I am by no means proficient at machine learning. Logreduce happens to be useful as it is now. However here are some other strategies that may be worth investigating.

The model is currently using a word dictionnary to build the features vector and this may be improved by using different feature extraction techniques more suited for log line events such as MinHash and/or Locality Sensitive Hash.

The NearestNeighbors kneighbors query tends to be slow for large samples and this may be improved upon by using Self Organizing Map, RandomForest or OneClassSVM model.

When line sizes are not homogeneous in a file group, then the model doesn't work well. For example, mistral/api.log line size varies between 10 and 8000 characters. Using models per bins based on line size may be a great improvement.

CI logs analysis is a broad subject on its own, and I suspect someone good at machine learning might be able to find other clever anomaly detection strategies.

Further processing

Detected anomalies could be further processed by:

  • Merging similar anomalies discovered accross different files.
  • Looking for known anomalies in a system like elastic-recheck.
  • Reporting new anomalies to elastic-recheck so that affected jobs could be grouped.

Conclusion

CI log analysis is a powerful service to assist failure investigations. The end goal would be to report anomalies instead of exhaustive job logs.

Early results of LogReduce models look promising and I hope we could setup such services for any CI jobs in the future. Please get in touch by mail or irc (tristanC on Freenode) if you are interrested.

View article »

Mailing List Changes

You need to be aware of recent changes to our mailing lists

What Happened, and Why?

Since the start of the project we have had one mailing list for both users and developers of the RDO project. Over time, we felt that user questions have been drowned out by the more technical developer-oriented discussion, leaving users/operators out of the conversation.

To this end, we've decided to split the one mailing list - rdo-list@redhat.com - into two new mailing lists - dev@lists.rdoproject.org and users@lists.rdoproject.org

We've also moved the rdo-newsletter@redhat.com list to the new newsletter@lists.rdoproject.org email address.

What you need to do

You need to update your contacts list to reflect this change, and start sending email to the new addresses.

As in any typical open source project, user conversations (questions, discussion, community announcements, and so on) should go to the users list, while developer related discussion should go to the dev list.

If you send email to the old address, you should receive an immediate autoresponse reminding you of the new addresses.

List descriptions and arhives are now all at https://lists.rdoproject.org/mailman/listinfo. Please let me know if you see references to the old list information, so we can get it updated.

View article »

CentOS Dojo @ CERN

Hi,

Alan, Matthias, Rich and I were at CERN last week on thursday and friday to attend the CentOS dojo. Rich also a wrote a series of blog posts about the dojo.

First day: CentOS SIGs meetup

Thursday was dedicated to a SIGs meeting, I'll give few highlights but you can read notes on his etherpad.

  • We managed to agree on a proposal to allow bot accounts for SIGs which is one of RDO current pain points.
  • There was also progress into improving CI for SIGs contents, like defining a matrix for SIGs depending on each other to trigger tests
  • Testing against CentOS extras is also an issue. SIGs were advised to provide automated tests that CentOS QA can run and send feedback to SIGs (not blocking updates but still an improvement). Thanks to the t_functional framework
  • Many discussions around the package build workflow (signing, embargoed builds, deprecate content).
  • SIG process: what happens when a chair is MIA? (happened for storage SIG) That was a very productive and focused session, we even managed not to get over schedule, defining a proper agenda ahead of time have helped.

At the end of the day, we had a tour of the datacenter (to see and touch the nodes that run RDO <3). Then, we visited the ATLAS experiment facility.


Second day: CentOS dojo

Friday was the dojo (See schedule with slides attached!) itself, we had about 100 persons registered, with more or less20 not showing up. It started by Belmiro Moreira talk about the OpenStack infrastructure at CERN. It is amazing to see that their RDO cloud runs over 279k cores and has been updated to Pike. It was followed up by a talk from Hervé Rousseau about CERN storage facilities, and the challenge they are facing (Data Deluge in 2026!). They are big users of Ceph and CephFS.

Afterwards, we had a SIGs status from Storage, Opstools (mrunge) and Cloud (myself). It seems that attendance was happy to discover Opstools in a new light, Matthias had many questions after his talk. For my Cloud SIG talk (slides, I collected many stats to show the vitality of our community. I would like to thank boucher and the Software Factory Team for the RepoXplorer project for the stats, it was really helpful. Then, I spoke our contributions to cross-SIG collaboration, including amoralej proposal for a ceph build pipeline inspired by ours. And I ended up with our own infrastructure, showing off DLRN, WeIRDO etc. The day ended up by a talk from kwizart (RPMFusion maintainers) about CentOS and 3rd party repository.

The hallway track was also interesting as I got to meet with Magnum PTL and the other folks maintaining it at CERN. I finally got feedback about magnum packaging working fine, and we spoke about adding RDO 3rd-party CI to magnum. We don't ship magnum in OSP, but this is a visible project and used by RDO biggest use-case, so helping them to set it up is an excellent news for RDO.


Conclusion

This was an excellent event, where SIGs were able to focus on solving our current pain points. As a community, RDO does value our collaboration with CentOS to provide a native and rock-solid experience of OpenStack, from the kernel to the API endpoints!

View article »

Project Teams Gathering interviews

Several weeks ago I attended the Project Teams Gathering (PTG) in Denver, and conducted a number of interviews with project teams and a few of the PTLs (Project Technical Leads).

These interviews are now all up on the RDO YouTube channel. Please subscribe, as I'll be doing more interviews like this at OpenStack Summit in Sydney, as well as at future events.

I want to draw particular attention to my interview with the Swift crew about how they collaborate across company lines and across timezones. Very inspiring.

Watch all the videos now.

View article »

Recent blog posts

Here's what the RDO community has been blogging about recently:

OpenStack 3rd Party CI with Software Factory by jpena

Introduction When developing for an OpenStack project, one of the most important aspects to cover is to ensure proper CI coverage of our code. Each OpenStack project runs a number of CI jobs on each commit to test its validity, so thousands of jobs are run every day in the upstream infrastructure.

Read more at http://rdoproject.org/blog/2017/09/openstack-3rd-party-ci-with-software-factory/

OpenStack Days UK by Steve Hardy

OpenStack Days UKYesterday I attended the OpenStack Days UK event, held in London.  It was a very good day and there were a number of interesting talks, and it provided a great opportunity to chat with folks about OpenStack.I gave a talk, titled "Deploying OpenStack at scale, with TripleO, Ansible and Containers", where I gave an update of the recent rework in the TripleO project to make more use of Ansible and enable containerized deployments.I'm planning some future blog posts with more detail on this topic, but for now here's a copy of the slide deck I used, also available on github.

Read more at http://hardysteven.blogspot.com/2017/09/openstack-days-uk-yesterday-i-attended.html

OpenStack Client in Queens - Notes from the PTG by jpichon

Here are a couple of notes about the OpenStack Client, taken while dropping in and out of the room during the OpenStack PTG in Denver, a couple of weeks ago.

Read more at http://www.jpichon.net/blog/2017/09/openstack-client-queens-notes-ptg/

Event report: OpenStack PTG by rbowen

Last week I attended the second OpenStack PTG, in Denver. The first one was held in Atlanta back in February.

Read more at http://drbacchus.com/event-report-openstack-ptg/

View article »

OpenStack 3rd Party CI with Software Factory

Introduction

When developing for an OpenStack project, one of the most important aspects to cover is to ensure proper CI coverage of our code. Each OpenStack project runs a number of CI jobs on each commit to test its validity, so thousands of jobs are run every day in the upstream infrastructure.

In some cases, we will want to set up an external CI system, and make it report as a 3rd Party CI on certain OpenStack projects. This may be because we want to cover specific software/hardware combinations that are not available in the upstream infrastructure, or want to extend test coverage beyond what is feasible upstream, or any other reason you can think of.

While the process to set up a 3rd Party CI is documented, some implementation details are missing. In the RDO Community, we have been using Software Factory to power our 3rd Party CI for OpenStack, and it has worked very reliably over some cycles.

The main advantage of Software Factory is that it integrates all the pieces of the OpenStack CI infrastructure in an easy to consume package, so let's have a look at how to build a 3rd party CI from the ground up.

Requirements

You will need the following:

  • An OpenStack-based cloud, which will be used by Nodepool to create temporary VMs where the CI jobs will run. It is important to make sure that the default security group in the tenant accepts SSH connections from the Software Factory instance.
  • A CentOS 7 system for the Software Factory instance, with at least 8 GB of RAM and 80 GB of disk. It can run on the OpenStack cloud used for nodepool, just make sure it is running on a separate project.
  • DNS resolution for the Software Factory system.
  • A 3rd Party CI user on review.openstack.org. Follow this guide to configure it.
  • Some previous knowledge on how Gerrit and Zuul work is advisable, as it will help during the configuration process.

Basic Software Factory installation

For a detailed installation walkthrough, refer to the Software Factory documentation. We will highlight here how we set it up on a test VM.

Software installation

On the CentOS 7 instance, run the following commands to install the latest release of Software Factory (2.6 at the time of this article):

$ sudo yum install -y https://softwarefactory-project.io/repos/sf-release-2.6.rpm
$ sudo yum update -y
$ sudo yum install -y sf-config

Define the architecture

Software Factory has several optional components, and can be set up to run them on more than one system. In our setup, we will install the minimum required components for a 3rd party CI system, all in one.

$ sudo vi /etc/software-factory/arch.yaml

Make sure the nodepool-builder role is included. Our file will look like:

---
description: "OpenStack 3rd Party CI deployment"
inventory:
  - name: managesf
    ip: 192.168.122.230
    roles:
      - install-server
      - mysql
      - gateway
      - cauth
      - managesf
      - gitweb
      - gerrit
      - logserver
      - zuul-server
      - zuul-launcher
      - zuul-merger
      - nodepool-launcher
      - nodepool-builder
      - jenkins

In this setup, we are using Jenkins to run our jobs, so we need to create an additional file:

$ sudo vi /etc/software-factory/custom-vars.yaml

And add the following content

nodepool_zuul_launcher_target: False

Note: As an alternative, we could use zuul-launcher to run our jobs and drop Jenkins. In that case, there is no need to create this file. However, later when defining our jobs we will need to use the jobs-zuul directory instead of jobs in the config repo.

Edit Software Factory configuration

$ sudo vi /etc/software-factory/sfconfig.yaml

This file contains all the configuration data used by the sfconfig script. Make sure you set the following values:

  • Password for the default admin user.
authentication:
  admin_password: supersecurepassword
  • The fully qualified domain name for your system.
fqdn: sftests.com
  • The OpenStack cloud configuration required by Nodepool.
nodepool:
  providers:
  - auth_url: http://192.168.1.223:5000/v2.0
    name: microservers
    password: cloudsecurepassword
    project_name: mytestci
    region_name: RegionOne
    regions: []
    username: ciuser
  • The authentication options if you want other users to be able to log into your instance of Software Factory using OAuth providers like GitHub. This is not mandatory for a 3rd party CI. See this part of the documentation for details.

  • If you want to use LetsEncrypt to get a proper SSL certificate, set:

  use_letsencrypt: true

Run the configuration script

You are now ready to complete the configuration and get your basic Software Factory installation running.

$ sudo sfconfig

After the script finishes, just point your browser to https:// and you can see the Software Factory interface.

SF interface

Configure SF to connect to the OpenStack Gerrit

Once we have a basic Software Factory environment running, and our service account set up in review.openstack.org, we just need to connect both together. The process is quite simple:

  • First, make sure the local Zuul user SSH key, found at /var/lib/zuul/.ssh/id_rsa.pub, is added to the service account at review.openstack.org.

  • Then, edit /etc/software-factory/sfconfig.yaml again, and edit the zuul section to look like:

zuul:
  default_log_site: sflogs
  external_logservers: []
  gerrit_connections:
  - name: openstack
    hostname: review.openstack.org
    port: 29418
    puburl: https://review.openstack.org/r/
    username: mythirdpartyciuser
  • Finally, run sfconfig again. Log information will start flowing in /var/log/zuul/server.log, and you will see a connection to review.openstack.org port 29418.

Create a test job

In Software Factory 2.6, a special project named config is automatically created on the internal Gerrit instance. This project holds the user-defined configuration, and changes to the project must go through Gerrit.

Configure images for nodepool

All CI jobs will use a predefined image, created by Nodepool. Before creating any CI job, we need to prepare this image.

  • As a first step, add your SSH public key to the admin user in your Software Factory Gerrit instance.

Add SSH Key

  • Then, clone the config repo on your computer and edit the nodepool configuration file:
$ git clone ssh://admin@sftests.com:29418/config sf-config
$ cd sf-config
$ vi nodepool/nodepool.yaml
  • Define the disk image and assign it to the OpenStack cloud defined previously:
---
diskimages:
  - name: dib-centos-7
    elements:
      - centos-minimal
      - nodepool-minimal
      - simple-init
      - sf-jenkins-worker
      - sf-zuul-worker
    env-vars:
      DIB_CHECKSUM: '1'
      QEMU_IMG_OPTIONS: compat=0.10
      DIB_GRUB_TIMEOUT: '0'

labels:
  - name: dib-centos-7
    image: dib-centos-7
    min-ready: 1
    providers:
      - name: microservers

providers:
  - name: microservers
    cloud: microservers
    clean-floating-ips: true
    image-type: raw
    max-servers: 10
    boot-timeout: 120
    pool: public
    rate: 2.0
    networks:
      - name: private
    images:
      - name: dib-centos-7
        diskimage: dib-centos-7
        username: jenkins
        min-ram: 1024
        name-filter: m1.medium

First, we are defining the diskimage-builder elements that will create our image, named dib-centos-7.

Then, we are assigning that image to our microservers cloud provider, and specifying that we want to have at least 1 VM ready to use.

Finally we define some specific parameters about how Nodepool will use our cloud provider: the internal (private) and external (public) networks, the flavor for the virtual machines to create (m1.medium), how many seconds to wait between operations (2.0 seconds), etc.

  • Now we can submit the change for review:
$ git add nodepool/nodepool.yaml
$ git commit -m "Nodepool configuration"
$ git review
  • In the Software Factory Gerrit interface, we can then check the open change. The config repo has some predefined CI jobs, so you can check if your syntax was correct. Once the CI jobs show a Verified +1 vote, you can approve it (Code Review +2, Workflow +1), and the change will be merged in the repository.

  • After the change is merged in the repository, you can check the logs at /var/log/nodepool and see the image being created, then uploaded to your OpenStack cloud.

Define test job

There is a special project in OpenStack meant to be used to test 3rd Party CIs, openstack-dev/ci-sandbox. We will now define a CI job to "check" any new commit being reviewed there.

  • Assign the nodepool image to the test job
$ vi jobs/projects.yaml

We are going to use a pre-installed job named demo-job. All we have to do is to ensure it uses the image we just created in Nodepool.

- job:
    name: 'demo-job'
    defaults: global
    builders:
      - prepare-workspace
      - shell: |
          cd $ZUUL_PROJECT
          echo "This is a demo job"
    triggers:
      - zuul
    node: dib-centos-7
  • Define a Zuul pipeline and a job for the ci-sandbox project
$ vi zuul/upstream.yaml

We are creating a specific Zuul pipeline for changes coming from the OpenStack Gerrit, and specifying that we want to run a CI job for commits to the ci-sandbox project:

pipelines:
  - name: openstack-check
    description: Newly uploaded patchsets enter this pipeline to receive an initial +/-1 Verified vote from Jenkins.
    manager: IndependentPipelineManager
    source: openstack
    precedence: normal
    require:
      open: True
      current-patchset: True
    trigger:
      openstack:
        - event: patchset-created
        - event: change-restored
        - event: comment-added
          comment: (?i)^(Patch Set [0-9]+:)?( [\w\\+-]*)*(\n\n)?\s*(recheck|reverify)
    success:
      openstack:
        verified: 0
    failure:
      openstack:
        verified: 0

projects:
  - name: openstack-dev/ci-sandbox
    openstack-check:
      - demo-job

Note that we are telling our job not to send a vote for now (verified: 0). We can change that later if we want to make our job voting.

  • Apply configuration change
$ git add zuul/upstream.yaml jobs/projects.yaml
$ git commit -m "Zuul configuration for 3rd Party CI"
$ git review

Once the change is merged, Software Factory's Zuul process will be listening for changes to the ci-sandbox project. Just try creating a change and see if everything works as expected!

Troubleshooting

If something does not work as expected, here are some troubleshooting tips:

Log files

You can find the Zuul log files in /var/log/zuul. Zuul has several components, so start with checking server.log and launcher.log, the log files for the main server and the process that launches CI jobs.

The Nodepool log files are located in /var/log/nodepool. builder.log contains the log from image builds, while nodepool.log has the log for the main process.

Nodepool commands

You can check the status of the virtual machines created by nodepool with:

$ sudo nodepool list

Also, you can check the status of the disk images with:

$ sudo nodepool image-list

Jenkins status

You can see the Jenkins status from the GUI, at https:///jenkins/, if logged on with the admin user. If no machines show up at the 'Build Executor Status' pane, that means that either Nodepool could not launch a VM, or there was some issue in the connection between Zuul and Jenkins. In that case, check the jenkins logs at `/var/log/jenkins`, or restart the service if there are errors.

Next steps

For now, we have only ran a test job against a test project. The real power comes when you create a proper CI job on a project you are interested in. You should now:

  • Create a file under jobs/ with the JJB definition for your new job.

  • Edit zuul/upstream.yaml to add the project(s) you want your 3rd Party CI system to watch.

View article »

Recent blog posts

It's been a few weeks since I did one of these blog wrapups, and there's been a lot of great content by the RDO community recently.

Here's some of what we've been talking about recently:

Project Teams Gathering (PTG) report - Zuul by tristanC

The OpenStack infrastructure team gathered in Denver (September 2017). This article reports some of Zuul's topics that were discussed at the PTG.

Read more at http://rdoproject.org/blog/2017/09/PTG-report-zuul/

Evaluating Total Cost of Ownership of the Identity Management Solution by Dmitri Pal

Increasing Interest in Identity Management: During last several months I’ve seen a rapid growth of interest in Red Hat’s Identity Management (IdM) solution. This might have been due to different reasons.

Read more at http://rhelblog.redhat.com/2017/09/18/evaluating-total-cost-of-ownership-of-the-identity-management-solution/

Debugging TripleO Ceph-Ansible Deployments by John

Starting in Pike it is possible to use TripleO to deploy Ceph in containers using ceph-ansible. This is a guide to help you if there is a problem. It asks questions, somewhat rhetorically, to help you track down the problem.

Read more at http://blog.johnlikesopenstack.com/2017/09/debug-tripleo-ceph-ansible.html

Make a NUMA-aware VM with virsh by John

Grégory showed me how he uses virsh edit on a VM to add something like the following:

Read more at http://blog.johnlikesopenstack.com/2017/09/make-numa-aware-vm-with-virsh.html

Writing a SELinux policy from the ground up by tristanC

SELinux is a mechanism that implements mandatory access controls in Linux systems. This article shows how to create a SELinux policy that confines a standard service:

Read more at http://rdoproject.org/blog/2017/09/SELinux-policy-from-the-ground-up/

Trick to test external ceph clusters using only tripleo-quickstart by John

TripleO can stand up a Ceph cluster as part of an overcloud. However, if all you have is a tripleo-quickstart env and want to test an overcloud feature which uses an external Ceph cluster, then can have quickstart stand up two heat stacks, one to make a separate ceph cluster and the other to stand up an overcloud which uses that ceph cluster.

Read more at http://blog.johnlikesopenstack.com/2017/09/trick-to-test-external-ceph-clusters.html

RDO Pike released by Rich Bowen

The RDO community is pleased to announce the general availability of the RDO build for OpenStack Pike for RPM-based distributions, CentOS Linux 7 and Red Hat Enterprise Linux. RDO is suitable for building private, public, and hybrid clouds. Pike is the 16th release from the OpenStack project, which is the work of more than 2300 contributors from around the world (source).

Read more at http://rdoproject.org/blog/2017/09/rdo-pike-released/

OpenStack Summit Sydney preview: Red Hat to present at more than 40 sessions by Peter Pawelski, Product Marketing Manager, Red Hat OpenStack Platform

The next OpenStack Summit will take place in Sydney, Australia, November 6-8. And despite the fact that the conference will only run three days instead of the usual four, there will be plenty of opportunities to learn about OpenStack from Red Hat’s thought leaders.

Read more at http://redhatstackblog.redhat.com/2017/08/31/openstack-summit-fall2017-preview/

Scheduled snapshots by Tim Bell

While most of the machines on the CERN cloud are configured using Puppet with state stored in external databases or file stores, there are a few machines where this has been difficult, especially for legacy applications. Doing a regular snapshot of these machines would be a way of protecting against failure scenarios such as hypervisor failure or disk corruptions.

Read more at http://openstack-in-production.blogspot.com/2017/08/scheduled-snapshots.html

Ada Lee: OpenStack Security, Barbican, Novajoin, TLS Everywhere in Ocata by Rich Bowen

Ada Lee talks about OpenStack Security, Barbican, Novajoin, and TLS Everywhere in Ocata, at the OpenStack PTG in Atlanta, 2017.

Read more at http://rdoproject.org/blog/2017/08/ada-lee-openstack-security-barbican-novajoin-tls-everywhere-in-ocata/

Octavia Developer Wanted by assafmuller

I’m looking for a Software Engineer to join the Red Hat OpenStack Networking team. I am presently looking to hire in Europe, Israel and US East. The candidate may work from home or from one of the Red Hat offices. The team is globally distributed and comprised of talented, autonomous, empowered and passionate individuals with a healthy work/life balance. The candidate will work on OpenStack Octavia and LBaaS. The candidate will write and review code while working with upstream community members and fellow Red Hatters. If you want to do open source, Red Hat is objectively where it’s at. We have an institutional culture of open source at all levels and this has a ripple effect on your day to day and your career at the company.

Read more at https://assafmuller.com/2017/08/18/octavia-developer-wanted/

View article »

Project Teams Gathering (PTG) report - Zuul

The OpenStack infrastructure team gathered in Denver (September 2017). This article reports some of Zuul's topics that were discussed at the PTG.

For your reference, I highlighted some of the new features comming in the Zuul version 3 in this article.

Cutover and jobs migration

The OpenStack community grew a complex set of CI jobs over the past several years, that needs to be migrated. A zuul-migrate script has been created to automate the migration from the Jenkins-Jobs-Builder format to the new Ansible based job definition. The migrated jobs are prefixed with "-legacy" to indicate they still need to be manually refactored to fully benefit from the ZuulV3 features.

The team couldn't finish the migration and disable the current ZuulV2 services at the PTG because the jobs migration took longer than expected. However, a new cutover attemp will occur in the next few weeks.

Ansible devstack job

The devstack job has been completely rewritten to a fully fledged Ansible job. This is a good example of what a job looks like in the new Zuul:

A project that needs a devstack CI job needs this new job definition:

- job:
    name: shade-functional-devstack-base
    parent: devstack
    description: |
      Base job for devstack-based functional tests
    pre-run: playbooks/devstack/pre
    run: playbooks/devstack/run
    post-run: playbooks/devstack/post
    required-projects:
      # These jobs will DTRT when shade triggers them, but we want to make
      # sure stable branches of shade never get cloned by other people,
      # since stable branches of shade are, well, not actually things.
      - name: openstack-infra/shade
        override-branch: master
      - name: openstack/heat
      - name: openstack/swift
    roles:
      - zuul: openstack-infra/devstack-gate
    timeout: 9000
    vars:
      devstack_localrc:
        SWIFT_HASH: "1234123412341234"
      devstack_local_conf:
        post-config:
          "$CINDER_CONF":
            DEFAULT:
              osapi_max_limit: 6
      devstack_services:
        ceilometer-acentral: False
        ceilometer-acompute: False
        ceilometer-alarm-evaluator: False
        ceilometer-alarm-notifier: False
        ceilometer-anotification: False
        ceilometer-api: False
        ceilometer-collector: False
        horizon: False
        s-account: True
        s-container: True
        s-object: True
        s-proxy: True
      devstack_plugins:
        heat: https://git.openstack.org/openstack/heat
      shade_environment:
        # Do we really need to set this? It's cargo culted
        PYTHONUNBUFFERED: 'true'
        # Is there a way we can query the localconf variable to get these
        # rather than setting them explicitly?
        SHADE_HAS_DESIGNATE: 0
        SHADE_HAS_HEAT: 1
        SHADE_HAS_MAGNUM: 0
        SHADE_HAS_NEUTRON: 1
        SHADE_HAS_SWIFT: 1
      tox_install_siblings: False
      tox_envlist: functional
      zuul_work_dir: src/git.openstack.org/openstack-infra/shade

This new job definition simplifies a lot the devstack integration tests and projects now have a much more fine grained control over their integration with the other OpenStack projects.

Dashboard

I have been working on the new zuul-web interfaces to replace the scheduler webapp so that we can scale out the REST endpoints and prevent direct connections to the scheduler. Here is a summary of the new interfaces:

  • /tenants.json : return the list of tenants,
  • /{tenant}/status.json : return the status of the pipelines,
  • /{tenant}/jobs.json : return the list of jobs defined, and
  • /{tenant}/builds.json : return the list of builds from the sql reporter.

Moreover, the new interfaces enable new use cases, for example, users can now:

  • Get the list of available jobs and their description,
  • Check the results of post and periodic jobs, and
  • Dynamically list jobs' results using filters, for example, the last tripleo periodic jobs can be obtained using:
$ curl ${TENANT_URL}/builds.json?project=tripleo&pipeline=periodic | python -mjson.tool
[
    {
        "change": 0,
        "patchset": 0,
        "id": 16,
        "job_name": "periodic-tripleo-ci-centos-7-ovb-ha-oooq",
        "log_url": "https://logs.openstack.org/periodic-tripleo-ci-centos-7-ovb-ha-oooq/2cde3fd/",
        "pipeline": "periodic",
		...
    },
    ...
]

OpenStack health

The openstack-health service is likely to be modified to better interface with the new Zuul design. It is currently connected to an internal gearman bus to receive job completion events before running the subunit2sql process.

This processing could be rewritten as a post playbook to do the subunit processing as part of the job. Then the data could be pushed to the SQL server with the credencials stored in a Zuul's secret.

Roadmap

The last day, even though most of us were exhausted, we spend some time discussing the roadmap for the upcoming months. While the roadmap is still being defined, here are some hilights:

  • Based on new user's walkthrough, the documentation will be greatly improved, For example see this nodepool contribution.
  • Jobs will be able to return structured data to improve the reporting. For example a pypi publisher may return the published url. Similarly, a rpm-build job may return the repository url.
  • Dashboard web interface and javascript tooling,
  • Admin interface to manually trigger unique build or cancel a buildset,
  • Nodepool quota to improve performances,
  • Cross source dependencies, for example a github change in Ansible could depends-on a gerrit change in shade,
  • More Nodepool drivers such as Kubernetes or AWS, and
  • Fedmsg and mqtt zuul driver for message bus repporting and trigger source.

In conclusion, the ZuulV3 efforts were extremly fruitful and this article only covers a few of the design sessions. Once again, we have made great progress and I'm looking forward to further developments. Thanks you all for the great team gathering event!

View article »

Writing a SELinux policy from the ground up

SELinux is a mechanism that implements mandatory access controls in Linux systems. This article shows how to create a SELinux policy that confines a standard service:

  • Limit its network interfaces,
  • Restrict its system access, and
  • Protect its secrets.

Mandatory access control

By default, unconfined processes use discretionary access controls (DAC). A user has all the permissions over its objects, for example the owner of a log file can modify it or make it world readable.

In contrast, mandatory access control (MAC) enables more fine grained controls, for example it can restrict the owner of a log file to only append operations. Moreover, MAC can also be used to reduce the capability of a regular process, for example by denying debugging or networking capabilities.

This is great for system security, but is also a powerful tool to control and better understand an application. Security policies reduce services' attack surface and describes service system operations in depth.

Policy module files

A SELinux policy is composed of:

  • A type enforcement file (.te): describes the policy type and access control,
  • An interface file (.if): defines functions available to other policies,
  • A file context file (.fc): describes the path labels, and
  • A package spec file (.spec): describes how to build and install the policy.

The packaging is optional but highly recommended since it's a standard method to distribute and install new pieces on a system.

Under the hood, these files are written using macros processors:

  • A policy file (.pp) is generated using: make NAME=targeted -f "/usr/share/selinux/devel/Makefile"
  • An intermediary file (.cil) is generated using: /usr/libexec/selinux/hll/pp

Policy developpment workflow:

The first step is to get the services running in a confined domain. Then we define new labels to better protect the service. Finally the service is run in permissive mode to collect the access it needs.

As an example, we are going to create a security policy for the scheduler service of the Zuul program.

Confining a Service

To get the basic policy definitions, we use the sepolicy generate command to generate a bootstrap zuul-scheduler policy:

sepolicy generate --init /opt/rh/rh-python35/root/bin/zuul-scheduler

The –init argument tells the command to generate a service policy. Other types of policy could be generated such as user application, inetd daemon or confined administrator.

The .te file contains:

  • A new zuul_scheduler_t domain,
  • A new zuul_scheduler_exec_t file label,
  • A domain transition from systemd to zuul_scheduler_t when the zuul_scheduler_exec_t is executed, and
  • Miscellaneous definitions such as the ability to read localization settings.

The .fc file contains regular expressions to match a file path with a label: /bin/zuul-scheduler is associated with zuul_scheduler_exec_t.

The .if file contains methods (macros) that enable role extension. For example, we could use the zuul_scheduler_admin method to authorize a staff role to administrate the zuul service. We won't use this file because the admin user (root) is unconfined by default and it doesn't need special permission to administrate the service.

To install the zuul-scheduler policy we can run the provided script:

$ sudo ./zuul_scheduler.sh
Building and Loading Policy
+ make -f /usr/share/selinux/devel/Makefile zuul_scheduler.pp
Creating targeted zuul_scheduler.pp policy package
+ /usr/sbin/semodule -i zuul_scheduler.pp

Restarting the service should show (using "ps Zax") that it is now running with the system_u:system_r:zuul_scheduler_t:s0 context instead of the system_u:system_r:unconfined_service_t:s0.

And looking at the audit.log, it should show many "avc: denied error" because no permissions have yet been defined. Note that the service is running fine because this initial policy defines the zuul_scheduler_t domain as permissive.

Before authorizing the service's access, let's define the zuul resources.

Define the service resources

The service is trying to access /etc/opt/rh/rh-python35/zuul and /var/opt/rh/rh-python35/lib/zuul which inherited the etc_t and var_lib_t labels. Instead of giving zuul_scheduler_t access to etc_t and var_lib_t, we will create new types. Moreover the zuul-scheduler manages secret keys we could isolate from its general home directory and it requires two tcp ports.

In the .fc file, define the new paths:

/var/opt/rh/rh-python35/lib/zuul/keys(/.*)?  gen_context(system_u:object_r:zuul_keys_t,s0)
/etc/opt/rh/rh-python35/zuul(/.*)?           gen_context(system_u:object_r:zuul_conf_t,s0)
/var/opt/rh/rh-python35/lib/zuul(/.*)?       gen_context(system_u:object_r:zuul_var_lib_t,s0)
/var/opt/rh/rh-python35/log/zuul(/.*)?       gen_context(system_u:object_r:zuul_log_t,s0)

In the .te file, declare the new types:

# System files
type zuul_conf_t;
files_type(zuul_conf_t)
type zuul_var_lib_t;
files_type(zuul_var_lib_t)
type zuul_log_t;
logging_log_file(zuul_log_t)

# Secret files
type zuul_keys_t;
files_type(zuul_keys_t)

# Network label
type zuul_gearman_port_t;
corenet_port(zuul_gearman_port_t)
type zuul_webapp_port_t;
corenet_port(zuul_webapp_port_t);

Note that the file_type() macro is important since it provides unconfined access to the new types. Without it, even the admin user could not access the file.

In the .spec file, add the new path and setup the tcp port labels:

%define relabel_files() \
restorecon -R /var/opt/rh/rh-python35/lib/zuul/keys
...

# In the %post section, add
semanage port -a -t zuul_gearman_port_t -p tcp 4730
semanage port -a -t zuul_webapp_port_t -p tcp 8001

# In the %postun section, add
for port in 4730 8001; do semanage port -d -p tcp $port; done

Rebuild and install the package:

sudo ./zuul_scheduler.sh && sudo rpm -ivh ./noarch/*.rpm

Check that the new types are installed using "ls -Z" and "semanage port -l":

$ ls -Zd /var/opt/rh/rh-python35/lib/zuul/keys/
drwx------. zuul zuul system_u:object_r:zuul_keys_t:s0 /var/opt/rh/rh-python35/lib/zuul/keys/
$ sudo semanage port -l | grep zuul
zuul_gearman_port_t            tcp      4730
zuul_webapp_port_t             tcp      8001

Update the policy

With the service resources now declared, let's restart the service and start using it to collect all the access it needs.

After a while, we can update the policy using "./zuul_scheduler.sh –update" which basically does: "ausearch -m avc –raw | audit2allow -R". This collects all the permissions denied to generates type enforcement rules.

We can repeat this steps until all the required accesses are collected.

Here's what looks like the resulting zuul-scheduler rules:

allow zuul_scheduler_t gerrit_port_t:tcp_socket name_connect;
allow zuul_scheduler_t mysqld_port_t:tcp_socket name_connect;
allow zuul_scheduler_t net_conf_t:file { getattr open read };
allow zuul_scheduler_t proc_t:file { getattr open read };
allow zuul_scheduler_t random_device_t:chr_file { open read };
allow zuul_scheduler_t zookeeper_client_port_t:tcp_socket name_connect;
allow zuul_scheduler_t zuul_conf_t:dir getattr;
allow zuul_scheduler_t zuul_conf_t:file { getattr open read };
allow zuul_scheduler_t zuul_exec_t:file getattr;
allow zuul_scheduler_t zuul_gearman_port_t:tcp_socket { name_bind name_connect };
allow zuul_scheduler_t zuul_keys_t:dir getattr;
allow zuul_scheduler_t zuul_keys_t:file { create getattr open read write };
allow zuul_scheduler_t zuul_log_t:file { append open };
allow zuul_scheduler_t zuul_var_lib_t:dir { add_name create remove_name write };
allow zuul_scheduler_t zuul_var_lib_t:file { create getattr open rename write };
allow zuul_scheduler_t zuul_webapp_port_t:tcp_socket name_bind;

Once the service is no longer being denied permissions, we can remove the "permissive zuul_scheduler_t;" declaration and deploy it in production. To avoid issues, the domain can be set to permissive at first using:

$ sudo semanage permissive -a zuul_scheduler_t

Too long, didn't read

In short, to confine a service:

  • Use sepolicy generate
  • Declare the service's resources
  • Install the policy and restart the service
  • Use audit2allow

Here are some useful documents:

View article »