DevNetSecOps

Network security group automation with Ansible, AWS, git, and Jenkins to move fast with security and auditability.

Etymology

The title is a joke combining

  • DevOps: the practice of “…building, evolving and operating rapidly-changing resilient systems at scale” (Jez Humble) [among many definitions]

  • DevSecOps: the practice of integrating security into continuous delivery environments and workflows

  • NetSec: the practice of securing network traffic via encryption, segregation, etc

Background

Our team, Developer Infrastructure, is charged with accelerating application development while keeping our systems secure and stable. The ability to leverage Amazon Web Services’ (AWS) network security features (security groups), APIs, and tools were driving factors when adopting it as our compute platform of choice. While moving more and more of our applications to AWS, we found that centrally controlled network security administration would not scale to allow our teams to move as fast as they needed to.

At Flatiron our products are rapidly evolving necessitating frequent updates to production network security rules. In May 2016 there were 26 such changes - more than one per working day! This post describes how we were able to build a safe, auditable way to configure network security that enables our developers and security stakeholders to move quickly.

Infrastructure Configuration Drift

In the early phases of our AWS adoption we controlled security groups through both code and the AWS web console. This put us in a confusing situations. Changes made in the web console would be wiped out by executing our Ansible security group playbook. This would set all security group definitions to what is defined in our code repo. This is called Configuration Drift - and it lead to wasted time reconciling what went wrong and how things should be, as well as ambiguity around the preferred source of truth for the configuration.

Solution

Since we have our desired state defined in code, we can allow our Jenkins workers to read the state of AWS and see if the reality matches our desired state. It alerts us when we’re out of sync.

Mechanics

We defined a Jenkins job that runs every ten minutes and uses Ansible’s Check Mode to decide if any part of AWS’s configuration has deviated from our code-based definitions. If the Ansible playbook returns output indicating that any group definition would be changed or created then we have deviated from the codebase definition.

Our top level playbook splits groups up by team for organization.

sg-update main tasks.yml

---
- include: infra.yml
- include: team1.yml
- include: team2.yml

Team-level playbooks define security groups that are aligned to that team. In the example below, the infrastructure team manages the Jenkins network security rules.

infra.yml

...
- name: jenkins security group
  ec2_group:
   name: jenkins
   description: jenkins
   vpc_id: ""
   region: us-east-1
   rules:
     - proto: tcp
       from_port : 443
       to_port : 443
       cidr_ip: ""
...

Our check-security-group.sh script is runnable by Jenkins or users. It returns a >0 value (indicating an error) that is the number of changes that would have to be made to AWS to sync its definitions to what our repo describes.

check-security-group.sh

#!/bin/bash -e

# activate python virtual environment

# run sg-update in RO only mode
# report if anything has failed/changed/unreachable
AWS_DEFAULT_REGION=us-east-1 SG_RESULT=$(ansible-playbook aws/sg-update.yml --check)
echo "${SG_RESULT}"

# ansible output of
## PLAY RECAP ****************************************************************
## localhost                  : ok=131  changed=0    unreachable=0    failed=0
# store as environment variables $ok $changed $unreachable $failed
eval $(echo "${SG_RESULT}" | grep '^localhost' | cut -d":" -f2)
echo "ok: $ok changed: $changed failed: $failed unreachable: $unreachable"
exit $(($failed + $changed + $unreachable))

In order for Jenkins to read the state of security groups, we assign it this IAM policy via an EC2 IAM Role.

{
   "Statement": [
       {
           "Action": "ec2:DescribeSecurityGroups",
           "Effect": "Allow",
           "Resource": "*",
           "Sid": "AllowSG"
       }
   ],
   "Version": "2012-10-17"
}

Workflow

  1. Engineer proposes Security Group changes in code diff via Phabricator (our code review tool)
  2. Security and/or our team approves after review
  3. Engineer merges changes to master
  4. Drift detected (code is ahead of cloud conf) - Notify #aws

    Mechanism: check-security-group.sh returns >0 if anything updated jenkins red

  5. Admin acts to run playbook
  6. jenkins green

Systems Design Principles

Simplicity: the core systems design principle. Complexity is inertia against change and participation.

Our automation introduces no new systems or tools - just a new Jenkins build job. We were already using Slack, Jenkins, AWS, Phabricator, and Ansible. Our security groups rules were already defined as code. Introducing a diff uses our normal code review process.

Reproducibility: if I cannot tear it down and remake it automatically, it is not simple.
A destructive change to our security group definitions can be reverted or reset instantly.

Transparency: I should be able to observe the state of the system quickly and easily. The configuration repo is simple to read. The status of compliance is a Jenkins green light or red X.

Visibility: when things are going wrong the right parties should be notified. Chat notifications related to the Jenkins job alert administrators when eyes should be on. Consumers of network security may opt into monitoring configuration management files via Phabricator, our code review tool. For example I am CC’d on all infrastructure related security group changes.

phabricator

Auditability: I should be able to say what has happened before by observing a system of record. Our Jenkins build record shows all deviations from approved configurations. Every line of our security group automation is attributed to a commit/author, and a commit that ties back to a code diff which shows security or engineering approval.

git-blame

Self-service: improvements and additions are not only the domain of sysadmins. All engineers can suggest changes and observe what currently exists.

Why only allow admins to execute changes?

We wanted to allow as much self service as possible - when considering security vs ease of use tradeoffs, we decided that a single gatekeeper to production step was the best tradeoff for us.

Conclusion

We think trading off between moving fast and staying secure is a false dilemma. We hope to allow our teams to move fast and stay secure through workflows like this.

If this sounds cool, please consider spending time in our #devsecnetops-geek channel at Flatiron as part of our engineering team :)

Thank you to contributors, reviewers, and editors: Nick Arvanitis, Dan Eisenberg, Darren Gruber, Ann Jaskiw, Brian McNamara, and Joe Mou