Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add openaddress batch machine #26

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions .github/workflows/build_openaddresses.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# This workflow will build a docker container, publish it to Azure Container Registry, and deploy it to Azure Kubernetes Service using a helm chart.
#
# https://github.com/Azure/actions-workflow-samples/tree/master/Kubernetes
#
# To configure this workflow:
#
# 1. Set up the following secrets in your workspace:
# a. REGISTRY_USERNAME with ACR username
# b. REGISTRY_PASSWORD with ACR Password
# c. AZURE_CREDENTIALS with the output of `az ad sp create-for-rbac --sdk-auth`
#
# 2. Change the values for the REGISTRY_NAME, CLUSTER_NAME, CLUSTER_RESOURCE_GROUP and NAMESPACE environment variables (below).
name: build_openaddresses
on: [pull_request]

# Environment variables available to all jobs and steps in this workflow
env:
REGISTRY_NAME: k8scc01covidacr
CLUSTER_NAME: k8s-cancentral-02-covid-aks
CLUSTER_RESOURCE_GROUP: k8s-cancentral-01-covid-aks
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@master

# Connect to Azure Container registry (ACR)
- uses: azure/docker-login@v1
with:
login-server: ${{ env.REGISTRY_NAME }}.azurecr.io
username: ${{ secrets.REGISTRY_USERNAME }}
password: ${{ secrets.REGISTRY_PASSWORD }}

- name: Free disk space
run: |
sudo swapoff -a
sudo rm -f /swapfile
sudo apt clean
docker rmi $(docker image ls -aq)
df -h

- run: |
docker build -f ./openaddresses-batch-machine/container/Dockerfile -t ${{ env.REGISTRY_NAME }}.azurecr.io/daaas-openaddresses-batch-machine:${{ github.sha }} ./openaddresses-batch-machine/container
docker tag ${{ env.REGISTRY_NAME }}.azurecr.io/daaas-openaddresses-batch-machine:${{ github.sha }} ${{ env.REGISTRY_NAME }}.azurecr.io/daaas-openaddresses-batch-machine:latest

# Scan image for vulnerabilities
- uses: Azure/container-scan@v0
with:
image-name: ${{ env.REGISTRY_NAME }}.azurecr.io/daaas-openaddresses-batch-machine:${{ github.sha }}
severity-threshold: CRITICAL
run-quality-checks: false
57 changes: 57 additions & 0 deletions .github/workflows/publish_openaddresses.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# This workflow will build a docker container, publish it to Azure Container Registry, and deploy it to Azure Kubernetes Service using a helm chart.
#
# https://github.com/Azure/actions-workflow-samples/tree/master/Kubernetes
#
# To configure this workflow:
#
# 1. Set up the following secrets in your workspace:
# a. REGISTRY_USERNAME with ACR username
# b. REGISTRY_PASSWORD with ACR Password
# c. AZURE_CREDENTIALS with the output of `az ad sp create-for-rbac --sdk-auth`
#
# 2. Change the values for the REGISTRY_NAME, CLUSTER_NAME, CLUSTER_RESOURCE_GROUP and NAMESPACE environment variables (below).
name: publish_openaddresses
on:
push:
branches:
- master

# Environment variables available to all jobs and steps in this workflow
env:
REGISTRY_NAME: k8scc01covidacr
CLUSTER_NAME: k8s-cancentral-02-covid-aks
CLUSTER_RESOURCE_GROUP: k8s-cancentral-01-covid-aks
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@master

# Connect to Azure Container registry (ACR)
- uses: azure/docker-login@v1
with:
login-server: ${{ env.REGISTRY_NAME }}.azurecr.io
username: ${{ secrets.REGISTRY_USERNAME }}
password: ${{ secrets.REGISTRY_PASSWORD }}

- name: Free disk space
run: |
sudo swapoff -a
sudo rm -f /swapfile
sudo apt clean
docker rmi $(docker image ls -aq)
df -h

# Container build and push to a Azure Container registry (ACR)
- run: |
docker build -f ./openaddresses-batch-machine/container/Dockerfile -t ${{ env.REGISTRY_NAME }}.azurecr.io/daaas-openaddresses-batch-machine:${{ github.sha }} ./openaddresses-batch-machine/container
docker tag ${{ env.REGISTRY_NAME }}.azurecr.io/daaas-openaddresses-batch-machine:${{ github.sha }} ${{ env.REGISTRY_NAME }}.azurecr.io/daaas-openaddresses-batch-machine:latest
docker push ${{ env.REGISTRY_NAME }}.azurecr.io/daaas-openaddresses-batch-machine:${{ github.sha }}
docker push ${{ env.REGISTRY_NAME }}.azurecr.io/daaas-openaddresses-batch-machine:latest

# Scan image for vulnerabilities
- uses: Azure/container-scan@v0
with:
image-name: ${{ env.REGISTRY_NAME }}.azurecr.io/daaas-openaddresses-batch-machine:${{ github.sha }}
severity-threshold: CRITICAL
run-quality-checks: false
9 changes: 9 additions & 0 deletions openaddresses-batch-machine/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Summary

This repo builds and provides access to [this modification](https://github.com/JosephKuchar/batch-machine) of [OpenAddresses batch-machine](https://github.com/openaddresses/batch-machine). The container is built is similar to that specified by the OpenAddresses repo, but is pinned to a specific commit from JosephKuchar/batch-machine and restricts the user to be non-ROOT.

# Usage:

See `../.github/workflows/build_openaddresses.yml` (or `publish_openaddresses.yml`) for CI/build details, which build `./container/Dockerfile`

See `./pipeline/get_openaddresses_data.ipynb` for example usage. Typically, the easiest way to invoke this is through a Kubeflow Pipeline.
26 changes: 26 additions & 0 deletions openaddresses-batch-machine/container/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
FROM alpine:3.11

ENV BATCH_MACHINE_PATH=/batch-machine

RUN apk add nodejs yarn git python3 python3-dev py3-pip \
py3-gdal gdal gdal-dev make bash sqlite-dev zlib-dev \
postgresql-libs gcc g++ musl-dev postgresql-dev cairo \
py3-cairo file

# Download and install Tippecanoe
RUN git clone -b 1.35.0 https://github.com/mapbox/tippecanoe.git /tmp/tippecanoe && \
cd /tmp/tippecanoe && \
make && \
PREFIX=/usr/local make install && \
rm -rf /tmp/tippecanoe

# Get/install batch-machine
RUN git clone https://github.com/JosephKuchar/batch-machine $BATCH_MACHINE_PATH && \
pip3 install $BATCH_MACHINE_PATH

# Restrict to non-root access
RUN addgroup appgroup && \
adduser -S -g appgroup appuser
USER appuser

CMD python3 ${BATCH_MACHINE_PATH}/test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: Copy to Minio
inputs:
- {name: Minio URL, type: URL, description: 'Minio instance URL, starting with http://'}
- {name: Minio access key, type: String}
- {name: Minio secret key, type: String}
- {name: Local source, description: 'Local source of upload'}
- {name: Minio destination, type: String, description: 'Minio destination location in format <bucket>/<location_in_bucket>'}
- {name: Flags, optional: true, default: '', type: String, description: 'Flags/options passed to mc'}
outputs:
- {name: Minio destination, type: String}
- {name: md5sum, type: String, description: 'A combined md5sum of all data passed to MinIO'}
implementation:
container:
image: minio/mc
command:
- sh
- -ex
- -c
- |
FLAGS=$7
mkdir -p "$(dirname "$5")"
mkdir -p "$(dirname "$6")"
mc config host add my_minio $0 $1 $2
mc cp $FLAGS $3 my_minio/$4
echo "$4" > "$5"
# Use find in case we retrieved a directory - this gets all files in the dir
find $3 -type f -exec md5sum {} \; | sort -k 2 | md5sum | awk '{print $1}' > $6
- {inputValue: Minio URL}
- {inputValue: Minio access key}
- {inputValue: Minio secret key}
- {inputPath: Local source}
- {inputValue: Minio destination}
- {outputPath: Minio destination}
- {outputPath: md5sum}
- {inputValue: Flags}
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Download Data from OpenAddresses
inputs:
- {name: source_json, type: JsonObject, description: 'OpenAddresses source specification in JSON format'}
- {name: args, type: String, optional: true, default: '', description: 'Optional command line args to pass to openaddr-process-one, such as "--layer addresses --layersource city"'}
outputs:
- {name: data, description: 'All data downloaded from OpenAddresses call'}
implementation:
container:
image: k8scc01covidacr.azurecr.io/daaas-openaddresses-batch-machine:latest
command:
- sh
- -ex
- -c
- |
SOURCE_JSON=$0
ARGS=$1
OUTPUT_PATH=$2
mkdir -p $OUTPUT_PATH

cat $SOURCE_JSON

openaddr-process-one $SOURCE_JSON $OUTPUT_PATH $ARGS

- {inputPath: source_json}
- {inputValue: args}
- {outputPath: data}
Loading