Open Ownership Register Exporter

This repository contains the code which exports the beneficial ownership data stored in the Open Ownership Register database. The input source is a MongoDB database, with the data processed and mapped into a Gzipped JSON-lines file, stored in S3.

The Exporter is written in Ruby, running using "Serverless" technologies.

The exporter is:

orchestrated by AWS Step Functions
uses S3 for temporary storage of processed parts
uses S3 for storage of final output file
uses AWS Athena (a serverless Presto implementation) to deduplicate and concatenate the output of small stages.
uses DynamoDB to store metadata about which parts are completed or queued.
uses SQS as a queueing system, both for processing parts and for extracting data from completed parts during the final export stage.

Setting up local development

Setting up the AWS Stack

The setup in AWS is almost fully automated, once the env files have been setup.

Note: Currently need an existing bucket to put the code in, this should be automated.

Note: Currently the scripts have been tested with Admin user access run from local enviroment, but this should be restricted before adding into any sort of remote scripts.

Create deployment .deploy.env file

Sample .deploy.env

AWS_REGION=eu-west-1
ACCESS_KEY_ID=
SECRET_ACCESS_KEY=
AWS_ACCOUNT_ID=
CODE_BUCKET=
CODE_PREFIX=code
REPO_NAME=openownership/register-files-combiner

The code bucket should be an existing S3 bucket, in the same region you are deploying into.
The AWS access key / secret access key should have admin access to your AWS account.

Create ENV file in deploy/envs

TODO

Create Stack

Choose an appropriate name for the stack. This would usually be "dev", "staging", "prod" etc, but any alphanumeric names are fine.

bundle exec deploy/create_stack $STACK_NAME
bundle exec deploy/create_stack dev

TODO: Currently the dynamodb tables are deployed with provisioned capacity 1, which is not enough when running a large export. These need to autoscale, but this has not been added to the deployment script yet. If necessary, increase the capacities manually before running the export.

Code Deployment

Package up the code

bundle exec deploy/package_code $BRANCH
bundle exec deploy/package_code main

The script only accepts BRANCH but in the future a git SHA would be more appropriate.
This will clone a fresh copy, install dependencies, zip up, and upload final code to the S3 bucket defined in .deploy.env

Note: currently this is run locally and uses RVM to install and select the Ruby version. This should be dockerized for simplicity.

Deploy code

Once the code has been packaged up and the stack has been created, the code will need to be deployed to the stack.

This will update the Lambda environment and the code, but the rest of the settings will remain unchanged.

bundle exec deploy/deploy_code $STACK_NAME $BRANCH $ENV_NAME
bundle exec deploy/deploy_code dev main dev

The STACK_NAME is the name provided when creating the stack
The BRANCH is the name provided when packaging the code
The ENV_NAME should match the name of the .json file in deploy/envs/{ENV_NAME}.json For example, deploy/envs/dev.json would have the name dev

Running an export

Find the appropriate Step Function in the AWS console:

https://eu-west-1.console.aws.amazon.com/states/home?region=eu-west-1#/statemachines

(The Region may be different).

Click to see the state machine with the correct name. Click "Start Execution". Click "Start Execution" with empty parameters {}.

Wait for the function to complete (production currently takes about 1 hour to export the 27 million records). The results will be written in two places:

In {S3_BUCKET} path bods_exports_results/{EXPORT_ID}.jsonl.gz
In {S3_BUCKET} path bods_exports_results/export_parts/export={EXPORT_ID}/partX.jsonl.gz

The latter is both used when creating the final output file, but also is useful when looking at the data, as downloading and unzipping the full file can be unwieldly.

Running tests

There is no CI/CD currently active for this project.

To run tests locally, use the provided .test.env file, altering the mongo address if necessary.

BODS_EXPORT_S3_BUCKET_NAME=
BODS_EXPORT_AWS_ACCESS_KEY_ID=
BODS_EXPORT_AWS_SECRET_ACCESS_KEY=
ATHENA_DATABASE=
SQS_QUEUE_URL=

Note that this Mongo address must have write access, and the database tables will be dropped during the tests. For this reason, tests must only be run with the ENV variable RACK_ENV=test

RACK_ENV=test bundle exec rspec

This will run any unit and integration tests, which run completely locally and do not use any AWS resources.

[ End-to-end tests, which spin up a stack and test the whole flow, are in progress. ]

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.circleci		.circleci
bin		bin
deploy		deploy
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.rubocop-disable-default-enabled.yml		.rubocop-disable-default-enabled.yml
.rubocop-enable-default-disabled.yml		.rubocop-enable-default-disabled.yml
.rubocop.yml		.rubocop.yml
.rubocop_todo.yml		.rubocop_todo.yml
.ruby-version		.ruby-version
.test.env		.test.env
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
lambda_function.rb		lambda_function.rb
register_files_combiner.gemspec		register_files_combiner.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Ownership Register Exporter

Contents

Setting up local development

Setting up the AWS Stack

Code Deployment

Running an export

Running tests

About

Releases

Packages

Contributors 2

Languages

openownership/register-files-combiner

Folders and files

Latest commit

History

Repository files navigation

Open Ownership Register Exporter

Contents

Setting up local development

Setting up the AWS Stack

Code Deployment

Running an export

Running tests

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages