Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Commit

Permalink
Merge pull request #19 from ds-wizard/release/2.5.0
Browse files Browse the repository at this point in the history
Release 2.5.0
  • Loading branch information
MarekSuchanek committed Jul 8, 2020
2 parents 6878896 + 7f7c80e commit 67be649
Show file tree
Hide file tree
Showing 21 changed files with 274 additions and 417 deletions.
1 change: 1 addition & 0 deletions .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
github: ds-wizard
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -131,3 +131,5 @@ dmypy.json
# IDE
.idea/
.vscode/

workdir/
7 changes: 6 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,9 @@ COPY . /app

RUN python setup.py install

CMD ["docworker", "/app/config.yml", "/app/templates"]
ENV DOCWORKER_CONFIG /app/config.yml
ENV DOCWORKER_WORKDIR /tmp/docworker

RUN mkdir /tmp/docworker

CMD ["docworker"]
20 changes: 2 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# Data Stewardship Wizard Document Worker

[![Documentation Status](https://readthedocs.org/projects/ds-wizard/badge/?version=latest)](https://docs.ds-wizard.org/en/latest/?badge=latest)
[![GitHub release (latest SemVer)](https://img.shields.io/github/v/release/ds-wizard/document-worker)](https://github.com/ds-wizard/document-worker/releases)
[![Docker Pulls](https://img.shields.io/docker/pulls/datastewardshipwizard/document-worker)](https://hub.docker.com/r/datastewardshipwizard/document-worker)
[![Document Worker CI](https://github.com/ds-wizard/document-worker/workflows/Document%20Worker%20CI/badge.svg?branch=master)](https://github.com/ds-wizard/document-worker/actions)
[![GitHub](https://img.shields.io/github/license/ds-wizard/document-worker)](LICENSE)
[![Documentation Status](https://readthedocs.org/projects/ds-wizard/badge/?version=latest)](https://docs.ds-wizard.org/en/latest/)

*Worker for assembling and transforming documents*

Expand All @@ -14,23 +16,6 @@
- [wkhtmltopdf](https://github.com/wkhtmltopdf/wkhtmltopdf)
- [pandoc](https://github.com/jgm/pandoc)

## Templates

We are using HTML Jinja2 templates described by a JSON file within specified directory. The JSON file can look like this:

```json
{
"uuid": "4bfe909b-7dbc-40a7-8609-085e9af1df98",
"name": "My cool template",
"rootFile": "my/relative/dir/index.html.j2",
"wkhtmltopdf": "",
"pandoc": ""
}
```

The `wkhtmltopdf` and `pandoc` fields are optional and you can specify extra command line options and arguments for calls of those commands for converting document. Path specified in `rootFile` is relative to JSON file, then paths in Jinja2 are relative to the root file.


## Docker

Docker image is prepared with basic dependencies and worker installed. It is available though Docker Hub: [datastewardshipwizard/document-worker](https://hub.docker.com/r/datastewardshipwizard/document-worker).
Expand All @@ -46,7 +31,6 @@ $ docker build . -t docworker:local
### Mount points

- `/app/config.yml` = configuration file (see [example](config.yml))
- `/app/templates` = directory with templates
- `/usr/share/fonts/<type>/<name>` = fonts according to [Debian wiki](https://wiki.debian.org/Fonts/PackagingPolicy) (for wkhtmltopdf)

### Fonts
Expand Down
42 changes: 22 additions & 20 deletions config.yml
Original file line number Diff line number Diff line change
@@ -1,37 +1,39 @@
mongo:
host: localhost
port: 27017
database: test_db
collection: documents
fs_collection: documentFs
# host: localhost
# port: 27017
database: engine-wizard
# collection: documents
# fs_collection: documentFs
# templates_collection: templates
# assets_fs_collection: templateAssetFs
# auth:
# username:
# password:
# database: (database)
# mechanism: SCRAM-SHA-256

mq:
host: localhost
port: 5672
vhost: /
queue: test_queue
#mq:
# host: localhost
# port: 5672
# vhost: /
# queue: document.generation
# auth:
# username:
# password:

logging:
level: INFO

documents:
naming:
strategy: sanitize # uuid|slugify|sanitize
#documents:
# naming:
# strategy: sanitize # uuid|slugify|sanitize

externals:
pandoc:
executable: pandoc
args: --standalone
#externals:
# pandoc:
# executable: pandoc
# args: --standalone
# timeout:
wkhtmltopdf:
executable: wkhtmltopdf
args:
# wkhtmltopdf:
# executable: wkhtmltopdf
# args:
# timeout:
1 change: 0 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,4 @@ services:
context: .
dockerfile: Dockerfile
volumes:
- './examples/templates:/app/templates:ro'
- './config.yml:/app/config.yml:ro'
10 changes: 6 additions & 4 deletions document_worker/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,12 @@ def validate_config(ctx, param, value: IO):


@click.command(name='docworker')
@click.argument('config', type=click.File('r'), callback=validate_config)
@click.argument('templates_dir', type=click.Path(exists=True, file_okay=False, dir_okay=True))
def main(config: DocumentWorkerConfig, templates_dir):
worker = DocumentWorker(config, templates_dir)
@click.argument('config', envvar='DOCWORKER_CONFIG',
type=click.File('r'), callback=validate_config)
@click.argument('workdir', envvar='DOCWORKER_WORKDIR',
type=click.Path(dir_okay=True, exists=True))
def main(config: DocumentWorkerConfig, workdir: str):
worker = DocumentWorker(config, workdir)
try:
worker.run()
except Exception as e:
Expand Down
15 changes: 14 additions & 1 deletion document_worker/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,15 @@ class MongoConfig:

def __init__(self, host: str, port: int, username: str, password: str,
database: str, collection: str, fs_collection: str,
templates_collection: str, assets_fs_collection: str,
auth_database: str, auth_mechanism: str):
self.host = host
self.port = port
self.database = database
self.collection = collection
self.fs_collection = fs_collection
self.templates_collection = templates_collection
self.assets_fs_collection = assets_fs_collection
self.username = username
self.password = password
self.auth_database = auth_database
Expand All @@ -36,6 +39,8 @@ def __str__(self):
f'- database = {self.database} ({type(self.database)})\n' \
f'- collection = {self.collection} ({type(self.collection)})\n' \
f'- fs_collection = {self.fs_collection} ({type(self.fs_collection)})\n' \
f'- templates_collection = {self.templates_collection} ({type(self.templates_collection)})\n' \
f'- assets_fs_collection = {self.assets_fs_collection} ({type(self.assets_fs_collection)})\n' \
f'- username = {self.username} ({type(self.username)})\n' \
f'- password = {self.password} ({type(self.password)})\n' \
f'- auth_database = {self.auth_database} ({type(self.auth_database)})\n' \
Expand Down Expand Up @@ -104,7 +109,7 @@ def __init__(self, level, message_format: str):
self.message_format = message_format

def __str__(self):
return f'MQueueConfig\n' \
return f'LoggingConfig\n' \
f'- level = {self.level} ({type(self.level)})\n' \
f'- message_format = {self.message_format} ({type(self.message_format)})\n'

Expand Down Expand Up @@ -175,6 +180,8 @@ class DocumentWorkerYMLConfigParser:
'port': 27017,
'collection': 'documents',
'fs_collection': 'documentFs',
'templates_collection': 'templates',
'assets_fs_collection': 'templateAssetFs',
MONGO_AUTH_SUBSECTION: {
'username': None,
'password': None,
Expand Down Expand Up @@ -274,6 +281,8 @@ def mongo(self) -> MongoConfig:
database=self.get_or_default(self.MONGO_SECTION, 'database'),
collection=self.get_or_default(self.MONGO_SECTION, 'collection'),
fs_collection=self.get_or_default(self.MONGO_SECTION, 'fs_collection'),
templates_collection=self.get_or_default(self.MONGO_SECTION, 'templates_collection'),
assets_fs_collection=self.get_or_default(self.MONGO_SECTION, 'assets_fs_collection'),
username=self.get_or_default(self.MONGO_SECTION, self.MONGO_AUTH_SUBSECTION, 'username'),
password=self.get_or_default(self.MONGO_SECTION, self.MONGO_AUTH_SUBSECTION, 'password'),
auth_database=self.get_or_default(self.MONGO_SECTION, self.MONGO_AUTH_SUBSECTION, 'database'),
Expand Down Expand Up @@ -337,6 +346,8 @@ class DocumentWorkerCFGConfigParser(configparser.ConfigParser):
'password': None,
'collection': 'documents',
'fs_collection': 'documentFs',
'templates_collection': 'templates',
'assets_fs_collection': 'templateAssetFs',
'auth_database': None,
'auth_mechanism': 'SCRAM-SHA-256'
},
Expand Down Expand Up @@ -418,6 +429,8 @@ def mongo(self) -> MongoConfig:
database=self.get_or_default(self.MONGO_SECTION, 'database'),
collection=self.get_or_default(self.MONGO_SECTION, 'collection'),
fs_collection=self.get_or_default(self.MONGO_SECTION, 'fs_collection'),
templates_collection=self.get_or_default(self.MONGO_SECTION, 'templates_collection'),
assets_fs_collection=self.get_or_default(self.MONGO_SECTION, 'assets_fs_collection'),
username=self.get_or_default(self.MONGO_SECTION, 'username'),
password=self.get_or_default(self.MONGO_SECTION, 'password'),
auth_database=self.get_or_default(self.MONGO_SECTION, 'auth_database'),
Expand Down
33 changes: 32 additions & 1 deletion document_worker/consts.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ class DocumentField:
UUID = 'uuid'
NAME = 'name'
STATE = 'state'
TEMPLATE = 'templateUuid'
TEMPLATE = 'templateId'
FORMAT = 'formatUuid'
RETRIEVED = 'retrievedAt'
FINISHED = 'finishedAt'
Expand All @@ -22,6 +22,37 @@ class DocumentField:
METADATA_FILENAME = 'fileName'


class TemplateFileField:
FILENAME = 'fileName'
CONTENT = 'content'


class TemplateAssetField:
UUID = 'uuid'
FILENAME = 'fileName'
CONTENT_TYPE = 'contentType'


class FormatField:
UUID = 'uuid'
NAME = 'name'
STEPS = 'steps'


class StepField:
NAME = 'name'
OPTIONS = 'options'


class TemplateField:
ID = 'id'
NAME = 'name'
METAMODEL_VERSION = 'metamodelVersion'
FILES = 'files'
FORMATS = 'formats'
ASSETS = 'assets'


class JobDataField:
DOCUMENT_UUID = 'documentUuid'
DOCUMENT_CONTEXT = 'documentContext'
Expand Down
25 changes: 18 additions & 7 deletions document_worker/conversions.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,12 @@
from document_worker.documents import FileFormat, FileFormats


def run_conversion(args: list, input_data: bytes, name: str,
def run_conversion(*, args: list, workdir: str, input_data: bytes, name: str,
source_format: FileFormat, target_format: FileFormat, timeout=None) -> bytes:
command = ' '.join(args)
logging.info(f'Calling "{command}" to convert from {source_format} to {target_format}')
p = subprocess.Popen(args,
cwd=workdir,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
Expand Down Expand Up @@ -44,12 +45,17 @@ def __init__(self, config: DocumentWorkerConfig = None):
self.config = config

def __call__(self, source_format: FileFormat, target_format: FileFormat,
data: bytes, metadata: dict) -> bytes:
data: bytes, metadata: dict, workdir: str) -> bytes:
template_args = self.extract_template_args(metadata)
command = self.config.wkhtmltopdf.command + self.ARGS1 + template_args + self.ARGS2
return run_conversion(
command, data, type(self).__name__, source_format, target_format,
timeout=self.config.wkhtmltopdf.timeout
args=command,
workdir=workdir,
input_data=data,
name=type(self).__name__,
source_format=source_format,
target_format=target_format,
timeout=self.config.wkhtmltopdf.timeout,
)

@staticmethod
Expand All @@ -63,13 +69,18 @@ def __init__(self, config: DocumentWorkerConfig = None):
self.config = config

def __call__(self, source_format: FileFormat, target_format: FileFormat,
data: bytes, metadata: dict) -> bytes:
data: bytes, metadata: dict, workdir: str) -> bytes:
args = ['-f', source_format.name, '-t', target_format.name, '-o', '-']
template_args = self.extract_template_args(metadata)
command = self.config.pandoc.command + template_args + args
return run_conversion(
command, data, type(self).__name__, source_format, target_format,
timeout=self.config.pandoc.timeout
args=command,
workdir=workdir,
input_data=data,
name=type(self).__name__,
source_format=source_format,
target_format=target_format,
timeout=self.config.pandoc.timeout,
)

@staticmethod
Expand Down
36 changes: 12 additions & 24 deletions document_worker/templates/formats.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,24 @@
import uuid

from document_worker.templates.steps import create_step, FormatStepException
from document_worker.consts import FormatField, StepField
from document_worker.documents import DocumentFile


class FormatMetaField:
UUID = 'uuid'
NAME = 'name'
STEPS = 'steps'


class StepMetaField:
NAME = 'name'
OPTIONS = 'options'


class Format:

FORMAT_META_REQUIRED = [FormatMetaField.UUID,
FormatMetaField.NAME,
FormatMetaField.STEPS]
FORMAT_META_REQUIRED = [FormatField.UUID,
FormatField.NAME,
FormatField.STEPS]

STEP_META_REQUIRED = [StepMetaField.NAME,
StepMetaField.OPTIONS]
STEP_META_REQUIRED = [StepField.NAME,
StepField.OPTIONS]

def __init__(self, template, metadata: dict):
self.template = template
self._verify_metadata(metadata)
self.uuid = uuid.UUID(metadata[FormatMetaField.UUID])
self.name = metadata[FormatMetaField.NAME]
self.uuid = uuid.UUID(metadata[FormatField.UUID])
self.name = metadata[FormatField.NAME]
logging.info(f'Setting up format "{self.name}" ({self.uuid})')
self.steps = self._create_steps(metadata)
if len(self.steps) < 1:
Expand All @@ -39,26 +29,24 @@ def _verify_metadata(self, metadata: dict):
for required_field in self.FORMAT_META_REQUIRED:
if required_field not in metadata:
self.template.raise_exc(f'Missing required field {required_field} for format')
for step in metadata[FormatMetaField.STEPS]:
for step in metadata[FormatField.STEPS]:
for required_field in self.STEP_META_REQUIRED:
if required_field not in step:
self.template.raise_exc(f'Missing required field {required_field} for step in format "{self.name}"')

def _create_steps(self, metadata: dict):
steps = []
for step_meta in metadata[FormatMetaField.STEPS]:
step_name = step_meta[StepMetaField.NAME]
step_options = step_meta[StepMetaField.OPTIONS]
for step_meta in metadata[FormatField.STEPS]:
step_name = step_meta[StepField.NAME]
step_options = step_meta[StepField.OPTIONS]
try:
steps.append(
create_step(self.template.config, self.template, step_name, step_options)
)
except FormatStepException as e:
import logging
logging.warning('Handling job exception', exc_info=True)
self.template.raise_exc(f'Cannot load step "{step_name}" of format "{self.name}": {e.message}')
except Exception as e:
import logging
logging.warning('Handling job exception', exc_info=True)
self.template.raise_exc(f'Cannot load step "{step_name}" of format "{self.name}" ({e})')
return steps
Expand Down
Loading

0 comments on commit 67be649

Please sign in to comment.