The is the first in a series of articles demonstrating how to engineer a machine learning pipeline and deploy it to a production environment. We’re going to assume that a solution to a ML problem already exists within a Jupyter notebook, and that our task is to engineer this solution into an operational ML system, that can train a model, serve it via a web API and automatically repeat this process on a schedule when new data is made available.
The focus will be on software engineering and DevOps, as applied to ML, with an emphasis on ‘best practices’. All of the code developed in each part of this project, is available on GitHub, with a dedicated branch for each part, so you can explore the code in its various stages of development.
This first part is focused on how to setup a ML pipeline engineering project and covers:
A manufacturer of industrial spare-parts wants the ability to give its customers an estimate for the time it could take to dispatch an order. This depends on how many existing orders have yet to be processed, such that customers ordering late on a busy day can encounter unexpected delays, which sometimes leads to complaints; this is an exercise in keeping customers happy by managing their expectations.
Orders are placed on a B2B eCommerce platform, that is developed and maintained by the manufacturer’s in-house software engineering team. The product manager for the platform wants the estimated dispatch time to be presented to the customer (through the UI), before they place an order.
A data scientist has worked on this (regression) task and has handed us the Jupyter notebook containing their solution. They have concluded that optimal performance can be achieved by training on the preceding week’s orders data, so the model will have to be re-trained and redeployed on a weekly basis.
At the end of each week, the data engineering team deliver a new tranche of training data, as a CSV file on cloud object storage (AWS S3). The platform engineering team want access to order-dispatch estimates via a web service with a simple REST API, and have supplied us with an example request and response (reproduced below). The platform and data engineering teams both deploy their systems and services to AWS, and we too are required to deploy our solution (the pipeline) to AWS.
The architecture for the target solution is outlined above - the workflow is as follows:
The pipeline will be split into two stages, each of which will be implemented as an executable Python module:
The pipeline will be deployed in containers to AWS EKS (managed Kubernetes cluster), using Bodywork.
The files in the project’s git repository are organised as follows:
This file contains the configuration for the project’s CI/CD pipeline, using CircleCI. CI/CD and CircleCI will be discussed in more depth later on.
All of the Jupyter notebooks required to understand the ML solution to the business problem. All of the Python package requirements to run the notebooks should be included in notebooks/requirements_nb.txt.
All Python modules that define the pipeline.
Python modules defining automated tests for the pipeline.
Python packages required by the CI/CD pipeline - e.g. for running tests and deploying the pipeline.
Python packages required by the pipeline - e.g. Scikit-Learn, FastAPI, etc.
`flake8.ini` & `mypy.ini`
Configuration for the Tox test automation framework. Tox automates test execution and executes all tests in fresh Python virtual environments, isolating them from the idiosyncrasies of the local development environment.
Bodywork deployment configuration file.
We’ve split the various Python package requirements into separate files:
We’re planning to deploy the pipeline using Bodywork, which currently targets the Python 3.9 runtime, so we create a Python 3.9 virtual environment in which to install all requirements.
We’re going to use pytest to support test development and we’re going to run them via the Tox test automation framework. The best way to get this operational, is to write some skeleton code for the pipeline that can be covered by a couple of basic tests. For example, at a trivial level the `train_model.py` batch job should provide us with some basic logs, whose existence we can test for in `test_train_model.py`. Taking a Test-Driven Development (TDD) approach, we start with the test in `test_train_model.py`,
Where we use pytest’s `caplog` fixture to capture logs messages. We now provide the implementation in `train_model.py`,
Where `configure_logger` configures a Python logger that will be common to both `train_model.py` and `serve_model.py`.
Similarly for the `serve_model.py` module, we can write a trivial test for the REST API endpoint in `test_serve_model.py`,
This loads the FastAPI test client and uses it to verify that sending a request with valid data results in a response with a HTTP status code of `200`, but sending invalid data results in a HTTP `422` error (see this for more information on HTTP status codes). In `serve_model.py` we implement the code to satisfy these tests,
You can run all tests in the tests folder using,
Or isolate a specific test using the `-k` flag, for example,
Tox is a test automation framework that helps to manage groups of tests, together with isolated environments in which to run them. Configuration for Tox is defined in `tox.ini`, which is reproduced below.
Calling Tox from command line,
Will run every set of tests - those defined in the commands tagged with `unit_and_functional` and `static_code_analysis` - for every chosen environment, which in this case is just Python 3.9 (`py39`). This environment will have none of the environment variables or commands that are present in the local shell, unless they’ve been specified (we haven’t), and can only use the packages specified in `requirements_cicd.txt` and `requirements_pipe.txt`. Individual test-environment pairs can be executed using the `-e` flag, for example,
Will only run Flake8 and MyPy (static code analysis tools) and leave out the unit and functional tests. For more information on working with Tox, see the documentation.
Sometimes you just need to test on a ad hoc basis, by running the modules, setting breakpoints, etc. You can run the batch job in `train_model.py` using,
Which should print the following to stdout,
Similarly, the web API defined in `serve_model.py` can be started with,
Which should print the following to stdout,
And make the API available for testing locally - e.g., issuing the following request from the command line,
As defined in the tests. FastAPI will also automatically expose the following endpoints on your service:
Here at Bodywork HQ, we’re advocates for the “Hello, Production” school-of-thought, that encourages teams to make the deployment of a skeleton application (such as the trivial pipeline sketched-out in this article), one of the first tasks for any new project. As we have written about before, there are many benefits to taking deployment pains early on in a software development project, and then using the initial deployment skeleton as the basis for rapidly delivering useful functionality into production.
We’re planning to deploy to Kubernetes using Bodywork, but we appreciate that not everyone has easy access to a Kubernetes cluster for development. If this is your reality, then the next best thing your team could do, is to start by deploying to a local test cluster, to make sure that the pipeline is at least deploy-able. You can get started with a single node cluster on your laptop, using Minikube - see our guide to get this up-and-running in under 10 minutes.
The full description of the deployment is contained in `bodywork.yaml`, which we’ve reproduced below.
This describes a deployment with two stages - `train_model` and `serve_model` - that are executed one after the other, as described in `pipeline.DAG`. For more information on how to configure a Bodywork deployment, checkout the User Guide.
Once you have access to a test cluster, configure it for Bodywork deployments,
And then deploy the workflow directly from the GitHub repository (so make sure all commits have been pushed to your remote branch),
We like to watch our deployments rolling-out using the Kubernetes dashboard, as you can see in the video clip below.
Once the deployment has completed successfully, retrieve the details of the prediction service,
You can manually test the deployed prediction endpoint using,
Which should return the same response as before,
See our guide to accessing services for information on how to determine `CLUSTER_IP`.
Now that the overall structure of the project has been created, all that remains is to put in-place the processes required to get new code merged and deployed, as quickly and efficiently as possible. The process of getting new code merged on an ad hoc basis, is referred to as Continuous Integration (CI), while getting new code deployed as soon as it is merged, is known as Continuous Deployment (CD). The workflow we intend to impose is outlined in the diagram above. Briefly:
Here at Bodywork HQ we use GitHub and CircleCI to run this workflow. Branch protection rules on GitHub are used to prevent changes being pushed to master, unless automated tests and peer review have been passed. CircleCI is a paid-for CI/CD service (with an outrageously generous free-tier) that automatically integrates with GitHub to enable jobs (such as automated tests) to be triggered automatically following merge requests, or changes to the `master` branch, etc. Our CircleCI pipeline is defined in `.circleci/config.yml` and reproduced below.
Although this configuration file is specific to CircleCI, it will be easily recognisable to anyone who’s ever worked with similar services such as GitHub Actions, GitLab CI/CD, Travis CI, etc. In essence, it defines the following:
In the first part of this project we have expended a lot of effort to lay the foundations for the work that is to come - developing the model training job, the prediction service and deploying these to a production environment where they will need to be monitored. Thanks to automated tests and CI/CD, our team will be able to quickly iterate towards a well-engineered solution, with results that can be demonstrated to stakeholders early on.
Learn about the latest features and releases.