MLOps ===== In this module, we introduce concepts and techniques related to MLOps, that is, the automation of operations involved in building, validating and deploying ML pipelines for applications. By the end of this module, students should be able to: * Describes the basics of the entire ML application workflow and lifecycle. * Implement deployment automation for an arbitrary ML model using Python, Flask, and Docker. * Implement deployment automation for Tensorflow models using Tensorflow serving. The ML Lifecycle and MLOps --------------------------- Just like any other software, applications developed with machine learning are products that evolve over time. We repeatedly update the software as we gather more data, develop better models, and generally make improvements. MLOps is a modification of the concept of DevOps, a set of techniques and best practices that originated in the late 2000s with the organization of DevOpsDays (2009) and related events. DevOps is a combination of the terms *software development* and *IT operations*, and the idea was to shorten the time between when new software features were developed (i.e., new code being written) and when those features would be released and available to users. In order to shorten that time window, a number tasks needed to be automated, including: * Building/compiling software binaries, and/or assembling software packages. * Running tests on the new software to ensure it was high quality. * Integrating the software with other components in the ecosystem (e.g., other microservices in a microservice architecture) and running integration tests across the entire system. * Deploying the tested code to production. MLOps modifies the concept to DevOps concept to accommodate the processes involved with building and deploying new ML applications. At a high-level, the life cycle consists of: 1. *Data collection and preprocessing:* as we have seen, large amounts of high-quality data is essential to training models that can make accurate predictions. 2. *Model training:* Once we have gathered and preprocessed data, we train our model. As we have seen, this can involve lengthy executions to search across different model types and hyperparameter spaces. 3. *Model evaluation and validation:* As we train new versions of our model, we evaluate them using various metrics (e.g., accuracy, precision, recall, F1, etc.). Beyond evaluating models against specific metrics, a number of additional validation steps can be taken, including: *automated testing*, to ensure the application works as intended on important and/or edge cases; *fairness/bias assessment*, for example, by evaluating your model against a separate hold out dataset that is specifically engineered to ensure that it is representative and inclusive of all perspectives. 4. *Model deployment:* after a model version has been validated, we are ready to package and deploy it, either to a pre-production environment such as a test or QA environment, or to production. 5. *Integration Testing, Acceptance Testing, and other automated testing:* typically, the trained model is first deployed to a test or QA environment where automated tests are run. Various kinds of tests could be run, including integration tests (to test the ML model's integration into the rest of the application), functional or acceptance testing, to evaluate high level functions of the application, and performance and/or load testing, to ensure the application perfoms well under various loads. If the testing produces acceptable results, the model is then deployed to production. 6. *Production Monitoring:* once an ML model has been deployed to production, it must be monitored just like any other application component. The model will see new data samples in production, and these should be collected and evaluated. Various changes in the environment, referred to as *drift* can cause model performance to degrade. For example, some facial recognition systems saw preformance degradation during the pandemic as a result of people wearing masks. Language models see performance degradation over time due to the introduction and use of new words. .. figure:: ./images/MLOps.png :width: 800px :align: center The ML Application Development and Operations Lifecycle So far in this course, we have mostly focused on 1), 2) and 3). Below, we discuss specific techniques for Model Deployment. Model Deployment ----------------- There are a number of considerations when planning a model deployment. At a minimum, the software must be packaged and delivered in a way that allows it to be utilized by the rest of the application. Inference Server ^^^^^^^^^^^^^^^^ The concept of an *inference server* has gained traction in the ML community. The idea is to wrap the trained ML model in a lightweight server that can be executed over the network. Commonly, this is done either as an HTTP 1.x/REST API or an HTTP 2/gRPC server. For example, a REST API inference server for the model we developed to classify images with clothes objects may have the following endpoints: +---------------------------------+------------+---------------------------------------------+ | **Route** | **Method** | **What it should do** | +---------------------------------+------------+---------------------------------------------+ | ``/models/clothes/v1`` | GET | Return basic information about v1 of model | +---------------------------------+------------+---------------------------------------------+ | ``/models/clothes/v1`` | POST | Classify clothes object in image payload | | | | using version 1 (v1) of the model. | +---------------------------------+------------+---------------------------------------------+ When a client makes an HTTP POST request to ``/models/clothes/v1`` they send an image as part of the payload. The inference server must: 1. Retrieve the image out of the request payload. 2. Perform any preprocessing necessary on the image byte stream. 3. Apply the model to the processed image data to get a classification result. 4. Package the classification result into a convenient data structure (e.g., JSON). 5. Send a response with the classification data structure included as the message body. As you can see, we have encoded both the kind of model ("clothes") as well as the version ("v1") into our URL structure. This means that if we developed another model, for example, our handwritten digits classifier, we could easily add it to our inference server. We could also easily add a new version of the clothes model and serve both at the same time. There are a number of advantages to using an inference server architecture, many of which are just the advantages enjoyed by all HTTP/microservice architectures: 1. *Framework agnostic:* Regardless of which ML framework your model is developed in, it can be packaged into an inference server. With that said, some solutions are framework-specific. In fact, one of the solutions we'll look at is Tensorflow Serving, which serves Tensorflow models (and other kinds of *servables*). 2. *Language agnostic API:* Components of the application can interact easily with the inference server, regardless of the programming language they are written in, because all modern languages have an HTTP client. 3. *Scalability:* Multiple components of the application can interact with the model inference server, even from different computers. Additionally, multiple instances of the inference server itself can be deployed to increase the throughput of inferences. 4. *Plug-and-play and model chaining:* The concept of *plug-and-play* for ML models is the idea or goal of enabling different models to be "plugged" into an application with little to no code changes to the rest of the application. In order to achieve this, different models that perform the same (or similar) task must conform to a common interface. An HTTP interface is one possible mechanism. Similarly, *model chaining* is the idea that we can feed outputs of one model as inputs to another model. For example, we may have one model that finds language characters in an image and another model that translates words from one language to another (for example, `Google image translate `_). If individual models use HTTP requests and responses, the responses from one model can be easily fed into as a request to the next model. 5. *Versioning:* There are multiple, intuitive ways to version a model inference server. One which was suggested above is to use the URL to encode the version. These methods will be familiar to most developers, as REST APIs (and HTTP services more generally) have become common in cloud computing. What do we need to build an ML inference server? The basic ingredients are as follows: 1. *Serialize and deserialize trained models* --- with sklearn one can use the Python pickle module, but we will quickly see how to do this with Keras. 2. *Write the inference server code* --- we will see two methods for doing this, including a "generic" method using flask and a Tensorflow-specific method (Tensorflow Serving) 3. *Package the server as a docker container image* --- This will simplify deployment and make our server more portable. 4. *Deploy the server as a container* --- We can use a simple script, docker-compose, or something more elaborate such as Kubernetes. Serializing and Deserializeing Tensorflow Models ------------------------------------------------ The Python pickle module can be used to serialize a skelearn model. If you are interested in this topic, see the supplement on `Model Persistence with Pickle `_. For serializing a Tensorflow model, we recommend using the built in ``model.save()`` method. In general, attempting to use pickle on Tensorflow models can lead to errors related to model objects not being pickleable. We'll illustrate the techniques in this section using a model trained against the MNIST fashion dataset. Recall that dataset consisted of 28x28 grey scale images containing different articles of clothing, and our goal was to build a model that could perform image classification to determine the type of clothing in the image. We built a few different model architectures. Here I will work with the LeNet-5. We collect the essential code for building the model below: .. code-block:: python3 import keras from tensorflow.keras.datasets import fashion_mnist from tensorflow.keras.utils import to_categorical # Importing all the different layers and optimizers from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D, AveragePooling2D from keras import layers from keras import models from tensorflow.keras.optimizers import Adam from keras.applications.vgg16 import VGG16 # data load (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data() # normalize X_train_normalized = X_train / 255.0 X_test_normalized = X_test / 255.0 # Convert to "one-hot" vectors using the to_categorical function num_classes = 10 y_train_cat = to_categorical(y_train, num_classes) # Intializing a sequential model model = models.Sequential() # Layer 1: Convolutional layer with 6 filters of size 5x5, followed by average pooling model.add(Conv2D(6, kernel_size=(5, 5), activation='relu', input_shape=(28, 28, 1))) model.add(AveragePooling2D(pool_size=(2, 2))) # Layer 2: Convolutional layer with 16 filters of size 5x5, followed by average pooling model.add(Conv2D(16, kernel_size=(5, 5), activation='relu')) model.add(AveragePooling2D(pool_size=(2, 2))) # Flatten the feature maps to feed into fully connected layers model.add(Flatten()) # Layer 3: Fully connected layer with 120 neurons model.add(Dense(120, activation='relu')) # Layer 4: Fully connected layer with 84 neurons model.add(Dense(84, activation='relu')) # Output layer: Fully connected layer with num_classes neurons (e.g., 10 for MNIST) model.add(Dense(num_classes, activation='softmax')) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) model.summary() model.fit(X_train_normalized, y_train_cat, validation_split=0.2, epochs=20, batch_size=128, verbose=2) The output will look similar to the following at the bottom: .. code-block:: console . . . Epoch 18/20 375/375 - 3s - 7ms/step - accuracy: 0.9130 - loss: 0.2334 - val_accuracy: 0.9043 - val_loss: 0.2704 Epoch 19/20 375/375 - 3s - 7ms/step - accuracy: 0.9161 - loss: 0.2265 - val_accuracy: 0.9043 - val_loss: 0.2703 Epoch 20/20 375/375 - 3s - 7ms/step - accuracy: 0.9174 - loss: 0.2215 - val_accuracy: 0.9022 - val_loss: 0.2695 It's possible that a few more epochs might improve performance, but we're near or over 90% accuracy on both the train and validation sets, and the validation accuracy has started to plateau, so this seems like a good time to save the model. We use the ``model.save()`` function, passing in a file name to use to save the model. I will use the simple name ``clothes.keras``. It is a good habit to save the models with a ``.keras`` extension. .. code-block:: python3 model.save("clothes.keras") There should now be a file, ``clothes.keras`` in the same directory as the notebook you are writing. If we inspect this file, we will see that it is a zip archive and about 550KB: .. code-block:: console $ file clothes.keras clothes.keras: Zip archive data, at least v2.0 to extract .. note:: Keras supports multiple file format versions for saving models. The latest version, v3, will automatically be used whenever the file name passed ends in the ".keras" extension. From the official docs: *"The new Keras v3 saving format, marked by the .keras extension, is a more simple, efficient format that implements name-based saving, ensuring what you load is exactly what you saved, from Python's perspective. This makes debugging much easier, and it is the recommended format for Keras."* At this point, we can load our model easily from the saved file into a new Python program. To illustrate, let's restart our notebook kernel before running the following code. With our kernel restarted, we'll use the ``tf.keras.models.load_model()`` function to load the model directly from our archive file. Keep in mind that we will need to re-import tensorflow. .. code-block:: python3 import tensorflow as tf model = tf.keras.models.load_model('clothes.keras') Let's evaluate our model on the training set to convince ourselves that this is indeed our pre-trained model: .. code-block:: python3 # check accuracy on train and test without fitting the model from tensorflow.keras.datasets import fashion_mnist from tensorflow.keras.utils import to_categorical # NOTE: we need to perform the same pre-processing... (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data() # normalize X_train_normalized = X_train / 255.0 # Convert to "one-hot" vectors using the to_categorical function num_classes = 10 y_train_cat = to_categorical(y_train, num_classes) results_train = model.evaluate(X_train_normalized, y_train_cat, batch_size=128) print(results_train) 469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9222 - loss: 0.2040 [0.21835999190807343, 0.9186166524887085] Indeed, we get 91% accuracy on the training set. We're ready to build our inference server. .. warning:: Be very careful about the version of tensorflow you use to save the model and the version used to load the model. Changing major versions (e.g., tensorflow v1 to v2) can cause the model to fail to load, and even changing from 2.15 to 2.16 because 2.16 introduced a new major version of Keras (v3). See this `issue `_ from last year. The safest approach is always to use identical versions when saving and loading. Developing An Inference Server in Flask --------------------------------------- We'll first look at building an inference server using the Flask framework. This approach is easy to implement and provides us with unlimited customization. Initial Flask Server ^^^^^^^^^^^^^^^^^^^^^ To being, we'll create a new directory, ``models``, and move our ``clothes.keras`` model into it. We'll create a file called ``api.py`` at the same level as the ``models`` directory. The ``api.py`` will contain our Flask code. We need to install the Flask Python package into our containers. To do that, use the following ``pip`` command from within a terminal inside your Jupyter notebook server: .. code-block:: python3 pip install Flask==3.1.2 Remember, you only need to run this command once. We'll implement two routes, a ``GET`` route and a ``POST`` route, as per the table above. The GET will just return information about the model in a JSON object. Here is the starter code. We're importing the Flask class and creating the ``app`` object, which is the basic object used for configuring a Flask server. We use the ``@app.route()`` decorator to create a new *route*, specifying the URL path and HTTP request methods that that route function should handle. We define a ``model_info`` function which just returns a dictionary of metadata about our model. .. code-block:: python3 from flask import Flask app = Flask(__name__) @app.route('/models/clothes/v1', methods=['GET']) def model_info(): return { "version": "v1", "name": "clothes", "description": "Classify images containing articles of clothing", "number_of_parameters": 133280 } # start the development server if __name__ == '__main__': app.run(debug=True, host='0.0.0.0') The code at the bottom just runs the Flask development server whenever our Python model ``api.py`` is invoked from the command line. .. note:: If you prefer command-line tools, you may wish to install ``vim`` and ``lsof`` using apt: ``apt install vim lsof`` To run the server, one needs to install Flask and then execute .. code-block:: console $ python api.py Then, in a separate terminal, one can use ``curl`` to test the endpoint: .. code-block:: console curl http://127.0.0.1:5000/models/clothes/v1 { "description": "Classify images containing articles of clothing", "name": "clothes", "number_of_parameters": 133280, "version": "v1" } For more details on Flask, see COE 332 `notes `_ or the official `documentation `_. .. note:: The class Jupyter container image does not include the Flask package. You can install it with ``pip install Flask==3.1.2`` Packaging the Inference Server with Docker ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Now that we have a minimal server, let's build a docker image for it so we can test it out. We'll use a Dockerfile for that. The basic steps are: 1) Start with an official Python image 2) Install the Flask library 3) Copy our source code, ``api.py`` 4) Set the default command in the container to run our program. If you need a refresher on Docker, see the COE 332 `notes `_. Here is the Dockerfile that does that: .. code-block:: console # Image: jstubbs/ml-clothes-api FROM python:3.11 RUN pip install Flask==3.1.0 COPY api.py /api.py CMD ["python", "api.py"] To build our image, we use the ``docker build`` command. We'll use the ``-t`` flag to tag it with a name. I'll use ``jstubbs/ml-clothes-api``. You'll want to change the username ``jstubbs`` to your own username on Docker Hub. .. code-block:: console docker build -t jstubbs/ml-clothes-api . .. note:: You will not be able to build the docker image from within your Jupyter notebook server terminal. This is because docker itself is not installed/mounted in the container. Instead, you should SSH directly to your VM and execute the build command there. .. note:: Executing the docker build command establishes the current working directory (and all subdirectories) as the "context" for the build. Note that if you have files owned by the root user (or any other user) in the current directory or any subdirectories, the docker build will fail. You can change ownership of files (recursively) to the ``ubuntu`` user using a command like this ``chown -R ubuntu:ubutun ``. Now that our image is built, we can start a container for our inference server using the ``docker run`` command. We'll use the following flags to that command: * ``-it``: run the container in interactive mode and attach to stdout. This is helpful for seeing the logs from our Flask server. * ``--rm``: remove the container once we stop it. * ``-p 5000:5000``: map port 5000 in the container to port 5000 on the host. This is important because we want to be able to make requests to our container. We need to then specify the image name (in my case ``jstubbs/ml-clothes-api``), but we don't need to specify a program to run since we set the default command using the ``CMD`` instruction in our Dockerfile. Here is the full command to run our server container: .. code-block:: console docker run -it --rm -p 5000:5000 jstubbs/ml-clothes-api Let's check that it is working. In another window on your VM, use ``curl`` to try the GET route: .. code-block:: console curl localhost:5000/models/clothes/v1 { "description": "Classify images containing articles of clothing", "name": "clothes", "number_of_parameters": 133280, "version": "v1" } Looks good! Now, let's go back and add the inference route. Adding the Inference Route ^^^^^^^^^^^^^^^^^^^^^^^^^^ Our real goal is to make inference available as a service. For that, we need to add the POST route. That route should take an image, apply the model to it, and return the prediction. We'll need tensorflow so that we can load and execute the model, and of course, we'll need our model file. Let's start by adding those to the Dockerfile. .. code-block:: console :emphasize-lines: 5,8 # Image: jstubbs/ml-clothes-api FROM python:3.11 RUN pip install tensorflow==2.15 RUN pip install Flask==3.0 COPY models /models COPY api.py /api.py CMD ["python", "api.py"] Back in the ``api.py`` file, we need to implement the POST route. We'll want to load the model as well. It's good to load the model on server start up so that the model is ready to go when a request comes. As for the implementing the route itself, we have some choices about what kind of data the user will send, with different choices offering pros and cons. For example, we could: 1. Require the user send a raw image file, such as a png or jpg. 2. Require the user to send a numpy array, serialized as some kind of binary stream (e.g., using the pickle library) 3. Require the user to send a JSON list of numbers. Additionally, within options 2) and 3), we can require the user to preproess the data before sending (e.g., normalizing) or we can perform that function for them. In all three options, we could also consider allowing the user to send a batch of images to inference, instead of just 1. Option 1) is appealing for some use cases, but for this dataset we load the data directly into numpy using the ``fashion_mnist.load_data()`` function, so in some ways, this option isn't the most convenient for our demonstration purposes. Option 2) is likely more efficient than option 3), but it has the downside of only working for Python clients. If our application will be written in other languages, requiring a numpy array would be overly imposing. It's also complicated to implement, as we would need client and server to agree on a scheme (e.g., pickling) We'll go for option 3). It's easy, supports multiple languages and lends it self perfectly well to batching, though we won't implement that here. Instead, we'll assume the user sends us one image, and we'll take care of preprocessing it. .. note:: In Project 3 you are required to use Option 1. We will provide some guidelines on how to handle that case later in this module, but consult the offical Flask `documentation `_ for full details. To do that preprocessing, we'll convert the JSON list to a numpy array. We'll then reshape it so that it conforms to the shape required for the ``predict()`` method (remember, like with sklearn, the Keras ``model.predict()`` function expects a batch of images, so we'll need to pad an extra dimension onto the array.) .. code-block:: python3 def preprocess_input(im): """ Converts user-provided input into an array that can be used with the model. This function could raise an exception. """ # convert to a numpy array d = np.array(im) # then add an extra dimension return d.reshape(1, 28, 28) With that code in place, we can write the route. We specify ``POST`` as the method and we use the ``request.json`` object, which is a dictionary, to get at the request data. We are assuming the message contains a single object, ``image``, with the JSON list. Note that we also handle two errors cases: 1. The request does not contain json with an ``image`` field. 2. The ``image`` field cannot be converted to a numpy array and reshaped. Case 2) causes the ``preprocess_input()`` function to raise an exception, so we use a ``try...except`` block. To apply the model, we use ``model.predict()`` on the preprocessed data. Note that the result is a numpy array, which is not JSON serializable, so at a minimum we'll need to cast it to a normal Python list; we do that using the ``.tolist()`` method: .. code-block:: python3 @app.route('/models/clothes/v1', methods=['POST']) def classify_clothes_image(): im = request.json.get('image') if not im: return {"error": "The `image` field is required"}, 404 try: data = preprocess_input(im) except Exception as e: return {"error": f"Could not process the `image` field; details: {e}"}, 404 return { "result": model.predict(data).tolist()} Be sure to add the necessary imports and load the model object. Here is the complete solution: .. code-block:: python3 from flask import Flask, request import tensorflow as tf import numpy as np app = Flask(__name__) model = tf.keras.models.load_model('models/clothes.keras') @app.route('/models/clothes/v1', methods=['GET']) def model_info(): return { "version": "v1", "name": "clothes", "description": "Classify images containing articles of clothing", "number_of_parameters": 133280 } def preprocess_input(im): """ Converts user-provided input into an array that can be used with the model. This function could raise an exception. """ # convert to a numpy array d = np.array(im) # then add an extra dimension return d.reshape(1, 28, 28) @app.route('/models/clothes/v1', methods=['POST']) def classify_clothes_image(): im = request.json.get('image') if not im: return {"error": "The `image` field is required"}, 404 try: data = preprocess_input(im) except Exception as e: return {"error": f"Could not process the `image` field; details: {e}"}, 404 return { "result": model.predict(data).tolist()} # start the development server if __name__ == '__main__': app.run(debug=True, host='0.0.0.0') Handling Raw Files ^^^^^^^^^^^^^^^^^^ In Project 3, you will need to handle raw files passed directly to your Flask server (option 1) described above). The idea is to allow the user to pass a file as a mutli-part/form. The parts of the form are still named, so we'll assume the file has been added to the form under the ``image`` key. We need a way to get at that file from within our Flask route. To do so, use Flask's built-in ``request.files`` object and look for a key ``image``. Here is a snippet of code illustrating the technique: .. code-block:: python @app.route('/??', methods=['POST']) def upload_file(): # check if the post request has the file part if 'image' not in request.files: # if the user did not pass the image under `image`, we don't know what they are # don't, so return an error. return '{"error": "Invalid request; pass a binary image file as a multi-part form under the image key."}' # get the data data = request.files['image'] # do something with data... Testing the Inference Server ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We'll use ``requests`` to test our server because it will make it easy to work with the test data. The process is straightforward --- we select an item from the ``X_test`` array and cast it to a list using ``tolist()``. .. code-block:: python3 import requests # grab an entry from X_test -- here, we grab the first one l = X_test[0].tolist() # make the POST request passing the sinlge test case as the `image` field: rsp = requests.post("http://172.17.0.1:5000/models/clothes/v1", json={"image": l}) # print the json response rsp.json() {'result': [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]]} Note that our inference server returns the "raw" result of ``predict()``, which is an array. If we compare that to the actual label, we'll see that our model got the right answer: .. code-block:: python3 y_test_cat[0] -> array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]) Note that to send an inference request to a Flask server that is expecting a multi-part form, we need to use the ``files`` argument, passing binary data as part of a dictionary where the key is the expected one (in our case, ``image``). Here is an example: .. code-block:: # create the files dictionary data = {"image": open(path, 'rb')} # send the POST request rsp = requests.post(url, files=data) # process the response... # . . . We have written a complete grader module that you can use to test your project 3 inference servers. See the code and the README in the class repository `here `_. Additional References ---------------------- 1. Machine Learning Systems with TinyML. Chapter 14: Embedded AIOps. https://harvard-edge.github.io/cs249r_book/contents/ops/ops.html#key-components-of-mlops 2. Tensorflow Documentation, v2.15: tf.saved_model.save. https://www.tensorflow.org/versions/r2.15/api_docs/python/tf/saved_model/save 3. Tensorflow Serving with Docker. https://www.tensorflow.org/tfx/serving/docker