Experiments

An experiment allows you to run different models, or several variants of the same model, in a deployment to compare their performance on real data. They can be used to implement A / B testing, shadow deployments, and many other multi-model comparisons.

There are currently three supported modes, random split, key split and shadow deploy. Random split will route incoming requests to specified models randomly in accordance with the specified percentages. Key split will route incoming requests to the specified models based on the value of a meta key field. This allows you to treat a cohort of models as a single functional unit where all incoming inferences are routed to the correct model in the cohort. Shadow deploy will send each request to multiple models, return the result from the default model and log the results of all the models.

The random split mode allows you to perform randomized control trial type experiment where the model is chosen randomly. You can specify the percentage of requests each model recieives by assigning a ‘weight’. Weights are automatically normalized into a percentage for you so you don’t need to worry about them adding up to a particular value. You can also specify a meta key field to be used to generate the random cohort in a consistent manner. This might be useful if you want the model that handles requests from a particular user (or group) to be random but consistent. In that case the split_key might be the session_id or something similar.

Value split allows you to specifically choose which model handles requests for a user (or group). For example, if you want all ‘gold’ card users to go to one fraud prediction model and all ‘black’ card users to go to another.

Shadow deploy allows you to test new models without removing the default/control model. This is particularly useful for “burn-in” testing a new model with real world data without displacing the currently proven model.

Use cases include validation of a new model variant, and comparison of the performance of different models.

Random Split Experiment Example

Let’s take a quick look at how we can define and deploy an experiment. First, let’s initialize a reference to the Wallaroo system.

[1]:
import os
import wallaroo

wl = wallaroo.Client()

First we’ll upload our “default” or control model along with an alternate “challenger” model that we’ll use in this notebook.

[2]:
default_model    = wl.upload_model("control", "./keras_ccfraud.onnx")
challenger_model = wl.upload_model("challenger", "./modelA.onnx")

Now, we can verify that the models have been uploaded.

[3]:
models = wl.list_models()
[{"model_class": m.name(), "model_name": m.version()} for m in models]
[3]:
[{'model_class': 'challenger',
  'model_name': '61aa19e7-6ef3-4bed-b9ea-41e56478fdee'},
 {'model_class': 'control',
  'model_name': 'cb89782b-4dba-45de-b85e-83f485bbfa28'}]

Experiments run in Pipelines

To set up an experiment we create a pipeline to process the request in each inference engine. There are convenience classes and methods to help create appropriate pipelines for each mode.

Let’s set up a random split experiment pipeline where the default model has a weight of 2 and the challenger has a weight of 1. This means that the default model will receive approx 66% (2/3) of the traffic and the challenger 33%. Feel free to use .66 and .33 as weights if you prefer.

[4]:
pipeline = (wl.build_pipeline("randomsplitpipeline")
            .add_random_split([(2, default_model), (1, challenger_model)], "session_id"))

Now, let’s deploy that pipeline

[5]:
pipeline.deploy()
Waiting for deployment - this will take up to 45s ...... ok
[5]:
{'name': 'randomsplitpipeline', 'create_time': datetime.datetime(2022, 3, 3, 22, 15, 35, 649256, tzinfo=tzutc()), 'definition': "[{'RandomSplit': {'hash_key': 'session_id', 'weights': [{'model': {'name': 'control', 'version': '59d4baf0-152b-4b62-aae8-caa095436414', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}, 'weight': 2}, {'model': {'name': 'challenger', 'version': '98679550-af37-4709-b930-d114b53acc8a', 'sha': '438cd2762590b712106235dc4d635ca50b21304f42ee9529c7acd0b0aecac624'}, 'weight': 1}]}}]"}

While that is coming up, lets create a little function that will help us calculate the percentage of requests each model gets.

[6]:
import collections

def frequencies(responses):
    c = collections.Counter([r.model()[0] for r in responses])
    total = 0
    for k,v in c.items():
        total += v

    for k,v in c.items():
        print(f"{k:15} {v} inferences {v/total*100:5.2f}%")

And let’s load some test data

[7]:
import json

with open('dev_smoke_test.json', "rb") as f:
    data = json.load(f)

Now, let’s run a number of inferences and look at the percentages. We should see that the control model gets ~66% of the requests. Note that this is a random process so actual ratios will vary.

[8]:
num_runs = 1000
responses = []
for _ in range(num_runs):
    responses.extend(pipeline.infer(data))

frequencies(responses)
control         661 inferences 66.10%
challenger      339 inferences 33.90%

Let’s try a stable hash key

The hash key session_id was set up in our pipeline configuration, but we didn’t pass it in during that first trial. Let’s set an arbitrary key and see what happens. It should result in similar ratios.

[9]:
responses = []
for i in range(num_runs):
    data['session_id'] = f"session_{i}"
    responses.extend(pipeline.infer(data))
frequencies(responses)
control         694 inferences 69.40%
challenger      306 inferences 30.60%

Let’s try a constant stability key

With a constant key we’d expect every request to go to the same randomly choosen model. You can change the session_id value and see that the model chosen changes randomly.

[10]:
responses = []
for i in range(num_runs):
    data['session_id'] = f"session_abc"
    responses.extend(pipeline.infer(data))
frequencies(responses)
control         1000 inferences 100.00%

Now, let’s shut down this experiment and try another one.

[11]:
pipeline.undeploy()
[11]:
{'name': 'randomsplitpipeline', 'create_time': datetime.datetime(2022, 3, 3, 22, 15, 35, 649256, tzinfo=tzutc()), 'definition': "[{'RandomSplit': {'hash_key': 'session_id', 'weights': [{'model': {'name': 'control', 'version': '59d4baf0-152b-4b62-aae8-caa095436414', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}, 'weight': 2}, {'model': {'name': 'challenger', 'version': '98679550-af37-4709-b930-d114b53acc8a', 'sha': '438cd2762590b712106235dc4d635ca50b21304f42ee9529c7acd0b0aecac624'}, 'weight': 1}]}}]"}

Key Split Experiment Example

Now, let’s define a pipeline that automatically splits inference data on a given meta_key_name. Requests with card_type == gold go to the challenger. All other requests go to the default model. We’ll use the same model variants as before to save time but each experiment deployment would typically have its own models.

[12]:
meta_key_name = 'card_type'
pipeline = wl.build_pipeline("keysplitpipeline")
[13]:
pipeline.add_key_split(default_model, meta_key_name, {"gold": challenger_model}).deploy()
[13]:
{'name': 'keysplitpipeline', 'create_time': datetime.datetime(2022, 3, 3, 22, 39, 56, 44234, tzinfo=tzutc()), 'definition': "[{'MetaValueSplit': {'split_key': 'card_type', 'control': {'name': 'control', 'version': '59d4baf0-152b-4b62-aae8-caa095436414', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}, 'routes': {'gold': {'name': 'challenger', 'version': '98679550-af37-4709-b930-d114b53acc8a', 'sha': '438cd2762590b712106235dc4d635ca50b21304f42ee9529c7acd0b0aecac624'}}}}]"}

And test that inference data goes to the control by default:

[15]:
data['card_type'] = 'silver'
pipeline.infer(data)[0].model()
[15]:
('control', 'cb89782b-4dba-45de-b85e-83f485bbfa28')

And finally, test that data gets routed to the experimental model when the right meta-value is set:

[16]:
data['card_type'] = 'gold'
pipeline.infer(data)[0].model()
[16]:
('challenger', '61aa19e7-6ef3-4bed-b9ea-41e56478fdee')

The get audit logs function can be used to get the logs for further analysis. Each log record is similar the the result of an inference and will contain the pipeline_id, model and variant chosen as well as the inputs, outputs time and elapsed time.

[17]:
logs = pipeline.logs()
for log in logs:
    print(log.timestamp, log.model_name)
2022-02-04 20:19:32.769000 control
2022-02-04 20:19:35.871000 challenger
[18]:
pipeline.undeploy()
[18]:
{'name': 'keysplitpipeline', 'create_time': datetime.datetime(2022, 3, 3, 22, 39, 56, 44234, tzinfo=tzutc()), 'definition': "[{'MetaValueSplit': {'split_key': 'card_type', 'control': {'name': 'control', 'version': '59d4baf0-152b-4b62-aae8-caa095436414', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}, 'routes': {'gold': {'name': 'challenger', 'version': '98679550-af37-4709-b930-d114b53acc8a', 'sha': '438cd2762590b712106235dc4d635ca50b21304f42ee9529c7acd0b0aecac624'}}}}]"}

Shadow Deploy Experiment Example

Finally, let’s create a “shadow deployment” experiment pipeline. The champion model and all challengers are run for each input. The result data for all models is logged, but the output of the champion is the only result returned.

This is particularly useful for “burn-in” testing a new model with real world data without displacing the currently proven model.

[19]:
pipeline = wl.build_pipeline("shadowdeploypipeline")
pipeline.add_shadow_deploy(default_model, [challenger_model])
pipeline.deploy()
Waiting for deployment - this will take up to 45s ...... ok
[19]:
{'name': 'shadowdeploypipeline', 'create_time': datetime.datetime(2022, 3, 3, 22, 42, 53, 628431, tzinfo=tzutc()), 'definition': "[{'ModelInference': {'models': [{'name': 'control', 'version': '59d4baf0-152b-4b62-aae8-caa095436414', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}, {'name': 'challenger', 'version': '98679550-af37-4709-b930-d114b53acc8a', 'sha': '438cd2762590b712106235dc4d635ca50b21304f42ee9529c7acd0b0aecac624'}]}}, {'AuditResults': {'from': 1, 'to': None}}, {'Nth': {'index': 0}}]"}

We get the response from the control model.

[20]:
res = pipeline.infer(data)
res[0].model()
[20]:
('control', 'cb89782b-4dba-45de-b85e-83f485bbfa28')

And logs are created for both models.

[22]:
logs = pipeline.logs()
for log in logs:
    print(log.timestamp, log.model_name)
2022-02-04 20:20:26.072000 challenger
2022-02-04 20:20:26.072000 control
[23]:
pipeline.undeploy()
[23]:
{'name': 'shadowdeploypipeline', 'create_time': datetime.datetime(2022, 3, 3, 22, 42, 53, 628431, tzinfo=tzutc()), 'definition': "[{'ModelInference': {'models': [{'name': 'control', 'version': '59d4baf0-152b-4b62-aae8-caa095436414', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}, {'name': 'challenger', 'version': '98679550-af37-4709-b930-d114b53acc8a', 'sha': '438cd2762590b712106235dc4d635ca50b21304f42ee9529c7acd0b0aecac624'}]}}, {'AuditResults': {'from': 1, 'to': None}}, {'Nth': {'index': 0}}]"}

API Reference

See:

  • Model

  • ModelConfig

  • Pipeline

  • Deployment