Monitoring¶
Model drift is a critical problem in any production ML workflow. One key method for detecting drift is model validation, with tooling built directly into the Wallaroo platform. These tools can be used to diagnose and flag anomalous data within production pipelines. This enables fast identification and mitigation of model drift.
Some key model validations are so important that immediate action is required when they fail. The SDK allows users to configure custom alerts so they can be notified when these failures occur.
Value Validation¶
The simplest form of validation possible is simple boundary checking on model inputs and outputs. Let’s use our credit card fraud model to demonstrate:
[1]:
import wallaroo
client = wallaroo.Client()
fraud = client.upload_model('ccfraud', 'keras_ccfraud.onnx')
Suppose we know a priori that the “likelihood of fraud” output of this model should never exceed 0.95
. We can create a pipeline with this simple validation to continuously verify this in production:
[2]:
p = client.build_pipeline('fraud')
p = p.add_model_step(fraud)
p = p.add_validation('no_high_fraud', fraud.outputs[0][0] < 0.95)
p.deploy()
Waiting for deployment - this will take up to 45s ..... ok
[2]:
{'name': 'fraud', 'create_time': datetime.datetime(2022, 3, 3, 22, 49, 12, 101145, tzinfo=tzutc()), 'definition': '[{'ModelInference': {'models': [{'name': 'ccfraud', 'version': 'dae25648-bf42-4eb4-b873-9aa126fbee81', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}]}}, {'Check': {'tree': ['{"values": {"no_high_fraud": {"root": {"node": "binop", "op": "<", "left": {"node": "variable", "variant_id": {"name": "ccfraud"}, "position": "output", "key": [0, 0]}, "right": {"node": "literal", "float": 0.95}}, "required_data": [{"name": "ccfraud"}]}}, "gauges": [], "validations": ["no_high_fraud"]}']}}]'}
Of particular importance is the input to .add_validation()
. It’s not a python value, it’s an unevaluated Expression
, similar to formulas in R or expressions in SQL builders like SQLAlchemy.
We can now run inference directly against this pipeline:
[3]:
p.infer_from_file('./dev_smoke_test.json')
[3]:
[InferenceResult({'check_failures': [],
'elapsed': 132317,
'model_name': 'ccfraud',
'model_version': '5a4d7651-17af-41e5-8322-0886a413f198',
'original_data': {'tensor': [[1.0678324729342086,
0.21778102664937624,
-1.7115145261843976,
0.6822857209662413,
1.0138553066742804,
-0.43350000129006655,
0.7395859436561657,
-0.28828395953577357,
-0.44726268795990787,
0.5146124987725894,
0.3791316964287545,
0.5190619748123175,
-0.4904593221655364,
1.1656456468728569,
-0.9776307444180006,
-0.6322198962519854,
-0.6891477694494687,
0.17833178574255615,
0.1397992467197424,
-0.35542206494183326,
0.4394217876939808,
1.4588397511627804,
-0.3886829614721505,
0.4353492889350186,
1.7420053483337177,
-0.4434654615252943,
-0.15157478906219238,
-0.26684517248765616,
-1.454961775612449]]},
'outputs': [{'Float': {'data': [0.001497417688369751],
'dim': [1, 1],
'v': 1}}],
'pipeline_name': 'fraud',
'time': 1643992160482})]
This example is a known good value with a low chance of fraud. We can verify this by observing that no validation failures have been logged:
[4]:
p.logs()
[4]:
Timestamp | Output | Input | Anomalies |
---|---|---|---|
2022-04-Feb 16:29:20 | [0.001497417688369751] | [[1.0678324729342086, 0.21778102664937624, -1.7115145261843976, 0.6822857209662413, 1.0138553066742804, -0.43350000129006655, 0.7395859436561657, -0.28828395953577357, -0.44726268795990787, 0.5146124987725894, 0.3791316964287545, 0.5190619748123175, -0.4904593221655364, 1.1656456468728569, -0.9776307444180006, -0.6322198962519854, -0.6891477694494687, 0.17833178574255615, 0.1397992467197424, -0.35542206494183326, 0.4394217876939808, 1.4588397511627804, -0.3886829614721505, 0.4353492889350186, 1.7420053483337177, -0.4434654615252943, -0.15157478906219238, -0.26684517248765616, -1.454961775612449]] | 0 |
However, if we run inference on data with an abnormally high chance of fraud, we can see an anomaly is logged:
[5]:
p.infer_from_file('./dev_high_fraud.json')
[5]:
[InferenceResult({'check_failures': [{'False': {'expr': 'ccfraud.outputs[0][0] < 0.95'}}],
'elapsed': 104293,
'model_name': 'ccfraud',
'model_version': '5a4d7651-17af-41e5-8322-0886a413f198',
'original_data': {'tensor': [[1.0678324729342086,
18.155556397512136,
-1.658955105843852,
5.2111788045436445,
2.345247064454334,
10.467083577773014,
5.0925820522419745,
12.82951536371218,
4.953677046849403,
2.3934736228338225,
23.912131817957253,
1.7599568310350209,
0.8561037518143335,
1.1656456468728569,
0.5395988813934498,
0.7784221343010385,
6.75806107274245,
3.927411847659908,
12.462178276650056,
12.307538216518656,
13.787951906620115,
1.4588397511627804,
3.681834686805714,
1.753914366037974,
8.484355003656184,
14.6454097666836,
26.852377436250144,
2.716529237720336,
3.061195706890285]]},
'outputs': [{'Float': {'data': [0.9811990261077881], 'dim': [1, 1], 'v': 1}}],
'pipeline_name': 'fraud',
'time': 1643992165536})]
[6]:
p.logs()
[6]:
Timestamp | Output | Input | Anomalies |
---|---|---|---|
2022-04-Feb 16:29:20 | [0.001497417688369751] | [[1.0678324729342086, 0.21778102664937624, -1.7115145261843976, 0.6822857209662413, 1.0138553066742804, -0.43350000129006655, 0.7395859436561657, -0.28828395953577357, -0.44726268795990787, 0.5146124987725894, 0.3791316964287545, 0.5190619748123175, -0.4904593221655364, 1.1656456468728569, -0.9776307444180006, -0.6322198962519854, -0.6891477694494687, 0.17833178574255615, 0.1397992467197424, -0.35542206494183326, 0.4394217876939808, 1.4588397511627804, -0.3886829614721505, 0.4353492889350186, 1.7420053483337177, -0.4434654615252943, -0.15157478906219238, -0.26684517248765616, -1.454961775612449]] | 0 |
2022-04-Feb 16:29:25 | [0.9811990261077881] | [[1.0678324729342086, 18.155556397512136, -1.658955105843852, 5.2111788045436445, 2.345247064454334, 10.467083577773014, 5.0925820522419745, 12.82951536371218, 4.953677046849403, 2.3934736228338225, 23.912131817957253, 1.7599568310350209, 0.8561037518143335, 1.1656456468728569, 0.5395988813934498, 0.7784221343010385, 6.75806107274245, 3.927411847659908, 12.462178276650056, 12.307538216518656, 13.787951906620115, 1.4588397511627804, 3.681834686805714, 1.753914366037974, 8.484355003656184, 14.6454097666836, 26.852377436250144, 2.716529237720336, 3.061195706890285]] | 1 |
[7]:
p.undeploy()
[7]:
{'name': 'fraud', 'create_time': datetime.datetime(2022, 3, 3, 22, 49, 12, 101145, tzinfo=tzutc()), 'definition': '[{'ModelInference': {'models': [{'name': 'ccfraud', 'version': 'dae25648-bf42-4eb4-b873-9aa126fbee81', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}]}}, {'Check': {'tree': ['{"values": {"no_high_fraud": {"root": {"node": "binop", "op": "<", "left": {"node": "variable", "variant_id": {"name": "ccfraud"}, "position": "output", "key": [0, 0]}, "right": {"node": "literal", "float": 0.95}}, "required_data": [{"name": "ccfraud"}]}}, "gauges": [], "validations": ["no_high_fraud"]}']}}]'}
Baseline Drift Detection¶
Another important method for model validation is checking for model drift against a baseline. First, let’s upload our baseline:
[8]:
baseline = client.upload_model('fraud-base', 'xgboost_ccfraud.onnx')
production = client.upload_model('fraud-prod', 'keras_ccfraud.onnx')
We can now create a shadow deployment, with the baseline shadowing our production model. Then, we can validate their output never drifts further apart than a fixed tolerance:
[9]:
from wallaroo import functions as fn
p = client.build_pipeline('fraud-shadow')
p = p.add_shadow_deploy(production, [baseline])
p = p.add_validation(
'no_model_drift',
fn.abs(production.outputs[0][0] - baseline.outputs[0][0]) <= 0.05
)
p.deploy()
Waiting for deployment - this will take up to 45s ...... ok
[9]:
{'name': 'fraud-shadow', 'create_time': datetime.datetime(2022, 3, 3, 23, 7, 52, 323520, tzinfo=tzutc()), 'definition': '[{'ModelInference': {'models': [{'name': 'fraud-prod', 'version': '661c4e9b-43ec-414f-8f2a-0f3797053da3', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}, {'name': 'fraud-base', 'version': 'ca7cb8c5-32f7-419a-bf66-1e4af58111bd', 'sha': '054810e3e3ebbdd34438d9c1a08ed6a6680ef10bf97b9223f78ebf38e14b3b52'}]}}, {'Check': {'tree': ['{"values": {"no_model_drift": {"root": {"node": "binop", "op": "<=", "left": {"node": "fn", "fn": "abs", "arguments": [{"node": "binop", "op": "-", "left": {"node": "variable", "variant_id": {"name": "fraud-prod"}, "position": "output", "key": [0, 0]}, "right": {"node": "variable", "variant_id": {"name": "fraud-base"}, "position": "output", "key": [0, 0]}}]}, "right": {"node": "literal", "float": 0.05}}, "required_data": [{"name": "fraud-prod"}, {"name": "fraud-base"}]}}, "gauges": [], "validations": ["no_model_drift"]}']}}, {'AuditResults': {'from': 1, 'to': None}}, {'Nth': {'index': 0}}]'}
The usage of functions.abs
instead of python’s built-in abs
is critical here. abs
is not compatible with unevaluated Expression
objects.
We can now verify that a normal inference does not log a validation failure:
[10]:
p.infer_from_file('./dev_smoke_test.json')
[10]:
[InferenceResult({'check_failures': [],
'elapsed': 119481,
'model_name': 'fraud-prod',
'model_version': '948f3c65-381e-4bef-92c3-ffc30d8c3427',
'original_data': {'tensor': [[1.0678324729342086,
0.21778102664937624,
-1.7115145261843976,
0.6822857209662413,
1.0138553066742804,
-0.43350000129006655,
0.7395859436561657,
-0.28828395953577357,
-0.44726268795990787,
0.5146124987725894,
0.3791316964287545,
0.5190619748123175,
-0.4904593221655364,
1.1656456468728569,
-0.9776307444180006,
-0.6322198962519854,
-0.6891477694494687,
0.17833178574255615,
0.1397992467197424,
-0.35542206494183326,
0.4394217876939808,
1.4588397511627804,
-0.3886829614721505,
0.4353492889350186,
1.7420053483337177,
-0.4434654615252943,
-0.15157478906219238,
-0.26684517248765616,
-1.454961775612449]]},
'outputs': [{'Float': {'data': [0.001497417688369751],
'dim': [1, 1],
'v': 1}}],
'pipeline_name': 'fraud-shadow',
'time': 1643992224943})]
[11]:
p.logs()
[11]:
Timestamp | Output | Input | Anomalies |
---|---|---|---|
2022-04-Feb 16:30:24 | [0.001497417688369751] | [[1.0678324729342086, 0.21778102664937624, -1.7115145261843976, 0.6822857209662413, 1.0138553066742804, -0.43350000129006655, 0.7395859436561657, -0.28828395953577357, -0.44726268795990787, 0.5146124987725894, 0.3791316964287545, 0.5190619748123175, -0.4904593221655364, 1.1656456468728569, -0.9776307444180006, -0.6322198962519854, -0.6891477694494687, 0.17833178574255615, 0.1397992467197424, -0.35542206494183326, 0.4394217876939808, 1.4588397511627804, -0.3886829614721505, 0.4353492889350186, 1.7420053483337177, -0.4434654615252943, -0.15157478906219238, -0.26684517248765616, -1.454961775612449]] | 0 |
And that inference when values has drifted does log anomalies:
[12]:
p.infer_from_file('./dev_drift.json')
[12]:
[InferenceResult({'check_failures': [{'False': {'expr': 'abs(fraud-prod.outputs[0][0] - '
'fraud-base.outputs[0][0]) <= 0.05'}}],
'elapsed': 76097,
'model_name': 'fraud-prod',
'model_version': '948f3c65-381e-4bef-92c3-ffc30d8c3427',
'original_data': {'tensor': [[16.309736909599135,
6.599583312777217,
-1.7115145261843976,
3.843584873057595,
1.8125832479102169,
-0.43350000129006655,
17.30937776667656,
-0.2807742641284464,
6.2609333637166005,
9.713576049131042,
6.0047041206552,
10.752836139413064,
-0.4774067823292544,
4.1756599181627845,
-0.6972038718887318,
1.3471917248192462,
0.7307732879172534,
0.17833178574255615,
1.960185481479672,
1.3272984903998677,
1.6581661476267306,
13.303105662486775,
-0.32679284503894385,
5.602161198855116,
5.098649721637451,
-0.4434654615252943,
0.958688640259473,
-0.15767580972070405,
1.6405488923696567]]},
'outputs': [{'Float': {'data': [0.0001131892204284668],
'dim': [1, 1],
'v': 1}}],
'pipeline_name': 'fraud-shadow',
'time': 1643992231446})]
[13]:
p.logs()
[13]:
Timestamp | Output | Input | Anomalies |
---|---|---|---|
2022-04-Feb 16:30:24 | [0.001497417688369751] | [[1.0678324729342086, 0.21778102664937624, -1.7115145261843976, 0.6822857209662413, 1.0138553066742804, -0.43350000129006655, 0.7395859436561657, -0.28828395953577357, -0.44726268795990787, 0.5146124987725894, 0.3791316964287545, 0.5190619748123175, -0.4904593221655364, 1.1656456468728569, -0.9776307444180006, -0.6322198962519854, -0.6891477694494687, 0.17833178574255615, 0.1397992467197424, -0.35542206494183326, 0.4394217876939808, 1.4588397511627804, -0.3886829614721505, 0.4353492889350186, 1.7420053483337177, -0.4434654615252943, -0.15157478906219238, -0.26684517248765616, -1.454961775612449]] | 0 |
2022-04-Feb 16:30:31 | [0.0001131892204284668] | [[16.309736909599135, 6.599583312777217, -1.7115145261843976, 3.843584873057595, 1.8125832479102169, -0.43350000129006655, 17.30937776667656, -0.2807742641284464, 6.2609333637166005, 9.713576049131042, 6.0047041206552, 10.752836139413064, -0.4774067823292544, 4.1756599181627845, -0.6972038718887318, 1.3471917248192462, 0.7307732879172534, 0.17833178574255615, 1.960185481479672, 1.3272984903998677, 1.6581661476267306, 13.303105662486775, -0.32679284503894385, 5.602161198855116, 5.098649721637451, -0.4434654615252943, 0.958688640259473, -0.15767580972070405, 1.6405488923696567]] | 1 |
We can see in the logs that the production model has drifted significantly from its baseline sanity check.
[14]:
p.undeploy()
[14]:
{'name': 'fraud-shadow', 'create_time': datetime.datetime(2022, 3, 3, 23, 7, 52, 323520, tzinfo=tzutc()), 'definition': '[{'ModelInference': {'models': [{'name': 'fraud-prod', 'version': '661c4e9b-43ec-414f-8f2a-0f3797053da3', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}, {'name': 'fraud-base', 'version': 'ca7cb8c5-32f7-419a-bf66-1e4af58111bd', 'sha': '054810e3e3ebbdd34438d9c1a08ed6a6680ef10bf97b9223f78ebf38e14b3b52'}]}}, {'Check': {'tree': ['{"values": {"no_model_drift": {"root": {"node": "binop", "op": "<=", "left": {"node": "fn", "fn": "abs", "arguments": [{"node": "binop", "op": "-", "left": {"node": "variable", "variant_id": {"name": "fraud-prod"}, "position": "output", "key": [0, 0]}, "right": {"node": "variable", "variant_id": {"name": "fraud-base"}, "position": "output", "key": [0, 0]}}]}, "right": {"node": "literal", "float": 0.05}}, "required_data": [{"name": "fraud-prod"}, {"name": "fraud-base"}]}}, "gauges": [], "validations": ["no_model_drift"]}']}}, {'AuditResults': {'from': 1, 'to': None}}, {'Nth': {'index': 0}}]'}
Alerting¶
Some failures are critical, and require immediate action. The SDK can be configured to send any number of notifications to responsible parties:
[15]:
from wallaroo import notify
notification = notify.Email(to='<placeholder>@example.com')
These notifications can then be attached to a relevant alert. Alerts are built directly on top of the same expression syntax used for validations. However, unlike validations, alerts must be defined on a pipeline-wide Aggregate
. For example, if high fraud is inferred more than ten times in a five minute window:
[16]:
alerted = client.upload_model('fraud-prod', 'keras_ccfraud.onnx')
p = client.build_pipeline('fraud-alerting')
p.add_model_step(alerted)
low_fraud = alerted.outputs[0][0] <= 0.95
high_fraud = alerted.outputs[0][0] > 0.95
p.add_validation('low_fraud', low_fraud)
p.add_alert('high_fraud_5m', fn.count(high_fraud, '5m', '1s') > 10, [notification])
p.deploy()
Waiting for deployment - this will take up to 45s ...... ok
[16]:
{'name': 'fraud-alerting', 'create_time': datetime.datetime(2022, 3, 3, 23, 13, 23, 22692, tzinfo=tzutc()), 'definition': '[{'ModelInference': {'models': [{'name': 'fraud-prod', 'version': 'bc4f9717-4e55-4019-9c9c-3fd84ea94e00', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}]}}, {'Check': {'tree': ['{"values": {"high_fraud_5m:left": {"root": {"node": "fn", "fn": "count", "arguments": [{"node": "binop", "op": ">", "left": {"node": "variable", "variant_id": {"name": "fraud-prod"}, "position": "output", "key": [0, 0]}, "right": {"node": "literal", "float": 0.95}}, {"node": "literal", "timedelta": "5m"}, {"node": "literal", "timedelta": "1s"}]}, "required_data": [{"name": "fraud-prod"}]}}, "gauges": ["high_fraud_5m:left"], "validations": []}']}}, {'Check': {'tree': ['{"values": {"low_fraud": {"root": {"node": "binop", "op": "<=", "left": {"node": "variable", "variant_id": {"name": "fraud-prod"}, "position": "output", "key": [0, 0]}, "right": {"node": "literal", "float": 0.95}}, "required_data": [{"name": "fraud-prod"}]}}, "gauges": [], "validations": ["low_fraud"]}']}}]'}
functions.count
here is a windowed aggregate function that collects metrics across the entire pipeline. The first argument is the condition to count; the second is the sliding window of time to operate on, and the final argument is how often the aggregate value is updated.
Note that aggregate functions cannot be used in validations, as they track whole pipeline metrics and are not currently available in per-inference gauges.
With this alert configured, we can it by sending in a burst of apparent fraud:
[17]:
for _ in range(15):
p.infer_from_file('./dev_high_fraud.json')
Now, we can see the fraudulent entry failing our validation in the logs:
[18]:
p.logs(limit=1)
[18]:
Timestamp | Output | Input | Anomalies |
---|---|---|---|
2022-04-Feb 16:31:58 | [0.9811990261077881] | [[1.0678324729342086, 18.155556397512136, -1.658955105843852, 5.2111788045436445, 2.345247064454334, 10.467083577773014, 5.0925820522419745, 12.82951536371218, 4.953677046849403, 2.3934736228338225, 23.912131817957253, 1.7599568310350209, 0.8561037518143335, 1.1656456468728569, 0.5395988813934498, 0.7784221343010385, 6.75806107274245, 3.927411847659908, 12.462178276650056, 12.307538216518656, 13.787951906620115, 1.4588397511627804, 3.681834686805714, 1.753914366037974, 8.484355003656184, 14.6454097666836, 26.852377436250144, 2.716529237720336, 3.061195706890285]] | 2 |
Also, we’ll receive an email notification that the alert is active:
[19]:
p.undeploy()
[19]:
{'name': 'fraud-alerting', 'create_time': datetime.datetime(2022, 3, 3, 23, 13, 23, 22692, tzinfo=tzutc()), 'definition': '[{'ModelInference': {'models': [{'name': 'fraud-prod', 'version': 'bc4f9717-4e55-4019-9c9c-3fd84ea94e00', 'sha': 'bc85ce596945f876256f41515c7501c399fd97ebcb9ab3dd41bf03f8937b4507'}]}}, {'Check': {'tree': ['{"values": {"high_fraud_5m:left": {"root": {"node": "fn", "fn": "count", "arguments": [{"node": "binop", "op": ">", "left": {"node": "variable", "variant_id": {"name": "fraud-prod"}, "position": "output", "key": [0, 0]}, "right": {"node": "literal", "float": 0.95}}, {"node": "literal", "timedelta": "5m"}, {"node": "literal", "timedelta": "1s"}]}, "required_data": [{"name": "fraud-prod"}]}}, "gauges": ["high_fraud_5m:left"], "validations": []}']}}, {'Check': {'tree': ['{"values": {"low_fraud": {"root": {"node": "binop", "op": "<=", "left": {"node": "variable", "variant_id": {"name": "fraud-prod"}, "position": "output", "key": [0, 0]}, "right": {"node": "literal", "float": 0.95}}, "required_data": [{"name": "fraud-prod"}]}}, "gauges": [], "validations": ["low_fraud"]}']}}]'}