Back to All Events

"Ensuring Privacy while Utilizing Potentially Sensitive Data for Machine Learning" by Sameer Wadkar and Dave Torok of Comcast

Location: 18th Floor


Our team at Comcast has developed a framework for operationalizing ML Models. It covers the full ML lifecycle from data ingestion, feature engineering, model training, and model deployment to model evaluation at runtime. We process roughly 3-5 billion predictions per day using this platform. The system support proactive (model inference based on event combinations on a stream) as well as reactive (model inference on demand).

ML Models are designed to look for signals via complex feature engineering on raw data, which may contain potentially sensitive information including PII. A solution such as ours must allow ML Model developers to access this information for feature engineering but still ensure that customer privacy is protected. We will describe how our framework manages this using a combination of methods such as encryption, removal, anonymization, and aggregations to protect privacy without losing model efficacy.

We will also describe how we support a consistent feature-engineering pipeline to process data at rest (for model training) and data on stream (for model inference). Consistent feature engineering coupled with appropriate governance processes is another process, which ensures privacy of the user.

We will describe how individual model predictions are persisted to database and can be protected from external access via traditional firewalling methods. Another model called the Decision Engine exposes an endpoint to an external user, which will securely retrieve and aggregate the predictions of the other models and use these to provide recommendations to the end-user. The DE endpoint only takes the account number of the user as the input thereby securing even individual model predictions and their engineered features even further.