Data doesn’t sit still; it’s constantly moving and growing. We know there’s valuable intelligence to be extracted from this data, which has led to the current boom in business analytics and related technologies, particularly machine learning. However, as we depend upon these analytics in our operations, we must also continually monitor data to understand when it changes enough to impact the validity of the analytics.
As any data scientist can tell you, developing a new analytics model begins with collecting, identifying, cleansing and normalizing the data to be analyzed. This is often the longest part of the process of building a model, as many data integrity checks need to be made to ensure proper data quality.
But the data issues don’t end when the model is developed. Why? Because the data used to create the model won’t actually be the data that needs to be analyzed. The quality of a model’s output–the insights–depends on the continued integrity of the data it’s fed. As the development dataset and the production data stop resembling one another, the model’s performance will degrade—and it will do so in a non-uniform manner. Data never seen in model development is what we need to be monitoring for in production.
Most analytics methodologies offer a way to ensure that a model developed on today’s data will make good predictions, generalize when used to score future data. For instance, a “holdout sample” of the data available for modeling will be kept aside in order to validate the model. Once the model is in production, data scientists need to monitor data statistics, model score distributions and model performance to determine when the model may need to be updated based on changes in the production environment. Often, model performance is the key driver, but what do you do when you don’t have outcomes to compute model performance (such as in the areas of unsupervised models, used in many areas today such as cybersecurity)?
"Auto-encoders can help businesses understand the nearly undetectable data changes that can affect the accuracy of analytical models"
This is where understanding the validity of the model, and subsequent decisions critically depend on continually interrogating the production data to determine whether its diverging from the data used to develop the analytics being used. To accomplish this, data scientists are turning to deep learning methods to answer the specific questions of data divergence. Fortunately, artificial intelligence has evolved to meet this challenge: We have operationalized the use of “auto-encoders” to produce scores on the data similarity or dissimilarity in production. Where there is dissimilarity, one needs to use models/analytics with more caution, given that such data exemplars were not represented in the model development.
Modeling the Changes in Data
Auto-encoders function in a similar way to neural networks, another type of artificial intelligence. Neural networks take in raw data and, using a network of computational “neurons” or nodes, output a score. With auto-encoders, the process is similar but the output isn’t a score—it’s a version or “reconstruction” of the input data. Through deep learning (unsupervised machine learning), the auto-encoder learns the latent features (a model) that reproduce the input development data.
This auto-encoder model is important because it can indicate what types of future data has and has not been seen during model development. When reconstruction errors are large, it means that the combinations of data elements being passed through the model in production are different from what was seen during model training. This could indicate that the scores used in decision making will be less accurate for certain types of data examples or that the model needs revision.
A quick look at some real use cases for auto-encoders will help highlight their value.
Neural networks like the one mentioned earlier are often used by financial institutions to detect fraud. These networks analyze terabytes of transaction data from multiple countries around the world.
Fraud is typically the proverbial ‘needle in the haystack’ problem and subtle changes in attack vectors are often undetectable by standard data statistics. But an auto-encoder can ferret out minute data integrity issues that would be very important to draw into sight the shimmer of these needles. For example, a shift in transaction amounts for one company in Kazakhstan isn’t a change likely to impact the whole population, but it is an anomaly to explore as it could be evidence of fraud.
Cyber Security Analytics
Cyber Security is one of the most pressing areas of concern to businesses today. The patterns of attacks change continually, and the amount of good quality tagged data of attack/breach is often suspect or not collected.
Here is where unsupervised models are built with little to no historical data, and so it is even more important here to train an auto-encoder on the development data and assumptions in how it was gathered, simulated, or limited. The auto-encoder model runs alongside the unsupervised model and identifies patterns in the production data that violate the assumptions underlying the unsupervised model. The auto-encoder can also detect where the unsupervised model is working well when there are later shifts in the product data—say, this month compared to 6 months ago when the model was deemed to be working well.
Once the models are installed in the production environment, the auto-encoder monitors the reconstruction error regularly. When the auto-encoder model tells us that the error has grown too large—signaling that the production environment is varying considerably—a new version of the unsupervised model may need to be constructed or different strategies have taken on small segments of data where the reconstruction error signals that we should not rely as heavily on the unsupervised model. Further, and of a big benefit in the battle against cyber-attacks, the auto-encoder model spots new and often subtle patterns in the data to focus on in future model enhancements.
Auto-encoders can help businesses understand the nearly undetectable data changes that can affect the accuracy of analytical models. Businesses that can do this will be able to adjust their business strategies when subtle data changes occur, and modify their analytics accordingly. This is an excellent way to extract more useful information relevant to how their business is performing now, instead of basing today’s decisions on yesterday’s data and business environment.