Machine Learning is the most trending buzz word in the industry now, not specific to any industry, but used across all industries. ML is respected by industry practitioners for its power of Prophesy and it is here to stay. This blog post gives a real life picture of how ML models can be deployed across industries.
ML derives values or insights from the past data. Statistical techniques can find hidden correlations between an outcome variable and some dependent variables. So the most important component for developing an ML model is the past repository of data. Typically organizations have lot of data in their databases. For example a retail organization would have transaction data for sales based on customer, region, store, item, price, week, day, discounts offered, promotions applied, competitor pricing etc. Retail Managers would be interested in knowing certain things like
- Who are the customers to be targeted for a certain promotions?
- How much sales I can expect when I offer an item at a certain discount level?
- How much sales I can expect given a specific competitor pricing?
Past data which a company has would provide fairly accurate predictions for the above scenarios. Based on the predictions by the analytical models, companies can further derive valuable dependent information like at what price point I can maximize the profit, revenue etc. Hence industries can gain a lot of business insights and operational efficiencies by using the past data to their advantage. This is not just for retail organizations, Banking companies can use the past models to flag potential defaulters and can decide if it should issue a credit card or what should be the credit limit etc. ML can also flag banking transactions as fraudulent or not by looking at the patterns found in the past fraudulent transactions. Video recommendations by youtube, ads listed by various sites which you visit etc are applications of Machine Learning. In all industries we can find candidate use cases for Machine Learning.
Identifying the business problem to solve is the first step in considering a Machine Learning solution. Carefully analyze the problem to consider if it is a candidate for applying Machine Learning. Carefully ask questions like:
Do we have sufficient past data readily available or can it be procured?
Are the transactions which are happening in the future are going to be similar to the past transactions?
Once we identified it is a Machine Learning problem, identify how the data correlate with outcome variable.
The data we receive from production systems would be raw and at a granular level. Most organizations would have data ware house systems already available which has some form of aggregation for the raw data in their Data Warehouse.
Data Scientists should ask questions like what is the out come variable and what could be the dependent variables. Domain knowledge about the industry is needed in this step. For example, if the Data Scientist is trying to predict the sales figure, questions like what could contribute to the sale of a specific item should be answered. Things like make of the product, manufacturer, promotional price, discount level, method of advertising, seasonality, duration of promotion, holiday effect, competitor price etc etc will come into play in determining the sales of a product. Are these information readily available in the data ware house systems. In most cases we wont be having all these readily available. But by applying transformations we would be able to derive most of the details we would be needing. Once we have the information we have to check for outliers in the data, missing values etc. We have to deal with the outliers and missing data according to the approach we choose for our ML implementation. Some time we can model with missing data. Some times it is better to remove the missing data altogether. Some times we can derive the missing data based on the other information available for us. Same case with outliers. We have to ask the question why the outlier is existing. Is it a valid data? Is it an error in measurement? Getting answers to these questions would help in determining how to deal with outliers.
Once we have all the data available we can consider an ML algorithm for making the predictions. There are different statistical algorithms we can use for making predictions. Factors like Data size, processing time, interpret ability, accuracy of prediction etc comes into picture while determining the model to choose for a prediction task. Many times we have to evaluate different algorithms and consider the best one in terms of accuracy, interpretability, ease in deployment etc.
Effectiveness of a model during training can be measured using Cross validation. From the available data we can split the data into training and testing data. Train the model using the training data and validate the effectiveness of the model by testing it in the testing data. Various metrics are available for measuring the prediction accuracy.
We can also score the past data available in the database and see how the model is performing for the past data.
Once we have developed a model, next step is to deploy the model. There are different ways to deploy a model. If it is a simple linear model, where you have a coefficient for each variable, we can store the coefficient in a database and do the scoring in production by using these coefficients. In tree based models, we can save the modes in particular formats, and call these models during production scoring.
Models which are deployed should be periodically checked for performance(accuracy) and also the models should be refreshed periodically with the latest data available to keep the models up to date.