Most companies today realize that their data is an asset they should exploit to make better decisions. Modern data science has provided us with great tools to build powerful predictive models using machine learning and AI that can support this process. However, those tools also require extensive technical expertise to wield. Given the significant cost and the shortage of specialized data scientists, companies have not been able to fully exploit the data they’ve collected to build predictive models that can significantly improve their decision-making.
This raises an important question: Can we make these tools accessible to business experts so they can build predictive models without needing a technical expert?
To answer this question, we will look at two areas: Automation and Interaction.
Automation: When building a predictive model, there are a lot of technical decisions to be made. AI research has long been concerned with trying to use data to make these decisions automatically. Even skilled experts use automation tools because these tools allow them to discover better models faster.
Interaction: Even with a fully automated algorithm, there is still a lot of context and business knowledge that might influence how to build the best predictive model for a specific target. Today, as a business expert, you cannot include your knowledge into a predictive model directly. You can either use a data-driven model that ignores your knowledge, use a handcrafted model based on formulas that do not exploit your data, or work with a technical expert who will try to improve the forecasting model based on your feedback. A more efficient approach would be for business experts to interact directly with a predictive model and improve it without needing to understand the technical details.
Let’s dive into these two areas and explore the challenges and opportunities.
Automation
Automated machine learning (AutoML)
There is a long history of attempts to automate parts of the data science workflow. The most prominent and well-developed area is probably AutoML. The goal of AutoML is to automatically choose the best machine learning algorithm for a task, as well as the best configuration (hyperparameters) for that algorithm. Many AutoML approaches also support selecting the right features to include in a model.
There are versions of AutoML for small data and for huge data. One approach, TabPFN, focuses on getting very fast results for datasets with 1000 training examples, 100 features, and 10 classes or less. There is the classic auto-sklearn, which is a version of the popular scikit-learn machine learning library that removes the need to manually select an algorithm and its hyperparameters. And for deep learning, Network Architecture Search (NAS) automates the design of artificial neural networks.
In finance, many advanced predictive analytics solutions now offer software that automatically chooses the best of a set of different predictive models for time-series forecasting. In practice, however, choosing the right model is only part of the solution. There are a lot of other decisions to be made:
- Identifying outliers and choosing how to interpolate them
- Preprocessing the data to handle seasonalities
- Formatting data into a format suitable for machine learning
- Transforming data to extract useful features
- Collecting, cleaning, and mapping data about other indicators that can serve as features for the prediction target
These steps make up a large part of the effort to set up a performant predictive model – up to 80%, depending on whom you ask.
Automated data science
Automated data science is a more recent field that extends automation to include these areas as well. These approaches are less uniform as different approaches focus on different areas of the data science workflow. Roughly, the goal of automated data science can be seen as automatically assembling and testing entire pipelines. Similar to AutoML, data is used to automatically evaluate and test different pipelines to find good candidates.
This can be an issue when there is a limited amount of data available. For example, strategic monthly forecasts often only have less than a hundred data points to learn from. Automatically choosing thousands of parameter values on such small data leads to models that over-optimize and model every random fluctuation in the data instead of capturing the underlying relationships (in machine learning, this is called overfitting). As a result, we cannot rely purely on data to choose the best setup.
One way around this is to look at data science best practices and encode them as rules to create sensible models out of the box. Using large language models, we can also automate some of the common-sense decisions that practitioners would make based on their experience.
Including business knowledge
Often, however, decisions are not just based on technical considerations but also need to consider the business context. At Predikt, we use a lot of market data to predict financial KPIs. With millions of potential market indicators, purely data-driven approaches might pick up coincidental correlations. We need to incorporate business knowledge on which indicators to include and how to include them to make sure that our forecasting models will deliver accurate predictions in the future.
AI can help us automatically include common-sense business knowledge while constructing our predictive models. The main difficulty is to find a way to translate that business knowledge into technical knobs or parameters. In technical terms, this typically comes down to setting specific parameters, adding constraints to machine learning algorithms, or adding regularization terms to the loss function.
Interaction
While it is possible to emulate common sense and high-level business knowledge to a certain degree, business users often have additional context and expertise. Unless we include this knowledge, our forecasting model will always be suboptimal. Today, this interaction is mostly facilitated by a technical expert. However, most teams do not have technical experts readily available to them. Most often, business users end up having to choose between a handcrafted model or a purely data-driven model that ignores their expertise.
Allowing business experts to interact directly with predictive models requires addressing two core challenges: 1) Business users will need to understand what the model is doing, and 2) Users need to be able to correct the model without technical knowledge.
Explaining predictive models
Predictive models have a lot of settings and parameters. Explaining these parameters requires translating their impact to a business context. Users need to understand how the model makes its predictions, whether the learned features and logical relationships make sense, and the implications of how the model is set up.
There has been a lot of work on explaining machine learning models in the field of Explainable AI. Really bringing automated data science tools to business experts usually requires an extra step to make those explanations interpretable to business experts. Frameworks such as human-in-the-loop machine learning monitor what users do with the model outputs to learn to produce useful outputs. In many cases, making explanations interpretable might be application-specific. At Predikt, for example, we focus on creating intuitive visualizations and explanations for financial forecasts.
Improving predictive models
Once users understand how the predictive model makes predictions, they can spot faulty explanations. Enabling users to correct such mistakes and improve the model without having to understand which technical knobs to turn is an active area of research. The best approach will typically be application-specific, but large language models can help offer a generic interface that can adapt to the domain at hand.
There are a lot of interesting developments in this area, so I want to highlight a few approaches that I’ve encountered while working on automated data science in the past – this part will be a little more technical.
- In interactive machine learning, the key to building trust is to explain machine learning predictions and let users give intuitive feedback. For example, in image classification, the machine learning model might highlight what parts of an image were important for its classification. This allows users to spot when machine learning models are making decisions based on background information that might be specific to the dataset. Users can then erase parts of the image that should not be used, and the model is then updated through dataset augmentation or an adaptation of the loss function.
- At the Leuven ML lab, we designed an automated data science approach that lets users proactively give hints on how to preprocess and clean data in Excel sheets by selecting cells that belong together. These hints were then transformed into constraints that were combined with unsupervised information extracted from the data. Together, the constraints and extracted properties are used to build predictive pipelines automatically.
- Research on machine learning fairness has developed various approaches for eliminating biases from machine learning models. For example, some techniques add constraints or regularization terms to the machine-learning process. Next to encoding fairness constraints, these techniques can often be reused to constrain learning algorithms to respect feedback from users.
Collaborative data science
Data science is typically an iterative process in which practitioners try a model and analyze the predictions to determine whether they make sense and improve the model. Through automated and interactive tooling, we can make this feedback loop accessible to business experts and empower them to iteratively build predictive models based on both data and domain expertise.
Looking ahead, there is an exciting transition ahead where automated and interactive data science will extend to multiple users. In practice, the best predictive models integrate knowledge from multiple business experts with different areas of expertise. AI software will be able to foster this collaboration. As a result, companies will be able to rely on predictive models that effectively harness the collective knowledge of their employees to make better strategic decisions. Let’s make that happen!