Machine Learning for Post-Processing NWM Data#

Decision trees and XGBoost#

Authors: Savalan Naser Neisary (PhD Student, CIROH & The University of Alabama)

1. Introduction#

1.1. Overview of the Workshop’s Goals and Structure.#

This is going to be 60 minutes workshop in which we will:

  • Understand the basics of machine learning and decision-tree algorithms.

  • Learn how to apply and train an XGBoost model for hydrological modeling.

  • Learn how to implement feature selection using the XGBoost algorithm.

We will first review the theoretical background behind decision trees and the pros and cons of the most powerful decision-tree algorithms. Then, we will start the hands-on part of the workshop on setting up our environments and getting codes and data from GitHub repositories. Next, we plan to get the data preprocessed and start model development using the XGBoost algorithm. After that, we will discuss the feature selection and hyperparameter tuning (i.e., manually and automatically). Finally, we will evaluate the performance of XGBoost in different stations.

1.2. Post-processing Hydrological Predictions#

Effective and sustainable management of water resources is crucial to provide adequate water supply for human societies, regardless of their geographical location. Having an accurate and precise prediction of future hydrological variables, including streamflow is a critical component for an effective water systems management, and various studies presented different methods, such as post-processing to increase the accuracy of the hydrological predictions. Post-processing methods seek to quantify the uncertainties of hydrological model outcomes and correct their biases by using a statistical model to transform model outputs based the relationship(s) between observations and model. According to the literature Machine Learning (ML) models proved to be useful in post-processing the results of other ML or physical-based hydrological models. Therefore, in this workshop we will use decision-tree algorithms, an ensemble subgroup of ML models, to post-process streamflow outputs of a physical-baed model.

1.3. Post-processing Retrospective National Water Model (NWM) Streamflow Data#

NOAA introduced the NWM to address the need for an operational large-scale hydrological forecasting model to provide streamflow predictions in CONUS. While it has the capability of predicting streamflow in 2.7 billion water reaches, according to the literature, NWM has a low accuracy in regions west of the 95th meridian with drought and low-flow problem and in controlled basins with extensive water infrastructure. This low performance in western US watersheds is due to the lack of water operation consideration and a comprehensive groundwater and snow model beside calibrating NWM mostly with watersheds in eastern US. To compensate for NWM shortcomings in this workshop we will demonstrate how we can use decision-trees to increase its accuracy by post-processing the NWM outputs and adding the human activity impact to it.

2. Theoretical Background#

2.1. Decision-Trees Algorithm#

A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.Decision trees are recursively constructed multidimensional histograms. Decision tree learning employs a divide and conquer strategy by conducting a greedy search to identify the optimal split points within a tree. This process of splitting is then repeated in a top-down, recursive manner until all, or the majority of records have been classified under specific class labels. Hunt’s algorithm, which was developed in the 1960s to model human learning in Psychology, forms the foundation of many popular decision tree algorithms, such as ID3, C4.5, and CART.

PIC

Advantages

  • Easy to interpret: The Boolean logic and visual representations of decision trees make them easier to understand and consume. The hierarchical nature of a decision tree also makes it easy to see which attributes are most important, which isn’t always clear with other algorithms, like neural networks.

  • Little to no data preparation required: Decision trees have a number of characteristics, which make it more flexible than other classifiers. It can handle various data types—i.e. discrete or continuous values, and continuous values can be converted into categorical values through the use of thresholds. Additionally, it can also handle values with missing values, which can be problematic for other classifiers, like Naïve Bayes.

  • More flexible: Decision trees can be leveraged for both classification and regression tasks, making it more flexible than some other algorithms. It’s also insensitive to underlying relationships between attributes; this means that if two variables are highly correlated, the algorithm will only choose one of the features to split on.

Disadvantages

  • Prone to overfitting: Complex decision trees tend to overfit and do not generalize well to new data. This scenario can be avoided through the processes of pre-pruning or post-pruning. Pre-pruning halts tree growth when there is insufficient data while post-pruning removes subtrees with inadequate data after tree construction.

  • High variance estimators: Small variations within data can produce a very different decision tree. Bagging, or the averaging of estimates, can be a method of reducing variance of decision trees. However, this approach is limited as it can lead to highly correlated predictors.

  • More costly: Given that decision trees take a greedy search approach during construction, they can be more expensive to train compared to other algorithms.

2.2. Random Forest (RF) Algorithm#

RF is a widely used machine learning algorithm developed by Leo Breiman and Adele Cutler. RF is based on decision-trees, but it is based on the Wisdom of the Crowd, which means it aggregate the results of a group of DTs. Using the results of more than one models is called ensemble, so we can say that RF is an ensemble algorithm since it aggregates the results of several number of DTs. Using an ensemble of decision trees can largely reduce the overfitting and prediction variance, providing more accurate results. RF is an extension of the bagging approach, which generates a random subset of both samples and features for each model training. While a DT is based on all features to make decisions, the RF algorithm only uses a subset of features, which can reduce the influence of highly correlated features in model prediction.

Advantages

  • Reduced risk of overfitting: Decision trees run the risk of overfitting as they tend to tightly fit all the samples within training data. However, when there’s a robust number of decision trees in a random forest, the classifier won’t overfit the model since the averaging of uncorrelated trees lowers the overall variance and prediction error.

  • Provides flexibility: Since random forest can handle both regression and classification tasks with a high degree of accuracy, it is a popular method among data scientists. Feature bagging also makes the random forest classifier an effective tool for estimating missing values as it maintains accuracy when a portion of the data is missing.

  • Easy to determine feature importance: Random forest makes it easy to evaluate variable importance, or contribution, to the model. There are a few ways to evaluate feature importance. Gini importance and mean decrease in impurity (MDI) are usually used to measure how much the model’s accuracy decreases when a given variable is excluded. However, permutation importance, also known as mean decrease accuracy (MDA), is another importance measure. MDA identifies the average decrease in accuracy by randomly permutating the feature values in oob samples.

Disadvantages

  • Time-consuming process: Since random forest algorithms can handle large data sets, they can provide more accurate predictions, but can be slow to process data as they are computing data for each individual decision tree.

  • Requires more resources: Since random forests process larger data sets, they’ll require more resources to store that data.

  • More complex: The prediction of a single decision tree is easier to interpret when compared to a forest of them.

2.3. Extreme Gradient Boosting (XGBoost) Algorithm#

XGBoost is one of the algorithms based on Boosting ensemble method, and the idea behind it is to train the predictors sequentially, each trying to correct its predecessor. XGBoost method tries to fit the new predictor to the residual errors made by the previous predictor. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. XGBoost gained significant favor in the last few years as a result of helping individuals and teams win virtually every Kaggle structured data competition.

Advantages

  • Gradient Boosting comes with an easy to read and interpret algorithm, making most of its predictions easy to handle.

  • Boosting is a resilient and robust method that prevents and cubs over-fitting quite easily

  • XGBoost performs very well on medium, small, data with subgroups and structured datasets with not too many features.

  • It is a great approach to go for because the large majority of real-world problems involve classification and regression, two tasks where XGBoost is the reigning king.

Disadvantages

  • XGBoost does not perform so well on sparse and unstructured data.

  • A common thing often forgotten is that Gradient Boosting is very sensitive to outliers since every classifier is forced to fix the errors in the predecessor learners.

  • The overall method is hardly scalable. This is because the estimators base their correctness on previous predictors, hence the procedure involves a lot of struggle to streamline.