Development of a Machine Learning model based on feature selection to predict Volve production rate

Production forecasting is one of the most difficult tasks in petroleum engineering. There various ways to predict future production, the more popular ones are decline curve analysis, rate transient analysis and reservoir simulation analysis.

This article demonstrates an implementation of supervised machine learning predictive data analytics technique of random forest to predict oil &  gas production rates. It provides an alternative to modelling based production forecasting that requires high computational cost, time and furthermore, an experienced modeler.

The Jupyter Notebook and associated files are available on the Github page: discover-volve/AI_ProductionForecasting (github.com)

The code is also available in a separate article: https://discovervolve.com/2021/02/23/code-development-of-a-machine-learning-model-based-on-feature-selection-to-predict-volve-production-rate/

Workflow

In this article we will develop a machine learning model using the python based scikit-learn
library to predict the Volve production rate (oil and gas) by considering the parameters like
pressure, temperature, Choke size, etc. The production dataset is available here.

Machine learning (ML) algorithms generally follow the work-flow below. The parameters have the most effect on the outcomes are extracted (Feature Extraction) for the ML algorithms, then the various trained models undergo parameter tuning with a validation dataset, and finally the selected model is tested on a test dataset.

Image: Work flow for using machine and deep learning algorithms

The algorithm receives data characterized by X variables (features: BHP, BHT, DP_Choke_size, etc) and annotated with a Y variables (Target: Oil rate, Gas rate).
Then, we specify the type of model that the algorithm must learn with the
hyperparameters using the cross validation technique. We use an optimization
algorithm to find which hyperparameters of the model give us the best performance. Finally, the model is ready to be used with new data to predict the value of the target.

Dataset

The production dataset used in this article is from Volve field on Norwegian continental shelf with around 8 years history. Inputs used in this research are:

  • Bottom hole pressure (BHP)
  • Tubing head pressure (THP)
  • Bottom hole temperature (BHT)
  • Well head temperature (WHT)
  • Pressure differential in casing (DP)
  • Choke size (CS)
Image: Workflow to implement the Machine Learning Model.

Exploratory Data Analysis (EDA)

Data exploration aims to help us identify the types of variables, the location of the input and target data.

It is useful to know the types of variables because when we have a large dataset we want to know where position of numerical and character based variables.

Image: Distribution of the types of variables in the Production dataset

It is better to take a small step forward by analyzing just two or three interesting variables rather than taking a large step back by creating a lot of graphs and getting lost completely in our analysis.

Missing Data

We display the entire dataset as a heat map to identify missing or null values. In the image, the missing oil production rate (BORE_OIL_VOL) and gas production rate (BORE_GAS_VOL) values are not located in the rows and where the Water Injection Volume (BORE_WI_VOL ) is available.

Image: Heatmap indicating the null values present in the production dataset. The grey colored section is the missing data.
Table: The data distribution and types of wells in the Volve Oilfield

From the table above, we conclude that the 15/9-F-4 is a water injector well and we remove it from our dataset.

On the other hand, the 15/9-F-1 C and 15/9-F-5 wells are both producer as well as injector wells. So we use Empirical Cumulative Distribution Function (image below) to look at the distribution of oil and gas production rate. For the 15/9-F-1 C well only 40% of oil & gas rate data are zero or Nan while it is 80% is nan for the 15/9-F-5 well.
So, we keep the well 15/9-F-1C and we eliminate well 15/9-F-5.

Image: Emprical cumulative distribution of 15/9-F-1 C well

Data Cleaning

Examples of data rows that can be removed are:

  1. Empty columns.
  2. Water injection volume.
  3. Categorical columns – Well Bore Code, NPD Field Code, NPD Field Name, NPD Facility Code etc.
  4. Negative values stored in Downhole pressure, Downhole Temperature, Bore Oil Volume etc.
  5. Zero values of target variables (Oil production rate & Gas production rate).

Box Plot of Features

We use a boxplot of the features for each wellbore to properly understand the statistics of the dataset, known as the distribution of features.

It is clear that the data is very skewed and depends on each Well, so the missing values cannot be estimated by the mean method.

Understanding a Box plot

Image: Comparison of a boxplot of a nearly normal distribution and a probability density function (pdf) for a normal distribution

Image: Box Plot of the wells in Volve Oilfield relative to various properties

Correlations:

After filtering our dataset, it is time to use the seaborn function – pairplot to understand our data within the framework of a data science project.

The results of this function allow us to identify all the fairly linear relations between certain
variables where these features are strongly correlated (as shown by the correlation matrix
below) such as: DP_CHOCK_SIZE and AVG_WHP_P. It is recommended to keep any one feature in a pair of strongly correlated set.

Image: Correlation matrix to find relations between features

Data Pre-processing

After identifying and removing unnecessary data, the data needs to be formatted and normalized before application of Machine learning algorithms. The data pre-processing involves three major steps:

Data Imputation

A pandas framework of the dataset is created and the missing values are replaced with the fillna object.

Data scaling:

It is important to normalize the quantitative data because the ML models are algorithms like gradient descent, distance calculations or variance calculations. In this project we use the MinMax normalization.

{x}'=\frac{x-x_{min}}{x_{max}-x_{min}}

Features engineering:

Features need to be selected based on their importance using Feature engineering. Here, we find the most useful variables for our model in dataset based on statistical tests and using the Scikit-learn transformer SelectKBest.

Initialize the random forest algorithm by setting all parameters to random values from standard normal distribution or uniform distribution and multiply it by a scalar. From this model we will use the attribute – feature importance to identify the important variables in the split of random forest.

Image: Feature Importance Histogram
Image: Cumulative Importance of features

Finally, ‘ON_STREAM_HRS’, ‘AVG_DOWNHOLE_PRESSURE’, ‘AVG_DP_TUBING’, ‘AVG_WHP_P’, ‘AVG_WHT_P’ are selected as important training

Model Training

We further optimize model by finding the best combination of hyperparameters of
models. Cross validation using random search and GridSearchCV are used for finding the best combination of hyperparameters of algorithms. There is a loop between optimizing and validate model means that repeating the procedure until reaching the desired accuracy score.

Grid Search: set up a grid of hyperparameter values and for each combination, train a model and score on the validation data. In this approach, every single combination of hyperparameters values is tried which can be very inefficient!

Random search: set up a grid of hyperparameter values and select random combinations to train the model and score. The number of search iterations is set based on time/resources.

Cross Validation

Cross validation consists of training and evaluating the performance of the model on several
possible cuts of dataset: Train set (70%), Validation set (15%), Test set (15%).

Validation is used for examining the model trained on the training data set. If the results are not good, we go back and modify the model based on the training data and reexamine it on the validation dataset. This keeps going until the training and validation error keep going down.

At some point cross validation error will start going up, while training data error is still going down. This is the point where you start to overfit and want to stop. Later, we use the Test Data to check out the final model.

Here, Mean square error and coefficient of determination have been used to validate the results.

Results:

After completing all previous steps above, the model can have ability to forecast. In this
study, production rate include oil rate, gas rate are the outputs from the training model.

Image: The Actual and Predicted Oil rate of the Wells have been plotted indicating high accuracy in prediction.

Subscribe for Updates and Conversations

2 thoughts on “Development of a Machine Learning model based on feature selection to predict Volve production rate”

  1. Hi Abdelkarim,

    Excellent work on using ML for production prediction utilizing the Volve dataset. I am currently interested in developing a pipeline for using ML to predict stuck pipe events.

    Do you have any recommendations as to which wells would be best suited for this type of work? So far I believe data is needed for standpipe pressure, hook-load, torque, ROP, Bit RPM, Bit depth…etc.

  2. Hi, I am not sure about the best well because a lot of drilling events related data is locked in a proprietary database.

Leave a Reply

Your email address will not be published. Required fields are marked *