Solcast — Solar Power Forecasting

Predicting the hourly energy production of a photovoltaic module with an Artificial Neural Network.

15 min readApr 4, 2021

Global Energy Crisis

The world has 2 big energy problems.

First, billions of people in LEDCs suffer from energy poverty. These individuals, the bulk of whom reside in countries with a GDP per capita of less than $25,000, lack access to modern energy sources, like electricity:

This access gap breeds disastrous consequences. When people don’t have access to the very modern energy sources necessary for cooking and heating, they rely on solid fuels — mostly firewood, dung, and crop waste. The usage of these alternative sources comes at a massive public health cost: indoor air pollution. For the poorest people on the planet, it’s the single largest risk factor for early death, leading to a total of 1.6 million deaths per year.

In Africa alone, some 600 million people lack access to electricity. These people typically rely on fuelwood as an alternative, particularly across East, Central, & West Africa, where it provides over half of the total energy. The FAO’s recent reporting that this reliance on fuelwood constitutes the single most important driver of forest degradation in Africa is thus quite regrettable.

More generally, a lack of access to electricity subjects people to a life with no refrigeration of food, no washing machine, no dishwasher, no light at night, and no electric central heating — simply put, a life stricken with poverty.

Children in Africa work under the illumination of a single work lamp — the epitome of **energy poverty**.

On the flip side of the coin, despite having overcome the problems associated with energy poverty, people in more developed countries face an entire challenge of their own: they produce greenhouse gas emissions that are too high to be sustainable in the long-run. These emissions have meant that Earth is warmer today than any other point in human history. Shockingly, up to 200 million people could be displaced as a result of global warming.

The energy challenge the world faces today is therefore twofold: since the majority of the world still lives in poor conditions, people in these countries cannot afford sufficient energy. But those who have left this energy poverty behind rely excessively on fossil fuels to meet their energy needs. These twin challenges are actually two sides of one big problem: we lack large-scale energy alternatives to fossil fuels that are both cheap and sustainable.

We’re stuck between a proverbial rock and a hard place. We want to be in the turquoise region in this graph.

Fortunately, humanity has awakened to the troubling reality surrounding this looming energy crisis, and made significant efforts to develop and deploy scalable renewable energy sources. In 2019, a record-high 11.4% of global primary energy was comprised by renewables — that represents an increase in the share of renewable capacity upwards of 100% from 2001.

A Solar Solution

Indeed, within this booming shift to renewables, solar energy has taken the spotlight. Enough of the Sun’s solar output reaches the the Earth every hour to cover the world’s energy demands for an entire year. That’s insane. As the most abundant and freely available energy resource, it’s not at all surprising that solar energy is widely considered the most promising candidate for sustainable bulk power generation out of all the renewables.

A 225-acre facility in Tucson, Arizona produces **enough emission-free solar energy to power 5,620 homes**.

In fact, dedicating only half of the American land currently leased by oil and gas — which is to say, just 0.5% of total land in the US — to solar panels would be produce enough power to fully meet America’s energy needs. A crazy statistic like that does make one wonder: what’s stopping energy suppliers from shifting to solar and leveraging this untapped potential?

The answer lies in the fact that solar power generation is considered highly intermittent — that is to say, the level of solar output is fully dependent on weather, meteorological and temporal parameters. This breeds a significant challenge for companies looking to securely integrate photovoltaic (PV for short) systems into the smart grid: forecasting the solar power generated by a particular PV module, in a certain location, over a given time period.

Should solutions to this energy management quandary be insufficiently accurate, unexpected fluctuations in the level of solar power generated could adversely affect the health of the solar grid, the bottom line of the firm managing it, and by consequence, the quality of life of energy consumers.

A thick blanket of snow immobilizes this PV module — a problem which could leave **millions without power**.

Accuracy Maximization

The vast majority of current predictive algorithms exclusively rely on physical inputs, like humidity, temperature and wind speed. However, such systems are generally inaccurate, insofar as many energy suppliers have remained uncomfortable about the prospect of transitioning to solar energy sources.

However, recent research has called attention to the benefits of implementing machine learning algorithms to supplement the usage of physical inputs in forecasting solar power generation — these systems are both model-based and data-driven. In spite of this blossoming interest, the realm of building and implementing such solutions remains mostly unexplored.

Accordingly, this project represents an investigation into this emerging synthesis of hard data and machine learning algorithms in the pursuit of predicting solar output — an attempt at resolving the burning question within this field which asks: how accurate can these hybrid predictive models be?

Introducing **Solcast**: a machine learning-based solar power forecasting model.

To this end, we will train Solcast — a model which uses an Artificial Neural Network (ANN) to predict the hourly solar power generation of a given PV module as accurately as possible. Solcast will be built in Jupyter Notebook with Python, leveraging the TensorFlow specific implementation of the Keras API specification (a machine learning framework which enables the development of the ANN). Without further delay, let’s get building!

Importing the Libraries

The first item on our agenda involves calling various modules which offer valuable functions that aren’t pre-built into Python. Leveraging these libraries allows our program to import and analyze our dataset (pandas, numpy), train Solcast (tensorflow, keras), and graphically visualize the results (matplotlib, seaborn). The full code, which includes the specific lines used to import these useful aforementioned libraries, can be found here.

Importing the Dataset

We also have to load in the dataset that we’ll be using to build Solcast and validate its performance. The dataset, which can be found here, consists of the measurements taken by a specific PV module over the course of several years for a wide range of input parameters. With a cumulative total of 88473 data points, this dataset represents a rich bank of historical information, with which we can train an efficient, probabilistic, and predictive model. This emerging approach stands in stark contrast with the imprudent strategy taken by current systems, which are deemed deterministic, insofar as they only the consider the values of physical inputs in the exact moment of measurement.

Indeed, this dataset contains 21 attributes, 20 of which are the input features, which range from average barometric pressure to the solar zenith angle. The final attribute — our target variable — quantifies the amount of electricity generated by the PV module in kW. The following heat map is created to visualize the 441 individual correlations between our 21 attributes:

Currently, the 20 input parameters and 1 target variable are in the same 2D array — a format which hinders any potential training of the dataset. In order to resolve this, the data frame is reshaped into 2 new arrays: x, which corresponds to the 20 features, and y, which contains the 1 target variable. All of the columns in both arrays contain 4213 unique values. The start and end format, as well as the code enabling this transformation, are depicted below:

Splitting the Dataset

Next, we have to split this dataset of 4213 different rows into our training data, which, as the name suggests, we use to train the model itself, before obtaining an unbiased validation of Solcast’s performance with our testing data. We achieve this by leveraging sklearn’s train_test_split module, opting for the recommended 80:20 split between our training and testing data. This leaves 3370 and 843 rows in our training and testing data respectively:

Feature Scaling

The final step we take before training and testing Solcast is feature scaling, which involves standardizing the different scales (that all the different features have) in a fixed range, such that no particular feature has a disproportionate effect on the model’s output. The following graphic depicts this standardization process, alongside the lines of code which implement it:

Building the Model

With the data pre-processing component of our program completed, we can progress onto actually building Solcast. We do so by training an Artificial Neural Network (ANN), a computing system which simulates the way human brains process information. ANNs are constructed from 3 layers:

The Input Layer brings the initial data into the system and passes it along for further processing by subsequent layers. Since our training data contains 20 unique features, the ANN’s input layer consists of 20 discrete input nodes — 1 for each input parameter.
The Hidden Layers multiply the received input values by a particular weight and perform non-linear transformations to scale the inputs within a range. This enables the ANN to identify more significant decision boundaries for the data.
The Output Layer simply produces the outputs — the ANN’s prediction for the value corresponding to the quantity of power generated (in kW). In training the ANN, we’re aiming to minimize the discrepancy between the predicted and actual values of power generation.

The architecture of the ANN we train is depicted in the following diagram:

In order to structure the ANN as such, we leverage Keras, which is a high-level, open-source machine learning API written in Python and built on TensorFlow, an ML platform. We begin by defining create_solcast, a function which specifies the architecture of the ANN. In order to actually create Solcast, though, we must remember to call the function which does so. Finally, we train the newly-created ANN for 250 epochs, with 32 samples propagated through the ANN per forward/backward pass. This 3-step process is outlined in the following illustration, along with the code used to enact it:

We train the ANN for 250 epochs specifically, because this is approximately the number of epochs beyond which the marginal decrease in the error of the model for each additional epoch is practically negligible. In laymen’s terms, adding more epochs beyond a certain point won’t produce any meaningful increase in our model’s performance, and this certain point happens to occur at about 250 epochs in this particular situation.

An excellent general-purpose method of measuring how the model’s ‘error’ varies as the number of epochs increases is Root-Mean-Square Error (RMSE); this technique involves finding the square root of the mean of the square of all the error. To compute RMSE, the formula below is used:

Since the verbose argument is set as 2, the program logs the RMSE once per epoch. These data points are then plotted, as shown in the following graph:

Results

An RMSE of 0.23 (the lower, the better) after our 250th epoch is indicative of a relatively high accuracy between observed and predicted values. The model’s performance can be further evaluated by finding the R² score, a statistical measure which represents the proportion of the dependent variable’s variance (in our case, this is the power generated in kW) that is explained by the independent variables (our 20 input features). The standard deviation, which quantifies the size of this variance, is also provided:

The overall R² score works out to approximately 90.2%, indicating that Solcast was able to predict the hourly energy production of a PV module to a very high degree of accuracy. However, a 21.4% disparity between the R² score on the training and testing data indicates that the model suffered from overfitting (which occurs when the model describes the random fluctuations and stochastic noise in the data rather than the relationships between the attributes) to a somewhat larger extent than we would’ve liked.

It is easier to perceive this phenomenon graphically. As depicted in the graph on the left, with an R² score of ~ 94.5%, the model excelled at predicting the solar output on the training data. The graph on the right, by contrast, highlights the model’s somewhat weaker (albeit reasonably strong in absolute terms) performance on just the testing data, with an R² score of ~ 73.1%:

Finally, we can use sklearn’s Lasso module to visualize the impact that each of our 20 features had in determining the extent of solar power generation. As seen in the graph below, just 4 features (shortwave radiation, solar zenith angle, solar azimuth angle, and the angle of incidence) were responsible in determining 66.5% of the total level of solar power generated, with none of the remaining features having more than a 4.8% bearing on the output:

Model Improvements

With an overall R² score of approximately 90.2%, the approach taken by Solcast to solar power forecasting has been validated. We should nonetheless be wary of resting on our laurels, as there are certainly several steps we could take to minimize the extent of the model’s overfitting:

Regularization, which involves supplementing the error function with an additional penalty term — this technique ensures that coefficients don’t take extreme values, thus augmenting the ANN’s ability to generalize. The following graph illustrates this process, where RSS is a measure of the amount of variance which isn’t explained by the regression model itself:

Early Stopping, which stops the training process as soon as the testing data error starts rising. This form of regularization, implemented with a simple callback in Keras, stands in contrast with the approach we took, which involves training the ANN for a fixed number of epochs, and is clearly illustrated in the following diagram:

Data Augmentation, which entails artificially increasing the size of our dataset to provide more data points that can be used to train and test the model. For instance, one example from the wide-ranging pool of possible data augmentation techniques is SMOTE, which consists of synthesizing new examples for the minority class in particular. These methods could raise the number of rows in our dataset from 4213 to upwards of 10000:

Optimization Strategy

The top 4 input features identified represent a valuable set of priorities which firms in the energy management sector should take into account when deciding where to source their next solar facility. In doing so, firms must choose whether to optimize for raw solar power production or to maximize the removal of CO₂ from the atmosphere (one of the key goals motivating the development of solar facilities in the first place). Should companies adopt the first approach, 3 parameters should be considered:

Location: Although not designated as a specific factor in its own right in our dataset, the specific location one chooses to place PV modules underlies all 20 physical inputs. As such, it is critical to align the placement of solar power plants with regions of the country which receive the greatest degree of solar output from the Sun. Currently, energy suppliers are doing a pretty good job at positioning their PV modules to optimize for solar output. However, as seen here, there are still some notable areas for growth in terms of the placement of solar plants, as well locations which aren’t maximizing the level of solar output:

Shape: Having selected the optimal geographical locations to place solar plants, energy suppliers must now decide what shape to give the PV modules within them to gain the maximum solar power. This involves considering sunrise & sunset positions as well as solar elevation angles. The general consensus is that a rectangular shape can have a longer solar exposure during the day, and therefore gain more solar incidence, during the winter months, while the sun’s wider rise & set angles and higher solar horizon angle render a square shape optimum in the summer months:

Orientation: Having chosen the shape of the PV modules and where to place them, the final decision energy suppliers have to make is what angle to angle them at. With solar zenith and azimuth alone accounting for over 40% of predicted solar output, this decision holds a lot of weight. As seen here, irradiance maps can be modelled at various tilt angles to produce potential maps according to the measured energy density:

Should, on the other hand, the removal of carbon dioxide from the atmosphere be prioritized over the maximization of the solar power produced by PV modules, energy suppliers must take a fourth factor into consideration:

Grid Emission Dirtiness. This parameter gauges the level of CO₂ emissions associated with the construction and operation of the solar power plants themselves — a crucial addition if the goal is to comprehensively map out the flow of CO₂ into & out of the atmosphere.

Representatives from 3 firms working to accelerate and smoothen the transition to solar energy recently conducted a study, in which they overlaid the first 3 parameters (all of which relate to solar potential) as well as the new grid emission dirtiness attribute, to map out the best geographical targets to build new capacity to optimize for the removal of CO₂ from the atmosphere:

As seen above, the location of these best targets (in dark red) starkly varies from the areas in the US which receive the greatest solar irradiance, thus confirming the previously conjectured trade-off between optimizing for solar power production and CO₂ removal. The following recommendation provides guidance to companies in the energy management sector as to the optimal locations to build their next solar facility, for both possible prioritizations:

A notable absence from both components of the above recommendation is the California, which is currently contains 43.5% of the 1,721 solar power plants in America. Although the state might very well be the so-called ‘champion of renewable energy’, it’s not a particularly strong candidate for 2 reasons:

In terms of solar power production, the solar irradiance map portrays the counties of San Bernardino, Riverside, San Diego, and Imperial as ideal locations to build photovoltaic power stations; however, as depicted in the solar plant map, these 4 regions are already home to a large number of solar facilities, rendering these areas suboptimal to source solar output.
If the removal of CO₂ is the priority, it must be noted that that the most recent map paints a picture of solar plants in California as dirtier than those in other states. This is most likely because many of these power stations, such as the Ivanpah Solar Electric Generating System, burn natural gas to continuously maintain peak power generation. This preference for using renewables synergistically with fossil fuels certainly makes California an unideal location to build new solar plants.

Ivanpah’s (the largest solar concentrating plant) natural gas use renders its power generation scheme ‘dirty’.

Closing Thoughts

With solar energy expected to be the fastest growing renewable energy source from now to 2050, harnessing predictive power generation models will naturally play a critical role in smoothening the transition to a smarter grid. Machine learning-based models like Solcast are set to drastically upscale the accuracy of these forecasting algorithms, representing a much-needed step closer to securing cleaner, more reliable solar energy systems.