NERC Numerical Earth Science Modelling Short Course: Lecture 3 – Using subfunctions

Dr Simon A Mathias

Department of Earth Sciences

Durham University

This week’s leaning objectives
Cumulative distribution functions
Saving and loading data in MAT files
Ranking data using sort
Comparison with the normal distribution
Application of subfunctions
Assignment

This week’s leaning objectives

At the end of the week you should be able to:

Construct and interpret a cumulative distribution plot.
Rank data using sort.
Organise a script file using subfunctions.
Save and load data in mat files.

Cumulative distribution functions

In the last two weeks we have been learning how to import and manipulate data from excel spreadsheets. We have also applied some basic statistical procedures. This week we will develop our code further to develop a a probabilistic analysis of the Durham weather dataset. In this way, we will be applying similar techniques to different variables of interest. Application of subfunctions will make the task much easier.

But first, we will introduce some basic probability theory:

Consider a set of 5 measurements of a given variable, xn

%-----------
% n     xn
%-----------
% 1     35
% 2     27
% 3     30
% 4     10
% 5     25
%------------

If we rank these from small to large we get:

%-----------
% m     xm
%-----------
% 1     10
% 2     25
% 3     27
% 4     30
% 5     35
%------------

The probability of not exceeding a value, xm, in the future, P, can be estimated from the so-called Weibull plotting position:

where m is the associated rank number of the value, xm, and N is the number of values of x that have been previously observed.

For the table above we can therefore say:

%-----------------------------
% m     xm      P
%-----------------------------
% 1     10      1/6 = 0.17
% 2     25      2/6 = 0.33
% 3     27      3/6 = 0.50
% 4     30      4/6 = 0.67
% 5     35      5/6 = 0.83
%-----------------------------

Note that the need to add 1 to N is a way of avoiding the idea that it is impossible to exceed the largest value that has been previously observed.

The above table represents a discrete set of points from the so-called cumulative distribution function (CDF) of the data.

Open MATLAB.

Click on “File”, “New”, “Script”. The “Editor” window should now appear.

At the top of the window paste the following code.

function DurhamWeatherStatistics

%Import Durham weather data from excel
data=xlsread('DurhamWeather.xlsx','Sheet1');
%Extract years and store in yyyy
yyyy=data(:,1);
%Extract months and store in mm
mm=data(:,2);
%Extract monthly rainfall and store in rain
rain=data(:,6);
%Generate a column vector of the years of interest
YEAR=[1880:2011]';
%Count how many years we are interested in and store in M
M=numel(YEAR);
%Perform a loop for each year of interest
for i=1:M
    %Obtain the locations of all yyyy which are equal to the ith YEAR
    condition=yyyy==YEAR(i,1);
    ind=find(condition);
    %Store all rain data for year of interest in rainYEAR
    rainYEAR=rain(ind,1);
    %Calculate annual mean rainfall for that year
    %and store as the ith value of rainANNUAL
    rainANNUAL(i,1)=mean(rainYEAR,1);
end

Save the file as DurhamWeatherStatistics.m in the My_mfiles directory. Then run (by pressing the green arrow at the top of the editor window):

Again we have imported the Durham monthly rainfall time-series but this time generated a corresponding sequence of annual mean rainfall time-series.

Saving and loading data in MAT files

Now add the following:

%Save all the generated data in a MAT file
save Annualdata

%Load the generated data stored in the MAT file
load Annualdata

%Check which variables are available in the workspace
whos

  Name               Size            Bytes  Class      Attributes

  M                  1x1                 8  double               
  YEAR             132x1              1056  double               
  condition       1591x1              1591  logical              
  data            1591x7             89096  double               
  i                  1x1                 8  double               
  ind               12x1                96  double               
  mm              1591x1             12728  double               
  rain            1591x1             12728  double               
  rainANNUAL       132x1              1056  double               
  rainYEAR          12x1                96  double               
  yyyy            1591x1             12728  double

We have saved everything generated prior to the line “save Annualdata” in a file called Annualdata.mat. Now comment everything from the line starting with “data=xlsread(…” down to “save Annualdata”. This can be done by highlighting all the code of concern, rightclicking and choosing “comment”. All the text highlighted should now be preceeded with a % sign on each line and should be colored green. Now run.

Note that “whos” verifies that we are able access all the data in Annualdata.mat due the line “load Annualdata”. The neat thing about this is that loading the data in this way is much faster than regenerating the data from scratch each time.

Ranking data using sort

To generate the CDF for the annual mean rain data we first need to rank the data. The recommended MATLAB command for this purpose is sort. Type “help sort” in the Command Window to find out more.

The command, sort, can be used to rank the data in either the columns or rows of a 2D array.

Type the following in the Command Window

A=[2 5 8 6; 9 4 3 1; 7 12 10 11]
sort(A,1)

A =

     2     5     8     6
     9     4     3     1
     7    12    10    11


ans =

     2     4     3     1
     7     5     8     6
     9    12    10    11

Applying sort(A,1) has ranked all the columns in A from smallest to largest. Now try

sort(A,2)

ans =

     2     5     6     8
     1     3     4     9
     7    10    11    12

We have now ranked all the rows from smallest to largest

To generate and plot the CDF for Durham’s annual rainfall paste the following additional code into “DurhamWeatherStatistics.m”:

%Rank the rain from smallest to largest
RankedRain=sort(rainANNUAL,1);
%Count the number of data points
M=numel(RankedRain);
%Generate the rank numbers (i.e., numbers from 1 to M in increments of 1)
m=[1:M]';
%Calculate probability of non-exceedance using Weibull plotting position
P=m/(M+1);

figure
%Plot data as x-y scatter
plot(RankedRain,P,'.')
%Note that typing in '.' cause the results to be plotted as dots.
xlabel('Annual mean rainfall (mm/month)')
ylabel('Probability of non-exceedance')
title('Rainfall in Durham from 1880 to 2012')

Comparison with the normal distribution

Most of you will be familiar with calculating the mean, , of data. Many of you will also be familiar with the so-called standard deviation, $\sigma$ . The significance of these variables is best understood in the context of the normal distribution probability density function:

$f(x)=\frac{1}{\sigma \sqrt{2\pi}}e^{-(x-\mu)^2/(2\sigma^2)}$

The associated cumulative distribution function is obtained from

$P(x) = \int_{-\infty}^x f(\xi) d\xi = \frac{1}{2} erfc\left(\frac{\mu-x}{\sigma\sqrt 2}\right)$

where erfc denotes the so-called complementary error function. This, of course is already available in MATLAB. Type “help erfc” in the Command Window and find out more.

It is often interesting to compare an empirical cdf (as we have previously obtained from our rainfall data) with a theoretical normal cdf. Add the following code to “DurhamWeatherStatistics.m”

%Calculate mean
mu=mean(RankedRain,1);
%Calculate a standard deviation
sigma=std(RankedRain,[],1);

%Generate a set of 50 equally spaced points from the min to the max
rainMIN=min(RankedRain,[],1);
rainMAX=max(RankedRain,[],1);
%Type "help linspace" in Command Window to find out more.
x=linspace(rainMIN,rainMAX,50)';
%Evaluate Normal CDF
u=(mu-x)/sigma/sqrt(2);
Pnormal=erfc(u)/2;

%Plot results alongside empirical CDF
figure(2)
%Plot data as x-y scatter
plot(RankedRain,P,'.',x,Pnormal)
xlabel('Annual mean rainfall (mm/month)')
ylabel('Probability of non-exceedance')
title('Rainfall in Durham from 1880 to 2012')
legend('Empirical','Theoretical')

Application of subfunctions

Code can become much more organised and ultimately shorter when one seeks to reduce a programme into a sequence of subfunctions.

For example paste the following code into a new script and save and run in the directory, My_mfiles

function SomeAlgebra

%Generate a set of 20 equally spaced points from 0 to 10
x=linspace(0,10,20)';
%Define parameter values
a=5;
b=10;
c=0.5;
%Apply both formulae to get y and z
y=b*sin(c*x)+a;
z=y./x;
%Plot results
figure
plot(x,y,x,z)

Now replace the code with the two functions below and run.

function SomeAlgebra
%Generate a set of 20 equally spaced points from 0 to 10
x=linspace(0,10,20)';
%Define parameter values
a=5;
b=10;
c=0.5;
[y,z]=MySubFun(x,a,b,c);
%Plot results
figure
plot(x,y,x,z)
function [y,z]=MySubFun(x,a,b,c)

%Apply both formulae to get y and z
y=b*sin(c*x)+a;
z=y./x;

Note that exactly the same result has been achieved. What we have done is partitioned part of the code into a piece of code underneath called a subfunction. The subfunction has input arguments, denoted in the (), and output arguments denoted in the []. The code within the subfunction only knows about the data given as input arguments. The code that accesses the subfunction only knows about the information given in the output arguments.

An important fact is that the names of input and output arguments in the subfunction do not need to be the same as those corresponding arguments being used in the the function accessing the subfunction. For example the following code will perform exactly the same task as above.

function SomeAlgebra
%Generate a set of 20 equally spaced points from 0 to 10
x=linspace(0,10,20)';
%Define parameter values
a=5;
b=10;
c=0.5;
[y,z]=MySubFun(x,a,b,c);
%Plot results
figure
plot(x,y,x,z)
function [y1,z1]=MySubFun(x1,a1,b1,c1)

%Apply both formulae to get y1 and z1
y1=b1*sin(c1*x1)+a1;
z1=y1./x1;

So in the subfunction the input arguments are called x1, a1, b1 and c1 but in the function accessing the subfunction, the names are different: x, a, b, c. The key thing is for the arguments to be in the correct order.

Now consider a slightly briefer version of DurhamWeatherStatistics.m

function DurhamWeatherStatistics

%Import Durham weather data from excel
data=xlsread('DurhamWeather.xlsx','Sheet1');
%Extract years and store in yyyy
yyyy=data(:,1);
%Extract months and store in mm
mm=data(:,2);
%Extract monthly rainfall and store in rain
rain=data(:,6);
%Generate a column vector of the years of interest
YEAR=[1880:2011]';
%Count how many years we are interested in and store in M
M=numel(YEAR);
%Perform a loop for each year of interest
for i=1:M
    %Obtain the locations of all yyyy which are equal to the ith YEAR
    condition=yyyy==YEAR(i,1);
    ind=find(condition);
    %Store all rain data for year of interest in rainYEAR
    rainYEAR=rain(ind,1);
    %Calculate annual mean rainfall for that year
    %and store as the ith value of rainANNUAL
    rainANNUAL(i,1)=mean(rainYEAR,1);
end
%Rank the rain from smallest to largest
RankedRain=sort(rainANNUAL,1);
%Count the number of data points
M=numel(RankedRain);
%Generate the rank numbers (i.e., numbers from 1 to M in increments of 1)
m=[1:M]';
%Calculate probability of non-exceedance using Weibull plotting position
P=m/(M+1);
figure(1)
%Plot data as x-y scatter
plot(RankedRain,P,'.')
xlabel('Annual mean rainfall (mm/month)')
ylabel('Probability of non-exceedance')
title('Rainfall in Durham from 1880 to 2012')

Lets put some of the analysis into a subfunction:

function DurhamWeatherStatistics
%Import Durham weather data from excel
data=xlsread('DurhamWeather.xlsx','Sheet1');
%Extract years and store in yyyy
yyyy=data(:,1);
%Extract months and store in mm
mm=data(:,2);
%Extract monthly rainfall and store in rain
rain=data(:,6);
%Generate a column vector of the years of interest
YEAR=[1880:2011]';
%Apply sub function to get CDF for annual mean monthly rainfall
[RankedRain,P]=GetCDF(YEAR,yyyy,rain);
figure(1)
%Plot data as x-y scatter
plot(RankedRain,P,'.')
xlabel('Annual mean rainfall (mm/month)')
ylabel('Probability of non-exceedance')
title('Rainfall in Durham from 1880 to 2012')

function [RankedRain,P]=GetCDF(YEAR,yyyy,rain)

%Count how many years we are interested in and store in M
M=numel(YEAR);
%Perform a loop for each year of interest
for i=1:M
    %Obtain the locations of all yyyy which are equal to the ith YEAR
    condition=yyyy==YEAR(i,1);
    ind=find(condition);
    %Store all rain data for year of interest in rainYEAR
    rainYEAR=rain(ind,1);
    %Calculate annual mean rainfall for that year
    %and store as the ith value of rainANNUAL
    rainANNUAL(i,1)=mean(rainYEAR,1);
end
%Rank the rain from smallest to largest
RankedRain=sort(rainANNUAL,1);
%Count the number of data points
M=numel(RankedRain);
%Generate the rank numbers (i.e., numbers from 1 to M in increments of 1)
m=[1:M]';
%Calculate probability of non-exceedance using Weibull plotting position
P=m/(M+1);

The major advantage of this is that now we can apply the same analysis to the temperature as well. For example:

function DurhamWeatherStatistics
%Import Durham weather data from excel
data=xlsread('DurhamWeather.xlsx','Sheet1');
%Extract years and store in yyyy
yyyy=data(:,1);
%Extract months and store in mm
mm=data(:,2);
%Extract monthly rainfall and store in rain
rain=data(:,6);
%Extract monthly mean daily maximum temperature (deg C) and store in tmax
tmax=data(:,3);
%Generate a column vector of the years of interest
YEAR=[1880:2011]';
%Apply sub function to get CDF for annual mean monthly rainfall
[RankedRain,P]=GetCDF(YEAR,yyyy,rain);
%Apply sub function to get CDF for annual mean daily max temperature
[RankedTmax,P]=GetCDF(YEAR,yyyy,tmax);
figure(1)
subplot(1,2,1)
%Plot data as x-y scatter
plot(RankedRain,P,'.')
xlabel('Annual mean rainfall (mm/month)')
ylabel('Probability of non-exceedance')
title('Rainfall in Durham from 1880 to 2012')
subplot(1,2,2)
plot(RankedTmax,P,'.')
xlabel('Annual mean daily max temp (^oC)')
ylabel('Probability of non-exceedance')
title('Temperature in Durham from 1880 to 2012')


function [RankedRain,P]=GetCDF(YEAR,yyyy,rain)

%Count how many years we are interested in and store in M
M=numel(YEAR);
%Perform a loop for each year of interest
for i=1:M
    %Obtain the locations of all yyyy which are equal to the ith YEAR
    condition=yyyy==YEAR(i,1);
    ind=find(condition);
    %Store all rain data for year of interest in rainYEAR
    rainYEAR=rain(ind,1);
    %Calculate annual mean rainfall for that year
    %and store as the ith value of rainANNUAL
    rainANNUAL(i,1)=mean(rainYEAR,1);
end
%Rank the rain from smallest to largest
RankedRain=sort(rainANNUAL,1);
%Count the number of data points
M=numel(RankedRain);
%Generate the rank numbers (i.e., numbers from 1 to M in increments of 1)
m=[1:M]';
%Calculate probability of non-exceedance using Weibull plotting position
P=m/(M+1);

Note how we are able to use GetCDF to look at tmax without having to change the variable names in the subfunction in GetCDF. Also we have used another new command called “subplot”. Type “help subplot” to find more.

Assignment

Extend DurhamWeatherStatistics.m further to plot the theoretical normal CDF for both the annual mean monthly rainfall and the annual mean daily maximum temperature using an additional subfunction.

Published with MATLAB® 7.13