# PreCourse 3

# NERC Numerical Earth Science Modelling Short Course: Lecture 3 – Using subfunctions

Dr Simon A Mathias

Department of Earth Sciences

Durham University

## Contents

- This week’s leaning objectives
- Cumulative distribution functions
- Saving and loading data in MAT files
- Ranking data using sort
- Comparison with the normal distribution
- Application of subfunctions
- Assignment

## This week’s leaning objectives

At the end of the week you should be able to:

- Construct and interpret a cumulative distribution plot.
- Rank data using sort.
- Organise a script file using subfunctions.
- Save and load data in mat files.

## Cumulative distribution functions

In the last two weeks we have been learning how to import and manipulate data from excel spreadsheets. We have also applied some basic statistical procedures. This week we will develop our code further to develop a a probabilistic analysis of the Durham weather dataset. In this way, we will be applying similar techniques to different variables of interest. Application of subfunctions will make the task much easier.

But first, we will introduce some basic probability theory:

Consider a set of 5 measurements of a given variable, xn

%----------- % n xn %----------- % 1 35 % 2 27 % 3 30 % 4 10 % 5 25 %------------

If we rank these from small to large we get:

%----------- % m xm %----------- % 1 10 % 2 25 % 3 27 % 4 30 % 5 35 %------------

The probability of not exceeding a value, xm, in the future, P, can be estimated from the so-called Weibull plotting position:

where m is the associated rank number of the value, xm, and N is the number of values of x that have been previously observed.

For the table above we can therefore say:

%----------------------------- % m xm P %----------------------------- % 1 10 1/6 = 0.17 % 2 25 2/6 = 0.33 % 3 27 3/6 = 0.50 % 4 30 4/6 = 0.67 % 5 35 5/6 = 0.83 %-----------------------------

Note that the need to add 1 to N is a way of avoiding the idea that it is impossible to exceed the largest value that has been previously observed.

The above table represents a discrete set of points from the so-called cumulative distribution function (CDF) of the data.

Open MATLAB.

Click on “File”, “New”, “Script”. The “Editor” window should now appear.

At the top of the window paste the following code.

function DurhamWeatherStatistics

%Import Durham weather data from excel data=xlsread('DurhamWeather.xlsx','Sheet1'); %Extract years and store in yyyy yyyy=data(:,1); %Extract months and store in mm mm=data(:,2); %Extract monthly rainfall and store in rain rain=data(:,6); %Generate a column vector of the years of interest YEAR=[1880:2011]'; %Count how many years we are interested in and store in M M=numel(YEAR); %Perform a loop for each year of interest for i=1:M %Obtain the locations of all yyyy which are equal to the ith YEAR condition=yyyy==YEAR(i,1); ind=find(condition); %Store all rain data for year of interest in rainYEAR rainYEAR=rain(ind,1); %Calculate annual mean rainfall for that year %and store as the ith value of rainANNUAL rainANNUAL(i,1)=mean(rainYEAR,1); end

Save the file as DurhamWeatherStatistics.m in the My_mfiles directory. Then run (by pressing the green arrow at the top of the editor window):

Again we have imported the Durham monthly rainfall time-series but this time generated a corresponding sequence of annual mean rainfall time-series.

## Saving and loading data in MAT files

Now add the following:

%Save all the generated data in a MAT file save Annualdata %Load the generated data stored in the MAT file load Annualdata %Check which variables are available in the workspace whos

Name Size Bytes Class Attributes M 1x1 8 double YEAR 132x1 1056 double condition 1591x1 1591 logical data 1591x7 89096 double i 1x1 8 double ind 12x1 96 double mm 1591x1 12728 double rain 1591x1 12728 double rainANNUAL 132x1 1056 double rainYEAR 12x1 96 double yyyy 1591x1 12728 double

We have saved everything generated prior to the line “save Annualdata” in a file called Annualdata.mat. Now comment everything from the line starting with “data=xlsread(…” down to “save Annualdata”. This can be done by highlighting all the code of concern, rightclicking and choosing “comment”. All the text highlighted should now be preceeded with a % sign on each line and should be colored green. Now run.

Note that “whos” verifies that we are able access all the data in Annualdata.mat due the line “load Annualdata”. The neat thing about this is that loading the data in this way is much faster than regenerating the data from scratch each time.

## Ranking data using sort

To generate the CDF for the annual mean rain data we first need to rank the data. The recommended MATLAB command for this purpose is sort. Type “help sort” in the Command Window to find out more.

The command, sort, can be used to rank the data in either the columns or rows of a 2D array.

Type the following in the Command Window

A=[2 5 8 6; 9 4 3 1; 7 12 10 11] sort(A,1)

A = 2 5 8 6 9 4 3 1 7 12 10 11 ans = 2 4 3 1 7 5 8 6 9 12 10 11

Applying sort(A,1) has ranked all the columns in A from smallest to largest. Now try

sort(A,2)

ans = 2 5 6 8 1 3 4 9 7 10 11 12

We have now ranked all the rows from smallest to largest

To generate and plot the CDF for Durham’s annual rainfall paste the following additional code into “DurhamWeatherStatistics.m”:

%Rank the rain from smallest to largest RankedRain=sort(rainANNUAL,1); %Count the number of data points M=numel(RankedRain); %Generate the rank numbers (i.e., numbers from 1 to M in increments of 1) m=[1:M]'; %Calculate probability of non-exceedance using Weibull plotting position P=m/(M+1); figure %Plot data as x-y scatter plot(RankedRain,P,'.') %Note that typing in '.' cause the results to be plotted as dots. xlabel('Annual mean rainfall (mm/month)') ylabel('Probability of non-exceedance') title('Rainfall in Durham from 1880 to 2012')

## Comparison with the normal distribution

Most of you will be familiar with calculating the mean, , of data. Many of you will also be familiar with the so-called standard deviation, . The significance of these variables is best understood in the context of the normal distribution probability density function:

The associated cumulative distribution function is obtained from

where erfc denotes the so-called complementary error function. This, of course is already available in MATLAB. Type “help erfc” in the Command Window and find out more.

It is often interesting to compare an empirical cdf (as we have previously obtained from our rainfall data) with a theoretical normal cdf. Add the following code to “DurhamWeatherStatistics.m”

%Calculate mean mu=mean(RankedRain,1); %Calculate a standard deviation sigma=std(RankedRain,[],1); %Generate a set of 50 equally spaced points from the min to the max rainMIN=min(RankedRain,[],1); rainMAX=max(RankedRain,[],1); %Type "help linspace" in Command Window to find out more. x=linspace(rainMIN,rainMAX,50)'; %Evaluate Normal CDF u=(mu-x)/sigma/sqrt(2); Pnormal=erfc(u)/2; %Plot results alongside empirical CDF figure(2) %Plot data as x-y scatter plot(RankedRain,P,'.',x,Pnormal) xlabel('Annual mean rainfall (mm/month)') ylabel('Probability of non-exceedance') title('Rainfall in Durham from 1880 to 2012') legend('Empirical','Theoretical')

## Application of subfunctions

Code can become much more organised and ultimately shorter when one seeks to reduce a programme into a sequence of subfunctions.

For example paste the following code into a new script and save and run in the directory, My_mfiles

function SomeAlgebra

%Generate a set of 20 equally spaced points from 0 to 10 x=linspace(0,10,20)'; %Define parameter values a=5; b=10; c=0.5; %Apply both formulae to get y and z y=b*sin(c*x)+a; z=y./x; %Plot results figure plot(x,y,x,z)

Now replace the code with the two functions below and run.

function SomeAlgebra %Generate a set of 20 equally spaced points from 0 to 10 x=linspace(0,10,20)'; %Define parameter values a=5; b=10; c=0.5; [y,z]=MySubFun(x,a,b,c); %Plot results figure plot(x,y,x,z) function [y,z]=MySubFun(x,a,b,c)

%Apply both formulae to get y and z y=b*sin(c*x)+a; z=y./x;

Note that exactly the same result has been achieved. What we have done is partitioned part of the code into a piece of code underneath called a subfunction. The subfunction has input arguments, denoted in the (), and output arguments denoted in the []. The code within the subfunction only knows about the data given as input arguments. The code that accesses the subfunction only knows about the information given in the output arguments.

An important fact is that the names of input and output arguments in the subfunction do not need to be the same as those corresponding arguments being used in the the function accessing the subfunction. For example the following code will perform exactly the same task as above.

function SomeAlgebra %Generate a set of 20 equally spaced points from 0 to 10 x=linspace(0,10,20)'; %Define parameter values a=5; b=10; c=0.5; [y,z]=MySubFun(x,a,b,c); %Plot results figure plot(x,y,x,z) function [y1,z1]=MySubFun(x1,a1,b1,c1)

%Apply both formulae to get y1 and z1 y1=b1*sin(c1*x1)+a1; z1=y1./x1;

So in the subfunction the input arguments are called x1, a1, b1 and c1 but in the function accessing the subfunction, the names are different: x, a, b, c. The key thing is for the arguments to be in the correct order.

Now consider a slightly briefer version of DurhamWeatherStatistics.m

function DurhamWeatherStatistics

%Import Durham weather data from excel data=xlsread('DurhamWeather.xlsx','Sheet1'); %Extract years and store in yyyy yyyy=data(:,1); %Extract months and store in mm mm=data(:,2); %Extract monthly rainfall and store in rain rain=data(:,6); %Generate a column vector of the years of interest YEAR=[1880:2011]'; %Count how many years we are interested in and store in M M=numel(YEAR); %Perform a loop for each year of interest for i=1:M %Obtain the locations of all yyyy which are equal to the ith YEAR condition=yyyy==YEAR(i,1); ind=find(condition); %Store all rain data for year of interest in rainYEAR rainYEAR=rain(ind,1); %Calculate annual mean rainfall for that year %and store as the ith value of rainANNUAL rainANNUAL(i,1)=mean(rainYEAR,1); end %Rank the rain from smallest to largest RankedRain=sort(rainANNUAL,1); %Count the number of data points M=numel(RankedRain); %Generate the rank numbers (i.e., numbers from 1 to M in increments of 1) m=[1:M]'; %Calculate probability of non-exceedance using Weibull plotting position P=m/(M+1); figure(1) %Plot data as x-y scatter plot(RankedRain,P,'.') xlabel('Annual mean rainfall (mm/month)') ylabel('Probability of non-exceedance') title('Rainfall in Durham from 1880 to 2012')

Lets put some of the analysis into a subfunction:

function DurhamWeatherStatistics %Import Durham weather data from excel data=xlsread('DurhamWeather.xlsx','Sheet1'); %Extract years and store in yyyy yyyy=data(:,1); %Extract months and store in mm mm=data(:,2); %Extract monthly rainfall and store in rain rain=data(:,6); %Generate a column vector of the years of interest YEAR=[1880:2011]'; %Apply sub function to get CDF for annual mean monthly rainfall [RankedRain,P]=GetCDF(YEAR,yyyy,rain); figure(1) %Plot data as x-y scatter plot(RankedRain,P,'.') xlabel('Annual mean rainfall (mm/month)') ylabel('Probability of non-exceedance') title('Rainfall in Durham from 1880 to 2012') function [RankedRain,P]=GetCDF(YEAR,yyyy,rain)

%Count how many years we are interested in and store in M M=numel(YEAR); %Perform a loop for each year of interest for i=1:M %Obtain the locations of all yyyy which are equal to the ith YEAR condition=yyyy==YEAR(i,1); ind=find(condition); %Store all rain data for year of interest in rainYEAR rainYEAR=rain(ind,1); %Calculate annual mean rainfall for that year %and store as the ith value of rainANNUAL rainANNUAL(i,1)=mean(rainYEAR,1); end %Rank the rain from smallest to largest RankedRain=sort(rainANNUAL,1); %Count the number of data points M=numel(RankedRain); %Generate the rank numbers (i.e., numbers from 1 to M in increments of 1) m=[1:M]'; %Calculate probability of non-exceedance using Weibull plotting position P=m/(M+1);

The major advantage of this is that now we can apply the same analysis to the temperature as well. For example:

function DurhamWeatherStatistics %Import Durham weather data from excel data=xlsread('DurhamWeather.xlsx','Sheet1'); %Extract years and store in yyyy yyyy=data(:,1); %Extract months and store in mm mm=data(:,2); %Extract monthly rainfall and store in rain rain=data(:,6); %Extract monthly mean daily maximum temperature (deg C) and store in tmax tmax=data(:,3); %Generate a column vector of the years of interest YEAR=[1880:2011]'; %Apply sub function to get CDF for annual mean monthly rainfall [RankedRain,P]=GetCDF(YEAR,yyyy,rain); %Apply sub function to get CDF for annual mean daily max temperature [RankedTmax,P]=GetCDF(YEAR,yyyy,tmax); figure(1) subplot(1,2,1) %Plot data as x-y scatter plot(RankedRain,P,'.') xlabel('Annual mean rainfall (mm/month)') ylabel('Probability of non-exceedance') title('Rainfall in Durham from 1880 to 2012') subplot(1,2,2) plot(RankedTmax,P,'.') xlabel('Annual mean daily max temp (^oC)') ylabel('Probability of non-exceedance') title('Temperature in Durham from 1880 to 2012') function [RankedRain,P]=GetCDF(YEAR,yyyy,rain)

%Count how many years we are interested in and store in M M=numel(YEAR); %Perform a loop for each year of interest for i=1:M %Obtain the locations of all yyyy which are equal to the ith YEAR condition=yyyy==YEAR(i,1); ind=find(condition); %Store all rain data for year of interest in rainYEAR rainYEAR=rain(ind,1); %Calculate annual mean rainfall for that year %and store as the ith value of rainANNUAL rainANNUAL(i,1)=mean(rainYEAR,1); end %Rank the rain from smallest to largest RankedRain=sort(rainANNUAL,1); %Count the number of data points M=numel(RankedRain); %Generate the rank numbers (i.e., numbers from 1 to M in increments of 1) m=[1:M]'; %Calculate probability of non-exceedance using Weibull plotting position P=m/(M+1);

Note how we are able to use GetCDF to look at tmax without having to change the variable names in the subfunction in GetCDF. Also we have used another new command called “subplot”. Type “help subplot” to find more.

## Assignment

Extend DurhamWeatherStatistics.m further to plot the theoretical normal CDF for both the annual mean monthly rainfall and the annual mean daily maximum temperature using an additional subfunction.

Published with MATLABĀ® 7.13