Exercise 5 (demo) - Optimal interpolation

Exercise 5 (demo) - Optimal interpolation#

Aim: To map temperature data based on individual Argo profile data into an x-y-t set of maps.

Data: Data from early Argo profiles in the North Atlantic, provided in a *.mat file.

Directions: Create an *.ipynb and figures of the mapped data.

Demo: Since this is a more-involved calculation, most of the code has been provided for you below. There are a few places where you need to make edits/tweaks and choices. Some of these are indicated by the ellipses .... Others simply say in the text Try this.

This exercise is based on one from a course by Kathie Kelly in 2008.

Explore the data#

Create a notebook#

Create an *.ipynb containing the commands for this assignment, or copy this file.

File naming convention

Name your python notebook something useful ex<X>-<Lastname>-<slug>-seaocn.ipynb where you replace <X> with the exercise number and <slug> with the short slug to name the topic, and <Lastname> with your last name.

Figures should be named something like ex<X>fig<Y>-<Lastname>-<slug>-seaocn.png where you replace <X> with the exercise number, <Y> with the figure number, and <Lastname> with your last name.
Import necessary packages.

For example, matplotlib and pandas and numpy and xarray. You may also need scipy.
```
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import xarray as xr
from datetime import datetime
```
If you are missing any of these packages, please refer to Resources: Python.

# Your code here
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import numpy as np
import xarray as xr
from datetime import datetime
import datetime
import scipy.io as sio
from matplotlib.patches import Rectangle

Load Matlab data#

Use the file argo_med.mat in the data/ directory. With the scipy package, you can load matlab-format data files into python.

See also

Scipy io: https://docs.scipy.org/doc/scipy/tutorial/io.html
Load the file. Here, we’re loading the matlab file into something called mat_contents. Make a basic exploration. How big are the data? What are the coordinates?
```
mat_contents = sio.loadmat(fname)
print(mat_contents.keys())
print(mat_contents['time'])
```
Since the functions you have available depend on the type of the variable, don’t forget to use type() to find out what’s in there.

For consistency with the code below, use the names time, pres0, latd, lond, tprof.

Note that the time is in Matlab datenum format, which is days since 0000-01-01. Check out some tips on how to convert between Matlab time and python datenum.
See also
- See https://stackoverflow.com/questions/13965740/converting-matlabs-datenum-format-to-python which has the example we can adapt as
  
  datenums = time.flatten() timestamps = pd.to_datetime(datenums-719529, unit=’D’)

# Your code here

# Change the path to match where you put the data
file_path = '../data/'
filename = 'argo_med.mat'
fname = file_path + filename
mat_contents = sio.loadmat(fname)

# Pull out the individual data fields
time = mat_contents['time']
pres0 = mat_contents['pres0']
latd = mat_contents['latd']
lond = mat_contents['lond']
tprof = mat_contents['tprof']

# Create a variable `timestamps` which is in Python datetime format.
# BUT, don't replace the `time` vector with this new one.  We will still use `time`
datenums = time.flatten()
timestamps = pd.to_datetime(datenums-719529, unit='D')

# Verify that the time works.
# The code below should produce the output `18-Jan-2003`
tstr = timestamps.strftime("%d-%b-%Y")
tstr[0]

'18-Jan-2003'

Convert data to `xarray`#

We like xarray in this course, so we will re-bundle the data into an xarray object: a DataSet. Below shows a way to do this which will throw an error. Uncomment the lines and run it to see what the error looks like.

#data_xr = xr.DataArray(data = tprof,
#                       coords={'time': time, 'pres': pres0},
#                       dims = ["time", "pres"])

Error: `MissingDimensionsError`#

If you get an error like MissingDimensionsError: cannot set variable 'pres' with 2-dimensional data without explicit dimension names. Pass a tuple of (dims, data) instead., this is a problem with the shape or size of your pressure or time dimensions. Xarray wants these to be 1-dimensional.

Try a np.shape(time) to see how big it is.

If it is 1-dimensional, it will return something like (1019,)

If it is returning (1, 1019), this is 2-dimensional, and xarray is having trouble with it.

Plot data locations#

This is a quick-and-dirty plot, so matplotlib is fine.

# Your code here

fig, ax = plt.subplots()
plt.plot(argo_med.lond, argo_med.latd,'*')
plt.xlabel('Longitude [deg E]')
plt.ylabel('Latitude [deg N]')
plt.title('Argo profile locations')

Fig 1. Data locations#

minlat = 40
maxlat = 50
minlon = 320
maxlon = 340

Draw a box on top of your figure to show where the selected region is.

rect = Rectangle((minlon, minlat), maxlon-minlon, maxlat-minlat, linewidth=1, edgecolor='r', facecolor='none')
ax.add_patch(rect)

Save the figure as Fig 1 for this exercise. ex5fig1-<LastName>-data_locations-seaocn.png

We’ll work with data in an area with particularly high density of float profiles.

Choose a pressure surface to work with. Note that your pressure vector is 6 elements long. Pick one depth to work with.

# Define region
minlat = 40
maxlat = 50
minlon = 320
maxlon = 340

# Your code here
fig, ax = plt.subplots()

plt.plot(argo_med.lond, argo_med.latd,'*')
plt.xlabel('Longitude [deg E]')
plt.ylabel('Latitude [deg N]')
plt.title('Argo profile locations')

# Create a Rectangle patch & add to plot
rect = Rectangle((minlon, minlat), maxlon-minlon, maxlat-minlat, linewidth=1, edgecolor='r', facecolor='none')
ax.add_patch(rect)

# Save the figure to subdirectory figures/
fig.savefig("figures/ex5fig1-Example-data_locations-seaocn.png")

# Figure out how big the datasets are
temp = argo_med['tprof'][:,0]
print(temp)

Fig 2. Clip to a region and plot#

You’ve defined a maximum and minimum longitude and latitude above, and now you’d like to make your dataset smaller to work only within this region.

xarray has a few options for finding/selecting/choosing subsets of a dataset. These include xr.sel(), xr.where(), xr.loc(), etc. They have some similarities and differences which we won’t go into here.

Remove mean & any obvious latitudinal dependence#

Here you should plot against latitude and fit a polynomial (1 degree, if that suffices to remove most of the signal). Note that a 1-degree polynomial will also remove the mean, so you don’t need to separately remove the mean of these values. To check the mean of the dataset, use the plt.hist() command from matplotlib.

plt.hist(argo_reg.tprof.values.flatten())

Then check for a latitudinal dependence. For example:

plt.plot(argo_reg.latd,argo_reg.tprof[:,3],'*')

This figure does not need to be saved - it is a quick and dirty plot to take a look at the data.

# Your code here

fig, ax = plt.subplots(2,1)
ax[0].hist(argo_reg.tprof.values.flatten())
ax[1].plot(argo_reg.latd,argo_reg.tprof[:,3],'*')

Fig 3. Compute latitudinal temperature dependence#

Recall from exercise 2a tseries how to fit a 1-degree polynomial.

Extract data from one pressure level. Note that the second dimension or axis of the tprof dataArray is not unit-length. Pick one depth surface to use for the later exercises. Call this vector temp1 for consistency with later code examples.
Remove the NaN-values (if any). For example, find the indices where there are NaN-values using np.isnan, then subselect only the good stuff from temp1 using it in temp1[~np.isnan(temp1)]. Remove these from latitude as well.
```
ikeep = ~np.isnan(temp1)
temp1 = temp_anom[ikeep]
```

Fit a 1-d polynomial to temp1 as a function of lat1 (latitude, with the same NaN-values removed).

pf = np.polyfit(lat1, temp1, 1)
xp = np.linspace(minlat,maxlat,20)
p1 = np.poly1d(np.polyfit(lat1,temp1,1))

Remove the polynomial fit from the original data to create anomalies
```
lat_model = p1(lat1)
temp_anom[ikeep] = temp1 - lat_model
```
Create a new xarray.DataArray containing temp_anom then merge it using xr.merge back into argo_reg.

# Fit a line
lat1 = argo_reg.latd.values
# This is picking a specific depth.  What depth is it at?
temp1 = argo_reg.tprof[:,3].values

# Our new data vector will be called `temp_anom`
temp_anom = temp1
ikeep = ~np.isnan(temp1)

lat1 = lat1[ikeep]
temp1 = temp1[ikeep]

# Fit a polynomial
pf = np.polyfit(lat1, temp1, 1)
xp = np.linspace(40,50,20)
p1 = np.poly1d(np.polyfit(lat1, temp1, 1))

# Calculate anomalies relative to the latitudinal gradient
lat_model = p1(lat1)
temp_anom[ikeep] = temp1-lat_model

# Merge these back into the original
# See what else needs to be used by checking the cell above
data_xr4 = xr.DataArray(name = 'temp_anom', ...)

# Merge into the same dataset
argo_reg = xr.merge([argo_reg,...])

print(argo_reg) 

Now plot the process of fitting a line#

And save as your 3rd figure.

# Your code here
fig, ax = plt.subplots(2,1)

# Determine latitude dependence
ax[0].plot(argo_reg.latd,argo_reg.tprof[:,3],'o')
ax[0].plot(lat1,temp1,'r+')
ax[0].plot(xp,p1(xp))

ax[1].plot(lat1, temp_anom[ikeep],'*')

# Annotate figure
ax[1].set_xlabel('Latitude [deg N]')
ax[1].set_ylabel('Temperature anomaly [deg C]')
ax[0].set_ylabel('Temperature [deg C]')


fig.savefig("figures/ex5fig3-Example-data_locations-seaocn.png")

(Optional) Remove outliers larger than 3 standard deviations#

Compute the standard deviation of your data vector
```
std1 = np.nanstd(temp_anom)
```

Create a mask for where these data are outliers (by a 3x std definition)

mask_outlier = (argo_reg.temp_anom > -3*std1) | (argo_reg.temp_anom > 3*std1) 

Subselect the argo_reg for the non-outliers and save it into argo_reg_good.
```
argo_reg_good = argo_reg.where(mask_outlier, drop=True)
```

# In case you don't do the optional, you still need the newly named `argo_reg_good` for later code.
argo_reg_good = argo_reg

# Remove the outliers (optional)
# Uncomment and complete the lines
#std1 = ...
#mask_outlier = ...
#argo_reg_good = ...


# Print something to the screen to tell how many datapoints were removed (if any)
mp = argo_reg_good.temp_anom.shape[0]
mq = argo_reg.temp_anom.shape[0]

print('Removed ' + str(mq-mp) + ' values out of a total of ' + str(mq) + ' (' + str(np.around(100*(mq-mp)/mq,2)) + '% of data points removed)')

Computing the covariance function#

Calculate pairwise differences#

For every Argo measurement, calculate the difference to every other Argo measurement, and the difference in location (dx) and time (dt)

Read through the code below to understand what is going on.

How are new vectors being initialised? What is their initial size? How does there size get increased in the loop?
What are the indices of the for-loops? What values will jj (the outer loop index) take? What about ii? You can check this by adding some lines of code in relevant places within the loop:
```
print(jj)
```
Make a new cell and try the very simple loop:
```
for jj in range(0,5):
    print(jj)
```
What is the largest value of jj? How does this compare to the inputs you put in range()?
What is the theta variable? What are we using it for?
What is the factor np.cos(theta... doing? Why is there an np.pi/180 inside?
How big is index after executing the loop? Note that this variable is only counting for you - it’s not doing anything else in this loop (but we will use it later). How did we increment it? What would happen if you changed the line to
```
index += 2
```
Did you notice the indentation used on the loops? These are nested loops, so the activity inside the 2nd loop is indented twice.

# New cell for your simple loop

# The size of the vectors will be ndat^2/2, so this gets pretty big
ndat = argo_reg_good.temp_anom.shape[0]

# Initialise some empty vectors
dist = np.empty((0,0))
delt = np.empty((0,0))
temp_diff = np.empty((0,0))
myind = np.empty((0,0))
index = 0

for jj in range(0,ndat):
    for kk in range(jj+1,ndat): 
        dt = np.abs(argo_reg_good.time[jj] - argo_reg_good.time[kk])
        dlat = argo_reg_good.latd[jj] - argo_reg_good.latd[kk]
        dlon = argo_reg_good.lond[jj] - argo_reg_good.lond[kk]

        theta = np.mean([argo_reg_good.latd[jj], argo_reg_good.latd[kk]])

        # Distance
        dist1 = np.sqrt(dlat**2 + (dlon * np.cos(theta*np.pi/180))**2)*111
        dist = np.append(dist, dist1)

        # Time difference
        delt = np.append(delt,dt)
        
        # Temperature difference
        dif_temp1 = argo_reg_good.temp_anom[jj] - argo_reg_good.temp_anom[kk]
        temp_diff = np.append(temp_diff, dif_temp1)

        # 
        myind = np.append(myind,index)
        index += 1

Create a new `xarray` to hold the results#

The data fields should include time_diff, distance, temp_diff, against the index index.

# Create an xarray
data_xr = xr.DataArray(name = 'time_diff', data = delt,
                       coords={'index': myind},
                       dims = ["index"])
data_xr2 = xr.DataArray(name = 'distance', data = dist,
                        coords= {'index': myind},
                        dims = ["index"])
data_xr3 = xr.DataArray(name = 'temp_diff', data = temp_diff,
                        coords = {'index': myind},
                        dims = ["index"])

data_covar = xr.merge([data_xr, data_xr2, data_xr3])
print(data_covar)

Calculating the structure function#

\[ <(T_1-T_2)^2> = <T_1^2> + <T_2^2> - 2<T_1T_2> = 2<T^2> + 2<err^2> - 2<T_1T_2> \]

where \(<\cdot>\) denotes an average or mean. The structure function \(S\) is the variance + squared error,

\[ S = \frac{1}{2}<(T_1-T_2)^2> \]

In python, this becomes

delt_squared = data_covar.temp_diff**2/2

which is one half the squared temperature difference, but without the averaging in \(S\)

# Your code here
delt_squared = data_covar.temp_diff**2/2

data_xr4 = xr.DataArray(name = 'deltsqr', data = delt_squared,
                        coords = {'index': myind},
                        dims = ["index"])
data_covar = xr.merge([data_covar, data_xr4])

nprs = delt.shape[0]

# We will bin the differences as a function of space and time
# Choose the width of each bin
ds_bin = 25 # Spatial  - Use something like 10 - 100 kilometers (you can try a few)

# Choose a maximum distance to use
maxd = 900 

# Since the dataset is large, initially only work where the difference in time is small
itime_small = delt<dt_bin
len(itime_small)
type(itime_small)

Bin the structure function as a function of distance#

Create the variable \(S(x)\) which is a binned estimate of the squared temperature differences, as a function of distance \(x\). We will call this S_dx in python, against the distance vector ds.

# Mask for nearby in time and in space
# There is no point in running this over the entire dataset since far away in space/time are unlikely to influence values at a point
mask_small_dt = (data_covar.time_diff < dt_bin)
mask_small_ds = (data_covar.distance < maxd)
data_covar_near = data_covar.where(mask_small_dt & mask_small_ds, drop=True)

print(data_covar_near.distance.values.max())

# Subset the distances and the other values for small time differences

# Length of vector for distances
ds = np.arange(ds_bin/2, maxd, ds_bin)
# Initialise the vector
S_dx = np.empty((0,0))

# Loop through the distance vector `ds`
for jj in range(len(ds)):
    # Create the lower and upper distance limits within which to compute the average
    dsmall = ds[jj]-ds_bin/2
    dlarge = ds[jj]+ds_bin/2

    # Select for data within these limits
    dt1 = data_covar_near.where((data_covar_near.distance >= dsmall) & (data_covar_near.distance < dlarge), drop=True)

    # Calculate the average
    smean1 = np.mean(dt1.deltsqr)

    # Append the value to the S_dx vector
    S_dx = np.append(S_dx,smean1)

# Plot the binned structure function

# Variance /error increases with distance
fig, ax = plt.subplots()
plt.plot(ds,S_dx)
ax.set_xlabel('Distance [km]')
ax.set_ylabel('$T_{rms}^2$ [deg C^2]')

Estimate noise and variance#

Where noise (errsq), signal variance (var_signal), total variance (var_total) are in the equation for the structure function as

\[ 2S = <T1^2> + <T2^2> -2<T1*T2> = 2<T^2> + 2<err^2> -2<T1*T2> \]

for delt=0, this is the squared error.

for delt large, this is the total variance (signal variance plus the squared error).

You can repeat the above calculations changing the mask_small_dt values.

# Experiment with some values here that will affect your fit in the later code.
errsqr = 1
var_total = 1.2
var_signal = var_total - errsqr

Fit a covariance function#

The covariance function must be estimated either from the data to be mapped or from independent data. The function should be as simple as possible while still describing the essential variations in the data. Generally, an analytic function is used to allow for simple computation for a variety of spatial (and temporal) separations. The covariance function should decrease with increasing distance (or time), which minimizes the possibility that (5) will produce unrealistic (e.g., negative) squared error estimates, another reason for removing periodic signals. Included in the covariance function is an estimate of the variance of the field, which may vary somewhat spatially. Finally, an estimate is needed of the random error in the measurements, in addition to the error associated with not having a measurement at the exact location (or time) of the grid points for the map.

Here we will use:

\[ cov=var_{signal}*\exp(-ds^2/L_s^2-dt/L_t) \]

In your calculations below, guess initial decorrelation length scale \(L_s=300\) km, and decorrelation time scale of 10 days, \(L_t=10\).

# Make some initial guesses for the decorrelation scales
# These are the distances in kilometers and days over which you expect data to be similar (or to have an influence on nearby datapoints in the map)
Ls = ...
Lt = ...


# Define a simple covariance function as a function of distance
def my_covar_x(var_signal,ds,Ls):
    covx = var_signal*np.exp(-(ds/Ls)**2)
    
    return covx

# Adjusting the value from the cell above to better fit the curve.
var_sig2 = .9

# Covariance
cov_ds = var_total - S_dx

# Calculate the decorrelation curve
covx = my_covar_x(var_sig2, ds, Ls)

# Plot it against the structure function
fig,ax = plt.subplots()
plt.plot(ds,cov_ds)
plt.plot(ds,covx,'*')
ax.set_xlabel('Distance [km]')
ax.set_ylabel('Temperature covariance [(deg C)$^2$]')
ax.set_title('For Ls = ' + str(Ls) + 'km')

Bin the structure function as a function of distance#

Repeat for distance, experimenting with values for dt_bin and maxt. Note that there is no point in a very large maxt value since the dataset include data all collected within about 6 months.

# Mask for nearby in time and in space
dt_bin = ... # Temporal - Use something like 10 - 60 days (you can try a few)
maxt = 150

mask_small_dt = (data_covar.time_diff < maxt)
mask_small_ds = (data_covar.distance < ds_bin*2)
data_covar_near = data_covar.where(mask_small_dt & mask_small_ds, drop=True)

print(data_covar_near.distance.values.max())

# Subset the distances and the other values for small time differences
# Length of vector for distances
dt = np.arange(dt_bin/2, maxt, dt_bin)
S_dt = np.empty((0,0))
for jj in range(len(dt)):
    dsmall = dt[jj]-dt_bin/2
    dlarge = dt[jj]+dt_bin/2
    dt1 = data_covar_near.where((data_covar_near.time_diff >= dsmall) & (data_covar_near.time_diff < dlarge), drop=True)
    smean1 = np.nanmean(dt1.strfcn)
    S_dt = np.append(S_dt,smean1)

# Plot the binned structure function
# Variance /error increases with distance

fig, ax = plt.subplots()
plt.plot(dt,S_dt)
ax.set_xlabel('Time differential [days]')
ax.set_ylabel('$T_{rms}^2$ [deg C^2]')

## Fit a covariance function
Ls = ...
Lt = ...
cov_dt = var_total - S_dt

# Simple covariance in time
def my_covar_t(var_signal,ds,Ls):
    covt = var_signal*np.exp(-dt/Lt)
    
    return covt

covt = my_covar_t(var_sig2, dt, Lt)

fig,ax = plt.subplots()
plt.plot(dt,cov_dt)
plt.plot(dt,covt,'*')
ax.set_xlabel('Time differential [days]')
ax.set_ylabel('Temperature covariance [(deg C)$^2$]')
ax.set_title('For Lt = ' + str(Lt) + 'days')

# Combined time and space decorrelation
def my_covar_xt(var_signal,ds,Ls,dt,Lt):
    covxt = var_signal*np.exp(-(ds/Ls)**2-dt/Lt)

    return covxt

Optimal interpolation#

Optimal interpolation (or “objective mapping,” as it is frequently called in oceanography, or “kriging” in geophysics) is a procedure for obtaining a map from observations that has the minimum squared error, based on the statistics of the field to be mapped Bretherton et al. [1976]. The unique aspect of this procedure is the error map that accompanies the map of the desired variable.

Why compute an objective map?

to get a gridded field from irregularly spaced data (for data assimilation or for forcing a model)
to get a field with which to derive other estimates (e.g., geostrophic velocity from fields of density)
to get an estimate of the errors associated with a set of measurements at various locations and times

Deriving the objective estimate#

Given a set of measurements at arbitrary locations (and times), \(D(x,y,t)\), we want to create a series of maps at regular locations (and times). The measurement locations may be fixed, as in measurements from moored instruments, random as from Argo floats, or repeating but not on a uniform grid, as for satellite observations. The problem is to find the best linear combination of observed values for an estimate at a fixed location and time

\[ \hat{d}(x_0,y_0,t_0) = D\alpha\qquad\qquad (1) \]

where \(\alpha\) is a vector of coefficients that are applied to the matrix of observations \(D\). The columns of \(D\) are time series at the various location \((x,y)\). The best estimate is one that minimizes the square of the error

\[ \epsilon = \hat{d}(x_0,y_0,t_0)-d_T(x_0,y_0,t_0)\qquad\qquad (2) \]

where \(d_T\) is the actual value of the variable at that location and time.

To insure the smallest squared error, we require that the error be orthogonal to the estimate, that is,

This statement requires some interpretation, because it assumes that we have a time series (or an ensemble) of estimates of the variable and therefore an ensemble of errors, with which to determine \(\alpha\). For simplicity, take a hypothetical case in which data comes from fixed locations (as from a mooring) with no gaps (imagine!). Then at any given location the coefficients \(\alpha\) will not change in time and we have the times series for (2). Combining (1) and (2) gives

\[ (D\alpha)^T(D\alpha-d_T)=0 \]

or

\[ \alpha^T D^T D\alpha = \alpha^T D^T d_T \]

where \(C\) is the data-data covariance matrix, then

\[ \alpha^T C\alpha = \alpha^T z \qquad\qquad (3) \]

and \(N\) is the length of the time series in \(D\), so that

\[ \frac{D^TD}{N} = C \]

that is, the covariance between all the observations (at each of the moorings) with each other and \(z\) is a vector of covariances between the observations and the actual value \(d_T\), as

\[ z = \frac{D^Td_T}{N} \]

We can solve for the coefficients \(\alpha\) explicitly as

\[ \alpha = C^{-1}z \qquad\qquad (4) \]

For the actual calculation in python, we use

alpha = np.linalg.solve(C,zz)

It seems reasonable to estimate \(C\) from the observations (at least for the mooring example), but how do we get \(z\) when we do not know the actual value of the variable? In practice, we approximate the covariances using a simple (analytic) function. For a spatial map, the covariance function may simply be a function of the distance between the data locations, or it might vary in the \(x\) and \(y\)-directions, if their statistics differ significantly. In addition to the spatial separation, it might include the temporal separation. So, the covariance matrix \(C\) between data at different locations is just a function of their separation. Similarly, the covariance matrix \(z\) is just a function of the distance between the observations (here, the moorings) and the grid point in the map.

To understand the form of the coefficients (4), note that the “numerator” \(z\) depends on how close the data are to the grid point \((x_0,y_0)\), that is, the data that are nearest to the grid point are weighted most heavily in the estimate (1). What about the denominator \(C\)? The data covariance matrix describes how well the data are correlated with each other, that is, the extent to which observations that are close to each other are redundant or, conversely, that widely separated observations are independent. The denominator can be seen as a correction to the weighting by distance from the grid point (the numerator) to account for data redundancy.

The error estimate#

What is the (squared) error associated with this estimate?

\[ \epsilon^T\epsilon = \left(\hat{d}-d_T\right)^T\left(\hat{d}-d_T\right) = \hat{d}^T\hat{d} - 2\hat{d}^Td_T + d_T^Td_T \]

Using (1) and (3) we can combine the first and second terms (on the right) to get

\[ \epsilon^T\epsilon = d_T^T d_T- \hat{d}^Td_T = d_T^Td_T - \alpha^TD^Td_T \]

where the first term is the data variance (times \(N\)) and the second term can be simplified using (4) to get

\[ \left<\epsilon^2\right>=\sigma_d^2-\alpha^Tz \qquad\qquad (5) \]

where the first term on the RHS is the data variance. For observations that are poorly correlated with the location of a grid point in the map, the squared error can approach the variance. In this case, the estimate is not useful.

Removing the mean and periodic signals#

The objective map is performed on random variables whose statistics can be described by a covariance function. Generally, if the data is a function of time \(d(x,t)\), the temporal mean is removed, because of its implicit association with the ensemble average. By removing the mean and mapping the anomaly, the mapping program will not have to try to recover the mean value using sparse data. If the expected squared error is large, the anomaly can then be set to zero. A similar argument can be made for a periodic signal, if it is reasonably well-defined either by climatology or a long data record. The mapped anomaly can be added to the mean (and periodic signal) to recover the total value of the variable.

def obj_map_2D(vart, var, smax, Ls, tmax, Lt, data, longrid, latgrid, tmap):
    # 2D objective map using data(lon,lat,time)
    #
    # Note: temporal mean of data should be removed first
    # compute 2D Gauss-Markov (objective) estimate 
    #
    # data - xarray of location, time, and observed value
    #	 variables lond, latd, time and temp_anom
    # longrid, latgrid - output locations for maps
    # tmap - time of output map - in Matlab time, float since 0000-01-01 with
    # increment of days, i.e. 0000-01-02 is 1.
    #
    # data variance (vart) is given by
    #  vart = var + noise^2
    #
    # error is in units of signal (NOT fractional)
    #
    # For each output grid point:
    #	1) find data within within smax and tmax of grid pt & tmap 
    #	2) compute data-data covariance matrix 
    #	3) compute data-xgrid covariance vector zz				
    #	4) compute the coefficients 
    #		alpha = inv(C) * zz      (alpha=C\zz)
    #	5) compute the estimate for each grid point
    #		dhat=dij*alpha
    #	6) compute the normalized error estimate
    #		err^2 = var - zz'*alpha 
    #	7) check for instabilities in error calculation & replace with closest 
    #      single datum, if necessary
    #
    [ny, nx] = longrid.shape
    # Unpack the input data
    lon_data = data.lond
    lat_data = data.latd
    time_data = data.time # Note these are in Matlab datenum format (days since 0000-01-01)
    val_data = data.temp_anom

    err0 = np.sqrt(vart);

    # Initialise the mapped data & error
    mapdata = np.empty((ny, nx))
    err_rms = err0*np.ones(mapdata.shape)

    # Conversions for lat to kilometers
    yscl = 111
    cs = np.cos(np.mean(lat_data)*np.pi/180)

    # Cycle through positions in the grid
    for jj in range(0,ny):
        
        for ii in range(0,nx):
            # Find distance (km) from data to the map grid point at
            # latgrid[jj,ii] and longrid[jj,ii]
            # Note: data_lon and data_lat are vectors of the same length
            dr = yscl * np.sqrt(cs * (lon_data - longrid[jj,ii])**2 + (lat_data - latgrid[jj,ii])**2)
            # Find the time difference between the time for the map (tmap) and of the data (time_data)
            dt = abs(time_data - tmap)
    
            # Boolean mask for distances < smax and time diff < tmax
            mask_near = (dr < smax) & (dt < tmax)

            # How many data points are near enough to be used?
            npt = mask_near.sum().values

            # If there are any original datapoints nearby in time/space, then continue
            if npt>0:
                # Subselect the input data for only those points near in time/space
                lat_near_ij = lat_data.where(mask_near, drop=True).values
                lon_near_ij = lon_data.where(mask_near, drop=True).values
                time_near_ij = time_data.where(mask_near, drop=True).values
                val_near_ij = val_data.where(mask_near, drop=True).values
                dr_near_ij = dr.where(mask_near, drop=True).values
                dt_near_ij = dt.where(mask_near, drop=True).values

                # Compute data-grid covariance vector
                zz = my_covar_xt(var,dr_near_ij,Ls,dt_near_ij,Lt)
                
                # Compute data-data covariance matrix
                # Initialise an empty matrix, sized npt x npt square matrix
                C = np.empty((npt,npt))

                for iii in range(0,npt):
                    lat_data1 = lat_near_ij[iii]
                    lon_data1 = lon_near_ij[iii]
                    time_data1 = time_near_ij[iii]
                    
                    for jjj in range(0,npt):
                        lat_data2 = lat_near_ij[jjj]
                        lon_data2 = lon_near_ij[jjj]
                        time_data2 = time_near_ij[jjj]

                        # Compute distance between pairs of data points
                        dr_data = yscl*np.sqrt(cs * (lon_data2 - lon_data1)**2+(lat_data2 - lat_data1)**2)

                        # Compute delta time between the data points
                        dt_data = abs(time_data2 - time_data1)

                        # Covariance matrix
                        C[iii,jjj] = my_covar_xt(var,dr_data,Ls,dt_data,Lt)
    
                # Compute map value and error estimate
                alpha = np.linalg.solve(C,zz)
    
                # Map the data values into the map locations
                dhat = val_near_ij.conj().transpose() @ alpha
                # Estimate the error
                ersqs = vart - zz.conj().transpose() @ alpha
    
                # Value & error at nearest point (max covariance)
                zz0 = zz.max()
                kk = zz.argmax()
                err1 = vart - zz0**2/var
    
                # Check for instability:
                # 1) estimated error > error for single nearest point, or
                # 2) squared error < 0
                # For one datum: alpha = zz0/var
                if (ersqs > err1) | (ersqs < 0):
                    dij = dij(kk) # at max zz
                    ersqs = err1
                    alpha = zz0/var
                    dhat = val_near_ij.conj().transpose() @ alpha

                # Map values
                mapdata[jj,ii] = dhat
                err_rms[jj,ii] = np.sqrt(ersqs)
            else:
                mapdata[jj,ii] = 0
                err_rms[jj,ii] = 0

    return mapdata, err_rms

Run the optimal interpolation#

Loop through time (months).

Add comments to the code below to describe what each section or line is doing.

## run obj map routine 
# map 1 month at a time
# Define the maximum space/time intervals over which to grid
smax = .7*Ls
tmax = .7*Lt

# Define a map grid
latg = np.arange(minlat, maxlat, 1)
long = np.arange(minlon, maxlon, 1)

[xgrid, ygrid] = np.meshgrid(long,latg)

d1 = my_covar_xt(var_signal,smax,Ls,0,Lt)     # max space lag, no time lag
d2 = my_covar_xt(var_signal,0,Ls,tmax,Lt)     # max time lag, no space lag

data = argo_reg_good

epoch0 = mdates.date2num(np.datetime64('0000-01-01'))
epoch1 = mdates.date2num(np.datetime64('1970-01-01')+1)

vart = var_total
var = var_signal

ny, nx = xgrid.shape
nt = 4

mapT = np.empty((ny,nx,nt))
errT = np.empty((ny,nx,nt))

for mm in range(0,4):
    time1 = mdates.date2num(np.datetime64('2003-0' + str(mm+1) + '-15'))
    time2 = time1 + epoch1 - epoch0
    # call obj_map_2D
    mapdata, err_rms = obj_map_2D(vart,var,smax,Ls,tmax,Lt,data,xgrid,ygrid,time2);
    mapT[:,:,mm] = mapdata;
    errT[:,:,mm] = err_rms;

Fig 4. Plot the mapped data#

For one of the depth surfaces, plot the mapped data and error estimate. Basic code is below. Update for a better projection (e.g. Mercator).

Be sure to annotate your figure with the choice of \(L_s\), \(L_t\) and the depth/pressure surface that you are mapping.

# Plot an example of the mapped data.

fig, ax = plt.subplots(2,1)
ax[0].pcolormesh(xgrid,ygrid,mapT[:,:,0])
ax[0].set_xlabel('Longitude [$^\circ$E]')
ax[0].set_ylabel('Latitude [$^\circ$N]')
ax[0].set_title('Ls = ' + str(Ls) + 'km, Lt = ' + str(Lt) + 'd')
ax[1].pcolormesh(xgrid,ygrid,errT[:,:,0])
ax[1].set_xlabel('Longitude [$^\circ$E]')
ax[1].set_ylabel('Latitude [$^\circ$N]')


# Add the original Argo data over

# Fix the colorbar.  Try to load something from https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3

Fig 5. Add back in your mean and latitudinal dependence + plot#

Note that for the mapped data, you need to do this using your xgrid or ygrid and with the polynomial fit you identified above.

Plot the mapped data and save as Fig 5.

Exercise 5 (demo) - Optimal interpolation

Contents

Exercise 5 (demo) - Optimal interpolation#

Explore the data#

Create a notebook#

Load Matlab data#

Convert data to xarray#

Error: MissingDimensionsError#

Plot data locations#

Fig 1. Data locations#

Fig 2. Clip to a region and plot#

Remove mean & any obvious latitudinal dependence#

Fig 3. Compute latitudinal temperature dependence#

Now plot the process of fitting a line#

(Optional) Remove outliers larger than 3 standard deviations#

Computing the covariance function#

Calculate pairwise differences#

Create a new xarray to hold the results#

Calculating the structure function#

Bin the structure function as a function of distance#

Estimate noise and variance#

Fit a covariance function#

Bin the structure function as a function of distance#

Optimal interpolation#

Deriving the objective estimate#

The error estimate#

Removing the mean and periodic signals#

Run the optimal interpolation#

Fig 4. Plot the mapped data#

Fig 5. Add back in your mean and latitudinal dependence + plot#

Convert data to `xarray`#

Error: `MissingDimensionsError`#

Create a new `xarray` to hold the results#