Part 2 - Data Ingest and Plotting
Now that we have Python setup within Anaconda, let’s use it for some science! Here, we will plot data using matplotlib and xarray, which will require data wrangling using NumPy, and pandas. Through this process we will build a skillset fundamental to (geo)data science that can be applied to a variety of fields outside of the geosciences.
If you’re following along in this tutorial series, there’s no need to install any Python modules in Anaconda as the previous tutorial made sure that the environment we created had the entire scientific Python stack installed. So, let’s open up Jupyter Lab and create a new Jupyter Notebook to hold the code that we’ll be writing today.
Time series plotting with matplotlib
Within our notebook we’ll need to call the three packages that we need and this can be accomplished with this code chunk:
import matplotlib.pyplot as plt import numpy as np import pandas as pd
Within this code, we call the matplotlib Pyplot module into the Python instance as
plt, so we will access all functions by including the prefix
plt. and then typing in the function we want (e.g.
plt.scatter for a scatterplot or
plt.plot for a line plot). This same syntax is used to import NumPy and pandas and their associated functions.
With the modules we need in place, let’s ingest a timeseries of global mean annual temperature through time and plot it.
First let’s start ingesting the csv by calling
pd.read_csv and providing it with the URL of the HadCRUT5 time series. Note, that pandas has many functions and can readily ingest data from Excel and text data without a problem.
# Read in csv from the HadCRUT5 dataset hadcrut5 = pd.read_csv("https://www.metoffice.gov.uk/hadobs/hadcrut5/data/current/analysis/diagnostics/HadCRUT.220.127.116.11.analysis.summary_series.global.annual.csv") # Lets take a look at the first and last 5 rows of the data using pd.head() # This tells us what the data looks like and the column names which we need to plot the data hadcrut5.head
With the dataset imported and stored as the
hadcrut5 variable, lets plot it. But first, lets get some terminology out the way.
matplotlib creates “figures” that are composed of “subplots”. This structure is very powerful as it allows you to stitch together any plots you want within the same figure, as long as they’re separate subplots. Here, we will be creating a single figure composed of a single subplot which is apparent in the code. For a thorough explanation of how to use
matplotlib to create more complicated visualizations see this guide. Since we are only making a single time series plot, this code can be shortened quite a bit, but I’ve chosen to leave it verbose to get you thinking about
matplotlib plots as a collection of subplots because that’s a key feature of
# Create figure instance fig = plt.figure(figsize = (6, 6)) # Figure size width, height in inches. # Add subplot to figure instance and name it ax ax = fig.add_subplot(111) # 111 means "1x1 grid, first subplot" # Plot ax.plot(hadcrut5['Time'], hadcrut5['Anomaly (deg C)'])
Notice how we call the ‘Time’ and ‘Anomaly (deg C)’ columns that we uncovered by running
hadcrut5.head using brackets next to the pandas data frame variable name that we assigned earlier. The same can be accomplished by typing in
hadcrut5.Time, but using the bracket notation is preferable in case you have column names with spaces, as in
'Anomaly (deg C)'.
Our plot still look’s a bit rough so lets polish it by adding axis labels and a title:
# Create figure instance fig = plt.figure(figsize = (6, 6)) # Figure size width, height in inches. # Add subplot to figure instance and name it ax ax = fig.add_subplot(111) # 111 means "1x1 grid, first subplot" # Plot ax.plot(hadcrut5['Time'], hadcrut5['Anomaly (deg C)']) # Subplot title ax.set_title('Global Mean Annual Temperature Anomaly') # X-axis label ax.set_xlabel("Year") # Y-axis label ax.set_ylabel("Temperature Anomaly (°C)")
The plot is looking quite a bit better but we can still improve it by removing some whitespace which will require modifying some existing code that we have. Namely, the line that we call the figure instance.
# Notice how we've added the 'constrained_layout' argument and set it to true fig = plt.figure(figsize = (6, 6), constrained_layout=True) # Add subplot to figure instance and name it ax ax = fig.add_subplot(111) # 111 means "1x1 grid, first subplot" # Plot ax.plot(hadcrut5['Time'], hadcrut5['Anomaly (deg C)']) # Subplot title ax.set_title('Global Mean Annual Temperature Anomaly') # X-axis label ax.set_xlabel("Year") # Y-axis label ax.set_ylabel("Temperature Anomaly (°C)")
The plot is almost perfect, however the automatic x-ticks generated by
matplotlib make it seem as the though the data extends to the year 2025 when it ends at the year 2022. Changing this portion of the plot will require using NumPy to produce a 1-dimensional array (1D vector) of the x-ticks that we want. We can make the process a bit easier by making use of a handy function in NumPy that makes arrays for us of a sequence of numbers,
np.arange. The function has the arguments
start for the starting number of the sequence,
stop for the last number of the sequence, and
step for the step size to take in the sequence. We’ll start our sequence at 1850, end it at 2022, and take steps of 21.5 years. Unfortunately, using a step size of 21.5 will produce numbers with decimal points, which we can fix by appending the
.round() function to the
# Create figure instance fig = plt.figure(figsize = (6, 6), constrained_layout=True) # Add subplot to figure instance and name it ax ax = fig.add_subplot(111) # 111 means "1x1 grid, first subplot" # Plot ax.plot(hadcrut5['Time'], hadcrut5['Anomaly (deg C)']) # Subplot title ax.set_title('Global Mean Annual Temperature Anomaly') # X-axis label ax.set_xlabel("Year") # Y-axis label ax.set_ylabel("Temperature Anomaly (°C)") # X-axis ticks ax.set_xticks(np.arange(start=1850, stop=2023, step=21.5).round())
Although, I only gloss over it, NumPy is a critical package for scientific computing. It allows for the easy creation and manipulation of N-dimensional arrays necessary for machine learning, linear algebra, and climate model analysis (which we will get to in the next tutorial).
Plotting spatial data with xarray
As Earth scientists we commonly work with spatial data which brings its own set of challenges for data analysis and visualization. We often have to contend with geographical coordinated systems (CRS), data in 4 dimensions (time, latitude, longitude, altitude), paleogeography, among many other complications. These issues require their own blog posts and for simplicity, we will ignore all of these issues. Our goal will be to download a dataset that has a spatial component using xarray and plot it.
As with the previous section of this tutorial, let’s begin by loading
xarray into our Python shell.
import xarray as xr
Now let’s download the spatially explicit global mean annual temperatures from GISTEMP v4, the counterpart to the time series plot we just made. Unlike
xarray does not support reading files off of the internet (this is only half true, but to keep things simple we won’t worry about it), so we’ll have to manually download the data and put the data file in the correct location to read it into Python using
xarray. Download the HadCRUT5 netCDF file from here and place the file in
C:\Users\ on Windows or
~/ on MacOS and most Linux distributions.
# Read in file - if you get an error here, it likely means the file is not in the correct location hadcrut5_spatial = xr.open_dataset("HadCRUT.18.104.22.168.analysis.anomalies.ensemble_mean.nc") # Let's take a look at the data structure hadcrut5_spatial <xarray.Dataset> Dimensions: (time: 2073, latitude: 36, longitude: 72, bnds: 2) Coordinates: * time (time) datetime64[ns] 1850-01-16T12:00:00 ... 2022-09-16 * latitude (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5 * longitude (longitude) float64 -177.5 -172.5 -167.5 ... 172.5 177.5 realization int64 100 Dimensions without coordinates: bnds Data variables: tas_mean (time, latitude, longitude) float64 ... time_bnds (time, bnds) datetime64[ns] 1850-01-01 ... 2022-10-01 latitude_bnds (latitude, bnds) float64 -90.0 -85.0 -85.0 ... 85.0 90.0 longitude_bnds (longitude, bnds) float64 -180.0 -175.0 ... 175.0 180.0 realization_bnds (bnds) int64 1 200 Attributes: comment: 2m air temperature over land blended with sea water tempera... history: Data set built at: 2022-10-20T13:33:18+00:00 institution: Met Office Hadley Centre / Climatic Research Unit, Universi... licence: HadCRUT5 is licensed under the Open Government Licence v3.0... reference: C. P. Morice, J. J. Kennedy, N. A. Rayner, J. P. Winn, E. H... source: CRUTEM.22.214.171.124 HadSST.126.96.36.199 title: HadCRUT.188.8.131.52 blended land air temperature and sea-surfac... version: HadCRUT.184.108.40.206 Conventions: CF-1.7
We’re particularly interested in the
Data variablesbecause that dictates how we will plot the data - we have 2073 time steps (monthly data) on a grid that is 72 longitude cells by 36 latitude cells. Importantly, the data we are interested in is within the
tas_mean data variable. Plotting 2073 time steps of temperature anomalies isn’t very informative, so let’s simplify this data by taking the average of the last decade. To do so, let’s first take annual averages for the entire dataset and then average over the last 10 years. Note, there are more elegant ways to write this code, but we’re trying to keep things simple. Additionally, I’m choosing not to introduce the
.slice() functions, which will be introduced in a later tutorial.
# First take the annual mean # Note, that this is *not* weighted by days in each month for simplicity hadcrut5_annual_mean = hadcrut5_spatial['tas_mean'].groupby("time.year").mean("time") # From annual means, let's take the average of the last 10 years hadcrut5_last_decade = hadcrut5_annual_mean.sel(year=slice(2012, 2022)).mean("year")
We now have our average of the last 10 years of HadCRUT5, so let’s plot the data. We have three options for how to do so: 1) plotting with
xarrays built in
.plot() function, 2) manually plotting the data using
matplotlib, or 3) create subplots with
matplotlib but plot using
xarray. Option 1 is incredibly simple and great for quickly visualizing the data but has limited customization. Option 2 requires a bit more code but is much more customizable - figure size and subplot orientation can be readily changed.
# Option 1 -------------------------------------------------------------------- hadcrut5_last_decade.plot()
# Option 3 -------------------------------------------------------------------- # Create figure instance fig = plt.figure(figsize = (8, 5)) # Add subplot to figure instance and name it ax ax = fig.add_subplot(111) # Plot - notice how we add the ax argument and point it to our created subplot hadcrut5_last_decade.plot(ax=ax)
In both plots we can make out the fundamental features of contemporary climate change! The Arctic is warming faster than the rest of the Earth and land masses are warming faster than the oceans. Neither of these visualizations are publication ready, so stay tuned for further tutorials on how to produce a polished, publication-ready figure! For homework, use what was introduced in the Time series plotting with matplotlib section to reduce the amount of whitespace, rename axis labels, rename the plot, and rename the colorbar label.
Be on the lookout for Part 3 which will more comprehensively introduce the power of
xarray for handling spatial data.