Data Brochure¶

The Bureau of Meteorology releases national forecasts (up to 7 days in advance) in the afternoon each day. These forecasts are released as the Australian Digital Forecast Database (ADFD) which provides gridded data over the entire nation and surrounding waters. The current forecast can be viewed at http://www.bom.gov.au/australia/meteye/.

This dataset is prepared by the Evidence Targeted Automation team (gfe_eta@bom.gov.au) from the Science to Services Group, using forecast data obtained from the ADFD and observational data obtained from the Australian Data Archive for Meteorology (ADAM). This dataset focuses on the comparison of temperature, rainfall and wind forecasts against observations over land at the surface level, and includes data for an approximately three year-long period from May 2017 to April 2018.

Forecast weather elements include temperature, maximum and minimum temperature, rainfall probabilities and rainfall amounts, dew point, relative humidity wind magnitude and wind direction. Different forecast products have different time resolutions. For example, temperature forecasts are made for each hour, while maximum and minimum temperature forecasts are made for each day. The native time resolutions of forecast products are preserved in this dataset.

Observation data is provided from ~500 automatic weather stations (AWS) throughout Australia. AWS observations data have a native time resolution of one minute, and report the current air temperature, precipitation, dew point, relative humidity, wind magnitude and wind direction, amongst other weather elements not included in this dataset. The one-minute data are aggregated here to hourly blocks so that the observations can be directly compared against forecasts. For certain forecast elements (e.g. daily maximum temperature), additional aggregation of hourly AWS data will be needed to compare against forecasts.

This data brochure contains the following sections:

Sneak peek: a quick look at the first few rows of data in this dataset
Data Schema for Forecast and Observations: schema field definitions and explanations
Station Data: definitions and explanations for lookup-data containing information on AWS stations
Meteorological parameters: definitions and explanations for forecast and observations weather elements
Example code: some example code in Python for loading the dataset, including a demonstration of a simple root-mean-square error calculation

Sneak Peek¶

The first few rows of Op_Official_20180101.csv

station_number,area_code,parameter,valid_start,valid_end,value,unit,statistic,level,base_time
69132,NSW_PT019,DailyPoP,1514818800,1514905200,62.0,%,point,SFC,1514246400
62101,NSW_PT094,DailyPoP,1514818800,1514905200,17.0,%,point,SFC,1514246400
69138,NSW_PT149,DailyPoP,1514818800,1514905200,65.0,%,point,SFC,1514246400
66212,NSW_PT132,DailyPoP,1514818800,1514905200,65.0,%,point,SFC,1514246400
69147,NSW_PT087,DailyPoP,1514818800,1514905200,45.0,%,point,SFC,1514246400

The first few rows of aws_hourly_20180101.csv

station_number,area_code,parameter,valid_start,valid_end,value,unit,statistic,level,qc_valid_minutes,qc_valid_start,qc_valid_end
541150,,MaxT,1514764800,1514768400,28.2,Celsius,max,SFC,60,1514764800,1514768400
512019,,MaxT,1514764800,1514768400,31.6,Celsius,max,SFC,60,1514764800,1514768400
507500,,MaxT,1514764800,1514768400,32.1,Celsius,max,SFC,60,1514764800,1514768400
505056,,MaxT,1514764800,1514768400,33.6,Celsius,max,SFC,60,1514764800,1514768400
505053,,MaxT,1514764800,1514768400,34.2,Celsius,max,SFC,60,1514764800,1514768400

Data Schema for Forecast and Observations¶

TJLite schema¶

CSV files in this dataset follow the "TJLite schema", a convention for flexible storage of site-based meteorological data. There are two datasets, forecast data and observations data, stored in the fcst and obs directories respectively.

Data are stored in separate CSV files for each day, with a naming format of {source}_{date}.csv. This is a nominal date only used to partition the data into multiple files. It corresponds to the date of the records contained within. Times within the file are fully recorded in UTC to avoid ambiguity.

field name	data type	explanation of content
station number	integer	a unique number for the weather station
area_code	string	a BoM identifier for the geographical area where the station resides. This field may be blank.
parameter	string	the name of the weather element. See more information see note [1]
valid_start	integer	time in seconds since the Epoch (see note [2]). This is the start of the time period that the forecast or observed value corresponds to.
valid_end	integer	time in seconds since the Epoch (see note [2]). This is the end of the time period that the forecast or observed value corresponds to.
value	float	numerical value for the given weather element
unit	string	the unit for the value (e.g. Celsius, mm, %)
statistic	string	the function used to calculate the weather element value.
level	string	In this dataset, all rows should be `SFC` to denote measurements/forecasts for the surface.
----	----	-------
base_time	integer	present only for forecast data. Time in seconds since the Epoch (see note [2]). This is a standardised base time that represents when a forecast is prepared. See note [3] below.
----	----	-------
qc_valid_minutes	integer	present only for observational data. Number of minutes used to perform an hourly-aggregate. This field is populated for rows containing non-instantaneous (see note [4]) parameters. Most weather stations report data every minute, and the one-minute data are aggregated into hourly data using the method given in statistic. Sometimes a station may fail to report data for a number of reasons. The number of valid minutes used in each hourly aggregate is recorded here.
qc_valid_start	integer	present only for observational data. Time in seconds since the Epoch (see note [2]). For non-instantaneous (see note [4]) parameters, this is the timestamp of the start of first valid minute used for the aggregate. For instantaneous parameters, this is the timestamp of the valid minute used (e.g. if the minute corresponding to valid_start is not available, the next closest valid minute in the hour-block will be reported).
qc_valid_end	integer	present only for observational data. Time in seconds since the Epoch (see note [2]). This field is populated for non-instantaneous parameters (see note [4]). This is the timestamp of the end of the last valid minute.

Notes¶

Additional information on parameter, string, the weather element name.

For forecast data, these elements are

 ['DailyPoP', 'DailyPoP1', 'DailyPoP5', 'DailyPoP10', 'DailyPoP15',
  'DailyPoP25', 'DailyPoP50', 'DailyPrecip', 'DailyPrecip10Pct',
  'DailyPrecip25Pct', 'DailyPrecip50Pct', 'DailyPrecip75Pct',
  'PoP', 'Precip', 'Precip10Pct', 'Precip25Pct', 'Precip50Pct',
  'MaxT', 'MinT', 'T', 'Td', 'WindGust', 'WindMag', 'WindDir', 'RH']

For observational data, these are

 [MaxT', 'MinT', 'Precip', 'T', 'Td', 'RH',
  'WindDir', 'WindMag', 'MaxWindMag']

For an overview of the different weather elements, see below.

Time since Epoch is calculated as time in seconds since the Unix Epoch (1970-01-01T00:00:00Z+00)
Different regions (state offices) may publish their forecasts at slightly different times. This can also vary slightly by field type. For consistency, each forecast is given a nominal base_time which represents the time the forecast was issued. This dataset contains two forecast base times for each day, which correspond to forecasts issued in the morning and afternoon. The difference base_time - valid_start gives the lead_time of a forecast, i.e. how far into the future a forecast is. Generally, forecast accuracy tends to be better at shorter lead times.
Some parameters with hourly periods such as 'T', 'Td', 'WindDir', and 'WindMag' are considered to be instantaneous, meaning that their valid_start and valid_end are the same, and their value applies to the start of the hour. By contrast, non-instantaneous parameters have values that represent an aggregage over the time period between their valid_start and valid_end times.

Station Data¶

Map of AWS stations in this dataset¶

`StationData.csv`¶

spatial/StationData.csv holds information about the automatic weather station (AWS) sites that are present in the forecast and observational data (they are identified by their station_number in these datasets).

The first few rows of StationData.csv:

WMO_NUM,station_number,station_name,LATITUDE,LONGITUDE,STN_HT,AVIATION_ID,REGION,GridPt Lat,GridPt Lon,MSAS elevation,Distance from GridPt,Roughness,Distance from coast,Category,forecast_district,sa_special
95214,1006,WYNDHAM AERO,-15.51,128.1503,3.8,YWYM,WA,-15.49,128.17,79.76,2.1,39.1,51,mountains2,WA_PW001,
94102,1007,TROUGHTON ISLAND,-13.7542,126.1485,6,YTTI,WA,-13.74,126.17,0.16,2,0.1,5,coast,WA_PW001,
94100,1019,KALUMBURU,-14.2964,126.6453,23,YKAL,WA,-14.28,126.63,17.4,2.3,31.4,10,coast,WA_PW001,
95101,1020,TRUSCOTT,-14.09,126.3867,51,YTST,WA,-14.07,126.38,31.76,1.5,14.3,9,coast,WA_PW001,
99201,2012,HALLS CREEK MO,-18.2292,127.6636,422,HCR,WA,-18.24,127.67,423.36,2.3,12.7,999,mountains2,WA_PW001,

field name	data type	explanation of content
WMO_NUM	integer	the World Meteorological Organisation (WMO) id of weather stations (unlikely to be relevant).
station_num	integer	id of each weather station.
station_name	string	a human-readable name for the station. Note that AWS is an acronym for "automatic weather station".
LATITUDE	float	the latitude of the station (all negative since Australia is in the Southern Hemisphere!).
LONGITUDE	float	the longitude of the station (all positive since Australia is in the Eastern Hemispehere!).
STN_HT	float	the elevation in metres of the station above mean sea level.
AVIATION_ID	string	a unique identifier for the aviation industry. Where a station is associated with an airport this is the ICAO code for that airport (e.g. SYDNEY AIRPORT AMO, station_number:66037, has AVIATION_ID: YSSY).
REGION	string	the state or territory to which the station belongs (i.e. one of WA, NT, QLD, NSW, VIC, TAS/ANT, SA). Note that the stations in the ACT (Australian Capital Territory) belong to NSW.
GridPt Lat	float	the latitude of the grid cell in which the station resides.
GridPt Lon	float	the longitude of the grid cell in which the station resides.
MSAS elevation	float	the elevation of the grid cell in which the station resides in the MSAS analysis - a gridded observational dataset.
Distance from GridPt	float	the distance in kilometres of the station from the centre of the grid cell to which it belongs.
Roughness	float	a parameter specifying the 'roughness' of terrain around the MSAS grid cell in which the station resides. Higher values indicate rougher terrain.
Distance from coast	integer	distance in kilometres of the station from the nearest coast. Distances greater than 999km are clipped to 999. Negative values indicate offshore stations.
Category	string	a classification of the terrain type of the station.
forecast_district	string	id of the forecast district in which the station resides.
sa_special	string	extra information for South Australian stations.

Station group files¶

YAML files are used for partitioning the stations into various groups, identified by their station_number.

Three station grouping YAML files are included in the spatial folder:

temperature_station_groups.yml: station groups tailored for temperature verification, also likely to be relevant for dew point and relative humidity
wind_station_groups.yml: station groups tailored for wind verification
station_groups.yml: station groups that are tailored useful for rainfall and any other verification

Each YAML file has the following general structure:

STATE:  # Cross-region, NSW/ACT, QLD, VIC, etc...
    THEME:  # Topographical, District, Special
        SUB_GROUPS_WITHIN_THEME:  # e.g. Topographichal has sub-groups Mountainous, Coastal, and Flat and Hilly
            [1111, 2222, ...]  # the station numbers

Meteorological parameters¶

Forecast parameters¶

The forecast data (in files fcst/Op_Official_{date}.csv) has the following parameters:

parameter	unit	frequency	description of forecast
`T`	degrees celcius	1H	the instantaneous air temperature.
`MaxT`	degrees celcius	24H	the maximum temperature over the forecast period.
`MinT`	degrees celcius	24H	the minimum temperature over the forecast period.
`Td`	degrees celcius	1H	the instantaneous dewpoint temperature.
`RH`	percentage	1H	the instantaneous relative humidity.
`PoP`	percentage	3H	the chance of receiving at least 0.2mm rainfall (the detection limit of the rain gauges) over a 3-hour period.
`DailyPoP`	percentage	24H	the chance of receiving at least 0.2mm rainfall (the detection limit of the rain gauges) over a 24-hour
`DailyPoP1`	percentage	24H	the chance of receiving at least 1mm rainfall over a 24-hour period.
`DailyPoP5`	percentage	24H	the chance of receiving at least 5mm rainfall over a 24-hour period.
`DailyPoP10`	percentage	24H	the chance of receiving at least 10mm rainfall over a 24-hour period.
`DailyPoP15`	percentage	24H	the chance of receiving at least 15mm rainfall over a 24-hour period.
`DailyPoP25`	percentage	24H	the chance of receiving at least 25mm rainfall over a 24-hour period.
`DailyPoP50`	percentage	24H	the chance of receiving at least 50mm rainfall over a 24-hour period.
`Precip`	mm	3H	the expected (in a statistical sense) amount of rainfall over a 3-hour period.
`Precip10Pct`	mm	3H	there is a 10% chance of receiving at least this much rainfall over the 3-hour period.
`Precip25Pct`	mm	3H	there is a 25% chance of receiving at least this much rainfall over the 3-hour period.
`Precip50Pct`	mm	3H	there is a 50% chance of receiving at least this much rainfall over the 3-hour period.
`DailyPrecip`	mm	24H	the expected (in a statistical sense) amount of rainfall over a 15Z-aligned 24-hour period.
`DailyPrecip10Pct`	mm	24H	there is a 10% chance of receiving at least this much rainfall over the 24-hour period.
`DailyPrecip25Pct`	mm	24H	there is a 25% chance of receiving at least this much rainfall over the 24-hour period.
`DailyPrecip50Pct`	mm	24H	there is a 50% chance of receiving at least this much rainfall over the 24-hour period.
`DailyPrecip75Pct`	mm	24H	there is a 75% chance of receiving at least this much rainfall over the 24-hour period.
`WindMag`	kt	1H	the instantaneous 10-minute mean wind speed.
`WindDir`	degrees	1H	the instantaneous 10-minute mean wind direction.
`WindGust`	kt	1H	the instantaneous maximum wind gust. Note that no corresponding observation is provided with this dataset.

Observational parameters¶

The observational data (in files obs/aws_hourly_{date}.csv) has the following parameters:

parameter	unit	description
`T`	degrees celcius	the instantaneous air temperature.
`MaxT`	degrees celcius	the maximum recorded air temperature over an one-hour period.
`MinT`	degrees celcius	the minimum recorded air temperature over an one-hour period.
`Td`	degrees celcius	the instantaneous dew point temperature.
`RH`	percentage	the instantaneous relative humidity.
`WindMag`	kt	the instantaneous 10-minute mean wind speed.
`WindDir`	degrees	the instantaneous 10-minute mean wind direction.
`MaxWindMag`	kt	the maximum 10-minute mean wind speed over a 1-hour period.
`Precip`	mm	the amount of rainfall recorded over a 1-hour period in increments of 0.2mm.

Example Code¶

Useful Functions¶

The following are some useful functions in Python for loading and working with the data.

import pandas as pd
import xarray as xr

def epoch_seconds_to_timestamp(inp):
    """
    Converts an integer (seconds since UNIX epoch) or list of integers
    into a Pandas Timestamp object.
    
    Example:
        >>> epoch_seconds_to_timestamp(1461945600)
        Timestamp('2016-04-29 16:00:00')
        
        >>> epoch_seconds_to_timestamp([1461945600, 1461945660, 1461945720])
        DatetimeIndex(['2016-04-29 16:00:00', '2016-04-29 16:01:00',
                       '2016-04-29 16:02:00'],
                      dtype='datetime64[ns]', freq=None)
    """
    return pd.to_datetime(inp, unit='s')


def pd_read_fcst_csv(file_path):
    """
    Reads a forecast CSV file and returns a Pandas DataFrame with the
    appropriate type conversions.
    
    Examples:
        >>> pd_read_fcst_csv('fcst/Op_Official_20150501.csv')
    """
    df = pd.read_csv(file_path)
    for field in ['valid_start', 'valid_end', 'base_time']:
        df[field] = epoch_seconds_to_timestamp(df[field])
    df['station_number'] = df['station_number'].astype('int')

    return df


def pd_read_obs_csv(file_path):
    """
    Reads a observations CSV file and returns a Pandas DataFrame with the
    appropriate type conversions.
    
    Example:
        >>> pd_read_obs_csv('obs/aws_hourly_20150501.csv')
    """
    df = pd.read_csv(file_path)
    for field in ['valid_start', 'valid_end', 'qc_valid_start', 'qc_valid_end']:
        df[field] = epoch_seconds_to_timestamp(df[field])
    df['station_number'] = df['station_number'].astype('int')

    return df


def dataframe_param_to_xarray(dataframe, param, indices):
    """
    Filters the `dataframe` using its "parameter" column to contain only
    the supplied `param`, then creates an xarray DataArray object using
    the supplied `indices`.
    
    Example:
        >>> df = pd.DataFrame(
            [[0, 'max', 5],
            [0, 'min', 3],
            [1, 'max', 10],
            [1, 'min, 0]],
            columns=['x', 'parameter', 'value'])
        >>> dataframe_param_to_xarray(df, 'min', ['x'])
        <xarray.DataArray 'min' (x: 2)>
        array([3, 0])
        Coordinates:
          * x        (x) int64 0 1
    """
    # filter by parameter
    dataframe = dataframe[dataframe['parameter'] == param]

    # keep only the indices and value columns
    selection = dataframe[indices + ['value']]

    # drop any spurious duplicated data
    # (the same indices and value, i.e. some data was saved twice?)
    selection = selection.drop_duplicates(subset=indices)

    # drop any spurious NaN-data
    # (indices contain NaN, incomplete data?)
    selection = selection.dropna(subset=indices)

    # set the dataframe index
    selection = selection.set_index(indices)

    # obtain an xarray.DataArray from the 'value' column, using the indices
    # as coordinate values
    data_array = xr.DataArray.from_series(selection["value"])
    data_array.name = param

    return data_array


def fcst_param_to_xarray(dataframe, param, indices=None):
    """
    Filters a forecast dataframe to obtain an xarray DataArray for the specified
    forecast parameter.
    
    Indices defaults to ['station_number', 'base_time', 'valid_start'], but an
    alternative set of indices can be supplied.
    
    Example:
        >>> df = pd_read_fcst_csv('fcst/Op_Official_20150501.csv')
        >>> fcst_param_to_xarray(df, 'T')
    """
    if indices is None:
        indices = ['station_number', 'base_time', 'valid_start']
    return dataframe_param_to_xarray(dataframe, param, indices=indices)


def obs_param_to_xarray(dataframe, param, indices=None):
    """
    Filters an observations dataframe to obtain an xarray DataArray for the specified
    observations parameter.
    
    Indices defaults to ['station_number', 'valid_start'], but an alternative
    set of indices can be supplied.
    
    Example:
        >>> df = pd_read_obs_csv('obs/aws_hourly_20150501.csv')
        >>> obs_param_to_xarray(df, 'T')
    """
    if indices is None:
        indices = ['station_number', 'valid_start']
    return dataframe_param_to_xarray(dataframe, param, indices=indices)

Example calculation of root-mean-square error¶

# load one week of forecasts from 2015-06-01 - 2015-06-07
# this may take a little while
fcst_dfs = [
    pd_read_fcst_csv('fcst/Op_Official_20180101.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180102.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180103.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180104.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180105.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180106.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180107.csv'),
]
fcst_df = pd.concat(fcst_dfs)
# extract the hourly-temperature forecast into an xarray DataArray
fcst = fcst_param_to_xarray(fcst_df, 'T')

# load observations from the same period to compare against the forecasts
obs_dfs = [
    pd_read_obs_csv('obs/aws_hourly_20180101.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180102.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180103.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180104.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180105.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180106.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180107.csv'),
]
obs_df = pd.concat(obs_dfs)
# extract the hourly-temperature observations into an xarray DataArray
obs = obs_param_to_xarray(obs_df, 'T')

# calculate squared errors
se = pow(fcst - obs, 2.0)

# calculate root-mean-squared errors averaged over all stations
rmse = pow(se.mean(dim=['station_number']), 0.5)

# how many valid (not NaN) points are there?
print('Data points: ', int(se.count()))

Data points:  1455567

# how many stations were matched?
print('No. of stations: ', len(se.coords['station_number']))

No. of stations:  500

# what forecast base-times do we have for this period?
# These times indicate when each forecast was prepared.
fcst.base_time

<xarray.DataArray 'base_time' (base_time: 33)>
array(['2017-12-23T00:00:00.000000000', '2017-12-23T12:00:00.000000000',
       '2017-12-24T00:00:00.000000000', '2017-12-24T12:00:00.000000000',
       '2017-12-25T00:00:00.000000000', '2017-12-25T12:00:00.000000000',
       '2017-12-26T00:00:00.000000000', '2017-12-26T12:00:00.000000000',
       '2017-12-27T00:00:00.000000000', '2017-12-27T12:00:00.000000000',
       '2017-12-28T00:00:00.000000000', '2017-12-28T12:00:00.000000000',
       '2017-12-29T00:00:00.000000000', '2017-12-29T12:00:00.000000000',
       '2017-12-30T00:00:00.000000000', '2017-12-30T12:00:00.000000000',
       '2017-12-31T00:00:00.000000000', '2017-12-31T12:00:00.000000000',
       '2018-01-01T00:00:00.000000000', '2018-01-01T12:00:00.000000000',
       '2018-01-02T00:00:00.000000000', '2018-01-02T12:00:00.000000000',
       '2018-01-03T00:00:00.000000000', '2018-01-03T12:00:00.000000000',
       '2018-01-04T00:00:00.000000000', '2018-01-04T12:00:00.000000000',
       '2018-01-05T00:00:00.000000000', '2018-01-05T12:00:00.000000000',
       '2018-01-06T00:00:00.000000000', '2018-01-06T12:00:00.000000000',
       '2018-01-07T00:00:00.000000000', '2018-01-07T12:00:00.000000000',
       '2018-01-08T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * base_time  (base_time) datetime64[ns] 2017-12-23 2017-12-23T12:00:00 ...

# Each valid_start time may have forecast values issued up to 7 days in advance
# For example 2018-01-01 has forecasts values issued from 2017-12-25 
# to 2018-01-01. 
fcst.sel(valid_start=pd.to_datetime('2018-01-01T00:00'))\
    .dropna(dim='base_time')\
    .base_time
# Note 1: that some base times may be missing due to irregular archiving of 
# forecast grids which has been improved since 2015.
# Note 2: Occasionally you may find forecasts that were issued more than 7
# days in advance. These should not be considered to be reliable data.

<xarray.DataArray 'base_time' (base_time: 14)>
array(['2017-12-25T00:00:00.000000000', '2017-12-25T12:00:00.000000000',
       '2017-12-26T00:00:00.000000000', '2017-12-26T12:00:00.000000000',
       '2017-12-27T00:00:00.000000000', '2017-12-27T12:00:00.000000000',
       '2017-12-28T00:00:00.000000000', '2017-12-28T12:00:00.000000000',
       '2017-12-29T00:00:00.000000000', '2017-12-30T00:00:00.000000000',
       '2017-12-30T12:00:00.000000000', '2017-12-31T00:00:00.000000000',
       '2017-12-31T12:00:00.000000000', '2018-01-01T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * base_time    (base_time) datetime64[ns] 2017-12-25 2017-12-25T12:00:00 ...
    valid_start  datetime64[ns] 2018-01-01

# select one forecast base-time and plot the average root-mean-squared 
# error for the validity times included in that forecast
# The error increases as we forecast farther into the future
from matplotlib import pyplot as plt
%matplotlib inline

rmse.sel(base_time=pd.to_datetime('2018-01-01T00:00')).plot()
plt.ylabel("RMSE (degrees C)")
plt.title(
    "RMSE of forecasts issued at 2018-01-01T00:00 (base-time)\n"
    "averaged over all stations"
)

Text(0.5,1,'RMSE of forecasts issued at 2018-01-01T00:00 (base-time)\naveraged over all stations')