Data Brochure

The Bureau of Meteorology releases national forecasts (up to 7 days in advance) in the afternoon each day. These forecasts are released as the Australian Digital Forecast Database (ADFD) which provides gridded data over the entire nation and surrounding waters. The current forecast can be viewed at http://www.bom.gov.au/australia/meteye/.

This dataset is prepared by the Evidence Targeted Automation team (gfe_eta@bom.gov.au) from the Science to Services Group, using forecast data obtained from the ADFD and observational data obtained from the Australian Data Archive for Meteorology (ADAM). This dataset focuses on the comparison of temperature, rainfall and wind forecasts against observations over land at the surface level, and includes data for an approximately three year-long period from May 2017 to April 2018.

Forecast weather elements include temperature, maximum and minimum temperature, rainfall probabilities and rainfall amounts, dew point, relative humidity wind magnitude and wind direction. Different forecast products have different time resolutions. For example, temperature forecasts are made for each hour, while maximum and minimum temperature forecasts are made for each day. The native time resolutions of forecast products are preserved in this dataset.

Observation data is provided from ~500 automatic weather stations (AWS) throughout Australia. AWS observations data have a native time resolution of one minute, and report the current air temperature, precipitation, dew point, relative humidity, wind magnitude and wind direction, amongst other weather elements not included in this dataset. The one-minute data are aggregated here to hourly blocks so that the observations can be directly compared against forecasts. For certain forecast elements (e.g. daily maximum temperature), additional aggregation of hourly AWS data will be needed to compare against forecasts.

This data brochure contains the following sections:

Sneak Peek

The first few rows of Op_Official_20180101.csv

station_number,area_code,parameter,valid_start,valid_end,value,unit,statistic,level,base_time
69132,NSW_PT019,DailyPoP,1514818800,1514905200,62.0,%,point,SFC,1514246400
62101,NSW_PT094,DailyPoP,1514818800,1514905200,17.0,%,point,SFC,1514246400
69138,NSW_PT149,DailyPoP,1514818800,1514905200,65.0,%,point,SFC,1514246400
66212,NSW_PT132,DailyPoP,1514818800,1514905200,65.0,%,point,SFC,1514246400
69147,NSW_PT087,DailyPoP,1514818800,1514905200,45.0,%,point,SFC,1514246400

The first few rows of aws_hourly_20180101.csv

station_number,area_code,parameter,valid_start,valid_end,value,unit,statistic,level,qc_valid_minutes,qc_valid_start,qc_valid_end
541150,,MaxT,1514764800,1514768400,28.2,Celsius,max,SFC,60,1514764800,1514768400
512019,,MaxT,1514764800,1514768400,31.6,Celsius,max,SFC,60,1514764800,1514768400
507500,,MaxT,1514764800,1514768400,32.1,Celsius,max,SFC,60,1514764800,1514768400
505056,,MaxT,1514764800,1514768400,33.6,Celsius,max,SFC,60,1514764800,1514768400
505053,,MaxT,1514764800,1514768400,34.2,Celsius,max,SFC,60,1514764800,1514768400

Data Schema for Forecast and Observations

TJLite schema

CSV files in this dataset follow the "TJLite schema", a convention for flexible storage of site-based meteorological data. There are two datasets, forecast data and observations data, stored in the fcst and obs directories respectively.

Data are stored in separate CSV files for each day, with a naming format of {source}_{date}.csv. This is a nominal date only used to partition the data into multiple files. It corresponds to the date of the records contained within. Times within the file are fully recorded in UTC to avoid ambiguity.

field name data type explanation of content
station number integer a unique number for the weather station
area_code string a BoM identifier for the geographical area where the station resides. This field may be blank.
parameter string the name of the weather element. See more information see note [1]
valid_start integer time in seconds since the Epoch (see note [2]). This is the start of the time period that the forecast or observed value corresponds to.
valid_end integer time in seconds since the Epoch (see note [2]). This is the end of the time period that the forecast or observed value corresponds to.
value float numerical value for the given weather element
unit string the unit for the value (e.g. Celsius, mm, %)
statistic string the function used to calculate the weather element value.
level string In this dataset, all rows should be SFC to denote measurements/forecasts for the surface.
---- ---- -------
base_time integer present only for forecast data. Time in seconds since the Epoch (see note [2]). This is a standardised base time that represents when a forecast is prepared. See note [3] below.
---- ---- -------
qc_valid_minutes integer present only for observational data. Number of minutes used to perform an hourly-aggregate. This field is populated for rows containing non-instantaneous (see note [4]) parameters. Most weather stations report data every minute, and the one-minute data are aggregated into hourly data using the method given in statistic. Sometimes a station may fail to report data for a number of reasons. The number of valid minutes used in each hourly aggregate is recorded here.
qc_valid_start integer present only for observational data. Time in seconds since the Epoch (see note [2]). For non-instantaneous (see note [4]) parameters, this is the timestamp of the start of first valid minute used for the aggregate. For instantaneous parameters, this is the timestamp of the valid minute used (e.g. if the minute corresponding to valid_start is not available, the next closest valid minute in the hour-block will be reported).
qc_valid_end integer present only for observational data. Time in seconds since the Epoch (see note [2]). This field is populated for non-instantaneous parameters (see note [4]). This is the timestamp of the end of the last valid minute.

Notes

  1. Additional information on parameter, string, the weather element name.

    • For forecast data, these elements are
       ['DailyPoP', 'DailyPoP1', 'DailyPoP5', 'DailyPoP10', 'DailyPoP15',
        'DailyPoP25', 'DailyPoP50', 'DailyPrecip', 'DailyPrecip10Pct',
        'DailyPrecip25Pct', 'DailyPrecip50Pct', 'DailyPrecip75Pct',
        'PoP', 'Precip', 'Precip10Pct', 'Precip25Pct', 'Precip50Pct',
        'MaxT', 'MinT', 'T', 'Td', 'WindGust', 'WindMag', 'WindDir', 'RH']
    • For observational data, these are
       [MaxT', 'MinT', 'Precip', 'T', 'Td', 'RH',
        'WindDir', 'WindMag', 'MaxWindMag']
    • For an overview of the different weather elements, see below.
  2. Time since Epoch is calculated as time in seconds since the Unix Epoch (1970-01-01T00:00:00Z+00)

  3. Different regions (state offices) may publish their forecasts at slightly different times. This can also vary slightly by field type. For consistency, each forecast is given a nominal base_time which represents the time the forecast was issued. This dataset contains two forecast base times for each day, which correspond to forecasts issued in the morning and afternoon. The difference base_time - valid_start gives the lead_time of a forecast, i.e. how far into the future a forecast is. Generally, forecast accuracy tends to be better at shorter lead times.

  4. Some parameters with hourly periods such as 'T', 'Td', 'WindDir', and 'WindMag' are considered to be instantaneous, meaning that their valid_start and valid_end are the same, and their value applies to the start of the hour. By contrast, non-instantaneous parameters have values that represent an aggregage over the time period between their valid_start and valid_end times.

Station Data

Map of AWS stations in this dataset

StationData.csv

spatial/StationData.csv holds information about the automatic weather station (AWS) sites that are present in the forecast and observational data (they are identified by their station_number in these datasets).

The first few rows of StationData.csv:

WMO_NUM,station_number,station_name,LATITUDE,LONGITUDE,STN_HT,AVIATION_ID,REGION,GridPt Lat,GridPt Lon,MSAS elevation,Distance from GridPt,Roughness,Distance from coast,Category,forecast_district,sa_special
95214,1006,WYNDHAM AERO,-15.51,128.1503,3.8,YWYM,WA,-15.49,128.17,79.76,2.1,39.1,51,mountains2,WA_PW001,
94102,1007,TROUGHTON ISLAND,-13.7542,126.1485,6,YTTI,WA,-13.74,126.17,0.16,2,0.1,5,coast,WA_PW001,
94100,1019,KALUMBURU,-14.2964,126.6453,23,YKAL,WA,-14.28,126.63,17.4,2.3,31.4,10,coast,WA_PW001,
95101,1020,TRUSCOTT,-14.09,126.3867,51,YTST,WA,-14.07,126.38,31.76,1.5,14.3,9,coast,WA_PW001,
99201,2012,HALLS CREEK MO,-18.2292,127.6636,422,HCR,WA,-18.24,127.67,423.36,2.3,12.7,999,mountains2,WA_PW001,
field name data type explanation of content
WMO_NUM integer the World Meteorological Organisation (WMO) id of weather stations (unlikely to be relevant).
station_num integer id of each weather station.
station_name string a human-readable name for the station. Note that AWS is an acronym for "automatic weather station".
LATITUDE float the latitude of the station (all negative since Australia is in the Southern Hemisphere!).
LONGITUDE float the longitude of the station (all positive since Australia is in the Eastern Hemispehere!).
STN_HT float the elevation in metres of the station above mean sea level.
AVIATION_ID string a unique identifier for the aviation industry. Where a station is associated with an airport this is the ICAO code for that airport (e.g. SYDNEY AIRPORT AMO, station_number:66037, has AVIATION_ID: YSSY).
REGION string the state or territory to which the station belongs (i.e. one of WA, NT, QLD, NSW, VIC, TAS/ANT, SA). Note that the stations in the ACT (Australian Capital Territory) belong to NSW.
GridPt Lat float the latitude of the grid cell in which the station resides.
GridPt Lon float the longitude of the grid cell in which the station resides.
MSAS elevation float the elevation of the grid cell in which the station resides in the MSAS analysis - a gridded observational dataset.
Distance from GridPt float the distance in kilometres of the station from the centre of the grid cell to which it belongs.
Roughness float a parameter specifying the 'roughness' of terrain around the MSAS grid cell in which the station resides. Higher values indicate rougher terrain.
Distance from coast integer distance in kilometres of the station from the nearest coast. Distances greater than 999km are clipped to 999. Negative values indicate offshore stations.
Category string a classification of the terrain type of the station.
forecast_district string id of the forecast district in which the station resides.
sa_special string extra information for South Australian stations.

Station group files

YAML files are used for partitioning the stations into various groups, identified by their station_number.

Three station grouping YAML files are included in the spatial folder:

  • temperature_station_groups.yml: station groups tailored for temperature verification, also likely to be relevant for dew point and relative humidity
  • wind_station_groups.yml: station groups tailored for wind verification
  • station_groups.yml: station groups that are tailored useful for rainfall and any other verification

Each YAML file has the following general structure:

STATE:  # Cross-region, NSW/ACT, QLD, VIC, etc...
    THEME:  # Topographical, District, Special
        SUB_GROUPS_WITHIN_THEME:  # e.g. Topographichal has sub-groups Mountainous, Coastal, and Flat and Hilly
            [1111, 2222, ...]  # the station numbers

Meteorological parameters

Forecast parameters

The forecast data (in files fcst/Op_Official_{date}.csv) has the following parameters:

parameter unit frequency description of forecast
T degrees celcius 1H the instantaneous air temperature.
MaxT degrees celcius 24H the maximum temperature over the forecast period.
MinT degrees celcius 24H the minimum temperature over the forecast period.
Td degrees celcius 1H the instantaneous dewpoint temperature.
RH percentage 1H the instantaneous relative humidity.
PoP percentage 3H the chance of receiving at least 0.2mm rainfall (the detection limit of the rain gauges) over a 3-hour period.
DailyPoP percentage 24H the chance of receiving at least 0.2mm rainfall (the detection limit of the rain gauges) over a 24-hour
DailyPoP1 percentage 24H the chance of receiving at least 1mm rainfall over a 24-hour period.
DailyPoP5 percentage 24H the chance of receiving at least 5mm rainfall over a 24-hour period.
DailyPoP10 percentage 24H the chance of receiving at least 10mm rainfall over a 24-hour period.
DailyPoP15 percentage 24H the chance of receiving at least 15mm rainfall over a 24-hour period.
DailyPoP25 percentage 24H the chance of receiving at least 25mm rainfall over a 24-hour period.
DailyPoP50 percentage 24H the chance of receiving at least 50mm rainfall over a 24-hour period.
Precip mm 3H the expected (in a statistical sense) amount of rainfall over a 3-hour period.
Precip10Pct mm 3H there is a 10% chance of receiving at least this much rainfall over the 3-hour period.
Precip25Pct mm 3H there is a 25% chance of receiving at least this much rainfall over the 3-hour period.
Precip50Pct mm 3H there is a 50% chance of receiving at least this much rainfall over the 3-hour period.
DailyPrecip mm 24H the expected (in a statistical sense) amount of rainfall over a 15Z-aligned 24-hour period.
DailyPrecip10Pct mm 24H there is a 10% chance of receiving at least this much rainfall over the 24-hour period.
DailyPrecip25Pct mm 24H there is a 25% chance of receiving at least this much rainfall over the 24-hour period.
DailyPrecip50Pct mm 24H there is a 50% chance of receiving at least this much rainfall over the 24-hour period.
DailyPrecip75Pct mm 24H there is a 75% chance of receiving at least this much rainfall over the 24-hour period.
WindMag kt 1H the instantaneous 10-minute mean wind speed.
WindDir degrees 1H the instantaneous 10-minute mean wind direction.
WindGust kt 1H the instantaneous maximum wind gust. Note that no corresponding observation is provided with this dataset.

Observational parameters

The observational data (in files obs/aws_hourly_{date}.csv) has the following parameters:

parameter unit description
T degrees celcius the instantaneous air temperature.
MaxT degrees celcius the maximum recorded air temperature over an one-hour period.
MinT degrees celcius the minimum recorded air temperature over an one-hour period.
Td degrees celcius the instantaneous dew point temperature.
RH percentage the instantaneous relative humidity.
WindMag kt the instantaneous 10-minute mean wind speed.
WindDir degrees the instantaneous 10-minute mean wind direction.
MaxWindMag kt the maximum 10-minute mean wind speed over a 1-hour period.
Precip mm the amount of rainfall recorded over a 1-hour period in increments of 0.2mm.

Example Code

Useful Functions

The following are some useful functions in Python for loading and working with the data.

In [1]:
import pandas as pd
import xarray as xr

def epoch_seconds_to_timestamp(inp):
    """
    Converts an integer (seconds since UNIX epoch) or list of integers
    into a Pandas Timestamp object.
    
    Example:
        >>> epoch_seconds_to_timestamp(1461945600)
        Timestamp('2016-04-29 16:00:00')
        
        >>> epoch_seconds_to_timestamp([1461945600, 1461945660, 1461945720])
        DatetimeIndex(['2016-04-29 16:00:00', '2016-04-29 16:01:00',
                       '2016-04-29 16:02:00'],
                      dtype='datetime64[ns]', freq=None)
    """
    return pd.to_datetime(inp, unit='s')


def pd_read_fcst_csv(file_path):
    """
    Reads a forecast CSV file and returns a Pandas DataFrame with the
    appropriate type conversions.
    
    Examples:
        >>> pd_read_fcst_csv('fcst/Op_Official_20150501.csv')
    """
    df = pd.read_csv(file_path)
    for field in ['valid_start', 'valid_end', 'base_time']:
        df[field] = epoch_seconds_to_timestamp(df[field])
    df['station_number'] = df['station_number'].astype('int')

    return df


def pd_read_obs_csv(file_path):
    """
    Reads a observations CSV file and returns a Pandas DataFrame with the
    appropriate type conversions.
    
    Example:
        >>> pd_read_obs_csv('obs/aws_hourly_20150501.csv')
    """
    df = pd.read_csv(file_path)
    for field in ['valid_start', 'valid_end', 'qc_valid_start', 'qc_valid_end']:
        df[field] = epoch_seconds_to_timestamp(df[field])
    df['station_number'] = df['station_number'].astype('int')

    return df


def dataframe_param_to_xarray(dataframe, param, indices):
    """
    Filters the `dataframe` using its "parameter" column to contain only
    the supplied `param`, then creates an xarray DataArray object using
    the supplied `indices`.
    
    Example:
        >>> df = pd.DataFrame(
            [[0, 'max', 5],
            [0, 'min', 3],
            [1, 'max', 10],
            [1, 'min, 0]],
            columns=['x', 'parameter', 'value'])
        >>> dataframe_param_to_xarray(df, 'min', ['x'])
        <xarray.DataArray 'min' (x: 2)>
        array([3, 0])
        Coordinates:
          * x        (x) int64 0 1
    """
    # filter by parameter
    dataframe = dataframe[dataframe['parameter'] == param]

    # keep only the indices and value columns
    selection = dataframe[indices + ['value']]

    # drop any spurious duplicated data
    # (the same indices and value, i.e. some data was saved twice?)
    selection = selection.drop_duplicates(subset=indices)

    # drop any spurious NaN-data
    # (indices contain NaN, incomplete data?)
    selection = selection.dropna(subset=indices)

    # set the dataframe index
    selection = selection.set_index(indices)

    # obtain an xarray.DataArray from the 'value' column, using the indices
    # as coordinate values
    data_array = xr.DataArray.from_series(selection["value"])
    data_array.name = param

    return data_array


def fcst_param_to_xarray(dataframe, param, indices=None):
    """
    Filters a forecast dataframe to obtain an xarray DataArray for the specified
    forecast parameter.
    
    Indices defaults to ['station_number', 'base_time', 'valid_start'], but an
    alternative set of indices can be supplied.
    
    Example:
        >>> df = pd_read_fcst_csv('fcst/Op_Official_20150501.csv')
        >>> fcst_param_to_xarray(df, 'T')
    """
    if indices is None:
        indices = ['station_number', 'base_time', 'valid_start']
    return dataframe_param_to_xarray(dataframe, param, indices=indices)


def obs_param_to_xarray(dataframe, param, indices=None):
    """
    Filters an observations dataframe to obtain an xarray DataArray for the specified
    observations parameter.
    
    Indices defaults to ['station_number', 'valid_start'], but an alternative
    set of indices can be supplied.
    
    Example:
        >>> df = pd_read_obs_csv('obs/aws_hourly_20150501.csv')
        >>> obs_param_to_xarray(df, 'T')
    """
    if indices is None:
        indices = ['station_number', 'valid_start']
    return dataframe_param_to_xarray(dataframe, param, indices=indices)

Example calculation of root-mean-square error

In [2]:
# load one week of forecasts from 2015-06-01 - 2015-06-07
# this may take a little while
fcst_dfs = [
    pd_read_fcst_csv('fcst/Op_Official_20180101.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180102.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180103.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180104.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180105.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180106.csv'),
    pd_read_fcst_csv('fcst/Op_Official_20180107.csv'),
]
fcst_df = pd.concat(fcst_dfs)
# extract the hourly-temperature forecast into an xarray DataArray
fcst = fcst_param_to_xarray(fcst_df, 'T')

# load observations from the same period to compare against the forecasts
obs_dfs = [
    pd_read_obs_csv('obs/aws_hourly_20180101.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180102.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180103.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180104.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180105.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180106.csv'),
    pd_read_obs_csv('obs/aws_hourly_20180107.csv'),
]
obs_df = pd.concat(obs_dfs)
# extract the hourly-temperature observations into an xarray DataArray
obs = obs_param_to_xarray(obs_df, 'T')

# calculate squared errors
se = pow(fcst - obs, 2.0)

# calculate root-mean-squared errors averaged over all stations
rmse = pow(se.mean(dim=['station_number']), 0.5)
In [3]:
# how many valid (not NaN) points are there?
print('Data points: ', int(se.count()))
Data points:  1455567
In [4]:
# how many stations were matched?
print('No. of stations: ', len(se.coords['station_number']))
No. of stations:  500
In [5]:
# what forecast base-times do we have for this period?
# These times indicate when each forecast was prepared.
fcst.base_time
Out[5]:
<xarray.DataArray 'base_time' (base_time: 33)>
array(['2017-12-23T00:00:00.000000000', '2017-12-23T12:00:00.000000000',
       '2017-12-24T00:00:00.000000000', '2017-12-24T12:00:00.000000000',
       '2017-12-25T00:00:00.000000000', '2017-12-25T12:00:00.000000000',
       '2017-12-26T00:00:00.000000000', '2017-12-26T12:00:00.000000000',
       '2017-12-27T00:00:00.000000000', '2017-12-27T12:00:00.000000000',
       '2017-12-28T00:00:00.000000000', '2017-12-28T12:00:00.000000000',
       '2017-12-29T00:00:00.000000000', '2017-12-29T12:00:00.000000000',
       '2017-12-30T00:00:00.000000000', '2017-12-30T12:00:00.000000000',
       '2017-12-31T00:00:00.000000000', '2017-12-31T12:00:00.000000000',
       '2018-01-01T00:00:00.000000000', '2018-01-01T12:00:00.000000000',
       '2018-01-02T00:00:00.000000000', '2018-01-02T12:00:00.000000000',
       '2018-01-03T00:00:00.000000000', '2018-01-03T12:00:00.000000000',
       '2018-01-04T00:00:00.000000000', '2018-01-04T12:00:00.000000000',
       '2018-01-05T00:00:00.000000000', '2018-01-05T12:00:00.000000000',
       '2018-01-06T00:00:00.000000000', '2018-01-06T12:00:00.000000000',
       '2018-01-07T00:00:00.000000000', '2018-01-07T12:00:00.000000000',
       '2018-01-08T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * base_time  (base_time) datetime64[ns] 2017-12-23 2017-12-23T12:00:00 ...
In [6]:
# Each valid_start time may have forecast values issued up to 7 days in advance
# For example 2018-01-01 has forecasts values issued from 2017-12-25 
# to 2018-01-01. 
fcst.sel(valid_start=pd.to_datetime('2018-01-01T00:00'))\
    .dropna(dim='base_time')\
    .base_time
# Note 1: that some base times may be missing due to irregular archiving of 
# forecast grids which has been improved since 2015.
# Note 2: Occasionally you may find forecasts that were issued more than 7
# days in advance. These should not be considered to be reliable data.
Out[6]:
<xarray.DataArray 'base_time' (base_time: 14)>
array(['2017-12-25T00:00:00.000000000', '2017-12-25T12:00:00.000000000',
       '2017-12-26T00:00:00.000000000', '2017-12-26T12:00:00.000000000',
       '2017-12-27T00:00:00.000000000', '2017-12-27T12:00:00.000000000',
       '2017-12-28T00:00:00.000000000', '2017-12-28T12:00:00.000000000',
       '2017-12-29T00:00:00.000000000', '2017-12-30T00:00:00.000000000',
       '2017-12-30T12:00:00.000000000', '2017-12-31T00:00:00.000000000',
       '2017-12-31T12:00:00.000000000', '2018-01-01T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * base_time    (base_time) datetime64[ns] 2017-12-25 2017-12-25T12:00:00 ...
    valid_start  datetime64[ns] 2018-01-01
In [7]:
# select one forecast base-time and plot the average root-mean-squared 
# error for the validity times included in that forecast
# The error increases as we forecast farther into the future
from matplotlib import pyplot as plt
%matplotlib inline

rmse.sel(base_time=pd.to_datetime('2018-01-01T00:00')).plot()
plt.ylabel("RMSE (degrees C)")
plt.title(
    "RMSE of forecasts issued at 2018-01-01T00:00 (base-time)\n"
    "averaged over all stations"
)
Out[7]:
Text(0.5,1,'RMSE of forecasts issued at 2018-01-01T00:00 (base-time)\naveraged over all stations')