The Bureau of Meteorology releases national forecasts (up to 7 days in advance) in the afternoon each day. These forecasts are released as the Australian Digital Forecast Database (ADFD) which provides gridded data over the entire nation and surrounding waters. The current forecast can be viewed at http://www.bom.gov.au/australia/meteye/.
This dataset is prepared by the Evidence Targeted Automation team (gfe_eta@bom.gov.au) from the Science to Services Group, using forecast data obtained from the ADFD and observational data obtained from the Australian Data Archive for Meteorology (ADAM). This dataset focuses on the comparison of temperature, rainfall and wind forecasts against observations over land at the surface level, and includes data for an approximately three year-long period from May 2017 to April 2018.
Forecast weather elements include temperature, maximum and minimum temperature, rainfall probabilities and rainfall amounts, dew point, relative humidity wind magnitude and wind direction. Different forecast products have different time resolutions. For example, temperature forecasts are made for each hour, while maximum and minimum temperature forecasts are made for each day. The native time resolutions of forecast products are preserved in this dataset.
Observation data is provided from ~500 automatic weather stations (AWS) throughout Australia. AWS observations data have a native time resolution of one minute, and report the current air temperature, precipitation, dew point, relative humidity, wind magnitude and wind direction, amongst other weather elements not included in this dataset. The one-minute data are aggregated here to hourly blocks so that the observations can be directly compared against forecasts. For certain forecast elements (e.g. daily maximum temperature), additional aggregation of hourly AWS data will be needed to compare against forecasts.
This data brochure contains the following sections:
The first few rows of Op_Official_20180101.csv
station_number,area_code,parameter,valid_start,valid_end,value,unit,statistic,level,base_time
69132,NSW_PT019,DailyPoP,1514818800,1514905200,62.0,%,point,SFC,1514246400
62101,NSW_PT094,DailyPoP,1514818800,1514905200,17.0,%,point,SFC,1514246400
69138,NSW_PT149,DailyPoP,1514818800,1514905200,65.0,%,point,SFC,1514246400
66212,NSW_PT132,DailyPoP,1514818800,1514905200,65.0,%,point,SFC,1514246400
69147,NSW_PT087,DailyPoP,1514818800,1514905200,45.0,%,point,SFC,1514246400
The first few rows of aws_hourly_20180101.csv
station_number,area_code,parameter,valid_start,valid_end,value,unit,statistic,level,qc_valid_minutes,qc_valid_start,qc_valid_end
541150,,MaxT,1514764800,1514768400,28.2,Celsius,max,SFC,60,1514764800,1514768400
512019,,MaxT,1514764800,1514768400,31.6,Celsius,max,SFC,60,1514764800,1514768400
507500,,MaxT,1514764800,1514768400,32.1,Celsius,max,SFC,60,1514764800,1514768400
505056,,MaxT,1514764800,1514768400,33.6,Celsius,max,SFC,60,1514764800,1514768400
505053,,MaxT,1514764800,1514768400,34.2,Celsius,max,SFC,60,1514764800,1514768400
CSV files in this dataset follow the "TJLite schema", a convention for flexible storage of site-based meteorological data. There are two datasets, forecast data and observations data, stored in the fcst
and obs
directories respectively.
Data are stored in separate CSV files for each day, with a naming format of {source}_{date}.csv
. This is a nominal date only used to partition the data into multiple files. It corresponds to the date of the records contained within. Times within the file are fully recorded in UTC to avoid ambiguity.
field name | data type | explanation of content |
---|---|---|
station number | integer | a unique number for the weather station |
area_code | string | a BoM identifier for the geographical area where the station resides. This field may be blank. |
parameter | string | the name of the weather element. See more information see note [1] |
valid_start | integer | time in seconds since the Epoch (see note [2]). This is the start of the time period that the forecast or observed value corresponds to. |
valid_end | integer | time in seconds since the Epoch (see note [2]). This is the end of the time period that the forecast or observed value corresponds to. |
value | float | numerical value for the given weather element |
unit | string | the unit for the value (e.g. Celsius, mm, %) |
statistic | string | the function used to calculate the weather element value. |
level | string | In this dataset, all rows should be SFC to denote measurements/forecasts for the surface. |
---- | ---- | ------- |
base_time | integer | present only for forecast data. Time in seconds since the Epoch (see note [2]). This is a standardised base time that represents when a forecast is prepared. See note [3] below. |
---- | ---- | ------- |
qc_valid_minutes | integer | present only for observational data. Number of minutes used to perform an hourly-aggregate. This field is populated for rows containing non-instantaneous (see note [4]) parameters. Most weather stations report data every minute, and the one-minute data are aggregated into hourly data using the method given in statistic. Sometimes a station may fail to report data for a number of reasons. The number of valid minutes used in each hourly aggregate is recorded here. |
qc_valid_start | integer | present only for observational data. Time in seconds since the Epoch (see note [2]). For non-instantaneous (see note [4]) parameters, this is the timestamp of the start of first valid minute used for the aggregate. For instantaneous parameters, this is the timestamp of the valid minute used (e.g. if the minute corresponding to valid_start is not available, the next closest valid minute in the hour-block will be reported). |
qc_valid_end | integer | present only for observational data. Time in seconds since the Epoch (see note [2]). This field is populated for non-instantaneous parameters (see note [4]). This is the timestamp of the end of the last valid minute. |
Additional information on parameter, string, the weather element name.
['DailyPoP', 'DailyPoP1', 'DailyPoP5', 'DailyPoP10', 'DailyPoP15',
'DailyPoP25', 'DailyPoP50', 'DailyPrecip', 'DailyPrecip10Pct',
'DailyPrecip25Pct', 'DailyPrecip50Pct', 'DailyPrecip75Pct',
'PoP', 'Precip', 'Precip10Pct', 'Precip25Pct', 'Precip50Pct',
'MaxT', 'MinT', 'T', 'Td', 'WindGust', 'WindMag', 'WindDir', 'RH']
[MaxT', 'MinT', 'Precip', 'T', 'Td', 'RH',
'WindDir', 'WindMag', 'MaxWindMag']
Time since Epoch is calculated as time in seconds since the Unix Epoch (1970-01-01T00:00:00Z+00)
Different regions (state offices) may publish their forecasts at slightly different times. This can also vary slightly by field type. For consistency, each forecast is given a nominal base_time which represents the time the forecast was issued. This dataset contains two forecast base times for each day, which correspond to forecasts issued in the morning and afternoon. The difference base_time - valid_start
gives the lead_time of a forecast, i.e. how far into the future a forecast is. Generally, forecast accuracy tends to be better at shorter lead times.
Some parameters with hourly periods such as 'T', 'Td', 'WindDir'
, and 'WindMag'
are considered to be instantaneous, meaning that their valid_start and valid_end are the same, and their value applies to the start of the hour. By contrast, non-instantaneous parameters have values that represent an aggregage over the time period between their valid_start and valid_end times.
StationData.csv
¶spatial/StationData.csv
holds information about the automatic weather station (AWS) sites that are present in the forecast and observational data (they are identified by their station_number
in these datasets).
The first few rows of StationData.csv:
WMO_NUM,station_number,station_name,LATITUDE,LONGITUDE,STN_HT,AVIATION_ID,REGION,GridPt Lat,GridPt Lon,MSAS elevation,Distance from GridPt,Roughness,Distance from coast,Category,forecast_district,sa_special
95214,1006,WYNDHAM AERO,-15.51,128.1503,3.8,YWYM,WA,-15.49,128.17,79.76,2.1,39.1,51,mountains2,WA_PW001,
94102,1007,TROUGHTON ISLAND,-13.7542,126.1485,6,YTTI,WA,-13.74,126.17,0.16,2,0.1,5,coast,WA_PW001,
94100,1019,KALUMBURU,-14.2964,126.6453,23,YKAL,WA,-14.28,126.63,17.4,2.3,31.4,10,coast,WA_PW001,
95101,1020,TRUSCOTT,-14.09,126.3867,51,YTST,WA,-14.07,126.38,31.76,1.5,14.3,9,coast,WA_PW001,
99201,2012,HALLS CREEK MO,-18.2292,127.6636,422,HCR,WA,-18.24,127.67,423.36,2.3,12.7,999,mountains2,WA_PW001,
field name | data type | explanation of content |
---|---|---|
WMO_NUM | integer | the World Meteorological Organisation (WMO) id of weather stations (unlikely to be relevant). |
station_num | integer | id of each weather station. |
station_name | string | a human-readable name for the station. Note that AWS is an acronym for "automatic weather station". |
LATITUDE | float | the latitude of the station (all negative since Australia is in the Southern Hemisphere!). |
LONGITUDE | float | the longitude of the station (all positive since Australia is in the Eastern Hemispehere!). |
STN_HT | float | the elevation in metres of the station above mean sea level. |
AVIATION_ID | string | a unique identifier for the aviation industry. Where a station is associated with an airport this is the ICAO code for that airport (e.g. SYDNEY AIRPORT AMO, station_number:66037, has AVIATION_ID: YSSY). |
REGION | string | the state or territory to which the station belongs (i.e. one of WA, NT, QLD, NSW, VIC, TAS/ANT, SA). Note that the stations in the ACT (Australian Capital Territory) belong to NSW. |
GridPt Lat | float | the latitude of the grid cell in which the station resides. |
GridPt Lon | float | the longitude of the grid cell in which the station resides. |
MSAS elevation | float | the elevation of the grid cell in which the station resides in the MSAS analysis - a gridded observational dataset. |
Distance from GridPt | float | the distance in kilometres of the station from the centre of the grid cell to which it belongs. |
Roughness | float | a parameter specifying the 'roughness' of terrain around the MSAS grid cell in which the station resides. Higher values indicate rougher terrain. |
Distance from coast | integer | distance in kilometres of the station from the nearest coast. Distances greater than 999km are clipped to 999. Negative values indicate offshore stations. |
Category | string | a classification of the terrain type of the station. |
forecast_district | string | id of the forecast district in which the station resides. |
sa_special | string | extra information for South Australian stations. |
YAML files are used for partitioning the stations into various groups, identified by their station_number
.
Three station grouping YAML files are included in the spatial
folder:
temperature_station_groups.yml
: station groups tailored for temperature verification, also likely to be relevant for dew point and relative humiditywind_station_groups.yml
: station groups tailored for wind verificationstation_groups.yml
: station groups that are tailored useful for rainfall and any other verificationEach YAML file has the following general structure:
STATE: # Cross-region, NSW/ACT, QLD, VIC, etc...
THEME: # Topographical, District, Special
SUB_GROUPS_WITHIN_THEME: # e.g. Topographichal has sub-groups Mountainous, Coastal, and Flat and Hilly
[1111, 2222, ...] # the station numbers
The forecast data (in files fcst/Op_Official_{date}.csv
) has the following parameters:
parameter | unit | frequency | description of forecast |
---|---|---|---|
T |
degrees celcius | 1H | the instantaneous air temperature. |
MaxT |
degrees celcius | 24H | the maximum temperature over the forecast period. |
MinT |
degrees celcius | 24H | the minimum temperature over the forecast period. |
Td |
degrees celcius | 1H | the instantaneous dewpoint temperature. |
RH |
percentage | 1H | the instantaneous relative humidity. |
PoP |
percentage | 3H | the chance of receiving at least 0.2mm rainfall (the detection limit of the rain gauges) over a 3-hour period. |
DailyPoP |
percentage | 24H | the chance of receiving at least 0.2mm rainfall (the detection limit of the rain gauges) over a 24-hour |
DailyPoP1 |
percentage | 24H | the chance of receiving at least 1mm rainfall over a 24-hour period. |
DailyPoP5 |
percentage | 24H | the chance of receiving at least 5mm rainfall over a 24-hour period. |
DailyPoP10 |
percentage | 24H | the chance of receiving at least 10mm rainfall over a 24-hour period. |
DailyPoP15 |
percentage | 24H | the chance of receiving at least 15mm rainfall over a 24-hour period. |
DailyPoP25 |
percentage | 24H | the chance of receiving at least 25mm rainfall over a 24-hour period. |
DailyPoP50 |
percentage | 24H | the chance of receiving at least 50mm rainfall over a 24-hour period. |
Precip |
mm | 3H | the expected (in a statistical sense) amount of rainfall over a 3-hour period. |
Precip10Pct |
mm | 3H | there is a 10% chance of receiving at least this much rainfall over the 3-hour period. |
Precip25Pct |
mm | 3H | there is a 25% chance of receiving at least this much rainfall over the 3-hour period. |
Precip50Pct |
mm | 3H | there is a 50% chance of receiving at least this much rainfall over the 3-hour period. |
DailyPrecip |
mm | 24H | the expected (in a statistical sense) amount of rainfall over a 15Z-aligned 24-hour period. |
DailyPrecip10Pct |
mm | 24H | there is a 10% chance of receiving at least this much rainfall over the 24-hour period. |
DailyPrecip25Pct |
mm | 24H | there is a 25% chance of receiving at least this much rainfall over the 24-hour period. |
DailyPrecip50Pct |
mm | 24H | there is a 50% chance of receiving at least this much rainfall over the 24-hour period. |
DailyPrecip75Pct |
mm | 24H | there is a 75% chance of receiving at least this much rainfall over the 24-hour period. |
WindMag |
kt | 1H | the instantaneous 10-minute mean wind speed. |
WindDir |
degrees | 1H | the instantaneous 10-minute mean wind direction. |
WindGust |
kt | 1H | the instantaneous maximum wind gust. Note that no corresponding observation is provided with this dataset. |
The observational data (in files obs/aws_hourly_{date}.csv
) has the following parameters:
parameter | unit | description |
---|---|---|
T |
degrees celcius | the instantaneous air temperature. |
MaxT |
degrees celcius | the maximum recorded air temperature over an one-hour period. |
MinT |
degrees celcius | the minimum recorded air temperature over an one-hour period. |
Td |
degrees celcius | the instantaneous dew point temperature. |
RH |
percentage | the instantaneous relative humidity. |
WindMag |
kt | the instantaneous 10-minute mean wind speed. |
WindDir |
degrees | the instantaneous 10-minute mean wind direction. |
MaxWindMag |
kt | the maximum 10-minute mean wind speed over a 1-hour period. |
Precip |
mm | the amount of rainfall recorded over a 1-hour period in increments of 0.2mm. |
The following are some useful functions in Python for loading and working with the data.
import pandas as pd
import xarray as xr
def epoch_seconds_to_timestamp(inp):
"""
Converts an integer (seconds since UNIX epoch) or list of integers
into a Pandas Timestamp object.
Example:
>>> epoch_seconds_to_timestamp(1461945600)
Timestamp('2016-04-29 16:00:00')
>>> epoch_seconds_to_timestamp([1461945600, 1461945660, 1461945720])
DatetimeIndex(['2016-04-29 16:00:00', '2016-04-29 16:01:00',
'2016-04-29 16:02:00'],
dtype='datetime64[ns]', freq=None)
"""
return pd.to_datetime(inp, unit='s')
def pd_read_fcst_csv(file_path):
"""
Reads a forecast CSV file and returns a Pandas DataFrame with the
appropriate type conversions.
Examples:
>>> pd_read_fcst_csv('fcst/Op_Official_20150501.csv')
"""
df = pd.read_csv(file_path)
for field in ['valid_start', 'valid_end', 'base_time']:
df[field] = epoch_seconds_to_timestamp(df[field])
df['station_number'] = df['station_number'].astype('int')
return df
def pd_read_obs_csv(file_path):
"""
Reads a observations CSV file and returns a Pandas DataFrame with the
appropriate type conversions.
Example:
>>> pd_read_obs_csv('obs/aws_hourly_20150501.csv')
"""
df = pd.read_csv(file_path)
for field in ['valid_start', 'valid_end', 'qc_valid_start', 'qc_valid_end']:
df[field] = epoch_seconds_to_timestamp(df[field])
df['station_number'] = df['station_number'].astype('int')
return df
def dataframe_param_to_xarray(dataframe, param, indices):
"""
Filters the `dataframe` using its "parameter" column to contain only
the supplied `param`, then creates an xarray DataArray object using
the supplied `indices`.
Example:
>>> df = pd.DataFrame(
[[0, 'max', 5],
[0, 'min', 3],
[1, 'max', 10],
[1, 'min, 0]],
columns=['x', 'parameter', 'value'])
>>> dataframe_param_to_xarray(df, 'min', ['x'])
<xarray.DataArray 'min' (x: 2)>
array([3, 0])
Coordinates:
* x (x) int64 0 1
"""
# filter by parameter
dataframe = dataframe[dataframe['parameter'] == param]
# keep only the indices and value columns
selection = dataframe[indices + ['value']]
# drop any spurious duplicated data
# (the same indices and value, i.e. some data was saved twice?)
selection = selection.drop_duplicates(subset=indices)
# drop any spurious NaN-data
# (indices contain NaN, incomplete data?)
selection = selection.dropna(subset=indices)
# set the dataframe index
selection = selection.set_index(indices)
# obtain an xarray.DataArray from the 'value' column, using the indices
# as coordinate values
data_array = xr.DataArray.from_series(selection["value"])
data_array.name = param
return data_array
def fcst_param_to_xarray(dataframe, param, indices=None):
"""
Filters a forecast dataframe to obtain an xarray DataArray for the specified
forecast parameter.
Indices defaults to ['station_number', 'base_time', 'valid_start'], but an
alternative set of indices can be supplied.
Example:
>>> df = pd_read_fcst_csv('fcst/Op_Official_20150501.csv')
>>> fcst_param_to_xarray(df, 'T')
"""
if indices is None:
indices = ['station_number', 'base_time', 'valid_start']
return dataframe_param_to_xarray(dataframe, param, indices=indices)
def obs_param_to_xarray(dataframe, param, indices=None):
"""
Filters an observations dataframe to obtain an xarray DataArray for the specified
observations parameter.
Indices defaults to ['station_number', 'valid_start'], but an alternative
set of indices can be supplied.
Example:
>>> df = pd_read_obs_csv('obs/aws_hourly_20150501.csv')
>>> obs_param_to_xarray(df, 'T')
"""
if indices is None:
indices = ['station_number', 'valid_start']
return dataframe_param_to_xarray(dataframe, param, indices=indices)
# load one week of forecasts from 2015-06-01 - 2015-06-07
# this may take a little while
fcst_dfs = [
pd_read_fcst_csv('fcst/Op_Official_20180101.csv'),
pd_read_fcst_csv('fcst/Op_Official_20180102.csv'),
pd_read_fcst_csv('fcst/Op_Official_20180103.csv'),
pd_read_fcst_csv('fcst/Op_Official_20180104.csv'),
pd_read_fcst_csv('fcst/Op_Official_20180105.csv'),
pd_read_fcst_csv('fcst/Op_Official_20180106.csv'),
pd_read_fcst_csv('fcst/Op_Official_20180107.csv'),
]
fcst_df = pd.concat(fcst_dfs)
# extract the hourly-temperature forecast into an xarray DataArray
fcst = fcst_param_to_xarray(fcst_df, 'T')
# load observations from the same period to compare against the forecasts
obs_dfs = [
pd_read_obs_csv('obs/aws_hourly_20180101.csv'),
pd_read_obs_csv('obs/aws_hourly_20180102.csv'),
pd_read_obs_csv('obs/aws_hourly_20180103.csv'),
pd_read_obs_csv('obs/aws_hourly_20180104.csv'),
pd_read_obs_csv('obs/aws_hourly_20180105.csv'),
pd_read_obs_csv('obs/aws_hourly_20180106.csv'),
pd_read_obs_csv('obs/aws_hourly_20180107.csv'),
]
obs_df = pd.concat(obs_dfs)
# extract the hourly-temperature observations into an xarray DataArray
obs = obs_param_to_xarray(obs_df, 'T')
# calculate squared errors
se = pow(fcst - obs, 2.0)
# calculate root-mean-squared errors averaged over all stations
rmse = pow(se.mean(dim=['station_number']), 0.5)
# how many valid (not NaN) points are there?
print('Data points: ', int(se.count()))
# how many stations were matched?
print('No. of stations: ', len(se.coords['station_number']))
# what forecast base-times do we have for this period?
# These times indicate when each forecast was prepared.
fcst.base_time
# Each valid_start time may have forecast values issued up to 7 days in advance
# For example 2018-01-01 has forecasts values issued from 2017-12-25
# to 2018-01-01.
fcst.sel(valid_start=pd.to_datetime('2018-01-01T00:00'))\
.dropna(dim='base_time')\
.base_time
# Note 1: that some base times may be missing due to irregular archiving of
# forecast grids which has been improved since 2015.
# Note 2: Occasionally you may find forecasts that were issued more than 7
# days in advance. These should not be considered to be reliable data.
# select one forecast base-time and plot the average root-mean-squared
# error for the validity times included in that forecast
# The error increases as we forecast farther into the future
from matplotlib import pyplot as plt
%matplotlib inline
rmse.sel(base_time=pd.to_datetime('2018-01-01T00:00')).plot()
plt.ylabel("RMSE (degrees C)")
plt.title(
"RMSE of forecasts issued at 2018-01-01T00:00 (base-time)\n"
"averaged over all stations"
)