JSON   RDF   ISO19115/ISO19139 XML

Whole-of-Australian Government Web Crawl

Includes publicly-accessible, human-readable material from Australian government websites, using the Organisations Register (AGOR) as a seed list and domain categorisation source, obeying robots.txt and sitemap.xml directives, gathered over a 10-day period.

Several non-*.gov.au domains are included in the AGOR - these have been crawled up to a limit of 10K URLs.

Several binary file formats included and converted to HTML: doc,docm,docx,dot,epub,keys,numbers,pages,pdf,ppt,pptm,pptx,rtf,xls,xlsm,xlsx

URLs returning responses larger than 10MB are not included in the dataset.

Data is published in the Web Archive (WARC) format, in both a single, multi-gigabyte WARC file and split series.

Licence

Web content contained within these WARC files has originally been authored by the agency hosting the referenced material. Authoring agencies are responsible for the choice of licence attached to the original material.

A consistent licence across the entirety of the WARC files' contents should not be assumed. Agencies may have indicated copyright and licence information for a given URL as metadata embedded in a WARC file entry, but this should not be assumed to be present for all WARC entries.

Data and Resources

Additional Info

Field Value
Title Whole-of-Australian Government Web Crawl
Type Dataset
Language English
Licence Other
Data Status active
Update Frequency quarterly
Landing Page https://data.gov.au/dataset/99f43557-1d3d-40e7-bc0c-665a4275d625
Date Published 2018-08-06
Date Updated 2018-12-11
Contact Point
Digital Transformation Agency
0427 136 791
data@digital.gov.au
Temporal Coverage 2018-08-03 - 2018-11-20
Geospatial Coverage Australia
Jurisdiction Commonwealth of Australia
Data Portal data.gov.au
Publisher/Agency Digital Transformation Agency
Fields of Research Communications and Media Policy
comments powered by Disqus
comments powered by Disqus