As an introduction, I'd like to acknowledge the quality of this paper - it is was really a good read and I can only encourage the team behind the work to continue in this direction.
Please find below a short summary of my comments/suggestions:
Suggestions on the 2 main issues of findability and practical useability:
- findability: maintain a national metadata catalog with a balanced number of attributes for each dataset
- practical useability: make datasets available through stable APIs
Other points I wanted to convey in this email:
- re: draft principle 1: only interesting datasets need to be released (let the community choose)
- re: draft principle 3: enhance reusability and interoperability by using information management standards (data normalisation, standard spatial data reference systems, ISO codes, character encoding...)
- re: draft principle 8: APIs can decrease the time and effort needed to release a dataset (they simplify interactions with community members interested in reuse)
- re: draft principle 8: costly application implementations (websites, iPhone apps, widgets, ...) should not be part of the government agencies spending (providing access via APIs is much more cost effective)
- re: draft principle 10: custodian agencies should provide an open, collaborative mechanism for community to feed inaccuracies back to them (this is a way to add value to PSI and to engage community members)
Below are my comments on 5 topics that I wanted to discuss in a bit more detail.
- findability and metadata: datasets findability can be enhanced by a metadata record associated with each dataset. Metadata endeavours are notoriously known for being time-consumming and tedious but a significant effort of a data publication framework should be to come up with a useful set of metadata information that really enhance the discoverability of datasets. Traditional metadata templates are just too rich, too complex. I'd see a minimal subset of metadata properties like custodian, publication data, data currency and accuracy, spatial extent (if relevant), spatial reference system and a text description (searchable) as a initial basis for a metadata repository. Community feedback could be used to choose which properties to make available / searchable. There is probably a need for additional academic research to investigate what is the compromise (enhanced discoverability vs. metadata maintenance overhead) that maximise benefits for both the custodian agencies and the dataset seekers. This is one aspect that can be part of the coordinated effort across all jurisdiction - expose the metadata information in a standard templated form for a dataset to be considered "released at Australian Government standards" and this service catalog to be managed centrally by OAIC (it could be the back-end of the public website data.gov.au).
- suitability of data for publication: datasets should be published in the rawest form possible (i.e. without post-processing that could introduce errors) and as soon as possible (i.e. as close to real time as possible). Obviously no dataset is 100% correct / accurate so custodian agencies should provide a mechanism for community users to feed inaccuracies back to them. This is an open mechanism in which government reaches additional transparency by publishing the raw data, and benefits from community feedback that will increase the accuracy (hence the value) of the dataset over time - a virtuous circle of continuous improvement.
- format of publication: published information should make no assumption on how the datasets are to be used. Instead of using the name "National Public Toilet Map", the custodian agency should release a service that can be queried in standard fashion (JSON, XML, RDF) and implemented (mashed up) on any platform. It is about stepping back from the costly implementation (on multiple platforms for multiple target uses) and focusing solely on providing re-useable data access.It would be called the "National Public Toilet dataset", that can subsequently be implemented in a map or in a list of the top 10 closest, etc ... The implementation is always a costly endeavour and should be left to industries/individual whose core business it is - government agencies should step away from these projects and focus on managing efficiently the data internally and providing an access point (an API).
To use a simplistic metaphor, it's about leading the horse to the water - and if the water is any good, the horse will drink :-) If the dataset has any value for the community, someone will make use of it. More seriously, this is a responsible, cost-effective approach because it removes the guesswork out of deciding which datasets are going to be of any use (and when).
- publication location stability: making data available for download in a file is a step, but is not the preferred option by community stakeholders interested in reusing the data programmatically. A preferred option is an access point (an API) that remains stable overtime and that delivers up-to-date information. It prevents re-users to have to download files in sometimes complex websites that are regularly re-organised. Another metaphor: it's like having to find a can of baked beans in a supermarket where aisles are re-organised frequently :-) You'd better order the can online, so that you don't have to know where the cans are physically stored. By having an API to access the data, re-users can plug into stable connection points and enable up-to-date, higher value information delivery to the community. At the government agency level, we are talking about a read-only database (duplicated from the one maintained by the agency) exposed to the web by a simple querying service, at a stable address - the dataset does not need to be in a file on the website.
My main point here is that only 4 datasets on data.gov.au have been updated in the past year, which probably reflects the overhead of preparing, uploading the data for most agencies - or maybe the fact that most agencies are not interested in the task of publishing. By making it a self-service directly plugged on an agency database, publishing costs can be driven down dramatically (both financial and human resource).
- information management standards for true interoperability: to provide true interoperability of datasets with other application (datasets-as-a-service), agencies should have consistent guidelines and implementations of data management best practices. This point comes from having experimented with VicRoads' CrashStats data system, only to realise that the spatial reference system used to locate the crash (the X and Y coordinates) was proprietary i.e. designed and used only at VicRoads, hence making it quite difficult to reuse within other information systems. Another point around that is the character encoding that could prevent data from an agency to be used by another because it contains accentuated characters for instance. Also, ISO country codes could be a government standard.
I am available for further discussion if necessary.
All the best to this quality initiative,
0433 800 629
Was this page helpful?
If you would like to provide more feedback, please email us at email@example.com