Explainable AI and Other Questions Where Provenance Matters

Lindsay Frost
January 10, 2019


On the night of 18th March 2018 a woman walking across a road in Tempe, Arizona, was struck and killed by an autonomous vehicle [1]. On 11th December 2018 Google CEO Sundar Pichai [2] faced questions before the US Congress during a 3-hour public hearing [3] about alleged political bias in filtering of news. In Europe, it is almost certain that in the next few years some major company will be in court facing GDPR fines of 4% of annual global revenues [4] if judged culpable in using personal data outside the agreed context [5]. So be warned: if you are an owner or operator of a decision-making software platform, unclear provenance in decision-making and/or context puts you at risk.

This article briefly explains why you (as a data scientist or a CTO or indeed as a citizen) need to worry about data provenance and metadata and then takes a look at the various kinds of tools available to help reduce the risks and costs.

Firstly, unclear sourcing of data and consequent inaccuracy or misunderstandings already cost billions of dollars … and also lives.

The cost of correcting or “cleaning up” data which has incorrect or misinterpreted provenance, before including it in data warehouses, data lakes, CRM systems, etc., is huge. A blogged guesstimate for the USA in 2011 was $3.1 trillion per year [6] (a catchy number uncritically used without attribution by IBM [7], taken up by Harvard Business Review [8] and mentioned by dozens of others – making it itself an example of poor recording of provenance). Nonetheless, errors and duplicates in e.g. company customer records (CRM) obviously do waste millions of erroneous billings and cause a multiplicity of unsolicited credit-card mailings every year, which you – dear customer – pay for one way or another.

Recent publications guesstimate that simply correcting typos, formatting and misinterpretations, so-called Data Wrangling, continues to consume “half the time of data scientists” [9]. This is sufficient reason to spawn an industry for data clean-up [10] and now also an industry for 'Self-Service Data Preparation' [11] i.e. cloud-services helping data producers to improve the initial labelling (provisioning) of their information with metadata.

But lives are also at risk. For example, when hospital records are entered incorrectly and nothing and no-one (has time to) check the context. The quality (completeness, correctness, concordance, plausibility, and currency) of Electronic Health Records (EHRs) is “often not consistent with research standards” [12]. Pregnancy tests ordered for men? … happens all the time [13]; male patient records with the checkbox “cervical cancer diagnosed” marked?… routinely found! In an analysis [14] of the English National Health Service, the annual 2012 hospital statistics showed approximately 20,000 adults attending paediatric outpatient services, approximately 17,000 males admitted to obstetrical inpatient services, and about 8,000 males admitted to gynaecology inpatient services. Many projects in genome research rely on correlating EHRs with DNA nucleotide sequences to infer drug efficiencies and diagnostics, so not all mistakes are merely amusing. A concerted effort to detect such errors is underway [12].

Other errors arise not in the data but in its interpretation or prioritization, e.g. in the machine learning algorithms which are used to define such things as ranking for job applications, eligibility for personal credit, admissibility for a business visa, decisions during automated-vehicle driving, allocation of hospital emergency response resources during peak periods, traffic planning to reduce air pollution near kindergartens and aged-care centres, etc. etc.

For example, do you feel comfortable knowing that there exists job-applicant screening software, for screening future staff in contact with young children, for which the authors claim that over 19,000 test cases have shown that it “correctly identified 77% of the men and over 72% of the women who posed a sexual risk” [15]? Whatever the methodology, I would like to know how many people were screened out by being “incorrectly identified” and what biases might be inherent to the system?

The various forms of provenance accountability, and methodologies available, can be broadly summarized in the table below (with some recent references as examples):

(with recent references)
‘data provenance accountability’ concerns issues of correct recording of the source(s) of information and such meta-data (context) as the timing, location, procedural history of derived information, declared accuracy, declared producing entity and so on (all of which may need to be collated into a cumulative ‘history’ when the data is processed/aggregated)
  • Ontology Management [16]
  • Provenance Ontology [17]
  • Self-Service Data Preparation [9]
‘data flow accountability’ concerns issues of ensuring that the data is permitted (licensed) to flow through a series of correctly identified processes/systems (this is particularly important for privacy regulations in Europe and elsewhere, which require that personal data is only used for the pre-agreed purpose [4] and that there is a ‘right to erasure’ [19] and a ‘right to object to further processing of personal data' [20])
  • Provenance-aware Software Coding [18]
‘algorithmic accountability’ concerns issues of fairness, transparency, and explainability of decision-making (or filtering or rating) software, particularly regarding machine learning
  • Explainable AI [21]
‘legal non-repudiability’ for some or all of the above information may be required when legal liability is asserted, requiring that the accuracy of appropriate records is trusted by all parties (an area of application for e.g. blockchain distributed ledger technologies)
  • Distributed Ledgers for provenance [22]

Now, imagine you have all of the above sufficiently covered within your domain of interest ... how do you share the provenance and context information with another system? The W3C has developed since around 2010 a body of work (PROV) for modelling and transferring provenance information [17] which google asserts has been referenced over half a million times, not counting references internal to W3C. Various protocol bindings are available.

Meanwhile, from the Internet of Things and Smart City areas of application, attempts are being made to standardise within ETSI an API and a protocol called NGSI-LD [23] which is designed especially for encouraging ontology management, context information management and ultimately data flow accountability across systems, as illustrated in the figure below.

Figure 1: Cross-platform exchange of context and provenance information [23].

Figure 1: Cross-platform exchange of context and provenance information [23].


We are living in the age of Digital Transformation of business and society. The European Union is spending billions attempting to guide and facilitate a “soft landing” into a society which is fair, efficient and empowering of citizens. On the other hand, we are also living in the age of “fake news”.

Proving how you know what you know is becoming mission-critical.


  1. https://www.nytimes.com/interactive/2018/03/20/us/self-driving-uber-pedestrian-killed.html  Published 21st March 2018. See also a 700 page overview sponsored by Daimler and Benz Stiftung: Maurer, Markus, J. Christian Gerdes, Barbara Lenz, and Hermann Winner (eds.), “Autonomous driving: Technical, Legal and Social Aspects”. Published by Springer, Berlin, 2016. Accessed on 8th January 2019 at http://www.oapen.org/download?type=document&docid=1002194#page=81
  2. https://en.wikipedia.org/wiki/Sundar_Pichai
  3. See https://www.youtube.com/watch?v=Ul5fMAG2tk4 (at timemarker 52 minutes)
  4. GDPR General Data Protection Regulation (EU) 2016/679 of 27 April 2016. Correg. 23rd May 2018. Accessed 8th January 2019 at https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:02016R0679-20160504&from=EN
  5. GDPR Art 5(1)(b).
  6. Tibbetts, Hollis. “$3 Trillion Problem: Three Best Practices for Today's Dirty Data Pandemic”. Published online 20110910 at http://hollistibbetts.sys-con.com/node/1975126 .
  7. https://www.ibmbigdatahub.com/infographic/four-vs-big-data
  8. Redman, Thomas C. “Bad Data Costs the U.S. $3 Trillion Per Year”. Published by Harvard Business Review online 22 September 2016. Accessed on 02 January 2019 at https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year
  9. Hellerstein, Joseph M., Jeffrey Heer, and Sean Kandel. "Self-Service Data Preparation: Research to Practice." IEEE Data Eng. Bull. 41, no. 2 (2018): 23-34. Accessed on 02 January 2019 at http://sites.computer.org/debull/A18june/p23.pdf
  10. Chu, Xu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. "Data cleaning: Overview and emerging challenges." In Proceedings of the 2016 International Conference on Management of Data, pp. 2201-2206. ACM, 2016. Accessed on 20190102 at https://www.cs.sfu.ca/~jnwang/papers/sigmod2016-datacleaning-tutorial.pdf; Tang, Nan. "Big RDF data cleaning." In 2015 31st IEEE International Conference on Data Engineering Workshops (ICDEW), pp. 77-79. IEEE, 2015. Accessed 2nd January 2019 at http://da.qcri.org/ntang/pubs/desweb2015.pdf; Tang, Nan. "Big data cleaning." In Asia-Pacific Web Conference, pp. 13-24. Springer, Cham, 2014. Accessed 2nd January 2019 at https://pdfs.semanticscholar.org/cc63/18aed11065cd1b5773f472c38f8feec51702.pdf
  11. Howard, Philip. “Data Preparation (self-service)”. Published online 04 July 2018 at https://www.bloorresearch.com/technology/data-preparation-self-service  Note: twenty major companies are listed. See also Zaidi, Ehtisham, Rita Sallam, Shubhangi Vashisth. “Market Guide for Data Preparation, ID G00315888” Published by Gartner online on 14th December 2017. See https://www.gartner.com/document/3838463
  12. Hersh, William A, et al.. (2013). “Caveats for the use of operational electronic health record data in comparative effectiveness research”. Medical care, 51(8 Suppl 3), S30-7. Accessed on 2nd January 2019 at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3748381/pdf/nihms491343.pdf
  13. Bethel, Dennis (PhD Med.). Published online on 27 March 2016 at https://www.kevinmd.com/blog/2016/03/this-doctor-orders-pregnancy-tests-on-men-youre-probably-doing-it-too.html Note: the doctor complains about software in major hospitals, not a medical malpractice.
  14. Brennan L, Watson M, Klaber R, Charles T. “The importance of knowing context of hospital episode statistics when reconfiguring the NHS”. British Medical Journal. 2012;344:e2432.
  15. Abel, Gene G., Alan Jordan, Nora Harlow, and Yu-Sheng Hsu. "Preventing child sexual abuse: screening for hidden child molesters seeking jobs in organizations that care for children." Sexual Abuse (2018): 1079063218793634, published 16th August 2018”.
  16. Paschke, A., & Schäfermeier, R. (2018). OntoMaven-Maven-Based Ontology Development and Management of Distributed Ontology Repositories. In Synergies Between Knowledge Engineering and Software Engineering (pp. 251-273). Springer, Cham. Accessed on 3rd January 2019 at https://arxiv.org/pdf/1309.7341.pdf but see also the seminal work Noy, Natalya F., and Mark A. Musen. "Ontology versioning in an ontology management framework." IEEE Intelligent Systems 19, no. 4 (2004): 6-13.
  17. Groth, P., Moreau (eds.), L.”PROV-Overview. An Overview of the PROV Family of Documents. W3C Working Group Note”. Published online on 30 April 2013 at  https://www.w3.org/TR/prov-overview  by World Wide Web Consortium
  18. Sáenz-Adán, C., Pérez, B., Huynh, T. D., & Moreau, L. (2018, January). UML2PROV: Automating Provenance Capture in Software Engineering. In International Conference on Current Trends in Theory and Practice of Informatics (pp. 667-681). Accessed on 02 January 2019 at https://nms.kcl.ac.uk/luc.moreau/papers/uml2prov-sofsem18.pdf
  19. GDPR, Art 17.
  20. GDPR, Art 21.
  21. Malone, Brandon, Alberto García-Durán, and Mathias Niepert. "Knowledge Graph Completion to Predict Polypharmacy Side Effects." In International Conference on Data Integration in the Life Sciences, pp. 144-149. Springer, Cham, 2018. Accessed 8th January 2019 at https://arxiv.org/pdf/1810.09227. See also references in the “European Union High-Level Expert Group on Artificial Intelligence Draft Ethics Guidelines for Trustworthy AI”, published 18th December 2018 at https://ec.europa.eu/newsroom/dae/document.cfm?doc_id=56433
  22. Kim, Henry M., and Marek Laskowski. "Toward an ontology‐driven blockchain design for supply‐chain provenance." Intelligent Systems in Accounting, Finance and Management 25, no. 1 (2018): 18-27. Accessed on 3rd January 2019 at https://arxiv.org/ftp/arxiv/papers/1610/1610.02922.pdf
  23. NGSI-LD API “Context Information Management Application Programming Interface (API): For Public Review””. Published at 18th December 2018 at  https://docbox.etsi.org/ISG/CIM/Open/ISG_CIM_NGSI-LD_API_Draft_for_public_review.pdf



lindsay frostLindsay Frost is Chief Standardization Engineer at NEC Laboratories Europe GmbH. He was elected chairman of ETSI ISG CIM in February 2017, elected to the Board of ETSI in November 2017 and is ETSI delegate to the sub-committee of the EC Multi-Stakeholder Platform (Digitizing European Industry) and to the CEN-CENELEC-ETSI Sector Forum on Smart and Sustainable Cities and Communities. He began his career in experimental physics facilities in Australia, Germany and Italy, before joining NEC in 1999 where he has managed R&D teams for 3GPP, WiMAX, fixed-mobile convergence and WLAN. Contact him at Lindsay.Frost@neclab.eu.