Explainable AI and Other Questions Where Provenance Matters
On the night of 18th March 2018 a woman walking across a road in Tempe, Arizona, was struck and killed by an autonomous vehicle . On 11th December 2018 Google CEO Sundar Pichai  faced questions before the US Congress during a 3-hour public hearing  about alleged political bias in filtering of news. In Europe, it is almost certain that in the next few years some major company will be in court facing GDPR fines of 4% of annual global revenues  if judged culpable in using personal data outside the agreed context . So be warned: if you are an owner or operator of a decision-making software platform, unclear provenance in decision-making and/or context puts you at risk.
This article briefly explains why you (as a data scientist or a CTO or indeed as a citizen) need to worry about data provenance and metadata and then takes a look at the various kinds of tools available to help reduce the risks and costs.
Firstly, unclear sourcing of data and consequent inaccuracy or misunderstandings already cost billions of dollars … and also lives.
The cost of correcting or “cleaning up” data which has incorrect or misinterpreted provenance, before including it in data warehouses, data lakes, CRM systems, etc., is huge. A blogged guesstimate for the USA in 2011 was $3.1 trillion per year  (a catchy number uncritically used without attribution by IBM , taken up by Harvard Business Review  and mentioned by dozens of others – making it itself an example of poor recording of provenance). Nonetheless, errors and duplicates in e.g. company customer records (CRM) obviously do waste millions of erroneous billings and cause a multiplicity of unsolicited credit-card mailings every year, which you – dear customer – pay for one way or another.
Recent publications guesstimate that simply correcting typos, formatting and misinterpretations, so-called Data Wrangling, continues to consume “half the time of data scientists” . This is sufficient reason to spawn an industry for data clean-up  and now also an industry for 'Self-Service Data Preparation'  i.e. cloud-services helping data producers to improve the initial labelling (provisioning) of their information with metadata.
But lives are also at risk. For example, when hospital records are entered incorrectly and nothing and no-one (has time to) check the context. The quality (completeness, correctness, concordance, plausibility, and currency) of Electronic Health Records (EHRs) is “often not consistent with research standards” . Pregnancy tests ordered for men? … happens all the time ; male patient records with the checkbox “cervical cancer diagnosed” marked?… routinely found! In an analysis  of the English National Health Service, the annual 2012 hospital statistics showed approximately 20,000 adults attending paediatric outpatient services, approximately 17,000 males admitted to obstetrical inpatient services, and about 8,000 males admitted to gynaecology inpatient services. Many projects in genome research rely on correlating EHRs with DNA nucleotide sequences to infer drug efficiencies and diagnostics, so not all mistakes are merely amusing. A concerted effort to detect such errors is underway .
Other errors arise not in the data but in its interpretation or prioritization, e.g. in the machine learning algorithms which are used to define such things as ranking for job applications, eligibility for personal credit, admissibility for a business visa, decisions during automated-vehicle driving, allocation of hospital emergency response resources during peak periods, traffic planning to reduce air pollution near kindergartens and aged-care centres, etc. etc.
For example, do you feel comfortable knowing that there exists job-applicant screening software, for screening future staff in contact with young children, for which the authors claim that over 19,000 test cases have shown that it “correctly identified 77% of the men and over 72% of the women who posed a sexual risk” ? Whatever the methodology, I would like to know how many people were screened out by being “incorrectly identified” and what biases might be inherent to the system?
The various forms of provenance accountability, and methodologies available, can be broadly summarized in the table below (with some recent references as examples):
(with recent references)
|‘data provenance accountability’ concerns issues of correct recording of the source(s) of information and such meta-data (context) as the timing, location, procedural history of derived information, declared accuracy, declared producing entity and so on (all of which may need to be collated into a cumulative ‘history’ when the data is processed/aggregated)||
|‘data flow accountability’ concerns issues of ensuring that the data is permitted (licensed) to flow through a series of correctly identified processes/systems (this is particularly important for privacy regulations in Europe and elsewhere, which require that personal data is only used for the pre-agreed purpose  and that there is a ‘right to erasure’  and a ‘right to object to further processing of personal data' )||
|‘algorithmic accountability’ concerns issues of fairness, transparency, and explainability of decision-making (or filtering or rating) software, particularly regarding machine learning||
|‘legal non-repudiability’ for some or all of the above information may be required when legal liability is asserted, requiring that the accuracy of appropriate records is trusted by all parties (an area of application for e.g. blockchain distributed ledger technologies)||
Now, imagine you have all of the above sufficiently covered within your domain of interest ... how do you share the provenance and context information with another system? The W3C has developed since around 2010 a body of work (PROV) for modelling and transferring provenance information  which google asserts has been referenced over half a million times, not counting references internal to W3C. Various protocol bindings are available.
Meanwhile, from the Internet of Things and Smart City areas of application, attempts are being made to standardise within ETSI an API and a protocol called NGSI-LD  which is designed especially for encouraging ontology management, context information management and ultimately data flow accountability across systems, as illustrated in the figure below.
We are living in the age of Digital Transformation of business and society. The European Union is spending billions attempting to guide and facilitate a “soft landing” into a society which is fair, efficient and empowering of citizens. On the other hand, we are also living in the age of “fake news”.
Proving how you know what you know is becoming mission-critical.
- https://www.nytimes.com/interactive/2018/03/20/us/self-driving-uber-pedestrian-killed.html Published 21st March 2018. See also a 700 page overview sponsored by Daimler and Benz Stiftung: Maurer, Markus, J. Christian Gerdes, Barbara Lenz, and Hermann Winner (eds.), “Autonomous driving: Technical, Legal and Social Aspects”. Published by Springer, Berlin, 2016. Accessed on 8th January 2019 at http://www.oapen.org/download?type=document&docid=1002194#page=81
- See https://www.youtube.com/watch?v=Ul5fMAG2tk4 (at timemarker 52 minutes)
- GDPR General Data Protection Regulation (EU) 2016/679 of 27 April 2016. Correg. 23rd May 2018. Accessed 8th January 2019 at https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:02016R0679-20160504&from=EN
- GDPR Art 5(1)(b).
- Tibbetts, Hollis. “$3 Trillion Problem: Three Best Practices for Today's Dirty Data Pandemic”. Published online 20110910 at http://hollistibbetts.sys-con.com/node/1975126 .
- Redman, Thomas C. “Bad Data Costs the U.S. $3 Trillion Per Year”. Published by Harvard Business Review online 22 September 2016. Accessed on 02 January 2019 at https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year
- Hellerstein, Joseph M., Jeffrey Heer, and Sean Kandel. "Self-Service Data Preparation: Research to Practice." IEEE Data Eng. Bull. 41, no. 2 (2018): 23-34. Accessed on 02 January 2019 at http://sites.computer.org/debull/A18june/p23.pdf
- Chu, Xu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. "Data cleaning: Overview and emerging challenges." In Proceedings of the 2016 International Conference on Management of Data, pp. 2201-2206. ACM, 2016. Accessed on 20190102 at https://www.cs.sfu.ca/~jnwang/papers/sigmod2016-datacleaning-tutorial.pdf; Tang, Nan. "Big RDF data cleaning." In 2015 31st IEEE International Conference on Data Engineering Workshops (ICDEW), pp. 77-79. IEEE, 2015. Accessed 2nd January 2019 at http://da.qcri.org/ntang/pubs/desweb2015.pdf; Tang, Nan. "Big data cleaning." In Asia-Pacific Web Conference, pp. 13-24. Springer, Cham, 2014. Accessed 2nd January 2019 at https://pdfs.semanticscholar.org/cc63/18aed11065cd1b5773f472c38f8feec51702.pdf
- Howard, Philip. “Data Preparation (self-service)”. Published online 04 July 2018 at https://www.bloorresearch.com/technology/data-preparation-self-service Note: twenty major companies are listed. See also Zaidi, Ehtisham, Rita Sallam, Shubhangi Vashisth. “Market Guide for Data Preparation, ID G00315888” Published by Gartner online on 14th December 2017. See https://www.gartner.com/document/3838463
- Hersh, William A, et al.. (2013). “Caveats for the use of operational electronic health record data in comparative effectiveness research”. Medical care, 51(8 Suppl 3), S30-7. Accessed on 2nd January 2019 at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3748381/pdf/nihms491343.pdf
- Bethel, Dennis (PhD Med.). Published online on 27 March 2016 at https://www.kevinmd.com/blog/2016/03/this-doctor-orders-pregnancy-tests-on-men-youre-probably-doing-it-too.html Note: the doctor complains about software in major hospitals, not a medical malpractice.
- Brennan L, Watson M, Klaber R, Charles T. “The importance of knowing context of hospital episode statistics when reconfiguring the NHS”. British Medical Journal. 2012;344:e2432.
- Abel, Gene G., Alan Jordan, Nora Harlow, and Yu-Sheng Hsu. "Preventing child sexual abuse: screening for hidden child molesters seeking jobs in organizations that care for children." Sexual Abuse (2018): 1079063218793634, published 16th August 2018”.
- Paschke, A., & Schäfermeier, R. (2018). OntoMaven-Maven-Based Ontology Development and Management of Distributed Ontology Repositories. In Synergies Between Knowledge Engineering and Software Engineering (pp. 251-273). Springer, Cham. Accessed on 3rd January 2019 at https://arxiv.org/pdf/1309.7341.pdf but see also the seminal work Noy, Natalya F., and Mark A. Musen. "Ontology versioning in an ontology management framework." IEEE Intelligent Systems 19, no. 4 (2004): 6-13.
- Groth, P., Moreau (eds.), L.”PROV-Overview. An Overview of the PROV Family of Documents. W3C Working Group Note”. Published online on 30 April 2013 at https://www.w3.org/TR/prov-overview by World Wide Web Consortium
- Sáenz-Adán, C., Pérez, B., Huynh, T. D., & Moreau, L. (2018, January). UML2PROV: Automating Provenance Capture in Software Engineering. In International Conference on Current Trends in Theory and Practice of Informatics (pp. 667-681). Accessed on 02 January 2019 at https://nms.kcl.ac.uk/luc.moreau/papers/uml2prov-sofsem18.pdf
- GDPR, Art 17.
- GDPR, Art 21.
- Malone, Brandon, Alberto García-Durán, and Mathias Niepert. "Knowledge Graph Completion to Predict Polypharmacy Side Effects." In International Conference on Data Integration in the Life Sciences, pp. 144-149. Springer, Cham, 2018. Accessed 8th January 2019 at https://arxiv.org/pdf/1810.09227. See also references in the “European Union High-Level Expert Group on Artificial Intelligence Draft Ethics Guidelines for Trustworthy AI”, published 18th December 2018 at https://ec.europa.eu/newsroom/dae/document.cfm?doc_id=56433
- Kim, Henry M., and Marek Laskowski. "Toward an ontology‐driven blockchain design for supply‐chain provenance." Intelligent Systems in Accounting, Finance and Management 25, no. 1 (2018): 18-27. Accessed on 3rd January 2019 at https://arxiv.org/ftp/arxiv/papers/1610/1610.02922.pdf
- NGSI-LD API “Context Information Management Application Programming Interface (API): For Public Review””. Published at 18th December 2018 at https://docbox.etsi.org/ISG/CIM/Open/ISG_CIM_NGSI-LD_API_Draft_for_public_review.pdf
Lindsay Frost is Chief Standardization Engineer at NEC Laboratories Europe GmbH. He was elected chairman of ETSI ISG CIM in February 2017, elected to the Board of ETSI in November 2017 and is ETSI delegate to the sub-committee of the EC Multi-Stakeholder Platform (Digitizing European Industry) and to the CEN-CENELEC-ETSI Sector Forum on Smart and Sustainable Cities and Communities. He began his career in experimental physics facilities in Australia, Germany and Italy, before joining NEC in 1999 where he has managed R&D teams for 3GPP, WiMAX, fixed-mobile convergence and WLAN. Contact him at Lindsay.Frost@neclab.eu.
Sign Up for IoT Technical Community Updates
Calendar of Events
20-24 June 2021
Call for Papers
IEEE World Forum on the Internet of Things (WF-IoT) 2021
Submission Deadline: 15 January 2021
Submission Deadline: 1 March 2021
Special Issue on Emerging Safety, Efficiency, and Security of Embedded Software and Systems on Internet of Things
Submission Deadline: 15 February 2021
Special Issue on Cybertwin-driven 6G: Architectures, Methods and Applications
Submission Deadline: 1 February 2021
Special Issue on Security, Privacy, and Trustworthiness in Intelligent Cyber-Physical Systems and Internet-of-Things
Submission Deadline: 15 January 2021