Using Shape Expressions for data quality and consistency in Wikidata

Andra Waagmeester, Katherine Thornton, Lucas Werkmeister and Gregory Stupp

As a truly open data infrastructure, community issues such as disagreement, bias, human error, vandalism, etc. manifest themselves on Wikidata. From a curator's perspective, it can be challenging at times to filter through the different Wikidata views while maintaining one's own definitions and standards. Whether stemming from benign differences in opinions/views, or more malignant forms of vandalism or the introduction of low quality evidence, public databases face extra challenges in providing data quality in the public domain. Here we propose the use of W3C Shape Expressions (ShEx: as a toolkit to model, validate and filter the interactions between designated public resources and Wikidata. It is a language for expressing constraints on RDF graph and a schema language for graphs. Wikidata is fundamentally a graph, so ShEx can be used to validate Wikidata items, communicate expected graph patterns, and generate user interfaces and interface code. It will also allow us to efficiently: Exchange and understand each other’s models Express a shared model of our footprint in Wikidata Agilely develop and test that model against sample data and evolve Catch disagreement, inconsistencies or errors efficiently at input time or in batch inspections.