Wikidata as authority linking hub

Joachim Neubert (ZBW) Jakob Voß (VZG)

DINI AG KIM Workshop Mannheim, 4. Mai 2017

Introduction

Authority files

Consistently refer to entities

  • Via identifier (“things, not strings”)
  • GND, MeSH, STW, ISIL, RePEc-Authors…

Linking hubs

Connect identifiers among authority files

  • owl:sameAs, skos:exactMatch, skos:closeMatch
  • VIAF, sameAs.org, Wikidata

Wikidata

  • Knowledge base of Wikimedia projects
  • All kinds of entities
    • concepts, places, people, works…

Wikidata Usage

  • Editable by anyone
    • via Website and API
    • via apps that use the API
  • Data available
    • http://query.wikidata.org/ (SPARQL)
    • JSON API & database dumps

Wikidata Statements

Wikidata item example

Authority file identifiers in Wikidata

More than half of all Wikidata properties

Wikidata—ISIL (organizations)

Example:

Neuschwanstein Castle (Q4152) ISIL (P791): DE-MUS-051612

Current state:

  • lobid.org: ~15,000 ISIL (DACH only)
  • Wikidata: ~6,500 ISIL

Tool: Mix’n’match

  • Web application mapping tool
  • Helps to add 1-to-1-mappings

https://tools.wmflabs.org/mix-n-match/

Step 1: Upload ISIL list with names

Step 2: Confirm match candidates

Automatically matched
Visual Mode

GND—RePEc Authors

  • In EconBiz economics search portal authors are identified differently:
    • by GND ID in data from ZBW’s Econis catalog (and from others)
    • by RePEc Author ID in data from Research Papers for Economics
  • Large volumes: 450,000 vs. 50,000 distinct persons
  • ~3,000 pairs of IDs discovered in a previous project

Utilizing Wikidata as Linking Hub

  • Wikidata-Properties for both identifier systems
    • GND ID (P227): ~375,000 items which are humans
    • RePEc Short-ID (P2428): ~2,200 items
  • Since every identifier should identify exactly one person, we can derive
    • GND ID ⟶ Wikidata ID ⟶ RePEc ID
    • RePEc ID ⟶ Wikidata ID ⟶ GND ID
    where both properties have values (~760 items)

Step 1: Supplement WD items with RePEc Short-IDs

Bulk editing with Quickstatements2

Further simplification with upcoming release of wdmapper command line tool

Step 2: Supplement WD items with GND IDs

  • 384 WD items with RePEc Short-ID without GND ID
  • same process as other direction

Step 3: Add “most important” authors with RePEc identifiers

Step 4: Add “most important” authors with GND identifiers

  • 18,000 authors with >30 publications in EconBiz
  • loaded as Mix’n’match set GND economists (de)
  • order by publication count (descending)
  • 25% matched automatically with Wikidata items

Work to do

Step 5: Rinse and repeat

  • Repeat Mix’n’match “sync” operation before starting to work manually
    • often, people are adding data at fast rate!
  • Repeat bulk adding of missing identifiers to make use of complementing identifiers added meanwhile

Step 6: Add missing Wikidata items

  • Verify missing authors indeed are not in Wikidata
  • Generate Wikidata items from from existing mappings or lists, e.g. top female economists
Using Wikidata’s QuickStatements tool

Result

The mapping, currently (2017-05-02) consisting of

  • 1233 matching GND - RePEc short IDs
    • 769 matches from ZBW’s mapping
    • 464 matches contributed by non-ZBW staff
  • Finally all 3,000 pairs from ZBW’s mapping

Further Results

  • Identifiers and items inserted by individual Wikidata contributors add up continuously
  • Mapping steps can be repeated with additional input data (e.g., top economists from Latin America, “all authors affiliated to Leibniz institutions in economics”…
  • Further identifiers (VIAF, ORCID, …) provide more opportunities for indirect matching

Results from every step in the mapping process and all indiviual efforts immediately available and preserved

Tools

  • Mix’n’match (intellectual matching)
  • QuickStatements/2 (addition of generated properties and items)
  • wdmapper (harvest, diff & add mappings)
    • Support of indirect mappings (e.g., GND-WD-RePEc) in one step
    • Work in progress (no adding by now)
    • Daily harvested mappings in multiple formats: http://coli-conc.gbv.de/concordances/wikidata/

Tools for mass editing require approved bot account.

Limitations

  • Mapping algorithms to find mapping candidates
  • Limitation to easy-1-1-relationships
    • part-whole
    • often new Wikidata items required
    • depends on the use case
  • Large sets of mappings and results
  • Regular review required for maintainance

Benefits

  • Outsourced interface, storage, and operation
  • Crowdsourced mapping maintenance
  • Wikidata has policies and tools for data quality
  • Open Data for multiple and unknown uses
  • Additional benefits:
    • multilingual Wikipedia links
    • lots of (formatted) data, nice pictures, …
    • links to multiple other vocabularies