How the OAR improves data quality

The OAR is powered by a deduplication algorithm, which processes each line of contributed data to detect whether or not facilities already exist in the database.

The OAR’s technical team runs regular training exercises to develop and refine the algorithm, which is based on a statistical model. In addition to these training exercises, the OAR team continually moderates data in the tool to ensure a high quality of data and maintain the trust of our valued users. We also reactively moderate data, when discrepancies are reported to us by users.

How data is processed

Every line of data contributed to the OAR is processed by a deduplication algorithm.

Using the Dedupe Python library, the workflow begins by comparing a new set of records submitted by a contributor to the existing set of mapped facilities. Dedupe can match large lists accurately because it uses blocking and active learning to intelligently reduce the amount of work required.

The OAR uses simple string comparisons for its facility data. Pre-processing strings before comparing them maximizes the quality of match results.

Each entry to the database then falls into one of three categories:

  • The OAR auto-accepts matches with 80% confidence or greater

  • It presents matches between 50% and 80% confidence to the contributor as a "potential match", asking for human intervention to “confirm” or “reject” the potential match

  • Anything below the 50% threshold is automatically created as a new entity in the tool and allocated its own unique OAR ID

See this technical blog for more detail on how the OAR processes data.

How data is moderated

Alongside the automated work of the OAR’s algorithm, the OAR team continually moderates data in the tool to eliminate any duplicates that may have crept into the database, as well as to promote higher quality data or update GPS coordinates. All data moderation is logged in our publicly available Moderation Log.

You can view the OAR’s full moderation policy here. Want to understand more about the tricky issue of duplicates? Read this.

How you can help

Spotted a duplicate in our data? Have access to more accurate GPS coordinates for a facility? Contact the team to report improvements or suggested changes to OAR data.

Merge facilities: share OAR IDs of the facilities that you think should be merged, i.e. where you have identified two or more entries in the database which you believe are the same facility.

Split facilities: share OAR IDs of the facilities that you believe should be unmerged/split, i.e. where two or more names and addresses have been identified as a match, but you believe should be listed as separate entities.

Promote alternative name and address: in most instances, the master entry for a facility in the OAR is the original entry for that facility in the database. Share the OAR ID and the alternative name or address that you believe should be promoted to the primary position, i.e. where more complete or accurate address details are available.

Updating GPS coordinates: Members of the OAR community with local knowledge of facility locations may have access to more accurate GPS coordinates than those assigned to a facility profile by Google’s geo-coder for the OAR. While considered best-in-class, Google’s geo-coding is not perfect, and so the OAR team welcomes the submission of alternative GPS coordinates. While neither name and address details nor GPS coordinates are ever over-written by the OAR, additional GPS coordinates may be added to a facility’s profile, with the caveat that one facility clearly cannot exist in two physical locations.

Report a facility as closed: you can report a facility as closed on the OAR by selecting "Report facility as closed" on a facility profile. Read our Facility Closure policy here.

Image shows how to report a facility as closed.