Schema alignment
A Wikibase schema is a template of Wikidata edits that is applied to each row in the project. This page describes how each part of this template works, and how it generates edits depending on the contents of the table cells.
Items
An item in the schema represents a set of changes on a particular Wikidata item, generated by a single row. This item can contain changes in terms (labels, descriptions and aliases) or statements.
It is possible to make edits on different items for each row of your table: just add multiple items in your schema. Each item has a subject, which can be either entered manually (when the item on which the edits should be made is the same for all rows), or any reconciled column can be dropped in this field. In this case, the edits will depend on the reconciliation status of each cell:
- If the cell is matched to an item, edits will be made on that item;
- If the cell is marked as corresponding to a new item, a new item will be created for it. See New items for more details about how this works;
- If the cell has reconciliation candidates but has not been matched to any of them, the edit will be skipped (even if there is only one candidate with a high reconciliation score);
- If the cell is not reconciled or blank, the edit will be skipped.
Do not worry about the ordering of items in the schema or the order of your rows, as OpenRefine will rearrange your edits to optimize their upload. If your project makes edits on the same item across multiple rows, these edits will be merged together and performed in one edit. See Uploading your changes about that.
Terms
Terms are the language-specific strings that you find at the top of Wikidata items: labels, descriptions and aliases. OpenRefine lets you edit these terms via the Wikidata schema.
Languages
Each term belongs to a particular language. Wikidata supports hundreds of languages, which are designated by language codes. For each term that you want to add to an item, you will need to specify the language for this term. There are two cases:
- Either the language is constant across your dataset: you know that all the names in a given column are spelled in the same language. In this case, type the name of the language in the input and select the language in the drop-down suggestion dialog. This will place the appropriate language code in the input.
- Or the language varies across your dataset. In this case, you need to provide a column of Wikimedia language codes that indicates the language for each term that you want to add. Just drag and drop this column to the language field. If there are any invalid language codes in this column, the corresponding terms will be ignored. OpenRefine will translate any deprecated language codes to their preferred values silently.
Labels
This is because Wikidata items can have at most one label per language, so you need to choose whether to override any existing label (default behaviour before 3.2) or only insert your label if there is no such label in the given language (default behaviour starting from 3.2). When the content of the cell providing the label is blank, nothing will be changed (so, it is not possible to remove labels).
Descriptions
Descriptions work like labels: there is at most one description per language, and OpenRefine can override existing descriptions or leave them unchanged. It is not possible to remove descriptions either.
Aliases
Aliases are added to the list of existing aliases in the given language. When adding an alias in a language where no label has been added yet, the alias is automatically promoted to a label for this language. It is not possible to remove aliases or to override any existing aliases.
Statements
You can add statements in the schema: this will generate new statements on the corresponding items. These statements will be merged with any existing statements on the actual Wikidata items and this merging process depends on the upload medium. It is forecast to give more control over the merging strategy in the near future.
Main values
Statements must have main values: \"novalue\" or \"somevalue\" statements are not supported yet. The main value of a statement is a data value whose type depends on the property used for the statement. If the main value cannot be evaluated (for instance because one of the cells it depends on is empty), then the entire statement will be skipped.
See the data values section for more details about how to specify each type of data value and when they are skipped.
Qualifiers
Qualifiers can be added on each statement. When their values are skipped, only the qualifier will be discarded: the rest of the statement will still be added.
References
References can (and should) be added to back each statement. If values inside the reference are skipped, the corresponding part of the reference will be discarded but the reference will still be added (unless the reference becomes empty).
Editing mode
The editing mode of a statement determines how it contributes to the corresponding entity. OpenRefine offers three editing modes:
- '''Add or merge''', which adds the statement or merges it with the first existing statement that matches it;
- '''Add''', which only adds the statement if there are no matching statements on the entity. Otherwise, leave those statements untouched;
- '''Delete''', which deletes all matching statements.
The way statements are matched is controlled by the matching strategy, which can be configured for each statement in the schema.
Matching strategy
The matching strategy determines how the candidate statements generated by the schema are compared to the existing statements on the entity. OpenRefine offers three merging strategies:
- '''Property''', which compares statements by their main property only. This means that any two statements using the same main property will be considered equivalent. For intance, using this merging strategy in conjunction with the '''Delete''' editing mode will delete all statements with a particular main property on the target entity.
- '''Property and value''', which compares statements by their main property and main value only. This is what QuickStatements does. In addition, it is possible (and enabled by default) to match statement values in a lax way, for instance to ignore differences in trailing whitespace or rounding of quantities.
- '''Qualifiers''', which compare statements using their property, main value and qualifiers. It is possible to define a list of property identifiers which determines which qualifiers are discriminating. Other qualifiers will not be taken into account when comparing statements. By default, all qualifiers are taken into account. This matching strategy also supports lax value matching.
These matching strategies are not honoured when exporting to QuickStatements, as the QuickStatements formats do not make it possible to represent them.
Lax value matching
When lax value matching is enabled, the following values are considered equal for statement matching purposes:
- strings which differ by whitespace at the beginning or end (such as
Berlin
andBerlin
); - URLs which differ by trailing slash or
http
/https
differences (such ashttp://wikiba.se
andhttps://wikiba.se/
); - quantities with the same unit, whose uncertainty domain overlap (such as
47±1
and48±0.5
); - geographical coordinates whose uncertainty domain overlap (note that since the uncertainty of geographical coordinates is expressed in degrees, this does not guarantee a distance threshold below which the coordinates will match);
- monolingual text values whose values differ by leading or trailing whitespace;
- dates which differ in attributes which are rendered irrelevant by the lowest precision of both values to compare (such as
1976-01-01
and1976
).
Ranks
All statements ranks are set to Normal. It is currently not possible to set a different rank.
Data values
Data values are the data that you can find as target of a statement (or qualifier, or part of a reference). Each property dictates a particular type of data value. In each case, OpenRefine uses a particular process to translate cell contents to a data value of the appropriate type. This section explains the process for all data types.
Items
Items are evaluated in the same way as the subjects of items in the schema. They can be input directly using the auto-suggest service provided, or any column reconciled against Wikidata can be used. Refer to the first Items section to see how they are evaluated.
Strings and external identifiers
Bare strings and external identifiers can be input directly as constants (if they do not change across rows) or using any column. If a reconciled column is used for a string value, it is the value of the cell that is going to be used, not the name of the reconciled item (which is what OpenRefine displays). Values are skipped when the column is blank or null.
Monolingual texts
Monolingual texts consist of two parts:
- the language: see Languages for their structure;
- the value of the text: see the section above.
A monolingual text is skipped when any of its parts is skipped (that is, if the language or the text are invalid).
Dates
Dates are parsed from cell contents (or from any constant provided in the schema) and the precision of the date is inferred from its format. Here are the valid formats:
YYYYM
, such as2001M
(millenium precision)YYYYC
, such as1901C
(century precision)YYYYD
, such as1981D
(decade precision)YYYY
, such as1984
(year precision)YYYY-MM
, such as2019-03
(month precision)YYYY-MM-DD
, such as1897-08-14
(day precision)
Any value that does not match any of these formats will be ignored. All dates are represented in UTC, Gregorian calendar.
In OpenRefine 3.3, the following new formats have been introduced:
TODAY
returns today's date with day precision. This will be evaluated when performing the edits (or exporting to QuickStatements);YYYY-MM-DD_QID
can be used to specify a date in a particular calendar (such as the proleptic Julian calendar (Q1985786).
In OpenRefine 3.5, the following new format has been introduced:
-234
represents the year 234 BCE
Quantities
Quantities consist of two parts: the amount and the unit.
- the amount is mandatory and must be a string, such as
18,229.1020
. The precision that is displayed will be respected (the same number of trailing zeros will be shown in Wikidata). By default, no upper and lower bounds will be set. To define these, one needs to use the engineering notation, such as3.45E+3
, which will be interpreted as3,450±5
. As usual, the amount can be provided as a constant or as a column variable. In the latter case, the values in the column must be strings. - the unit is optional. It is an item, so it can be provided either with the auto-suggest dialog or as a reconciled column. It is important to note that if a reconciled column is used, any unreconciled cells will discard the entire quantity value. So a template for a quantity value is either always unit-less, or always has a unit.
Globe coordinates
Geographic coordinates are specified as strings with the following formats, where all components are floating point numbers in degrees:
latitude,longitude
for a default precision of ten micro degrees (for instance:49.265278,4.028611
can be used indicate the position of Reims, France.
latitude,longitude,precision
when specifying an explicit precision (for instance:49.265278,4.028611,0.1
can be used indicate the position of Reims within a tenth of a degree).
All globe coordinates are on Earth (Q2).
If your coordinates are in a different format, such as
49° 15′ 55″ N, 4° 1′ 43″ E
, you will need to convert them to decimal
format first.
Media on Commons
Media on Wikimedia Commons is treated like strings, whose values must exactly match filenames on Commons. These values are not checked during schema evaluations: if they are wrong, uploading the statements will fail.
Tabular data and Geoshapes must be prefixed with the Data:
namespace.
This is indicated by the placeholder in the field that appears when
constructing the schema.
Properties
Properties are always constants: there is currently no way to reconcile a column against properties. They have to be selected with the auto-suggest dialog.
Other data types
URLs, mathematical expressions and other textual datatypes are supported and treated as strings. At the time of writing, all datatypes supported by Wikidata are supported by OpenRefine.