Data sets used for experimental evaluation in the related publication: Evaluating Web Table Annotation Methods: From Entity Lookups to Entity Embeddings
The data sets are contained within archive folders corresponding the three gold standard data sets used in the related publication. Each is presented in both .csv and .json formats.
The gold standard data sets are collections of web tables:
T2D consists of a schema-level gold standard of 1,748
Web tables, manually annotated with class- and property-mappings, as well as an entity-level gold standard of 233 Web tables.
Limaye consists of 400 manually annotated Web tables with entity-, class-, and property-level correspondences, where single cells (not rows) are mapped to entities. The corrected version of this gold standard is adapted to annotate rows with entities, from the annotations of the label column cells.
WikipediaGS is an instance-level gold standard developed from 485K Wikipedia tables, in which links in the label column are used to infer the annotation of a row to a DBpedia entity.
The .csv files are formatted as double quoted (' " ') fields, separated by commas (',').
In the tables files, each file corresponds to one table, each field represents a column, and each line represents a different row.
In the entities files, there are only three fields:
"DBpedia uri","cell string","row number"
representing the correct annotation, the string of the label column cell, and the row (starting from 0) in which this mapping is found, respectively.
Tables and entities files that correspond to the same table have the same filename.
The same formatting and naming convention is used in T2D gold standard (http://webdatacommons.org/webtables/goldstandard.html).
Each line in a .json file corresponds to a table, written as a JSONObject. T2D and Limaye tables files contain only one line (table) per file, while the Wikipedia gold standard contains multiple lines (tables) per .json file. In T2D and Limaye, the entity mappings of those tables can be found in the entities files with the same filename, while in Wikipedia, the entity mappings of each table can be found the line of the entities files having the "tableId" field as the one of the corresponding table.
The contents of a table in .json are given as a two-dimensional array (a JSONArray of JSONArray s), called "contents". Each JSONArray in the contents represents a table row. Each element of this array is a JSONObject, representing one cell of the row. The field "data" of each cell contains the cell string contents, while there may also be a field "isHeader" to denote of the current cell is in a header row. In the Wikipedia gold standard there may also be a "wikiPageId" field, denoting the existing hyperlink of this cell to a Wikipedia page. It only contains the suffix of a Wikipedia URL, skipping the first part "https://en.wikipedia.org/wiki/".
The entity mappings files are in the same format as in csv:
["DBpedia uri","cell string",row number] inside the "mappings" field of a json file.
Note on license: please refer to the README.txt. Data is derived from Wikipedia and
other sources may have different licenses.
contents can be shared under the terms of Creative Commons
as outlined on Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content
correspondences of the T2D Gold standard is provided under the terms of the
disclaimer of warranties and limitation of liabilities that apply to the Common
Crawl corpus. The DBpedia subset is licensed under the terms
of the Creative Commons Attribution-ShareAlike
License and the GNU Free Documentation License that applies to
Limaye gold standard is downloaded from: http://websail-fe.cs.northwestern.edu/TabEL/
(download date: August 25, 2016). Please refer to the original website and the
following paper for more details and citation information:
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables
Using Entities, Types and Relationships. PVLDB, 3(1):1338â€“1347, 2010.
Also: THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.