The automatic fusion of conflicting entity values can easily introduce incorrect information into a knowledge graph, and even restricted human intervention is problematic on a large scale.
The order in which data sources and their updates are integrated into a Knowledge Graph can significantly influence the final quality of the graph.
Assessing the population accuracy of entity and relation types in a knowledge graph requires extending standard metrics like precision and recall to compare against a gold standard.
It is unclear how well the dstlr pipeline handles incremental knowledge graph updates, such as the deletion or updating of entities, despite its ability to ingest new batches of documents via Apache Solr.
Knowledge graph solutions often use rule-based mappings to extract entities and relations from semi-structured sources, as seen in DBpedia, Yago, DRKG, VisualSem, and WorldKG.
Ehrlinger et al. define a knowledge graph as a system that 'acquires and integrates information into an ontology and applies a reasoner to derive new knowledge.'
Integrating a new source into a knowledge graph requires matching the source ontology or schema with the existing knowledge graph ontology to determine which elements already exist and which should be added.
Evaluating the quality of a Knowledge Graph depends on the scope of the graph, and is generally easier for domain-specific Knowledge Graphs than for very large Knowledge Graphs covering many domains, for which completeness may not be possible.
Initial knowledge graph versions are typically created using a batch-like process, such as transforming a single data source or integrating multiple sources.
Incremental Entity Resolution requires developing incremental versions of blocking, matching, and clustering phases to handle new entities alongside existing ones in a Knowledge Graph.
Streaming-like data ingestion into a knowledge graph requires support for dynamic, real-time matching of new entities with existing knowledge graph entities.
Off-the-shelf named-entity recognition tools do not provide canonicalized identifiers for extracted mentions, necessitating a second step to link mentions to existing entities in a knowledge graph or to assign new identifiers.
The 'KGClean' system is an embedding-powered knowledge graph cleaning framework, published as a CoRR abstract in 2020.
Entity Linking systems face specific challenges including coreference resolution, where entities are referred to indirectly (e.g., via pronouns), and the handling of emerging entities that are recognized but not yet present in the target Knowledge Graph.
The quality of a knowledge graph and its data sources can be measured along four primary dimensions: correctness, freshness, comprehensiveness, and succinctness.
Knowledge graph toolsets are mostly closed-source, which limits their usability for new knowledge graph projects or research investigations.
Removing irrelevant entities that do not pertain to the intended domain can be preferable to filling in missing data, as it prevents the knowledge graph from becoming unnecessarily bloated.
The HKGB platform accepts annotations for its knowledge graph only if there is high agreement across annotators, defined as exceeding a confidence threshold of 0.81.
Ontology and schema matching is a key prerequisite for integrating individual ontologies or knowledge graphs into an existing version of an overall knowledge graph.
When integrating data into a knowledge graph, systems can determine relevant subsets of data based on the quality or trustworthiness of a source and the importance of single entities.
Calculating disjoint axioms by identifying wrong type statements based on existing relations, such as domain and range checks, is a method for validating knowledge graph consistency.
Ontologies enable the inference of new implicit knowledge from explicitly represented information in a knowledge graph.
The choice between using RDF, Property Graph Models (PGM), or a custom data model depends on the targeted application or use case of the final knowledge graph.
KGClean is a knowledge graph-driven cleaning framework that utilizes knowledge graph embeddings.
Knowledge graph systems require support for incremental updates, which can be performed periodically in a batch-like manner or continuously in a streaming-like fashion to maintain data freshness.
Blocking for incremental or streaming Entity Resolution requires identifying a subset of existing Knowledge Graph entities for matching to ensure efficiency, as Knowledge Graphs are typically large and growing.
Maintaining an audit trail of changes and ensuring traceability in a knowledge graph supports data governance and reproducibility.
Evaluating the quality of a constructed knowledge graph is difficult because it ideally requires a near-perfect 'gold standard' result for both the initial knowledge graph and its subsequent updates.
Methods for fixing or mitigating detected quality issues by refining and repairing the knowledge graph are required for successful construction.
Deep provenance allows for fact-level changes in a knowledge graph without re-computing the entire pipeline and helps identify the origin of incorrect values.
Assessing the quality of a Knowledge Graph is extremely challenging because there are many valid ways to structure and populate Knowledge Graphs, and even subproblems like evaluating the quality of a Knowledge Graph ontology are difficult.
Building an initial knowledge graph structure from existing data sources requires cleaning and enrichment processes to ensure sufficient domain coverage and quality.
A knowledge graph's ontology defines the concepts, relationships, and rules governing the semantic structure within a knowledge graph, including the types and properties of entities and their relationships.
The XI Pipeline constructs knowledge graphs semi-automatically from unstructured or semi-structured documents like publications and social network content.
NELL continuously crawls the web for new data but performs updates to the knowledge graph in a batch-like manner.
Succinctness in a knowledge graph requires a high focus of data, such as on a single domain, and the exclusion of unnecessary information to improve resource consumption, scalability, and system availability.
Knowledge graph-specific approaches have limitations regarding scalability to many sources, support for incremental updates, metadata management, ontology management, entity resolution and fusion, and quality assurance.
The authors of 'Construction of Knowledge Graphs: State and Challenges' focused their analysis on open and semi-automatic knowledge graph implementations, while only briefly discussing closed knowledge graphs because their data is not publicly accessible and their construction techniques are not verifiable.
The intended use cases of a Knowledge Graph influence its quality requirements and should be considered during construction and evaluation.
Validating a knowledge graph's data integrity concerning its underlying semantic structure (ontology) is a specific quality aspect of knowledge graph construction.
A central metadata repository simplifies access to knowledge graph-relevant metadata, whereas using multiple metadata repositories offers more flexibility but can introduce complexity, inconsistencies, and hinder discovery.
The SAGA system maintains a 'Live Graph' that continuously integrates streaming data and references stable entities from a batch-based Knowledge Graph, utilizing an inverted index and a key-value store for scalability and near-real-time query performance.
Provenance metadata can capture the steps of schema and data transformations applied within a knowledge graph pipeline.
VisualSem is a high-quality knowledge graph designed for vision and language tasks, as documented by H. Alberts et al. in 2020.
Wikipedia's category system can be used to derive relevant classes for a knowledge graph through NLP-based 'category cleaning' techniques.
The SLOGERT workflow operates in two phases: (1) extract data and parameters from log files to generate RDF templates conforming to the knowledge graph ontology, and (2) convert log files to graphs and integrate them into the final knowledge graph by connecting local context information to computer log domain identifiers and external sources.
The temporal development of a knowledge graph can be managed through a versioning concept where new versions of the graph are periodically released.
Correctness in a knowledge graph implies the validity of information (accuracy) and consistency, meaning each entity, concept, relation, and property is canonicalized with a unique identifier and included exactly once.
The term 'knowledge graph' dates back to 1973.
Knowledge graphs typically either integrate sources from different domains (cross-domain) or focus on a single domain such as research, biomedicine, or Covid-19.
Fact-level metadata in a knowledge graph can be stored either embedded with the data items or in parallel to the data using unique IDs for referencing.
Recent approaches to entity resolution for knowledge graphs utilize multi-source big data techniques, Deep Learning, or knowledge graph embeddings.
Hao et al. introduced detective rules (DRs) that enable actionable decisions on relational data by establishing connections between a relation and a knowledge graph.
DBpedia and Yago are limited to batch updates that require a full recomputation of the knowledge graph.
The AI-KG knowledge graph, which is an automatically generated knowledge graph of artificial intelligence, was presented by D. Dessì, F. Osborne, D. Reforgiato Recupero, D. Buscaldi, E. Motta, and H. Sack at the International Semantic Web Conference in 2020.
Data cleaning approaches used in Knowledge Graph construction can be applied to the final Knowledge Graph to identify outliers or contradicting information.
Methods for learning ontologies from relational databases focus on reverse engineering or using mappings to transform a relational database schema into an ontology or knowledge graph. Reverse engineering allows for the derivation of an Entity-Relationship diagram or conceptual model from the relational schema, though this requires careful handling of trigger and constraint definitions to prevent semantic loss.
Figure 1 in 'Construction of Knowledge Graphs: State and Challenges' visualizes a simplified knowledge graph containing ten entities across eight types (Country, City, Artist, Album, Record Label, Genre, Song, Year) and various relationships, with ontological information like types or 'is-a' relations represented by dashed lines.
Paulheim et al. [19] define retrospective evaluation as a method where human judges assess the correctness of a knowledge graph, typically restricted to a sample due to the voluminous nature of the graphs, with accuracy or precision as the reported quality metric.
Knowledge Completion involves extending a knowledge graph by learning missing type information, predicting new relations, and enhancing domain-specific data.
Extraction methods for semi-structured data typically combine data cleaning and rule-based mappings to transform input data into a knowledge graph, targeting defined classes and relations of an existing ontology.
Accuracy in a Knowledge Graph indicates the correctness of facts, including type, value, and relation correctness, and can be separated into syntactic accuracy (assessing wrong value datatype/format) and semantic accuracy (assessing wrong information).
The dstlr tool extracts mentions and relations from text, links them to Wikidata, and populates the resulting knowledge graph with additional facts from Wikidata.
Deep or statement-level provenance refers to metadata attached to individual facts (entities, relations, properties) in a knowledge graph, such as creation date, confidence scores of extraction methods, or the original text paragraph from which the fact was derived.
Most existing entity resolution approaches for knowledge graphs are designed for static or batch-like processing where matches are determined within or between datasets of a fixed size.
The NELL (Never-Ending Language Learning) project features the highest number of continuously and incrementally generated knowledge graph versions, with over 1100 dumps.
Quantifying criteria for source selection and computing the cost of integrating a source helps save effort while producing a high-quality knowledge graph.
Data integration and canonicalization in knowledge graphs involve entity linking, entity resolution, entity fusion, and the matching and merging of ontology concepts and properties.
Recomputing a knowledge graph from scratch for every update results in redundant computation, which limits scalability as the number and size of input sources increase.
Comprehensiveness in a knowledge graph requires good coverage of all relevant data (completeness) and the combination of complementing data from different sources.
Provenance metadata is essential for Knowledge Graph quality assurance because it helps explain and maintain data regarding the context and validity of conflicting values.
CS-KG is a large-scale knowledge graph that aggregates research entities and claims within the field of computer science, as detailed by Dessì et al. in the 2022 proceedings of the 21st International Semantic Web Conference.
The SAGA approach is one of the few methods that attempts to maintain the trustworthiness of facts within a knowledge graph.
A comparative analysis of current knowledge graph-specific pipelines and toolsets, as presented in 'Construction of Knowledge Graphs: State and Challenges', reveals significant differences in input data structure, construction methods, ontology management, the ability to integrate new information, and the tracking of provenance.
Checking a single fact across different datasets is a method to detect inaccurate facts in a knowledge graph.
Knowledge graph entities possess unique identifiers.
Van Assche et al. utilize Linked Data Event Streams (LDES) to continuously update a knowledge graph with changes originating from underlying data sources.
Hogan et al. argue that the definition of a knowledge graph provided by Ehrlinger et al. is too specific and excludes various industrial knowledge graphs that helped popularize the concept.
Lassila et al. conclude that both RDF and Property Graph Models (PGM) are qualified to meet their respective challenges but neither is perfect for every use case, recommending increased interoperability between both models to reuse existing techniques.
Entity Resolution and Fusion is the process of identifying matching entities and merging them within a knowledge graph.
To mitigate the impact of integration order on Knowledge Graph quality, it is advisable to integrate the highest-quality data sources first.
Open Information Extraction requires a secondary canonicalization step to deduplicate extracted relations and link them to synonymous relations already contained in the knowledge graph.
Evaluating the accuracy of a knowledge graph against a manually labeled subset of entities and relations is a conventional approach, but it is costly and typically results in small gold standard datasets according to Paulheim et al. [19].
KATARA utilizes crowdsourcing to verify whether data values that do not match an existing knowledge graph are correct.
Distant supervision is a common method for link prediction that involves linking knowledge graph entities to a text corpus using NLP approaches and identifying patterns between those entities within the text.
Common techniques for determining metadata for Knowledge Graph data sources include data profiling, topic modeling, keyword tagging, and categorization.
Quality Assurance and knowledge graph completion steps are not required for every knowledge graph update and may be executed asynchronously within separate pipelines.
TripleCheckMate is a crowdsourcing tool that allows users to evaluate single resources in an RDF knowledge graph by annotating found errors with one of 17 error classes.
While high degrees of automation are possible for individual Knowledge Graph construction tasks, human interaction generally tends to significantly improve data quality, though it can become a limiting factor for scalability regarding data volume and the number of sources.
Knowledge graph quality can be improved by enriching domain knowledge through loading specific entity information from external, open-accessible knowledge bases, rather than integrating entire external data collections.
Data quality problems should be addressed during the import process to prevent the ingestion of low-quality or incorrect data into a knowledge graph.
Data profiling and cleaning techniques can be applied to identify erroneous values in a knowledge graph based on their distribution.
Human-in-the-loop approaches for knowledge graph evaluation may require KG sampling to evaluate only sub-graphs of the entire knowledge graph due to the degree of automation involved.
Freshness (timeliness) in a knowledge graph requires the continuous updating of instances and ontological information to incorporate changes from relevant data sources.
The matching step of incremental Entity Resolution involves performing pair-wise comparisons between new entities and existing Knowledge Graph entities identified by the preceding blocking step.
Graph data models should facilitate the knowledge graph construction process by supporting the acquisition, transformation, and integration of heterogeneous data from different sources, utilizing formats that allow for seamless data exchange between pipeline steps.
Ontology development is the incremental process of creating or extending an ontological knowledge base, which is required for both the initial construction of a knowledge graph and its subsequent updates to incorporate new information.
Open knowledge graph-specific approaches currently face limitations in scalability to many sources, support for incremental updates, and several technical areas including metadata management, ontology management, entity resolution/fusion, and quality assurance.
Neural methods for entity resolution in knowledge graphs have recently faced increased scrutiny following a period of significant hype.
Li et al. investigate the correctness of a fact in a knowledge graph by searching for evidence in other knowledge bases, web data, and search logs.
Entity Linking (EL) or Named Entity Disambiguation (NED) is the process of linking recognized named entities in text to a knowledge base or Knowledge Graph (KG) by selecting the correct entity from a set of candidates.
Using external datasets to validate knowledge graph facts can introduce errors from two sources: errors within the target knowledge graph itself and errors in the linkage between the target knowledge graph and the external reference sources.
WorldKG utilizes an unsupervised machine learning approach for ontology alignment, whereas most other knowledge graph approaches perform alignment and merging of ontologies manually.
The SAGA knowledge graph construction system manages incremental integration by updating a stable knowledge graph in batches while simultaneously serving a live knowledge graph that prioritizes data freshness over certain quality assurance steps.
Public biochemical databases, such as the National Library of Medicine, allow for the retrieval of gene and protein data based on their symbols to enrich knowledge graphs.
Heiko Paulheim surveys approaches that exploit links to other knowledge graphs to verify information and fill existing data gaps.
Zhu et al. focus on the creation of multi-modal knowledge graphs, specifically by combining symbolic knowledge in a knowledge graph with corresponding images.
Using a dictionary (also called a lexicon or gazetteer) is a reliable and simple method to detect entity mentions in text, as it maps labels of desired entities to identifiers in a knowledge graph, effectively performing named-entity recognition and entity linking in a single step.
Empirical evaluation on DBpedia by Acosta et al. [234] shows that combining expert crowdsourcing and paid microtasks on Amazon Mechanical Turk is a complementary and affordable way to enhance knowledge graph data quality.
Solutions for detecting changes in Knowledge Graph data sources include manual user notifications via email, accessing change APIs using publish-subscribe protocols, and computing differences by repeatedly crawling external data to compare against previous snapshots.
Semantic reasoning and inference allow for the validation of a knowledge graph's consistency based on a given ontology or individual structural constraints.
The term 'knowledge graph' gained popularity following a 2012 blog post about the Google Knowledge Graph.
Most knowledge graph construction solutions produce a final knowledge graph that contains a union of all extracted values, either with or without provenance, leaving the final consolidation or selection of entity identifiers and values to the targeted applications.