Maintaining a metadata repository (MDR) is beneficial for knowledge graph construction because it allows for the storage and organization of different kinds of metadata in a uniform and consistent way.
Link prediction is a task in knowledge graph construction that aims to identify missing relations between entities, such as identifying that the song 'Ageispolis' was written by the artist 'Aphex Twin'.
Optimizing data freshness is a relevant criterion for knowledge graph construction to guarantee up-to-date results in upstream applications.
The NELL knowledge graph construction solution requires only final user approval of the correctness of extracted values or patterns for quality assurance.
Xiaogang Ma reviewed knowledge graph applications and construction approaches in the geoscience domain, noting that creation methods range from manual approaches to data mining of crowdsourced data.
The main tasks for knowledge graph construction include: (1) Data Acquisition & Preprocessing (selection of sources, acquisition, transformation, and cleaning), (2) Metadata Management (acquisition and management of provenance, structural, temporal, quality, and log metadata), and (3) Ontology Management (creation and incremental evolution of the ontology).
Şimşek et al. provided a high-level overview of the knowledge graph construction process within the context of a knowledge graph's lifecycle, utilizing a case study to illustrate encountered challenges.
The HKGB knowledge graph construction solution provides a description of its entity resolution (ER) process that is too vague to allow for a definitive assessment of its capabilities.
Fan et al. (2020) utilized deep learning-based named entity recognition for knowledge graph construction specifically applied to geological hazards.
Knowledge graph construction and maintenance tools should enable the definition and execution of powerful, efficient, and scalable pipelines.
The primary goal of entity matching in knowledge graph construction is to identify similar entities as candidates for a final clustering step, which determines whether a new entity should be added to an existing cluster or form a new one.
Future work in knowledge graph construction faces challenges regarding incremental approaches, open toolsets, and benchmarks.
Existing toolsets for knowledge graph construction generally offer better functionality than open approaches, but they are mostly closed-source, rendering them unusable for new knowledge graph projects or research investigations.
Knowledge graph construction requires methods to evaluate the quality of each step of the construction pipeline as well as the resulting knowledge graph.
Metadata for knowledge graph construction can be created manually by human users or automatically by computer programs using heuristics or algorithms.
The entity linking component of knowledge extraction can render an additional entity resolution step unnecessary in knowledge graph construction.
While NLP tasks have benefited from reusable implementations like Stanford CoreNLP, other knowledge graph construction tasks, such as entity resolution, currently lack similar modular, reusable implementations.
SAGA is a closed-source toolset that supports multi-source data integration for both batch-like incremental knowledge graph construction and continuous knowledge graph updates.
Knowledge Graph construction requires handling heterogeneous data formats, including CSV, XML, JSON, and RDF, as well as various access technologies like downloadable files, databases, and APIs.
Benchmark datasets exist for specific subtasks of knowledge graph construction, such as entity resolution (e.g., Gollum) and knowledge completion (e.g., CoDEx).
Knowledge graph construction pipelines are often created in a batch-like manner, making them unfit to incorporate new incoming facts without full re-computation of individual tasks.
The authors of 'Construction of Knowledge Graphs: State and Challenges' expand the scope of knowledge graph construction research to include non-RDF-based models like the Property Graph Model, the integration of structured and unstructured data, and incremental maintenance.
Methods for fixing or mitigating detected quality issues by refining and repairing the knowledge graph are required for successful construction.
There is a lack of widely used end-to-end benchmark datasets for knowledge graph construction, leading researchers to often create custom datasets or use subsets of existing datasets to evaluate their construction pipelines.
Data and metadata management are cross-cutting tasks in knowledge graph construction that are necessary throughout the entire pipeline.
SLOGERT (Semantic LOG ExtRaction Templating) is a framework for automated knowledge graph construction from log data, utilized in security-related applications to detect upcoming threats and vulnerabilities.
Knowledge graph construction pipelines often require manual intervention at different steps, which limits scalability to large data volumes and increases the time required for updating a knowledge graph.
Knowledge graph construction approaches are categorized into two types: KG-specific approaches, which focus on integrating data from a fixed set of sources for a single knowledge graph, and toolsets or strategies, which are generic and applicable to different sources and knowledge graphs.
Current benchmarks for knowledge graph construction tasks often have gaps regarding scalability and domain diversity, despite being complex.
Validating a knowledge graph's data integrity concerning its underlying semantic structure (ontology) is a specific quality aspect of knowledge graph construction.
The World KG approach to knowledge graph construction manually verifies all matches to external ontologies for quality assurance.
A tutorial-style overview of knowledge graph construction and curation, with a focus on integrating data from textual and semi-structured sources like Wikipedia, is provided in reference [12].
Knowledge graph construction requires incremental approaches that can build upon previous match decisions to determine if new entities are already represented in the knowledge graph or should be added as new entities.
The authors of the paper 'Construction of Knowledge Graphs: State and Challenges' focus their requirements analysis on the knowledge graph construction process and structured input data, such as entity resolution, rather than focusing solely on the knowledge graph itself.
WordNet, ImageNet, and BabelNet are frequently used as starting points for knowledge graph construction.
DBpedia and YAGO are the only knowledge graph construction solutions that perform an automatic consistency check.
The work by Şimşek et al. provides insights into real-world knowledge graph construction but lacks a systematic comparison of state-of-the-art approaches regarding the requirements of knowledge graph construction.
Knowledge Graph construction should involve integrating additional data sources after the initial premium sources to capture 'long tail' entities, such as less prominent persons.
The SAGA knowledge graph construction solution utilizes several truth discovery and source reliability-based fusion methods for entity fusion.
Data cleaning approaches used in Knowledge Graph construction can be applied to the final Knowledge Graph to identify outliers or contradicting information.
A benchmark for knowledge graph construction should ideally involve the initial construction and incremental update of domain-specific or cross-domain knowledge graphs from diverse data sources, using predefined ontologies and data models like RDF or property graphs to facilitate evaluation.
Storing blocking keys in data structures with efficient access speeds up entity comparisons in Knowledge Graph construction.
The construction of a knowledge graph is a multi-disciplinary effort that requires expertise from natural language processing, data integration, knowledge representation, and knowledge management.
Knowledge graph construction tasks, such as data fusion for determining final entity values, should utilize fact-level provenance metadata.
The DRKG, HKGB, and SAGA knowledge graph construction solutions use machine learning-based link prediction on graph embeddings to find further knowledge for knowledge completion.
Most knowledge graph construction approaches integrate supplementary data, specifically mapping rules, training data, or quality constraints such as SHACL shapes.
The main challenges in utilizing data for knowledge graph construction include creating more efficient learning schemes, handling complex contexts such as relational information across sentences, and detecting undefined relations in new domains.
Knowledge Graph construction must account for continuously changing data sources, which requires mechanisms for change recognition and version maintenance.
The HKGB knowledge graph construction solution relies heavily on user interaction for quality assurance.
The SAGA knowledge graph construction solution attempts to automatically detect potential errors or vandalism, quarantining them for human curation where changes are applied directly to the live graph before being applied to the stable graph.
Reusing existing toolsets for knowledge graph construction often requires transformation or mapping between different data formats and processing steps.
Completely automatic knowledge graph construction is currently not achievable because steps such as identifying relevant data sources and developing the knowledge graph ontology typically require human input from individuals, expert groups, or communities.
Applying automatic approaches to knowledge graph construction can cause the extraction of irrelevant information, necessitating either manual intervention or the leveraging of known information from existing structured databases.
Knowledge graph construction requires identifying relevant sources and determining relevant subsets of data, as it is generally unnecessary to integrate all information from a source for a specific project.
Requirements for knowledge graph construction and maintenance are grouped into four aspects: input consumption, incremental data processing capabilities, tooling/pipelining, and quality assurance.
Modular processing workflows with transparent interfaces increase the reusability of alternative tools and implementations in knowledge graph construction.
In incremental entity resolution, it is beneficial to know the type of new entities from previous steps in the knowledge graph construction pipeline to limit comparisons to existing knowledge graph entities of the same or related types.
Knowledge graph construction requires scalable methods for the acquisition, transformation, and integration of diverse input data, including structured, semi-structured, and multimodal unstructured data such as textual documents, web data, images, and videos.
The SLOGERT knowledge graph construction solution adds links to external information based on previously extracted persistent identifiers (PIDs).
Entity resolution is supported by only a few knowledge graph construction approaches.
Effective knowledge graph construction pipelines must support the management of metadata for data sources, processing steps, intermediate results, and the knowledge graph versions themselves.
Current knowledge graph construction solutions have limited metadata support, with few approaches acknowledging the importance of provenance tracking and debugging capabilities.
Effective data and metadata management is essential for open and incremental knowledge graph construction processes.
Entity fusion is the least supported task among the knowledge graph construction solutions considered in the study, with none of the dataset-specific knowledge graphs performing classical, sophisticated entity fusion.
Quality assurance is necessary throughout the entire Knowledge Graph construction process, including source selection, data cleaning, knowledge extraction, ontology evolution, and entity fusion.
Constructing knowledge graphs requires tools that possess good interoperability, a high degree of automation, high customizability, and the ability to adapt to new domain requirements.
Quality assurance in knowledge graph construction is a cross-cutting topic that addresses ontological consistency, data quality of entities and relations (comprehensiveness), and domain coverage.
Knowledge graph construction requires propagating not just new information, but also deletions and updates from source data to the knowledge graph.
Knowledge graph construction pipelines face significant challenges, including the need for scalability, the integration of heterogeneous data sources, and the tracking of data provenance.
The SLOGERT knowledge graph construction solution suggests that entity resolution might be necessary in some cases, but recommends that this process be performed using an external tool.
DRKG and WorldKG represent one-time efforts in knowledge graph construction without any updates.
There is a need for more comprehensive data quality measures and repair strategies that minimize human intervention to retain scalability in knowledge graph construction.
Selecting relevant data sources for Knowledge Graph construction is typically a manual process, though it can be supported by data catalogs that provide metadata about the sources and their contents.
Most knowledge graph construction solutions produce a final knowledge graph that contains a union of all extracted values, either with or without provenance, leaving the final consolidation or selection of entity identifiers and values to the targeted applications.
The acquisition of provenance data is the most common form of metadata support in knowledge graph construction, ranging from simple source identifiers and confidence scores to the inclusion of original values.
Most knowledge graph construction approaches have either no support or unknown support for incremental updates, meaning they cannot integrate changes in data sources or new sources without a full recomputation of the knowledge graph.
Existing benchmarks for knowledge graph construction are currently limited to individual tasks such as knowledge extraction, ontology matching, entity resolution, and knowledge graph completion.