concept

data cleaning

Facts (22)

Sources

Construction of Knowledge Graphs: State and Challenges - arXiv arxiv.org arXiv 20 facts

claimUser interaction cleaning methods incorporate human knowledge to improve the quality of data cleaning results while reducing the required human effort.

referenceI.F. Ilyas and X. Chu authored the book 'Data cleaning', published by Morgan & Claypool in 2019.

referenceActiveClean is an interactive data cleaning framework for statistical modeling developed by S. Krishnan, J. Wang, E. Wu, M.J. Franklin, and K. Goldberg, published in the Proceedings of the VLDB Endowment in 2016.

referenceThe 'CrowdCleaner' system performs data cleaning for multi-version data on the web via crowdsourcing, as presented at the IEEE 30th International Conference on Data Engineering (ICDE 2014) in Chicago, IL.

referenceE. Rahm and H.H. Do published a review of data cleaning problems and current approaches in the IEEE Data Engineering Bulletin in 2000.

referenceThe 'KATARA' system performs reliable data cleaning using knowledge bases and crowdsourcing, as published in the Proceedings of the VLDB Endowment in 2015.

claimData cleaning approaches used in Knowledge Graph construction can be applied to the final Knowledge Graph to identify outliers or contradicting information.

referenceP. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis introduced conditional functional dependencies as a method for data cleaning in their 2007 paper presented at the ICDE conference.

procedureExtraction methods for semi-structured data typically combine data cleaning and rule-based mappings to transform input data into a knowledge graph, targeting defined classes and relations of an existing ontology.

procedureData cleaning typically involves four subtasks: data profiling to identify quality issues, data repair to correct identified problems, data transformation to standardize data representations, and data deduplication to eliminate duplicate entities.

referenceX. Chu, I.F. Ilyas, and P. Papotti proposed a holistic data cleaning approach that puts violations into context, presented at the 29th IEEE International Conference on Data Engineering in 2013.

claimData profiling and cleaning techniques can be applied to identify erroneous values in a knowledge graph based on their distribution.

claimRule-based methods are classic techniques used for data cleaning that handle errors violating integrity constraint rules, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and denial constraints (DCs).

claimData cleaning should be integrated into all major steps of the knowledge graph construction pipeline to limit the entry of dirty or incorrect information.

claimRule-based methods for data cleaning are limited by the difficulty of obtaining a sufficient number of correct rules.

claimQuality assurance is necessary throughout the entire Knowledge Graph construction process, including source selection, data cleaning, knowledge extraction, ontology evolution, and entity fusion.

claimData cleaning in knowledge graphs involves detecting and removing errors and inconsistencies to improve data quality.

claimMachine learning approaches for data cleaning have gained prominence because they simplify the configuration of various subtasks.

claimQuality improvement for knowledge graphs includes data cleaning, error correction, outlier detection, entity resolution, data fusion, and continuous ontology development.

referenceThe 'DANCE' system performs data cleaning using constraints and experts, as presented at the 33rd IEEE International Conference on Data Engineering (ICDE 2017) in San Diego, CA.

Practices, opportunities and challenges in the fusion of knowledge ... frontiersin.org Frontiers 1 fact

claimConstructing and maintaining high-quality knowledge graphs typically involves significant human effort, including data cleaning, entity alignment, relation labeling, and expert validation, which is particularly labor-intensive in domains requiring expert knowledge.

On Hallucinations in Artificial Intelligence–Generated Content ... jnm.snmjournals.org The Journal of Nuclear Medicine 1 fact

claimSystematic data cleaning during preprocessing can reduce inconsistencies and improve data fidelity to mitigate hallucinations, although defining objective criteria for data quality standards remains a complex challenge.