Related Communities:

Methods of Entity Resolution and Data Fusion

Methods of Entity Resolution and Data Fusion in the ETL-Process and their Implementation in the Hadoop Environment.

Author(s): A. Vovchenko, L. Kalinichenko, D. Kovalev.
Published:Informatics and Applications. Moscow: IPI RAN, 2014. -- V. 8, Iss. 4. -- P. 94-109.
Entities extraction, their transformation and loading in the integrated repository is the main problem of data integration. These actions are part of the ETL-process (extract - transform - loading). Entity is a digital representation of a real world object (for example, information about the person). Entity resolution cares of duplicate detection, deduplication, record linkage, object identification, reference matching, and other ETL-related tasks. After the entity resolution step entities should be merged into the one reference entity (containing information from all related entities). Data Fusion is the final step in the data integration process. The paper gives an overview of the entity resolution and data fusion methods. The paper presents also the techniques for programming of the entity resolution and data fusion methods for implementing of the ETL- process in the Hadoop environment HIL (High-Level Integration Language) - a declarative language that focuses on the resolution and fusion of the entities in the Hadoop-infrastructure is used in this part of the paper.
Download: [ Adobe PDF ]

Supported by Synthesis Group