期刊名称:Bulletin of the Technical Committee on Data Engineering
出版年度:2011
卷号:34
期号:03
出版社:IEEE Computer Society
摘要:We present Midas, a system that uses complex data processing to extract and aggregate facts from a large
collection of structured and unstructured documents into a set of unified, clean entities and relationships.
Midas focuses on data for financial companies and is based on periodic filings with the U.S. Securities
and Exchange Commission (SEC) and Federal Deposit Insurance Corporation (FDIC). We show that,
by using data aggregated by Midas, we can provide valuable insights about financial institutions either
at the whole system level or at the individual company level.
The key technology components that we implemented in Midas and that enable the various financial
applications are: information extraction, entity resolution, mapping and fusion, all on top of a scalable
infrastructure based on Hadoop. We describe our experience in building the Midas system and also
outline the key research questions that remain to be addressed towards building a generic, high-level
infrastructure for large-scale data integration from public sources.