期刊名称:Bulletin of the Technical Committee on Data Engineering
出版年度:2018
卷号:41
期号:1
页码:63
出版社:IEEE Computer Society
摘要:Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult andtime consuming effort. Data provenance support is a key building block in libraries that aim to providedebugging support for data processing pipelines. In this paper we report our experience in buildingTitian: a data provenance system targeting the Apache Spark framework. Our focus here is to analyzethe design choices and trade offs that we and others made. Ultimately, we believe there is still more workto do before reaching a widespread adoption of data provenance outside the research community.