期刊名称:Bulletin of the Technical Committee on Data Engineering
出版年度:2016
卷号:39
期号:2
页码:78
出版社:IEEE Computer Society
摘要:Temporal data pose unique data quality challenges due to the presence of autocorrelations, trends,seasonality, and gaps in the data. Data streams are a special case of temporal data where velocity,volume and variety present additional layers of complexity in measuring the veracity of the data.In this paper, we discuss a general, widely applicable framework for data quality measurement ofstreams in a dynamic environment that takes into account the evolving nature of streams. We classifydata quality anomalies using four types of constraints, identify violations that could be potential dataglitches, and use statistical distortion as a metric for measuring data quality in a near real-time fashion.We illustrate our framework using commercially available streams of NYSE stock prices consisting ofaggregates of prices and trading volumes collected every minute over a one year period from November2011 to November 2012.