期刊名称:Bulletin of the Technical Committee on Data Engineering
出版年度:2012
卷号:35
期号:02
出版社:IEEE Computer Society
摘要:One trend in the implementation of modern web systems is the use of activity data in the form of log or
event messages that capture user and server activity. This data is at the heart of many internet systems in
the domains of advertising, relevance, search, recommendation systems, and security, as well as contin-
uing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands
on data feeds. Activity data is extremely high volume and real-time pipelines present new design chal-
lenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn¡¯s
data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system
called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion
message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports
dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing
each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and
engineering problems we encountered along the way.