[Flume Cookbook] What is Apache Flume

From the Apache Flume Website "Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store"

Flume can be used to transport variety of data, be it Metrics data, log data or any other kind of data by customizing different part of Flume.

Event is the basic unit of data flow inside Flume system. A Flume Agent is a java process that hosts and manages different components of Flume system on a node or a host. Lets look inside a Flume Agent

Flume Agent Overview

Flume Agent Overview

The picture above shows basic components of an Flume Agent.

The main components are

  • Agent process
  • Source
  • Channel
  • Sink

We shall look at all these components in detail in subsequent chapters.

Flume supports various deployment topologies. You can single Agents which all dump your data into HDFS or can have multi-agent flows. Lets briefly look at some example topologies

Single Agent Topology

In this topology, a Flume Agent directly send the data to destination, HDFS in this example.

Single Agent Topology

Single Agent Topology

Multi Agent Topology

In this topology, Flume Agent send data to intermediate Flume Agents, which then send data to destination. A common example would be of Web Servers, would be better to have intermediate Flume Agents to write data to HDFS than all of them writing to HDFS together.

Multi-Agent Topology

Multi-Agent Topology

Replicating Topology

In this topology, we can use Flume's capability to replicate Events. We can send the data to batch processing systems or archival storage, along side to Real-time Analytics systems like Storm or Druid, with just simple configuration.

Replicating Topology

Replicating Topology

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.