From the Apache Flume Website “Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store”
Flume can be used to transport variety of data, be it Metrics data, log data or any other kind of data by customizing different part of Flume.
Event is the basic unit of data flow inside Flume system. A Flume Agent is a java process that hosts and manages different components of Flume system on a node or a host. Lets look inside a Flume Agent
The picture above shows basic components of an Flume Agent.
The main components are
- Agent process
We shall look at all these components in detail in subsequent chapters.
Flume supports various deployment topologies. You can single Agents which all dump your data into HDFS or can have multi-agent flows. Lets briefly look at some example topologies
Single Agent Topology
In this topology, a Flume Agent directly send the data to destination, HDFS in this example.
Multi Agent Topology
In this topology, Flume Agent send data to intermediate Flume Agents, which then send data to destination. A common example would be of Web Servers, would be better to have intermediate Flume Agents to write data to HDFS than all of them writing to HDFS together.
In this topology, we can use Flume’s capability to replicate Events. We can send the data to batch processing systems or archival storage, along side to Real-time Analytics systems like Storm or Druid, with just simple configuration.