In the 2 post in the series, lets examine some of the key Flume terms.
In part 1 of the series, we worked on setting up a single node cluster. Before we dive deeper into Flume, lets look at some basic concepts which shall help in understanding things.
Event is the byte payload which needs to be transmitted from source to destination. The byte payload is the data that applications needs to store. Event also contains certain header information.
Flume Agent is central to Flume deployment. Agent is an independent process that hosts/manages Sources, Sinks and Channels.
Source is the way by which data is ingested into Flume Agent. The Source could be the 1st Agent to ingest data into Flume or the Source could be one of the intermediate Agents while delivering data to end destination.
A Channel is a transient store where events are stored before they are consumed by a Sink. A Channel is link between a Source and a Sink. Source accepts the data and push to Channel. A Sink gets the data from Channel and writes either to next destination or the final destination. A Source can have more than one Channel.
A Sink is responsible for consuming the events from Channel. It either writes those events to next Source or if it’s last Sink in chain it may write to eventual destination like a File System or HDFS etc.
A Client is an implementation which resides at the point of origin of Events and has the capability to deliver Events to Flume Agent. For ex, if we use Flume Log4j appender, it acts as a Client and delivers the logs to the configured Flume Agent
Flow is the movement of Events across Flume topology from Source to eventual destination.
The picture below shows the a sample flow with two Nodes/Agent.
Topology is the way Flume Agents are arranged in the Flow path of Events from source to destination. A sample topology is shown below. There are other topologies possible, but we shall touch upon them post discussion on Sinks.