Skip to content

Data Stream Generator

gvdongen edited this page Aug 10, 2021 · 3 revisions

What?

The code can be found here.

This component generates a data stream that is published to Kafka. The processing job can then ingest this data for further processing.

The original data contains a measurement per minute. In this benchmark we speed the sending of data up and send the data of one minute each second. We replace the timestamp in the message by the current timestamp.

Data description

At a data volume 0:

  • There are 68400 observations in the data
  • There are 19 distinct measurement ID + lane ID in the data
  • There are 12 distinct measurement ID in the data
  • There are 34200 speed observations in the data
  • There are 34200 flow observations in the data
  • There are no missing or invalid observations

In the publishers we inflate these numbers ten times.

  Throughput = 380*($DATA_VOLUME+1)

If only one stream is published this should be divided by 2:

  Throughput_single_stream = 190*($DATA_VOLUME+1)

The data can be found in the resources folder of this project.

Settings

Either one or two streams based on the stage that is executed.

S3_ACCESS_KEY=xxxx; # To read the input data from S3. You can also use the data in the local resources.
S3_SECRET_KEY=xxxx; # To read the input data from S3. You can also use the data in the local resources.
KAFKA_BOOTSTRAP_SERVERS=$(hostname -I | head -n1 | awk '{print $1\;}'):9092; # The Kafka brokers.
DATA_VOLUME=0; # With two streams: Throughput = 380*($DATA_VOLUME+1); With one stream: Throughput = 190\*($DATA_VOLUME+1)
MODE=constant-rate; # which mode to run in: constant-rate, periodic-burst, single-burst or faulty-event
FLOWTOPIC=ndwflow; # name of flow topic
SPEEDTOPIC=ndwspeed; # name of speed topic
RUNS_LOCAL=true; # whether to run in local mode

Modes

There are four different modes: constant-rate, periodic-burst, single-burst and faulty-event.

Constant rate

Constant-rate is the default mode and will be used for latency workloads, sustainable throughput workloads and master and worker failure workloads. It sends data at a constant pace with small batches every 5 milliseconds.

Periodic bursts

This mode sends a low constant throughput of data with periodic bursts (every 10 seconds). This mode can be used to test the resilience of the framework against bursty data workloads.

Single burst

This mode publishes a larger data volume at the beginning (first five minutes) and then continues with a low volume. This can be used to publish a lot of data onto Kafka for five minutes, then start up the stream processing framework and measure the peak throughput that can be reached to catch up with the delay.

Faulty event

In this mode, the publisher sends data at a constant rate onto Kafka and sends an erroneous event after 15 minutes. This can be used to check the effect of erroneous events on the framework and to make the job fail.