Authors: Vojtěch Juránek, Gustavo Fernandes
Level: Intermediate
Summary: The spark
quickstart demonstrates how to process data on the Data Grid from Apache Spark.
Technologies: Apache Spark, JBoss Data Grid, Scala, Streaming, Listeners, Hot Rod
Target Product: JDG
Product Versions: JDG 7.x
Source: https://github.com/infinispan/jdg-quickstart
The spark
quickstart demonstrates how to use Apache Spark to process data stored in the JBoss Data Grid. The quickstart
simulates a network of temperature sensors, each sensor will send measurements from different places to the Data Grid.
The temperatures will be stored on a Cache<String, Double>
.
A Spark Streaming job will listen to changes in the temperature readings and will calculate the average for each place,
storing results back into the "avg-temperatures" cache. Finally, the user can subscribe to one or more cities in order
to get notified when a temperature average changed.
- JDK 8, since Spark does not work yet with java 9+
- Maven 3+
- JBoss Data Grid 7.3.x
- Apache Spark 2.3+
If you have not yet done so, you must Configure Maven before testing the quickstarts.
-
Obtain the JDG server distribution on Red Hat's Customer Portal at https://access.redhat.com/jbossnetwork/restricted/listSoftware.html
-
Start JDG
-
Open a command line and navigate to the root of the JDG directory.
-
The following shows the command line to start the server:
For Linux: $JDG_HOME/bin/standalone.sh For Windows: %JDG_HOME%\bin\standalone.bat
-
-
Create a local cache called
avg-temperatures
:-
Open a command line and navigate to the root of the JDG directory.
-
The following shows the commands used to create the cache:
For Linux: bin/cli.sh -c command="/subsystem=datagrid-infinispan/cache-container=local/configurations=CONFIGURATIONS/local-cache-configuration=avg-temperatures:add(start=EAGER,template=false)" bin/cli.sh -c command="/subsystem=datagrid-infinispan/cache-container=local/local-cache=avg-temperatures:add(configuration=avg-temperatures)" For Windows: bin\cli.bat -c command="/subsystem=datagrid-infinispan/cache-container=local/configurations=CONFIGURATIONS/local-cache-configuration=avg-temperatures:add(start=EAGER,template=false)" bin\cli.bat -c command="/subsystem=datagrid-infinispan/cache-container=local/local-cache=avg-temperatures:add(configuration=avg-temperatures)"
-
- Download and unpack version 2.3.3+ of Apache Spark, picking as package a pre-built version like
Pre-built for Hadoop 2.6 and later
. On Windows, you can use 7-zip to extract the .tgz file. The extract location will be referred asSPARK_HOME
NOTE For Windows users: Spark on windows relies on some binaries not packaged with the .tgz distributed. You'll need to download a Hadoop 2.6 distribution that contains windows binaries on the \bin
folder such as winutils.exe, and also have a environment variable HADOOP_HOME
pointing to the hadoop distribution folder.
-
Start Spark master:
-
Open a command line and navigate to
SPARK_HOME
: -
Run the following command to start the Spark master:
For Linux: sbin/start-master.sh --webui-port 9080 -h localhost For Windows: bin\spark-class.cmd org.apache.spark.deploy.master.Master --webui-port 9080 -h localhost
-
-
Start the Spark Worker:
-
Open a command line and navigate to
SPARK_HOME
: -
Run the following command to start a Spark slave:
For Linux: sbin/start-slave.sh spark://127.0.0.1:7077 --webui-port 9081 For Windows: bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077 --webui-port 9081
From now the Spark admin console can be accessed at http://localhost:9080
NOTE: The following build command assumes you have configured your Maven user settings. If you have not, you must include Maven setting arguments on the command line. See the main README for more information.
-
Make sure you have started the JDG and Spark as described above.
-
Open a command line and navigate to the root directory of this quickstart.
-
Type this command to build and deploy the archive:
mvn clean package
The directory temperature-sensor
contains an application to simulate a network of temperature sensors.
The simulation randomly selects one of the European capitals and randomly generates a temperature measurement.
After that it stores the (place, temperature)
pairs into Infinispan. It generates such pairs every 100 ms.
To start the sensor network simulation:
-
Open a command line and navigate to the directory
temperature-sensor
of the spark quickstart: -
Run the following command to start it:
java -jar target/temperature-sensor-jar-with-dependencies.jar
The Spark job TemperatureAnalysis
recomputes average temperatures for each (place, temperature)
that arrives from the sensor network.
To start the job:
-
Open a command line and navigate to the directory
spark-temperature-analysis
of the quickstart. -
Run the following command to start the job:
For Linux: $SPARK_HOME/bin/spark-submit --master spark://127.0.0.1:7077 --class org.infinispan.quickstart.spark.TemperatureAnalysis target/spark-temperature-analysis-jar-with-dependencies.jar For Windows: %SPARK_HOME%\bin\spark-submit.cmd --master spark://127.0.0.1:7077 --class org.infinispan.quickstart.spark.TemperatureAnalysis target\spark-temperature-analysis-jar-with-dependencies.jar
The directory temperature-client
is the end user application that is notified about average temperatures changes in selected places.
To start the client application:
-
Open a command line and navigate to the directory
temperature-client
of the quickstart: -
Run the following command to start it:
java -jar target/temperature-client-jar-with-dependencies.jar Prague Vienna
The last argument is a space separated list of capitals to subscribe for updates, at least one place is required. The application will listen for updates for 5 minutes.