Setup guide

This document describes how to setup all the dependencies to run the notebooks in this repository in two different environments:

a Linux system (local or an Azure Data Science Virtual Machine (DSVM))
Azure Databricks.

Compute environments

Depending on the type of recommender system and the notebook that needs to be run, there are different computational requirements.

Currently, this repository supports the following environments:

Python CPU
Python GPU
PySpark

Setup guide for Local or DSVM

Setup Requirements

Machine running Linux, Windows Subsystem for Linux (WSL) or macOS
Anaconda with Python version >= 3.6.
- This is pre-installed on Azure DSVM, for local setup Miniconda is a quick way to get started.
Apache Spark (this is only needed for the PySpark environment).

Dependencies setup

We install the dependencies with Conda. As a pre-requisite, we want to make sure that Anaconda and the package manager Conda are both up to date:

conda update conda
conda update anaconda

We provide a script, generate_conda_file.py, to generate a conda file, depending of the environment we want to use. This will create the environment using the Python version 3.6 with all the correct dependencies.

To install each environment, first we need to generate a conda yaml file and then install the environment. We can specify the environment name with the input -n.

Click on the following menus to see more details:

Python CPU environment

Assuming the repo is cloned as Recommenders in the local system, to install the Python CPU environment:

cd Recommenders
python scripts/generate_conda_file.py
conda env create -f reco_base.yaml

Python GPU environment

Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment:

cd Recommenders
python scripts/generate_conda_file.py --gpu
conda env create -f reco_gpu.yaml

PySpark environment

To install the PySpark environment, which by default installs the CPU environment:

cd Recommenders
python scripts/generate_conda_file.py --pyspark
conda env create -f reco_pyspark.yaml

Additionally, if you want to test a particular version of spark, you may pass the --pyspark-version argument:

python scripts/generate_conda_file.py --pyspark-version 2.4.0

NOTE - for a PySpark environment, we need to set the environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to point to the conda python executable.

To set these variables every time the environment is activated, we can follow the steps of this guide. Assuming that we have installed the environment in /anaconda/envs/reco_pyspark, we create the file /anaconda/envs/reco_pyspark/etc/conda/activate.d/env_vars.sh and add:

#!/bin/sh
export PYSPARK_PYTHON=/anaconda/envs/reco_pyspark/bin/python
export PYSPARK_DRIVER_PYTHON=/anaconda/envs/reco_pyspark/bin/python

This will export the variables every time we do conda activate reco_pyspark. To unset these variables when we deactivate the environment, we create the file /anaconda/envs/reco_pyspark/etc/conda/deactivate.d/env_vars.sh and add:

#!/bin/sh
unset PYSPARK_PYTHON
unset PYSPARK_DRIVER_PYTHON

All environments

To install all three environments:

cd Recommenders
python scripts/generate_conda_file.py --gpu --pyspark
conda env create -f reco_full.yaml

Register the conda environment as a kernel in Jupyter

We can register our created conda environment to appear as a kernel in the Jupyter notebooks.

conda activate my_env_name
python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"

Troubleshooting for the DSVM

We found that there can be problems if the Spark version of the machine is not the same as the one in the conda file. You can use the option --pyspark-version to address this issue.
When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this on a DSVM, we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at /dsvm/tools/spark/current/conf/spark-env.sh.

SPARK_LOCAL_DIRS="/mnt"
SPARK_WORKER_DIR="/mnt"
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.appDataTtl=3600, -Dspark.worker.cleanup.interval=300, -Dspark.storage.cleanupFilesAfterExecutorExit=true"

Setup guide for Azure Databricks

Requirements of Azure Databricks

Runtime version 4.3 (Apache Spark 2.3.1, Scala 2.11)
Python 3

Repository installation

You can setup the repository as a library on Databricks either manually or by running an installation script. Both options assume you have access to a provisioned Databricks workspace and cluster and that you have appropriate permissions to install libraries.

Quick install

This option utilizes an installation script to do the setup, and it requires additional dependencies in the environment used to execute the script.

To run the script, following prerequisites are required:
Install Azure Databricks CLI (command-line interface) and setup CLI authentication. Please find details about how to create a token and set authentication here. Very briefly, you can install and configure your environment with the following commands.
pip install databricks-cli
databricks configure --token
Get the target cluster id and start the cluster if its status is TERMINATED.
You can get the cluster id from the databricks CLI with:
databricks clusters list
If required, you can start the cluster with:
databricks clusters start --cluster-id <CLUSTER_ID>`
The script also requires the zip command line utility, which may not be installed. You can install it with:
sudo apt-get update
sudo apt-get install zip

Once you have confirmed the databricks cluster is RUNNING, install the modules within this repository with the following commands:

cd Recommenders
./scripts/databricks_install.sh <CLUSTER_ID>

Manual setup

To install the repo manually onto Databricks, follow the steps:

Clone the Microsoft Recommenders repository to your local computer.
Zip the contents inside the Recommenders folder (Azure Databricks requires compressed folders to have the .egg suffix, so we don't use the standard .zip):
```
cd Recommenders
zip -r Recommenders.egg .
```
Once your cluster has started, go to the Databricks workspace, and select the Home button.
Your Home directory should appear in a panel. Right click within your directory, and select Import.
In the pop-up window, there is an option to import a library, where it says: (To import a library, such as a jar or egg, click here). Select click here.
In the next screen, select the option Upload Python Egg or PyPI in the first menu.
Next, click on the box that contains the text Drop library egg here to upload and use the file selector to choose the Recommenders.egg file you just created, and select Open.
Click on the Create library. This will upload the egg and make it available in your workspace.
Finally, in the next menu, attach the library to your cluster.

Confirm Installation

After installation, you can now create a new notebook and import the utilities from Databricks in order to confirm that the import worked.

import reco_utils

Troubleshooting Installation on Azure Databricks

For the reco_utils import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work.

Prepare Azure Databricks for Operationalization

This repository includes an end-to-end example notebook that uses Azure Datbaricks to estimate a recommendation model using Alternating Least Squares, writes pre-computed recommendations to Azure Cosmos DB, and then creates a real-time scoring service that retrieves the recommendations from Cosmos DB. In order to execute that notebook, you must install the Recommenders repository as a library (as described above), AND you must also install some additional dependencies. Similar to above, you can do so either manually or via an installation script.

Quick install

This option utilizes an installation script to do the setup, and it requires the same dependencies as the databricks installation script (see above).

Once you have:

Installed and configured the databricks CLI
Confirmed that the appropriate cluster is RUNNING
Installed the Recommenders egg as described above
Confirmed you are in the root directory of the Recommenders repository

you can install additional dependencies for operationalization with:

scripts/prepare_databricks_for_o16n.sh <CLUSTER_ID>

This script does all of the steps described in the Manual setup section below.

Manual setup

You must install three packages as libraries from PyPI:

azure-cli
azureml-sdk[databricks]
pydocumentdb

You can follow instructions here for details on how to install packages from PyPI.

Additionally, you must install the spark-cosmosdb connector on the cluster. The easiest way to manually do that is to:

Download the appropriate jar from MAVEN. NOTE This is the appropriate jar for spark versions 2.3.X, and is the appropriate version for the recommended Azure Databricks run-time detailed above.
Upload and install the jar by:
1. Log into your Azure Databricks workspace
2. Select the Clusters button on the left.
3. Select the cluster on which you want to import the library.
4. Select the Upload and Jar options, and click in the box that has the text Drop JAR here in it.
5. Navigate to the downloaded .jar file, select it, and click Open.
6. Click on Install.
7. Restart the cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SETUP.md

SETUP.md

Setup guide

Table of Contents

Compute environments

Setup guide for Local or DSVM

Setup Requirements

Dependencies setup

Register the conda environment as a kernel in Jupyter

Troubleshooting for the DSVM

Setup guide for Azure Databricks

Requirements of Azure Databricks

Repository installation

Confirm Installation

Troubleshooting Installation on Azure Databricks

Prepare Azure Databricks for Operationalization

Files

SETUP.md

Latest commit

History

SETUP.md

File metadata and controls

Setup guide

Table of Contents

Compute environments

Setup guide for Local or DSVM

Setup Requirements

Dependencies setup

Register the conda environment as a kernel in Jupyter

Troubleshooting for the DSVM

Setup guide for Azure Databricks

Requirements of Azure Databricks

Repository installation

Confirm Installation

Troubleshooting Installation on Azure Databricks

Prepare Azure Databricks for Operationalization