Skip to content

Latest commit

 

History

History
304 lines (168 loc) · 13.6 KB

File metadata and controls

304 lines (168 loc) · 13.6 KB

Automated Machine Learning (AutoML)

SAP HANA Cloud is enriched with an Automated Machine Learning (AutoML) approach. AutoML can be helpful, for example, to give a data scientist a head-start into quickly finding a first machine learning model.

A machine learning model is a program that can find patterns or make decisions from a previously unseen dataset. This is made possible by first 'training' the model with a large dataset. During training, the machine learning algorithm is optimized to find certain patterns or outputs from the dataset, depending on the task. The output of this process - often a computer program with specific rules and data structures - is called a machine learning model.

AutoML with SAP HANA Cloud is a great starting point to see what is possible with a dataset, and if it is worth to invest more time into a use case.

Unique benefits with PAL (Predictive Analytics Library) AutoML Includes:

  • Improved PAL models and business impact
  • Composite pipeline models of multiple PAL algorithms
  • Automated algorithm comparison and selection as well as parameter search and optimal selection ML predictions of higher accuracy/value

Productivity up-lift and expert experience

  • Expert Data Scientists derive best models in less time/better utilization of compute time to derive best models
  • Comparable AutoML expert experience and address trending/competitive capability gap and improve time to maximum value with PAL

For this section, a data set containing customer transactions as a table has already been loaded into the SAP HANA Cloud Database (GX_TRANSACTIONS).

The challenge is to predict whether a transaction is fraudulent or not. Such use cases are often quite challenging due to imbalanced data and thus require different techniques before implementing a machine learning model.


Try it out!

Start this lesson by downloading the following zipped JupyterLab NoteBook. Save the file locally and extract the notebook file. This file will be uploaded into the Jupyter environment later in the exercise.

The data resides in the HDI container schema, in table GX_TRANSACTIONS. Data is accessed from the Python environment directly in SAP HANA Cloud, and will leverage the native Auto Machine Learning capability in SAP HANA Cloud.

All steps in this exercise will be based on the embedded Python code in the Notebook.

Setting up the environment

  1. To execute PAL and AutoML methods , the HDI container's application user (XXX_RT) needs to be assigned either the role AFL__SYS_AFL_AFLPAL_EXECUTE or AFL__SYS_AFL_AFLPAL_EXECUTE_WITH_GRANT_OPTION

  2. We will grant the required roles to the HDI container's application user by creating a User Provided Service and .hdbgrants artifact.

3.Open Business Application Studio and click on mta.yaml in your project to view it. Observe that in the requires section of db module and resources section, there is a reference to the SAP HANA HDI service instance that is bound to the application

  1. Go to the SAP HANA PROJECTS view and click on add database connection as shown below

  1. If you are prompted with Cloud Floundry Sign In, proceed by selecting SSO passcode and clicking on the Open a new browser page to generate your SSO passcode as shown in step 6 of this guide. Skip to next step if Cloud Floundry Sign In is not prompted.

  2. In the Add Database Connection wizard, select *Create user-provided service instance from the drop-down.

  1. Enter the following details and click ADD
Service instance name : DA263-XXX (replace XXX with your user login number)

user name : HDI_ML_GRANTOR 

password : Walldorf11 

  1. Check the mta.yaml file. Additional references to the created User Provided Service will be added automatically.

  1. Open the generated DA263-XXX.hdbgrants file and replace its content with below.

{
    "ServiceName_1": {
        "application_user": {
            "roles": [
                "HDI_ML_GRANTOR_ROLE"
            ]
        }
    }
}

The above .hdbgrants artifact grants the role HDI_ML_GRANTOR_ROLE to the HDI container's application user (XXX_RT user) using the user provided service.

  1. Deploy the database module.

  1. Now the HDI container's application user will have the required roles to execute AutoML methods.The application user and password can be found in the .env file

  1. To view the properties in .env in json format, create a dummy_env.json file and copy the contents of .env file into it. Remove the VCAP_SERVICES=' at the beginning and a single quote "'" at the end.

  2. Right Click on the content and format document. We will use host,port,schema,user and password values of the hdi-shared service to connect to the HDI container and execute AutoML methods.

Note: Please make a note of the above values and we will be using it in following exercises.

Setting up dev space with python tools

  1. Let us create a new dev space with python tools enabled. Open SAP Business Application Studio in a new window

  2. Create a new Dev Space

  1. Provide any name for the new dev space, select SAP Fiori and on the right under Additional SAP Extensions, select Python Tools. CLick on Create Dev Space

  1. Once the new dev space is in RUNNING state , click on the dev space.

  1. Once the workspace is loaded, we will create a folder to store our python jupyter notebook

  2. Open Terminal and execute the following commands

$ cd ~/projects/
$ mkdir auto_ml_hana
$ cd auto_ml_hana/
  1. Open the folder in workspace

  1. In the terminal , download pip, install it, add its location to PATH, and then proceed to install the hana-ml and hdbcli python packages which will be imported later in the notebook.
$ curl https://bootstrap.pypa.io/get-pip.py > get-pip.py && python3 get-pip.py &&  echo "export PATH=/home/user/.local/bin:$PATH" >> .bashrc && source ~/.bashrc
$ pip install hdbcli hana-ml 
$ pip install shapely
  1. After installations are done, download the jupyter notebook from this location FraudDetection_AUtoML.ipynb and extract the file.

  2. Drag and drop the notebook file on to the Explorer pane or Right Click on Explorer pane and select Upload. Browse to the location where the file has been saved and select it then click Open.

  1. The file will now appear in the Explorer pane. Double-click on the file to open it.

  2. The Notebook is now ready!

Analysis

  1. The first step is to import and install the SAP HANA Cloud ML Library. Select the first cell and click the execute button.

Tip: Use the keyboard shortcut SHIFT+ENTER to execute the code cells throughout the rest of the Notebook.

  1. The hana_ml library enables you to directly connect to your SAP HANA Cloud tenant. Use the connection details in the table below: Get the host,port,schema,user and password of your HDI container service from the .env variable as shown in step 12

Input Type Values
Host name: {placeholder
Host Port: 443
Username: {placeholder
Password: {placeholder
HANA Encrypt: True
  1. Create a data frame through SQL or table function and get the row count.

  1. Control data and convert the following variables accordingly.

  1. Control the conversion and take a look at a short description of the data.

Note: The target variable is called Fraud. In addition, there are eight predictors capturing different information from a transaction.

Note: Data types have been altered accordingly

  1. Split the data into a training and testing set.

  1. Control the size of the training and testing datasets.

  1. Import the following dependencies for the Automatic Classification.

  1. Manage the workload in SAP HANA Cloud tenant by creating workload classes. Please execute the following SQL script to set the workload class, which will be used in the Automatic Classification.

Note: Ignore the error if the work class PAL_AUTOML_WORKLOAD already exists.


Run process

The AutoML approach automatically executes data processing, model fitting, comparison and optimization.

First, create an AutoML classifier object auto_c in the following cell. It is helpful to review and set respective AutoML configuration parameters

  • The defined scenario will run two iterations of pipeline optimization. The total number of pipelines which will be evaluated is equal to population_size + generations × offspring_size. Hence, in this case this amounts to 15 pipelines.
  • With elite_number, you specify how many of the best pipelines you want to compare. Setting random_seed =1234 helps to get reproducable AutoML runs.
  1. Set the maximum runtime for individual pipeline evaluations with the parameter max_eval_time_mins or determine if the AutoML shall stop if there are no improvement for the set number of generations with the early_stop parameter. Further, you can set specific performance measures for the optimization with the scoring parameter.

Important! Change <YourName> to your user id {DA263-XXX} in the .format() method.

  1. Reinitialize and display the AutoML operators and their parameters.

Note: A default set of AutoML classification operators and parameters is provided as the global config-dict, which can be adjusted to the needs of the targeted AutoML scenario. Use methods like update_config_dict, delete_config_dict, display_config_dic to update the scenario definition.

  1. Adjust some of the settings to narrow the searching space. As the resampling method choose the SMOTETomek method, since the data is imbalanced.

  1. Exclude the Transformer methods. As machine learning algorithms keep the Hybrid Gradient Boosting Tree and Multi Logistic Regression.

  1. Set some parameters for the optimization of the algorithms.

  1. Review the complete AutoML configuration for the classification.

  1. Fit the Auto ML scenario on the training data. It may take a couple of minutes. If it takes too long, exclude the SMOTETomek in the resampler() method of the config file.

  1. Inspect the pipeline progress through the execution logs.

  2. Evaluate the best model on the testing data.

  1. Create predictions with your machine learning model.

  1. Save the best model in SAP HANA. Therefore, create a Model Storage. Change 'YourSchema' in the code below to schema of your HDI container.

Note: Please note that we are using HDI container schema here for this exercise to save the ML models in tables in HANA CLoud. This approach is not recommeneded in production as all objects in HDI container schema should get created via HDI design time deployment. In production, it is recommended to use a plain schema for saving autoML models.

  1. Save the model through the following command.

Further Reading For further information on AutoML with HANA Cloud, check out the following links:

SAP HANA Cloud PAL AutoML Documentation Python ML client for SAP HANA AutoML Reference Document Github Repository with Example Code

Congratulations!! This concludes the lesson on Automated Machine Learning in SAP HANA Cloud and finishes todays workshop