Merge pull request #863 from openml/develop

Release OpenML 0.10.1
openml · Nov 5, 2019 · 949515f · 949515f
2 parents 0f36642 + 34d54d9
commit 949515f
Show file tree

Hide file tree

Showing 76 changed files with 3,154 additions and 1,059 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,18 @@
 [![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
 
-A python interface for [OpenML](http://openml.org). You can find the documentation on the [openml-python website](https://openml.github.io/openml-python).
-
-Please commit to the right branches following the gitflow pattern:
-http://nvie.com/posts/a-successful-git-branching-model/
+A python interface for [OpenML](http://openml.org), an online platform for open science collaboration in machine learning.
+It can be used to download or upload OpenML data such as datasets and machine learning experiment results.
+You can find the documentation on the [openml-python website](https://openml.github.io/openml-python).
+If you wish to contribute to the package, please see our [contribution guidelines](https://github.com/openml/openml-python/blob/develop/CONTRIBUTING.md).
 
 Master branch:
 
 [![Build Status](https://travis-ci.org/openml/openml-python.svg?branch=master)](https://travis-ci.org/openml/openml-python)
-[![Code Health](https://landscape.io/github/openml/openml-python/master/landscape.svg)](https://landscape.io/github/openml/openml-python/master)
+[![Build status](https://ci.appveyor.com/api/projects/status/blna1eip00kdyr25?svg=true)](https://ci.appveyor.com/project/OpenML/openml-python)
 [![Coverage Status](https://coveralls.io/repos/github/openml/openml-python/badge.svg?branch=master)](https://coveralls.io/github/openml/openml-python?branch=master)
 
 Development branch:
 
 [![Build Status](https://travis-ci.org/openml/openml-python.svg?branch=develop)](https://travis-ci.org/openml/openml-python)
-[![Code Health](https://landscape.io/github/openml/openml-python/master/landscape.svg)](https://landscape.io/github/openml/openml-python/master)
+[![Build status](https://ci.appveyor.com/api/projects/status/blna1eip00kdyr25/branch/develop?svg=true)](https://ci.appveyor.com/project/OpenML/openml-python/branch/develop)
 [![Coverage Status](https://coveralls.io/repos/github/openml/openml-python/badge.svg?branch=develop)](https://coveralls.io/github/openml/openml-python?branch=develop)
diff --git a/appveyor.yml b/appveyor.yml
@@ -43,4 +43,4 @@ build: false
 
 test_script:
   - "cd C:\\projects\\openml-python"
-  - "%CMD_IN_ENV% pytest -n 4 --timeout=600 --timeout-method=thread -sv --ignore='test_OpenMLDemo.py'"
+  - "%CMD_IN_ENV% pytest -n 4 --timeout=600 --timeout-method=thread -sv"
diff --git a/ci_scripts/install.sh b/ci_scripts/install.sh
@@ -36,12 +36,13 @@ pip install -e '.[test]'
 python -c "import numpy; print('numpy %s' % numpy.__version__)"
 python -c "import scipy; print('scipy %s' % scipy.__version__)"
 
-if [[ "$EXAMPLES" == "true" ]]; then
-    pip install -e '.[examples]'
-fi
 if [[ "$DOCTEST" == "true" ]]; then
     pip install sphinx_bootstrap_theme
 fi
+if [[ "$DOCPUSH" == "true" ]]; then
+    conda install --yes gxx_linux-64 gcc_linux-64 swig
+    pip install -e '.[examples,examples_unix]'
+fi
 if [[ "$COVERAGE" == "true" ]]; then
     pip install codecov pytest-cov
 fi
@@ -52,3 +53,5 @@ fi
 # Install scikit-learn last to make sure the openml package installation works
 # from a clean environment without scikit-learn.
 pip install scikit-learn==$SKLEARN_VERSION
+
+conda list
diff --git a/ci_scripts/test.sh b/ci_scripts/test.sh
@@ -28,7 +28,7 @@ run_tests() {
         PYTEST_ARGS=''
     fi
 
-    pytest -n 4 --durations=20 --timeout=600 --timeout-method=thread -sv --ignore='test_OpenMLDemo.py' $PYTEST_ARGS $test_dir
+    pytest -n 4 --durations=20 --timeout=600 --timeout-method=thread -sv $PYTEST_ARGS $test_dir
 }
 
 if [[ "$RUN_FLAKE8" == "true" ]]; then

diff --git a/doc/api.rst b/doc/api.rst
@@ -85,6 +85,7 @@ Modules
 
     list_evaluations
     list_evaluation_measures
+    list_evaluations_setups
 
 :mod:`openml.flows`: Flow Functions
 -----------------------------------

diff --git a/doc/contributing.rst b/doc/contributing.rst
@@ -21,20 +21,20 @@ you can use github's assign feature, otherwise you can just leave a comment.
 Scope of the package
 ====================
 
-The scope of the OpenML python package is to provide a python interface to
-the OpenML platform which integrates well with pythons scientific stack, most
+The scope of the OpenML Python package is to provide a Python interface to
+the OpenML platform which integrates well with Python's scientific stack, most
 notably `numpy <http://www.numpy.org/>`_ and `scipy <https://www.scipy.org/>`_.
 To reduce opportunity costs and demonstrate the usage of the package, it also
 implements an interface to the most popular machine learning package written
-in python, `scikit-learn <http://scikit-learn.org/stable/index.html>`_.
+in Python, `scikit-learn <http://scikit-learn.org/stable/index.html>`_.
 Thereby it will automatically be compatible with many machine learning
 libraries written in Python.
 
 We aim to keep the package as light-weight as possible and we will try to
 keep the number of potential installation dependencies as low as possible.
 Therefore, the connection to other machine learning libraries such as
 *pytorch*, *keras* or *tensorflow* should not be done directly inside this
-package, but in a separate package using the OpenML python connector.
+package, but in a separate package using the OpenML Python connector.
 
 .. _issues:
 
@@ -52,7 +52,7 @@ contains longer-term goals.
 How to contribute
 =================
 
-There are many ways to contribute to the development of the OpenML python
+There are many ways to contribute to the development of the OpenML Python
 connector and OpenML in general. We welcome all kinds of contributions,
 especially:
 
@@ -158,5 +158,67 @@ Happy testing!
 Connecting new machine learning libraries
 =========================================
 
-Coming soon - please stay tuned!
+Content of the Library
+~~~~~~~~~~~~~~~~~~~~~~
 
+To leverage support from the community and to tap in the potential of OpenML, interfacing
+with popular machine learning libraries is essential. However, the OpenML-Python team does
+not have the capacity to develop and maintain such interfaces on its own. For this, we
+have built an extension interface to allows others to contribute back. Building a suitable 
+extension for therefore requires an understanding of the current OpenML-Python support.
+
+`This example <examples/flows_and_runs_tutorial.html>`_ 
+shows how scikit-learn currently works with OpenML-Python as an extension. The *sklearn*
+extension packaged with the `openml-python <https://github.com/openml/openml-python>`_
+repository can be used as a template/benchmark to build the new extension.
+
+
+API
++++
+* The extension scripts must import the `openml` package and be able to interface with
+  any function from the OpenML-Python `API <api.html>`_.
+* The extension has to be defined as a Python class and must inherit from
+  :class:`openml.extensions.Extension`.
+* This class needs to have all the functions from `class Extension` overloaded as required.
+* The redefined functions should have adequate and appropriate docstrings. The
+  `Sklearn Extension API :class:`openml.extensions.sklearn.SklearnExtension.html`
+  is a good benchmark to follow.
+
+
+Interfacing with OpenML-Python
+++++++++++++++++++++++++++++++
+Once the new extension class has been defined, the openml-python module to 
+:meth:`openml.extensions.register_extension.html` must be called to allow OpenML-Python to
+interface the new extension.
+
+
+Hosting the library
+~~~~~~~~~~~~~~~~~~~
+
+Each extension created should be a stand-alone repository, compatible with the
+`OpenML-Python repository <https://github.com/openml/openml-python>`_.
+The extension repository should work off-the-shelf with *OpenML-Python* installed.
+
+Create a `public Github repo <https://help.github.com/en/articles/create-a-repo>`_ with
+the following directory structure:
+
+::
+
+| [repo name]
+|    |-- [extension name]
+|    |    |-- __init__.py
+|    |    |-- extension.py
+|    |    |-- config.py (optionally)
+
+
+
+Recommended
+~~~~~~~~~~~
+* Test cases to keep the extension up to date with the `openml-python` upstream changes.
+* Documentation of the extension API, especially if any new functionality added to OpenML-Python's
+  extension design.
+* Examples to show how the new extension interfaces and works with OpenML-Python.
+* Create a PR to add the new extension to the OpenML-Python API documentation.
+
+
+Happy contributing!
diff --git a/doc/index.rst b/doc/index.rst
@@ -38,7 +38,7 @@ Example
     # Publish the experiment on OpenML (optional, requires an API key.
     # You can get your own API key by signing up to OpenML.org)
     run.publish()
-    print('View the run online: %s/run/%d' % (openml.config.server, run.run_id))
+    print(f'View the run online: {openml.config.server}/run/{run.run_id}')
 
 You can find more examples in our `examples gallery <examples/index.html>`_.
 

diff --git a/doc/progress.rst b/doc/progress.rst
@@ -6,8 +6,57 @@
 Changelog
 =========
 
+0.10.1
+~~~~~~
+* ADD #175: Automatically adds the docstring of scikit-learn objects to flow and its parameters.
+* ADD #737: New evaluation listing call that includes the hyperparameter settings.
+* ADD #744: It is now possible to only issue a warning and not raise an exception if the package
+  versions for a flow are not met when deserializing it.
+* ADD #783: The URL to download the predictions for a run is now stored in the run object.
+* ADD #790: Adds the uploader name and id as new filtering options for ``list_evaluations``.
+* ADD #792: New convenience function ``openml.flow.get_flow_id``.
+* ADD #861: Debug-level log information now being written to a file in the cache directory (at most 2 MB).
+* DOC #778: Introduces instructions on how to publish an extension to support other libraries
+  than scikit-learn.
+* DOC #785: The examples section is completely restructured into simple simple examples, advanced
+  examples and examples showcasing the use of OpenML-Python to reproduce papers which were done
+  with OpenML-Python.
+* DOC #788: New example on manually iterating through the split of a task.
+* DOC #789: Improve the usage of dataframes in the examples.
+* DOC #791: New example for the paper *Efficient and Robust Automated Machine Learning* by Feurer
+  et al. (2015).
+* DOC #803: New example for the paper *Don’t  Rule  Out  Simple  Models Prematurely:
+  A Large Scale  Benchmark Comparing Linear and Non-linear Classifiers in OpenML* by Benjamin
+  Strang et al. (2018).
+* DOC #808: New example demonstrating basic use cases of a dataset.
+* DOC #810: New example demonstrating the use of benchmarking studies and suites.
+* DOC #832: New example for the paper *Scalable Hyperparameter Transfer Learning* by
+  Valerio Perrone et al. (2019)
+* DOC #834: New example showing how to plot the loss surface for a support vector machine.
+* FIX #305: Do not require the external version in the flow XML when loading an object.
+* FIX #734: Better handling of *"old"* flows.
+* FIX #736: Attach a StreamHandler to the openml logger instead of the root logger.
+* FIX #758: Fixes an error which made the client API crash when loading a sparse data with
+  categorical variables.
+* FIX #779: Do not fail on corrupt pickle
+* FIX #782: Assign the study id to the correct class attribute.
+* FIX #819: Automatically convert column names to type string when uploading a dataset.
+* FIX #820: Make ``__repr__`` work for datasets which do not have an id.
+* MAINT #796: Rename an argument to make the function ``list_evaluations`` more consistent.
+* MAINT #811: Print the full error message given by the server.
+* MAINT #828: Create base class for OpenML entity classes.
+* MAINT #829: Reduce the number of data conversion warnings.
+* MAINT #831: Warn if there's an empty flow description when publishing a flow.
+* MAINT #837: Also print the flow XML if a flow fails to validate.
+* FIX #838: Fix list_evaluations_setups to work when evaluations are not a 100 multiple.
+* FIX #847: Fixes an issue where the client API would crash when trying to download a dataset
+  when there are no qualities available on the server.
+* MAINT #849: Move logic of most different ``publish`` functions into the base class.
+* MAINt #850: Remove outdated test code.
+
 0.10.0
 ~~~~~~
+
 * ADD #737: Add list_evaluations_setups to return hyperparameters along with list of evaluations.
 * FIX #261: Test server is cleared of all files uploaded during unit testing.
 * FIX #447: All files created by unit tests no longer persist in local.
@@ -25,6 +74,7 @@ Changelog
 * ADD #412: The scikit-learn extension populates the short name field for flows.
 * MAINT #726: Update examples to remove deprecation warnings from scikit-learn
 * MAINT #752: Update OpenML-Python to be compatible with sklearn 0.21
+* ADD #790: Add user ID and name to list_evaluations
 
 
 0.9.0

diff --git a/doc/usage.rst b/doc/usage.rst
@@ -21,11 +21,11 @@ Installation & Set up
 ~~~~~~~~~~~~~~~~~~~~~~
 
 The OpenML Python package is a connector to `OpenML <https://www.openml.org/>`_.
-It allows to use and share datasets and tasks, run
+It allows you to use and share datasets and tasks, run
 machine learning algorithms on them and then share the results online.
 
 The following tutorial gives a short introduction on how to install and set up
-the OpenML python connector, followed up by a simple example.
+the OpenML Python connector, followed up by a simple example.
 
 * `Introduction <examples/introduction_tutorial.html>`_
 
@@ -52,7 +52,7 @@ Working with tasks
 ~~~~~~~~~~~~~~~~~~
 
 You can think of a task as an experimentation protocol, describing how to apply
-a machine learning model to a dataset in a way that it is comparable with the
+a machine learning model to a dataset in a way that is comparable with the
 results of others (more on how to do that further down). Tasks are containers,
 defining which dataset to use, what kind of task we're solving (regression,
 classification, clustering, etc...) and which column to predict. Furthermore,
@@ -86,7 +86,7 @@ predictions of that run. When a run is uploaded to the server, the server
 automatically calculates several metrics which can be used to compare the
 performance of different flows to each other.
 
-So far, the OpenML python connector works only with estimator objects following
+So far, the OpenML Python connector works only with estimator objects following
 the `scikit-learn estimator API <http://scikit-learn.org/dev/developers/contributing.html#apis-of-scikit-learn-objects>`_.
 Those can be directly run on a task, and a flow will automatically be created or
 downloaded from the server if it already exists.
@@ -114,7 +114,7 @@ requirements and how to download a dataset:
 OpenML is about sharing machine learning results and the datasets they were
 obtained on. Learn how to share your datasets in the following tutorial:
 
-* `Upload a dataset <examples/create_upload_tutorial.html>`_
+* `Upload a dataset <examples/30_extended/create_upload_tutorial.html>`_
 
 ~~~~~~~~~~~~~~~~~~~~~~~
 Extending OpenML-Python

diff --git a/examples/20_basic/README.txt b/examples/20_basic/README.txt
@@ -0,0 +1,4 @@
+Introductory Examples
+=====================
+
+Introductory examples to the usage of the OpenML python connector.
diff --git a/examples/introduction_tutorial.py → examples/20_basic/introduction_tutorial.py b/examples/introduction_tutorial.py → examples/20_basic/introduction_tutorial.py
@@ -1,8 +1,8 @@
 """
-Introduction
-============
+Setup
+=====
 
-An introduction to OpenML, followed up by a simple example.
+An example how to set up OpenML-Python followed up by a simple example.
 """
 ############################################################################
 # OpenML is an online collaboration platform for machine learning which allows
@@ -61,7 +61,7 @@
 openml.config.start_using_configuration_for_example()
 
 ############################################################################
-# When using the main server, instead make sure your apikey is configured.
+# When using the main server instead, make sure your apikey is configured.
 # This can be done with the following line of code (uncomment it!).
 # Never share your apikey with others.
 
@@ -96,7 +96,7 @@
 # For this tutorial, our configuration publishes to the test server
 # as to not crowd the main server with runs created by examples.
 myrun = run.publish()
-print("kNN on %s: http://test.openml.org/r/%d" % (data.name, myrun.run_id))
+print(f"kNN on {data.name}: http://test.openml.org/r/{myrun.run_id}")
 
 ############################################################################
 openml.config.stop_using_configuration_for_example()
diff --git a/examples/20_basic/simple_datasets_tutorial.py b/examples/20_basic/simple_datasets_tutorial.py
@@ -0,0 +1,68 @@
+"""
+========
+Datasets
+========
+
+A basic tutorial on how to list, load and visualize datasets.
+"""
+############################################################################
+# In general, we recommend working with tasks, so that the results can
+# be easily reproduced. Furthermore, the results can be compared to existing results
+# at OpenML. However, for the purposes of this tutorial, we are going to work with
+# the datasets directly.
+
+import openml
+############################################################################
+# List datasets
+# =============
+
+datasets_df = openml.datasets.list_datasets(output_format='dataframe')
+print(datasets_df.head(n=10))
+
+############################################################################
+# Download a dataset
+# ==================
+
+# Iris dataset https://www.openml.org/d/61
+dataset = openml.datasets.get_dataset(61)
+
+# Print a summary
+print(f"This is dataset '{dataset.name}', the target feature is "
+      f"'{dataset.default_target_attribute}'")
+print(f"URL: {dataset.url}")
+print(dataset.description[:500])
+
+############################################################################
+# Load a dataset
+# ==============
+
+# X - An array/dataframe where each row represents one example with
+# the corresponding feature values.
+# y - the classes for each example
+# categorical_indicator - an array that indicates which feature is categorical
+# attribute_names - the names of the features for the examples (X) and
+# target feature (y)
+X, y, categorical_indicator, attribute_names = dataset.get_data(
+    dataset_format='dataframe',
+    target=dataset.default_target_attribute
+)
+############################################################################
+# Visualize the dataset
+# =====================
+
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt
+sns.set_style("darkgrid")
+
+
+def hide_current_axis(*args, **kwds):
+    plt.gca().set_visible(False)
+
+
+# We combine all the data so that we can map the different
+# examples to different colors according to the classes.
+combined_data = pd.concat([X, y], axis=1)
+iris_plot = sns.pairplot(combined_data, hue="class")
+iris_plot.map_upper(hide_current_axis)
+plt.show()