Skip to content

OpenML Python ideas for Google Summer of Code

Joaquin Vanschoren edited this page Mar 16, 2015 · 10 revisions

About OpenML

OpenML.org is an open science platform for machine learning. It allows machine learning researchers to

  • share and download machine learning datasets. These datasets are auto-analyzed and annotated so you can filter through 1000s of datasets based on certain properties.
  • create tasks (such as classification, regression, clustering) on these datasets. OpenML auto-creates machine readable descriptions (given inputs, required outputs) so that software tools can automatically execute many of these tasks.
  • share and download programs that solve these tasks and generate the required output (e.g. models and predictions)
  • share and download the results of experiments, so you can immediately compare your models to the state of the art and build on the work of others (e.g. for meta-analysis)

What makes OpenML unique, is that it is directly integrated into many machine learning tools and libraries. These integrations allow datasets, code and experiments to be shared automatically (even in the background) without requiring extra work from the user/researcher, except for entering login details. Users can log their research privately, work in teams, or completely in the open. The integration also auto-annotates all experiments so that they are fully reproducible. Finally, everything is organized online: experiments are linked to datasets and code, as well as to the resulting models, a huge range of evaluation measures and instance-level predictions. OpenML also auto-evaluates models online to allow objective comparisons.

OpenML is also completely open source. It is language-agnostic, operating through an extensive REST API, but has language-specific APIs in languages such as Java, R, and Python. Currently, the Python API is still lacking many features, and we need talented people to help.

How it is going to change the world

Our goals are to revolutionize the way we do data science research:

  • Change the scale of collaboration in data-driven research: from small teams to massive collaborations. OpenML allows you to effortlessly collaborate with everyone in the world, and all contributions are auto-organized online.
  • Change the speed of collaboration: from days to seconds. OpenML allows you to collaborate in real time: your research can be auto-shared online, and people all over the world can work with it the second it appears.
  • Automate many aspects of data science research: you can easily evaluate new algorithms on 1000s of datasets, run many algorithms on a new dataset, or run very large studies more easily. OpenML makes it easy to access data and code and organized all experiments online.
  • Make data-driven science reproducible: all experiments are auto-annotated with all details so that others (and you) can easily reproduce and build on it.
  • Make interdisciplinary research easier: dataset and code are better annotated and easy to access. This will allow researchers in drug discovery, bio-informatics (e.g. genetic epidemiology, cancer research), biology and many more sciences to share data in a way that data scientists can understand easily. Vice versa, data scientists auto-share their methods and results in a way that is easy for domain scientists to understand, so that they can see which methods work well on their data.

You can watch a general talk on OpenML here: https://www.youtube.com/watch?v=J84Eg-0RlCk

Who is using OpenML

OpenML is very young and we haven't started announcing it widely, but it is already used by over 300 registered machine learning researchers and visited by over 100 people every day (registration is only required to upload data). It also includes domain science researchers, including drug discovery and cancer research, who use the platform to collaborate online. They have already shared 1000+ datasets, 1000+ algorithms and 300,000+ experiments.

Contacting OpenML

OpenML operates mostly through GitHub. Please use the issue tracker to announce yourself and start a conversation.

We opened a thread to announce yourself and discuss GSoC projects here: https://github.com/openml/python/issues/3

Also feel free to use our mailing list: [email protected]

Getting Started

Setup instructions for the current Python API can be found here: http://openml.readthedocs.org/en/latest/#

Code and discussions are organized on GitHub: https://github.com/openml/python

Information on the OpenML API can be found here: http://openml.org/guide

The procedure to get started would be to try out the current Python API as described in the projects below and suggest improvements. We use GitHub's issue tracker to report bugs and suggest features: https://github.com/openml/OpenML/issues However, please report bugs and ideas specific to the Python API here: https://github.com/openml/python/issues

Skills required

  • Python programming (2 and 3)
  • Basic git/github skills
  • For the scikit-learn projects: practical machine learning experience in python, e.g. by using a package like scikit-learn or pylearn2

Test

Here are two possibilities of showing your skills:

  • Extend the documentation section working with tasks. It should include a walkthrough on how to:
    • download a task together with a dataset
    • load train/test splits from OpenML
    • evaluate a random forest from scikit-learn on that task
  • Implement one of the following missing API calls:
    • authenticate.check: Prior to sending a REST call to the OpenML server, the python program should check whether the user is still authenticated. If not, it should re-authenticate the user.
    • run.get: This call should download all information about a run and store it in a python object.

GSOC Project ideas

Python is one of the leading languages to develop machine learning systems, and contains many mature modules such as scikit-learn. What is still missing is the possibility for these systems to share their datasets, workflows and experiments in a structured way so that others can build on them. Bridging Python and OpenML would be very useful for all Python programmers interested in machine learning. The projects consists of one main goal, and several smaller goals which build upon it.

A better OpenML Python API

  • Description: Improve the current module that interfaces Python programs with the OpenML web API. This will allow scientists to download and share machine learning datasets, tasks, programs and experiments. A prototype implementation exists, but GSOC participants can take this much further and develop a powerful interface between the OpenML and Python communities.
  • Skills: Python, working with REST APIs, knowledge of how OpenML works
  • Difficulty level: Intermediate
  • Related Readings/Links: You can read documentation on the current API: http://openml.readthedocs.org/en/latest/# and read current discussions on the issue tracker: https://github.com/openml/python/issues
  • Potential mentors: Matthias Feurer, Joaquin Vanschoren, Jan van Rijn

Integration in SciKit-Learn

  • Description: Connect OpenML closely to scikit-learn, so that new powerful functions can be offered to do large-scale machine learning experiments. For instance, scikit-learn algorithms could automatically be run on 1000s of OpenML datasets and the results could be immediately uploaded and organized online, where they can be compared to the state of the art, including algorithms from other platforms such as R, WEKA, MOA,... For inspiration of what is possible, here is a talk about how OpenML is integrated in mlr: https://www.youtube.com/watch?v=rzjkT1uLNi4
  • Skills: Python, working with REST APIs, knowledge of how OpenML works
  • Difficulty level: Intermediate
  • Related Readings/Links: You can read documentation on the current API: http://openml.readthedocs.org/en/latest/# and read current discussions on the issue tracker: https://github.com/openml/python/issues. SciKit-Learn documentation can be found here: http://scikit-learn.org/stable/
  • Potential mentors: Matthias Feurer, Andreas Mueller, Joaquin Vanschoren

A connection with IPython Notebooks / Jupyter

  • Description: A popular way to share machine learning studies are IPython Notebooks. It would be great if these can be incorporated into OpenML so that machine learning people can import data sets directly from OpenML into notebooks and thus quickly do many machine learning studies. This would be a major step forward for massive scale reproducible machine learning research. It may be needed to extend the Python API if additional functionality is required.
  • Skills: Python, SciKitLearn
  • Difficulty level: Intermediate
  • Related Readings/Links: You can read documentation on the current API: http://openml.readthedocs.org/en/latest/# and read current discussions on the issue tracker: https://github.com/openml/python/issues. Documentation on IPython Notebooks: http://ipython.org/notebook.html and Jupyter: http://jupyter.org/
  • Potential mentors: Matthias Feurer, Joaquin Vanschoren, Yongming Luo