Data Science Virtual Machine

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science.  It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics.  So instead of you having to create a VM and download and install all these tools which can take many hours, within a matter of minutes you can be up and running.

The DSVM is designed and configured for working with a broad range of usage scenarios.  You can scale your environment up or down as your project needs change.  You are able to use your preferred language to program data science tasks.  You can install other tools and customize the system for your exact needs.

It is available on Windows Server 2012 (create), Windows Server 2016 (create) and on Linux – either Ubuntu 16.04 LTS (create) or on OpenLogic 7.2 CentOS-based Linux distributions (create).

The key scenarios for using the Data Science VM:

  • Preconfigured analytics desktop in the cloud
  • Data science training and education
  • On-demand elastic capacity for large-scale projects
  • Short-term experimentation and evaluation
  • Deep learning

The DSVM has many popular data science and deep learning tools already installed and configured.  It also includes tools that make it easy to work with various Azure data and analytics products.  You can explore and build predictive models on large-scale data sets using the Microsoft R Server or using SQL Server 2016 (note that R Server and SQL Server on the DSVM are not licensed for use on production data).  A host of other tools from the open source community and from Microsoft are also included, as well as sample code and notebooks.  See the full list here and see the latest new and upgraded tools here.

Finally, for Windows users check out Ten things you can do on the Data science Virtual Machine and for Linux users check out Data science on the Linux Data Science Virtual Machine.  For more information on how to run specific tools for Windows see Provision the Microsoft Data Science Virtual Machine and for Linux see Provision the Linux Data Science Virtual Machine.

UPDATE 10/7/17: There is now a Deep Learning VM (DLVM), which is a specially configured variant of the DSVM that is custom made to help users jump start deep learning on Azure GPU VMs.  The DLVM uses the same underlying VM images of the DSVM and hence comes with the same set of data science tools and deep learning frameworks as the base VM.  More info.

More info:

Data Science Virtual Machine – A Walkthrough of end-to-end Analytics Scenarios (video)

Introduction to the cloud-based Data Science Virtual Machine for Linux and Windows

Introducing the new Data Science Virtual Machine on Windows Server 2016

About James Serra

James is a big data and data warehousing solution architect at Microsoft. Previously he was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer. He is a prior SQL Server MVP with over 25 years of IT experience.
This entry was posted in SQLServerPedia Syndication. Bookmark the permalink.

3 Responses to Data Science Virtual Machine

  1. Nice summary James. We’ve had some clients ask if the DSVM is a server (ie multi-use) or client (ie single use). We’ve found they operate better when each user has their own VM spun up for them and under their control so they can customise it or right size it – but interested for your thoughts on that? They can also be spun up from a template customised to the client environment, such as pre-connected to GitHub, security settings, shared Azure Storage connectivity, or other customisations. Also – just a thought – but Leveraging Azure DevTest Labs for resource management across a DSVM fleet is a nice way to manage the environment..

    • James Serra says:

      Great comments Rolf and I agree that it’s better for each user to spin up their own VM to customize it and not interfere with other users work.

  2. Robert Sterbal says:

    Would this be a candidate for a Bitnami stack ?

    It might be nice to develop a process/documentation for doing an implementation with a virtualbox.