Distributed Writes

In SQL Server, scaling out reads (i.e. using Active secondary replicas via AlwaysOn Availability Groups) is a lot easier than scaling out writes.  So what are your options when you have a tremendous amount of writes that scaling up will not handle, no matter how big your server is?  There are a number of options that allow you to write to many servers (instead of writing to one master server) that I’ll call distributed writes.  Here are some ideas:

The one option out of all the above options that does not require coding and can support a large number of writes per second is Azure Cosmos DB.  All the other options can require significant coding and/or can only handle a limited amount of writes per second.  This is because Cosmos DB uses documents (JSON files) where all the information needed is included in that document so no joins are needed and documents can be spread on multiple servers (see Partition and scale in Azure Cosmos DB and A technical overview of Azure Cosmos DB).  This is opposed to relational databases that use multiple tables that must be joined.  If the tables are on different nodes that will cause a lot of data shuffling causing performance problems.

To go into greater detail on the benefits of Cosmos DB over SQL Server for distributed writes:

  • Consistency
    • Peer2Peer SQL Replication introduces pains around data consistency and conflict resolution
  • Availability
    • Sharding with SQL introduces pains around maintaining availability when increasing/decreasing the degree of scale-out.  Frequently, downtime is involved due to needs to re-balancing data across shards
    • SQL requires rigid schemas and indices to be defined upfront.  Every time schema and index updates are needed – you will incur a heavy operational cost of running Create Index and Alter Tables scripts across all database shards and replicas.  Furthermore, this introduces availability issues as schemas are being altered.
  • Handling sustained heavy write ingestion
    • Putting queueing mechanisms in front of SQL only gives you a buffer for handling spikes in writes, but at the end of the day, the database itself needs to support sustained heavy write ingestion in order to consume the buffered events.  What happens if events come in to the buffer faster than you drain it?  You will need a database specifically designed for heavy write ingestion

Azure Cosmos DB solves these by:

  • Providing 5 well-defined consistency models to help developers tune the right Consistency vs Performance tradeoffs for their scenario
  • Scale on demand and support for flexible data model while maintaining high availability (99.99% availability SLA).  Scaling out and partition management is taken care of by the service on behalf of the user
  • Use of log-structured techniques to be a truly latch-free database to sustain heavy write ingestion with durable persistence

In the end, eliminating schema, index management, and JOINs are a necessary byproduct of scale out that Azure Cosmos DB provides.

After the initial post of this blog, I received the question “Why not just use SQL 2016 in-Memory tables for heavy write systems (info)?” and received a great reply from a Cosmos DB product manager:

SQL in-memory is only viable when:

  • Request and data volume are small enough to fit on a single machine.  You still have the fundamental problem of hard limits due to scale-up model.
  • Scenario does not need durability, reliability, or availability – which are requirements for >99% of mission critical customer scenarios.


  • If data is kept in only in-memory, you experience data loss upon any intermittent failure that requires computer to restart (e.g. os crash, power outage, os decides it wants to reboot to update, etc.).  In order for data to be durable, it needs to be persisted to disk.  In order to offer resiliency against disk failures, you will need to replicate to a number of disks
  • For durable scenarios – memory only acts as a buffer to absorb spikes.  In order to achieve sustained write ingestion – you will need to flush the buffer as fast as you input into the buffer.  Now you have a bottleneck on disk i/o unless you scale-out
  • This is why they immediately have to address that this is for “applications where durability is not required”; durability is a requirement for >99% of data scenarios
  • Data loss and data corruption should be treated as cardinal sin for any data product


  • This model is still a scale-up model – in which there are many hard limits
  • What happens for data volume that doesn’t fit in memory (which tends to be very small size relative to disk storage)?  You need to scale-out
  • What happens for request volume that memory bandwidth is inadequate?  You need to scale out
  • This is why the throughput numbers in the blog are orders of magnitude smaller than what customers are doing everyday on Cosmos DB, and talking about storage size is quietly ignored


  • Memory is 100x more expensive than SSD.  Achieving high storage in a scale-out system will yield not only better scale and durability characteristics – but incur much lower costs for any large-scale scenarios

More info:

Database Sharding

Posted in SQL Server, SQLServerPedia Syndication | 9 Comments

Azure Archive Blob Storage

Last week Microsoft released a public preview of a new service called Azure Archive Blob Storage, offering customers a lower-cost cloud storage solution for rarely accessed data.  This allows for storage tiering, where organizations can place their critical data on expensive, high-performance storage and then move it down the line as it winds up being accessed less frequently over time.

Last year Microsoft introduced Azure Cool Blob storage, which cost customers a penny per GB per month in some Azure regions.  Now, users have another, lower-cost option in Azure Archive Blob Storage, along with new Blob-Level Tiering data lifecycle management capabilities.  So there are now three Azure blog storage tiers: Hot, Cool, and Archive.

Azure Archive Blob Storage costs 0.18 cents per GB per month when the service is delivered through its cloud data center in the East US 2 (for comparison, in the same region hot is 1.8 cents and cool is 1.0 cents per GB per month) .  Customers can expect a 99 percent availability SLA (service level agreement) when the service makes its way out of the preview stage.

Complementing the new service is a new Blob-level Tiering feature that will allow customers to change the access tier of blob storage objects among Hot, Cool or Archive.  Also in preview, it enables users to match costs to usage patterns without moving data between accounts.

Archive storage has the lowest storage cost and highest data retrieval costs compared to hot and cool storage.

While a blob is in archive storage, it cannot be read, copied, overwritten, or modified.  Nor can you take snapshots of a blob in archive storage.  However, you may use existing operations to delete, list, get blob properties/metadata, or change the tier of your blob.  To read data in archive storage, you must first change the tier of the blob to hot or cool.  This process is known as rehydration and can take up to 15 hours to complete for blobs less than 50 GB.  Additional time required for larger blobs varies with the blob throughput limit.

During rehydration, you may check the “archive status” blob property to confirm if the tier has changed.  The status reads “rehydrate-pending-to-hot” or “rehydrate-pending-to-cool” depending on the destination tier.  Upon completion, the “archive status” blob property is removed, and the “access tier” blob property reflects the hot or cool tier.

Example usage scenarios for the archive storage tier include:

  • Long-term backup, archival, and disaster recovery datasets
  • Original (raw) data that must be preserved, even after it has been processed into final usable form. (For example, Raw media files after transcoding into other formats)
  • Compliance and archival data that needs to be stored for a long time and is hardly ever accessed. (For example, Security camera footage, old X-Rays/MRIs for healthcare organizations, audio recordings, and transcripts of customer calls for financial services)

More info:

Announcing the public preview of Azure Archive Blob Storage and Blob-Level Tiering

Microsoft Unveils Cost-Cutting Archival Cloud Storage Option

Posted in Azure, SQLServerPedia Syndication | 2 Comments

Data Virtualization vs Data Warehouse

Data virtualization goes by a lot of different names: logical data warehouse, data federation, virtual database, and decentralized data warehouse.  Data virtualization allows you to integrate data from various sources, keeping the data in-place, so that you can generate reports and dashboards to create business value from the data.  It is an alternative to building a data warehouse, where you collect data from various sources and store a copy of the data in a new data store.

The main advantage of data virtualization is speed-to-market, where we can build a solution in a fraction of the time it takes to build a data warehouse.  This is because you don’t need to design and build the data warehouse and the ETL to copy the data into it, and also don’t need to spend as much time testing.  Copying the data means more hardware costs, more software licenses, more ETL flows to build and maintain, more data inconsistencies and more data governance costs, so using data virtualization can also save you a lot of money.  Other reasons for data virtualization include rapid prototyping for batch data movement, self-service analytics via a virtual sandbox, and regulatory constraints on moving data.

Some of the more popular data virtualization products are Cisco Data Virtualization (previously called Composite Software), Denodo Platform for Data VirtualizationInformatica Data Virtualization, Drermio, and Incorta.

Along the same lines of data virtualization vs data warehouse is federated queries vs data lake.  Such technologies as PolyBase, Metanautix, and U-SQL in Azure Data Lake Analytics provide for federated queries.

But there are some major drawbacks to data virtualization and federated queries, so you have to ask the following questions when you are thinking about using it:

  • Speed.  Is this something I could use for a Power BI dashboard where I wanted to slice and dice data with sub-second response times?  Or is this more for operational type reporting?
  • How much will this affect the performance of the source system?  Could a query consume all the resources of a server with a data source that I’m querying against?  Does it push down the query in the same way PolyBase does?
  • Do I need to install something on each server that contains a data source I want to use?
  • Does it use the indexes of each technology on the data store, or does it create its own indexes?
  • How is security handled for giving users access to each data source?
  • How is master data management handled (i.e. the same customer in multiple data sources but the customer name spelled differently)?
  • Where and how will the data be cleaned?
  • Will reports break if the source system is changed?

And there are some very valid reasons why a physical data warehouse is required:

  • Many production systems don’t keep track of historical data.  This data must be stored somewhere for historical analysis of the data. The physical data warehouse is, in this case, the most obvious solution
  • Accessing production systems directly for reporting and analytics can lead to too much interference on those systems and to performance degradation.  Note that this was once the reason why physical data warehouses were developed in the first place
  • Speed: I data warehouse is optimized for read access while a source system is usually optimized for writes
  • In building a data warehouse you will be restructuring, renaming, and joining data (i.e. creating star schemas) to make it easy for users to create reports
  • A data warehouse protects users against source system upgrades

A word about Views, such as those in SQL Server, can be thought of as a “lightweight” data virtualization solution: When users need access to operational data, views can be defined directly on the operational data store or the production databases.  But views have their own issues: operations for correcting and transforming data must be added to the view definitions, because the original operations are implemented in the physical data warehouse environment and are now bypassed.  They now have to be virtualized.

More info:

IT pros reveal benefits, drawbacks of data virtualization software

Experts Reconsider the Data Warehouse

Clearly Defining Data Virtualization, Data Federation, and Data Integration



Mark Beyer, Father of the Logical Data Warehouse, Guest Post

The Logical Data Warehouse: Smart Consolidation for Smarter Warehousing

data federation technology (data virtualization technology or data federation services)

Logical Data Warehousing for Big Data

The Many Uses of Data Federation

Data Federation

Enterprise Data Management, Part 1

How Data Federation Can Co-exist with an EDW

Demystifying Data Federation for SOA

Federation Supplements The Data Warehouse – Not Either/Or, Never Was

Counterpoint: The Data Warehouse is Still Alive

Posted in Data Lake, Data warehouse, SQLServerPedia Syndication | Comments Off on Data Virtualization vs Data Warehouse

Microsoft Big Data Certification Exams

I previously blogged about Microsoft certification changes, and since then there have been some new Microsoft Big Data certifications exams released:

Already live:

70-773: Analyzing Big Data with Microsoft R Server

70-774: Perform Cloud Data Science with Azure Machine Learning

70-775: Perform Data Engineering on Microsoft Azure HDInsight

Now in beta:

70-776: Perform Big Data Engineering with Microsoft Cloud Services (Azure SQL Data Warehouse, Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics)

70-778: Visualizing Data with Power BI

A few other related notes:

Posted in Certification, SQLServerPedia Syndication | 3 Comments

Azure Analysis Services web designer

Microsoft has released a preview of the Azure Analysis Services web designer.  This is a browser-based experience that will allow developers to start creating and managing Azure Analysis Services (AAS) semantic models quickly and easily.  SQL Server Data Tools (SSDT) and SQL Server Management Studio (SSMS) will still be the primary tools for development, but this new designer gives you another option for creating a new model or to do things such as adding a new measure to a development or production AAS model.

A highly requested feature is that you can import a Power BI Desktop file (.pbix) into an Analysis Services database.  And once imported you can reverse engineer to Visual Studio.  Note for PBIX import only Azure SQL Database, Azure SQL Data warehouse, Oracle, and Teradata are supported at this time and Direct Query models are not yet supported for import (Microsoft will be adding new connection types for import every month).

This initial release includes three major capabilities, model creation, model editing, and doing queries.  The model can be created directly from Azure SQL Database or SQL Data Warehouse or as mentioned imported from Power BI Desktop PBIX files.  When creating from a database, you can choose which tables to include and the tool will create a DirectQuery model to your data source (DirectQuery is where the query to obtain the model is passed down into the data source and executed there).  Then you can view the model metadata, edit the Tabular Model Scripting Language (TMSL) JSON, and add measures.  There are shortcuts to open a model in Power BI Desktop, Excel, or open a Visual Studio project which is created from the model on the fly.  You can also create simple queries against the model to see the data or test out a new measure without having to use SSMS.

Keep in mind that if you save changes to the TMSL, it is updating the model in the cloud.  It is a best practice to make changes to a development server and propagate changes to production with a tool such as BISM Normalizer.

More info:

Introducing the Azure Analysis Services web designer

Azure Analysis Services Web Designer

Posted in Azure Analysis Services, SQLServerPedia Syndication, SSAS | 1 Comment

Azure SQL Data Sync 2.0

Azure SQL Data Sync has been ignored for quite some time, but has finally gotten an update (it’s in public preview).  This release includes several major improvements to the service including new Azure portal support, PowerShell and REST API support, and enhancements to security and privacy.

SQL Azure Data Sync is a Microsoft Windows Azure web service that provides data synchronization capabilities for SQL databases.  SQL Azure Data Sync allows data to be synchronized between on-premises SQL Server databases and Azure SQL databases; in addition, it can also keep multiple Azure SQL databases in sync (see Azure SQL Data Sync technical documentation).

SQL Data Sync targets the reference data replication scenario.  Its key capabilities are:

Sync between SQL Server (2005 SP2 and later) and Azure SQL databases, or between Azure SQL databases:

  • One-way and bi-directional sync
  • One-to-one and hub-spoke
  • Table filter and column filter
  • Scheduled and on-demand
  • Eventual consistency

Active Geo-Replication, in contrast, targets the GeoDR scenario for Azure SQL Database by replicating the database to another region.  It only supports one-way replication (secondaries are read-only), replication is at database granularity, there is no database or column/row filter support, and it is only available for Premium service tier.

Comparing Data Sync to transactional replication: if you only need to do one way replication, both solution work.  Transactional replication can provide better latency and transactional consistency.  But it requires more complex setup and maintenance.  It doesn’t require the latest SQL Server as publisher as SQL Server 2012 or later will work.  Data sync is an Azure Database feature which requires less maintenance and is easier to setup.  But it has a minimum five minute latency and a more significant performance impact to the source database (trigger based change tracking).  It is optimized for bi direction sync.

Data Sync is now available in the new Azure portal.  If you would like to try Data Sync refer to this tutorial.  Existing users will be migrated to the new service starting June 1, 2017.  For more information on migration look at the blog post “Migrating to Azure SQL Data Sync 2.0.”

More info:

Azure SQL Data Sync Refresh

Sync data across multiple cloud and on-premises databases with SQL Data Sync

Getting Started with Azure SQL Data Sync (Preview)

Posted in Azure, SQLServerPedia Syndication | Comments Off on Azure SQL Data Sync 2.0

4TB disk sizes for Azure IaaS VMs available

Microsoft has introduced two new disk sizes for Azure IaaS VMs in P40 (2TB) and P50 (4TB) for both managed and unmanaged Premium Disks and S40 (2TB) and S50 (4TB) for both managed and unmanaged Standard Disks.  This enables customers to add 4x more disk storage capacity per VM.  Customers can now provision up to a total of 256TB disk storage on a GS5 VM using 64 disks with 4TB capacity.  This means the max SQL Server database size on a Azure VM goes from 64TB to 256TB!

Premium Disks Standard Disks
Managed Disks P40, P50 S40, S50
Unmanaged Disks P40, P50 Max up to 4,095GB

Larger Premium Disks P40 and P50 will support your IO intensive workload and therefore offer higher provisioned disk performance. The maximum Premium Disk IOPS and bandwidth is increased to 7,500 IOPS and 250 MBps respectively.  Standard Disks, of all sizes, will offer up to 500 IOPS and 60 MBps.

P40 P50 S40 S50
Disk Size 2048GB 4095GB 2048GB 4095GB
Disk IOPS 7,500 IOPS 7,500 IOPS Up to 500 IOPS Up to 500 IOPS
Disk Bandwidth 250 MBps 250 MBps Up to 60 MBps Up to 60 MBps

You can create a larger disk or resize existing disks to larger disk sizes with your existing Azure tools through Azure Resource Manager (ARM) or the Azure Portal.  Azure Backup and Azure Site Recovery support for larger disks is coming soon (current timeline is before end of July).

You can visit the Managed Disk Pricing and unmanaged Disk Pricing pages for more details about pricing.

More info:

Azure increases the maximum size and performance of Azure Disks

Posted in Azure, SQLServerPedia Syndication | 2 Comments

Data lake details

I have blogged before about data lakes (see What is a data lake? and Why use a data lake?), and wanted to provide more details on this popular technology, some of which I cover in my presentation “Big data architectures and the data lake“.  In short, the data lake is a storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.

The data lake should be the center of just about any big data solution for these major reasons:

  • Inexpensively store unlimited data
  • Collect all data “just in case”
  • Easy integration of differently-structured data
  • Store data with no modeling – “Schema on read”
  • Complements enterprise data warehouse (EDW)
  • Frees up expensive EDW resources, especially for refining data
  • Hadoop cluster offers faster ETL processing over SMP solutions
  • Quick user access to data
  • Data exploration to see if data valuable before writing ETL and schema for relational database
  • Allows use of Hadoop tools such as ETL and extreme analytics
  • Place to land IoT streaming data
  • On-line archive or backup for data warehouse data
  • Easily scalable
  • With Hadoop, high availability built in
  • Allows for data to be used many times for different analytic needs and use cases
  • Low-cost storage for raw data saving space on the EDW

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data.  With the old way, you have to know the questions to ask.  The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

  • Too much time spent modeling when you don’t know all of the questions your data needs to answer
  • Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

Data Lake layers:

  • Raw data layer– Raw events are stored for historical reference. Also called staging layer or landing area
  • Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer
  • Application data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed
  • Sandbox data layer – Optional layer to be used to “play” in.  Also called exploration layer or data science workspace

Be aware that you still need data governance so your data lake does not turn into a data swamp!  Having a data lake does not lessen the data governance that you would normally apply when building a relational data warehouse.  In fact, it usually requires more data governance.  So look at tools such as the Azure Data Catalog to keep track of all your data.

Data typically lands in products such as Hadoop Distributed File System (HDFS) or the Azure Data Lake Store (ADLS).   Since these technologies are just storage, you need other technologies for the compute part.  For example, here are ways to clean data in ADLS:

Since you will typically have both a Data Lake and a relational data warehouse when building a big data solution, it’s important to understand the roles of each:

Data Lake/Hadoop (staging and processing environment)

  • Batch reporting
  • Data refinement/cleaning
  • ETL workloads
  • Store historical data
  • Sandbox for data exploration
  • One-time reports
  • Data scientist workloads
  • Quick results

Data Warehouse/RDBMS (serving and compliance environment)

  • Low latency
  • High number of users
  • Additional security
  • Large support for tools
  • Easily create reports (Self-service BI): A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?  Very little, hence a major reason to use a RDBMS

A question I get frequently is if all relational data should be copied to the data lake, especially if some of the relational data is not needed for reporting or analytics in the data lake (just in the data warehouse).  There still could be reasons to copy the data to the data lake, such as for backup purposes, to use low-cost storage for raw data saving space on the data warehouse, to use Hadoop tools, or to offload the refining of the data from the data warehouse (especially if ETL jobs on the data warehouse are taking too long).  But in some cases you may want to skip the data lake, especially if you have many existing SSIS packages as you can minimize the changes to them (i.e. just changing the destination source).  Also keep in mind the extra time needed to export data out from a relational database into a flat file before copying it to the data lake.

One issue to be aware of is when you have to move a lot of data each day from a relational database to a data lake.  You may want to look at 3rd-party products for Change Data Capture (CDC) for high volumes of data to get updates of your source system into a data lake such as HVR and Attunity.  Performing updates to rows of data in files sitting in a Hadoop/HDFS data lake can be very slow compared to appends.

More info:

Data Lake Use Cases and Planning Considerations

Zones in a Data Lake

Posted in Azure Data Lake, Data Lake, SQLServerPedia Syndication | 3 Comments

Azure DevTest Labs

I have been working at Microsoft now for 3 years and 4 months (side note: it’s by far the best company I have ever worked for).  You would think by now I know about every Azure product, but we release new products and services at such a tremendously fast pace that almost weekly I discover something I did not know about.  Today was one of those days as I discovered Azure DevTest Labs, which was made generally available in May 2016 (it public previewed in November 2015).

Here is the overview:

Developers and testers are looking to solve the delays in creating and managing their environments by going to the cloud.  Azure solves the problem of environment delays and allows self-service within a new cost efficient structure.  However, developers and testers still need to spend considerable time configuring their self-served environments.  Also, decision makers are uncertain about how to leverage the cloud to maximize their cost savings without adding too much process overhead.

Azure DevTest Labs is a service that helps developers and testers quickly create environments in Azure while minimizing waste and controlling cost.  You can test the latest version of your application by quickly provisioning Windows and Linux environments using reusable templates and artifacts.  Easily integrate your deployment pipeline with DevTest Labs to provision on-demand environments.  Scale up your load testing by provisioning multiple test agents, and create pre-provisioned environments for training and demos.

Azure DevTest Labs addresses the problems in Dev/Test environments today majorly through four aspects:

  • Quickly be “ready to test” – DevTest Labs enables you to create pre-provisioned environments with everything your team needs to start developing and testing applications.  Simply claim the environments where the last good build of your application is installed and get working right away.  Or, use containers for even faster and leaner environment creation
  • Worry-free self-service – DevTest Labs makes it easier to control costs by allowing you to set policies on your lab – such as number of virtual machines (VM) per user and number of VMs per lab.  DevTest Labs also enables you to create policies to automatically shut down and start VMs
  • Create once, use everywhere – Capture and share environment templates and artifacts within your team or organization – all in source control – to create developer and test environments easily
  • Integrates with your existing toolchain – Leverage pre-made plug-ins or our API to provision Dev/Test environments directly from your preferred continuous integration (CI) tool, integrated development environment (IDE), or automated release pipeline. You can also use our comprehensive command-line tool

Jeff Gilbert’s TechNet blog has some great blogs on Azure DevTest Labs as well as Praveen Kumar Sreeram, and there are some excellent short videos by Microsoft to help you get started.

You’ll need a subscription that provides you monthly Azure credits to use DevTest labs.  Besides the pay-as-you-go option, there are free options and subscription options:

* MSDN Platforms is available exclusively through Microsoft Volume Licensing. For pricing and purchase details, contact your Microsoft account representative, Microsoft Partner, or an authorized volume licensing reseller.

More info:

Getting to know Azure DevTest Labs

How to Use Azure DevTest Labs for Test Environments and Dev Machines

Azure DevTest Labs.

More about Azure DevTest Labs

Posted in Azure, SQLServerPedia Syndication | 2 Comments

Data Science Virtual Machine

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science.  It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics.  So instead of you having to create a VM and download and install all these tools which can take many hours, within a matter of minutes you can be up and running.

The DSVM is designed and configured for working with a broad range of usage scenarios.  You can scale your environment up or down as your project needs change.  You are able to use your preferred language to program data science tasks.  You can install other tools and customize the system for your exact needs.

It is available on Windows Server 2012 (create), Windows Server 2016 (create) and on Linux – either Ubuntu 16.04 LTS (create) or on OpenLogic 7.2 CentOS-based Linux distributions (create).

The key scenarios for using the Data Science VM:

  • Preconfigured analytics desktop in the cloud
  • Data science training and education
  • On-demand elastic capacity for large-scale projects
  • Short-term experimentation and evaluation
  • Deep learning

The DSVM has many popular data science and deep learning tools already installed and configured.  It also includes tools that make it easy to work with various Azure data and analytics products.  You can explore and build predictive models on large-scale data sets using the Microsoft R Server or using SQL Server 2016 (note that R Server and SQL Server on the DSVM are not licensed for use on production data).  A host of other tools from the open source community and from Microsoft are also included, as well as sample code and notebooks.  See the full list here and see the latest new and upgraded tools here.

Finally, for Windows users check out Ten things you can do on the Data science Virtual Machine and for Linux users check out Data science on the Linux Data Science Virtual Machine.  For more information on how to run specific tools for Windows see Provision the Microsoft Data Science Virtual Machine and for Linux see Provision the Linux Data Science Virtual Machine.

UPDATE 10/7/17: There is now a Deep Learning VM (DLVM), which is a specially configured variant of the DSVM that is custom made to help users jump start deep learning on Azure GPU VMs.  The DLVM uses the same underlying VM images of the DSVM and hence comes with the same set of data science tools and deep learning frameworks as the base VM.  More info.

More info:

Data Science Virtual Machine – A Walkthrough of end-to-end Analytics Scenarios (video)

Introduction to the cloud-based Data Science Virtual Machine for Linux and Windows

Introducing the new Data Science Virtual Machine on Windows Server 2016

Posted in SQLServerPedia Syndication | 3 Comments