Azure SQL Database monitoring

Even though an Azure SQL Database stores all data on the Azure cloud, it does not mean that your options for managing and monitoring the databases are limited compared to on-prem databases.  In fact, the options available are very similar to on-prem, including 3rd-party products that support Azure SQL databases.  Here are the various options:

Posted in Azure SQL Database, SQLServerPedia Syndication | Leave a comment

Scaling Azure SQL Database

One of the advantages Azure SQL Database has over on-prem SQL Server is the ease in which it can scale.  I’ll discuss the various options for horizontal scaling, vertical scaling, and other similar features.

Horizontal scaling refers to adding or removing databases in order to adjust capacity or overall performance. This is also called “scaling out”.  Sharding, in which data is partitioned across a collection of identically structured databases, is a common way to implement horizontal scaling.

Vertical scaling refers to increasing or decreasing the performance level of an individual database—this is also known as “scaling up.”

Elastic Database features enables you to use the virtually unlimited database resources of Azure SQL Database to create solutions for transactional workloads, and especially Software as a Service (SaaS) applications.  Elastic Database features are composed of the following:

You can change the service tier and performance level of your SQL database with the Azure portal, PowerShell (using the Set-AzureSqlDatabase cmdlet), the Service Management REST API (using the Update Database command), or Transact-SQL (via the ALTER DATABASE statement).  You can use DMVs to monitor the progress of the upgrade operation for a database.  This allows you to easily scale up or down a database, and it will remain online and available during the entire operation with no downtime.  This is vertical scaling.  See Change the service tier and performance level (pricing tier) of a SQL database.

Another feature is called Stretch Databases, to let your on-prem SQL Server database hold just the core data, with old/cold data that continues to grow sidelined transparently in Azure SQL Database.  This is a feature only available in SQL Server 2016.  See Stretch Database.

More info:

Elastic Database features overview

Video Azure SQL Database Elastic Scale

Video Elastic for SQL – shards, pools, stretch

SQL Azure Performance Benchmarking

Azure SQL Database DTU Calculator

Posted in Azure SQL Database, SQLServerPedia Syndication | 2 Comments

Microsoft Azure Stack in Preview!

Microsoft made available the first technical preview of its new Microsoft Azure Stack offering today.  It was announced last week.  Azure Stack brings the cloud model of computing to every datacenter – your own private cloud.  Azure Stack is a new hybrid cloud platform product that enables organizations to deliver Azure services from their own datacenter in a way that is consistent with Azure.  You can download it now, but first check out the hardware and deployment requirements.  Once installed, check out the Tools and PaaS services for Azure Stack which will be updated frequently.  If you run into any issues and need help, or if you’d like to provide feedback or make requests, visit the Azure Stack forum.

Also released is a whitepaper providing more information on key Azure Stack concepts and capabilities that should help you gain a much richer understanding of Microsoft’s approach.  Check out the full documentation.

On February 3, Mark Russinovich and Jeffrey Snover demonstrated Azure Stack in a webcast, followed by a question and answer session.  Also check out An early look at Azure Stack and what it means for IT, with Jeffrey Snover.

Azure Stack essentially is Microsoft’s better bridge to using its cloud services, both the platform-as-a-service (PaaS) and infrastructure-as-a-service (IaaS) kinds.  It uses the same code as Azure.  Scripting tools for management, such as PowerShell and command-line interfaces, will work across Microsoft’s Azure cloud computing services as well as local datacenter implementations of Azure Stack.  This allows a company to build a private cloud, and then easily migrate parts to a public cloud, allowing you to have a true hybrid cloud platform.  Microsoft is the only company to have both a private and public cloud option.

General availability is scheduled for Q4 of this year, but some services will only be in preview at that time:

azure-stack-services-100642771-orig (1)

More info:

First Microsoft Azure Stack Preview Sees Daylight

Microsoft readies first technical preview of Azure Stack hybrid-cloud bundle

Microsoft Azure Stack: Hardware requirements

Posted in Azure, SQLServerPedia Syndication | Leave a comment

Migrate from on-prem SQL server to Azure SQL Database

This blog describes the various approaches you can use to migrate an on-premises SQL Server database to Azure SQL Database.

In this migration process you migrate both your schema and your data from the SQL Server database in your current environment into SQL Database, provided the existing database passes compatibility tests.  Fortunately with SQL Database Version 12 (V12), there are very few remaining compatibility issues other than server-level and cross-database operations (see Azure SQL Database Transact-SQL differencesAzure SQL Database General Limitations and Guidelines, and Azure SQL Database resource limits).  Databases and applications that rely on partially or unsupported functions will need some re-engineering to fix these incompatibilities before the SQL Server database can be migrated.

If not done already, create a server to host the database you wish to migrate, and then set up a firewall rule for it (see SQL Database tutorial: Create a SQL database in minutes using sample data and the Azure portal).

Note: If you are fairly confident your database is compatible and it is not large a large database, you can skip testing for compatibility and just use the SSMS Migration Wizard.  If this wizard detects compatibility issues, errors will be displayed to the screen and the migration will not continue – you can then proceed to the steps below to test and fix the compatibility issues.  Or, you can run the SQL Azure Migration Wizard and it can fix most compatibility issues (the few that remain) while creating the SQL Database, so this could be the only tool you need to use since it also migrates the data.

Test compatibility

First you will want to test for SQL Database compatibility issues before you start the migration process:

Fix compatibility issues

If there are any compatibility issues, you will need to fix them before proceeding with the migration by using:

  • SQL Azure Migration Wizard, available on CodePlex, will generate a T-SQL script from the incompatible source database that is then transformed to make it compatible with the SQL Database.  It will then connect to the SQL Database and execute the script on the target database.  It also has options to analyze trace files to determine compatibility issues and the script can be generated with schema only or can include data in BCP format.  It will also process the body of functions or stored procedures which is normally excluded from validation performed by SSDT (see next option), so it may find issues that might not otherwise be reported by SSDT alone.  See the wizard in action via the video SQL Database Migration Wizard
  • SQL Server Data Tools for Visual Studio (SSDT).  Import the database schema into a Visual Studio database project for analysis.  Specify the target platform for the project as SQL Database V12 and then build the project.  If the build is successful, the database is compatible.  If the build fails, resolve the errors in SSDT.  Once the project builds successfully, you can publish it back as a copy of the source database and then use the data compare feature in SSDT to copy the data from the source database to the Azure SQL V12 compatible database.  You can then migrate this updated database
  • SQL Server Management Studio (SSMS).  Use SSMS to fix compatibility issues using various Transact-SQL commands, such as ALTER DATABASE

Migrate database

And lastly you will migrate the compatible SQL Server database to SQL Database.  There are several migration methods for various scenarios.  The method you choose depends upon your tolerance for downtime, the size and complexity of your SQL Server database, and your connectivity to the Microsoft Azure cloud.

To migrate schema and data from on-prem SQL Server (or Azure SQL Database) to Azure SQL Data Warehouse, use the Data Warehouse Migration Utility (Preview).

More info:

Migration cookbook now available for the latest Azure SQL Database Update (V12)

Migrating a SQL Server database to Azure SQL Database

How to Migrate from On-Premises to Azure SQL Database

Posted in Azure SQL Database, SQL Server, SQLServerPedia Syndication | Leave a comment

HA/DR for Azure SQL Database

Azure SQL Database is a relational database-as-a-service in the cloud.  It uses a special version of Microsoft SQL Server as its backend that is nearly identical to SQL Server (see Azure SQL Database Transact-SQL differences).  While there are many benefits to using SQL Database over SQL Server, in this blog post I’ll talk about the various types of high-availability and disaster recovery options that are much easier to setup than SQL Server.

When you use the Azure portal to create a SQL Database, the various plans under the pricing tier include three service tiers: Basic, Standard, and Premium.  Here are those three plans with their high-availability (HA) and disaster recovery (DR) options:

Basic: Automatic Backups, Point In Time Restore up to 7 days, Disaster recovery (DR): Geo-Restore, restore to any Azure region

Standard: Automatic Backups, Point In Time Restore up to 14 days, DR: Standard Geo-Replication, offline secondary

Premium: Automatic Backups, Point In Time Restore up to 35 days, DR: Active Geo-Replication, up to 4 online (readable) secondary backups

Here are more details on those options:

High Availability: Each database possesses one primary and two local replica databases stored on LRS Azure Blob Storage that reside in the same datacenter, providing high availability within that datacenter.  At least two of those databases are synchronous.  The hardware these databases reside on are on completely physically separate sub-systems.  So if the hardware fails, your database will automatically and seamlessly fail over to the synchronous copy.

Automatic Backups: All Basic, Standard, and Premium databases are protected by automatic backups.  Full backups are taken every week, differential backups every day, and log backups every 5 minutes.  The first full backup is scheduled immediately after a database is created.  Normally this completes within 30 minutes but it can take longer.  If a database is “born big”, for example if it is created as the result of database copy or restore from a large database, then the first full backup may take longer to complete.  After the first full backup all further backups are scheduled automatically and managed silently in the background.  Exact timing of full and differential backups is determined by the system to balance overall load.  Backup files are stored locally in blob storage in the same data center as your databases with local redundancy.  When you restore a database, the required backup files are retrieved and applied.  The full, differential, and log backups are also copied to the blob storage in the paired secondary region in the same geo-political area for disaster recovery purpose (RA-GRS).  These geo-redundant copies are used to enable geo-restore as explained shortly.

Point In Time Restore: Point In Time Restore is designed to return your database to an earlier point in time. It uses the database backups, incremental backups and transaction log backups that the service automatically maintains for every user database.  See Azure SQL Database Point in Time Restore.  To restore a database, see Recover an Azure SQL Database from a user error.  When you perform a restore, you’ll get a new database on the same server.

Geo-Restore: When you create a SQL Database server, you choose the region you want it in (i.e. East US), and this is your primary region.  If there is an incident in this region and a database is unavailable, you can restore it from the geo-redundant backup copy in the secondary region to any region, using the same technology as point in time restore, and therefore the databases are resilient to the storage outages in the primary region.  Note that with this option, your data could be up to one hour behind.  See Azure SQL Database Geo-Restore.

Standard Geo-Replication: This is where a copy of your data in the primary database is constantly being written asynchronously to a non-readable secondary database on a server in a different region (geo-redundancy).  In the event of a disaster you can fail over to the secondary.  Since the copy is asynchronous the data in the secondary database will be behind the primary, but not by more than five seconds (you can make the copy synchronous by using the system procedure sp_wait_for_database_copy_sync).  See Azure SQL Database Standard Geo-Replication.

Active Geo-Replication: Similar to Standard Geo-replication, your data is being asynchronously written except it’s on up to four secondary servers in different regions, and these secondaries are readable (each continuous copy is referred to as an called online secondary database).  You can also fail over to a secondary in the event of disaster in the same way as Standard Geo-Replication.  In addition, Active Geo-Replication can be used to support application upgrade or relocation scenarios without downtime, as well as load balancing for read-only workloads.  See Active Geo-Replication for Azure SQL Database.

A word about database failover:
If a region has an extended outage you will receive an alert in the Azure Portal and will see your SQL Database servers’ state set to Degraded.  At that point an application has a choice of initiating the failover or waiting for the datacenter to recover.  If your application needs to optimize for higher availability and can tolerate a data loss of 5 seconds then it should failover as soon as you receive an alert or detect database connectivity failures.  If your application is sensitive to data loss you may opt to wait for the SQL Database service to recover.  If this happens no data loss will occur.  In case you initiate the failover the database you must reconfigure your applications appropriately to connect to the new primary databases.  Once you have completed the failover you will want to ensure that the new primary is also protected as soon as possible.  Since primary region recovery may take time you will have to wait for your server to change from Degraded back to Online status. This will allow you to initiate geo-replication from the new primary to protect it.  Until seeding of the new secondary is completed your new primary will remain unprotected.

More info:

Creating a large data warehouse in Azure

Business Continuity Overview

Design for business continuity

SQL Database Enable Geo Replication in Azure Portal

Fault-tolerance in Windows Azure SQL Database

Distributed Storage: How SQL Azure Replicas Work

High Availability and Disaster Recovery for Azure SQL Databases

Posted in Azure SQL Database, SQLServerPedia Syndication | Leave a comment

Why use a data lake?

Previously I covered what a data lake is (including the Azure Data Lake and enhancements), and now I wanted to touch on the main reason why you might want to incorporate a data lake into your overall data warehouse solution.

To refresh, a data lake is a landing zone, usually in Hadoop, for disparate sources of data in their native format (NoSQL databases can be used for “data lakes” too, although they often require data conversion to a single format so are much less frequently used).  Data is not structured or governed on its way into the data lake.  This eliminates the upfront costs of data ingestion, especially transformation.  Once data is in the lake, the data is available to everyone.  You don’t need a priority understanding of how data is related when it is ingested, rather, it relies on the end-user to define those relationships as they consume it.  Data governorship happens on the way out instead of on the way in.  This makes a data lake very efficient in processing huge volumes of data.  Another benefit is the data lake allows for data exploration and discovery, to find out if data is useful or to simply leave data in the data lake while you wait to see how you might eventually use it.  Or it can be used by people who don’t require top-quality cleansed data and are desperately trying to address time-critical business requirements.  Think of a data lake as a solution for any case in which you do not need the benefits that an RDBMS gives you, such as accelerated performance for query processing and complex joins, workload management features required to meet strict SLAs, and high concurrency.  Such use cases can be as simple a one-time operational report, or as complicated as using the data lake to offload workloads for refining data, in which the data lake become the primary site for the preparation (cleansing, aggregation, and integration) of data.

The disadvantages of a data lake are that it is usually not useful for analytical processing.  If you need to return queries very fast, as in a few seconds or less, the data lake won’t give you that performance.  Also the data is usually not clean, and it’s not easy to join the data.  This leads to the inability to determine data quality or the lineage of findings by other users that have previously found value in using the same data in the lake.  And finally, it can be difficult for less technical people to explore and ask questions of the data (i.e. having to write a Spark job in Java).  This is where federated query technologies like PolyBase help by allowing end users to use regular SQL to access data in the data lake as well as combine that data with data in an RDBMS.  What many people don’t realize is data governance needs to be applied to a data lake, just like it needs to be applied to a data warehouse, if people will try to report from it (as opposed to passing the raw data in the data lake to a RDBMS where it will be cleaned and accessible to end users).  This means you should have multiple stages of data in the data lake, such as: raw, cleaned, mastered, test, production ready.  Don’t make the data lake become a collection of silos (a data swamp!).  Think about creating a data lake competency center to prevent this.

A data warehouse, which is a central repository of integrated data from one or more disparate sources, has many benefits.  However, it takes a lot of time and money to get the data ready to be analyzed, performing the tasks to acquire the data, prep it, govern it, and finding keys to join the data.  Often integration is done without a clear visibility into the value that is going to be extracted from it.  Just because you can join and relate the data, if you have not figured out the value, the risk is that if turns out not to be valuable you have wasted a lot of time.  Also, if relationships change between new data sources it becomes a challenge.  The big benefit of a data warehouse is speed of queries.  If you create a dashboard where a user can slice and dice the data, you want to be hitting a data warehouse.  You get the speed because a data warehouse can be in a RDBMS as well as a star schema format which will give you optimal performance, much more than you would get from querying a data lake.

So using both a data lake and a data warehouse is the ideal solution.  Think of it as the data lake is a technology solution, and the data warehouse is the business solution.  The best way to see this in action is an architecture diagram that lists the many possible technologies and tools that you can use:

Data Lake Architecture

The details on the numbered steps in the above diagram (which is cloud solution, but it can access on-prem sources):

1) Copy source data into the Azure Data Lake Store (twitter data example)
2) Massage/filter the data using Hadoop (or skip using Hadoop and use stored procedures in SQL DW/DB to massage data after step #5)
3) Pass data into Azure ML to build models using Hive query (or pass in directly from Azure Data Lake Store)
4) Azure ML feeds prediction results into the data warehouse
5) Non-relational data in Azure Data Lake Store copied to data warehouse in relational format (optionally use PolyBase with external tables to avoid copying data)
6) Power BI pulls data from data warehouse to build dashboards and reports
7) Azure Data Catalog captures metadata from Azure Data Lake Store and SQL DW/DB
8) Power BI and Excel can pull data from the Azure Data Lake Store via HDInsight

The beauty in the above diagram is you are separating out storage (Azure Data Lake Store) from compute (HDInsight), so you can shut down your HDInsight cluster to save costs without affecting the data.  So you can fire up a HDInsight cluster, scale it up, and do processing, and then shut it down when not in use or scale it down.

You can also use Azure Data Lake Analytics and uSQL instead of HDInsight/Hadoop.  ADL Analytics (big data queries as a service) give you convenience, efficiency, and automatic scaling, while HDInisght (cluster as a service) gives you customization, control, and flexibility allowing you to leverage the Hadoop ecosystem (Spark, Storm, Hbase, etc).

There are so many possible technologies and tools that can be used.  You can even have the data from a data lake feed a NoSQL database, a SSAS cube, a data mart, or go right into Power BI Desktop.  There is not a cookie-cutter solution – it depends on such things as the type of data, the size, the frequency (real-time streaming), the skill set, the required speed, the allowed company software, the company risk or comfort level (bleeding edge, leading edge), whether there is sensitive data that cannot be in the cloud, etc.  For example, even though the data lake is a staging area of the data warehouse, operational reports that run at night can be generated from the data lake in which various reporting tools that support the Azure Data Lake Store would be used.  Or if you wanted to stream data, tools like Azure Event Hubs and Azure Stream Analytics would be used.  There could even be cases where you want to take relational sources and use products like Data Quality Services (DQS) and Master Data Services (MDS) to clean and master the data, then import it using SSIS into SQL DW/DB, bypassing the data lake altogether.

If you would rather go the route of using Hadoop software, many of the above technologies have Hadoop or open source equivalents: AtScale and Apache Kylin create SSAS-like OLAP cubes on Hadoop, Jethro Data creates indexes on Hadoop data, Apache Atlas for metadata and lineage tools, Apache Drill to query Hadoop files via SQL, Apache Mahout or Spark MLib for machine learning, Apache Flink for distributed stream and batch data processing, Apache HBase for storing non-relational streaming data and supporting fast query response times, SQLite/MySQL/PostgreSQL for storing relational data, Apache Kafka for event queuing, Apache Falcon for data and pipeline management (ETL), and Apache Knox for authentication and authorization.

If you already have a data warehouse solution, incorporating a data lake does not mean all existing sources should land in the data lake.  Rather, it can be used for new sources and down the road you can migrate some of the older sources.  For now those existing sources can bypass the data lake and go right into the data warehouse.

Another big use for a data lake is for offloading data (i.e. historical data) from other systems, such as your data warehouse or operational systems where the per-terabyte cost of storage and processing is considerable higher.

The goal of this post is to educate you on all the tools so you choose the best one. The bottom line is a data lake should be a key component to your modern data architecture.

Data Lakes – Five Tips to Navigating the Dangerous Waters

Data Lake Principles and Economics

Data lakes and the promise of unsiloed data

Data lakes, don’t confuse them with data warehouses, warns Gartner

Organize and discover your big data in the Azure Data Lake with Azure Data Catalog

Posted in Data Lake, SQLServerPedia Syndication | 7 Comments

Storage options on Azure

Microsoft Azure is a cloud computing platform and infrastructure, created by Microsoft, for building, deploying and managing applications and services through a global network of Microsoft-managed and Microsoft partner-hosted datacenters.  Included in this platform are multiple ways of storing data.  Below I will give a brief overview of each so you can get a feel for the best use case for each, with links provided that go into more detail:

  • Local Storage: Provides temporary storage for a running application instance.  Local storage represents a directory on the physical file system of the underlying hardware that an application instance is running on, and can be used to store any information that is specific to the local running application instance.  You can create a number of local stores for each instance
  • Windows Azure Storage:
    • Blob: A reliable, cost-effective cloud storage for large amounts of unstructured data, such as documents and media files.  It is a highly scalable, REST-based cloud object store.  The storage service offers three types of blobs: block blobs, page blobs, and append blobs.  Block blobs are best for sequential file I/O, page blobs are best for random-write pattern data, and append blobs are optimized for append operations
    • Data Lake Store: This is “HDFS as a service”.  It is a new flavor of Azure Blob Storage which can handle streaming data (low latency, high volume, short updates), is geo-distributed, data-locality aware and allows individual files to be sized at petabyte scale.  See Azure Data Lake
    • File: Fully managed file shares that use the standard SMB 3.0 protocol.  Migrate file share-based applications to the cloud with no code changes, allowing you to lift and shift legacy apps
    • Disks: Persistent block storage for Azure Iaas VMs.  The disks are actually .vhd files that need to reside in an Azure storage account.  You have the option of using Premium storage disks: SSD based, high IOPS, low latency
    • Table: A NoSQL key-value store for rapid development using massive semi-structured datasets.  Highly scalable to PBs, and has dynamic scaling based on load.  Has fast key/value lookups
    • Queue: A way to reliably store and send messages for scenarios including workflow processing or communication between application components.  Can decouple and scale components.  Helps your applications absorb unexpected traffic bursts and can prevent servers from being overwhelmed by a sudden flood of requests.  Instead of getting dropped, incoming requests are buffered in the queue until servers catch up—so traffic bursts don’t take down your applications
  • SQL Database: Fully managed database-as-a-service that is in the cloud and is based on the Microsoft SQL Server engine.  SQL Database delivers predictable performance, scalability with no downtime, business continuity and data protection—all with near-zero administration
  • DocumentDB: A NoSQL document database service that natively supports JSON and JavaScript directly inside the database engine. It’s a solution for applications that run in the cloud when predictable throughput, low latency, and flexible query are key

More info:

Data Storage Offerings on the Azure Platform

Azure Table Storage and Microsoft Azure SQL Database – Compared and Contrasted

Introduction to Microsoft Azure Storage

Posted in Azure, Data Lake, SQLServerPedia Syndication | 1 Comment

Redundancy Options in Azure Blob Storage

So you have data in Azure Blob Storage and are concerned about reliability.  Have no fear!  There are four replication options for redundancy:

1. Locally Redundant Storage (LRS): All data in the storage account is made highly durable and available within a facility/datacenter by replicating transactions synchronously to three different storage nodes within a single storage cluster (today those clusters are 20 racks withing the same building).  Each storage node is on a different fault and upgrade domain to ensure recovery from common failures (disk, node, rack) without impacting your storage account’s availability and durability.  All storage writes are performed synchronously across the three replicas in the three separate fault domains before success is returned back to the client. Note there is an option to have premium storage LRS which means the storage is SSD-based.  For internal details of Windows Azure Storage check out a whitepaper here, a video here, and the slides here.

2. Zone-Redundant Storage (ZRS): Zone-redundant storage (ZRS) stores three replicas of your data across two to three facilities, with two of the facilities at least 25 miles apart.  It is designed to keep all three replicas within in a single region (but may span across two regions), providing higher durability than LRS (which replicates data only within the same facility).  Data is synchronously replicated when the multiple facilities are within a region; in the event that multiple regions are needed, three copies are synchronously stored within the region and data is asynchronously replicated to facilities in other regions.  If your storage account has ZRS enabled, then your data is durable even in the case of failure at one of the facilities.  ZRS is currently available only for block blobs.

3. Geo-Redundant Storage (GRS): Like LRS, transactions are replicated synchronously to three different storage nodes within the same primary region chosen for creating the storage account (there are 24 regions throughout the world, which each region made up of multiple datacenters).  However, transactions are also queued for asynchronous replication to another secondary region that is hundreds of miles away (usually 300 miles or more) from the primary (geo-replication).  In this secondary region the data is again made durable by replicating it to three more storage nodes there (i.e. total of 6 copies).  So even in the case of a complete regional outage or a regional disaster in which the primary location is not recoverable, your data is still durable.

4. Read-Access Geo-Redundant Storage (RA-GRS): This is the default option for redundancy when a storage account is created.  For a GRS storage account, there is the ability to turn on read-only access to a storage account’s data in the secondary region.  Since replication to the secondary region is done asynchronously, this provides an eventual consistent version of the data to be read from (if the primary goes offline there can be 5-15 minutes of data last written that are not yet available on the secondary).  When you enable read-only access to your secondary region, you get a secondary endpoint in addition to the primary endpoint for accessing your storage account.  This secondary endpoint is similar to the primary endpoint except for the suffix “-secondary”. For example: if the primary endpoint is myaccount.<service>.core.windows.net, the secondary endpoint is myaccount-secondary.<service>.core.windows.net.

When creating a storage account, you will be presented with the following storage account types:

StorageOptions

These options mean the data in your Microsoft Azure storage account is always replicated to ensure durability and high availability, meeting the Azure Storage SLA even in the face of transient hardware failures.

Untitled picture

For locally redundant storage, Microsoft stores CRCs of the data to ensure correctness and periodically reads and validates the CRCs to detect bit rot (random errors occurring on the disk media over a period of time).  If CRC fails, the data is recovered via an automated process.  And since each VM disk is a blob in Azure storage, if CRC fails on a disk it is automatically commissioned/decommissioned.

Some customers may choose Locally Redundant Storage for storage that does not require the additional durability of Geo Redundant Storage and want to benefit from the discounted price compared to GRS.  This data typically falls into the categories of (a) non-critical or temporary data (such as logs), or (b) data that can be recreated if it is ever lost from sources stored elsewhere.  An example of the latter is encoded media files that could be recreated from the golden bits stored in another Windows Azure Storage account that uses Geo Redundant Storage.  In addition, some companies have geographical restrictions about what countries their data can be stored in, and choosing Locally Redundant Storage ensures that the data is only stored in the location chosen for the storage account.

For remote storage (GRS and RA-GRS), in the event of a major disaster that affects the primary storage location, Microsoft will manually first try to restore the primary location.  Restoring of primary is given precedence since failing over to secondary may result in recent delta changes being lost because of the nature of replication being asynchronous, and not all applications may prefer failing over if the availability to the primary can be restored.  Dependent upon the nature of the disaster and its impacts, in some very rare occasions, Microsoft may not be able to restore the primary location, and they would need to perform a geo-failover.

When this happens, affected customers will be notified via their subscription contact information or via the Azure portal.  As part of the failover, the customer’s “account.service.core.windows.net” DNS entry would be updated to point from the primary location to the secondary location.  Once this DNS change is propagated, the existing Blob URIs will work.  This means that you do not need to change your application’s URIs – all existing URIs will work the same before and after a geo-failover.

After the failover occurs, the location that is accepting traffic is considered the new primary location for the storage account.  This location will remain as the primary location unless another geo-failover was to occur.  Once the new primary is up and accepting traffic, Microsoft will bootstrap to a new secondary to get the data geo redundant again.

More info:

Windows Azure Storage Redundancy Options and Read Access Geo Redundant Storage

Azure Storage replication

Azure Business Continuity Technical Guidance

Posted in Azure, SQLServerPedia Syndication | 1 Comment

Copying data from Azure Blob Storage

In a previous blog I talked about copying on-prem data to Azure Blob Storage (Getting data into Azure Blob Storage).  Let’s say you have copied the data and it is sitting in Azure Blob Storage (or an Azure Data Lake) and you now want to copy it from Azure Blob Storage into either SQL Server on an Azure Virtual Machine (SQL Server IaaS), SQL DW, or SQL DB.  Below I cover the various ways to do this by listing the technology and the supported destinations:

Here is an important point to keep in mind when reviewing your options if you are building a Big Data platform: all these options are copying data, but you can use PolyBase to query the data as it sits in Azure Blob Storage and avoid the ETL time and storage of copying the data.

Note all the above technologies that work against Azure Blob Storage will also work against Azure Data Lake except for PolyBase, which is not supported yet.

More info:

Load data into SQL Data Warehouse

Migrate Your Data

Azure SQL Data Warehouse loading patterns and strategies

Posted in Azure, Azure SQL Database, Data Lake, SQLServerPedia Syndication | Leave a comment

Azure Data Lake enhancements

I first blogged about Microsoft’s new product, the Azure Data Lake, a few months back (here).  There are already enhancements, as announced at Stata + Hadoop World.  Here they are in brief:

  • The Azure Data Lake has been renamed to the Azure Data Lake Store.  The Data Lake Store provides a single repository where you can easily capture data of any size, type, and speed and without forcing changes to your application as the data scales.  In the store your data is accessible from any HDFS application and tool via WebHDFS.  It will be available in preview later this year
  • A new service called Azure Data Lake Analytics, which is a distributed analytics services build on Apache YARN that allows developers to be productive immediately on big data.  This is accomplished by submitting a job to the service where the service will automatically run it in parallel in the cloud and scale to process data of any size (scaling is achieved by simply moving a slider).  Then the job completes, it winds down resources automatically, and you only pay for the processing power used.  This makes it easy to get started quickly and be productive with the SQL, .NET, or Hive skills you already have, whether you’re a DBA, data engineer, data architect, or data scientist.  Because the analytics service works over both structured and unstructured data, you can quickly analyze all of your data – social sentiment, web clickstreams, server logs, devices, sensors, and more.  There’s no infrastructure setup, configuration, or management.  It will be available in preview later this year
  • A new language called U-SQL, which is a big data language that seamlessly unifies the ease of use of SQL with the expressive power of C#.  U-SQL’s scalable distributed query capability enables you to efficiently analyze data in the Azure Data Lake Store and across Azure Blob Storage, SQL Servers in Azure, Azure SQL Database and Azure SQL Data Warehouse.  U-SQL is built on the learnings from Microsoft’s internal experience with SCOPE and existing languages such as T-SQL, ANSI SQL, and Hive.  See Introducing U-SQL – A Language that makes Big Data Processing Easy and Tutorial: develop U-SQL scripts using Data Lake Tools for Visual Studio
  • Azure HDInsight is now included as part of the Azure Data Lake.  This integrates the fully managed Apache Hadoop cluster service into the Azure Data Lake Store.  It was also announced the general availability of HDInsight on Linux with an industry-leading 99.9% uptime SLA is available now
  • Azure Data Lake Tools for Visual Studio, provide an integrated development environment that spans the Azure Data Lake, dramatically simplifying authoring, debugging and optimization for processing and analytics at any scale.  With this tool you can write U-SQL, and you can also write U-SQL in the Azure management portal.  See Azure Data Lake: Making Big Data Easy

Note that Hortonworks, Cloudera, and MapR will integrate with the Azure Data Lake store, as well as other partners.

928Pic2_png-550x0

The services are now in private preview and interested developers can sign up here.

More info:

Microsoft expands Azure Data Lake to unleash big data productivity

Announcing General Availability of HDInsight on Linux + new Data Lake Services and Language

Posted in Azure, Data Lake, SQLServerPedia Syndication | 2 Comments