4 Years, 2 Weeks, 23 Hours, 55 Minutes ago
Hyperscale cloud services such as Azure are designed to work with large amounts of data, taking advantage of their economies of scale when purchasing storage hardware. Their close relationship with search engines like Bing and Google allows them to build on the algorithms and tools developed to analyze the public internet. It’s a combination that makes them an ideal platform for building applications that need to process massive data sets, at a scale that would be prohibitive in your own data center.
Microsoft has offered a range of data and analytics services on Azure since its early days, starting with its own SQL database (which quickly became a cloud-hosted version of the familiar SQL Server), adding HDInsight for Hadoop and other Apache data services, and offering a large-scale data lake that lets you mix structured and unstructured data. Until recently most of these services have been stand-alone, and if you wanted to bring them together, you’d need to build your own analytics tooling. At Ignite 2019, Microsoft launched Azure’s existing SQL Data Warehouse as Azure Synapse, rearchitected and rebranded, adding support for Apache Spark and its own Studio development and analytics tools.
Introducing Azure Synapse
Azure Synapse is more than a rebranding of an existing product, with a focus on integrating much of Azure’s data analysis capabilities into a single service. Unlike traditional data warehouses, there’s support for mixed relational and unstructured data, while still allowing you to use existing SQL skills to build and test analytical models, building on Azure SQL’s PolyBase big data query engine. Because it uses column stores in memory, it’s fast and efficient, an important feature when you’re using a cloud service consumption model.
Where Synapse differs from other data warehouse products is its roots in Azure SQL’s hyperscale option. Instead of a single compute node handling all your queries, it uses a cluster of what Microsoft is calling “data warehouse units.” These separate query compute from the underlying storage and let Synapse take a massively data parallel approach to working with your queries. Each data warehouse unit has compute and a custom application, the Data Movement Service, that works across nodes and with Azure Storage to ensure that the right data is available in the right node. It’s certainly fast; a demo at Ignite compared it with Google’s Big Query on a 30 petabyte data set, and showed Synapse to be 75 times faster.
There’s a price advantage to this architecture, too. Like much of Azure’s Paas functionality, it’s essentially serverless, spinning up as required and paused when no longer needed. This way you only pay for the compute used to run a query and your underlying storage, adding more compute as required for more complex analysis.
Getting started with Azure Synapse SQL Pools
Creating a new SQL Pool for Azure Synapse follows the same path as Azure Data Warehouse. Start by making a new SQL Data Warehouse resource in the Azure Portal. This lets you create a new server, with log-on and password. Choosing a performance level picks the default server type, which sets the compute costs for your pool. Once provisioned and with the appropriate firewall rules set, you can start managing it with your choice of SQL Server management tool. Queries are built using the familiar T-SQL dialect. You can use PolyBase to load the data from Azure blob storage into staging tables that are used to build the production tables that you then use for your queries.
The same platform will work with Apache Spark rather than SQL if you prefer to use it to analyze your data. That approach gives you an extra level of flexibility, letting you choose the appropriate tool for your problem.