Choose a data storage technology - Azure Architecture Center (2024)

  • Article

This topic compares options for data storage for big data solutions—specifically, data storage for bulk data ingestion and batch processing, as opposed to analytical data stores or real-time streaming ingestion.

What are your options when choosing data storage in Azure?

There are several options for ingesting data into Azure, depending on your needs.

File storage:

  • Azure Storage blobs
  • Azure Data Lake Storage Gen2

NoSQL databases:

Analytical databases:

Azure Data Explorer

Azure Storage blobs

Azure Storage is a managed storage service that is highly available, secure, durable, scalable, and redundant. Microsoft takes care of maintenance and handles critical problems for you. Azure Storage is the most ubiquitous storage solution Azure provides, due to the number of services and tools that can be used with it.

There are various Azure Storage services you can use to store data. The most flexible option for storing blobs from many data sources is Blob storage. Blobs are basically files. They store pictures, documents, HTML files, virtual hard disks (VHDs), big data such as logs, database backups—pretty much anything. Blobs are stored in containers, which are similar to folders. A container provides a grouping of a set of blobs. A storage account can contain an unlimited number of containers, and a container can store an unlimited number of blobs.

Azure Storage is a good choice for big data and analytics solutions, because of its flexibility, high availability, and low cost. It provides hot, cool, and archive storage tiers for different use cases. For more information, see Azure Blob Storage: Hot, cool, and archive storage tiers.

Azure Blob storage can be accessed from Hadoop (available through HDInsight). HDInsight can use a blob container in Azure Storage as the default file system for the cluster. Through a Hadoop Distributed File System (HDFS) interface provided by a WASB driver, the full set of components in HDInsight can operate directly on structured or unstructured data stored as blobs. Azure Blob storage can also be accessed via Azure Synapse Analytics using its PolyBase feature.

Other features that make Azure Storage a good choice are:

  • Multiple concurrency strategies.
  • Disaster recovery and high-availability options.
  • Encryption at rest.
  • Azure role-based access control (RBAC) to control access using Microsoft Entra users and groups.

Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 is a single, centralized repository where you can store all your data, both structured and unstructured. A data lake enables your organization to quickly and more easily store, access, and analyze a wide variety of data in a single location. With a data lake, you don't need to conform your data to fit an existing structure. Instead, you can store your data in its raw or native format, usually as files or as binary large objects (blobs).

Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you also get low-cost, tiered storage, with high availability/disaster recovery capabilities.

Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.

Azure Cosmos DB

Azure Cosmos DB is Microsoft's globally distributed multi-model database. Azure Cosmos DB guarantees single-digit-millisecond latencies at the 99th percentile anywhere in the world, offers multiple well-defined consistency models to fine-tune performance, and guarantees high availability with multi-homing capabilities.

Azure Cosmos DB is schema-agnostic. It automatically indexes all the data without requiring you to deal with schema and index management. It's also multi-model, natively supporting document, key-value, graph, and column-family data models.

Azure Cosmos DB features:

  • Geo-replication
  • Elastic scaling of throughput and storage worldwide
  • Five well-defined consistency levels

HBase on HDInsight

Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. HBase provides random access and strong consistency for large amounts of unstructured and semi-structured data in a schemaless database organized by column families.

Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is schemaless in the sense that neither the columns nor the type of data stored in them need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.

The HDInsight implementation uses the scale-out architecture of HBase to provide automatic sharding of tables, strong consistency for reads and writes, and automatic failover. Performance is enhanced by in-memory caching for reads and high-throughput streaming for writes. In most cases, you'll want to create the HBase cluster inside a virtual network so other HDInsight clusters and applications can directly access the tables.

Azure Data Explorer

Azure Data Explorer is a fast and highly scalable data exploration service for log and telemetry data. It helps you handle the many data streams emitted by modern software so you can collect, store, and analyze data. Azure Data Explorer is ideal for analyzing large volumes of diverse data from any data source, such as websites, applications, IoT devices, and more. This data is used for diagnostics, monitoring, reporting, machine learning, and additional analytics capabilities. Azure Data Explorer makes it simple to ingest this data and enables you to do complex ad hoc queries on the data in seconds.

Azure Data Explorer can be linearly scaled out for increasing ingestion and query processing throughput. An Azure Data Explorer cluster can be deployed to a Virtual Network for enabling private networks.

Key selection criteria

To narrow the choices, start by answering these questions:

  • Do you need managed, high-speed, cloud-based storage for any type of text or binary data? If yes, then select one of the file storage or analytics options.

  • Do you need file storage that is optimized for parallel analytics workloads and high throughput/IOPS? If yes, then choose an option that is tuned to analytics workload performance.

  • Do you need to store unstructured or semi-structured data in a schemaless database? If so, select one of the non-relational or analytics options. Compare options for indexing and database models. Depending on the type of data you need to store, the primary database models may be the largest factor.

  • Can you use the service in your region? Check the regional availability for each Azure service. For more information, see Products available by region.

Capability matrix

The following tables summarize the key differences in capabilities.

File storage capabilities

CapabilityAzure Data Lake Storage Gen2Azure Blob Storage containers
PurposeOptimized storage for big data analytics workloadsGeneral purpose object store for a wide variety of storage scenarios
Use casesBatch, streaming analytics, and machine learning data such as log files, IoT data, click streams, large datasetsAny type of text or binary data, such as application back end, backup data, media storage for streaming, and general purpose data
StructureHierarchical file systemObject store with flat namespace
AuthenticationBased on Microsoft Entra identitiesBased on shared secrets Account Access Keys and Shared Access Signature Keys, and Azure role-based access control (Azure RBAC)
Authentication protocolOpen Authorization (OAuth) 2.0. Calls must contain a valid JWT (JSON web token) issued by Microsoft Entra IDHash-based Message Authentication Code (HMAC). Calls must contain a Base64-encoded SHA-256 hash over a part of the HTTP request.
AuthorizationPortable Operating System Interface (POSIX) access control lists (ACLs). ACLs based on Microsoft Entra identities can be set file and folder level.For account-level authorization use Account Access Keys. For account, container, or blob authorization use Shared Access Signature Keys.
AuditingAvailable.Available
Encryption at restTransparent, server sideTransparent, server side; Client-side encryption
Developer SDKs.NET, Java, Python, Node.js.NET, Java, Python, Node.js, C++, Ruby
Analytics workload performanceOptimized performance for parallel analytics workloads, High Throughput and IOPSNot optimized for analytics workloads
Size limitsNo limits on account sizes, file sizes or number of filesSpecific limits documented here
Geo-redundancyLocally-redundant (locally redundant storage (LRS)), globally redundant (geo-redundant storage (GRS)), read-access globally redundant (read-access geo-redundant storage (RA-GRS)), zone-redundant (zone-redundant storage (ZRS)).Locally redundant (LRS), globally redundant (GRS), read-access globally redundant (RA-GRS), zone-redundant (ZRS). See Azure Storage redundancy for more information

NoSQL database capabilities

CapabilityAzure Cosmos DBHBase on HDInsight
Primary database modelDocument store, graph, key-value store, wide column storeWide column store
Secondary indexesYesNo
SQL language supportYesYes (using the Phoenix JDBC driver)
ConsistencyStrong, bounded-staleness, session, consistent prefix, eventualStrong
Native Azure Functions integrationYesNo
Automatic global distributionYesNoHBase cluster replication can be configured across regions with eventual consistency
Pricing modelElastically scalable request units (RUs) charged per-second as needed, elastically scalable storagePer-minute pricing for HDInsight cluster (horizontal scaling of nodes), storage

Analytical database capabilities

CapabilityAzure Data Explorer
Primary database modelRelational (column store), telemetry, and time series store
SQL language supportYes
Pricing modelElastically scalable cluster instances
AuthenticationBased on Microsoft Entra identities
Encryption at restSupported, customer-managed keys
Analytics workload performanceOptimized performance for parallel analytics workloads
Size limitsLinearly scalable

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Next steps

  • Azure Cloud Storage Solutions and Services
  • Review your storage options
  • Introduction to Azure Storage
  • Introduction to Azure Data Explorer
  • Big data architectures
  • Big data architecture style
  • Understand data store models
Choose a data storage technology - Azure Architecture Center (2024)
Top Articles
Strategies for Building and Growing a Freight Brokerage Team
What is the difference between a mortgage broker and a mortgage lender? | Consumer Financial Protection Bureau
Dte Outage Map Woodhaven
Pangphip Application
Craigslist Vans
Lighthouse Diner Taylorsville Menu
Jeremy Corbell Twitter
Professor Qwertyson
Tyrunt
Free VIN Decoder Online | Decode any VIN
Tanger Outlets Sevierville Directory Map
Maxpreps Field Hockey
Qhc Learning
83600 Block Of 11Th Street East Palmdale Ca
Darksteel Plate Deepwoken
Labor Gigs On Craigslist
7 Fly Traps For Effective Pest Control
Images of CGC-graded Comic Books Now Available Using the CGC Certification Verification Tool
Plan Z - Nazi Shipbuilding Plans
Accident On May River Road Today
Stardew Expanded Wiki
China’s UberEats - Meituan Dianping, Abandons Bike Sharing And Ride Hailing - Digital Crew
Curry Ford Accident Today
Pjs Obits
Kaitlyn Katsaros Forum
Morse Road Bmv Hours
Baldur's Gate 3: Should You Obey Vlaakith?
Troy Gamefarm Prices
Watson 853 White Oval
Coindraw App
2023 Ford Bronco Raptor for sale - Dallas, TX - craigslist
Top Songs On Octane 2022
Dtlr On 87Th Cottage Grove
Http://N14.Ultipro.com
Jay Gould co*ck
De beste uitvaartdiensten die goede rituele diensten aanbieden voor de laatste rituelen
Royals op zondag - "Een advertentie voor Center Parcs" of wat moeten we denken van de laatste video van prinses Kate?
10 Most Ridiculously Expensive Haircuts Of All Time in 2024 - Financesonline.com
ATM Near Me | Find The Nearest ATM Location | ATM Locator NL
Bbc Gahuzamiryango Live
Kazwire
Craigslist Putnam Valley Ny
Craigslist En Brownsville Texas
F9 2385
2132815089
Sofia Franklyn Leaks
Nu Carnival Scenes
Blue Beetle Showtimes Near Regal Evergreen Parkway & Rpx
Sara Carter Fox News Photos
Dineren en overnachten in Boutique Hotel The Church in Arnhem - Priya Loves Food & Travel
Fresno Craglist
Ok-Selection9999
Latest Posts
Article information

Author: Arielle Torp

Last Updated:

Views: 6450

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Arielle Torp

Birthday: 1997-09-20

Address: 87313 Erdman Vista, North Dustinborough, WA 37563

Phone: +97216742823598

Job: Central Technology Officer

Hobby: Taekwondo, Macrame, Foreign language learning, Kite flying, Cooking, Skiing, Computer programming

Introduction: My name is Arielle Torp, I am a comfortable, kind, zealous, lovely, jolly, colorful, adventurous person who loves writing and wants to share my knowledge and understanding with you.