simon_bisson
Contributor

How to use Azure Data Explorer for large-scale data analysis

analysis
Feb 12, 20196 mins
AnalyticsCloud ComputingData Management

Microsoft’s tool for querying terabytes of data finally arrives for everyone to use

cloud connect comput woman carry lights
Credit: Getty Images

One of the big issues facing anyone building a data-driven devops practice is, quite simply, the scale of the data you’re collecting. Logs from millions of users quickly add up, and the same is true of the internet of things or any other large source of data. It’s a world where you’re generating terabytes of data from which you need to understand quickly what that data is telling you.

Traditional databases aren’t much help, because you have to run that data through an extract, transform, load (ETL) process before you can start to explore it, even if you’re considering using data warehouse-style analytics tools. Tools to handle massive amounts of data is becoming increasingly important, not only for analytical systems, but also to provide the training data needed to build machine learning models.

Introducing Azure Data Explorer

That’s where Azure’s Data Explorer comes in. It’s a tool for delving through your data, making ad-hoc queries while quickly bringing your data into a central store. Microsoft claims import speeds of up to 200MB/sec per node and queries across a billion records taking less than a second. Data can be analyzed using conventional techniques or across time series, with a fully managed platform where you only need to consider your data and your queries.

Working at cloud scale can mean generating large amounts of data, which can be hard to analyze using traditional tools. Like Cosmos DB, Azure Data Explorer is another example of Microsoft giving its own internal tools to its customers. Running a public cloud at scale has meant that Microsoft has needed to create new tools to handle issues in handling terabytes of data and managing massive data centers. Azure Data Explorer brings those elements together, and turns them into a tool that can work with your log files and your streaming data. That makes it an essential tool for any one building massive distributed applications, on-premises or in the cloud.

Originally code-named Kusto, Azure Data Explorer is the commercial version of the tools Microsoft uses to manage its own logging data across Azure. Back in 2016, Microsoft was handling more than a trillion eventsand more than 600TB of data daily—enough data to well and truly stress test the underlying system. Unless you’re running all the IoT systems for BP or another large oil company, you’re unlikely to need to process that much data, but it’s good to know that the option is there.

Azure Data Explorer: a query engine for cloud-scale data

At the heart of Azure Data Explorer is a custom query engine, with its own query language that’s optimized for working with large amounts of data and able to work with a mix of structured and unstructured data from many sources. It’s a read-only tool particularly useful for working with logs and column stores. Microsoft uses elements of its Kusto query language in other Azure tools, including the Application Insights tool that’s at the heart of much of the operations side of Azure DevOps.

You start by creating a cluster with its associated databases, before ingesting data. Once it’s in place and receiving data, you can start to explore your data using a query engine that’s available as a standalone application or hosted in the Azure Portal. Adding it to an existing data pipeline won’t affect your applications; it’s another fork in your pipeline that takes advantage of Azure’s distributed architecture to operate outside your application flow.

Being part of your data pipeline but outside your applications is an important aspect of working with Azure Data Explorer. It’s a tool for speculative analysis of your data, one that can inform the code you build, optimizing what you query for or helping build new models that can become part of your machine learning platform. Queries won’t modify your data, and they can be shared with other users, making this a useful tool for a data science team.

Using Azure Data Explorer with application data

One of the more useful ways of working with Azure Data Explorer is with Event Hubs and Event Grid. You start by creating a table in your Data Explorer instance, which you’ll map to the structure of the JSON data that’s being processed by the event hub. Once that’s in place, you connect your event hub feed to a table, using your JSON mapping to populate the data. You also need to set up any connection strings, authorizing the link between your event hub and your Data Explorer table. Once you’re running, you may need to wait a while before querying your data, because the ingestion process batches data before feeding it into the table.

With data feeding into a Data Explorer table, you can start to build queries. Use the Azure Portal to make your first queries, using its query builder tool. If you want to get started without creating your own data sources, Microsoft provides a preconfigured test cluster you can use for your query experiments.

To query a table, you simply start a query with its name, then apply your sorting criteria before filtering off the data you want to use. The Azure Portal-based query builder displays the results in a table. More complex queries let you choose which elements of a table you want to display, while the Recall command brings back previous queries so you can compare different passes through the same data.

The tool panel in the column view grid exposes more query options, giving you more ways to filter your data and even apply basic pivot table options. If you’ve used Excel’s data analytics features, you’ll find it very familiar as a tool for quickly spotting interesting data points that can drive deeper analysis. The query builder also includes tools for basic visualization, and you can choose from a range of chart types.

You’re not limited to using the portal query builder because Microsoft has also released a Python library targeted at data scientists. With Python an important tool for machine learning, you can start using tools like the Anaconda analytics environment and Jupyter Notebooks to work with your Azure Data Explorer data sets. Data scientists aren’t the only audience for Azure Data Explorer; there’s also a connector to Power BI for business analysts.

simon_bisson
Contributor

Author of InfoWorld's Enterprise Microsoft blog, Simon BIsson prefers to think of "career" as a verb rather than a noun, having worked in academic and telecoms research, as well as having been the CTO of a startup, running the technical side of UK Online (the first national ISP with content as well as connections), before moving into consultancy and technology strategy. He’s built plenty of large-scale web applications, designed architectures for multi-terabyte online image stores, implemented B2B information hubs, and come up with next generation mobile network architectures and knowledge management solutions. In between doing all that, he’s been a freelance journalist since the early days of the web and writes about everything from enterprise architecture down to gadgets.

More from this author