Arenadata Hadoop

Arenadata Hadoop (ADH) is a commercial distribution of the open-source Apache Hadoop software. It is a big data platform designed for storing, processing, and analyzing large volumes of structured and unstructured data.

Arenadata Hadoop includes various tools and components that are part of the Hadoop ecosystem, such as the Hadoop Distributed File System (HDFS), MapReduce, YARN, and various other Apache projects. It also includes additional software components and tools that are designed to make it easier to deploy, manage, and use Hadoop in enterprise environments.

Use cases

Big data analytics

ADH can be used to process and analyze large volumes of data, such as clickstream data, sensor data, social media data, and financial data. This can help businesses gain valuable insights into customer behavior, market trends, and other important metrics.

Machine learning and artificial intelligence

ADH can be used as a data processing platform for machine learning and artificial intelligence applications. This can help businesses to build predictive models, detect anomalies, and automate decision-making processes.

Data integration

ADH can be used to integrate data from multiple sources and formats into a unified, centralized data repository. This can help businesses to eliminate data silos and provide a single, consistent view of data.

Fraud detection and prevention

ADH can be used to detect and prevent fraud by analyzing large volumes of data in real-time. This can help businesses to identify and respond to fraudulent activities quickly, reducing losses and protecting their reputation.

Log analytics

ADH can be used to process and analyze log data generated by IT systems and applications. This can help businesses to troubleshoot issues, identify performance bottlenecks, and improve system reliability.

Enterprise

Community

Support for key Hadoop components

High availability and disaster recovery features

Advanced security features, including encryption, role-based access control

Automated management and monitoring tools

Deploy & upgrade automation

Offline installation

Technical support 24/7

Corporate training courses

Tailored solutions

Available integrations

ADQM

ADB

ADPG

ADS

Oracle

MS SQL

AWS S3

Azure Storage

Azure Datalake

GCS

JDBC

Solr

Phoenix

Zeppelin

Airflow

AVRO

PARQUET

ORC

DELTA

XML

JSON

Operating systems

Alt Linux

CentOS

RedHat

Astra Linux

Ubuntu

RedOS

Support for key Hadoop components

High availability and disaster recovery features

Advanced security features, including encryption, role-based access control

Automated management and monitoring tools

Deploy & upgrade automation

Offline installation

Technical support 24/7

Corporate training courses

Tailored solutions

Available integrations

ADQM

ADB

ADPG

ADS

Oracle

MS SQL

AWS S3

Azure Storage

Azure Datalake

GCS

JDBC

Solr

Phoenix

Zeppelin

Airflow

AVRO

PARQUET

ORC

DELTA

XML

JSON

Operating systems

Alt Linux

CentOS

RedHat

Astra Linux

Ubuntu

RedOS

Components

Hue

In development. HUE (Hadoop User Experience) is a web-based interface for the Hadoop ecosystem for data analytics.

Hue allows users to perform data analysis without losing any context. The goal is to promote self service and stay simple like Excel so users can find, explore, query and analyze data. One of the main advantages of Hue is the ability to connect to various data sources: Apache Hive, Impala, Flink SQL, Spark SQL, Phoenix, ksqlDB, Apache Hadoop HDFS, Ozone, HBase, etc.

Apache Ozone

In development. Apache Ozone is an open-source, scalable, and distributed object store designed for big data workloads. It is part of the Apache Hadoop ecosystem and is built on top of Hadoop Distributed File System (HDFS).

Ozone is designed to provide high performance and scalability for storing and processing large amounts of unstructured data such as log files, images, videos, and other large data objects. It is optimized for workloads that require high throughput and low latency, such as big data analytics, machine learning, and streaming data processing.

One of the key features of Ozone is its support for multiple storage classes, including hot, warm, and cold storage. This allows users to store data based on its access patterns and lifecycle, optimizing cost and performance.

Ozone also includes built-in data replication and distribution capabilities, enabling data to be stored across multiple nodes in a Hadoop cluster for improved availability and durability.

Smart Storage Manager

Smart Storage Manager is a service that aims to optimize the efficiency of storing and managing data in the Hadoop Distributed File System. SSM collects HDFS operation data and system state information, and based on the collected metrics can automatically use methodologies such as cache, storage policies, heterogeneous storage management (HSM), data compression, and Erasure Coding. In addition, SSM provides the ability to configure asynchronous replication of data and namespaces to a backup cluster for the purpose of organizing DR.

Apache Kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide SQL on Data Warehouses and Lakehouses.

Kyuubi builds distributed SQL query engines on top of various kinds of modern computing frameworks, e.g. Apache Spark, Flink, Hive, Impala, etc., to query massive datasets distributed over fleets of machines from heterogeneous data sources.

Apache Impala

Apache Impala is an open-source massively parallel processing (MPP) SQL query engine for processing large volumes of data in real-time. It allows users to perform interactive queries on Apache Hadoop data stored in HDFS or Apache HBase. Impala was developed to address the need for a faster, more efficient SQL query engine for big data processing than traditional batch-oriented SQL engines.

Impala provides high-speed performance through its MPP architecture, which enables it to distribute processing across multiple nodes in a Hadoop cluster. It also includes support for advanced features such as complex joins, subqueries, and aggregation functions.

Impala is designed to be easy to use and integrate with existing BI and analytics tools. It supports standard SQL queries and JDBC/ODBC drivers for easy integration with a wide range of applications.

Apache ZooKeeper

Apache ZooKeeper is a distributed coordination service that is designed to help manage large distributed systems. It provides a centralized infrastructure for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is used extensively in Hadoop clusters to help manage the coordination of distributed systems and to ensure that each node in the cluster is aware of the state of the other nodes.

Hadoop Distributed File System (HDFS)

HDFS is a highly scalable and fault-tolerant distributed file system that forms the foundation of the ADH platform. It allows you to store large volumes of data across multiple nodes in a cluster, with built-in redundancy to ensure that data is always available, even in case of a node failure. HDFS is optimized for handling large files, making it an ideal choice for big data applications.

Apache YARN

YARN is a resource management and job scheduling framework that allows you to run multiple applications simultaneously on a Hadoop cluster. YARN enables you to allocate cluster resources dynamically, based on the needs of each application, and to monitor and manage those resources to ensure optimal performance.

Apache HBase

This is a NoSQL database that provides real-time read/write access to large datasets stored in Hadoop. HBase is designed to handle massive volumes of data and is optimized for random, real-time access to data, making it a popular choice for big data applications that require low-latency access to large datasets.

Apache Phoenix

Apache Phoenix is an open-source, SQL-like query engine for Hadoop that is designed to provide fast and efficient querying of large datasets. Phoenix is built on top of HBase, which means that it can handle massive amounts of data with low latency and provides support for real-time updates and access to data.

Apache Spark

Apache Spark is a fast and powerful open-source data processing engine that provides scalable, fault-tolerant data processing capabilities for big data workloads. The Apache Spark component of Arenadata Hadoop provides a high-performance and distributed computing framework that can process large datasets in parallel across a cluster of nodes. With its advanced analytics capabilities, including machine learning, graph processing, and SQL-like querying, Apache Spark can help businesses extract valuable insights from their data.

Apache Hive

Apache Hive is an open-source data warehouse infrastructure that provides data summarization, query, and analysis capabilities for large datasets stored in Hadoop. The Apache Hive component of Arenadata Hadoop provides a SQL-like interface for querying data in Hadoop, enabling businesses to perform ad-hoc queries, data analysis, and reporting. Hive translates SQL queries into MapReduce jobs, which can be executed on a Hadoop cluster. With its support for partitioning, indexing, and compression, Hive can help businesses optimize data storage and processing in Hadoop.

Apache Tez

Apache Tez is an open-source data processing framework that provides a flexible, efficient, and scalable way to execute complex data processing tasks on a Hadoop cluster. When used together with Apache Hive, Tez provides a faster and more efficient way to execute Hive queries, by replacing the MapReduce execution engine with a more optimized one. The Hive + Tez combination in Arenadata Hadoop provides a powerful and scalable platform for data warehousing, allowing businesses to perform ad-hoc queries, data analysis, and reporting at scale. With Tez's support for dynamic task scheduling and data partitioning, it can accelerate query processing by optimizing the data flow between Hive operators.

Apache Flink

Apache Flink is an open-source stream processing framework that enables the processing of large volumes of real-time data with low latency. The Apache Flink component of Arenadata Hadoop provides a distributed computing framework for real-time data processing that can be seamlessly integrated with batch processing. Flink supports event-driven processing and provides a unified programming model for both batch and stream processing, making it ideal for building end-to-end data processing pipelines. With its advanced features, including support for stateful streaming, windowing, and machine learning, Apache Flink can help businesses gain real-time insights from their data.

Apache Solr

Apache Solr is an open-source, enterprise-level search platform that is built on top of the Apache Lucene search library. Solr provides a robust and scalable search solution that is used by organizations of all sizes to power search functionality on their websites, mobile apps, and other applications.

Features

Time-saving

Reduced installation and configuration time compared to the manual installation

Easy to use

Users can easily install and configure Hadoop without requiring extensive technical knowledge

Standardization

Standardized installation across multiple machines, reducing the risk of errors and inconsistencies

Increased efficiency

Reduced risk of system downtime and overall improved system efficiency

Expertise

Our team evaluates bug fixes and enhancements from the broader Hadoop community and determines which ones to incorporate into their product

Arenadata Platform Security

Enterprise edition

Arenadata Platform Security (ADPS) is a combination of two security components:

Apache Ranger

Apache Ranger is an open-source security framework that provides centralized policy management for Hadoop and other big data ecosystems. The Arenadata platform integrates with Apache Ranger to provide policy-based access control and fine-grained authorization for data and analytics applications.

Apache Knox

Apache Knox is an open-source gateway that provides secure access to Hadoop clusters and other big data systems. The Arenadata platform integrates with Apache Knox to provide secure access to the platform and its services.

Together, ADPS provides a comprehensive security framework that includes policy-based access control, fine-grained authorization, and secure access to the platform and its services. This helps organizations protect sensitive data and ensure compliance with regulations.

ADB Spark Connector

The ADB Spark connector provides the possibility of high-speed, parallel data exchange between Apache Spark and Arenadata DB.

It has great flexibility in configuration and a multitude of features, including:

high speed of data transmission;
automatic data schema generation;
flexible partitioning;
support for push-down operators;
support for batch operations.

Read documentation

ADQM Spark Connector

Multifunctional connector with support for parallel read/write operations between Apache Spark and Arenadata QuickMarts.

It has great flexibility in configuration and a multitude of features, including:

high speed of data transmission;
automatic data schema generation;
flexible partitioning;
support for push-down operators;
support for batch operations.

Read documentation

Product comparison

Infrastructure

Management system

Arenadata Cluster Manager (ADCM)

A single tool for managing the lifecycle of all Arenadata products.

ADCM is installed with one command and only requires Docker.

Cloudera Manager

Automatic deployment and configuration.

Custom monitoring and reporting.

Built-in monitoring

Yes

Centralized upgrade

Yes

IT landscape support

Ability to deploy various combinations of bare metal, cloud

Yes

By using infrastructure bundles, ADH supports installation on physical and virtual servers (on-premises), in private and public clouds according to the IaaS model. Additionally, infrastructure bundles provide automatical installation on existing nodes and nodes creation "on the fly" for part of cloud providers (YC, VK).

Yes

Supported.

Support for cloud providers

Yandex Cloud;

VK Cloud;

Sber Cloud;

Google Cloud Platform.

Google Cloud Platform;

AWS;

Azure.

Domestic OS support

Alt Linux

Yes

Astra Linux

Yes

Features

Offline installation

Yes

High availability

Yes

ADH supports high availability for key critical platform data services (YARN, HDFS, Hive).

Yes

Integration with other products

Yes

ADH supports a number of proprietary solutions for integration:

Spark Tarantool (Picodata) Connector;
Spark Arenadata DB Connector;
Spark Arenadata QuickMarts Connector.

ADH also provides:

Kerberos support for PXF;
Informatica DEI 10.4 support for ADH 2.X.

Yes

Security settings

SSL encryption

Yes

Via ADCM.

Yes

Standard access separation based on Role Base Access Control

Yes

Flexible settings with Ranger in a separate ADPS product, which can serve multiple instances of ADH and other Arenadata products.

Yes

Single point of secure access

Yes

Knox as a part of ADPS.

Yes

Additionally

Technical support 24/7

Yes

On-demand fixes and improvements

Yes

Training/workshops

Yes

Full training on working with Arenadata products.

Not available for Russia

Community version

Yes

ADH is the only commercial distribution with a free version available. You can just download it.

Documentation

Yes

Detailed documentation in Russian and English languages for all services, their installation, configuration, and operation.

Publicly available.

Yes

Publicly available.

Registration in the register of domestic software

Yes

Successful deployments

Yes

ADH has been used for hundreds of thousands of hours in more than 20 Russian leader companies as a central data platform, which stores and processes up to 25 petabyte data.

Yes

Release history with descriptions

Yes

Complete release history with service versions and description of the upgraded functionality is available in the open domain.

Yes

Complete release history with service versions and description of the upgraded functionality is available in the open domain.

Comparison of current service versions

Service

ADH 3.2.4.2

Cloudera 6.3.4

HDFS & YARN

3.2.4

3.0.0

Impala

4.2.0_arenadata1

3.2.0

Hive

3.1.3_arenadata6

2.1.1

HBase

2.4.17_arenadata1

2.1.4

Phoenix

5.1.3_arenadata2

5.0

Tez

0.10.1_arenadata1

0.9.2

Zeppelin

0.8.1

0.8.2

ZooKeeper

3.5.10

3.4.5

Sqoop

1.4.7_arenadata2

1.4.7

Airflow2

2.6.3

Solr

8.11.2

7.4.0

Spark2

2.3.2_arenadata2

2.4.0

Spark3

3.4.2_arenadata1

3.0.1

Knox

1.6.0

1.2.0

Ranger

2.4.0_arenadata1

2.1.0

Flink

1.17.1_arenadata1

Kyuubi

1.18.0_arenadata1

SSM

1.6.0_arenadata1

Hue

Currently in development

4.4.0

“Product comparison” section is relevant on the date of 15.01.2024.

Releases

2023

ADH 3.2.4.2_b2

ADH 3.2.4.1_b3

ADH 3.2.4.2_b1

ADH 3.2.4.1_b2

ADH 3.2.4.1_b1

ADH 3.1.2.1_b2

ADH 3.1.2.1

ADH 2.1.10

ADH 2.1.8

ADH 2.1.7

ADH 2.1.6

ADH 2.1.4_b11

ADH 2.1.4_b10

ADH 2.1.4_b9

ADH 2.1.4_b5

ADH 2.1.4_b3

ADH 2.1.4_b2

ADH 2.1.4_b1

ADH 2.1.3

ADH 2.1.2.5

ADH 2.1.2.3

ADH 2.1.2.2

ADH 2.1.2.1

ADH 2.1.2.0

ADH 2.1.1

ADH 2.1.0