Paul Strassmann’s blog: Apache Hadoop – Ordering Large Scale Diverse Data

Apache Hadoop is open source software for consolidating, combining and analyzing large-scale data. Apache Hadoop is a software library that supports distributed processing of vast amounts of data (in terabytes and petabytes) across huge clusters of computers (thousands of nodes). It scales up from single servers to thousands of machines, each offering server localized computation and storage. Rather than rely on hardware to deliver high-availability, the software is designed to detect and handle failures at the application layer. It delivers a service for computer clusters, each of which may be prone to failures.

Relational data base software excels in storing workloads consisting of structured data. Hadoop solves a different problem which is fast, reliable analysis of structured data as well as unordered complex data. Hadoop is deployed along legacy IT systems to combine old data with new incoming data sets.

Hadoop consists of reliable data storage using the Hadoop Distributed File System (HDFS). It uses high-performance parallel data processing using a technique called MapReduce.

Hadoop runs on commodity servers. Servers can be added or removed from a Hadoop cluster at will. A Hadoop server cluster is self-healing. It can run large-scale, high-performance processing jobs despite of system changes.

Dozens of open source firms participate in the upgrading and maintenance of Hadoop/MapReduce. Critical bug fixes and new features are added to a public repository, which is subject to rigorous tests to ensure software reliability. All major firms that offer cloud computing services already employ Hadoop/MapReduce. *

A Map/Reduce job splits input data into independent chunks, which are processed as separate tasks in a completely parallel manner. The Map/Reduce software sorts the outputs of the individual “maps” on separate servers, which are then fed into the reduce process. The software takes care of scheduling tasks, monitoring progress and re-executing any failed tasks.

The compute nodes and the storage nodes are identical. The Map/Reduce framework and the Hadoop Distributed File System run on the same set of servers. This configuration allows Hadoop to schedule tasks on the nodes where data is already present, resulting in high bandwidth across each cluster.

The Map/Reduce framework consists of a single master JobTracker and of separates TaskTrackers for each cluster-node. The master is responsible for scheduling the jobs' component tasks on the individual servers, monitoring them and re-executing any failed tasks.

Applications specify the input/output locations and supply the map of how a job is processed. This reduces processing overhead via implementations of all connecting interfaces. These, and other job parameters, the comprise configuration management for each application.

SUMMARY
The masses of data, such as is currently tracked at multiple DoD network control centers, cannot be analyzed by existing relational database software. In addition, access to multiple web sites to extract answers to customized queries requires a new architecture for organizing how data is stored and then extracted.

The current DoD incoming traffic is too diverse. It shows high real time volume peak loads. The text, graphics and video content are unstructured. They do not fit the orderly arrangements for filing of records into pre-defined formats. The bandwidth that is required for the processing of incoming messages, especially from cyber operations and from intelligence sources, calls for the processing of data in a massively parallel computer in order to generate sub-second answers.

The conventional method for processing information, such as the existing multi-billion Enterprise Resource Planning (ERP) systems, rely on a single massive master database for support.

A new approach, pioneered by Google ten years ago, relies on Hardoop/Map Reduce methods for searching through masses of transactions that far exceed the volume of transactions currently seen in the support conventional business data processing.
With the rapid expansion of wireless communication from a wide variety of personal devices, DoD messages subject to processing by means of massive parallel computers will be exceeding the conventional workload of legacy applications.

DoD is now confronted with the challenge of not only cutting the costs of IT, but also with the task of installing Hardoop/Map Reduce software in the next few years. In this regard the current emphasis on the reduction in the number of data centers is misdirected. The goal for DoD is to start organizing the computing as a small number of massive parallel computer networks, with processing distributed to thousands of interconnected servers. Cutting the number of data centers without a collateral thrust for software architecture innovation may be a road that will only increase the obsolescence of DoD IT assets as Amazon, Baidu, Facebook, EBay, LinkedIn, Rackspace, Twitter and Yahoo forge ahead at an accelerating pace.

Meanwhile DoD is wrestling how to afford funding the completion of projects started after FY01. DoD must start carving out a large share of its $36 billion+ IT budget to make sure that FY13-FY18 investments can catch up with rapid progress now made by commercial firms.

After all, DoD is still spending more money on IT than any one else in the world!

* http://wiki.apache.org/hadoop/PoweredBy

Paul Strassmann’s blog

Search This Blog

Apache Hadoop – Ordering Large Scale Diverse Data

No comments:

Post a Comment