Thursday, August 18, 2011

Apache Hadoop – Ordering Large Scale Diverse Data

Apache Hadoop is open source software for consolidating, combining and analyzing large-scale data. Apache Hadoop is a software library that supports distributed processing of vast amounts of data (in terabytes and petabytes) across huge clusters of computers (thousands of nodes). It scales up from single servers to thousands of machines, each offering server localized computation and storage. Rather than rely on hardware to deliver high-availability, the software is designed to detect and handle failures at the application layer. It delivers a service for computer clusters, each of which may be prone to failures.

Relational data base software excels in storing workloads consisting of structured data. Hadoop solves a different problem which is fast, reliable analysis of structured data as well as unordered complex data. Hadoop is deployed along legacy IT systems to combine old data with new incoming data sets.

Hadoop consists of reliable data storage using the Hadoop Distributed File System (HDFS). It uses high-performance parallel data processing using a technique called MapReduce.

Hadoop runs on commodity servers. Servers can be added or removed from a Hadoop cluster at will. A Hadoop server cluster is self-healing. It can run large-scale, high-performance processing jobs despite of system changes.

Dozens of open source firms participate in the upgrading and maintenance of Hadoop/MapReduce. Critical bug fixes and new features are added to a public repository, which is subject to rigorous tests to ensure software reliability. All major firms that offer cloud computing services already employ Hadoop/MapReduce. *

A Map/Reduce job splits input data into independent chunks, which are processed as separate tasks in a completely parallel manner. The Map/Reduce software sorts the outputs of the individual “maps” on separate servers, which are then fed into the reduce process. The software takes care of scheduling tasks, monitoring progress and re-executing any failed tasks.

The compute nodes and the storage nodes are identical. The Map/Reduce framework and the Hadoop Distributed File System run on the same set of servers. This configuration allows Hadoop to schedule tasks on the nodes where data is already present, resulting in high bandwidth across each cluster.

The Map/Reduce framework consists of a single master JobTracker and of separates TaskTrackers for each cluster-node. The master is responsible for scheduling the jobs' component tasks on the individual servers, monitoring them and re-executing any failed tasks.

Applications specify the input/output locations and supply the map of how a job is processed. This reduces processing overhead via implementations of all connecting interfaces. These, and other job parameters, the comprise configuration management for each application.

The masses of data, such as is currently tracked at multiple DoD network control centers, cannot be analyzed by existing relational database software. In addition, access to multiple web sites to extract answers to customized queries requires a new architecture for organizing how data is stored and then extracted.

The current DoD incoming traffic is too diverse. It shows high real time volume peak loads. The text, graphics and video content are unstructured. They do not fit the orderly arrangements for filing of records into pre-defined formats. The bandwidth that is required for the processing of incoming messages, especially from cyber operations and from intelligence sources, calls for the processing of data in a massively parallel computer in order to generate sub-second answers.

The conventional method for processing information, such as the existing  multi-billion Enterprise Resource Planning (ERP) systems, rely on a single massive master database for support.

A new approach, pioneered by Google ten years ago, relies on Hardoop/Map Reduce methods for searching through masses of transactions that far exceed the volume of transactions currently seen in the support conventional business data processing.
With the rapid expansion of wireless communication from a wide variety of personal devices, DoD messages subject to processing by means of massive parallel computers will be exceeding the conventional workload of legacy applications.

DoD is now confronted with the challenge of not only cutting the costs of IT, but also with the task of installing Hardoop/Map Reduce software in the next few years. In this regard the current emphasis on the reduction in the number of data centers is misdirected. The goal for DoD is to start organizing the computing as a small number of massive parallel computer networks, with processing distributed to thousands of interconnected servers. Cutting the number of data centers without a collateral thrust for software architecture innovation may be a road that will only increase the obsolescence of DoD IT assets as Amazon, Baidu, Facebook, EBay, LinkedIn, Rackspace, Twitter and Yahoo forge ahead at an accelerating pace.

Meanwhile DoD is wrestling how to afford funding the completion of projects started after FY01. DoD must start carving out a large share of its $36 billion+ IT budget to make sure that FY13-FY18 investments can catch up with rapid progress now made by commercial firms.

After all, DoD is still spending more money on IT than any one else in the world!



  1. Your desktop/laptop is the best place for custom development of WordPress themes. You will need to turn your computer into a local server and create a virtual web server (Apache, MySQL and PHP), which can easily be done by installing WordPress into your computer.

    apache jobs

  2. he post is really informative, you have discussed the primary things that one should kept in mind There are numerous ways you have stuffed here and share your awesome information on Hadoop.
    Hadoop Training in hyderabad

  3. Uniqe informative article and of course True words, thanks for sharing. Today I see myself proud to be a hadoop professional with strong dedication and will power by blasting the obstacles. Thanks to Big Data Training Chennai

  4. Thank you so much for sharing this great information. Today I stand as a successful hadoop certified professional. Thanks to Big Data Course in Chennai

  5. Nice piece of article you have shared here, my dream of becoming a hadoop professional become true with the help of Hadoop Training in Chennai, keep up your good work of sharing quality articles.

  6. Actually, you have explained the technology to the fullest. Thanks for sharing the information you have got. It helped me a lot. I experimented your thoughts in my training program.

    Hadoop Training Chennai
    Hadoop Training in Chennai
    Big Data Training in Chennai

  7. There are lots of information about latest technology and how to get trained in them, like Hadoop training institutes in chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

    Big Data Hadoop Training in Chennai | Hadoop Course in Chennai

  8. Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

    Software testing training in chennai | Testing training in chennai | Software testing course in chennai


For comments please e-mail