Hadoop and Big Data Analytics
Gartner defines Big Data as “high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”.
Big data is data that, by virtue of its velocity, volume, or variety (the three Vs), cannot be easily stored or analyzed with traditional methods.
The term covers each and every piece of data your organization has stored till now. It includes all the data stored both on-premises or in the cloud. It could be papers, digital, structured and non-structured data within your company.
There is a deluge of structured and unstructured data, that is generated every second. This is known as Big Data, which can be analyzed to help customers turn that data into insights. AWS provides a broad platform of managed services, infrastructure and tools to tackle your next Big Data project. It enables you to collect, store, process, analyze and visualize Big Data on the cloud. It provides all the hardware, software, infrastructure to maintain and scale, so that you focus on building your application.
Some of the common Big Data Customer scenarios include Web & E-Tailing, Telecommunications, Government, Healthcare & Life Science, Bank & Financial Services and Retail, where Big Data is continuously generated.
How Big Data is consumed by Businesses
Businesses can gain a lot of insight into how their product is being consumed, by analyzing the huge Big Data generated. Big Data analytics is an area of rapidly growing diversity. Analyzing large data sets requires significant compute capacity that can vary in size based on the amount of input data and the analysis required. This characteristic of big data workloads is ideally suited to the pay-as-you-go cloud computing model, where applications can easily scale up and down based on demand.
Using Big Data analytics will give you a clear picture about how your data is being generated and consumed by the customers. It can be used for predictive marketing and plans to increase your business. It provides:
- Early key indicators, gives insights into business trends resulting in business fortunes.
- Analytics results in business advantage.
- Get more precise analysis and results with more data.
Limitations of using the traditional analytics methods:
The advancements in technologies has resulted in huge volume of data being generated every second. Storing, processing, analyzing and getting quality results is time consuming, costly and ineffective in the current scenario.
- Only a limited amount of high fidelity raw data is available for analysis.
- Storage is limited by the high volume of raw data that is continuously generated.
- Moving data for computation doesn’t scale accordingly.
- Data is archived regularly to conserve space. This limits the amount of data that is available for the analytical tools.
- The perception that traditional data warehousing processes are too slow and limited in scalability.
- The ability to converge data from multiple data sources, both structured and unstructured.
- The realization that time to information is critical to extract value from data sources that include mobile devices, RFID, the web and a growing list of automated sensory technologies.
As requirements change you can easily resize your environment (horizontally or vertically) on AWS to meet your Amazon Web Services.
In addition, there are at least four major developmental segments that underline the diversity to be found within Big Data analytics. These segments are MapReduce, scalable database, real-time stream processing and Big Data appliance.
Using Hadoop for Big Data Analytics
There is a big difference between Big Data and Hadoop. The former is an asset, often a complex and ambiguous one, while the latter is a program that accomplishes a set of goals and objectives for dealing with that asset.
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Hadoop is a framework, which allows processing of large data sets. It completes the tasks in minutes, while the same done using the RDMS would take hours.
Hadoop has 2 main components:
- HDFS – Hadoop Distributed File System (for Storage)
- MapReduce (for Processing)
Hadoop Distributed File System works
The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. It consists of HDFS clusters, which each contain one or more data nodes. Incoming data is split into segments and distributed across data nodes to support parallel processing. Each segment is then replicated on multiple data nodes to enable processing to continue in the event of a node failure.
While HDFS protects against some types of failure, it is not entirely fault tolerant. A single NameNode located on a single server is required. If this server fails, the entire file system shuts down. A secondary NameNode periodically backs up the primary. The backup data is used to restart the primary but cannot be used to maintain operation.
HDFS is typically used in a Hadoop installation, yet other distributed file systems are also supported. The Amazon S3 file system can be used but does not maintain information on the location of data segments, reducing the ability of Hadoop to survive server or rack failures. Other file systems such as open source CloudStore and the MapR file system can be used to do maintain location information.
Distributed processing is handled by MapReduce
The idea behind MapReduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data. The HDFS system then acts to distribute data across a network or migrate it as necessary.
The MapReduce feature consists of one JobTracker and multiple TaskTrackers. Client applications submit jobs to the JobTracker, which assigns each job to a TaskTracker node. When HDFS or another location-aware file system is in use, JobTracker takes advantage of knowing the location of each data segment. It attempts to assign processing to the same node on which the required data has been placed.
Apache Hadoop users typically build their own parallelized computing clusters from commodity servers, each with dedicated storage in the form of a small disk array or solid-state drive (SSD) for performance. These are commonly referred to as “shared-nothing” architectures.
Big Data is getting Big and more important
As more and more data are collected, the analysis of these data requires scalable, flexible, and high-performing tools to provide analysis and insight in a timely fashion. Big Data analytics is a growing field, with the need to parse large data sets from multiple sources, and to produce information in real-time or near-real-time gaining importance. IT organizations are exploring various analytics technologies to parse web-based data sources and extract value from the social networking boom.