Big data analytics beyond hadoop pdf

Saturday, April 20, 2019 admin Comments(0)

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives (FT Press Analytics) By Vijay Sr. Click link below. When most technical professionals think of Big Data analytics today, they think of Hadoop. But there are many cutting-edge applications that Hadoop isn't well. 1. IntroductionGoogle's seminal paper on Map-Reduce [1] was the trigger that led to lot of developments in the big data space. Though the.

Language: English, Spanish, Indonesian
Country: Liechtenstein
Genre: Art
Pages: 626
Published (Last): 30.08.2016
ISBN: 714-2-74759-871-7
ePub File Size: 24.44 MB
PDF File Size: 20.51 MB
Distribution: Free* [*Regsitration Required]
Downloads: 33680
Uploaded by: FELTON

Chapter 6 Conclusions: Big Data Analytics Beyond. Hadoop Map-Reduce. Big Data is that while Apache Hadoop is quite useful, and most certainly quite successful .. org/event/usenix99/invited_talks/ The term was also used. Big Data Analytics Beyond Hadoop Real-Time Applications with Storm, Spark, and More Hadoop - Ebook download as PDF File .pdf), Text File .txt). Big Data Analytics Beyond Hadoop: Real-Time Applications With Storm, Spark, And More Hadoop. Alternatives (FT Press Analytics). Offer us 5 mins and also we .

I have also used Mahout in a production system for realizing recommendation algorithms in financial domain and found it to be scalable. Recent efforts by traditional vendors such as SAS in-memory analytics also fall into this category. In particular. The motivation for Mesos Hindman et al. Social Media Data: Husbands, K.

GraphX is the other system with specific focus on graph construction and transformations. While Pregel is good at graph parallel abstraction, easy to reason with and ensures deterministic computation, it leaves it to the user to architect the movement of data. Further, like all BSP systems, it also suffers from the curse of the slow jobs — meaning that even a single slow job which could be due to load fluctuations or other reasons can slow down the whole computation.

GraphLab, as well as its subsequent version known as Powergraph, constitute the state of the art in graph processing and are especially well suited to power law graphs. Another interesting effort in this space is the Graph Search work from Facebook, who are building search over what they term as an entity graph. Another interesting effort in the third generation paradigms comes from Twitter, who built the Storm framework for real-time complex event processing.

We have built several machine learning algorithms over Storm in order to perform real-time analytics. Interesting alternatives to Storm include the Akka project, which is based on the Actors model of concurrent programming. The S4 system from Yahoo also falls in the category. The Dempsy system from Nokia is also comparable to Storm. However, Akka is promising due to its ability to deliver stream composition, failure recovery and two-way communication.

We have made detailed comparisons on a custom cluster at Impetus and have found that the while the logistic regression algorithm over Hadoop from Mahout took nearly seconds to run for a million records, Spark was able to run the logistic regression algorithm for a million records on the same cluster in only 30 seconds.

This is in line with the performance studies reported in the AmpLab site here: There are several similar real-life use cases for Spark as well as Storm and GraphLab, some of the third generation paradigms we have focused on. The same queries took nearly seconds on Cassandra natively took only seconds on a warmed cache , but less than 1 second using Spark! Quantifind is another start-up that uses Spark in production.

They use Spark to allow video companies to predict success of new releases — they have been able to move from running ML in hours over Hadoop to running it in seconds by using Spark. Conviva, a start-up uses Spark to run repeated queries on video data and found that it was nearly 30 times faster than Hive. Yahoo is also using Spark to build algorithmic learning for advertisement targeting for a new paradigm they name as continuous computing. There are also several use cases for Storm listed in the Storm page.

Umbrella security labs provides a case for using GraphLab to develop and test ML models quickly over complete data sets. They have implemented a page rank algorithm over a large graph and ran it using GraphLab in about 2 minutes. This tutorial has set the tone for big data analytics thinking beyond Hadoop by discussing the limitations of Hadoop. It has also brought out the three dimensions along which thinking beyond Hadoop is necessary:.

Simplified Data Processing on Large Clusters. Communications of the ACM, 51 1 , Resilient distributed datasets: Adapting scientific computing problems to clouds using MapReduce. Future Generation Computer Systems , Distributed GraphLab: Over a million developers have joined DZone.

Let's be friends: Big Data Analytics Beyond Hadoop. DZone 's Guide to. Free Resource. Like 0. Join the DZone community and get the full member experience. Join For Free. Beyond Hadoop What then are the alternatives to realize some of the complex iterative algorithms that may not be amenable to be implemented efficiently over Hadoop? Concluding Remarks This tutorial has set the tone for big data analytics thinking beyond Hadoop by discussing the limitations of Hadoop.

It would be an interesting area of further research to estimate the To summarize. Quadrature approaches that are sufficient for low-dimensional integrals might be realizable on Hadoop.

They arise in Bayesian inference as well as in random effects models. They occur in various domains—image de-duplication. MCMC is iterative in nature because the chain must converge to a stationary distribution. The catalog cross-matching problem can be posed as a generalized N-body problem. But the other forms might be hard to realize over Hadoop—when either dynamic programming is used or Hidden Markov Models HMMs are used. Spark could be seen as the next-generation data processing alternative to Hadoop in the big data space.

Alignment problems: The alignment problems are those that involve matching between data objects or sets of objects. The simpler approaches in which the alignment problem can be posed as a linear algebra problem can be realized over Hadoop. Storm has matured faster and has more production use cases than the others. By processing. The emerging focus of big data analytics is to make traditional techniques.

I will also discuss the alternatives to Spark in this space—systems such as HaLoop and Twister. I mean running page ranking or other ML algorithms on the graph. My perspective is that Hadoop is just one such paradigm. Storm from Twitter has emerged as an interesting alternative in this space. The other emerging approach for analytics focuses on new algorithms or techniques from ML and data mining for solving complex analytical problems.

The use of GraphLab for any of the other giants is an interesting topic of further research. It can be inferred that Hadoop is basically a batch processing system and is not well suited for real-time computations.

When using the term big data analytics. The key idea distinguishing Spark is its in-memory computation. GraphBuilder can be used for constructing the graph. Initial performance studies have shown that Spark can be times faster than Hadoop for certain applications. I refer to the capability to ask questions on large data sets and answer them appropriately.

GraphLab is focused on giant 4. I will discuss Storm in more detail in the later chapters of this book—though I will also attempt a comparison with the other alternatives for real-time analytics. The third dimension where beyond-Hadoop thinking is required is when there are specific complex data structures that need specialized processing—a graph is one such example.

There have been some efforts to use Hadoop for graph processing. They need to perform operations on the graphs. This is reflected in the approach of SAS and other traditional vendors to build Hadoop connectors.

GraphLab Low et al. The other dimension for which the beyond-Hadoop thinking is required is for real-time analytics. The traditional ML tools for ML and statistical analysis. This should help put in perspective some of the key aspects of the book and establish the beyond-Hadoop thinking along the three dimensions of real-time analytics.

The third-generation tools such as Spark. Efforts to scale traditional tools over Hadoop. These allow what I call a shallow analysis of big data. These facilitate deeper analysis of big data. In other words. Recent efforts by traditional vendors such as SAS in-memory analytics also fall into this category. These allow deep analysis on smaller data sets—data sets that can fit the memory of the node on which the tool runs.

The following discussion should help clarify the big data analytics spectrum. The first-generation tool vendors are addressing those limitations by building Hadoop connectors as well as providing clustering options—meaning that the vendors have made efforts to reengineer the tools such as R and SAS to scale horizontally.

Second-generation ML tools such as Mahout. First of all. Big Data Analytics: Evolution of Machine Learning Realizations I will explain the different paradigms available for implementing ML algorithms. The other second-generation tools are the traditional tools that have been scaled to work over Hadoop. Mahout has a set of algorithms for clustering and classification.

In essence. I have also used Mahout in a production system for realizing recommendation algorithms in financial domain and found it to be scalable. These include the linear regression. Mahout can thus be said to work on big data. One observation about Mahout is that it implements only a smaller subset of ML algorithms over Hadoop—only 25 algorithms are of production quality. The SAS in-memory analytics. This has been pointed out by several others as well—for instance.

In contrast. The choices in this space include the work done by Revolution Analytics.

I had to tweak the source significantly. I will explain these three primitives: It does provide a fast sequential implementation of the logistic regression.

Twister distinguishes between static and variable data. These tools are maturing fast and are open source especially Mahout.

What do I mean by iterative? A set of entities that perform a certain computation. This means 1 MR per primitive. The efforts in the third generation have been to look beyond Hadoop for analytics along different dimensions.

Tachyon provides a distributed file system abstraction and provides interfaces for file operations across the cluster. The other interesting work is by a small start-up known as Concurrent Systems. Zacharia et al. The HaLoop work Bu et al. So they proposed a new paradigm for cluster computing that can provide similar guarantees or fault tolerance FT as MR but would also be suitable for iterative and interactive applications.

I discuss the approaches along the three dimensions. The second component is the Tachyon file system built on top of Mesos. The main motivation for Spark was that the commonly used MR paradigm. The Berkeley researchers have proposed BDAS as a collection of technologies that help in running data analytics tasks across a cluster of nodes.

The runtime over Hadoop would allow these model files to be scaled over a Hadoop cluster. Storm is also compared to the other contenders in real-time computing.

Analytics big beyond pdf data hadoop

Real-time Analytics The second dimension for beyond-Hadoop thinking comes from real-time analytics. But Graph Processing Dimension The other important tool that has looked beyond Hadoop MR comes from Google—the Pregel framework for realizing graph computations Malewicz et al. The details of this architecture. Readers familiar with BSP can see why Pregel is a perfect example of BSP—a set of entities computing user-defined functions in parallel with global synchronization and able to exchange messages.

There is also the global barrier—which moves forward after all compute functions are terminated. HDFS spout. Each vertex in the graph is associated with a user-defined compute function. Storm is a scalable Complex Event Processing CEP engine that enables complex computations on event streams in real time. Computations in Pregel comprise a series of iterations.

Twister http: This is necessary because the Storm cluster is heavy in processing due to the ML involved. Kafka spout. The vertices can send messages through the edges and exchange values with other vertices. ML algorithms on the streams typically run here.

Twitter from Storm has emerged as the best contender in this space.

In this section:

Apache Hama Seo et al. They run the computations on the streams. Pregel ensures at each superstep that the user-defined compute function is invoked in parallel on each edge.

This is an application-specific wiring together of spouts and bolts—topology gets executed on a cluster of nodes. It might be that they do not want to be seen as being different from the Hadoop community.

The Kafka cluster stores up the events in the queue. Performance evaluations on the Twitter graph for page-ranking and triangle counting problems have verified the efficiency of GraphLab compared to other approaches. The other projects that are inspired by Pregel are Apache Giraph. GraphLab Gonzalez et al. GraphLab provides useful abstractions for processing graphs across a cluster of nodes deterministically.

Table 1. The other point to be noted is that most of the graph processing paradigms are not fault-tolerant. Golden Orb. The focus of this book is mainly on Giraph. It can be inferred that although the traditional tools have worked on only a single node and might not scale horizontally and might also have single points of failure. Closing Remarks This chapter has set the tone for the book by discussing the limitations of Hadoop along the lines of the seven giants.

It has also brought out the three dimensions along which thinking beyond Hadoop is necessary: Real-time analytics: Storm and Spark streaming are the choices. Analytics involving iterative ML: Spark is the technology of choice. Specialized data structures and processing requirements for these: GraphLab is an important paradigm to process large graphs. These are elaborated in the subsequent chapters of this book.

Happy reading! Andrieu, Christopher, N. Doucet, and M. Asanovic, K. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. A View from Berkeley. Available at www.

Last accessed September 11, Dean, Jeffrey, and Sanjay Ghemawat. Simplified Data Processing on Large Clusters. A Runtime for Iterative MapReduce. June , Chicago, Illinois. Available at http: Gonzalez, Joseph E. Konstan, Joseph A. Laney, Douglas.

Controlling Data Volume, Velocity, and Variety. Retrieved February 6, Luhn, H. Malewicz, Grzegorz, Matthew H. Austern, Aart J. Bik, James C. A System for Large-scale Graph Processing. The National Academies Press. Perry, Tekla S. Seo, Sangwon, Edward J. Michael J.. New York. Scott Shenker. Pelle Jakovits. Alvin AuYoung. Mosharaf Chowdhury. Satish Narayana. Michael J. Reynold S. Indrajit Roy. Joseph E.

Cluster Computing with Working Sets. One would have to create fresh MR jobs for every iteration in a lot of these classes of computations. The next iteration would need data to be initialized. It then goes on to discuss the design and architecture of BDAS. I have explained that Hadoop is well suited for giant 1 simple statistics as well as simpler problems in other giants.

The data flow diagram for iterative computing in Figure 2. Spark is the fulcrum of the BDAS framework. This implies that every value in Scala is an object and every operation a method call similar to object-oriented languages such as By its very nature.

Motivation One of the main motivations for proposing Spark was to allow distributed programming of Scala collections or sequences in a seamless fashion. These APIs facilitate programming at a much higher level of abstraction compared to traditional approaches. Figure 2. Hadoop is not well suited for these. The last kind of scenario is real-time computations.

Interactive querying is one such scenario. Scala is statically typed language that fuses object-oriented programming with functional programming. Hadoop is a batch-oriented system—implying that for every query. Combining the capability to handle batch. These sequences all sequences in Scala inherit from the scala. Spark provides a distributed shared object space that enables the previously enumerated Scala sequence operations over a distributed system Zaharia et al.

Greenplum database 4. But MR has sophisticated mechanisms for handling failures during the computation which is a direct consequence of the intermediate file creation. Smalltalk or Java. In addition. This implies that the amount of work to be redone can be significant. Efficiency can be viewed as comprising two parts: Both MR and parallel databases use replication to handle FT.

The comparison between MR and parallel database systems can be made along three axes: MR might not require a predefined schema. Map and filter are commonly used functions in Scala sequences—they apply map and filter operations to the elements of the sequence uniformly. With respect to execution strategy. With respect to indexing. Seq class and define a common set of interfaces for abstracting common operations.

Motivation The other dimension for large-scale analytics is interactive queries. Common sequences defined in the Scala library include arrays. These types of queries occur often in a big data environment.

Gamma DeWitt et al. MR creates intermediate files and transfers these from mappers to the reducers explicitly with a pull approach. Parallel databases distribute the data relational tables into a set of shared-nothing clusters and split queries into multiple nodes for efficient processing by using an optimizer that translates Structured Query Language SQL commands into a query plan.

The parallel databases do not persist intermediate results to disk. In case of complex queries involving joins. There are two broad approaches to solving interactive queries on massive data sets: The coarse-grained recovery point is true even in the case of some of the new low-latency engines proposed for querying large data sets such as Cloudera Impala.

Consider a traditional data warehousing environment. The reasoning behind this assertion is that Hive or Hbase. The third dimension is that the resources of the cluster must be utilized efficiently. In this case.

Big Data Analytics Beyond Hadoop

Motivation Most frameworks such as Spark. Hadoop can augment this environment in various ways. Although existing cluster managers. It could also help in certain preprocessing typically known as data wrangling. Google Dremel. The parallel database systems are good for optimized queries on a cluster of shared-nothing nodes. The motivation for Mesos Hindman et al. The frameworks need to run a set of processes in parallel on different nodes of the cluster.

Hadoop MR is not well suited for interactive queries. They also need to handle failures of these processes. Apache Drill. On top of the preceding limitations. The amount of work required to be redone is not significant in typical failure scenarios for the MR paradigm. This has been documented among others by Pavlo and others Pavlo et al. This might require monitoring the cluster resources and getting information about them quickly enough.

Hadoop beyond big pdf data analytics

There are other use cases too that motivate the need for multiple frameworks to coexist in the same physical cluster. Hadoop is not ideal for such use cases. As I have explained before. In the same environment. Hadoop Yarn. The frameworks in common use for resource management include Mesos. The lowest layer Layer1 deals with the resources available. These are ideal use cases for Mesos. The initial version of YARN addresses memory scheduling only.

It is targeted more on cloud-aware applications and has evolved into a multicloud resource manager Duplyakin et al. Layer2 is the data management layer. In the former case. It gives each framework access to the whole cluster without locking. Spark can also work with Tachyon. As can be inferred from Figure 2. For Spark is responsible for scheduling within its own containers. The other difference is that Mesos uses container groups to schedule frameworks.

Omega is a shared-state scheduler from Google Schwarzkopf et al. As a result. Hadoop MR also sits in the same layer. Nimbus has been used among others by the Storm project to manage the cluster resources. Shark is built over Spark and provides an SQL interface to applications. Spark is the key framework of BDAS in this layer because it is the in-memory cluster computing paradigm. This approach tries to combine the advantages of monolithic schedulers have complete control of cluster resources and not be limited to containers and two-level schedulers have a framework-specific scheduling policy.

The other interesting frameworks in this layer include SparkGraph. The other frameworks in this layer include MPI and Storm. These are the main applications: Monolithic schedulers use a single scheduling algorithm for all jobs. Layer3 is the data processing layer. Conviva has built a Spark-based optimization platform that enables end users to choose an optimal CDN for each user based on aggregate runtime statistics such as network load and buffering ratio for various CDNs.

Paradigm for Efficient Data Processing on a Cluster The data flow in Spark for iterative ML algorithms can be understood from an inspection of the illustration in Figure 2. The Spark production cluster supported 20 LPs. Shark is used for on-the-fly ad hoc queries due to its capability to answer ad hoc queries at low latency. Conviva is an end-user video content customizing company.

The optimization algorithm is based on the linear programming LP -based approach. The key in the Ooyala system is its capability to have a via media for the two extremes in video content querying: Yahoo has built a Shark Software-as-a-Service SaaS application and used it to power a pilot for advertisement data analytics.

The Shark pilot predicted which users and user segments are likely to be interested in specific ad campaigns and identified the right metrics for user engagement. It enables the end user to switch CDNs at runtime based on load and traffic patterns. Yahoo has extensively experimented with Spark and Shark.

Ooyala is another online video content provider. The other way of creating an RDD is to parallelize a Scala collection. The following sections explain more details of Spark internals—its design. The interesting property of RDDs is the capability to store its lineage or the series of transformations required for creating it.

After the termination condition check determines that the iterations should end. This implies that a Spark program can only make a reference to an RDD—which will have its lineage as to how it was created and what operations have been performed on it. The important collection in Spark is the RDD.

The operations such as map. Compared to these operations. The operations for creating RDDs are known as transformations in Spark. The lineage graph stores both transformations and actions on RDDs. For instance. A set of transformations and actions is summarized in Table 2. Various actions can be specified on RDDs.

The persistence as well as partitioning aspects of RDDs can be specified by the programmer. They include operations such as count. The following is the Spark code snippet one might end up writing for this: Table 2. The CDR structure is call id. Narrow dependencies arise in the case of map. The mapping is many-to-one from each parent partition to child partitions. When an action is invoked on an RDD. The RDD interface comprises five pieces of information as detailed in Table 2.

Spark Implementation Spark is implemented in about Spark can run over Mesos. Wide dependencies arise. Recovery is also faster with narrow dependencies. Wide dependencies can lead to inefficient pipelining and might require Hadoop MR shuffle-like transfers across the network. The types of dependencies influence the kind of pipelining that is possible on each cluster node.

The child partition RDDs use only the RDDs of the parent partitions one-to-one mapping from partitions in parent to partitions in child.


Spark runs the unmodified Scala interpreter. The BM is the component on each node responsible for serving cached RDDs and to receive shuffle data. The BM communicates with the CM to fetch remote blocks. The CM is an asynchronous networking library. The executor interacts with other components such as the Block Manager BM. This simplifies fault recovery. Each stage has only narrow dependencies.

It can also be viewed as a write-once key value store in each worker. The DS submits the stages of task objects to the task scheduler TS. For wide dependencies. TSM is an entity maintained by the TS—one per task set to track the task execution. The DS is also responsible for resubmission of stages whose outputs are lost. A task object is a self-contained entity that comprises code and transformations.

The TS maps tasks to nodes based on a scheduling algorithm known as delay scheduling Zaharia et al. The Worker component of Spark is responsible for receiving the task objects and invoking the run method on them in a thread pool. When map outputs are lost. Tasks are shipped to nodes—preferred locations. The MOT is the component responsible for keeping track of where each map task ran and communicates this information to the reducers—the workers cache this information.

Spark only needs to store the lineage graph for FT. RDDs that have only narrow dependencies are not good candidates for check-pointing. Recovery would require the operations to be reconstructed on the lost partitions of RDDs—but this can be done in parallel for efficiency. Straggler mitigation or backup tasks in a DSM are hard to realize due to the fact that both of the tasks might contend for the memory. The lineage graph has sufficient information to reconstruct lost partitions of the RDD.

On disk: This provides the slowest performance. The flip side of Spark is that the coarse-grained nature of RDD RDDs with wide dependencies can use check-pointing. As serialized Java objects in memory: This provides more memory-efficient representation. The other fundamental difference between Spark and DSM systems is the straggler mitigation strategy available in Spark due to the read-only nature of RDDs—this allows backup tasks to be executed in parallel.

This provides better performance because objects are in JVM memory itself. Although this restricts the kind of applications that can use Spark. It can be noted that the globals are a special way of mimicking DSM functionality within Spark. Readers familiar with BSP can realize why Pregel is a perfect example of BSP—a set of entities computing in parallel with global synchronization and able to exchange messages.

The bulk operators correspond to transformations in Spark. Accumulators are variables that can only be added by the workers and read by the driver program—parallel aggregates can be realized fault-tolerantly.

Since the same user function is applied to all vertices. The AMPLabs team themselves have built the entire Pregel as a small library over Spark in just a few hundred lines of code. Computations in Pregel Malewicz et al. Programmers can pass functions or closures to invoke the map. Piccolo Power and Li The list of cluster computing models expressible through RDDs and their operations are given here: Broadcast variables are used by the programmer to copy read-only data once to all the workers.

But it turns out that the expressibility of RDDs is good enough for a number of applications. The simpler case can be expressed as flatMap and groupByKey operations. Nectar is a software system targeted at data center management that treats data and computations as first-class entities functions in DryadLINQ [Yu et al.

The other perspective for in-memory computation is the fact that ML algorithms require to iterate over a working set of data and can HaLoop is Hadoop modified with a loop-aware TS and certain caching schemes. The main difference between Nectar and Spark is that Nectar might not allow the user to specify data partitioning and might not allow the user to specify which data to be persisted. This was observed in a study presented by Ananthanarayanan et al. The caching is for both loop invariant data that are cached at the mappers and reducer outputs that are cached to enable termination conditions to be checked efficiently.

Both Twister and HaLoop are interesting efforts that extend the MR paradigm for iterative computations.

This can be understood from two perspectives. Twister Ekanayake et al. SQL Interface over a Distributed System In-memory computation has become an important paradigm for massive data analytics. They are. HaLoop not only provides a programming abstraction for expressing iterative applications. Spark allows both and is hence more powerful.

This enables data to be derived by running appropriate computations in certain cases and to avoid recomputations for frequently used data. It has a much more powerful set of constructs. Systems Similar to Spark Nectar Gunda et al. These are simple to express in Spark because it lends itself very easily to iterative computations.

Twister provides publish-subscribe infrastructure to realize a broadcast construct. From one perspective. By joining this with the vertices RDD. HaLoop Bu et al. The recovery is fine-grained. The approach for logical plan generation in Shark is also similar to that of Hive.

Query parsing 2. Spark materializes map output to memory before a shuffle stage—reduce tasks later use this output through the MOT component. Shark modifies this first by collecting statistics specific to a partition as well as globally. It must be noted that Hive and Shark can be used often to query such data.

Logical plan generation 3. This generates an abstract syntax tree. The statistics collected can include partition sizes and record counts for skew detection. The physical plan generation is when both approaches are quite different.

Another Shark modification enables the DAG to be changed at runtime based on the statistics collected. Shark provides an SQL interface over Spark. Whereas a physical plan in Hive might be a series of MR jobs. Shark can be viewed as an in-memory distributed SQL system essentially. Partial DAG Execution This is a technique to create query execution plans at runtime based on statistics collected. Mapping the logical plan to a physical execution plan Shark uses the Hive query compiler for query parsing.

It must be noted that the Shark builds on top of query optimization approaches that work on a single node with the concept of the PDE. The coarse-grained nature of RDDs works well even for queries. This is true for data that are new not loaded into Shark before.

This is what I am referring to here. Shark has been shown to be faster than Hadoop for loading data into memory and provides the same throughput as Hadoop loading data into HDFS.

Distributed Data Loading Shark uses Spark executors for data loading. It provides two kinds of joins: This drastically reduces the number of objects in memory and improves GC and performance of Shark consequently. The statistics collection and subsequent DAG modification is useful in implementing distributed join operations in Shark. One kind of collection. The other way statistics are used for optimization in Shark is in determining the number of reducers or the degree of parallelism by examining partition sizes and fusing together smaller partitions.

This task makes an independent decision on compression whether this column needs to be compressed and. Broadcast join is realized by sending a small table to all nodes. But the disadvantage of this scheme is that it ends up creating a huge number of objects in the JVM memory. Shark has realized a columnar store that creates single objects out of entire columns of primitive types.

The remaining objects can be collected. In particular. Readers should keep in mind that as the number of objects in the Java heap increases. The broadcast join works efficiently only when one table is small—now the reader can see why such statistics are useful in dynamic query optimization in Shark.

The other kind is the stop-the-world major collection. Full Partition-Wise Joins Both tables are hash-partitioned on the join key in the case of a shuffle join. The resultant compression metadata is stored for each partition.

This has the advantage that they are natively available to JVM for faster access.

Related Post: BIG BOOK OF BUDS 2 PDF

It also results in improved space utilization compared to the naive approach of Spark. Shark creates Spark map tasks and avoids the expensive Shuffle operation to achieve higher efficiency. Although Hadoop does not allow such co-partitioning. Partition Pruning As known in the traditional database literature. It might. The Hadoop YARN scheduler is a monolithic scheduler and might allow several frameworks to run in the cluster. MPI employs a gang scheduling algorithm. Virtualization has been known to be a performance bottleneck.

As known in the traditional database literature. Yet another option is to allocate a set of virtual machines VMs for each framework.

This is achieved as Shark allows the RDDs representing the query plan to be returned in addition to the query results. Shark augments the statistics collected in the data loading process with range values and distinct values for enum types. Running both over the same cluster can result in conflicting requirements and allocations.

The other option is to physically partition the cluster into multiple smaller clusters and run the individual frameworks on smaller clusters. This implies that the user can initiate operations on this RDD—this is fundamental in that it makes the power of Spark RDDs available to Shark queries. This is where Mesos fits in—it allows the user to manage cluster resources across diverse frameworks sharing the cluster. While joining two co-partitioned tables.

This cannot be achieved with any existing schedulers. Mesos makes certain resource offers in the form of containers to the respective frameworks. Mesos offers frameworks the capability to set filters. Mesos Components The key components of Mesos are the master and slave daemons that run respectively on the Mesos master and Mesos slave. Samza has been recently open sourced from LinkedIn—Mesos allows these new frameworks to be experimentally deployed in an existing cluster coexisting with other frameworks.

It must be noted that frameworks never specify the required resources and have the option of rejecting requests that do not satisfy their requirements. The master sends the tasks along with resource requirements to the slave daemon.

To improve the efficiency of this process. This might be a less efficient way of utilizing the cluster resources compared to monolithic schedulers such as Hadoop YARN.

Each slave also hosts frameworks or parts of frameworks. In practice. But it allows flexibility—for instance. The framework scheduler accepts the request or can reject it. At the second level. Mesos is a two-level scheduler.

The remaining resources in the cluster are free to be allocated to other frameworks. At the first level. The slave daemons publish the list of available resources as an offer to the master daemon.

The master invokes the allocation module. The master then makes a resource offer to the framework scheduler. This ensures the resources are locked and available for this framework once the framework accepts the offer.

The Resource Manager RM also has the capability to rescind the offer. The framework might take some time to respond to the offer. If the framework is within its guaranteed resource allocation. The This is useful in situations in which Mesos must kill user tasks. The traditional hypervisor-based virtualization techniques. OS Level virtualization creates a partition of the physical machine resources using the concept of isolated user-space instances. This can be inefficient. It must be noted that min-max fairness is a common algorithm.

The interesting properties of the DRF algorithm are given in the following list: The Linux containers approach is an instance of an approach known as OS Level virtualization. Resource Allocation The allocation module is pluggable. The DRF is a generalization of the min-max fairness algorithm for heterogeneous resources. The DRF algorithm ensures that min-max can be applied across dominant resources of users. Xen Barham et al.

Hadoop beyond big analytics pdf data

Mesos can kill its processes. Frameworks can also read their guaranteed resource allocation through API calls. Fair schedulers such as those in Hadoop https: Isolation Mesos provides isolation by using Linux containers http: The state of the master comprises only the three pieces—namely.

Mesos also allows frameworks to register multiple schedulers and can connect to a slave scheduler in case Detailed performance evaluation studies in Xavier et al. Mesos uses LXC. Mesos also reports framework executors and tasks to the respective framework.

A specific test known as the fork bomb test which forks child processes repeatedly shows that the LXC cannot limit the number of child processes created currently. References The traditional parallel programming tools such as MPI or the new paradigms such as Spark are well-suited for such optimization problems and efficient realizations at scale. This is especially evident in the kinds of queries at Ooyala on video data.

Several other researchers have also observed that Hadoop is not good for iterative ML algorithms. Mesos was built as a resource manager that can manage resources of a single cluster running multiple frameworks such as Hadoop. Shark is very useful for a slightly different set of use cases: Starting fresh MR jobs for every iteration.

It must be noted that optimization problems with a large number of constraints and variables are notoriously difficult to solve in a Hadoop MR environment. You should. The only work that addresses one kind of optimization problem and its efficient realization over Hadoop is from the Vowpal Wabbit group. The primary reason is the lack of long-lived MR and the lack of in-memory programming support.

Stochastic approaches are more Hadoopable. This is useful in data warehousing environments. Samuel Madden. Tim Harris. Last published February Boris Dragovic. Microsoft Research. Alba Cristina M.. Andrew Wang. Srikanth Kandula. Michael L. San Francisco. Morgan Kaufmann Publishers Inc.

Coordinated Memory Caching for Parallel Jobs. Rolf Neugebauer. Magdalena Balazinska. Eugene Ng. Wesley W. Olivier Chapelle. Apache Software Foundation. Keir Fraser.