Research List of Apache

From http://projects.apache.org/indexes/quick.html

[Now, Future], 2015-02-06 update.

Apache Accumulo

The Apache Accumulo sorted, distributed key/value store is based on Google‘s BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.

Categories: database
Languages: Java
PMC: Apache Accumulo

Apache Ambari

Apache Ambari makes Hadoop cluster provisioning, managing, and monitoring dead simple.

Categories: big-data
Languages: Java, Python, JavaScript
PMC: Apache Ambari

Apache Avro

Apache Avro is a data serialization system.

Categories: library, big-data
Languages: C, C++, C#, Java, PHP, Python, Ruby
PMC: Apache Avro

Apache Chukwa

Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ?exible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

Categories: hadoop
Languages: Java, Javascript
PMC: Apache Chukwa

Apache Drill

Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google‘s Dremel.

Categories: big-data
Languages: Java
PMC: Apache Drill

Apache Giraph

Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections.

Categories: big-data
Languages: Java
PMC: Apache Giraph

Apache Hadoop

Hadoop is a distributed computing platform. This includes the Hadoop Distributed Filesystem (HDFS) and an implementation of MapReduce.

Categories: database
Languages: Java
PMC: Apache Hadoop

Apache Hama

The Apache Hama is an efficient and scalable general-purpose BSP computing engine which can be used to speed up a large variety of compute-intensive analytics applications.

Categories: big-data
Languages: Java
PMC: Apache Hama

Apache HBase

Use Apache HBase software when you need random, realtime read/write access to your Big Data. This project‘s goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google‘s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Categories: database
Languages: Java
PMC: Apache HBase

Apache Hive

The Apache Hive (TM) data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop (TM), it provides * tools to enable easy data extract/transform/load (ETL) * a mechanism to impose structure on a variety of data formats * access to files stored either directly in Apache HDFS (TM) or in other data storage systems such as Apache HBase (TM) * query execution via MapReduce Hive defines a simple SQL-like query language, called HiveQL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. HiveQL can also be extended with custom scalar functions (UDF‘s), aggregations (UDAF‘s), and table functions (UDTF‘s).

Categories: database
Languages: Java
PMC: Apache Hive

Apache Lucene Core

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Categories: database
Languages: Java
PMC: Apache Lucene

Apache Mahout

Scalable machine learning library

Categories: library
Languages: Java
PMC: Apache Mahout

Apache Nutch

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely: Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. Being pluggable and modular of course has it‘s benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter‘s for custom implementations e.g. Apache Tika for parsing. Additonally, pluggable indexing exists for Apache Solr, Elastic Search, etc. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

Categories: web-framework
Languages: Java
PMC: Apache Nutch

Apache Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

Categories: big-data
Languages: Java, JavaScript
PMC: Apache Oozie

Apache Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig‘s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs. Pig‘s language layer consists of a textual language called Pig Latin, which has the following key properties: * Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. * Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. * Extensibility. Users can create their own functions to do special-purpose processing.

Categories: database
Languages: Java
PMC: Apache Pig

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing. It offers high-level APIs in Java, Scala and Python as well as a rich set of libraries including stream processing, machine learning, and graph analytics.

Categories: big-data
Languages: Java, Scala, Python
PMC: Apache Spark

Apache Sqoop

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Categories: big-data
Languages: Java
PMC: Apache Sqoop

Apache Storm

Apache Storm is a distributed real-time computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing real-time computation.

Categories: big-data
Languages: Java
PMC: Apache Storm

Apache ZooKeeper

Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.

Categories: database
Languages: Java
PMC: Apache ZooKeeper

郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。