Quick Start
Interactive Analysis with the DCF Shell
Basics
DCF shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. Start it by running the following if you installed DCF.js into global:
Or running following if you run DCF.js from source:
DCF's primary abstraction is a distributed collection of items, called a Dataset. Datasets can be created from any remote fileLoader(such as WebHDFS, Aliyun OSS) or by transforming other Datasets. Lets make a a new Dataset from the text of the README file in DCF source directory:
You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read the API reference.
Now let's transform this DataFrame to a new one. We call filter
to return a new DataFrame with a subset of the lines in the file.
We can chain together transformations and actions:
More on Dataset Operations
Dataset actions and transformations can be used for more complex computations. Let's say we want to find the line with the most words:
This first maps a line to an integer value, creating a new Dataset. reduce
is called on that Dataset to find the largest word count. The arguments to map
and reduce
are JavaScript function literals (closures), and can use any language feature or JavaScript library. For example, we can easily call functions declared elsewhere. We’ll use Math.max()
function to make this code easier to understand:
If you use upvalues of closure, or require a module, please read: How to pass a closure.
One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
Here, we call flatMap
to transform a Dataset of lines to a Dataset of words, and then combine groupByKey
and count
to compute the per-word counts in the file as a Dataset of (String, Long) pairs. To collect the word counts in our shell, we can call collect
:
Caching
DCF also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithDCF
dataset to be cached:
It may seem silly to use DCF to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting to a cluster, as described in RDD Programming Guide.
Self-Containerd Applications
Suppose we wish to write a self-contained application using the DCF API. We will walk through a simple application in JavaScript(with Node.js)
You should already have a package.json
in your project, then you can install dcf as a dependency:
As a example, we'll create a simple RCF application, main.js:
About why and when to use await, read Deferred API & Async API
This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. For applications that use third-party libraries, we can also add code dependencies with npm or yarn. SimpleApp
is simple enough that we do not need to specify any code dependencies.
We can run this application with Node:
Last updated