Quick Start

Interactive Analysis with the DCF Shell

Basics

DCF shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. Start it by running the following if you installed DCF.js into global:

dcf-shell

Or running following if you run DCF.js from source:

npm start

DCF's primary abstraction is a distributed collection of items, called a Dataset. Datasets can be created from any remote fileLoader(such as WebHDFS, Aliyun OSS) or by transforming other Datasets. Lets make a a new Dataset from the text of the README file in DCF source directory:

> var textFile = dcc.textFile('README.md')

You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read the API reference.

> textFile.count()
67
> textFile.take(1)
[ '## Distributed Computing Framework for Node.js' ]

Now let's transform this DataFrame to a new one. We call filter to return a new DataFrame with a subset of the lines in the file.

> var linesWithDCF = textFile.filter(v => v.indexOf("dcf") >= 0)

We can chain together transformations and actions:

> textFile.filter(v => v.indexOf('dcf') >= 0).count()
6

More on Dataset Operations

Dataset actions and transformations can be used for more complex computations. Let's say we want to find the line with the most words:

> textFile.map(v => v.split(' ').length).reduce((a, b) => a > b ? a :b)
24

This first maps a line to an integer value, creating a new Dataset. reduce is called on that Dataset to find the largest word count. The arguments to map and reduce are JavaScript function literals (closures), and can use any language feature or JavaScript library. For example, we can easily call functions declared elsewhere. We’ll use Math.max() function to make this code easier to understand:

If you use upvalues of closure, or require a module, please read: How to pass a closure.

> textFile.map(v => v.split(' ').length).reduce((a, b) => Math.max(a, b))
24

One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:

> var wordCounts = textFile.flatMap(line => line.split(' ')).filter(v => v).map(v=>[v, 1]).reduceByKey((a,b)=>a+b)

Here, we call flatMap to transform a Dataset of lines to a Dataset of words, and then combine groupByKey and count to compute the per-word counts in the file as a Dataset of (String, Long) pairs. To collect the word counts in our shell, we can call collect:

> wordCounts.collect()
[ [ 'Computing', 1 ],
  [ 'project', 2 ],
  ... ]

Caching

DCF also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithDCF dataset to be cached:

> linesWithDCF = linesWithDCF.cache()
> linesWithDCF.count()
6
> linesWithDCF.count()
6

It may seem silly to use DCF to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting to a cluster, as described in RDD Programming Guide.

Self-Containerd Applications

Suppose we wish to write a self-contained application using the DCF API. We will walk through a simple application in JavaScript(with Node.js)

You should already have a package.json in your project, then you can install dcf as a dependency:

npm install --save dcf

As a example, we'll create a simple RCF application, main.js:

const { LocalClient, Context } = require('dcf')

async function main() {
  // Initialize client and context.
  const client = new LocalClient();
  await client.init();
  const dcc = new Context(client);
  
  // Do some works:
  const logFile = './README.md';
  
  const logData = dcc.textFile(logFile).cache()
  
  const numAs = await logData.filter(v => v.indexOf('a') >= 0).count()
  const numBs = await logData.filter(v => v.indexOf('b') >= 0).count()

  console.log("Lines with a: %i, lines with b: %i", numAs, numBs);
  
  // Dispose the client:
  await client.dispose();
}

// Run the application and rethrow async errors.
main().catch(e => setTimeout(() => {
  throw e;
}));

About why and when to use await, read Deferred API & Async API

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. For applications that use third-party libraries, we can also add code dependencies with npm or yarn. SimpleApp is simple enough that we do not need to specify any code dependencies.

We can run this application with Node:

node main.js

Last updated