I read the news about Apache Spark - “Run programs up to 100x faster than Hadoop MapReduce” - and I wanted to explore it. Furthermore, you can write spark applications in Python. The trouble was that I hadn't even gotten around to trying Hadoop MapReduce, inspite of wanting to do so for years.
If all you have is a desktop, how do you experiment and learn systems which are inherently distributed. You can't get any insight into it using a single machine.
Fortunately, even mid-range desktops now come with a quad-core and 8GB of RAM. So, you can run at least a few virtual machines. Many tools, e.g. virtualbox, are very good and easy to use. But are they enough?
Spark is a “fast and general engine for large-scale data processing” (http://spark.apache.org). It allows you to build parallel applications.
It can access the data from various sources, in particular, existing Hadoop data. It can run on Hadoop 2's YARN cluster manager.
So, you may want to understand and set up a Hadoop data source and possibly a YARN cluster manager.
You need to setup a cluster of virtual machines on which the master and slave instances of Spark will run.
You need to setup a HDFS cluster of virtual machines for Hadoop data. There will be a NameNode which will manage the file system metadata and DataNodes that will store the actual data.
You may need to play around with the number of virtual machines and you don't want to create each virtual machine manually, with each machine opening a window on the desktop display. You want to manage the machines from a single environment, conveniently.
That brings us to the desire to create a local cloud on the desktop. OpenStack is a popular, opensource option and Redhat offers an opensource distribution of it, http://openstack.redhat.com.
The RDO distribution of OpenStack will be included in the repositories of Fedora 21. You can add an additional repository for installing it on Fedora 20.
A bare bones cloud image is available from Fedora's download site. You can also build your own spin using or expanding the kickstart for a cloud image, fedora-x86_64-cloud.ks, a part of the fedora-kickstarts package.
You will have to break the exploration of big data on a desktop into smaller steps. Each step will be built on top of the previous steps. Hopefully, it will run reasonably well on a quad-core desktop with 8GB ram to give you an understanding of the additional technology and the programming involved.
The current exploration will be on Fedora because my desktop is Fedora 20, Fedora offers an OpenStack distribution and it will be difficult to experiment on multiple distributions simultaneously.
The first step will be to create a local cloud.
You may want to use a virtual machine to minimize the risk to your desktop environment. A useful utility is appliance-creator, which is a part of the appliance-tools package.
You can use the kickstart file fedora-x86_64-cloud.ks, with a couple of changes in fedora-cloud-base.ks to allow signing in as root because by default, the image depends on cloud-init to create an account 'fedora' and inject ssh credentials for password-less login (see https://www.technovelty.org/linux/running-cloud-images-locally.html as an alternate solution). You need to increase the size of the disk and selinux should be permissive or disabled.
You will need to make sure that the virtualization packages are installed (https://fedoraproject.org/wiki/Getting_started_with_virtualization). Just do the following:
Install the image created by appliance-creator using virt-manager. You will probably need 3 GB memory to successfully install openstack.
Now you are ready to follow the RDO “quick start” instructions (http://openstack.redhat.com/Quickstart).
The following commands are fairly quick:
Packstack command makes it trivial to install the openstack and the dependencies. It uses puppet. However, packstack command may take a long time depending on the network and download speeds. (I usually find it better to add “&country=us,ca” to the fedora and update repositories for the Indian environment.)
You may find that the above command fails to run remote script after setting up the ssh keys. If so, you need to setup authorized keys.
The first time packstack is run, it creates an answer file with a name like packstack-answers-X-Y.txt. You will want to reuse the same answers in case you have to rerun packstack.
After packstack completes successfully, the local cloud is ready. You can browse the site http://$VM-IP-ADDRESS/dashboard.