How to run Apache Spark on MicroK8s and Ubuntu Core, in the cloud: Part 4

robgibbon

on 18 August 2021

Tags: Big Data

This article was last updated 1 year ago.


In this series, we’ve been building up an Apache Spark cluster on Kubernetes using MicroK8s, Ubuntu Core OS, LXD and GCP. We’ve learned about and set up nested virtualisation on the cloud, and had some fun. But right, it’s retrospective time: in Part 1, we saw how to get MicroK8s up on LXD, on Ubuntu Core using Multipass. In Part 2, we looked at getting the same setup running under nested virtualisation on the cloud. And in Part 3, we bunged an Apache Spark cluster on top and interrogated it from a Jupyter notebook. All in a day’s work!

But we’re not finished quite yet. The setup we’ve built so far is vertically scalable. But we want to be able to scale horizontally and add more Ubuntu Core instances to a cluster in order to handle big data workloads with Spark. So in this final Part 4 of the series, we’ll get to that using LXD clustering and fan networking.

Fan networking builds a virtual, overlay software-defined network on top of the real, underlay network, so that VMs and Linux Containers can communicate with one another across hosts seamlessly. LXD’s fan networking implementation can handle up to a theoretical ~16 million hosts, with 65k VMs or Linux Containers per host. So for our Spark cluster, it should be good!

The first step is to deploy three Ubuntu Core VMs on GCE. We’ll take a smaller instance size this time, so this project shouldn’t end up costing you more than a regular oat milk latte from your favourite coffee shop. You can terminate and delete the VM we used the last time if you want, and we will start afresh. While we’re here, we’ll also adjust the MTU for the VPC so that everything works as expected.

Use the following commands:

gcloud compute networks update default --mtu=1500

for x in {0..2};
do
gcloud beta compute instances create ubuntu-core-20-$x --zone=europe-west1-b --machine-type=n2-standard-2 --network-interface network=default --network-tier=PREMIUM --maintenance-policy=MIGRATE --service-account=<YOUR_SERVICE_ACCOUNT>@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --min-cpu-platform="Intel Cascade Lake" --image=ubuntu-core-20-secureboot --boot-disk-size=60GB --boot-disk-type=pd-balanced --boot-disk-device-name=ubuntu-core-20-$x --shielded-secure-boot --no-shielded-vtpm --no-shielded-integrity-monitoring --reservation-affinity=any --enable-nested-virtualization --create-disk=size=60,mode=rw,auto-delete=yes,name=storage-disk-$x,device-name=storage-disk-$x
done

Cool! We got the three systems up and deployed! Since we preinstalled LXD in the VM disk image back in Part 2, we can just log into the first one and initialize the LXD cluster and the fan network from it.

VLAN Party: building the LXD cluster and fan network

Time to cluster up kids, here we go! Run the following commands to initialize the cluster:

GCE_IIP0=$(gcloud compute instances list | grep ubuntu-core-20-0 | awk '{ print $5 }')
INT0_IP=$(gcloud compute instances list | grep ubuntu-core-20-0 | awk '{ print $4 }')
SUBNET=$(echo $INT0_IP | sed 's/\.[0-9]*$/\.0/')

cat > preseed-0.yaml <<EOF
config:
  core.https_address: $INT0_IP:8443
  core.trust_password: w0rkinNightZzz
networks:
- config:
    bridge.mode: fan
    fan.underlay_subnet: $SUBNET/16
  description: ""
  name: lxdfan0
  type: ""
  project: default
storage_pools:
- config:
    source: /dev/sdb
  description: ""
  name: local
  driver: zfs
profiles:
- config: {}
  description: ""
  devices:
    eth0:
      name: eth0
      network: lxdfan0
      type: nic
    root:
      path: /
      pool: local
      type: disk
  name: default
projects: []
cluster:
  server_name: ubuntu-core-20-0
  enabled: true
  member_config: []
  cluster_address: ""
  cluster_certificate: ""
  server_address: ""
  cluster_password: ""
  cluster_certificate_path: ""
EOF

scp preseed-0.yaml <Your Ubuntu ONE username>@$GCE_IIP0:.
ssh <Your Ubuntu ONE username>@$GCE_IIP0 "cat preseed-0.yaml | sudo lxd init --preseed"

Oh wow, that’s the first node bootstrapped. On to building the rest of the LXD cluster now. We just need the cert data from the bootstrap server in order to get going. We’ll build preseed files and apply them to the other two nodes:

GCE_IIP1=$(gcloud compute instances list | grep ubuntu-core-20-1 | awk '{ print $5 }')
GCE_IIP2=$(gcloud compute instances list | grep ubuntu-core-20-2 | awk '{ print $5 }')

INT1_IP=$(gcloud compute instances list | grep ubuntu-core-20-1 | awk '{ print $4 }')
INT2_IP=$(gcloud compute instances list | grep ubuntu-core-20-2 | awk '{ print $4 }')

CERT_DATA=$(ssh <Your Ubuntu ONE username>@$GCE_IIP0 "sudo sed 's/$/\n/g' /writable/system-data/var/snap/lxd/common/lxd/cluster.crt")

cat > preseed-1.yaml <<EOF
config: {}
networks: []
storage_pools: []
profiles: []
projects: []
cluster:
  server_name: ubuntu-core-20-1
  enabled: true
  member_config:
  - entity: storage-pool
    name: local
    key: zfs.pool_name
    value: ""
    description: '"zfs.pool_name" property for storage pool "local"'
  - entity: storage-pool
    name: local
    key: source
    value: /dev/sdb
    description: '"source" property for storage pool "local"'
  cluster_address: $INT0_IP:8443
  cluster_certificate: "$CERT_DATA"
  server_address: $INT1_IP:8443
  cluster_password: "w0rkinNightZzz"
  cluster_certificate_path: ""
EOF


cat > preseed-2.yaml <<EOF
config: {}
networks: []
storage_pools: []
profiles: []
projects: []
cluster:
  server_name: ubuntu-core-20-2
  enabled: true
  member_config:
  - entity: storage-pool
    name: local
    key: zfs.pool_name
    value: ""
    description: '"zfs.pool_name" property for storage pool "local"'
  - entity: storage-pool
    name: local
    key: source
    value: /dev/sdb
    description: '"source" property for storage pool "local"'
  cluster_address: $INT0_IP:8443
  cluster_certificate: "$CERT_DATA"
  server_address: $INT2_IP:8443
  cluster_password: "w0rkinNightZzz"
  cluster_certificate_path: ""
EOF

scp preseed-1.yaml <Your Ubuntu ONE username>@$GCE_IIP1:.
ssh -t <Your Ubuntu ONE username>@$GCE_IIP1 "cat preseed-1.yaml | sudo lxd init --preseed"

scp preseed-2.yaml <Your Ubuntu ONE username>@$GCE_IIP2:.
ssh -t <Your Ubuntu ONE username>@$GCE_IIP2 "cat preseed-2.yaml | sudo lxd init --preseed"

Alive! she cried: deploying MicroK8s

Your LXD cluster should be alive now. Alive! Now we can pour MicroK8s over the top, but this time all clustered and highly available. Since LXD has made our Ubuntu Core instances operate as one, we can launch several MicroK8s instances from one host, and they’ll be distributed across the LXD cluster as needed.

One gotcha to be aware of is that our LXD fan network, like our MicroK8s Calico network, is making use of VXLAN networking, meaning that we will need to adjust the MTUs at each layer. We already changed the MTU for the VPC to 1500 in a previous step. In the next steps, we’ll change the MTU for the VMs running MicroK8s to 1450, and the MTU for the Calico network to 1400, as well as deploying and clustering MicroK8s:

for x in {0..2};
do
ssh -t <Your Ubuntu ONE username>@$GCE0_IIP "sudo lxc init ubuntu:20.04 microk8s-$x --vm -c limits.memory=6GB -c limits.cpu=2"
ssh -t <Your Ubuntu ONE username>@$GCE0_IIP "sudo lxc config device override microk8s-$x root size=40GB"
ssh -t <Your Ubuntu ONE username>@$GCE0_IIP "sudo lxc start microk8s-$x"
sleep 90 # let it boot
ssh -t <Your Ubuntu ONE username>@$GCE0_IIP "sudo lxc exec microk8s-$x -- echo '            mtu: 1450' | sudo tee -a /etc/netplan/50-cloud-init.yaml"
ssh -t <Your Ubuntu ONE username>@$GCE0_IIP "sudo lxc exec microk8s-$x -- sudo netplan apply"
ssh -t <Your Ubuntu ONE username>@$GCE0_IIP "sudo exec microk8s-$x -- sudo snap install microk8s --classic"
ssh -t <Your Ubuntu ONE username>@$GCE0_IIP sudo exec microk8s-$x -- microk8s.kubectl patch configmap/calico-config -n kube-system --type merge -p '{"data":{"veth_mtu": "1400"}}'
ssh -t <Your Ubuntu ONE username>@$GCE0_IIP sudo exec microk8s-$x -- microk8s.kubectl rollout restart daemonset calico-node -n kube-system
done

for x in {1..2};
do
JOIN_CMD=$(ssh -t <Your Ubuntu ONE username>@$GCE_IIP0 "sudo lxc exec microk8s-0 -- microk8s add-node" | grep "microk8s join" | tail -n 1)

ssh -t <Your Ubuntu ONE username>@$GCE_IIP0 "sudo lxc exec microk8s-$x -- $JOIN_CMD"
done

Just a highly available Kubernetes cluster, on a highly available LXD cluster, on a bunch of next-gen Ubuntu Core OS boxes using nested virtualisation! To get Spark deployed and running on the top, we can replay the steps we did in Part 3 of this series.

Caesar wrap: review

To recap:

  • In Part 1, we used Multipass to spin up an Ubuntu Core VM on our workstation, and then we deployed MicroK8s on it using LXD and a Linux Container.
  • In Part 2, we took that work to the GCP cloud, and we made it even better by using a nested VM to run MicroK8s inside our Ubuntu Core instance, again using LXD.
  • In Part 3, we put Apache Spark on top and queried it from a Jupyter Notebook server.
  • And finally, in this Part 4, we made the solution fully distributed and highly available using LXD clustering and fan networking.

Thanks for being such great company on this journey – but now it’s your turn to take this further. I want you to have some fun, and don’t forget to tag us on Twitter (@ubuntu) with some examples of your nicest projects!

Contact us to learn more about MicroK8s, LXD, Ubuntu Core or any of the technologies that we’ve explored in this series. We’ll gladly set up a call so that you can find out more about how our technologies and services can help you to jump-start your next implementation.

Ubuntu cloud

Ubuntu offers all the training, software infrastructure, tools, services and support you need for your public and private clouds.

Newsletter signup

Get the latest Ubuntu news and updates in your inbox.

By submitting this form, I confirm that I have read and agree to Canonical's Privacy Policy.

Related posts

Spark or Hadoop: the best choice for big data teams?

I always find the Olympics to be an unusual experience. I’m hardly an athletics fanatic, yet I can’t help but get swept up in the spirit of the competition....

Can it play Doom? Running an AI LAN party on a Spark cluster with ViZDoom

It’s all about AI these days, so I decided to try and answer the important question: can you make a Spark cluster run AI agents that play a game of Doom, in a...

Migrating from Cloudera to a modern data hub architecture

In the early 2010s, Apache Hadoop captured the imagination of the tech community. A free and powerful open source platform, it gave users a way to process...