Wednesday, August 31, 2016

LinuxCon NA 2016 - Highlights

After visiting FISL this summer, my travels have now taken me to LinuxCon NA 2016 in Toronto.
As everyone knows, the hot topic of the moment is containers, and they were everywhere at LinuxCon. Several companies are working in this market, there are even hardware optimized for getting the best performance on containers!
However, besides containers, there were several other different subjects of which I had more contact with:
memory-driven computers, workqueues, bluetooth, graphics, file systems, power saving (check the talk highlights below).
I also met several amazing people working in different fields and contributing with the free software community.

The place:

The infrastructure of the event was great, wifi worked everywhere. There was breakfast for attendees and snacks during the small breaks during the day.
In the main Hall, there were several couches and tables, and the conference rooms were great.

Each morning there were keynotes that were hosted in a big fancy room.  These were also streamed to the main hall so other people could watch.
In the afternoons there were several talks happening in parallel in smaller rooms.

The women lunch:

On the first day there was a women's only lunch event promoted by Intel, which was populated by 100+ women from the tech field. I've never seen so many of us reunited like that!
It was a great event to socialize and learn where everybody works. Several of them work directly with coding, but not the majority.
It was a pleasure to meet everyone and I am looking forward to see even more women in tech.

Booths (Hightlights):



In this booth I met The Machine, which is based in a Memory-driven computer architecture that promises to revolutionize how we know computers today.
The main memory is based in memristors, which can be viewed as a non-volatile RAM, so instead of having our basic model of Caches/Main Memory/Disk we would only have one memory based on memristors, all connected through a photonic fabric instead of a copper bus.
This changes our current programing model. HP have a github available with a framework where you can emulate the hardware, test and start programing for it.


Diamanti is a company that offers a hardware based solution to optimize containers and virtual environments, as mentioned in my NVMe post, I am working in a patch to optimize performance of shared NVMe device for a guest system in software while Diamanti, instead of sharing a NVMe device by software, make their hardware pretend there are multiple NVMe devices and they attach each of this devices directly to a container or virtual machine, thus from a software point of view, the container controls the device without having the VMM interfering.
They also do the same this for other peripherals beside NVMe as network cards.


Besides the Linux distribution (Ubuntu), this booth was presenting Juju, which is a tool to manage your services in the cloud, and also LXD, an hypervisor for containers


As most of you know, the Docker project is a great tool to create containers, which are something in the middle of a virtual machine and a chroot, it uses the kernel from the host.
Docker is also the name of the company (I thought is was only the name of the project) and they use LXC as a base to create containers.
The company provides services for other companies using the Docker project as setting up the infrastructure, setting a private Docker Hub, providing support, etc.


Why was Microsoft was in LinuxCon? To declare its love for Linux! :)
In this booth I obtained many stickers written "Microsoft loves Linux".I guess they decided to stop fighting old battles and be friends with Linux in the server market.


CoreOs is a Linux distribution mostly meant to be a lightweight host system for docker containers.
Kubernetes is a tool for managing containers, automating deployment and scaling. So used in conjunction with CoreOs is a good match.

Talks (Hightlights):


Btrfs with High Speed Devices - Chris Mason, Facebook:

Currently the maintainer of Btrfs, Chris Mason talked about this file system, tools to debug and how to identify bottlenecks.
One of the bottlenecks was btree locking, where he presented a patch that has a new locking scheme that optimizes the file system.

Open Source Bluetooth Device Firmware for IoT and Makers - Marcel Holtmann, Intel:

In this talk, Marcel Hltmann gave a great overview of the Bluetooth stack and mentioned that Bluetooth 5.0 is coming with support for mesh network.
As he is the maintainer of the Bluetooth stack on Linux, he talked about BlueZ and other Bluetooth tools in Linux.
For IoT and Makers who usually use an nRF51/nRF52 Bluetooth chip with the proprietary SoftDevice firmware, Marcel talked about how we could use Zyphis or MyNewt (which are open source) instead of SoftDevice and how he managed to get it working on Arduino 101.

Async Execution with Workqueue - Bhaktipriya Shridhar, Linux Kernel

Bhaktipriya Shridhar gave a talk about her Outreachy project on workqueues and how she managed to migrated several drivers from the old API to the new one.
Workqueues is a mechanism in Linux Kernel to execute pieces of code in asynchronous fashion, in short: if you have a function to execute and you don't want to wait for it to return, you can add it in the workqueue.
Internally, the kernel has two API's, the old one, with several issues as proliferation of kernel threads (it could run out of process IDs before even executing user space), deadlocks (if wasn't handled correctly) and unnecessary context switches. And the new API, the Concurrency Managed Workqueue (cmwq), which solves most of these issues.


Kernel Internship Report and Outreachy Panel - Moderated by Karen Sandler, Software Freedom Conservancy; Helen Fornazier, Rik Van Riel & Bhaktipriya Shridhar

Outreachy is 3 month internship meant to promote the presence of minorities in free software community.
If you know what GSoC is, Outreachy is similar with small differences in the projects (not necessary about coding), the selection phase, who can participate, etc.

In the panel we had 2 former mentors Rik Van Riel and Tiffany Antopolski (who is also a former intern), Bhaktipriya Shridhar (current intern in the linux kernel), myself as former intern and Karen Sandler as host and part of the organization of the Outreachy program at Software Freedom Conservancy.
Each one shared their experience as a mentor or as an intern.

CPUfreq and The Scheduler: Revolution in CPU Power Management - Rafael J. Wysocki, Intel OTC

To save power when the system can't go idle, CPUFreq can decrease or increase the clock frequency of the CPU based in the current work load.
Rafael Wysocki (ACPI core maintainer) explained the architecture of the old system that was based on timers, that would sample the load from time to time and update the clock frequency accordingly. The new system provides a much better result by using a Scheduler-driven mechanism instead of timers, using data from the scheduler to make decisions on the next frequency.


Bringing Android Explicit Fencing to Mainline - Gustavo Padovan, Collabora Ltd.

In this talk, Gustavo Padovan explained how graphic fences are exposed to userspace to synchronize buffer sharing and increase performance compared to the implicit fencing where userspace is not aware.


The gala party:

In the last day we had a great gala for the 25th anniversary of Linux.

I had the pleasure to have a great conversation with Eduardo Habkost from Red Hat who has worked with virtualization for 10+ years and gave me a great explanation on how Qemu connects with KVM.

Tuesday, August 23, 2016

Increased performance of emulated NVMe devices

Nowadays, in Google Cloud Engine (GCE), it is possible to attach a local SSD with the NVMe interface to your virtual machine. Unfortunately, you only get a good number of iops (input/output operations per second) if you instantiate a machine with nvme-backports-debian-7-wheezy image; other available distributions on GCE will have a lower number of iops.

It turns out that Google's Virtual Machine Monitor (aka Hypervisor) implements a custom NVMe command that allows it to increase up to 4 times the number of  iops (note: this is from what I've tested so far, but it seems to be possible to get up to 5 times faster according to the original commit message; check the  Technical Details sessions to see how this is possible), however the kernel you use needs to support it and this is not yet the case with the mainline kernel.

This is not exclusive to GCE as Google released a patch not only to the kernel  but also to the qemu and is available here.

Collabora has been helping update, refactor and review the patches to the Linux Kernel to send it upstream, however since this is not yet an official nvme standard, it shouldn't be merged into Kernel mainline, as its specification may still receive changes.

Seeing as it considerably increases performance, the feature is in the process of being discussed and proposed to the NVMe workgroup with Collabora's help.
While the seems interested in adding an official extension to stardarize it, as published in the mailing list, nothing has been defined yet, as this is a very recent discussion and it can take up to a year to be ratified by the NVMe workgroup.

So, for the time being, you can get a more recent version of the patch and install the driver yourself here:

How it works?

Technical details


The NVMe interface basicaly works with command queues. The drive writes a command in a region known to both (driver and device controller) and then updates the tail of the queue, writting to an MMIO register called doorbell.

In an environment with several guest OSes on top of a VMM sharing a resource, communication between the guest OS and the real device is usually trapped by the VMM. As an MMIO is usually a syncronous acces to the device, it means that every MMIO access will cause a trap.

Example of emulated device in the VMM
The main idea here is to decrease the number of traps to the VMM by reducing the number of writtes to the doorbells.

This is achieved in two ways:
    1) Batching; or
    2) Letting the VMM pull the current doorbell value when it is already in execution.

The first one is easy, we can wait X commands to be written in the queue to ring the doorbell.
The second one is a bit more complicated. The guest OS needs to inform the emulated device in the VMM where it can pull the doorbell values, and the emulated NVMe device needs to inform the guest OS that it can restart the counter of X.

This is what this new feature does:
It adds a new command in the NVMe interface where the driver can send to the NVMe device controller two memory buffers:
1) A buffer where the real doorbell values are: Instead of writting to the MMIO  doorbell, the driver writtes the value in this buffer; and
2) Another buffer with a hint from the controller about how many commands the driver can write in the queue without ringing the doorbell.

The exact technical details may still change in the future, especially on how to properly implement the second item above. It is also very likely that Google's patches won't be compliant with the future ratified standard.

For the time being though, you can use the Collabora tree. Please let me know if you have any comments/feedback!