Tuesday, June 13, 2017

NVMe: Officially faster for emulated controllers!

The Doorbell Buffer Config command

When I last wrote about NVMe, the feature to improve NVMe performance over emulated environments was just a living discussion and a work in progress patch. However, it has now been officially released in the NVMe Specification Revision 1.3 under the name "Doorbell Buffer Config command", along with an implementation that is already in the mainline Linux Kernel! \o/

You can already feel the difference in performance if you compile Kernel 4.12-rc1 (or later) and run it over a virtual machine hosted on Google Compute Engine. Google actually updated their hypervisor as soon as the feature was ratified by the NVMe working group, even before it was publicly released.

There were very few changes from the original proposal, I.e. opcodes, return values and now fancy names; the buffers (as described in my last post) are now called Shadow Doorbell and EventIdx buffers.

In short, the first one mimics the Doorbell registers in memory, allowing the emulated controller to fetch the Doorbell value when convenient instead of waiting for the Doorbell register to be written. For its part, the EventIdx provides a hint given by the emulated controller to tell the host if the Doorbell register needs to be updated (in case the emulated controller is not fetching the Doorbell value from the Shadow Doorbell buffer). You can check section 7.13 of the specification for an example of usage.


The following test results were obtained in a machine of type n1-standard-4 (4 vCPUs, 15 GB memory) at Google Cloud Engine platform with Kernel 4.12.0-rc5 using the following command:

$ sudo fio --time_based --name=benchmark --runtime=30 \
--filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32 \
--direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=1 \
--rw=randread --blocksize=4k --randrepeat=0

Results (in Input/Ouput Operations per Second):
Without Shadow Doorbell and EventIdx buffers: 43.9K IOPS
With Shadow Doorbell and EventIdx buffers: 184K IOPS
Gain ~= 4 times

Screenshot - Without Shadow Doorbell and EventIdx buffers

Screenshot - With Shadow Doorbell and EventIdx buffers

Enjoy your enhanced numbers of IOPS! :D

Wednesday, May 3, 2017

Collabora Contributions to Linux Kernel 4.11

Linux Kernel v4.11 was released, and 9 different Collabora developers contributed a total of 44 patches, an increase of 5 patches from version 4.10. The majority of Collabora's work this time was around fixes and clean ups in the DRM. In addition to our contributions as authors, Collabora also added 22 Reviewed-by tags for patches reviewed by our engineers. You can learn more information about the v4.11 merge window in's extensive coverage: part 1, part 2 and part 3.

Now here is a look at the specific changes made by Collaborans. To begin with, Enric Balletbo fixed an issue and improved documentation of IIO regarding sensors and also added support for several buses and peripherals for the Toby-Churchill SL50 board. Romain Perier added an ASoC machine driver for Rockchip rk3288-based boards that have an HDMI and analog audio output, and also added support for slave mode in the Everest Semi ES8328 audio codec, while including ES8388 as a compatible device in the ES8328's codec driver.

For his part, Tomeu Vizoso fixed the sink display error in DRM EDID when no deep color is available for Rotel RSX-1058 and also fixed several issues and code cleanups regarding CRC in DRM and integrated the new CRC debugfs API in i915. Gabriel Krisman Bertazi, who wrote a great article recently on tracing the user space and Operating System interactions, made several improvements to DRM by adding documentation, cleaning up the code and fixing several issues, including allowing QXL build when FBDEV_EMULATION is disabled.

Lastly, Daniel Stone, Collabora's Graphics Lead, fixed an important issue in DRM regarding the use of Atomic State in legacy ioctls, while Fabien Lahoudere cleaned up the code for the Epson RTC removing an unnecessary spinlock, and Robert Foss fixed a copy of uninitialized memory in Ethernet qed code.

Here is the complete list of Collabora contributions:

Daniel Stone (1):

Enric Balletbo i Serra (10):

Fabien Lahoudere (1):

Gabriel Krisman Bertazi (19):

Gustavo Padovan (1):

Martyn Welch (1):

Romain Perier (4):

Tomeu Vizoso (6):

Robert Foss (1):


Gustavo Padovan (8):

Emil Velikov (5):

Gabriel Krisman Bertazi (3):

Robert Foss (3):

Daniel Stone (2):

Tomeu Vizoso (1):

Wednesday, August 31, 2016

LinuxCon NA 2016 - Highlights

After visiting FISL this summer, my travels have now taken me to LinuxCon NA 2016 in Toronto.
As everyone knows, the hot topic of the moment is containers, and they were everywhere at LinuxCon. Several companies are working in this market, there are even hardware optimized for getting the best performance on containers!
However, besides containers, there were several other different subjects of which I had more contact with:
memory-driven computers, workqueues, bluetooth, graphics, file systems, power saving (check the talk highlights below).
I also met several amazing people working in different fields and contributing with the free software community.

The place:

The infrastructure of the event was great, wifi worked everywhere. There was breakfast for attendees and snacks during the small breaks during the day.
In the main Hall, there were several couches and tables, and the conference rooms were great.

Each morning there were keynotes that were hosted in a big fancy room.  These were also streamed to the main hall so other people could watch.
In the afternoons there were several talks happening in parallel in smaller rooms.

The women lunch:

On the first day there was a women's only lunch event promoted by Intel, which was populated by 100+ women from the tech field. I've never seen so many of us reunited like that!
It was a great event to socialize and learn where everybody works. Several of them work directly with coding, but not the majority.
It was a pleasure to meet everyone and I am looking forward to see even more women in tech.

Booths (Hightlights):



In this booth I met The Machine, which is based in a Memory-driven computer architecture that promises to revolutionize how we know computers today.
The main memory is based in memristors, which can be viewed as a non-volatile RAM, so instead of having our basic model of Caches/Main Memory/Disk we would only have one memory based on memristors, all connected through a photonic fabric instead of a copper bus.
This changes our current programing model. HP have a github available with a framework where you can emulate the hardware, test and start programing for it.


Diamanti is a company that offers a hardware based solution to optimize containers and virtual environments, as mentioned in my NVMe post, I am working in a patch to optimize performance of shared NVMe device for a guest system in software while Diamanti, instead of sharing a NVMe device by software, make their hardware pretend there are multiple NVMe devices and they attach each of this devices directly to a container or virtual machine, thus from a software point of view, the container controls the device without having the VMM interfering.
They also do the same this for other peripherals beside NVMe as network cards.


Besides the Linux distribution (Ubuntu), this booth was presenting Juju, which is a tool to manage your services in the cloud, and also LXD, an hypervisor for containers


As most of you know, the Docker project is a great tool to create containers, which are something in the middle of a virtual machine and a chroot, it uses the kernel from the host.
Docker is also the name of the company (I thought is was only the name of the project) and they use LXC as a base to create containers.
The company provides services for other companies using the Docker project as setting up the infrastructure, setting a private Docker Hub, providing support, etc.


Why was Microsoft was in LinuxCon? To declare its love for Linux! :)
In this booth I obtained many stickers written "Microsoft loves Linux".I guess they decided to stop fighting old battles and be friends with Linux in the server market.


CoreOs is a Linux distribution mostly meant to be a lightweight host system for docker containers.
Kubernetes is a tool for managing containers, automating deployment and scaling. So used in conjunction with CoreOs is a good match.

Talks (Hightlights):


Btrfs with High Speed Devices - Chris Mason, Facebook:

Currently the maintainer of Btrfs, Chris Mason talked about this file system, tools to debug and how to identify bottlenecks.
One of the bottlenecks was btree locking, where he presented a patch that has a new locking scheme that optimizes the file system.

Open Source Bluetooth Device Firmware for IoT and Makers - Marcel Holtmann, Intel:

In this talk, Marcel Hltmann gave a great overview of the Bluetooth stack and mentioned that Bluetooth 5.0 is coming with support for mesh network.
As he is the maintainer of the Bluetooth stack on Linux, he talked about BlueZ and other Bluetooth tools in Linux.
For IoT and Makers who usually use an nRF51/nRF52 Bluetooth chip with the proprietary SoftDevice firmware, Marcel talked about how we could use Zyphis or MyNewt (which are open source) instead of SoftDevice and how he managed to get it working on Arduino 101.

Async Execution with Workqueue - Bhaktipriya Shridhar, Linux Kernel

Bhaktipriya Shridhar gave a talk about her Outreachy project on workqueues and how she managed to migrated several drivers from the old API to the new one.
Workqueues is a mechanism in Linux Kernel to execute pieces of code in asynchronous fashion, in short: if you have a function to execute and you don't want to wait for it to return, you can add it in the workqueue.
Internally, the kernel has two API's, the old one, with several issues as proliferation of kernel threads (it could run out of process IDs before even executing user space), deadlocks (if wasn't handled correctly) and unnecessary context switches. And the new API, the Concurrency Managed Workqueue (cmwq), which solves most of these issues.


Kernel Internship Report and Outreachy Panel - Moderated by Karen Sandler, Software Freedom Conservancy; Helen Fornazier, Rik Van Riel & Bhaktipriya Shridhar

Outreachy is 3 month internship meant to promote the presence of minorities in free software community.
If you know what GSoC is, Outreachy is similar with small differences in the projects (not necessary about coding), the selection phase, who can participate, etc.

In the panel we had 2 former mentors Rik Van Riel and Tiffany Antopolski (who is also a former intern), Bhaktipriya Shridhar (current intern in the linux kernel), myself as former intern and Karen Sandler as host and part of the organization of the Outreachy program at Software Freedom Conservancy.
Each one shared their experience as a mentor or as an intern.

CPUfreq and The Scheduler: Revolution in CPU Power Management - Rafael J. Wysocki, Intel OTC

To save power when the system can't go idle, CPUFreq can decrease or increase the clock frequency of the CPU based in the current work load.
Rafael Wysocki (ACPI core maintainer) explained the architecture of the old system that was based on timers, that would sample the load from time to time and update the clock frequency accordingly. The new system provides a much better result by using a Scheduler-driven mechanism instead of timers, using data from the scheduler to make decisions on the next frequency.


Bringing Android Explicit Fencing to Mainline - Gustavo Padovan, Collabora Ltd.

In this talk, Gustavo Padovan explained how graphic fences are exposed to userspace to synchronize buffer sharing and increase performance compared to the implicit fencing where userspace is not aware.


The gala party:

In the last day we had a great gala for the 25th anniversary of Linux.

I had the pleasure to have a great conversation with Eduardo Habkost from Red Hat who has worked with virtualization for 10+ years and gave me a great explanation on how Qemu connects with KVM.

Tuesday, August 23, 2016

Increased performance of emulated NVMe devices

Nowadays, in Google Cloud Engine (GCE), it is possible to attach a local SSD with the NVMe interface to your virtual machine. Unfortunately, you only get a good number of iops (input/output operations per second) if you instantiate a machine with nvme-backports-debian-7-wheezy image; other available distributions on GCE will have a lower number of iops.

It turns out that Google's Virtual Machine Monitor (aka Hypervisor) implements a custom NVMe command that allows it to increase up to 4 times the number of  iops (note: this is from what I've tested so far, but it seems to be possible to get up to 5 times faster according to the original commit message; check the  Technical Details sessions to see how this is possible), however the kernel you use needs to support it and this is not yet the case with the mainline kernel.

This is not exclusive to GCE as Google released a patch not only to the kernel  but also to the qemu and is available here.

Collabora has been helping update, refactor and review the patches to the Linux Kernel to send it upstream, however since this is not yet an official nvme standard, it shouldn't be merged into Kernel mainline, as its specification may still receive changes.

Seeing as it considerably increases performance, the feature is in the process of being discussed and proposed to the NVMe workgroup with Collabora's help.
While the seems interested in adding an official extension to stardarize it, as published in the mailing list, nothing has been defined yet, as this is a very recent discussion and it can take up to a year to be ratified by the NVMe workgroup.

So, for the time being, you can get a more recent version of the patch and install the driver yourself here:

How it works?

Technical details


The NVMe interface basicaly works with command queues. The drive writes a command in a region known to both (driver and device controller) and then updates the tail of the queue, writting to an MMIO register called doorbell.

In an environment with several guest OSes on top of a VMM sharing a resource, communication between the guest OS and the real device is usually trapped by the VMM. As an MMIO is usually a syncronous acces to the device, it means that every MMIO access will cause a trap.

Example of emulated device in the VMM
The main idea here is to decrease the number of traps to the VMM by reducing the number of writtes to the doorbells.

This is achieved in two ways:
    1) Batching; or
    2) Letting the VMM pull the current doorbell value when it is already in execution.

The first one is easy, we can wait X commands to be written in the queue to ring the doorbell.
The second one is a bit more complicated. The guest OS needs to inform the emulated device in the VMM where it can pull the doorbell values, and the emulated NVMe device needs to inform the guest OS that it can restart the counter of X.

This is what this new feature does:
It adds a new command in the NVMe interface where the driver can send to the NVMe device controller two memory buffers:
1) A buffer where the real doorbell values are: Instead of writting to the MMIO  doorbell, the driver writtes the value in this buffer; and
2) Another buffer with a hint from the controller about how many commands the driver can write in the queue without ringing the doorbell.

The exact technical details may still change in the future, especially on how to properly implement the second item above. It is also very likely that Google's patches won't be compliant with the future ratified standard.

For the time being though, you can use the Collabora tree. Please let me know if you have any comments/feedback!

Tuesday, August 4, 2015

[Outreachy Status Log] Scaler implementation

The implementation of the scaler node (which increases the size of the received image by a multiplier) is done and available in my github account:

The debayer filter (from the previous Outreachy Status Log post) is also available there.

The scaler still doesn't work if the received image is in bayer format, I'll let this to a future patch.

The next patch will allow changes in the pad format by user space as the frame size and enumerate the supported pixel formats by the ioctls. I already have a sketch of this code but I need to test it a little more.

Monday, August 3, 2015

[Outreachy Status Log] Debayer implementation

What is a bayer format

The sensor of a camera is composed my many color sensors, each of this color sensors just capture a single color. Thus a sensor may be composed with many color sensors in the order Red and Green in the odd lines and Green and Blue in the even lines for example:

|Red Sensor.....| Green Sensor.|Red Sensor.....| Green Sensor.|
|Green Sensor..| Blue Sensor...|Green Sensor..| Blue Sensor...|
|Red Sensor.....| Green Sensor.|Red Sensor.....| Green Sensor.|
|Green Sensor..| Blue Sensor...|Green Sensor..| Blue Sensor...|

We say it's a RGGB bayer format. We could have any combinations of these color sensors as GRBG (for Green and Read in the odd lines and Blue and Green in the even lines), BGGR, GBRG, GRBG, etc.
Usually our eye are more sensible to green, that is why we are te double of green sensors compared to the other color components.

Here is a Bayer format image, 64x64 RGGB:

If you look closely you can notice that in the odd lines we have only Red and Green and in the even lines we have Green and Blue.

Thus, the camera process the results of each color sensor to build the image we are used to. And we will simulate this filter in out VIMC (Virtual media controler, yes, it was rename from VMC to VIMC).

Simple debayer algorithm

To get each color component of each pixel we will calculate the mean of the color components around a given color sensor, i.e. a Blue color sensor is inside a 3x3 square (or mean window):


The Red component will be the mean of all four Red components in this square, the Green component will be the man of all four Green component in this square and the Blue is the value read from the Blue sensor being evaluated.

The mean window can be any SxS size, where S is an odd value to allow the pixel being evaluated to be in the center of the square.


Debayer, windows size 3x3, image size 64x64

Debayer, windows size 5x5, image size 64x64

Debayer, windows size 11x11, image size 64x64

Code structure - the pix map table

In the implementation I have a table which describes the order:

With this table I am able to write a generic code to all the bayer orders.

I am still cleaning up this code, it will be available in my github soon

Sunday, July 19, 2015

Linux Kernel: memory corruption - debug tricks

Note: This post is meant to help you debug a memory corruption if you are building your own kernel from source.

The symptoms:

If you see one of the following errors in your in your dmesg log:

* BUG: unable to handle kernel paging request at 7f45402d
* invalid opcode: 0000 [#1] SMP
* general protection fault: 0000 [#1] SMP
* BUG: unable to handle kernel NULL pointer dereference at 0000000000000258

Then you probably have a memory corruption somewhere.

In my case, I didn't get these errors at first. I was adding objects to a list by calling list_add_tail(my_list, my_obj) and verifying that list_empty(my_list) was returning false as expected (and it was). But latter in the code when I called list_empty(my_list) again, it was returning true, and in nowhere in my code I was removing objects from the list.

Weird behaviors that doesn't make sense probably are due to memory corruption. When I started simplifying the code to isolate the problem, I start getting the errors above in the dmesg log.

Where it is happening:

When you have an error, you get a log message in dmesg like:

We will be looking at 2 infos there:
1) The IP (instruction pointer) address in the 2nd line
2) The call trace at line 23rd

The IP

In the 2nd line of this log, we have the instruction pointer address, it is the address the computer was executing that generated the error.

In this example, the IP is ffffffff817c69f0, then we can find where in the code this address corresponds to using the addr2line tool:

Where the path_to_your_kernel_tree is the path to the kernel code where you downloaded it (where you do make && make install)
The vmlinux file is the uncompressed version of the Linux image (it is created when you do make)

Note: if it doesn't work, check if your make menuconfig if your kernel is compiled with DEBUG flag

But this technique won't work if the corresponding address is not part of the kernel core code (if it was an error caused by some module, as we will see in the call trace section).

The call trace

In the log above, line 23rd, it means that the function devres_remove was called by devres_destroy, which was called by devm_kfree, which was called by vmc_cap_destroy and so on.

Now suppose that the vmc_cap_destroy function call devm_kfree in many places and you want to know exactly which one has triggered the bug. Lets do this in 2 steps:

1) Get the offset in the code section of the function using nm:
nm is a tool to get the offset of a symbol in a section:

In this example, the offset of the vmc_cap_destroy function is 0x680

2) Get the file and the line using addr2line:
In line 27th we have: vmc_cap_destroy+0x40, where the 0x40 is the offset inside the vmc_cap_destroy function where the code execution would return if the call to devm_kfree hadn't triggered the bug.

So lets add the function offset we found in the previous step with the offset inside the function: 0x680 + 0x40 = 0x6c0

And now we use the addr2line with the compile module.ko to find out where is it:

Then the buggy devm_kfree call is the first one just before the line 396 in file vmc-capture.c

Enable debug prints:

Besides the techniques above, you can enable the debug level prints in the dmesg log:

sudo sh -c "echo 8 > /proc/sys/kernel/printk"

If this doesn't work, check if the DYNAMIC_DEBUG flag is enabled in your menuconfig, if so, then check the next section about Dynamic debug.

In the case of a module that I was testing (the vivid module), I needed to change the vivid_debug parameter:

We can do this when we start the module:

sudo modprobe vivid vivid_debug=8

Or changing the parameter while its already running:

sudo sh -c "echo -n 0 > sys/module/vivid/parameters/vivid_debug"

Dynamic debug:

If your kernel is compiled with DYNAMIC_DEBUG flag, then changing the printk level probably won't enable the debug prints in the dmesg log.

Lets say the module we are working is called media, lets add it into the kernel:

sudo modprobe media

And now we will enable all the debug prints in the media module by sending 'module media +p' to the dynamic_debug/control file:

If you look at the dynamic_debug/control file content (as we did above) you can see the "=p" which means that the prints in those lines are enabled.

To disable the prints, we send 'module media -p' to the dynamic_debug/control file:

If you look at the dynamic_debug/control file content again you can see the "=_" which means that the prints in those lines are disabled.

Enabling a specific print

Instead of enabling all the debug prints of the entire module, we can enable all the prints on a specific file, or we can specify a file and a line:

$ sudo sh -c "echo -n 'file media-entity.c +p' > /sys/kernel/debug/dynamic_debug/control"

$ sudo sh -c "echo -n 'file media-entity.c line 301 +p' > /sys/kernel/debug/dynamic_debug/control"

Read dmesg log continuously

Instead of calling dmesg every time to check the log, you can do:

$ tail -f /var/log/kern.log

and the log will be printed continuously.