Stop The Ticks!

After those previous hurdles, I’d like to talk about what I’m working on currently 🙂
Busy cpus always have periodic ticks interrupting them to update scheduling stats, trigger load balancing, preempt tasks, update clock. When cpus go idle none of these is required, so the tick is asked to stop on that cpu in case of the CONFIG_NO_HZ_IDLE kernel configuration. So the design of the kernel is such that when there are missing stats, or absence of scheduling activity, it is assumed that the cpu was idle during this period.

However when nohz_full was introduced, the field changed. Periodic ticks were not required to interrupt nohz_full cpus when they were running single tasks. It was discussed why it was sensible to stop ticks on such cpus. We could have stopped periodic ticks altogether but for some constraints. If a remote cpu reads the load on the nohz_full cpu during the time that it was running one task, it would read a stale value since the tick is not running to update the load on the nohz_full cpu, although it is running a task.
At best, the remote cpu would accommodate stale values and at worst it could lead to crashes. So we need to have the tick running on the nohz_full cpu today, although we can accommodate it at a lower granularity. We can afford to do so at a lower granularity because it is about updating some statistics and having a stale value for a second will hopefully
not cause serious issues.

Now we want to get rid of this residual tick on the nohz_full cpus as well. But that would mean either that when the stats on cpu/task load are being accommodated for nohz_full cpus, the
callers must be made aware of the lagging numbers. They must then calculate the pending updates by themselves. This is hard since we will need to identify each of the callers and look
at ways to help them distinguish between idle cpus and nohz_full cpus running single tasks.

The easier option would be to offload the job of updating the load stats on nohz_full cpus to the housekeeping cpu. The housekeeping cpu is already doing the timekeeping duty in the nohz_full environment. It now needs to do some additional work on the behalf of nohz_full cpus.

So this is what I’m working on now! Hopefully once we’re done, the kernel HPC workload will be handled much better 🙂
Advertisements

Timers and more timers!

After completing the migration of unpinned timers to non-nohz_full CPUs, this week I moved on to pinned timers. The idea, as I’ve mentioned before is to reduce OS jitter or internal work on CPU’s which need to concentrate on single HPC tasks.

Time you enjoy wasting is not wasted time.’, J.R.R. Tolkien

When a CPU is idle or in other words, is in a deep idle state because of which it would conserve more power, we don’t want to disturb it if we can afford to.

This would be in terms of timers, tasks (userspace and kernel space) that we’d like to defer or shift to a non-idle CPU.

Deferrable timers are those that can be ignored sometimes. They are not “cancelled”, they are “ignored”. We know that the periodic tick fires at its frequency does statistical work like figuring out the load on the CPUs and schedules and preempts tasks accordingly. When the system is idle, the frequency of these ticks is reduced. While programming the clock to fire, deferrable timers are ignored and the clock is scheduled to fire at the next non-deferrable timers expiry time. If the timers queued are all deferrable, the ticks can be stopped altogether as the CPU is idle.

An example of this would be the intel_pstate timer. Intel P states driver manages CPU
frequency via the handling the P-States (Power states) that CPUs go into. If the CPU is idle, there is no load to evaluate on it and the intel_pstate timer can be safely ignored.

So now we will move on to see whether we can delay or move pinned timers so as to not disturb nohz_full CPUs.

CPU Idle States

On this blog post, I’d like to talk about some of what I’ve learned so as to properly understand the problem I’m addressing in the Full Dynticks implementation in the Linux kernel. I spoke about this in my previous blog post and here I’ll just give a summary :

The idea is to affine unpinned timers to non nohz_full CPU’s ( adaptive CPUs ) so that they are undisturbed by repeated preempting of tasks.

Now, each CPU in order to save energy can be made to go into low power modes or C-States.The basic idea of these modes is to cut the clock signal (or interrupts) and power from idle units inside the CPU. The more units you stop (by cutting the clock), reduce the voltage or even completely shut down, more energy you save. The downside is that you’d need more time for your CPU to wake up, the deeper its state of sleep is!

Consider this:
__________________
|                 |
| ——  ——  |
|| O  O || O  O | |
| ——  ——  |
|_________________|

The big box is the package, the two smaller boxes are the cores – Core 0
and Core 1, while the circles are the threads.

On my system (Intel® Core™ i7-5500U CPU @ 2.40GHz × 4) :
Package level has 4 idle states : C2,C3,C6,C7
Core level has  3 idle states : C3,C6,C7
Thread level has 6 idle staes : C0,POLL,C1E,C3, C6, C7

Let’s discuss the CPU idle states now:

C0

A busy CPU is in an active or C0 state. Note that there are also P-states which are execution power saving states, but we’ll not talk about them here.

C1

The first idle state is C1 or HLT (“Halt”) state. All x86 CPU’s have an instruction called “HLT” where it becomes idle or doesn’t run anything till it receives a hardware interrupt. This is the first power saving mode introduced where the internal clock signal is stopped (keeping of course a mode for emergency signals intact).

POLL

POLL isn’t a real idle state, in that it does not save  any  power.  Instead, a busy-loop  is executed doing nothing for a short period of time. This state is used if the  kernel  knows  that  work  has  to  be processed  very  soon  and  entering  any  real hardware idle state may        result in a slight performance penalty.

C3On C1 two internal CPU units are kept running: the bus interface unit and the APIC, Advanced Programmable Interface Controller. These units are kept running so the CPU can deal with important requests coming from the CPU external bus and can handle interruptions.

Sleep (C3), cuts the internal clock signals from the CPU, including the clocks from the bus interface unit and from the APIC. This means that when the CPU is in the Sleep mode. now it can’t answer to important requests coming from the CPU external bus nor interruptions.

You can observe the idle stats on your system using Powertop:

sudo apt-get install powertop

Run it with

sudo powertop

Switch to the idle stats tab!

Outreachy Acceptance

Hey everyone!

Off late, I’ve been facing a world of hesitation because I don’t know how to effectively start my first blog post! I find myself staring at the computer screen for several hours and procrastinating mainly because I am clueless on how to connect the dots and start writing it already.

On the 27th of April, I found out that I was accepted for the Outreachy program to intern in the Linux Kernel for the upcoming three months from May to August! 🙂

Starting from the beginning, lets start by talking about Outreachy! It is a program wherein anyone aspiring to venture into FOSS can look into the participating organizations code and propose projects after discussions with the mentors on the IRC ( Internet Relay Chat ).

Since being introduced to Computer System Organizations in college, I’ve been interested in learning more about it. You can find the patches I submitted during the application period here:

Now coming, to my project. I will work on the Linux kernels’ Full Dynticks system also known as CONFIG_NO_HZ_FULL. This CONFIG_NO_HZ_FULL support is of interest in particular to real-time Linux users and also can be of great benefit to HPC (High Performance Computing) workloads where there is only one task running (performance improvements by maybe 1%), a reduction in real-time workload latency, and can also help desktop and mobile users where there is just one CPU task active on a given core.

The idea is to delay periodic interrupts or ‘ticks’ whenever possible. Every time the tick fires, a check is made to see if there are expired timers which are then executed. Some of these timers are pinned which means that they are executable only on specific CPUs while some are not which means they can execute on any CPU.

This is called timer affinity.

If a non-pinned timer is executed on a full dynticks CPU ( one where the ticks are delayed ), the tick will fire on it in order to run the timer which is undesirable as we wish to minimize disturbances on these CPUs.

The task I have as of now is to affine non-pinned timers to the CPUs that are not in full dynticks mode.

I am looking forward to working on this project with the guidance of my mentors! I will update the blog regularly to keep in sync with the work I’m doing.