úterý 18. října 2011

Deeper C states and increased latency

Today it is common to describe processor power consumption and thermal management state by CX, states where X can be number 0 to n (n depends on the CPU type). In the C0 state the CPU is running. In C-states higher than 0 the CPU is stopped (sleeps). Higher C states means more power savings but also longer delay when returning to C0 (higher latency). ACPI specification describes C0 - C3, but recent CPUs mostly supports more C states. With several BIOSes, higher C states are mapped through C3. The mapping can be done dynamically according to operational conditions (e.g. for some laptops when running on battery the C6 is mapped, when running on AC the C4 is mapped). Overview of the most known C-states can be found in the Table 1 (non-complete compilation of [1-4]).

C0Operating stateCPU is fully turned on and executing instructions. It can be in one of P-states (P0 - Pn) which defines operational voltage and frequency.
C1HaltCPU main internal clocks are stopped. Bus interface unit and APIC are kept running at full speed.
C1EEnhanced HaltCPU main internal clocks are stopped and the CPU voltage is reduced. Bus interface unit and APIC are kept running. The frequency can be also reduced.
C2Stop ClockCPU internal and external clocks are stopped via hardware.
C3Deep SleepCPU internal and external clocks are stopped, L1/L2 cache can be flushed.
C4Deeper SleepCPU voltage is reduced.
C5Enhanced Deeper SleepCPU voltage is reduced even more and the memory cache is turned off.
C6Deep Power DownCore states are saved into memory with low power consumption. It can reduce the CPU internal voltage to any value, including 0 V.
C7Deeper Power Down *1Same as C6 + flush of L3 cache.

As mentioned earlier higher C states means not only less power consumption, but also higher latency. That's why several BIOSes uses different mapping for AC / battery. Simple experiment will proof the above claims.

Comparison of latency for AC / battery mode

At first we let the Lenovo T500 idling on AC. After while we got the following powertop output:
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        ( 0,8%)       Turbo Mode     2,7%
polling           0,0ms ( 0,0%)         2,81 Ghz     0,0%
C1 mwait          0,0ms ( 0,0%)         2,14 Ghz     0,0%
C2 mwait          0,3ms ( 0,3%)         1,60 Ghz     0,2%
C4 mwait          6,6ms (98,9%)          800 Mhz    97,2%
As you can see the CPU is most of the time in C4. We can check the latency reported by kernel with the following command:
# cat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency
This means 57 microseconds (C4 is currently mapped to state3 - the last state in the cpuidle subdir). Then we ping the T500 through LAN from another machine and we got: 401 (31) us. It is average of 10 runs. The standard deviation of the sample is written in the braces.

Next we performed the same experiment with the T500 running on battery. We got the following powertop output:
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        ( 0.1%)       Turbo Mode     0.9%
polling           0.0ms ( 0.0%)         2.81 Ghz     0.0%
C1 mwait          0.1ms ( 0.0%)         2.14 Ghz     0.0%
C2 mwait          0.7ms ( 0.1%)         1.60 Ghz     0.0%
C6 mwait         58.9ms (99.8%)          800 Mhz    99.1%

Wakeups-from-idle per second : 18.8     interval: 15.0s
Power usage (ACPI estimate): 12.5W (5.3 hours)
As you can see, the CPU is now most of the time in the C6, thus the CPU power consumption can be reduced near to zero. For the latency:
# cat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency
It means 162 microseconds (C6 is currently mapped to state3). The ping result: 493 (33) us. That is increase about 100 us.

The intel_idle driver

The problem with the latency can be even worse if the intel_idle driver [5] is utilized. This driver has been included since kernel version 2.6.35. It is native hardware driver for the latest Intel CPUs. It supersedes acpi_idle on supported processors (currently Intel Atom, Intel Core i3/i5/i7, associated Intel Xeons). The intel_idle knows more than ACPI and it can bypass the firmware / BIOS settings and the processor can then enter deeper power savings states more aggressively. By default the intel_idle driver is built into the Fedora kernel and is activated automatically on boot (on supported CPUs). This result in higher power savings by default but also higher latency.

If the increased latency is unacceptable, it is possible to specify the max allowed C state by the kernel command line parameter intel_idle.max_cstate, e.g. to use only C0, boot with the intel_idle.max_cstate=0.

The PM QoS kernel interface

For finer runtime control the PM QoS interface [6] can be utilized. Through this interface every process can register it's latency requirement and the cpuidle driver will not transition to deeper C states if the lowest request wouldn't be satisfied. The request is written as four-bytes signed integer to /dev/cpu_dma_latency. The request is valid till the file descriptor is held open. E.g. to request the latency to be lower than 100 us the following commands can be used:
# exec 3>/dev/cpu_dma_latency
# echo -ne '\0144\000\000\000' >&3
For the echo command the 100 (decimal) was translated into 0144 (octal). Let's repeat our experiment:
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        ( 0.2%)       Turbo Mode     0.1%
polling           0.0ms ( 0.0%)         2.81 Ghz     0.0%
C1 mwait          0.1ms ( 0.2%)         2.14 Ghz     1.3%
C2 mwait         39.7ms (99.6%)         1.60 Ghz     0.0%
C6 mwait          0.0ms ( 0.0%)          800 Mhz    97.3%

Wakeups-from-idle per second : 50.5     interval: 20.0s
Power usage (ACPI estimate): 14.9W (4.3 hours)
As you can see the CPU is now not transitioning to C6. As the side effect the power consumption increased about 2.5 W (by ACPI estimation) and the ping result is a bit better: 301 (30) us.

When the lower latency is not needed, we can remove the requirement from the kernel by closing the file descriptor:
# exec 3>&-
Let's try to set the required latency to 0, the powertop results:
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        ( 0.2%)       Turbo Mode     0.0%
polling          26.4ms (99.8%)         2.81 Ghz     0.0%
C1 mwait          0.0ms ( 0.0%)         2.14 Ghz     0.0%
C2 mwait          0.0ms ( 0.0%)         1.60 Ghz     0.0%
C6 mwait          0.0ms ( 0.0%)          800 Mhz    99.9%

Wakeups-from-idle per second : 37.8     interval: 15.0s
Power usage (ACPI estimate): 16.9W (3.9 hours) (long term: 15.7W,/4.2h)
Note the increased power consumption and that the cpufreq wasn't affected, i.e. the CPU is running at it's lowest speed. The ping result: 195 (34) us.

User control of PM QoS

The PM QoS interface is also proxied by upower daemon, thus it is possible to control the PM QoS settings through dbus interface (org.freedesktop.UPower.QoS).

The support was also recently added into tuned latency-performance profile (currently only in upstream git [7], but it will be probably part of the future v0.2.22 release). To test it:
# tuned-adm profile latency-performance
The powertop results:
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        ( 0.4%)       Turbo Mode   100.0%
polling          74.3ms (99.6%)         2.81 Ghz     0.0%
C1 mwait          0.0ms ( 0.0%)         2.14 Ghz     0.0%
C2 mwait          0.0ms ( 0.0%)         1.60 Ghz     0.0%
C6 mwait          0.0ms ( 0.0%)          800 Mhz     0.0%

Wakeups-from-idle per second : 13.4     interval: 5.0s
Power usage (ACPI estimate): 31.4W (2.1 hours)
Note that the power consumption nearly doubles and the CPU is running most of the time in Turbo mode. The ping results: 176 (32) us. Thus we got the lowest latency, but our setup is far from being power efficient.

*1 Deduced, no official name found in Intel specs.
[1] http://www.acpi.info/spec.htm
[2] http://www.hardwaresecrets.com/article/611
[3] http://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200-family-vol-1-datasheet.html
[4] http://www.lesswatts.org/documentation/silicon-power-mgmnt/
[5] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=2671717265ae6e720a9ba5f13fbec3a718983b65
[6] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob_plain;f=Documentation/power/pm_qos_interface.txt;hb=HEAD
[7] http://git.fedorahosted.org/git/?p=tuned.git;a=commit;h=31639fdec76b294fd67c78ec332fe26bf0ad7bb9

1 komentář:

  1. The best casinos and mobile apps for Android and iOS - DRMCD
    The best online 광주 출장샵 casinos 안동 출장샵 and mobile 아산 출장안마 apps for Android and iOS. Discover hundreds of 문경 출장안마 popular casino games and slots on mobile for free or with real 태백 출장안마