Posted on

What time is it?

A man with a watch knows what time it is. A man with two watches is never sure.

— Segal's Law

One of the least important services an operating system provides is the current time. Least important? Well, it’s important for some use cases. People like to have a clock displayed somewhere on the computer, almost always visible. Games and gamers like to know how long it takes to render a frame. Software engineers like to know exactly how long certain operations take. Network engineers and system administrators like to know various time-related metrics about their networks, such as round-trip latency. But most programs really don’t care about time at all! And that’s a good thing.

Time is an extraordinarily complex, surprisingly technical subject, and it is hard to handle time correctly in computers. Why? Modern theories of how the universe works tell us that time is relative. Relativity predicts, and experiments confirm, that gravity and velocity both conspire to cause time to vary depending on the reference frame of the observer. For the most part, we can safely ignore this when measuring time on the surface of the Earth. But not always! GPS clocks need to take into account relativistic effects due to gravitational difference between their orbit and the surface, and their high speed. Without these "corrections" (which are only "corrections" because GPS doesn’t use an actual relativistic model, much to this scientist’s dismay), position estimates would accumulate errors at the rate of kilometers a day!

Fortunately, a lot of smart people have mostly worked out the kinks in getting a consistent, global reference clock. This lets us ignore relativity and pretend that our spacetime is Newtonian. That clock is International Atomic Time (TAI), and is maintained by the International Bureau of Weights of Measures (BIPM), an international organization. Universal Coordinated Time (UTC) is derived from TAI by adding leap seconds, to more closely approximate the time we would get by using a sun dial: mean solar time. Mean solar time is what is largely relevant to us meatsacks. Assuming that there are global, high-accuracy time sources available, how does that get into our computers and how do we use it? Should we use it?

If you’re interested in the nitty-gritty of timekeeping, I highly recommend this 88-page book. I used it when researching this article and found it highly informative. It’s a tad old, but really not much has changed.

To answer the question of timekeeping for Robigalia, eternaleye and I have created the tempo crate. Here, I’ll describe the API and the implementation. For other operating systems, things can look a bit different.

Computer Timekeeping

There are two main sorts of things one does with time on a computer: receive a notification once some amount of time has passed (timers) and figure out what time it is (timestamping). We’ll discuss these in turn.

Timers

Timers are very easy to implement with a bit of hardware support. One possibility for implementation is some clock in the system which decrements some counter at some specific frequency. When the counter reaches 0, the timer generates an interrupt which the OS can handle in its own way. Configuring this sort of timer comes down to programming the divider and counter value. They usually come in two modes: periodic and one-shot. Periodic timers reset the counter to some specific value after the interrupt is generated. One-shot timers don’t reset the counter, and don’t do anything until the OS reprograms them to do something else. Another possibility is to have a counter which counts up at some specific frequency. A "deadline" is programmed into the timer, and it generates an interrupt when the counter is at least that deadline.

Handling timers is more or less completely separate from the concerns of tempo, which is timestamps. Timers are handled by the chuudan (Japanese "中 断", "interruption") crate, which I’ll write about later.

Timestamps

Timestamps are used to determine at what time some event occurred. Timestamps can be either relative or absolute. An example of a relative timestamp is the number of milliseconds that have elapsed since system boot. An example of an absolute timestamp in TAI is a 6-tuple of (year, month, week, hour, minute, second). This is actually still technically relative, but has a scope of the entire Earth with a standardized epoch. That’s as absolute as you can get. Timestamps don’t need to be related to physical time. They could, for example, be the state of a vector clock or interval tree clock. These sorts of logical clocks provide a notion of causality, not of time. This turns out to be hugely important in distributed systems where synchronized timekeeping is very expensive and, in the largest scales, impossible (due to relativity). Here we consider only temporal clocks.

In tempo, both relative and absolute times are supported, with the Tick and Tai types respectively. These represent a specific instant in time. There is also the TickDifference and TaiDifference types, which represent the time elapsed between two different timestamps, or a duration. Converting from a Tick to a Tai is done in two steps. First, you multiply the Tick by a ratio of TickDifference to TaiDifference. This ratio represents the frequency at which Tick increases. After this multiplication, the number you get is the amount of "real time" that has passed since the local clock started ticking. To get the absolute time, you add to it an offset which centers the tick 0 with the TAI epoch. We also maintain start times for three other particularly notable epochs as Tick: when the machine started, when the process was created, and when the thread was created. These are retrieved with the get_start function.

There are some complications in this setup. For example, what if the user puts the system into hibernation? At that point in time, the local clock stops ticking entirely. This would make the ratio completely skewed! To handle this, we introduce the idea of discontinuities into the mix. A discontinuity represents a period of time where the system clock was stopped for some reason. Right now, the only reasons we have are:

  • Suspend, representing that the system was put in some low-power state where the clock couldn’t be running

  • Wraparound, representing that the system clock overflowed.

These are relatively rare events, but it’s important to account for them. If you can think of other cases, let us know by either tweeting @cmrx64 or bringing it up in #robigalia on Freenode!

The last hitch we care about is clock source changes. Maybe a particular clock was found to be unreliable, and was replaced with another. Or maybe a thread was migrated to a different core and the clock available to that core is slightly different from the one on the previous core. To compensate for this, we also include a current clocksource_id, indicating what hardware clock source a thread has access to.

Synchronization

Local relative clocks can be pretty good, losing only a few seconds a week (Intel specifies an accuracy of no worse than 100ppm in their Xeon datasheets, which, worst case, would correspond to ~1 minute a week). How do we convert that into an absolute clock? How do we make sure that our internal clock reflects proper time? The answer is…​ it’s complicated. Time synchronization is worthy of its own blog post, but I’ll cover the basics here.

Right now, we use RADclock's algorithm for this. These slides have an overview. Our get_now and Tick is a raw TSC measurement. The difference clock is acquired by dividing by get_ratio, which is an estimate of the frequency of the TSC (TSC ticks per some number of nanoseconds). The absolute clock is acquired by further adding get_wall_boot to the difference clock.

I’ve also looked at ntimed, which is certainly worthy of consideration. All of these use the widely successful Network Time Protocol as their source of reference clocks. Running large clusters of computers without clock synchronization is a surefire way to achieve sadness, especially as some cryptographic and filesystem protocols in practice need a decent estimate of the current time.

Other alternatives in this space are httpstime and roughtime, with very different characteristics and goals.

Security

No component of Robigalia would be complete without an analysis of its security properties. One of the things we care about for the long term future is the ability to model processes as completely deterministic, pure functions based on the messages they receive. In the current implementation of tempo, we circumvent using IPC to get the current time for performance reasons. Instead, we use the TSC, discussed below. Conceptually getting the current time can be modeled as IPC, so there are no problems here.

Are there any additional problems that come when a process can learn what time it is? Well, one possibility is mounting timing side-channel attacks on other processes in the system, by (ab)using properties of various shared hardware resources. If a process doesn’t have a high-frequency and high-quality source of time, it becomes challenging to impossible to mount timing side channel attacks. One possible mitigation is simply denying processes access to the clock. The is necessary, but not sufficient, to prevent timing attacks! Using shared memory and at least two threads, it’s easy to simulate a clock with good enough resolution to mount timing attacks. This has recently reared its ugly head in JavaScript and the new SharedArrayBuffer type.

For this reason, processes by default don’t get access to a system clock, in the interest of least privilege. Of course, a process can be configured to have access to the system clock. For POSIX processes, this will probably be the default. We haven’t decided what to do about protecting against timing side-channels in these cases. Timing side-channel mitigation tends to be expensive or impossible on modern systems. The only recourse is probably to use software carefully written to avoid timing side channels, where that matters. In the future, I’m looking forward to Sanctum or other hardware extensions, for completely mitigating software attacks against critical components.

Knowing the current clocksource_id is a minor leak. With it one can observe the behavior of the thread’s scheduler. I don’t think this is meaningfully exploitable though, and it certainly doesn’t expose any information that a thread can’t infer by carefully watching a high-resolution clock.

Hardware clocks

In your typical x86 PC, there are at least 6 different clocks available: the real-time clock (RTC), the Programmable Interval Timer (PIT), Local Advanced Programmable Interrupt Controller, ACPI power management timer, the High-Performance Event Timer (HPET) and the Time Stamp Counter (TSC). We’ll look at these in turn and characterize their performance.

On other, less heavily standardized platforms (cough ARM cough), the variety of clocks is enormous and somewhat terrifying. We won’t consider those here, although if anyone has any concise information about timekeeping on ARM I’d love to see some pointers!

Real-time clock

The real-time clock (RTC) was added to the IBM PC AT and was originally a separate chip (MC146818), but these days it is part of the "chipset". It also contains at least 64 bytes of CMOS RAM which the firmware uses for storing various configuration options that really ought to persist past power loss. This chip had its own battery which keeps the clock running even when the system was off and disconnected from power. It is accessed in software using IO ports at the well-known addresses 0x70 and 0x71. First, a register number is written to port 0x70, and some time later a value will be available for reading on port 0x71. This corresponds to the functioning of the original RTC. The RTC stores and increments seconds, minutes, hours, day of week, day of month, month, and year. As such, its precision is rather poor. It can also be programmed to generate periodic interrupts and wake up the machine at a specific time but that functionality isn’t really used much anymore. Due to the one second precision, it’s not terribly useful as a clock either. What it is useful for is figuring out what time it is at boot, to provide initial estimates before calibration with some other clock (usually using NTP) is possible.

There are some other complexities with using the RTC. It doesn’t store a time zone. Its year on older systems only contained two digits, which made it suffer from Y2K. Modern systems, with ACPI, have a separate century counter. There are also some timing constraints when using it. For several hundred microseconds when the clock updates, it is inaccessible. All of this conspires to make this a very annoying device. Not terrible, but annoying.

Source: http://www.nxp.com/assets/documents/data/en/data-sheets/MC146818.pdf

Programmable Interval Timer

The programmable interval timer (PIT) has an internal oscillator that runs at roughly 1193182Hz (~1.193MHz), and feeds into a 16-bit downcounter. It’s "programmable" in that one can poke a 16-bit value into IO port 0x40 and the PIT will generate an interrupt on IRQ0 once the internal counter reaches 0, starting from that value. There are two modes for this: it will either immediately reset the internal counter back to the value that was poked (periodic), or wait for software to tell it to start counting again (oneshot). There are some divisors thrown in, to reduce the clock rate. Various frequencies from 18.222Hz to the maximum clock rate are achievable. There are also additional modes that aren’t used at all in modern systems.

This isn’t used very often these days, but when it is used, it is usually as the primary event source that determines when a running thread should be preempted. In that usecase, on non-realtime systems, the relatively poor granularity (~0.83us) doesn’t hurt much. But programming the PIT takes ages (over a microsecond), so it’s avoided these days.

Source: http://wiki.osdev.org/PIT, http://www.mcamafia.de/pdf/ibm_hitrc05.pdf

ACPI PMT

The ACPI power management timer (PMT) is an optional 24- or 32-bit counter running at a frequency of 3579545Hz (~3.57MHz). It can generate an interrupt whenever the most significant bit of the counter changes state, to allow software to emulate a wider counter. It is available as either a location on the IO bus or somewhere in the physical address space as memory-mapped IO. Exactly where it is placed is determined by the firmware and motherboard, and indicated by the ACPI tables. When it’s exposed on the IO bus, it suffers from slow (over a microsecond) reads. This timer is OK for timestamps, but only used if the TSC is unavailable. The ACPI 6.1 specification recommends using the HPET over the ACPI PMT where it’s available.

Source: http://uefi.org/sites/default/files/resources/ACPI_6_1.pdf

HPET

The high-performance event timer (HPET) was previously known as the "multimedia timer". It was introduced by Intel in 2000 and started shipping in real chips around 2005. It’s a one-stop-shop for all sorts of timing needs. The spec "recommends" a minimum of a 64-bit counter, with a frequency of at least 10MHz, with a drift of 500ppm, with at least 3 comparators, at least one of which must support periodic timers, with various options for delivering interrupts. What a mouth full! On my system, an Intel Xeon E3-1230v2, Linux detects the HPET as having 8 comparators, a 64-bit counter, and a frequency of 14.318180MHz.

The HPET is architected as a single counter register, which feeds into 3 to 32 timers. Each of these timers consists of a comparator, an enable/disable bit, and some miscellaneous configuration. Software can program a value into the comparator, and the timer will generate an interrupt when the counter is equal to that value. This is problematic. If you want to program an interrupt to occur in the near future, and the counter passes that value while your OS is faffing about doing what it does, you’ll never get an interrupt. It would have been kinda nice if the comparators were level-based (using greater-than) instead of edge-based (using equals). Oh well. The HPET is fraught with race conditions all around, but this is the nastiest one.

Due to the annoying race conditions and BIOS bugs, apparently neither Windows 8+ or Linux will use it if anything else is available, except as a replacement for the RTC timers ("legacy replacement mode").

Source: http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf

LAPIC Timer

The Local Advanced Programmable Interrupt Controller (LAPIC) is a piece of hardware associated which each logical processor, which corresponds to a thread in a system with hyperthreading. The LAPIC timer has a divider, 32-bit initial count, 32-bit current count, and mode. It operates by feeding the same clock that drives the TSC into a divider, which reduces the frequency by any power of two up to 128 and uses that to decrement the current count. When the current count reaches 0, an interrupt is generated for the corresponding logical processor. If set to periodic mode, the current count is reset to the initial count. Otherwise, it stays at 0 until reprogrammed. There is another mode, TSC-deadline, that operates completely differently. When in TSC-deadline mode, software programs a value into the IA32_TSC_DEADLINE model-specific-register (MSR). When the TSC is greater than or equal to that value, the LAPIC generates an interrupt. This allows absolute timers, not just relative timers. The seL4 realtime (RT) kernel uses this to decide when to reschedule, as does Linux, when it’s available.

This is the best timer available on the system, as it uses the best clock on the system: the TSC. Let’s talk about that next.

Source: https://software.intel.com/en-us/articles/intel-sdm, §10.5.4, page 3032 of combined PDF (page 10-16).

TSC

Except for the RTC, all these other devices have been timers, not a source of timestamps. Using a timer, you can emulate a timestamp by incrementing some counter every so often (following "oscillator + counter = clock"). This isn’t ideal, as it imposes software overhead on a function easily implementable by hardware. Enter the TSC. The TSC is the crème de la crème source of timestamps on any x86 system. It runs at the highest frequency in the system, and is very fast to access (less than 100 cycles) compared to the other clock sources. It’s but an RDTSCP instruction away! It used to count actual cycles of the CPU. But as of Nehalem, the TSC is "invariant", meaning it counts at a constant frequency regardless of the frequency the actual cores are running at or what power state they are in. It is entirely decoupled from actual instructions being executed. It more or less directly maps to the get_now() function in tempo.

Source: https://software.intel.com/en-us/articles/intel-sdm, §17.15, page 3272 of combined PDF (page 17-40)

The tempo API

The public API of tempo can be completely summarized as:

/// A monotonically non-decreasing counter. Increases at a steady rate,
/// currently estimated by `get_ratio`.
///
/// Note that both monotonicity and steadiness are properties of a clock on a
/// single hardware thread. When migrating across CPUs or other boundaries,
/// timing weirdness may be observed. By ignoring any ticks inside a discontinuity
/// range, the worst can be avoided. Keeping ticks for long timescales (approximately
/// over 1000 seconds) exposes clock drift. It's best to divide by `get_ratio` in that
/// case.
pub fn get_now() -> Tick;

/// Get any discontinuities that occurred since some start point.
///
/// The OS does not maintain the tick list forever. In particular, the buffer of
/// discontinuities has a fixed size. Through mechanisms not described here, threads
/// can register to receive a notification when the buffer is about to fill, and copy
/// out all the remaining discontinuities for storage.
///
/// A discontinuity is some period of time where the counter wasn't increasing.
/// Due to difficulties in determining exactly how long and when the system
/// went up/down, the `TaiDifference` portion of the tuple is merely an estimate.
/// The `Range<Tick>` portion has no relation to the actual tick rate, and
/// merely indicates a convenient place where the count stopped/started.
///
/// The usize this returns is the number of discontinuities that couldn't fit
/// into the slice, and were dropped. There can only ever be at most 120 recent
/// discontinuities, so passing a slice of length 120 will always avoid this
/// issue.
pub fn get_recent_discontinuities(start: Tick, read_into: &mut [(DiscontinuityKind, Range<Tick>, TaiDifference)]) -> usize;

/// Get the total (estimated) amount of time the system has been suspended.
///
/// This is a running sum of all `TaiDifference`s reported in a discontinuity
/// with DiscontinuityKind::Suspend. This is offered as a convenience. It is
/// possible to compute this in userspace, but it is somewhat annoying.
pub fn get_suspend_time() -> TaiDifference;

/// Get the number of ticks that pass per some number of nanoseconds.
pub fn get_ratio() -> (TickDifference, TaiDifference);

/// Get the wall clock time the system was booted at.
///
/// This is merely an estimate, usually achieved by synchronizing with some
/// number of network time servers, which have either a reference clock
/// (GPS or atomic, usually) or are themselves synchronized with some network
/// time servers.
pub fn get_wall_boot() -> Tai;

/// Get the current tick count at various start times relevant to this thread.
pub fn get_start(_arg: StartKind) -> Tick;

/// Get an ID of the current clocksource that `get_now` measures.
pub fn get_clocksource_id() -> u64;

These should make sense, given the explanation provided above. There’s some other stuff, but you’ll need to look at the code to see all the details.

Compared to the POSIX API, get_now corresponds to clock_gettime(CLOCK_MONOTONIC_RAW) on Linux. Using get_discontinuities, one can approximate clock_gettime(CLOCK_BOOTTIME). Using get_ratio, one can approximate clock_gettime(CLOCK_REALTIME). There is nothing like adjtime or adjtimex for modifying the rate of get_now or handling leap seconds. Any time adjustment of that sort would need to be handled by further userspace abstractions (for example, mapping TAI to UTC). Doing something like Google does by slewing the clock over the course of a day to account for a leap second is not possible. A second is a second. A robust timekeeping system’s core shouldn’t pretend otherwise.

There are additional privileged API, outside the tempo crate:

/// Set the system-wide boot time returned by `get_wall_boot`.
pub fn set_wall_boot(time: Tai);

/// Set a start point for a process to a specific value.
pub fn set_start(process: Process, kind: StartKind, tick: Tick) -> Result<...>;

/// Set the system-wide ratio returned by `get_ratio`.
pub fn set_ratio(ratio: (TickDifference, TaiDifference));

/// Enable or disable timekeeping for a process.
///
/// With timekeeping disabled, a process is unable to directly observe real
/// time. `get_now()` in such a process will cause a (recoverable) fault.
pub fn set_process_timekeeping_allowed(process: Process, allowed: bool);

/// Determine if a process is allowed to observe the time.
pub fn process_timekeeping_allowed(process: Process) -> bool;

These are used by system software to synchronize the clock and handle timekeeping for other processes.

Next, let’s look at some examples using tempo's API.

Comparative microbenchmarking

enum RaceWinner {
    A,
    B,
    Tie
}

  fn race<A: Fn(), B: Fn()>(a: A, b: B) -> RaceWinner {
      let start = get_now();
      a();
      let mid = get_now();
      b();
      let stop = get_now();

      let a_time = mid - start;
      let b_time = stop - mid;
      if a_time < b_time {
          RaceWinner::A
      } else if a_time > b_time {
          RaceWinner::B
      } else {
          RaceWinner::Tie
      }
  }

This simple example compares how long it takes to run two different callbacks. A better implementation would also look at any discontinuities that occurred between start and stop, and if there were any, repeat the measurement. A more robust benchmarking harness would take more measurements to estimate the "true" winner and compensate for things like preemption and system load. The ideal benchmarking harness would use various microarchitectural performance counters instead of just time, to fully understand the performance characteristics of the involved code.

Absolute microbenchmarking

fn bench<F: FnOnce()>(f: F) -> TaiDifference {
    let start = get_now();
    f();
    let stop = get_now();

    let time = stop - start;
    let rate = get_ratio();
    let nanos = time / rate;
    nanos
}

Pretty much all the comments above apply to this one as well. The difference with this example is that it scales by get_ratio() to determine how much proper time (in nanoseconds) passed during the execution.

Uptime calculator

This is an example which also uses get_wall_boot and get_discontinuities to estimate how long the machine has been running.

fn uptime() -> TaiDifference {
    let now = get_now();
    let boot = get_start(StartKind::Machine);
    let stopped_nanos = get_suspend_time();
    let total_time = now - boot;
    (total_time * get_ratio()) - stopped_nanos
}

Current time

Get the current time. This is rather similar to the previous example, but the discontinuity durations are added instead of subtracted. There is additionally an error term which is subtracted.

fn realtime() -> Tai {
    let now = get_now();
    let boot = get_start(StartKind::Machine);
    let stopped_nanos = get_suspend_time();
    let total_time = now - boot;
    get_wall_boot() + (total_time / get_ratio()) + stopped_nanos
}

tempo internals

tempo is organized around a syscall-free interface to the operating system, vaguely similar to Linux’s "vDSO" mechanism. This is to reduce the overhead necessary to create timestamps. Processes which are given access to timekeeping are given a capability corresponding to a read-only region of shared memory that the OS manages. They can then map this into their address space and access it like regular memory. Unlike the vDSO, this region contains no code, only data which the tempo implementation accesses.

The layout of this shared memory region looks like this:

wall_boot_offset: u16
ratio_offset: u16
discont_offset: u16
machine_start: u64
// padding to cache line
wall_boot: i128 // start + wall_boot_offset
wall_boot_seqno: u32
// padding to cache line
ratio_high: u64 // start + ratio_offset
ratio_low: i128
ratio_seqno: u32
// padding to cache line
discont_seqno: u32 // start + discont_offset
discont_len: u16
total_discont: i128;
discont_data: [Discontinuity]

The seqnos are used for synchronizing access to data which can’t be written atomically. The OS will never be blocked from writing, but user threads might be temporarily blocked from reading by the OS updating something. There is an additional per-thread page:

clocksource_id: u64
process_ticks: u64
thread_ticks: u64
thread_start: u64
process_start: u64

process_ticks and thread_ticks keeps track of how many ticks the process or thread has been executing for, if that has been enabled for the process. If it hasn’t been enabled, these fields won’t change. thread_start and process_start are the Tick count when the process/thread was spawned.

Virtualization

In a virtual machine, time can be virtualized just as much as other resources. This poses problems. Fortunately, since we don’t depend on periodic interrupts for timekeeping, things can be considerably simpler than they otherwise would be. The ratio and wall boot time are both non-constant, and not expected to be constant. They can change when a VM is migrated to a different host. Robigalia system timekeeping is robust enough that TSC passthrough should always be used. Where it’s slewed or offset at the whim of the hypervisor, various time-related functions may cause weirdness. Unfortunately, there’s nothing to do about this. We advocate the approach in Virtualize Everything but Time, which can thankfully be adapted without hypervisor modifications.

In practice, the most widely used alternative is to emulate the TSC

Conclusion

Timekeeping is hard! But with a different look at it than in traditional POSIX implementations, we can arrive at a more robust, simpler solution. Some of the gravest sins of POSIX have been using UTC (which is corrected by leap seconds) in the core and always using absolute realtime. These techniques aren’t new, and we’ve been greatly inspired by RADclock when figuring out how to expose time in our operating system. Hopefully future architectures like RISC-V can learn from some of the mistakes in extant timing hardware to more easily enable the sort of timekeeping we use here. All indications are that this is the future.

Thanks to Alex Elsayed, Stefan O’Rear, Samuel A. Falvo II, Catherine Callegari, Jamey Sharp, and Paul Lussier for early feedback on this post!