Page MenuHomeFreeBSD

sound: Remove macro magic from pcm/feeder_eq.c
Needs ReviewPublic

Authored by christos on Dec 11 2024, 4:49 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jan 9, 10:05 AM
Unknown Object (File)
Thu, Jan 9, 9:46 AM
Unknown Object (File)
Wed, Jan 8, 8:19 PM
Unknown Object (File)
Mon, Jan 6, 10:26 PM
Unknown Object (File)
Dec 26 2024, 12:23 PM
Unknown Object (File)
Dec 26 2024, 10:40 AM
Unknown Object (File)
Dec 26 2024, 9:55 AM
Unknown Object (File)
Dec 26 2024, 8:30 AM
Subscribers

Details

Summary

Turn the FEEDEQ_DECLARE macro into a single inline function
(feed_eq_biquad()). There is no reason to have this as a macro, and it
only complicates the code. An advantage of this patch is that, because
we no longer call the functions created by the macro through function
pointers (biquad_op), we can call feed_eq_biquad() directly in
feed_eq_feed().

Sponsored by: The FreeBSD Foundation
MFC after: 1 week

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 61567
Build 58451: arc lint + arc unit

Event Timeline

If you want to make your function equal to the macro in terms of performance, you have to call it with literal format parameters somehow. E.g. through a switch on the format parameters.

If you want to make your function equal to the macro in terms of performance, you have to call it with literal format parameters somehow. E.g. through a switch on the format parameters.

I do not have a benchmark right now to prove the opposite, but do you think such a performance hit is noticeable, if at all existent? Currently we use a function pointer to a specialized function (i.e one for each format), and the patch uses a generic function which fetches the format directly from the struct. I am not really well-versed with compiler optimizations, but I would suppose that the generic one is more likely to be inlined, and thus give us a performance boost (I did see the comment about CPU-branch prediction in D47932).

The main rationale behind this patch, as well as the other similar ones, is to make the code cleaner and easier to work with. There had been a few times already where I tried to make some changes to these files (especially with AFMT_FLOAT support) and the macros make it quite tedious. Even though D47932 fixes most of what I wanted, I still think it's good to have clean code everywhere.

If you want to make your function equal to the macro in terms of performance, you have to call it with literal format parameters somehow. E.g. through a switch on the format parameters.

I do not have a benchmark right now to prove the opposite, but do you think such a performance hit is noticeable, if at all existent? Currently we use a function pointer to a specialized function (i.e one for each format), and the patch uses a generic function which fetches the format directly from the struct. I am not really well-versed with compiler optimizations, but I would suppose that the generic one is more likely to be inlined, and thus give us a performance boost (I did see the comment about CPU-branch prediction in D47932).

The function pointer prevents inlining, alright. But inlining your generic function doesn't help to optimize the inner loops in feed_eq_biquad() that dominate here. The macro did that.

The main rationale behind this patch, as well as the other similar ones, is to make the code cleaner and easier to work with. There had been a few times already where I tried to make some changes to these files (especially with AFMT_FLOAT support) and the macros make it quite tedious. Even though D47932 fixes most of what I wanted, I still think it's good to have clean code everywhere.

I'm not telling you to go back to macros. All you need is a switch on the format parameter, in a dispatch function between the caller and feed_eq_biquad(). Something like:

switch(info->fmt) {
/* Cases you want to optimize for. */
case AFMT_S16_NE:
        feed_eq_biquad(..., ..., ..., AFMT_S16_NE);
        break;
case AFMT_S32_NE:
        feed_eq_biquad(..., ..., ..., AFMT_S32_NE);
        break;
/* Generic fallback, less optimized. */
default:
        feed_eq_biquad(..., ..., ..., info->fmt);
        break;
}

This will let the compiler inline with fixed parameters, transitively for the pcm_sample_read() and pcm_sample_write(). Whether it's critical to do so is hard to tell without a benchmark, but with this you shouldn't get performance regressions compared to the macros.

Bump.

The whole stack around these feeder changes depends on D47932, which is still broken. You may want to fix that one first.

If you want to make your function equal to the macro in terms of performance, you have to call it with literal format parameters somehow. E.g. through a switch on the format parameters.

I do not have a benchmark right now to prove the opposite, but do you think such a performance hit is noticeable, if at all existent? Currently we use a function pointer to a specialized function (i.e one for each format), and the patch uses a generic function which fetches the format directly from the struct. I am not really well-versed with compiler optimizations, but I would suppose that the generic one is more likely to be inlined, and thus give us a performance boost (I did see the comment about CPU-branch prediction in D47932).

The function pointer prevents inlining, alright. But inlining your generic function doesn't help to optimize the inner loops in feed_eq_biquad() that dominate here. The macro did that.

The main rationale behind this patch, as well as the other similar ones, is to make the code cleaner and easier to work with. There had been a few times already where I tried to make some changes to these files (especially with AFMT_FLOAT support) and the macros make it quite tedious. Even though D47932 fixes most of what I wanted, I still think it's good to have clean code everywhere.

I'm not telling you to go back to macros. All you need is a switch on the format parameter, in a dispatch function between the caller and feed_eq_biquad(). Something like:

switch(info->fmt) {
/* Cases you want to optimize for. */
case AFMT_S16_NE:
        feed_eq_biquad(..., ..., ..., AFMT_S16_NE);
        break;
case AFMT_S32_NE:
        feed_eq_biquad(..., ..., ..., AFMT_S32_NE);
        break;
/* Generic fallback, less optimized. */
default:
        feed_eq_biquad(..., ..., ..., info->fmt);
        break;
}

I suspect that to make sure this works as expected, you would also want to introduce functions like

static void __noinline
feed_eq_biquad_s16(...)
{
    feed_eq_biquad(..., AFMT_S16_NE);
}

and call those from the switch statement instead. That would make the programmer's intent more obvious, and prevent the compiler from trying to deduplicate the inlined code.

This will let the compiler inline with fixed parameters, transitively for the pcm_sample_read() and pcm_sample_write(). Whether it's critical to do so is hard to tell without a benchmark, but with this you shouldn't get performance regressions compared to the macros.

How frequently do these functions get called?

sys/dev/sound/pcm/feeder_eq.c
140

When parameterizing a function with a constant to get better code generation, it's a good idea to declar that parameter as const, so const uint32_t fmt here.

I suspect that to make sure this works as expected, you would also want to introduce functions like

static void __noinline
feed_eq_biquad_s16(...)
{
    feed_eq_biquad(..., AFMT_S16_NE);
}

and call those from the switch statement instead. That would make the programmer's intent more obvious, and prevent the compiler from trying to deduplicate the inlined code.

My experience from other projects is different, C++ in particular where these patterns are seen quite often using templates. But we're certainly making assumptions here about the compiler's optimization process. I'd want to have a look at the assembler generated to evaluate this.

This will let the compiler inline with fixed parameters, transitively for the pcm_sample_read() and pcm_sample_write(). Whether it's critical to do so is hard to tell without a benchmark, but with this you shouldn't get performance regressions compared to the macros.

How frequently do these functions get called?

The pcm_sample_read() and pcm_sample_write() get called for every sample processed. Higher level feeder functions get called on small buffer portions, e.g. ~100 frames, AFAIK.

christos marked an inline comment as done.

Mark fmt argument as const.