Page MenuHomeFreeBSD

csu: move common code to libc
Needs ReviewPublic

Authored by kib on Oct 31 2022, 1:34 AM.
Tags
None
Referenced Files
F55752337: D37220.diff
Thu, Feb 2, 5:19 PM
Unknown Object (File)
Mon, Jan 9, 7:35 AM
Unknown Object (File)
Sat, Jan 7, 3:35 PM
Unknown Object (File)
Dec 14 2022, 3:04 AM
Unknown Object (File)
Dec 4 2022, 7:49 PM
Subscribers

Details

Summary
Why? Most trivial point, it shaves around 600 bytes from the dynamic
binaries on amd64. Less trivial, the removed code is no longer part of
the ABI, and we can ship updates to it with libc updates. Right now most
of the csu is linked into the binaries and require us to do somewhat
tricky ABI compat when it needs to change. For instance, the init_array
change would be much simpler and does not require note tagging if we
have init calling code in libc.

This could be improved more, by splitting dynamic and static
initialization. For instance, &_DYNAMIC tests can be removed then. I
left this for later, after this change stabilizes.

[Only amd64/i386 are tested ATM]

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Oct 31 2022, 1:34 AM

Looks like a nice incremental improvement to me
environ and __progname in csu_common.h seems slightly surprising at first, takes a moment to realize it's not a traditional header

environ and __progname in csu_common.h seems slightly surprising at first, takes a moment to realize it's not a traditional header

It was csu_common.c first, but then I decided to use .h for included file. I can move it back if people prefer.

It was csu_common.c first, but then I decided to use .h for included file. I can move it back if people prefer.

I don't have a strong feeling one way or the other, it just took me a moment to figure out.

Add comment to csu_common.h.
Annotate _start() functions with __dead2.

Generally like it. Agnostic on .h vs .c for the shared include, both have their issues.

lib/libc/csu/ignore_init.c
123

can this be moved to libc_private.h?

158

Likewise...

kib marked 2 inline comments as done.Oct 31 2022, 9:09 PM
kib added inline comments.
lib/libc/csu/ignore_init.c
123

This is not easy, and definitely not as the part of this patch. There are a lot of 'extern char **environ' in libc sources, including getenv.c. The later would require approval from secteam@. I do not want to mix these things.

158

I do not see a benefit from the move. The declarations only needed locally.

kib marked 2 inline comments as done.

Add (not compiled) code for rest of the arches.

Fixes after tinderbox.
Microoptimize x86 static irel processing using powerpc64 pattern.

Move environ declaration to libc_private.h.

Suggested by: imp

For future static PIE support (and for static CHERI binaries which look like static PIEs even for PDEs) you still need to do relocation handling in csu itself as that needs to be done before taking pointers (whether to main or eprol/etext, and implicitly for function calls on architectures whose ABIs don't do direct calls, which I believe we don't have any more now MIPS is gone), at which point it's a bit odd to have non-IFUNC relocation handling in csu but IFUNC relocation handling in libc. I don't think this is incompatible with that, it just weakens the argument for the move in my opinion.

lib/libc/csu/ignore_init.c
164

Deduplicate with __libc_start1? Only difference is whether you do atexit(_mcleanup) and call monstartup.

Taking amd64 as an example, I think in my head the right split would be something like:

	int argc;
	char **argv;
	char **env;

 CSU:	argc = *(long *)(void *)ap;
 CSU:	argv = ap + 1;
 CSU:	env = ap + 2 + argc;
LIBC:	handle_argv(argc, argv, env);

	if (&_DYNAMIC != NULL) {
LIBC:		atexit(cleanup);
	} else {
 CSU:		process_irelocs();
LIBC:		_init_tls();
	}

#ifdef GCRT
LIBC:	atexit(_mcleanup);
LIBC:	monstartup(&eprol, &etext);
__asm__("eprol:");
#endif

LIBC:	handle_static_init(argc, argv, env);
LIBC:	exit(main(argc, argv, env));

That is, like what you've done, except leaving the process_irelocs (and arch-specific setup around that for CPU features) in csu by splitting out the handle_argv call in a separate __libc_start1_init_early-like function. Then static PIE support feels more natural, being initialisation done before the call to that new function.

Are you suggesting to move process_irelocs() to csu before the call to __libc_start1()?

Note that the split is NOP for static binaries, and only changes the final binary' text size for dynamic binaries. It only the matter where the .text content comes from.

I considered dropping static profiling as well, but decided not to. For static PIEs, libc_start1_gcrt() would need to adjust values of eprol/etext. I can add a comment about it.

Any other thoughts/views on this?

In D37220#848608, @kib wrote:

Are you suggesting to move process_irelocs() to csu before the call to __libc_start1()?

No, I'm suggesting splitting __libc_start1 into two halves so that process_irelocs can be called, along with any machine-dependent related initialisation like cpu_features, where it is today in csu. As a definitely-poorly-named sketch (that also ignores GCRT to more clearly illustrate this specifically), amd64 would look something like:

/* The entry function. */
void
_start(char **ap, void (*cleanup)(void))
{
	int argc;
	char **argv;
	char **env;

	argc = *(long *)(void *)ap;
	argv = ap + 1;
	env = ap + 2 + argc;
	__libc_start1_first(argc, argv, env, cleanup, main); // Just calls handle_argv for now

	if (&_DYNAMIC == NULL) {
		init_cpu_features();
		process_irelocs();
	}

	__libc_start1_second(argc, argv, env, cleanup, main); // Does the atexit or _init_tls, then handle_static_init, then main, then exit
}
In D37220#848608, @kib wrote:

Are you suggesting to move process_irelocs() to csu before the call to __libc_start1()?

No, I'm suggesting splitting __libc_start1 into two halves so that process_irelocs can be called, along with any machine-dependent related initialisation like cpu_features, where it is today in csu. As a definitely-poorly-named sketch (that also ignores GCRT to more clearly illustrate this specifically), amd64 would look something like:

/* The entry function. */
void
_start(char **ap, void (*cleanup)(void))
{
	int argc;
	char **argv;
	char **env;

	argc = *(long *)(void *)ap;
	argv = ap + 1;
	env = ap + 2 + argc;
	__libc_start1_first(argc, argv, env, cleanup, main); // Just calls handle_argv for now

	if (&_DYNAMIC == NULL) {
		init_cpu_features();
		process_irelocs();
	}

	__libc_start1_second(argc, argv, env, cleanup, main); // Does the atexit or _init_tls, then handle_static_init, then main, then exit
}

This means that init_cpu_features()/process_irelocs() are still pulled into the dynamic binaries, which is I am trying to avoid.

From your earlier comment, your suggestion based on existence of arches which need relocations to happen before first call, am I right? Would they serve itself by doing whatever is needed in csu::_start() and leaving process_irelocs() empty?

In D37220#851886, @kib wrote:
In D37220#848608, @kib wrote:

Are you suggesting to move process_irelocs() to csu before the call to __libc_start1()?

No, I'm suggesting splitting __libc_start1 into two halves so that process_irelocs can be called, along with any machine-dependent related initialisation like cpu_features, where it is today in csu. As a definitely-poorly-named sketch (that also ignores GCRT to more clearly illustrate this specifically), amd64 would look something like:

/* The entry function. */
void
_start(char **ap, void (*cleanup)(void))
{
	int argc;
	char **argv;
	char **env;

	argc = *(long *)(void *)ap;
	argv = ap + 1;
	env = ap + 2 + argc;
	__libc_start1_first(argc, argv, env, cleanup, main); // Just calls handle_argv for now

	if (&_DYNAMIC == NULL) {
		init_cpu_features();
		process_irelocs();
	}

	__libc_start1_second(argc, argv, env, cleanup, main); // Does the atexit or _init_tls, then handle_static_init, then main, then exit
}

This means that init_cpu_features()/process_irelocs() are still pulled into the dynamic binaries, which is I am trying to avoid.

From your earlier comment, your suggestion based on existence of arches which need relocations to happen before first call, am I right? Would they serve itself by doing whatever is needed in csu::_start() and leaving process_irelocs() empty?

Ping. Please comment on this question.

Regen the patch, just in case.

In D37220#851886, @kib wrote:
In D37220#848608, @kib wrote:

Are you suggesting to move process_irelocs() to csu before the call to __libc_start1()?

No, I'm suggesting splitting __libc_start1 into two halves so that process_irelocs can be called, along with any machine-dependent related initialisation like cpu_features, where it is today in csu. As a definitely-poorly-named sketch (that also ignores GCRT to more clearly illustrate this specifically), amd64 would look something like:

/* The entry function. */
void
_start(char **ap, void (*cleanup)(void))
{
	int argc;
	char **argv;
	char **env;

	argc = *(long *)(void *)ap;
	argv = ap + 1;
	env = ap + 2 + argc;
	__libc_start1_first(argc, argv, env, cleanup, main); // Just calls handle_argv for now

	if (&_DYNAMIC == NULL) {
		init_cpu_features();
		process_irelocs();
	}

	__libc_start1_second(argc, argv, env, cleanup, main); // Does the atexit or _init_tls, then handle_static_init, then main, then exit
}

This means that init_cpu_features()/process_irelocs() are still pulled into the dynamic binaries, which is I am trying to avoid.

From your earlier comment, your suggestion based on existence of arches which need relocations to happen before first call, am I right?

Would they serve itself by doing whatever is needed in csu::_start() and leaving process_irelocs() empty?

If you do that then what's the point of process_irelocs existing? We want uniformity between architectures rather than introducing new divergence. But sure, it'd "work", in the same way as just not using your new functions for those architectures at all and doing things the old way would "work". But ideally we'd have something here that is forward-looking, such as for static PIE (the requirements for which I believe are the same as CHERI here).

If you do that then what's the point of process_irelocs existing? We want uniformity between architectures rather than introducing new divergence. But sure, it'd "work", in the same way as just not using your new functions for those architectures at all and doing things the old way would "work".

It covers the current architectures/currently supported binaries.

But ideally we'd have something here that is forward-looking, such as for static PIE (the requirements for which I believe are the same as CHERI here).

The biggest issue with settling static PIE so far was that I do not have a (simple) example code which would produce all required relocations to handle.
Do you have such sample?

It might be a good base to finally add the support.