Treating tests as special case

Discussion:

Treating tests as special case

Pjotr Prins

2018-04-05 05:24:39 UTC

Last night I was watching Rich Hickey's on Specs and deployment. It is
a very interesting talk in many ways, recommended. He talks about
tests at 1:02 into the talk:

and he gave me a new insight which rang immediately true. He said:
what is the point of running tests everywhere? If two people test the
same thing, what is the added value of that? (I paraphrase)

With Guix a reproducibly building package generates the same Hash on
all dependencies. Running the same tests every time on that makes no
sense.

And this hooks in with my main peeve about building from source. The
building takes long enough. Testing takes incredibly long with many
packages (especially language related) and are usually single core
(unlike the build). It is also bad for our carbon foot print. Assuming
everyone uses Guix on the planet, is that where we want to end up?

Burning down the house.

Like we pull substitutes we could pull a list of hashes of test cases
that are known to work (on Hydra or elsewhere). This is much lighter
than storing substitutes, so when the binaries get removed we can
still retain the test hashes and have fast builds. Also true for guix
repo itself.

I know there are two 'inputs' I am not accounting for: (1) hardware
variants and (2) the Linux kernel. But, honestly, I do not think we
are in the business of testing those. We can assume these work. If
not, any issues will be found in other ways (typically a segfault ;).
Our tests are generally meaningless when it comes to (1) and (2). And
packages that build differently on different platforms, like openblas,
we should opt out on.

I think this would be a cool innovation (in more ways than one).

Pj.

Gábor Boskovits

2018-04-05 06:05:39 UTC

Post by Pjotr Prins
Last night I was watching Rich Hickey's on Specs and deployment. It is
a very interesting talk in many ways, recommended. He talks about
http://youtu.be/oyLBGkS5ICk
what is the point of running tests everywhere? If two people test the
same thing, what is the added value of that? (I paraphrase)

Actually running tests test the behaviour of a software. Unfortunately
reproducible build does not guarantee reproducible behaviour.
Furthermore there are still cases, where the environment is
not the same around these running software, like hardware or
kernel configuration settings leaking into the environment.
These can be spotted by running tests. Nondeterministic
failures can also be spotted more easily. There are a lot of
packages where pulling tests can be done, I guess, but probably not
for all of them. WDYT?
With Guix a reproducibly building package generates the same Hash on

Post by Pjotr Prins
all dependencies. Running the same tests every time on that makes no
sense.
And this hooks in with my main peeve about building from source. The
building takes long enough. Testing takes incredibly long with many
packages (especially language related) and are usually single core
(unlike the build). It is also bad for our carbon foot print. Assuming
everyone uses Guix on the planet, is that where we want to end up?
Burning down the house.
Like we pull substitutes we could pull a list of hashes of test cases
that are known to work (on Hydra or elsewhere). This is much lighter
than storing substitutes, so when the binaries get removed we can
still retain the test hashes and have fast builds. Also true for guix
repo itself.
I know there are two 'inputs' I am not accounting for: (1) hardware
variants and (2) the Linux kernel. But, honestly, I do not think we
are in the business of testing those. We can assume these work. If
not, any issues will be found in other ways (typically a segfault ;).
Our tests are generally meaningless when it comes to (1) and (2). And
packages that build differently on different platforms, like openblas,
we should opt out on.
I think this would be a cool innovation (in more ways than one).
Pj.

Pjotr Prins

2018-04-05 08:39:15 UTC

Post by GÃ¡bor Boskovits
Actually running tests test the behaviour of a software. Unfortunately
reproducible build does not guarantee reproducible behaviour.
Furthermore there are still cases, where the environment is
not the same around these running software, like hardware or
kernel configuration settings leaking into the environment.
These can be spotted by running tests. Nondeterministic
failures can also be spotted more easily. There are a lot of
packages where pulling tests can be done, I guess, but probably not
for all of them. WDYT?

Hi Gabor,

If that were a real problem we should not be providing substitutes -
same problem. With substitutes we also provide software with tests
that have been run once (at least).

We should not forbid people to run tests. But I don't think it should
be the default once tests have been run in a configuation.

Think of it as functional programming. In my opinion rerunning tests
can be cached.

My point is that we should not overestimate/overdo the idea of
leakage. Save the planet. We have responsibility.

Pj.

Hartmut Goebel

2018-04-05 08:58:51 UTC

Post by Pjotr Prins
We should not forbid people to run tests. But I don't think it should
be the default once tests have been run in a configuation.

+1

Post by Pjotr Prins
My point is that we should not overestimate/overdo the idea of
leakage. Save the planet. We have responsibility.

+1

--
Regards
Hartmut Goebel

| Hartmut Goebel | ***@crazy-compilers.com |
| www.crazy-compilers.com | compilers which you thought are impossible |

Björn Höfling

2018-04-05 06:21:15 UTC

On Thu, 5 Apr 2018 07:24:39 +0200

Post by Pjotr Prins
Last night I was watching Rich Hickey's on Specs and deployment. It is
a very interesting talk in many ways, recommended. He talks about
http://youtu.be/oyLBGkS5ICk
what is the point of running tests everywhere? If two people test the
same thing, what is the added value of that? (I paraphrase)
With Guix a reproducibly building package generates the same Hash on
all dependencies. Running the same tests every time on that makes no
sense.
And this hooks in with my main peeve about building from source. The
building takes long enough. Testing takes incredibly long with many
packages (especially language related) and are usually single core
(unlike the build). It is also bad for our carbon foot print. Assuming
everyone uses Guix on the planet, is that where we want to end up?
Burning down the house.
Like we pull substitutes we could pull a list of hashes of test cases
that are known to work (on Hydra or elsewhere). This is much lighter
than storing substitutes, so when the binaries get removed we can
still retain the test hashes and have fast builds. Also true for guix
repo itself.
I know there are two 'inputs' I am not accounting for: (1) hardware
variants and (2) the Linux kernel. But, honestly, I do not think we
are in the business of testing those. We can assume these work. If
not, any issues will be found in other ways (typically a segfault ;).
Our tests are generally meaningless when it comes to (1) and (2). And
packages that build differently on different platforms, like openblas,
we should opt out on.
I think this would be a cool innovation (in more ways than one).
Pj.

Hi Pjotr,

great ideas!

Last night I did a

guix pull && guix package -i git

We have substitutes, right? Yeah, but someone updated git, on my new
machine I didn't configure berlin.guixsd.org yet and hydra didn't have
any substitutes (build wasn't started yet?).

Building git was relatively fast, but all the tests took ages. And it
was just git. It should work. The git maintainers ran the tests. Marius
when he updated it in commit 5c151862c ran the tests. And that should
be enough of testing. Let's skip the tests.

On the other hand, if I create a new package definition and forget to
run the tests. If upstream is too sloppy, did not run the tests and had
no continuous integration. Who will run the tests then?

What if I build my package with different sources?

And you mentioned different environment conditions like machine and
kernel. We still have "only" 70-90% reproducibility. The complement
should have tests enabled. And the question "is my package
reproducible?" is not trivial to answer, and is not computable.

We saw tests that failed only in 2% of the runs and were fine in 98%.
If we would run those tests "just once", we couldn't figure out that
there is a problem (assuming the problem really is in the software, not
just the tests).

There could also be practible problems with that: If all write there
software nice and with autoconfigure and we just have a "make && make
test && make install" it's easy to skip the test. But for more
complicated things we have to find a way to tell the build-system how
to skip tests.

BjÃ¶rn

Pjotr Prins

2018-04-05 08:43:00 UTC

Post by BjÃ¶rn HÃ¶fling
great ideas!
Last night I did a
guix pull && guix package -i git
We have substitutes, right? Yeah, but someone updated git, on my new
machine I didn't configure berlin.guixsd.org yet and hydra didn't have
any substitutes (build wasn't started yet?).
Building git was relatively fast, but all the tests took ages. And it
was just git. It should work. The git maintainers ran the tests. Marius
when he updated it in commit 5c151862c ran the tests. And that should
be enough of testing. Let's skip the tests.

Not exactly what I am proposing ;). But, even so, I think we should
have a switch for turning off tests. Let the builder decide what is
good or bad. Too much nannying serves no one.

Post by BjÃ¶rn HÃ¶fling
On the other hand, if I create a new package definition and forget to
run the tests. If upstream is too sloppy, did not run the tests and had
no continuous integration. Who will run the tests then?

Hydra should always test before providing a hash that testing is done.

Post by BjÃ¶rn HÃ¶fling
What if I build my package with different sources?
And you mentioned different environment conditions like machine and
kernel. We still have "only" 70-90% reproducibility. The complement
should have tests enabled. And the question "is my package
reproducible?" is not trivial to answer, and is not computable.

Well, I believe that case is overrated and we prove that by actually
providing binary substitutes without testing ;)

Post by BjÃ¶rn HÃ¶fling
We saw tests that failed only in 2% of the runs and were fine in 98%.
If we would run those tests "just once", we couldn't figure out that
there is a problem (assuming the problem really is in the software, not
just the tests).
There could also be practible problems with that: If all write there
software nice and with autoconfigure and we just have a "make && make
test && make install" it's easy to skip the test. But for more
complicated things we have to find a way to tell the build-system how
to skip tests.

Totally agree. At this point I patch the tree not to run tests.

Pj.

Chris Marusich

2018-04-06 08:58:35 UTC

I think we should have a switch for turning off tests. Let the builder
decide what is good or bad. Too much nannying serves no one.

I think it would be OK to give users the choice of not running tests
when building from source, if they really don't want to. This is
similar to how users can choose to skip the "make check" step (and live
with the risk) when building something manually. However, I think we
should always run the tests by default.

Maybe you could submit a patch to add a "--no-tests" option?

That is why I was suggesting putting effort in improving substitute
delivery rather than trying to come up with special mechanisms.

Yes, I think that improving substitute availability is the best path
forward. I'm willing to bet that Pjotr would not be so frustrated if
substitutes were consistently available.

Regarding Pjotr's suggestion to add a "test result substitute" feature:
It isn't clear to me how a "test result substitute" is any better than a
substitute in the usual sense. It sounds like Pjotr is arguing that if
the substitute server can tell me that a package's tests have passed,
then I don't need to run the tests a second time. But why would I have
to build the package from source in that case, anyway? Assuming the
substitute server has told me that the package's tests have passed, it
is almost certainly the case that the package has been built and its
substitute is currently available, so I don't have to build the package
myself - I can just download the substitute! Conversely, if a
substitute server says the tests have not passed, then certainly no
substitute will be available, so I'll have to build it (and run the
tests) myself. Perhaps I am missing something, but it does not seem to
me that the existence of a "test result substitute" would add value.

I think what Pjotr really wants is (1) better substitute availability,
or (2) the option to skip tests when he has to build from source because
substitutes are not available. I think (1) is the best goal, and (2) is
a reasonable request in line with Guix's goal of giving control to the
user.

An idea that came up on #guix several months ago was to separate the
building of packages from testing. Testing would be a continuation of
the build, like grafts could be envisioned as a continuation of the
build.

What problems would that solve?

The building takes long enough. Testing takes incredibly long with
many packages (especially language related) and are usually single
core (unlike the build).

Eelco told me that in Nix, they set --max-jobs to the number of CPU
cores, and --cores to 1, since lots of software has concurrency bugs
that are easier to work around by building on a single core. Notably,
Guix does the opposite: we set --max-jobs to 1 and --cores to the number
of CPU cores. I wonder if you would see faster builds by adjusting
these options for your use case?

It is also bad for our carbon foot print. Assuming everyone uses Guix
on the planet, is that where we want to end up?

When everyone uses Guix on the planet, substitutes will be ubiquitous.
You'll be able to skip the tests because, in practice, substitutes will
always be available (which means an authorized substitute server ran the
tests successfully). Or, if you are very concerned about correctness,
you might STILL choose to build from source - AND run the tests -
because you are concerned that your particular circumstances (kernel
version, file system type, hardware, etc.) was not tested by the build
farm.

I know there are two 'inputs' I am not accounting for: (1) hardware
variants and (2) the Linux kernel. But, honestly, I do not think we
are in the business of testing those. We can assume these work.

Even if those components worked for the maintainers who ran the tests on
their own machines and made a release, they might not work correctly in
your own situation. Mark's story is a great example of this! For this
reason, some people will still choose to build things from source
themselves, even if substitutes are available from some other place.

--
Chris

David Pirotte

2018-04-06 18:36:56 UTC

Hello,

Post by Chris Marusich

Post by Ricardo Wurmus
An idea that came up on #guix several months ago was to separate the
building of packages from testing. Testing would be a continuation of
the build, like grafts could be envisioned as a continuation of the
build.

What problems would that solve?

If one can run tests suites locally upon built packages, that would already save
quite a great deal of planet heat I guess, not building from the source in the first
place, but only if they find a bug, fix it ... - and iiuc, Mark would have found the
bug he mentioned ...

Post by Chris Marusich
Even if those components worked for the maintainers who ran the tests on
their own machines and made a release, they might not work correctly in
your own situation. Mark's story is a great example of this! For this
reason, some people will still choose to build things from source
themselves, even if substitutes are available from some other place.

But they would rebuild from the source just to run the tests? Sounds to me that, if
possible, separate test suites from the building process is an added value to the
current situation

Cheers,
David

Ricardo Wurmus

2018-04-05 10:14:53 UTC

Post by BjÃ¶rn HÃ¶fling
And you mentioned different environment conditions like machine and
kernel. We still have "only" 70-90% reproducibility.

Where does that number come from? In my tests for a non-trivial set of
bioinfo pipelines I got to 97.7% reproducibility (or 95.2% if you
include very minor problems) for 355 direct inputs.

I rebuilt on three different machines.

--
Ricardo

Björn Höfling

2018-04-05 12:19:47 UTC

On Thu, 05 Apr 2018 12:14:53 +0200

Post by Ricardo Wurmus

Post by BjÃ¶rn HÃ¶fling
And you mentioned different environment conditions like machine and
kernel. We still have "only" 70-90% reproducibility.

Where does that number come from? In my tests for a non-trivial set
of bioinfo pipelines I got to 97.7% reproducibility (or 95.2% if you
include very minor problems) for 355 direct inputs.
I rebuilt on three different machines.

I have no own numbers but checked Ludivic's blog post from October 2017:

https://www.gnu.org/software/guix/blog/2017/reproducible-builds-a-status-update/

"Weâre somewhere between 78% and 91%ânot as good as Debian yet, [..]".

So if your numbers are valid for the whole repository, that is good
news and would mean we are now better than Debian [1], and that would
be worth a new blog post.

BjÃ¶rn

[1] https://isdebianreproducibleyet.com/

Ricardo Wurmus

2018-04-05 14:10:04 UTC

Hi Björn,

Post by BjÃ¶rn HÃ¶fling
On Thu, 05 Apr 2018 12:14:53 +0200

Post by Ricardo Wurmus

Post by BjÃ¶rn HÃ¶fling
And you mentioned different environment conditions like machine and
kernel. We still have "only" 70-90% reproducibility.

Where does that number come from? In my tests for a non-trivial set
of bioinfo pipelines I got to 97.7% reproducibility (or 95.2% if you
include very minor problems) for 355 direct inputs.
I rebuilt on three different machines.

https://www.gnu.org/software/guix/blog/2017/reproducible-builds-a-status-update/
"We’re somewhere between 78% and 91%—not as good as Debian yet, [..]".

Ah, I see.

Back then we didn’t have a fix for Python bytecode, which affects a
large number of packages in Guix but not on Debian (who simply don’t
distribute bytecode AFAIU).

Post by BjÃ¶rn HÃ¶fling
So if your numbers are valid for the whole repository, that is good
news and would mean we are now better than Debian [1], and that would
be worth a new blog post.

The analysis was only done for the “pigx” package and its
direct/propagated inputs.

I’d like to investigate the sources of non-determinism for remaining
packages and fix them one by one. For some we already know what’s wrong
(e.g. for Haskell packages the random order of packages in the database
seems to be responsible), but for others we haven’t made an effort to
look closely enough.

I’d also take the Debian numbers with a spoonful of salt (and then take
probiotics in an effort to undo some of the damage, see[1]), because
they aren’t actually rebuilding all Debian packages.

[1]: https://insights.mdc-berlin.de/en/2017/11/gut-bacteria-sensitive-salt/

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC
https://elephly.net

Ricardo Wurmus

2018-04-05 10:26:09 UTC

Hi Pjotr,

Post by Pjotr Prins
And this hooks in with my main peeve about building from source. The
building takes long enough. Testing takes incredibly long with many
packages (especially language related) and are usually single core
(unlike the build).

I share the sentiment. Waiting for tests to complete can be quite
annoying.

An idea that came up on #guix several months ago was to separate the
building of packages from testing. Testing would be a continuation of
the build, like grafts could be envisioned as a continuation of the
build.

Packages with tests would then become leaf nodes in the graph — nothing
would depend on the packages with tests, only on the packages without
tests. Building the test continuation would thus be optional and could
be something that’s done by the build farm but not by users who need to
compile a package for lack of substitutes.

The implementation details are tricky: can it be a proper continuation
from the time after the build phase but before the install phase? Would
this involve reverting to a snapshot of the build container? There are
packages that force “make check” before “make install” — do we patch
them or ignore them? Will every package then produce one extra
derivation for tests?

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC
https://elephly.net

Ludovic Courtès

2018-04-05 14:14:19 UTC

Hello!

I sympathize with what you write about the inconvenience of running
tests, when substitutes aren’t available. However, I do think running
tests has real value.

Of course sometimes we just spend time fiddling with the tests so they
would run in the isolated build environment, and they do run flawlessly
once we’ve done the usual adjustments (no networking, no /bin/sh, etc.)

However, in many packages we found integration issues that we would just
have missed had we not run the tests; that in turn can lead to very bad
user experience. In other cases we found real upstream bugs and were
able to report them
(cf. <https://github.com/TaylanUB/scheme-bytestructures/issues/30> for
an example from today.) Back when I contributed to Nixpkgs, tests were
not run by default and I think that it had a negative impact on QA.

So to me, not running tests is not an option.

The problem I’m more interested in is: can we provide substitutes more
quickly? Can we grow an infrastructure such that ‘master’, by default,
contains software that has already been built?

Post by Ricardo Wurmus
An idea that came up on #guix several months ago was to separate the
building of packages from testing. Testing would be a continuation of
the build, like grafts could be envisioned as a continuation of the
build.

I agree it would be nice, but I think there’s a significant technical
issue: test suites usually expect to run from the build tree.

Also, would a test failure invalidate the previously-built store
item(s)?

Thanks,
Ludo’.

Pjotr Prins

2018-04-05 14:59:29 UTC

Post by Ludovic CourtÃ¨s
I sympathize with what you write about the inconvenience of running
tests, when substitutes aren’t available. However, I do think running
tests has real value.
Of course sometimes we just spend time fiddling with the tests so they
would run in the isolated build environment, and they do run flawlessly
once we’ve done the usual adjustments (no networking, no /bin/sh, etc.)
However, in many packages we found integration issues that we would just
have missed had we not run the tests; that in turn can lead to very bad
user experience. In other cases we found real upstream bugs and were
able to report them
(cf. <https://github.com/TaylanUB/scheme-bytestructures/issues/30> for
an example from today.) Back when I contributed to Nixpkgs, tests were
not run by default and I think that it had a negative impact on QA.
So to me, not running tests is not an option.

I am *not* suggesting we stop testing and stop writing tests. They are
extremely important for integration (thought we could do with a lot
less and more focussed integration tests - ref Hickey). What I am
writing is that we don't have to rerun tests for everyone *once* they
succeed *somewhere*. If you have a successful reproducible build and
tests on a platform there is really no point in rerunning tests
everywhere for the exact same setup. It is a nice property of our FP
approach. Proof that it is not necessary is the fact that we
distribute substitute binaries without running tests there. What I am
proposing in essence is 'substitute tests'.

Ricardo is suggesting an implementation. I think it is simpler. When
building a derivation we know the hash. If we have a list of hashes in
the database for successful tests (hash-tests-passed) it is
essentially queriable and done. Even when the substitute gets removed,
that item can still remain at almost no cost.

Ludo, I think we need to do this. There is no point in running tests
that already have been run. Hickey is right. I have reached
enlightment. Almost everything I thought about testing is wrong. If
all the inputs are the same the test will *always* pass. There is no
point to it! The only way such a test won't pass it by divine
intervention or real hardware problems. Both we don't want to test
for.

If tests are so important to rerun: tell me why we are not running
tests when substituting binaries?

Post by Ludovic CourtÃ¨s
The problem I’m more interested in is: can we provide substitutes more
quickly? Can we grow an infrastructure such that ‘master’, by default,
contains software that has already been built?

Sure, that is another challenge and an important one.

Post by Ludovic CourtÃ¨s

Post by Ricardo Wurmus
An idea that came up on #guix several months ago was to separate the
building of packages from testing. Testing would be a continuation of
the build, like grafts could be envisioned as a continuation of the
build.

I agree it would be nice, but I think there’s a significant technical
issue: test suites usually expect to run from the build tree.

What I understand is that Nix already does something like this. they
have split testing out to allow for network access. I don't propose to
split the process. I propose to cache testing as part of the build.

Pj.

Ricardo Wurmus

2018-04-05 15:17:09 UTC

If all the inputs are the same the test will *always* pass. There is
no point to it! The only way such a test won't pass it by divine
intervention or real hardware problems. Both we don't want to test
for.
If tests are so important to rerun: tell me why we are not running
tests when substituting binaries?

I don’t understand this. People only run tests when they haven’t been
run on the build farm, because that’s part of the build. So when the
tests have passed (and the few short phases after that), then we have
substitutes anyway, and so users won’t re-run tests.

If you get substitutes you don’t need to run the tests.

Any change here seems to only affect the case where you build locally
even though there are substitutes. I’d say that this is a pretty rare
use case. Build farms do this, but they build binaries (and if they
differ from binaries built elsewhere the tests may also behave
differently).

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC
https://elephly.net

Ludovic Courtès

2018-04-05 15:24:12 UTC

Post by Pjotr Prins
I am *not* suggesting we stop testing and stop writing tests. They are
extremely important for integration (thought we could do with a lot
less and more focussed integration tests - ref Hickey). What I am
writing is that we don't have to rerun tests for everyone *once* they
succeed *somewhere*. If you have a successful reproducible build and
tests on a platform there is really no point in rerunning tests
everywhere for the exact same setup. It is a nice property of our FP
approach. Proof that it is not necessary is the fact that we
distribute substitute binaries without running tests there. What I am
proposing in essence is 'substitute tests'.

Understood.

Post by Pjotr Prins
If tests are so important to rerun: tell me why we are not running
tests when substituting binaries?

Because you have a substitute if and only those tests already passed
somewhere. This is exactly the property we’re interested in, right?

That is why I was suggesting putting effort in improving substitute
delivery rather than trying to come up with special mechanisms.

Post by Pjotr Prins

Post by Ludovic CourtÃ¨s

Post by Ricardo Wurmus
An idea that came up on #guix several months ago was to separate the
building of packages from testing. Testing would be a continuation of
the build, like grafts could be envisioned as a continuation of the
build.

I agree it would be nice, but I think there’s a significant technical
issue: test suites usually expect to run from the build tree.

What I understand is that Nix already does something like this. they
have split testing out to allow for network access.

Do you have pointers to that? All I’m aware of is the ‘doCheck’
variable that is unset (i.e., false) by default:

https://github.com/NixOS/nixpkgs/blob/master/pkgs/stdenv/generic/setup.sh#L1192

Ludo’.

Pjotr Prins

2018-04-05 16:41:58 UTC

Post by Ludovic CourtÃ¨s

Post by Pjotr Prins
I am *not* suggesting we stop testing and stop writing tests. They are
extremely important for integration (thought we could do with a lot
less and more focussed integration tests - ref Hickey). What I am
writing is that we don't have to rerun tests for everyone *once* they
succeed *somewhere*. If you have a successful reproducible build and
tests on a platform there is really no point in rerunning tests
everywhere for the exact same setup. It is a nice property of our FP
approach. Proof that it is not necessary is the fact that we
distribute substitute binaries without running tests there. What I am
proposing in essence is 'substitute tests'.

Understood.

Post by Pjotr Prins
If tests are so important to rerun: tell me why we are not running
tests when substituting binaries?

Because you have a substitute if and only those tests already passed
somewhere. This is exactly the property we’re interested in, right?

Yup. Problem is substitutes go away. We don't retain them and I often
encounter that use case.

Providing test-substitutes is much lighter and can be retained
forever.

When tests ever pass on a build server, we don't have to repeat them.
That is my story.

Pj.

Pjotr Prins

2018-04-05 18:35:23 UTC

Post by Pjotr Prins
Providing test-substitutes is much lighter and can be retained
forever.

See it as a light-weight substitute. It can also mean we can retire
large binary substitutes quicker. Saving disk space. I think it is a
brilliant idea ;)

A result of the Hickey insight is that I am going to cut down on my
own tests (the ones I write). Only integration tests are of interest
for deployment.

For those interested, attached patch disables tests in the build
system. You may need to adapt it a little for a recent checkout, but
you get the idea. Use at your own risk, but in a pinch it can be
handy.

Pj.

--

Ludovic Courtès

2018-04-06 07:57:00 UTC

Hello,

Post by Pjotr Prins

Post by Ludovic CourtÃ¨s

Post by Pjotr Prins
I am *not* suggesting we stop testing and stop writing tests. They are
extremely important for integration (thought we could do with a lot
less and more focussed integration tests - ref Hickey). What I am
writing is that we don't have to rerun tests for everyone *once* they
succeed *somewhere*. If you have a successful reproducible build and
tests on a platform there is really no point in rerunning tests
everywhere for the exact same setup. It is a nice property of our FP
approach. Proof that it is not necessary is the fact that we
distribute substitute binaries without running tests there. What I am
proposing in essence is 'substitute tests'.

Understood.

Post by Pjotr Prins
If tests are so important to rerun: tell me why we are not running
tests when substituting binaries?

Because you have a substitute if and only those tests already passed
somewhere. This is exactly the property we’re interested in, right?

Yup. Problem is substitutes go away. We don't retain them and I often
encounter that use case.

I agree this is a problem. We’ve tweaked ‘guix publish’, our nginx
configs, etc. over time to mitigate this, but I suppose we could still
do better.

When that happens, could you try to gather data about the missing
substitutes? Like what packages are missing (where in the stack), and
also how old is the Guix commit you’re using.

More generally, I think there are connections with telemetry as we
discussed it recently: we should be able to monitor our build farms to
see concretely how much we’re retaining in high-level terms.

FWIW, today, on mirror.hydra.gnu.org, the nginx cache for nars contains
94G (for 3 architectures).

On berlin.guixsd.org, /var/cache/guix/publish takes 118G (3
architectures as well), and there’s room left.

Post by Pjotr Prins
Providing test-substitutes is much lighter and can be retained
forever.

I understand. Now, I agree with Ricardo that this would target the
specific use case where you’re building from source (explicitly
disabling substitutes), yet you’d like to avoid running tests.

We could adresss this using specific mechanisms (although like I said, I
really don’t see what it would look like.) However, I believe
optimizing substitute delivery in general would benefit everyone and
would also address the running-tests-takes-too-much-time issue.

Can we focus on measuring the performance of substitute delivery and
thinking about ways to improve it?

Thanks for your feedback,
Ludo’.

Mark H Weaver

2018-04-05 20:26:50 UTC

Hi Pjotr,

Post by Pjotr Prins
what is the point of running tests everywhere? If two people test the
same thing, what is the added value of that? (I paraphrase)
With Guix a reproducibly building package generates the same Hash on
all dependencies. Running the same tests every time on that makes no
sense.

I appreciate your thoughts on this, but I respectfully disagree.

Post by Pjotr Prins
I know there are two 'inputs' I am not accounting for: (1) hardware
variants and (2) the Linux kernel. But, honestly, I do not think we
are in the business of testing those. We can assume these work.

No, we can't. For example, I recently discovered that GNU Tar fails one
of its tests on my GuixSD system based on Btrfs. It turned out to be a
real bug in GNU Tar that could lead to data loss when creating an
archive of recently written files, with --sparse enabled. I fixed it in
commit 45413064c9db1712c845e5a1065aa81f66667abe on core-updates.

I would not have discovered this bug if I had simply assumed that since
GNU Tar passes its tests on ext4fs, it surely must also pass its tests
on every other file system.

Post by Pjotr Prins
If not, any issues will be found in other ways (typically a segfault
;).

The GNU Tar bug on Btrfs would never produce a segfault. The only way
the bug could be observed is by noticing that data was lost. I don't
think that's a good way to discover a bug. I'd much rather discover the
bug by a failing test suite.

Tests on different hardware/kernel/kernel-config/file-system
combinations are quite useful for those who care about reliability of
their systems. I, for one, would like to keep running test suites on my
own systems.

Regards,
Mark

Pjotr Prins

2018-04-06 06:06:01 UTC

Post by Mark H Weaver
Tests on different hardware/kernel/kernel-config/file-system
combinations are quite useful for those who care about reliability of
their systems. I, for one, would like to keep running test suites on my
own systems.

Sure. And it is a great example why to test scenarios. But why force
it down everyone's throat? I don't want to test Scipy or ldc over and
over again. Note that I can work around it, but we are forcing our
methods here on others. If I do not like it, others won't. I am just
looking at running test billion times uselessly around the planet.
Does that not matter? We need to be green.

Ludo is correct that provisioning binary substitutes is one solution.
But not cheap. Can we guarantee keeping all substitutes? At least the
ones with long running tests ;). I don't know how we remove
substitutes now, but it would make sense to me to base that on
download metrics and size. How about ranking downloads in the last 3
months times the time to build? And trim from the end. That may be
interesting.

Even so, with my idea of test substitutes you don't have to opt out of
testing. And you would still have found that bug. Those who care can
test all they please.

Anyway, that is enough. I made my point and I am certain that we will
change our ways at some point. The laborious solution is to remove all
meaningless tests. And I am sure over 90% are pretty damn meaningless
for our purposes. Like the glut in binaries, we will trim it down over
time.

One suggestion: let's also look at tests that are *not* about
integration or hardware/kernel configuration and allow for running them
optionally. Stupidly running all tests that people come up with is not
a great idea. We just run what authors decide that should be run.

Pj.

Ricardo Wurmus

2018-04-06 08:27:42 UTC

Post by Pjotr Prins
Ludo is correct that provisioning binary substitutes is one solution.
But not cheap. Can we guarantee keeping all substitutes? At least the
ones with long running tests ;).

For berlin.guixsd.org we have an external storage array of a couple of
TB, which currently isn’t attached (I’ll get around to it some day). We
can keep quite a few substitutes with that amount of space.

Post by Pjotr Prins
Even so, with my idea of test substitutes you don't have to opt out of
testing. And you would still have found that bug. Those who care can
test all they please.

I am not sure there’s an easy implementation that allows us to make
tests optional safely. They are part of the derivation. We could make
execution dependent on an environment variable that is set or not by the
daemon, I suppose.

Post by Pjotr Prins
One suggestion: let's also look at tests that are *not* about
integration or hardware/kernel configuration and allow for running them
optionally. Stupidly running all tests that people come up with is not
a great idea. We just run what authors decide that should be run.

We’ve already trimmed some of the longer test suites. There are some
libraries and applications that have different test suites for different
purposes, and in those cases we picked something lighter and more
appropriate for our purposes.

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC
https://elephly.net

21 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Pjotr Prins 2018-04-05 05:24:39 UTC

Gábor Boskovits 2018-04-05 06:05:39 UTC

Pjotr Prins 2018-04-05 08:39:15 UTC

Hartmut Goebel 2018-04-05 08:58:51 UTC

Björn Höfling 2018-04-05 06:21:15 UTC

Pjotr Prins 2018-04-05 08:43:00 UTC

Chris Marusich 2018-04-06 08:58:35 UTC

David Pirotte 2018-04-06 18:36:56 UTC

Ricardo Wurmus 2018-04-05 10:14:53 UTC

Björn Höfling 2018-04-05 12:19:47 UTC

Ricardo Wurmus 2018-04-05 14:10:04 UTC

Ricardo Wurmus 2018-04-05 10:26:09 UTC

Ludovic Courtès 2018-04-05 14:14:19 UTC

Pjotr Prins 2018-04-05 14:59:29 UTC

Ricardo Wurmus 2018-04-05 15:17:09 UTC

Ludovic Courtès 2018-04-05 15:24:12 UTC

Pjotr Prins 2018-04-05 16:41:58 UTC

Pjotr Prins 2018-04-05 18:35:23 UTC

Ludovic Courtès 2018-04-06 07:57:00 UTC

Mark H Weaver 2018-04-05 20:26:50 UTC

Pjotr Prins 2018-04-06 06:06:01 UTC

Ricardo Wurmus 2018-04-06 08:27:42 UTC

about - legalese

Loading...