Hypothesis: Property-Based Testing for Python

hypothesis.readthedocs.io

89 points by lwhsiao 6 hours ago

benrutter 2 minutes ago

[delayed]

Property based testing is fantastic.

Why is it not more popular?

My theory is that only code written in functional languages has complex properties you can actually test.

In imperative programs, you might have a few utils that are appropriate for property testing - things like to_title_case(str) - but the bulk of program logic can only be tested imperatively with extensive mocking.

ibizaman 15 minutes ago

I actually used property testing very successfully to test a DB driver and a migration to another DB driver in Go. I wrote up about it here https://blog.tiserbox.com/posts/2024-02-27-stateful-property...
vrnvu 13 minutes ago

>> Why is it not more popular?
Property, fuzzy, snapshot testing. Great tools that make software more correct and reliable.
The challenge for most developers is that they need to change how they design code and think about testing.
I’ve always said the hardest part of programming isn’t learning, it’s unlearning what you already know…
eru 31 minutes ago

But wouldn't that apply just as much to example based testing?

NortySpock 4 hours ago

I keep thinking I have a possible use case for property -based testing, and then I am up to my armpits in trying to understand the on-the-ground problem and don't feel like I have time to learn a DSL for describing all possible inputs and outputs when I already had an existing function (the subject-under-test) that I don't understand.

So rather than try to learn to black boxes at the same time , I fall back to "several more unit tests to document more edge cases to defensibly guard against"

Is there some simple way to describe this defensive programming iteration pattern in Hypothesis? Normally we just null-check and return early and have to deal with the early-return case. How do I quickly write property tests to check that my code handles the most obvious edge cases?

eru 3 hours ago

In addition to what other people have said:
> [...] time to learn a DSL for describing all possible inputs and outputs when I already had an existing function [...]
You don't have to describe all possible inputs and outputs. Even just being able to describe some classes of inputs can be useful.
As a really simple example: many example-based tests have some values that are arbitrary and the test shouldn't care about them, like eg employees names when you are populating a database or whatever. Instead of just hard-coding 'foo' and 'bar', you can have hypothesis create arbitrary values there.
Just like learning how to write (unit) testable code is a skill that needs to be learned, learning how to write property-testable code is also a skill that needs practice.
What's less obvious: retro-fitting property-based tests on an exiting codebase with existing example-based tests is almost a separate skill. It's harder than writing your code with property based tests in mind.
---
Some common properties to test:
* Your code doesn't crash on random inputs (or only throws a short whitelist of allowed exceptions).
* Applying a specific functionality should be idempotent, ie doing that operation multiple times should give the same results as applying it only once.
* Order of input doesn't matter (for some functionality)
* Testing your prod implementation against a simpler implementation, that's perhaps too slow for prod or only works on a restricted subset of the real problem. The reference implementation doesn't even have to be simpler: just having a different approach is often enough.
- wodenokoto 2 hours ago
  
  But let's say employee names fail on apostrophe. Won't you just have a unit test that sometimes fail, but only when the testing tool randomly happens to add an apostrophe in the employee name?
  
  IanCal an hour ago
  
  Hypothesis keeps a database of failures to use locally and you can add a decorator to mark a specific case that failed. So you run it, see the failure, add it as a specific case and then that’s committed to the codebase.
  The randomness can bite a little if that test failure happens on an unrelated branch, but it’s not much different to someone just discovering a bug.
  
  jcattle an hour ago
  
  As far as I remember, hypothesis tests smartly. Which means that possibly problematic strings are tested first. It then narrows down which exact part of the tested strings caused the failure.
  So it might as well just throw the kitchen sink at the function, if it handles that: Great, if not: That string will get narrowed down until you arrive at a minimal set of failing inputs.
  
  vlovich123 an hour ago
  
  Either your code shouldn’t fail or the apostrophe isn’t a valid case.
  In the former, hypothesis and other similar frameworks are deterministic and will replay the failing test on request or remember the failing tests in a file to rerun in the future to catch regressions.
  In the latter, you just tell the framework to not generate such values or at least to skip those test cases (better to not generate in terms of testing performance).
  
  reverius42 an hour ago
  
  I think what they meant is, "won't Hypothesis sometimes fail to generate input with an apostrophe, thus giving you false confidence that your code can handle apostrophes?"
  I think the answer to this is, in practice, it will not fail to generate such input. My understanding is that it's pretty good at mutating input to cover a large amount of surface area with as few as possible examples.
  
  eru 31 minutes ago
  
  Hypothesis is pretty good, but it's not magic. There's only so many corner cases it can cover in the 200 (or so) cases per tests it's running by default.
  But by default you also start with a new random seed every time you run the tests, so you can build up more confidence over the older tests and older code, even if you haven't done anything specifically to address this problem.
  Also, even with Hypothesis you can and should still write specific tests or even just specific generators to cover specific classes of corners cases you are worried about in more detail.
  
  Balinares an hour ago
  
  No, Hypothesis iterates on test failures to isolate the simplest input that triggers it, so that it can report it to you explicitly.
akshayshah 3 hours ago

Sibling comments have already mentioned some common strategies - but if you have half an hour to spare, the property-based testing series on the F# for Fun and Profit blog is well worth your time. The material isn’t really specific to F#.
https://fsharpforfunandprofit.com/series/property-based-test...
sunshowers 4 hours ago

The simplest practical property-based tests are where you serialize some randomly generated data of a particular shape to JSON, then deserialize it, and ensure that the output is the same.
A more complex kind of PBT is if you have two implementations of an algorithm or data structure, one that's fast but tricky and the other one slow but easy to verify. (Say, quick sort vs bubble sort.) Generate data or operations randomly and ensure the results are the same.
- eru 2 hours ago
  
  > The simplest practical property-based tests are where you serialize some randomly generated data of a particular shape to JSON, then deserialize it, and ensure that the output is the same.
  Testing that f(g(x)) == x for all x and some f and g that are supposed to be inverses of each other is a good test, but it's probably not the simplest.
  The absolute simplest I can think of is just running your functionality on some randomly generated input and seeing that it doesn't crash unexpectedly.
  For things like sorting, testing against an oracle is great. But even when you don't have an oracle, there's lots of other possibilities:
  * Test that sorting twice has the same effect as sorting once.
  * Start with a known already in-order input like [1, 2, 3, ..., n]; shuffle it, and then check that your sorting algorithm re-creates the original.
  * Check that the output of your sorting algorithm is in-order.
  * Check that input and output of your sorting algorithm have the same elements in the same multiplicity. (If you don't already have a datastructure / algorithm that does this efficiently, you can probe it with more randomness: create a random input (say a list of numbers), pick a random number X, count how many times X appears in your list (via a linear scan); then check that you get the same count after sorting.
  * Check that permuting your input doesn't make a difference.
  * Etc.
  
  sunshowers 2 hours ago
  
  Speaking for myself — those are definitely all simpler cases, but for me I never found them compelling enough (beyond the "it doesn't crash" property). For me, the simplest case that truly motivated PBT for me was roundtrip serialization. Now I use PBT quite a lot, and most of them are either serialization roundtrip or oracle/model-based tests.
  
  eru 35 minutes ago
  
  Oh, yes, I was just listing simple examples. I wasn't trying to find a use case that's compelling enough to make you want to get started.
  I got started out of curiosity, and because writing property based tests is a lot more fun than writing example based tests.
disambiguation 4 hours ago

I've only used it once before, not as unit testing, but as stress testing for a new customer facing api. I wanted to say with confidence "this will never throw an NPE". Also the logic was so complex (and the deadline so short) the only reasonable way to test was to generate large amounts of output data and review it manually for anomalies.
meejah 4 hours ago

Here are some fairly simple examples: testing port parsing https://github.com/meejah/fowl/blob/e8253467d7072cd05f21de7c...
...and https://github.com/magic-wormhole/magic-wormhole/blob/1b4732...
The simplest ones to get started with are "strings", IMO, and also gives you lots of mileage (because it'll definitely test some weird unicode). So, somewhere in your API where you take some user-entered strings -- even something "open ended" like "a name" -- you can make use of Hypothesis to try a few things. This has definitely uncovered unicode bugs for me.
Some more complex things can be made with some custom strategies. The most-Hypothesis-heavy tests I've personally worked with are from Magic Folder strategies: https://github.com/tahoe-lafs/magic-folder/blob/main/src/mag...
The only real downside is that a Hypothesis-heavy test-suite like the above can take a while to run (but you can instruct it to only produce one example per test). Obviously, one example per test won't catch everything, but is way faster when developing and Hypothesis remembers "bad" examples so if you occasionally do a longer run it'll remember things that caused errors before.
fwip 4 hours ago

I think the easiest way is to start with general properties and general input, and tighten them up as needed. The property might just be "doesn't throw an exception", in some cases.
If you find yourself writing several edge cases manually with a common test logic, I think the @example decorator in Hypothesis is a quick way to do that: https://hypothesis.readthedocs.io/en/latest/reference/api.ht...

rmunn 4 hours ago

I love property-based testing, especially the way it can uncover edge cases you wouldn't have thought about. Haven't used Hypothesis yet, but I once had FsCheck (property-based testing for F#) find a case where the data structure I was writing failed when there were exactly 24 items in the list and you tried to append a 25th. That was a test case I wouldn't have thought to write on my own, but the particular number (it was always the 25th item that failed) quickly led me to find the bug. Once my property tests were running overnight and not finding any failures after thousands and thousands of random cases, I started to feel a lot more confident that I'd nailed down the bugs.

tombert 4 hours ago

I had a similar thing, with F# as well actually.
We had some code that used a square root, and in some very rare circumstances, we could get a negative number, which would throw an exception. I don't think i would have even considered that possibility if FsCheck hadn't generated it.
teiferer 3 hours ago

That example caught my attention. What was it in your code that made length 24 special?

imiric 9 minutes ago

Coincidentally, I recently stumbled upon a similar library for Go[1].

I haven't used it, or property-based testing, but I can see how it could be useful.

[1]: https://github.com/flyingmutant/rapid

cat-whisperer 2 hours ago

A little off topic but theirs’ this https://youtu.be/64t-gPC33cc this is a great video by jon about property testing in Rust

__mharrison__ 2 hours ago

I've taught this in my testing courses. I find that (pytest) fixtures are often good enough for coming up with multiple tests but are simple enough to implement.

dbcurtis 4 hours ago

It’s been quite some time since I’ve been in the business of writing lots of unit tests, but back in the day, I found hypothesis to be a big force multiplier and it uncovered many subtle/embarrassing bugs for me. Recommend. Also easy and intuitive to use.

IanCal an hour ago

Huge second.
I’ve never use pbt and failed to find a new bug. I recommended it in a job interview, they used it and discovered a pretty clear bug on their first test. It’s really powerful.
aethor 2 hours ago

I concur. Hypothesis saved me many times. It also helped me prove the existence of bugs in third party code, since I was able to generate examples showing that a specific function was not respecting certain properties. Without that I would have spent a lot of time trying to manually find an example, let alone the simplest possible example.
eru 2 hours ago

Hypothesis is also a lot better at giving you 'nasty' floats etc than Haskell's QuickCheck or the relevant Rust and OCaml libraries are. (Or at least used to be, I haven't checked on all of them recently.)

teiferer 3 hours ago

This approach has two fundamental problems.

1. It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert it. Is your function doing a+b? Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b since the framework provides a and b. You can do a simpler version that's less efficient, but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does. Any logical error that might be slipped into your SUT implementation has a high risk of also slipping into your test and will therefore be hidden by the complexity, even though it would be obvious from just looking at a few well thought through examples.

2. Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You basically just give up and leave edge case finding to chance. Testing for 0 or -1 or 1-more-than-list-length are obvious cases which both you the human test writer and some test framework can easily generate, and they are often actual edge cases. But what really constitutes an edge case depends on your implementation. You as the developer know the implementation and have a chance of coming up with the edge cases. You know the dark corners of your code. Random tests are just playing the lottery, replacing thinking hard.

ChadNauseam 2 hours ago

> Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b since the framework provides a and b. You can do a simpler version that's less efficient, but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does.
Not true. For example, if `f` is `+`, you can assert that f(x,y) == f(y,x). Or that f(x, 0) == x. Or that f(x, f(y, z)) == f(f(x, y), z).
Even a test as simple as "don't crash for any input" is actually extremely useful. This is fuzz testing, and it's standard practice for any safety-critical code, e.g. you can bet the JPEG parser on the device you're reading this on has been fuzz tested.
> You basically just give up and leave edge case finding to chance.
I don't know anything about Hypothesis in Python, but I don't think this is true in general. The reason is because the generator can actually inspect your runtime binary and see what branches are being triggered and try to find inputs that will cause all branches to be executed. Doing this for a JPEG parser actually causes it to produce valid images, which you would never expect to happen by chance. See: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...
> Such a fuzzing run would be normally completely pointless: there is essentially no chance that a "hello" could be ever turned into a valid JPEG by a traditional, format-agnostic fuzzer, since the probability that dozens of random tweaks would align just right is astronomically low.
> Luckily, afl-fuzz can leverage lightweight assembly-level instrumentation to its advantage - and within a millisecond or so, it notices that although setting the first byte to 0xff does not change the externally observable output, it triggers a slightly different internal code path in the tested app. Equipped with this information, it decides to use that test case as a seed for future fuzzing rounds:
- eru 2 hours ago
  
  > I don't know anything about Hypothesis in Python, but I don't think this is true in general. The reason is because the generator can actually inspect your runtime binary and see what branches are being triggered and try to find inputs that will cause all branches to be executed.
  The author of Hypothesis experimented with this feature once, but people usually want their unit tests to run really quickly, regardless of whether property based or example based. And the AFL style exploration of branch space typically takes quite a lot longer than what people have patience for in a unit test that runs eg on every update to every Pull Request.
  
  tybug 2 hours ago
  
  (Hypothesis maintainer here)
  Yup, a standard test suite just doesn't run for long enough for coverage guidance to be worthwhile by default.
  That said, coverage-guided fuzzing can be a really valuable and effective form of testing (see eg https://hypofuzz.com/).
vlovich123 2 hours ago

I have not met anyone that says you should only fuzz/property test, but claiming it can’t possibly find bugs or is unlikely to is silly. I’ve caught numerous non-obvious problems, including a non-fatal but undesirable off-by-1 error in math heavy code due to property testing. It works well when it’s an “np”-hard style problem where the code is harder than the verification. It does not work well for a+b but most problems it’s generally easier to write assertions that have to hold when executing your function. But if it’s not don’t use it - like all testing, it’s an art to determine when it’s useful and how to write it well.
Hypothesis in particular does something neat where it tries to generate random inputs that are more likely to execute novel paths within the code under test. That’s not replicated in Rust but is super helpful about reaching more paths of your code and that’s simply not able to be done manually if you have a lot of non obvious boundary conditions.
- eru 2 hours ago
  
  Yes, NP-style verification is a prime candidate.
  But even for something like a+b, you have lots of properties you can test. All the group theory axioms (insofar as they are supposed to hold) for example. See https://news.ycombinator.com/item?id=45820009 for more.
eru 2 hours ago

> 1. It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert
No. That's one valid approach, especially if you have a simpler alternative implementation. But testing against an oracle is far from the only property you can check.
For your example: suppose you have implemented an add function for your fancy new data type (perhaps it's a crazy vector/tensor thing, whatever).
Here are some properties that you might want to check:
a + b == b + a
a + (b + c) = (a + b) + c
a + (-a) == 0
For all a and b and c, and assuming that these properties are actually supposed to hold in your domain, and that you have an additive inverse (-). Eg many of them don't hold for floating point numbers in general, so it's good to note that down explicitly.
Depending on your domain (eg https://en.wikipedia.org/wiki/Tropical_semiring), you might also have idempotence in your operation, so a + b + b = a + b is also a good one to check, where it applies.
You can also have an alternative implementation that only works for some classes of cases. Or sometimes it's easier to prepare a challenge than to find it, eg you can randomly move around in a graph quite easily, and you can check that your A* algorithm you are working on finds a route that's at most as long as the number of random steps you took.
> 2. Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You basically just give up and leave edge case finding to chance. Testing for 0 or -1 or 1-more-than-list-length are obvious cases which both you the human test writer and some test framework can easily generate, and they are often actual edge cases. But what really constitutes an edge case depends on your implementation. [...]
You'd be surprised how often the generic heuristics for edge cases actually work and how often manual test writers forget that zero is also a number, and how often the lottery does a lot of the rest.
Having said this: Python's Hypothesis is a lot better at its heuristics for these edge cases than eg Haskell's QuickCheck.
12_throw_away 2 hours ago

> Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b
Not really, no, it's right there in the name: you should be testing properties (you can call them "invariants" if you want to sound fancy).
In the example of testing an addition operator, you could test:
1. f(x,y) >= max(x,y) if x and y are non-negative
2. f(x,y) is even iff x and y have the same parity
3. f(x, y) = 0 iff x=-y
etc. etc.
The great thing is that these tests are very easy and fast to write, precisely because you don't have to re-model the entire domain. (Although it's also a great tool if you have 2 implementations, or are trying to match a reference implementation)
robertfw 2 hours ago

I feel like this talk by John Hughes showed that there is real value in this approach with production systems of varying levels of complexity, with two different examples of using the approach to find very low level bugs that you'd never think to test for in traditional approaches.
https://www.youtube.com/watch?v=zi0rHwfiX1Q
locknitpicker 2 hours ago

> (...) but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does.
I think you're manifesting some misconceptions and ignorance about property-based testing.
Property-based testing is still automated testing. You still have a sut and you still exercise it to verify and validate invariants. This does not change.
The core trait of property-based testing is that instead of having to define and maintain hard coded test data to drive your tests, which are specific realizations of the input state, property-based testing instead focuses on generating sequences of randomly-generated input data, and in the event of a test failing it follows up with employing reduction strategies to distil input values that pinpoint minimum reproducible examples.
As a consequence, tests don't focus on which specific value a sut returns when given a specific input value. Instead, they focus on verifying more general properties of a sut.
Perhaps the main advantage of property-based testing is that developers don't need to maintain test data anymore, and this tests are no longer be green just because you forgot to update the test data to cover a scenario or to reflect an edge case. Developers instead define test data generators, and the property-based testing framework implements the hard parts such as the input distillation step.
Property-based testing is no silver bullet though.
> Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of.
Your comment completely misses the point of property-based testing. You still need to exercise your sut to cover scenarios. Where property-based testing excels is that you no longer have to maintain curated sets of test data, or update them whenever you update a component. Your inputs are already randomly generated following the strategy you specified.

thdhhghgbhy 2 hours ago

Is there something this nice for JS, with the decorators like that?

ngruhn an hour ago
No decorators, but fast-check has add-ons to various test frameworks. E.g. if you use Vitest you can write:
```
    import { test, fc } from '@fast-check/vitest'
    test.prop([fc.array(fc.double())])('sort is correct', (lst) => {
      expect(lst).toEqual(lst.toSorted())
    })
```
https://www.npmjs.com/package/@fast-check/vitest?activeTab=r...
eru 29 minutes ago

The decorators are a nice approach in Python, but they aren't really core to what Hypothesis does, nor what makes it better than eg Haskell's QuickCheck.
aidos an hour ago

Not decorators (or at least not last time I looked) but we use fast-check.
Was already familiar with and using Hypothesis in Python so went in search of something with similar nice ergonomics. Am happy with fast-check in that regard.
https://fast-check.dev/

pyuser583 4 hours ago

Make sure to read the docs and understand this well. It has its own vocabulary that can be very counterintuitive.

klntsky 4 hours ago

It seems to only implement a half of QuickCheck idea, because there is no counterexample shrinking. Good effort though! I wonder how hard would it be to derive generators for any custom types in python - probably not too hard, because types are just values

sunshowers 3 hours ago

Shrinking is by far the most important and impressive part of Hypothesis. Compared to how good it is in Hypothesis, it might as well not exist in QuickCheck.
Proptest in Rust is mostly there but has many more issues with monadic bind than Hypothesis does (I wrote about this in https://sunshowers.io/posts/monads-through-pbt/).
- eru 2 hours ago
  
  Python's Hypothesis has some very clever features to deal with shrinking past a monadic bind.
  If I remember right, it basically uses a binary 'tape' of random decisions. Shrinking is expressed as manipulations of that tape. Your generators (implicitly) define a projection from that tape to your desired types. Shrinking an early part of the tape, leave the later sub-generators to try and re-use the later parts of the tape.
  That's not guaranteed to work. But it doesn't have to work reliably for every shrink operation the library tries! It's sufficient, if you merely have a good-enough-chance to recover enough of the previous structure to trigger the bug again.
  
  sunshowers 2 hours ago
  
  I've always wondered if there could be a small machine learning model trained on shrinking.
  
  eru 36 minutes ago
  
  I'm not sure whether it would be useful, but it would definitely get you a grant (if done academically) or VC money (if done as a company) these days.
Jtsummers 3 hours ago

> because there is no counterexample shrinking
Hypothesis does shrink the examples, though.
- eru 2 hours ago
  
  And Hypothesis is miles ahead of QuickCheck in how it handles shrinking! Not only does it shrink automatically, it has no problem preserving invariants from generation in your shrinking; like only prime numbers or only strings that begin with a vowel etc.
pfdietz 3 hours ago

The way it does counterexample shrinking is the most clever part of Hypothesis.