[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [xsmith-dev] XSmith questions



Oops, I hit `reply` instead of `reply-all`.  I'm adding the mailing
list back on now.

On Tue, Jun 22, 2021 at 03:06:05PM -0600, William G Hatch wrote:
On Tue, Jun 22, 2021 at 12:43:50PM -0700, Sorawee Porncharoenwase wrote:
Hi everyone,

I just started using XSmith. I don’t think I grok the framework yet, so I
would really appreciate it if anyone could help me!

I'll do my best.  (I'm the main author of Xsmith, so you can blame
most of your issues on me.)  Xsmith is research software with lots of
sharp edges and insufficient documentation about those edges, and has
been successfully used mostly by its authors so far.  (Though at least
one other person has found some bugs using the fuzzers we've made.)

[This email is a bit rambly and repetetive, sorry.  I would edit it
better but I need to do something else now.  If this doesn't answer
your questions, please ask some more follow-up questions and I'll try
to write more cohesively.]

1) While I understand that dynamically typed language should specify
type-info to generate programs that don’t cause runtime type mismatch, I
find it weird that types are very integral to the framework. For example,
binder-info seems to presuppose that declaration AST nodes will have type
information in it.

On one hand, that’s strictly speaking not the case for many languages.

On the other hand, I can understand that the specified grammar here doesn’t
need to correspond to the actual language grammar. So let’s just add the
type field into the AST node. But this raises another question: how do I
know which other AST nodes will require additional type information as well?

If you really want to have a language where anything goes, you can
just create a `dyn` base-type and use it everywhere.  But, as you
mention, fuzzing will likely be ineffective since things will mostly
just crash with runtime type errors.  If you want to encode the fact
that various types can be implicitly coerced, you can add explicit
conversion AST nodes for Xsmith that go away when you pretty-print the
program.  Eg. you can define an IntToStringCoersion expression that is
rendered by just rendering the integer inside without adding anything
else to it.

Every node in the AST has type information.

Generally, an Xsmith language fuzzer will have 2-3 main AST node
types: Expression, Statement, and Definition.  Definitions are
separate to simplify Xsmith itself to do something consistent --
definitions can be expressions, statements, or something separate in
various languages, but always having Definition nodes in Xsmith makes
generic name analysis easier.

Expressions have the types you expect -- string, int, list-of-string,
etc.

Statement types are a bit of a hack.  Statements have two types:
no-return-statement and return-statement, which is a wrapper type that
includes an expression type inside that corresponds to the return type
of the function.

Definition nodes should have expression types, IE the type of the
right-hand-side of the definition.

In practice, you probably want to use canned-components to get
Definition node definitions and probably Statement node definitions as
well.  If you do that you don't need to provide type information for
them.  The main place you need to worry about types is with
expressions.

As an aside, Function definitions are often a special case in
programming languages, but we basically encode them as a definition
with a lambda on the right hand side that we maybe print differently.
(If you want to have both definitions with a lambda on the RHS and
function definitions with your language's shorthand syntax, you can
define a new Definition node subtype that prints differently.)


2) As far as I can see, all declaration nodes have types to be that of the
expression it contains. Again, I find this weird since declaration in most
languages is a statement, not an expression. I tried fixing this by
modifying the “Another Small Example with Variables” example to:

(add-property
arith
type-info
[Definition [no-return-type (λ (n t) (hash 'Expression
(fresh-type-variable)))]]
[LetStar [(fresh-type-variable)
         (λ (n t) (hash 'definitions (λ (cn) no-return-type)
                        'sideEs (λ (cn) (fresh-type-variable))
                        'Expression t))]]
...)

where no-return-type is from xsmith/canned-components. The generation
errors with with:

Exception:
unify!: subtype-unify!: can't unify these types: #<type-variable
(#<range:#<no-return-type>-#<no-return-type>>)> and #<int>

You need the type of the definition to be the type of the expression
of the RHS of the definition (modulo subtyping).  Despite the fact
that when printing you may make the definitions look like statements,
Xsmith just requires Definition nodes to have the same type as
variable references to that definition, and keeps Definition and
Statement nodes separate.

The return-type/no-return-type is sort of a hack to use the type
system to keep track of statements as well as expressions.  In eg. a
function, you need to track the return type throughout the function to
be sure that the function does return (potentially in multiple
branches of a conditional), and that all return statements return the
right type.  So by annotating which nodes do or don't return
something, Xsmith can be sure to generate well-formed functions.

The key to using Definitions in a statement language is basically a
`Block`-like statement that can have a series of statements that
include definitions needs to have a list of definitions and a list of
statements separately.  This is a bit of a limitation, since many
languages can interleave definitions and statements, and Xsmith
basically can't.  At any rate, the `Block` node from canned-components
does this.

Probably the best way to see how to do it in a full-sized (but still
relatively simple) example is to look at the simple/javascript.rkt
file in the xsmith-examples directory.  (We haven't actually used that
one yet, so I'm not 100% certain it generates actually correct
javascript, but it should be close if it's wrong.  At any rate, it's
probably the simplest full-size example to see the correlation between
the definition of the fuzzer and its output.)

The `Block` statement in canned-components, and that the javascript
example uses, is the main place to put definitions in a
statement-based language.  In the javascript example we have a series
of top-level definitions by using `ProgramWithBlock`, then basically
unwrapping the “block” and printing the definitions in the top level.
It also uses `LambdaWithBlock` for function definitions.  So if you
look at how it's used it should clarify how to have definitions in a
statement language.

In most examples that I saw, declarations are hoisted at the very top of
function / block / program. I think the no-return-type is particularly
going to be a problem when I want to allow declarations and statements to
interleave, since statements will be assigned a unique type (like
no-return-type), while declarations can’t do the same.

3) How do I debug when XSmith simply hangs and doesn’t output anything? It
looks like the problems I encountered are with types, since if I relax the
constraints, it can generate programs without any problem (though the
generated programs are incorrect, as expected).

Xsmith is probably hanging because it is in a lift loop.  IE it has
created a Definition node, and it needs to generate a right-hand-side
for the definition.  When it tries to generate the RHS expression, the
only legal expression it can find is VariableReference.  So it creates
a VariableReference, and it can't find a (non-circular) variable of
that type to reference, so it lifts one.  Then it needs to make an RHS
for that definition as well, and ...

We really ought to catch this situation somehow, crash, and report an
error.  But we don't at the moment.  Sorry.  If you hit control-C and
get the debug output, you can see the type of the lift cycle (by just
seeing what is being lifted over and over), and then try to figure out
why nothing else is legal.  You may have forgotten a literal node, or
maybe the literal node has holes (eg. a lambda is a literal but not
atomic, it needs statements/expressions inside), meaning that you need
to add an annotation of `#:prop wont-over-deepen #t` so it will
generate it even if it's at the max AST depth.

4) In xsmith/canned-components, there are several predefined types, but the
module only exports constructors, not accessors. What would be the best way
to extract information from it? unify!?

Yes, use `unify!`. Eg.

```
(define my-return-type (fresh-type-variable))
(define f (function-type (fresh-type-variable) my-return-type))
(unify! f some-function-type)
```

This is something I should also improve but haven't yet.  The basic
accessors can fail when logically they should succeed because they may
be passed a type variable instead of the struct for the type you think
it is.  So if you try to use eg. `(function-type t)` it might fail.  I
should provide a `function-type!` function that does the obvious
struct creation and unification, but I haven't yet.

I may add that over the coming weeks.  I've been busy with some other
things, but am shifting my gears back to Xsmith development, and hope
to finish a few final features (eg. work on stuff to do
feedback-directed fuzzing) and do some serious fuzzing with it.
(We've run fuzz campaigns and found some bugs in the past, but mostly
switched back to add-more-features mode to try to make it more
effective, and haven't quite gotten back to actually trying to find
bugs.)

Thanks!
Sorawee

I hope you find Xsmith useful!  What are you intending to fuzz?  I'm
also happy to look at any code you have on Github or such for some
debugging help.