[xsmith-dev] XSmith questions

Tue Jun 22 15:08:50 MDT 2021

Oops, I hit `reply` instead of `reply-all`.  I'm adding the mailing
list back on now.

On Tue, Jun 22, 2021 at 03:06:05PM -0600, William G Hatch wrote:
>On Tue, Jun 22, 2021 at 12:43:50PM -0700, Sorawee Porncharoenwase wrote:
>>Hi everyone,
>>
>>I just started using XSmith. I don’t think I grok the framework yet, so I
>>would really appreciate it if anyone could help me!
>
>I'll do my best.  (I'm the main author of Xsmith, so you can blame
>most of your issues on me.)  Xsmith is research software with lots of
>sharp edges and insufficient documentation about those edges, and has
>been successfully used mostly by its authors so far.  (Though at least
>one other person has found some bugs using the fuzzers we've made.)
>
>[This email is a bit rambly and repetetive, sorry.  I would edit it
>better but I need to do something else now.  If this doesn't answer
>your questions, please ask some more follow-up questions and I'll try
>to write more cohesively.]
>
>>1) While I understand that dynamically typed language should specify
>>type-info to generate programs that don’t cause runtime type mismatch, I
>>find it weird that types are very integral to the framework. For example,
>>binder-info seems to presuppose that declaration AST nodes will have type
>>information in it.
>>
>>On one hand, that’s strictly speaking not the case for many languages.
>>
>>On the other hand, I can understand that the specified grammar here doesn’t
>>need to correspond to the actual language grammar. So let’s just add the
>>type field into the AST node. But this raises another question: how do I
>>know which other AST nodes will require additional type information as well?
>
>If you really want to have a language where anything goes, you can
>just create a `dyn` base-type and use it everywhere.  But, as you
>mention, fuzzing will likely be ineffective since things will mostly
>just crash with runtime type errors.  If you want to encode the fact
>that various types can be implicitly coerced, you can add explicit
>conversion AST nodes for Xsmith that go away when you pretty-print the
>program.  Eg. you can define an IntToStringCoersion expression that is
>rendered by just rendering the integer inside without adding anything
>else to it.
>
>Every node in the AST has type information.
>
>Generally, an Xsmith language fuzzer will have 2-3 main AST node
>types: Expression, Statement, and Definition.  Definitions are
>separate to simplify Xsmith itself to do something consistent --
>definitions can be expressions, statements, or something separate in
>various languages, but always having Definition nodes in Xsmith makes
>generic name analysis easier.
>
>Expressions have the types you expect -- string, int, list-of-string,
>etc.
>
>Statement types are a bit of a hack.  Statements have two types:
>no-return-statement and return-statement, which is a wrapper type that
>includes an expression type inside that corresponds to the return type
>of the function.
>
>Definition nodes should have expression types, IE the type of the
>right-hand-side of the definition.
>
>In practice, you probably want to use canned-components to get
>Definition node definitions and probably Statement node definitions as
>well.  If you do that you don't need to provide type information for
>them.  The main place you need to worry about types is with
>expressions.
>
>As an aside, Function definitions are often a special case in
>programming languages, but we basically encode them as a definition
>with a lambda on the right hand side that we maybe print differently.
>(If you want to have both definitions with a lambda on the RHS and
>function definitions with your language's shorthand syntax, you can
>define a new Definition node subtype that prints differently.)
>
>
>>2) As far as I can see, all declaration nodes have types to be that of the
>>expression it contains. Again, I find this weird since declaration in most
>>languages is a statement, not an expression. I tried fixing this by
>>modifying the “Another Small Example with Variables” example to:
>>
>>(add-property
>>arith
>>type-info
>>[Definition [no-return-type (λ (n t) (hash 'Expression
>>(fresh-type-variable)))]]
>>[LetStar [(fresh-type-variable)
>>          (λ (n t) (hash 'definitions (λ (cn) no-return-type)
>>                         'sideEs (λ (cn) (fresh-type-variable))
>>                         'Expression t))]]
>>...)
>>
>>where no-return-type is from xsmith/canned-components. The generation
>>errors with with:
>>
>>Exception:
>>unify!: subtype-unify!: can't unify these types: #<type-variable
>>(#<range:#<no-return-type>-#<no-return-type>>)> and #<int>
>
>You need the type of the definition to be the type of the expression
>of the RHS of the definition (modulo subtyping).  Despite the fact
>that when printing you may make the definitions look like statements,
>Xsmith just requires Definition nodes to have the same type as
>variable references to that definition, and keeps Definition and
>Statement nodes separate.
>
>The return-type/no-return-type is sort of a hack to use the type
>system to keep track of statements as well as expressions.  In eg. a
>function, you need to track the return type throughout the function to
>be sure that the function does return (potentially in multiple
>branches of a conditional), and that all return statements return the
>right type.  So by annotating which nodes do or don't return
>something, Xsmith can be sure to generate well-formed functions.
>
>The key to using Definitions in a statement language is basically a
>`Block`-like statement that can have a series of statements that
>include definitions needs to have a list of definitions and a list of
>statements separately.  This is a bit of a limitation, since many
>languages can interleave definitions and statements, and Xsmith
>basically can't.  At any rate, the `Block` node from canned-components
>does this.
>
>Probably the best way to see how to do it in a full-sized (but still
>relatively simple) example is to look at the simple/javascript.rkt
>file in the xsmith-examples directory.  (We haven't actually used that
>one yet, so I'm not 100% certain it generates actually correct
>javascript, but it should be close if it's wrong.  At any rate, it's
>probably the simplest full-size example to see the correlation between
>the definition of the fuzzer and its output.)
>
>The `Block` statement in canned-components, and that the javascript
>example uses, is the main place to put definitions in a
>statement-based language.  In the javascript example we have a series
>of top-level definitions by using `ProgramWithBlock`, then basically
>unwrapping the “block” and printing the definitions in the top level.
>It also uses `LambdaWithBlock` for function definitions.  So if you
>look at how it's used it should clarify how to have definitions in a
>statement language.
>
>>In most examples that I saw, declarations are hoisted at the very top of
>>function / block / program. I think the no-return-type is particularly
>>going to be a problem when I want to allow declarations and statements to
>>interleave, since statements will be assigned a unique type (like
>>no-return-type), while declarations can’t do the same.
>>
>>3) How do I debug when XSmith simply hangs and doesn’t output anything? It
>>looks like the problems I encountered are with types, since if I relax the
>>constraints, it can generate programs without any problem (though the
>>generated programs are incorrect, as expected).
>
>Xsmith is probably hanging because it is in a lift loop.  IE it has
>created a Definition node, and it needs to generate a right-hand-side
>for the definition.  When it tries to generate the RHS expression, the
>only legal expression it can find is VariableReference.  So it creates
>a VariableReference, and it can't find a (non-circular) variable of
>that type to reference, so it lifts one.  Then it needs to make an RHS
>for that definition as well, and ...
>
>We really ought to catch this situation somehow, crash, and report an
>error.  But we don't at the moment.  Sorry.  If you hit control-C and
>get the debug output, you can see the type of the lift cycle (by just
>seeing what is being lifted over and over), and then try to figure out
>why nothing else is legal.  You may have forgotten a literal node, or
>maybe the literal node has holes (eg. a lambda is a literal but not
>atomic, it needs statements/expressions inside), meaning that you need
>to add an annotation of `#:prop wont-over-deepen #t` so it will
>generate it even if it's at the max AST depth.
>
>>4) In xsmith/canned-components, there are several predefined types, but the
>>module only exports constructors, not accessors. What would be the best way
>>to extract information from it? unify!?
>
>Yes, use `unify!`. Eg.
>
>```
>(define my-return-type (fresh-type-variable))
>(define f (function-type (fresh-type-variable) my-return-type))
>(unify! f some-function-type)
>```
>
>This is something I should also improve but haven't yet.  The basic
>accessors can fail when logically they should succeed because they may
>be passed a type variable instead of the struct for the type you think
>it is.  So if you try to use eg. `(function-type t)` it might fail.  I
>should provide a `function-type!` function that does the obvious
>struct creation and unification, but I haven't yet.
>
>I may add that over the coming weeks.  I've been busy with some other
>things, but am shifting my gears back to Xsmith development, and hope
>to finish a few final features (eg. work on stuff to do
>feedback-directed fuzzing) and do some serious fuzzing with it.
>(We've run fuzz campaigns and found some bugs in the past, but mostly
>switched back to add-more-features mode to try to make it more
>effective, and haven't quite gotten back to actually trying to find
>bugs.)
>
>>Thanks!
>>Sorawee
>
>I hope you find Xsmith useful!  What are you intending to fuzz?  I'm
>also happy to look at any code you have on Github or such for some
>debugging help.