The paper proposes a set of principles and a general architecture that may explain how language and meaning may originate and complexify in a group of physically grounded distributed agents. An experimental setup is introduced for concretising and validating specific mechanisms based on these principles. The setup consists of two robotic heads that watch static or dynamic scenes and engage in language games, in which one robot describes to the other what they see. The first results from experiments showing the emergence of distinctions, of a lexicon, and of primitive syntactic structures are reported.