Le Projet

Objectives and scientific hypotheses

In a nutshell, this pioneering interdisciplinary research will lead to a comprehensive linguistic analysis of the textualization process, i.e. the real-time progressive construction of a text. We will study bursts of writing, which are textual segments produced between two pauses, in order to provide insight into the relation between regularities of language performance and the cognitive and contextual constraints. The aim is to understand some of the layout mechanisms that allow language to give rise to novelty out of known and prefabricated data. The Pro-TEXT project will develop linguistic and psycholinguistic methods and machine-learning tools to model these regularities and provide evidence about patterns of text processing.

Research Issues. Basically, a text is a configuration pertaining to the highest level of linguistic complexity and constituting a communicative unit. But there is still no theoretical model of textualization as a process AND a product, despite the ubiquity and the empirical awareness of texts: in the current state of the art there is no consensual theoretical definition of text (cf. Adam 2014), nor are we fully informed on how a text is built, and automatic text-generation approaches have not yet found a satisfactory model (cf. Adorni & Zock 1996; Bowman et al. 2015 inter alii). Indeed, texts – and more specifically written texts – are produced under complex constraints among which some were impossible to capture until recently, due to the inaccessibility of insights into the textualization process as such.

Real-time recording of the writing process using key-stroke logging provides access to the dynamics of the textualization process. Roughly, written or oral language performances are incremental linearizations constrained by temporality, accompanied by disfluencies due to revision.  

We contend that a better knowledge of the dynamics of textualization processes will

  • make it possible to grasp the mechanisms that connect structure, genre constraints and pragmatic aims;
  • help understand the way language unit layouts achieve qualitative leaps;
  • unveil the moves that enable a qualitatively new product, the text, to be forged out of available data and structures.

Real-time recording of the writing process using key-stroke logging provides access to the dynamics of the textualization process, and the current development of linguistics, cognitive psychology and machine learning provides the tools for its analysis.

Put simply, written or oral language performances are incremental linearizations constrained by temporality. The temporal axis compels a hierarchisation of thought, which is complex and non linear. The knowledge stored in long-term memory is a network which includes multiple levels of representation. Language production calls for the linearization of these contents through organized linguistic strings, including topological disfluencies due to revisions of the immediately preceding text or of more distant fragments. During the textualization process, spontaneous language production is interrupted/segmented by pauses. A textual segment produced between two pauses is called a burst (Chenoweth & Hayes 2001):

e.g. (the bursts are in italics):

une cousine qui [pause] peut venir partager du temps avec elle pendant [pause] le [pause] w [pause] eek [pause] – [pause] end. [pause]

Objectives. The Pro-TEXT project aims to elucidate the dynamics of the textualization process by modeling the relations between the temporal indices of cognitive processes (such as pauses) and the nature of bursts of writing in French and in English-French translations. We argue that the way linguistic sequences linearly articulate during the process of textualization accounts for: i) the multilevel constraints underlying language performance and ii) specific relationships outside the scope of clause syntax.

The issue is i) to unearth the linguistic strings chosen by writers to build up their texts and the links by which they are interconnected; ii) to identify the types of sequences that constitute the linguistic material for textualization, iii) to fix the rules and layout regularities that support their organization in a formally and semantically valid text and the combinatorial strategies used by writers in various contexts and text genres; iv) to interpret the pauses of production and the bursts of writing by identifying the cognitive processes underlying them and how variations in cognitive demands affect these pauses and bursts, as well as the linguistic forms and functions of bursts.

Machine-learning incremental approaches will fill a gap in the analysis and representation of real-time language performance, while revealing regularities that remain unremarked under the methodologies used previously.

       Hypothesis. Our hypothesis is threefold:

  • First of all, we think that, despite revision disfluencies, the textualization process follows an incremental model which is mainly linear and operates with chunks segmented pursuant to their semantic value, making the content-structure mapping evolve towards a communicatively relevant unit, the text.
  • Second, this functional model is general, above and beyond generic and individual registers.
  • Third, for the above-mentioned reasons we consider that pause duration and location are defined by complex cognitive-linguistic constraints that do not reproduce traditional syntactic or sequential segmentation but follow semantic and constructional rules.

The expected research results are i) an extensive description of language performance units produced spontaneously during the textualization process; ii) a categorisation of types of pauses; iii) a modeling of the textualization processes.


