Le Projet

Objectives and scientific hypotheses

In a nutshell, this pioneering interdisciplinary research will lead to a comprehensive linguistic analysis of the textualization process, i.e. the real-time progressive construction of a text. We will study bursts of writing, which are textual segments produced between two pauses, in order to provide insight into the relation between regularities of language performance and the cognitive and contextual constraints. The aim is to understand some of the layout mechanisms that allow language to give rise to novelty out of known and prefabricated data. The Pro-TEXT project will develop linguistic and psycholinguistic methods and machine-learning tools to model these regularities and provide evidence about patterns of text processing.

Research Issues. Basically, a text is a configuration pertaining to the highest level of linguistic complexity and constituting a communicative unit. But there is still no theoretical model of textualization as a process AND a product, despite the ubiquity and the empirical awareness of texts: in the current state of the art there is no consensual theoretical definition of text (cf. Adam 2014), nor are we fully informed on how a text is built, and automatic text-generation approaches have not yet found a satisfactory model (cf. Adorni & Zock 1996; Bowman et al. 2015 inter alii). Indeed, texts – and more specifically written texts – are produced under complex constraints among which some were impossible to capture until recently, due to the inaccessibility of insights into the textualization process as such.

Real-time recording of the writing process using key-stroke logging provides access to the dynamics of the textualization process. Roughly, written or oral language performances are incremental linearizations constrained by temporality, accompanied by disfluencies due to revision.  

We contend that a better knowledge of the dynamics of textualization processes will

  • make it possible to grasp the mechanisms that connect structure, genre constraints and pragmatic aims;
  • help understand the way language unit layouts achieve qualitative leaps;
  • unveil the moves that enable a qualitatively new product, the text, to be forged out of available data and structures.

Real-time recording of the writing process using key-stroke logging provides access to the dynamics of the textualization process, and the current development of linguistics, cognitive psychology and machine learning provides the tools for its analysis.

Put simply, written or oral language performances are incremental linearizations constrained by temporality. The temporal axis compels a hierarchisation of thought, which is complex and non linear. The knowledge stored in long-term memory is a network which includes multiple levels of representation. Language production calls for the linearization of these contents through organized linguistic strings, including topological disfluencies due to revisions of the immediately preceding text or of more distant fragments. During the textualization process, spontaneous language production is interrupted/segmented by pauses. A textual segment produced between two pauses is called a burst (Chenoweth & Hayes 2001):

e.g. (the bursts are in italics):

une cousine qui [pause] peut venir partager du temps avec elle pendant [pause] le [pause] w [pause] eek [pause] – [pause] end. [pause]

Objectives. The Pro-TEXT project aims to elucidate the dynamics of the textualization process by modeling the relations between the temporal indices of cognitive processes (such as pauses) and the nature of bursts of writing in French and in English-French translations. We argue that the way linguistic sequences linearly articulate during the process of textualization accounts for: i) the multilevel constraints underlying language performance and ii) specific relationships outside the scope of clause syntax.

The issue is i) to unearth the linguistic strings chosen by writers to build up their texts and the links by which they are interconnected; ii) to identify the types of sequences that constitute the linguistic material for textualization, iii) to fix the rules and layout regularities that support their organization in a formally and semantically valid text and the combinatorial strategies used by writers in various contexts and text genres; iv) to interpret the pauses of production and the bursts of writing by identifying the cognitive processes underlying them and how variations in cognitive demands affect these pauses and bursts, as well as the linguistic forms and functions of bursts.

Machine-learning incremental approaches will fill a gap in the analysis and representation of real-time language performance, while revealing regularities that remain unremarked under the methodologies used previously.

       Hypothesis. Our hypothesis is threefold:

  • First of all, we think that, despite revision disfluencies, the textualization process follows an incremental model which is mainly linear and operates with chunks segmented pursuant to their semantic value, making the content-structure mapping evolve towards a communicatively relevant unit, the text.
  • Second, this functional model is general, above and beyond generic and individual registers.
  • Third, for the above-mentioned reasons we consider that pause duration and location are defined by complex cognitive-linguistic constraints that do not reproduce traditional syntactic or sequential segmentation but follow semantic and constructional rules.

The expected research results are i) an extensive description of language performance units produced spontaneously during the textualization process; ii) a categorisation of types of pauses; iii) a modeling of the textualization processes.


Adam J.-M. (Ed.) 2014. Faire Texte. Frontières textuelles et opérations de textualisation. Besançon : Annales littéraires de l´Université de Franche-Comté.

Adorni G., Zock M. (Eds) 1996. Trends in Natural Language Generation: an Artificial Intelligence Perspective. New York: Springer Verlag.

Aggarwal C., Yu P. 2007. A Survey of Synopsis Construction Methods in Data Streams. In C. Aggarwal (Ed Data Streams: Models and Algorithms, Springer, p. 169–207.

Alves R.A., Branco M, Castro S. L., Olive, T. 2011. Children of high transcription skill compose using bigger language bursts. In V.W. Berninger (Ed.) Past, Present, and Future Contributions of Cognitive Writing Research to Cognitive Psychology. New York: Psychology Press.

Alves R. A., Castro,S. L., de Sousa L., Strömqvist S. 2007. Influence of typing skill on pause-execution cycles in written composition. In M. Torrance, L. van Waes, D. Galbraith (Eds) Writing and Cognition: Research and Applications. Amsterdam: Elsevier, p. 55–65.

Alamargot D., Dansac C., Chesnet D., Fayol, M. 2007. Parallel processing before and after pauses: A combined analysis of graphomotor and eye movements during procedural text production. In M. Torrance, L. v. Waes, D. Galbraith (Eds) Writing and Cognition: Research and Applications (pp.13-29). Amsterdam: Elsevier.

Asher N., Lascarides A. 2003. Logics of Conversation. Cambridge – New York : Cambridge University Press.

Auer P. 2005. Projection in interaction and projection in grammar. Text – Interdisciplinary Journal for the Study of Discourse 25(1): 7–36.

Ben-David S., von Luxburg U., D. Pl. 2006. A sober look at clustering stability. In G. Lugosi, H. Simon, (Eds) Learning Theory, ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, vol. 4005, p. 5–19.

Benzitoun C., Dister, A., Gerdes, K., Kahane, S., Pietrandrea, P., Sabio, F., Debaisieux, J.-M. 2010. Tu veux couper là faut dire pourquoi. Propositions pour une segmentation syntaxique du français parle. In F. Neveu et al. (éd.) Actes du 2ᵉ Congrès Mondial de Linguistique Française. Paris : Institut de Linguistique Française, p. 2075-2090.

Berrendonner A. 2016. Attentes et insertions parenthétiques. Langue française 192: 37-52.

Bhatia V. K. 1993. Analysing Genre. Language Use in Professional Settings. London/New York: Longman.

Biber D. 2009. A corpus-driven approach to formulaic language in English. Multi-word patterns in speech and writing. International Journal of Corpus Linguistics 14(3): 275 311.

Blanche-Benveniste C. 1987. Syntaxe, choix du lexique et lieux de bafouillage. DRLAV 36-37: 123–157.

Blumenthal-Dramé A. 2013. Entrenchment in Usage-Based Theories. What Corpus Data Do and Do Not Reveal About the Mind. Berlin: De Gruyter Mouton.

Boré C. éd. 2016. La phrase en production d’écrits, approches nouvelles en didactique. Lidil 54. https://lidil.revues.org/4020

Bouveret M., Legallois D. (Eds) 2012. Constructions in French. Amsterdam/Philadelphia: John Benjamins.

Bowman S.R., Vilnis L., Vinyals O., M Dai A., Jozefowicz R., Bengio S. 2016. Generating sentences from a continuous space. Proceedings of CoNLL.

Brazil D. 1995. A Grammar of Speech. Oxford: Oxford University Press.

Bybee J. 2010. Language, Usage and Cognition. Cambridge. MA: Cambridge University Press.

Cabanes G.,  Bennani Y., Grozavu N. 2013. Unsupervised Learning for Analyzing the Dynamic Behavior of Online Banking Fraud. ICDM Workshops, Dallas, USA: 513-520.

Cabanes G., Bennani Y. 2012. Change detection in data streams through unsupervised learning. IJCNN, Brisbane, Australia: 1-6.

Chafe W. 1992. Information flow in speaking and writing. In P. Downing, S. D. Lima, M. Noonan (Eds) The Linguistics of Literacy. Amsterdam – Philadelphia: John Benjamins, p. 17-29.

Chenoweth N. A., Hayes J. R. 2001. Fluency in Writing : generating text in L1 and L2. Written Communication 18 (1): 80–98.

Christiansen M.H., Chater N. 2016. Creating Language : Integrating Evolution, Acquisition, and Processing. Cambridge: The MIT Press.

Cislaru G., Olive T. 2018. Le processus de textualisation. Bruxelles: De Boeck.

Conklin K., Schmitt N. 2008. Formulaic sequences : are they processed more quickly than nonformulaic language by native and nonnative speakers ? Applied Linguistics 29: 72 89.

Cornuejols A., Wemmert C,. Gancarski P., Bennani Y. 2018. Collaborative clustering: Why, when, what and how. Information Fusion 39: 81 – 95.

Daems J., Carl M., Vandepitte S., Hartsuiker R., Macken L. 2016. The Effectiveness of Consulting External Resources During Translation and Post-Editing of General Text Types. In: Schwieter J. W. and Ferreira A. (Eds) Translation and Cognition. An Overview. John Wiley & Sons, p. 111-134.

Denis P., Sagot B. 2012. Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging. Language Resources and Evaluation 46(4): 721–736.

Doquet C. 2011. L’Ecriture débutante. Rennes : Presses universitaires de Rennes.

Erman, B., Warren, B. 2000. The idiom principle and the open choice principle. Text 20: 29–62.

Foulin, J.-N. 1995. Pauses et débits : Les indicateurs temporels de la production écrite. L’Année Psychologique 9 (5): 483-504.

Grozavu N., Bennani Y. 2010. Topological collaborative clustering. Australian Journal of Intelligent Information Processing Systems 12(2).

Grozavu N., Cabanes G., Chahdi H., Rogovschi N. 2016. Automated topological co-clustering using fuzzy features partition. IJCNN, Killarney, Ireland: 2638-2645.

Guha S., Harb B. 2008. Approximation algorithms for wavelet transform coding of data streams. IEEE Transactions on Information Theory 54(2): 811–830.

Hoey M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.

Holat P., Plantevit M., Raïssi C., Tomeh N., Charnois T., Crémilleux B. 2014. Sequence classification based on delta-free sequential patterns. IEEE International Conference on DataMining (ICDM 14): 170-179.

Jayez J., Dargnat M. 2012. The Semantics of French Continuative Rises in SDRT. In A. Benz, M. Stede, P. Kühnlein (Eds) Constraints in Discourse 3. Representing and inferring discourse structure. Amsterdam/Philadephia: Benjamins: 109-135.

Kaufer D., Hayes J. R., Flower L. 1986. Composing written sentences. Research in the Teaching of English 20: 121-140.

Kehler A. 2002. Coherence, Reference and the Theory of Grammar. Stanford University: CSLI Publications.

Leblay C., Caporossi G. 2015. A graph theory approach to online writing data visualization. In G. Cislaru (éd.) Writing(s) at the Crossroads. The process-product interface. Amsterdam: John Benjamins, 171-181.

Lefeuvre F., Moline E. (Eds.) 2011. Unités syntaxiques et unités prosodiques, Langue française 170 [special issue].

Legallois, D., Tutin, A. (Eds). 2013. Vers une extension du domaine de la phraséologie [special issue]. Langages 189.

Leijten M., Van Waes L. 2006. Inputlog: New perspectives on the logging of on-line writing. In K. P. H. Sullivan, E. Lindgren (Eds.) Computer Keystroke Logging and Writing: Methods and applications. Amsterdam: Elsevier, 73-94.

Limpo T., Alves R. A. 2017. Written Language Bursts Mediate the Relationship Between Transcription Skills and Writing Performance. Written Communication 29: 306-332. 

Mann W.C., Thompson S. 1988. Rhetorical Structure Theory : Toward a functional theory of text organization. Text 8 (3): 243-281.

Medimorec S., Risko E. F. 2017. Pauses in written composition: on the importance of where writers pause. Reading and Writing: an Interdisciplinary Journal 30(6): 1267–1285.

Olive T. 2010. Methods, tools and techniques for the on-line study of the writing process. In N. L. Mertens (Ed.) Writing: Processes, Tools and Techniques (pp. 1-18). NY: Nova Publisher.

Olive T., Alves R. A., Castro S. L. 2009. Cognitive processes in writing during pauses and execution periods. European Journal of Cognitive Psycholog, 21: 758-785.

Olive T., Kellogg R. T. 2002. Concurrent activation of high-and low-level production processes in written composition. Memory and Cognition 30: 594–600.

Olive T. 2014. Toward an Incremental and Cascading Model of Writing: A review of research on writing processes coordination. Journal of Writing Research 6: 173-194.

Perret, C. Kandel, S. 2014. Taking advantage of between- and within-participant variability? Frontiers in Psychology 5: 1235. 

Proust-Lima C., Séne M., Taylor J.M.G., Jacqmin-Gadda H. 2014. Joint latent class models for longitudinal and time-to-event data: A review. Statistical Methods in Medical Research 23: 74-90.

Rogovschi N., Lebbah M., Bennani Y. 2008. Probabilistic Mixed Topological Map for Categorical and Continuous Data. ICMLA, San Diego, USA: 224-231.

Schilperoord J. 2002. On the Cognitive Status of Pauses in Discourse Production. In T. Olive, C. M. Levy (Eds) Contemporary Tools and Techniques for Studying Writing. Dordrecht: Kluwer Academic Press, p. 61–87.

Schwieter J. W., Ferreira A. 2016. Translation and Cognition. An Overview. John Wiley & Sons.

Sinclair J.R. 1991. Corpus, Concordance and Collocation. Oxford: Oxford University Press.

Sinclair J., Mauranen, A. 2006. Linear Unit Grammar : integrating speech and writing. Amsterdam – Philadelphia: John Benjamins.

Sublime J., Matei B., Grozavu N., Bennani Y., Cornuejols A. 2017. Entropy Based Probabilistic Collaborative Clustering. Pattern Recognition 72: 144–157.

Swales J. 1990. Genre Analysis. Cambridge. MA: Cambridge University Press.

Torrance M., van Waes L., Galbraith D. (Eds). 2007. Writing and Cognition: Research and applications. In Gert Rijlaarsdam (Series éd.) Studies in Writing (Vol. 19). Amsterdam: Elsevier.

Vandaele S. 2007. Quelques repères épistémologiques pour une approche cognitive de la traduction. Application à la traduction spécialisée | Some epistemological reference points for a cognitive approach of translation. Application in specialized translation in biomedicine. Meta 52(1): 129–145.

Verschueren J., Brisard F. 2009. Adaptability. In J. Verschueren, J.-O. Östman (Eds) Key Notions for Pragmatics. Amsterdam – Philadelphia: John Benjamins, p. 28 47.

Wengelin A., Torrance M., Holmqvist K., Simpson S., Galbraith D., Johansson V., Johansson R. 2009. Combined eyetracking and keystroke-logging methods for studying cognitive processes in text production. Behavior Research Methods 41(2) : 337–351.

Zimina M., Fleury S. 2014. Trameur: A Framework for Annotated Text Corpora Exploration. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: 57-61.