Software

Giat develops its own procedures for textual data analysis mainly with R.

Many software that combine statistical, IT and linguistic resources are available both for free and in commerce and many general-purpose data analysis software offer text mining tools (for example: IBM SPSS Text analytics  or SAS Text Miner). Every product is seldom complete and suitable for differnet applications, thus, choosing a software, it is necessary to have in mind the main goal of the research question and often an integration of different tools is necessary.

 

R (The R project for Statistical Computing)

R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, clustering, etc.) and graphical techniques. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. The R environment offers many packages for text mining applications. For example:


CAQDAS

In human and social sciences many software support content analysis, qualitative and quantitative analysis of textual data. CAQDAS (Computer-Assisted Qualitative Data Analysis Software) are part of a huge software family that offer research and query tools, corpora organization and short annotation. They are used mainly for content analysis in sociology and psychology; those products are helpful tools for researchers during the phase of recoding text in conceptual category. Basically they help to speed up the process of information retrieval and automatic recode. The most common are: Atlas.ti, dedoose, Ethnograph, MAXQDAWordStat, RQDAN6 and NVivo.

 

Analysis of Textual Data

Alceste

Alceste (Analyse des Lexèmes Co-occurents dans les Ènoncés d’un Text) is a textual data analysis software developed by the Image society, together with the French National Scientific Research Council (CNRS in French). Alceste proceed to a first analysis of the vocabulary of a corpus, and makes the dictionary of these words with their roots and frequencies. Then it cuts the text into homogeneous segments containing a sufficient number of words, and it proceeds to a classification of these segments by spotting stronger oppositions. This method allows extracting classes of meanings, made up of most specific words and phrases, the remaining classes represent the main ideas and themes of the corpus.The overall results, sorted according to their relevance, with several graphical representations and analysis reports, allow the user an easy and effective interpretation. It treats any type of text, in several languages and has many applications in different fields.

Iramuteq

Iramuteq is a free (as in free speech) software (licence GNU GPL) for data and textual manning. It’s based on R (IRaMuTeQ means R Interface for Multidimensional analysis of Texts and Questionnaire) and on the python programming language. It can perform different types of text analysis and visualization on large text corpora (over hundreds of millions of occurrences). One of its particularities is to reproduce Reinert Analysis (1983, 1991).

JGAAP

JGAAP (Java Graphical Authorship Attribution Program) is the EVL Lab's flagship product. JGAAP is an open-source software project that allows non-experts to use some of the latest methods in machine learning on their text classification problems. The Evaluating Variations in Language (EVL) Lab is an National Science Foundation (NSF) funded lab that studies different applications of Stylometry e.g. Authorship Attribution, Personality Detection, Author Profiling, and Author Verification.

Lexico

Working with  Lexico  the user maintains control over the entire lexicometric process from initial segmentation to the publication of final results (segmentation, concordances, breakdown in graphic form, characteristic elements and factorial analyses of repeated forms and segments). The main improvement found in the last versions concerns an object-oriented program architecture. The different interactive modules are now able to exchange more complex data (forms, repeated segments, future co-occurrences). This new version allows for more precision in the characterization of different parts of a corpus according to their most frequently employed forms by isolating sections of the text in which this sort of distribution is particularly evident. Concretisation of these sections onto diagrams that represent the text allow the creation of a veritable textual topography.

Nooj

Nooj  is a linguistic development environment that includes large-coverage dictionaries and grammars, and parses corpora in real time. It includes tools to create and maintain large-coverage lexical resources, as well as morphological and syntactic grammars. Dictionaries and grammars are applied to texts in order to locate morphological, lexical and syntactic patterns and tag simple and compound words. It can build complex concordances, with respect to all types of FiniteState and Context-Free patterns. Users can easily develop extractors to identify semantic units in large texts, such as names of persons, locations, dates, technical expressions of finance, etc.

Sphinx

Sphinx iQ offers a user-friendly environment that allows the management of all study, data collection and results communication projects. Sphinx opens and imports any kind of corpus as for example speeches, websites, non-directed interviews, focus groups, bibliographical bases, etc. It produces glossaries and lexical browsing to highlight specific subjects and associations (concordance and related lexicons). Sphinx analyses language structures thanks to syntax analysis (lemmatizer). It highlights textual specificities and view them on charts. Finally it can be useful in lexicometry: it measure and codify the text’s lexical features and create corresponding variables.

Taltac

TaLTaC is the acronym of Trattamento Automatico Lessicale e Testuale per l’Analisi del Contenuto (Italian for “Automatic Lexical and Textual Processing for the Analysis of Content”). Taltac is a software application for the automatic analysis of texts according to the logics of both Text Analysis (TA) and Text Mining (TM). Such an analysis allows to define a quantitative representation of the phenomenon under study, both at the level of text-units (words) and context-units (words). Consequently, both thelanguage and the contents of the text can be examined. The approach according to which the application has been designed makes it possible to carry out the analysis without actually reading the series of texts, that is, independently of the size of the corpus (the latter can be huge, and include as many as millions of words). Taltac originates from research carried out at the Universities of Salerno and Rome “La Sapienza” during the 1990s under the supervision of Sergio Bolasco, Professor of Statistics at the Department of geo-economic, linguistic, statistical and historical studies for regional analysis of “La Sapienza” University. It is the result of the cooperation of researchers and colleagues of several Italian and French universities. It employs both statistical and linguistic resources. These are highly integrated with each other and can be customized by the user. This allows – both at the lexical and textual level – for text analysis and information recovery and extraction according to the principles of data and text mining.

T-Lab

T-Lab  software is an all-in-one set of linguistic and statistical tools for content analysis and text mining. Its interface is very user-friendly and many types of texts can be analysed: speech transcripts, newspaper articles, responses to open-ended questions, transcripts of interviews and focus groups, legislative texts, company documents, books, etc. T-LAB uses a kind of text-driven automatic approach which allows meaningful patterns of words and themes to emerge. Various measures and several analysis methods can be applied. Tables and charts can be easily browsed and interpreted. The user’s interface and the contextual help are in four languages: English, French, Spanish, Italian. T-LAB pre-processing steps include text segmentation, automatic lemmatisation and key-term selection. Subsequently, three sub-menus allow easy browsing between several tools for co-occurrence analysis, thematic analysis, comparative analysis.

TXM

TXM  is free, open-source simple Unicode texts and XML aware text/corpus analysis environment and graphical client based on CQP and R. It is available for Microsoft Windows, Linux, Mac OS X and as a J2EE web portal. TXM implements the textometry text analysis methodology: It provides qualitative analysis tools: concordances of lexical patterns based on the efficient CQP full text search engine and its CQL query language, frequency lists, pattern occurrences histogram; and quantitative tools: factorial correspondance analysis, clustering and collocation statistics. It may be used with any collection of Unicode encoded documents in various formats: TXT, XML, various flavours of XML-TEI P5, XML-Transcriber, XML-TMX (aligned corpora), XML-PPS (Factiva), Europresse, etc. Applies various NLP tools on the fly on texts before analysis (e.g. TreeTagger for lemmatization and pos tagging). It is Open Source and based on the best open source components for text analysis: CQP, R and Java & XSLT libraries.

Wordsmith Tools

WordSmith Tools is an integrated suite of programs for looking at how words behave in texts. It allows to use the tools to find out how words are used in texts. The WordList tool provides a list of all the words or word-clusters in a text, set out in alphabetical or frequency order. The concordancer, Concord, gives a chance to see any word or phrase in context. With KeyWords it is possible to find the key words in a text. The tools have been used by Oxford University Press for their own lexicographic work in preparing dictionaries, by language teachers and students, and by researchers investigating language patterns in lots of different languages in many countries world-wide.

Stemmer

Porter stemmer

The Porter stemming algorithm (or Porter stemmer) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems. An on-line version for texts (written in English) is available.

Lemmatizers

Treetagger

TreeTagger  is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart.  In 1993/1994 the project collected textual material for German, French and Italian, developed a representation for texts and markups, along with a query language and a corpus access system for linguistic exploration of the text material. Texts and analysis results are kept separate from each other, for reasons of flexibility and extensibility of the system; this is possible because of a particular approach for storage and representation. Tool components under development, language-specific and general, range from morphosyntactic analysis to partial parsing, and from mutual information, t-score, collocation extraction and clustering to HMM-based tagging and n-gram tagging. Research on statistical models for noun phrases, verb-object collocations, etc. is going on. The TreeTagger is an open source software and it has been successfully used to tag German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Greek, Portuguese, Galician, Chinese, Swahili, Latin, Estonian and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available.

UdPIPE

UDPipe is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing.

The R package UDPipe (UFAL) allows to perform pre-processing operations that are fundamental for the preliminary analysis of a corpus, such as tokenization, parts of speech (POS) tagging, lemmatization, and dependency parsing. These operations can be carried out by exploiting pre-built templates (currently, templates exist in over 65 languages), or by leaving it to each user to build their own template for annotating texts.

Gatto

Gatto for ancient Italian (Corpus OVI dell'Italiano antico)