Symbolic Data Analysis: Taking Variability in Data into Account

Prof. Paula Brito. Faculdade de Economía & LIAAD - INESC TEC, Universidade do Porto, Portugal.

Abstract

Symbolic Data, introduced by E. Diday in the 1980’s, is concerned with analysing data with intrinsic variability, which should be taken into account. In Data Mining, Multivariate Data Analysis and classical Statistics, the elements under analysis are generally individual entities for which a single value is recorded for each variable - e.g., individuals, described by age, salary, education level, etc. But when the elements of interest are classes or groups of some kind - the citizens living in given towns; car models, rather than specific vehicles - then there is variability inherent to the data. To reduce this variability by taking central tendency measures - mean values, medians or modes - obviously leads to an important loss of information.

Symbolic Data Analysis provides a framework allowing representing data with variability, using new variable types. Also, methods have been developed which suitably take data variability into account. Symbolic data may be represented using the usual matrix-form data arrays, but now the elements of each cell are generally not single real values or categories, as in the classical case, but rather finite sets of values, intervals or, more generally, distributions.

In recent years, the term “Big Data” emerged, referring to data sets so large and complex that they become difficult to process with traditional data analysis applications and in a reasonable amount of time. SDA, offering the possibility of aggregating data at the user's chosen degree of granularity while keeping the information on the intrinsic variability, and then analyse the resulting (symbolic) data arrays, may play an important role in this context.

In this course we shall introduce and motivate the field of Symbolic Data Analysis, present into some detail the new variable types and illustrating with some examples. We shall furthermore discuss some issues that arise when analysing data that does not follow the usual classical model, and present data representation models for some variable types. Then we recall some methods that have been developed to analyse symbolic data. Some methods presented may be illustrated using the software package SODAS.

The course is aimed at all potential data analysts who need or are interested in analyzing data with variability, e.g. data resulting from the aggregation of individual records into groups of interest, or data which represent abstract entities such as biological species or regions as a whole. This methodology is particularly interesting for Economics and Management studies, Marketing, Social Sciences, Geography, Official Data statistics, as well as for Biology or Geology Data Analysis.

It is assumed that the participants master classical Statistics. For some of the themes in chapter 6 it would be necessary to have some background in Multivariate Data Analysis.

Course syllabus

A. The Symbolic Data Analysis Paradigm: 1. Introduction to Symbolic Data Analysis.; 1.1. Motivation. Examples.
: 2. Sources of symbolic data : aggregation (contemporary, temporal); description of abstract concepts. Examples. Alternative to the use of central tendency measures.
: 3. Types of variables and their representations. Examples.
: 4. Applications examples.
: 5. The SODAS package – presentation.; 5.1. SDS and XML files.; 6. Visualization of symbolic data with SODAS – the zoom-stars.
: 7. Interfaces : getting “native” data.
: 8. Data aggregation:; 8.1. DB2SO – principles. Example.; 8.2. Other aggregation forms.

B. Methods for the Analysis of Symbolic Data: 1. Descriptive Statistics
: 2. PCA; 2.1. Centers method; 2.2. Vertices method; 2.3. Application.; 2.4. Refererence to other methods
: 3. Classification; 3.1. Divisive Clustering : DIV; 3.2. Partitioning Clustering : SCLUST; 3.3. Hierarchical and Pyramial Clustering : HIPYR; 3.4. Other methods.
: 4. Discriminant Analysis; 4.1. Decision trees :TREE; 4.2. Other methods.
: 5. Regresssion; 5.1. Linear Regression for on interval-valued variables; 5.2. Linear Regression for on histogram-valued variables
: 6. Parametric Modelling; 6.1. Principles and definitions; 6.2. Gaussian model; 6.3. R Package MAINT.DATA; 6.4. Tests; 6.5. ANOVA and MANOVA; 6.6. Discriminant Analysis
: 7. Reference to other programs / packages
: 8. Main bibliography; 8.1. Books; 8.2. Main papers
: 9. The SDA community and its activities.

Attendance Examination
The Exam will be take place on Saturday 25th of July.

Main References

Books: Bock, H.-H. and Diday, E. (2000). Analysis of Symbolic Data, Exploratory Methods for Extracting Statistical Information from Complex Data. Springer, Heidelberg.; Billard, L., Diday, E. (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley.; Diday, E. and Noirhomme-Fraiture, M. (2008). Symbolic Data Analysis and the SODAS Software. Wiley.
Papers: Brito, P. (2014): "Symbolic Data Analysis: another look at the interaction of Data Mining and Statistics". WIREs Data Mining and Knowledge Discovery, Volume 4, Issue 4, July/August 2014, 281–295. DOI: 10.1002/widm.1133; Brito, P., Duarte Silva, A. P. (2012): "Modelling Interval Data with Normal and Skew-Normal Distributions”. Journal of Applied Statistics, Volume 39, Issue 1, 3-20. Noirhomme-Fraiture, M., Brito, P. (2011): "Far Beyond the Classical Data Models: Symbolic Data Analysis. " Statistical Analysis and Data Mining Volume 4, Issue 2, 157-170.; Brito, P. (2007): "Modelling and Analysing Interval Data". In: "Advances in Data Analysis", Decker, R., Lenz, H.-J. (Eds.), Series "Studies in Classification, Data Analysis and Knowledge Organization", Springer, Berlin, Heidelberg, New-York, 197-208.; Brito, P. (2007): “On the Analysis of Symbolic Data". In: "Selected Contributions in Classification and Data Analysis", Brito, P., Bertrand, P., Cucumel, G., De Carvalho, F. (Eds.), Series "Studies in Classification, Data Analysis and Knowledge Organization", Springer,Heidelberg, 13-22.; Duarte Silva, A. P. , Brito, P. (2006). "Linear Discriminant Analysis for Interval Data". Computational Statistics, 21, 2, 289-308.; Billard, L. and Diday, E. (2003) “From the statistics of data to the statistics of knowledge: Symbolic Data Analysis”, Journal of the American Statistical Association 98 (462), pp. 470–487.

Slides

SDA Buenos Aires 1

SDA Buenos Aires Desc_Stat

Symbolic Regression

SDA Buenos Aires Class

Hierarchical Pyramidal Clustering