Principal components analysis
- The text on this page is taken from an equivalent page of the IEHIAS-project.
Principal Components Analysis (PCA) is a statistical factor analysis method, based on the law of conservation of mass. It operates by first extracting a series of principal factors (components) from the measured data on pollutant concentrations, on the basis of the mutual correlation between the different species. These are then interpreted as putative sources, and the contribution from each estimated from the factor scores.
Scope
Purpose
Principal Components Analysis is a multivariate modeling technique that does not need data on source compositions as inputs. It can therefore be used to identify sources, and estimate contributions, where detailed information on source characteristics is not available, but where a substantial amount of measured concentration data exist.
Boundaries
PCA, like other multivariate receptor models, is based on the analysis of the correlation between measured concentrations of chemical species in a number of samples, assuming that highly correlated compounds come from the same source. The method can thus be used to detect the hidden source information from ambient measurement datasets. Ambiguities inevitably arise during interpretation of the factors, and the validity of the interpretation depends on good prior understanding of possible source characteristics; in the absence of such knowledge the results obtained can thus be somewhat hypothetical and tentative.
Assumptions
- Composition of the emission sources is constant over the period of sampling at the receptors;
- Chemical species used in PCA do not interact with each other and their concentrations are linearly additive;
- Source profiles (fpj) are linearly independent of each other;
- Marker elements (tracers) for each source should be included;
- Measurement errors are random and uncorrelated
- The numbers of species (j) is greater than or equal to the number of sources (p);
- The number of samples is much greater than the number of source types to ensure statistically meaningful calculations;
- Variability of the concentrations from sample to sample is primarily due to differences in source contribution and not due to measurement uncertainty or changes in source composition;
- The effect of processes that affect all sources equally (e.g. atmospheric dispersion) is much smaller than the effect of processes that influence individual sources (e.g. wind direction and changes in emission rates).
Weaknesses and limitations
- PCA requires large datasets on measured concentrations (preferably >100 samples);
- Analysis is limited by the accuracy, precision, and range of species measured at the receptor locations (e.g. ambient monitoring) sites;
- A determination must be made of how many 'factors' to retain, and emission sources have to be deduced by interpreting these factors;
- Information is needed on source profiles or existing profiles in order to verify the representativeness of the calculated source profiles and uncertainties in the estimated source contributions;
- Vectors and components are usually related to broad source types rather than specific sources;
- Analysis is sensitive to extreme values in the data sets;
- Analysis may generate negative loadings on some components (sources);
- A large number of solutions can be obtained and it may not be clear whether an optimal solution has been found;
- The components are not always physically explainable and fully satisfactory rotation techniques to provide clear definition of indpendent sources have not been identified;
- Results cannot be weighted to account for uncertainties in the measured data;
- PCA models cannot properly handle missing data or and values below the detection limit (both of which commonly occur in environmental measurements).
Requirements
For exposure assessment, the number of samples analysed must be representative both in time and space.
Method description
Input
As input, the PCA model requires concentration data from a large number of samples (n>100) analysed for chemical constituents. In order to interpret the resulting components, information is also needed on the characteristics of putative sources, derived either from the literature or available measurements of emission composition.
Output
The output from a PCA model is a series of factors (components), which can be interpreted as emission sources. Interpretation is based on the elements/species which characterise each component. The amount of variance explained by each factor is then used as the basis for estimating the contribution associated with each putative source.
PCA also provides qualitative source profiles, which can be used as input to other receptor models such as CMB.
Rationale
Principal Component Analysis (PCA) is a form of factor analysis. Its aim is to identify structures in the pattern of relationships among the variables (i.e. measured concentrations of different pollutant species) in order to.determine whether the observed variables can be explained largely or entirely in terms of a much smaller number of (inferred) variables, or factors. These are then interpreted and allocated to identifiable sources by reference to prior information on, or knowledge about, possible origins of the emissions (e.g. based on literature or independent measurement data). Estimates of the contribution of each of these source categories to the measured pollutant mixture are then made on the basis of the percentage of variance explained by each factor at each measurement location.
Method
PCA models can be expressed as follows:
p is the number of sources;
j is the number of species, with j ≥ p;
Cij is the measured ambient concentration of species j in sample i;
fpj (source profiles) is the fractional concentration of species j in the emissions from source p;
gip is the concentration contribution of source 'p to sample i; and
eij is the portion of the measured elemental concentration that cannot be explained by the model.
References
- Jollife, I.T. 1986 Principal component analysis. New York: Springer.
- Paatero, P., Hopke, P.K., Begum, B.A. and Biswas, S.K. 2005 A graphical diagnostic method for assessing the rotation in factor analytical models of atmospheric pollution. Atmospheric Environment 39, 193–201.
- Wolff, G.T., Korsog, P.E., Kelly, N.A. and Fermam ,M.A. 1985 Relationships between fine particulate species, gaseous pollutants and meteorological parameters in Detroit. Atmospheric Environment 19, 1341–1349.
See also
- Source attribution in general
- Source attribution database contains results from source attribution studies
Other source attribution methods: