Principal components analysis

The text on this page is taken from an equivalent page of the IEHIAS-project.

Principal Components Analysis (PCA) is a statistical factor analysis method, based on the law of conservation of mass. It operates by first extracting a series of principal factors (components) from the measured data on pollutant concentrations, on the basis of the mutual correlation between the different species. These are then interpreted as putative sources, and the contribution from each estimated from the factor scores.

Scope

Purpose

Principal Components Analysis is a multivariate modeling technique that does not need data on source compositions as inputs. It can therefore be used to identify sources, and estimate contributions, where detailed information on source characteristics is not available, but where a substantial amount of measured concentration data exist.

Boundaries

PCA, like other multivariate receptor models, is based on the analysis of the correlation between measured concentrations of chemical species in a number of samples, assuming that highly correlated compounds come from the same source. The method can thus be used to detect the hidden source information from ambient measurement datasets. Ambiguities inevitably arise during interpretation of the factors, and the validity of the interpretation depends on good prior understanding of possible source characteristics; in the absence of such knowledge the results obtained can thus be somewhat hypothetical and tentative.

Assumptions

Composition of the emission sources is constant over the period of sampling at the receptors;
Chemical species used in PCA do not interact with each other and their concentrations are linearly additive;
Source profiles (f_pj) are linearly independent of each other;
Marker elements (tracers) for each source should be included;
Measurement errors are random and uncorrelated
The numbers of species (j) is greater than or equal to the number of sources (p);
The number of samples is much greater than the number of source types to ensure statistically meaningful calculations;
Variability of the concentrations from sample to sample is primarily due to differences in source contribution and not due to measurement uncertainty or changes in source composition;
The effect of processes that affect all sources equally (e.g. atmospheric dispersion) is much smaller than the effect of processes that influence individual sources (e.g. wind direction and changes in emission rates).

Weaknesses and limitations

PCA requires large datasets on measured concentrations (preferably >100 samples);
Analysis is limited by the accuracy, precision, and range of species measured at the receptor locations (e.g. ambient monitoring) sites;
A determination must be made of how many 'factors' to retain, and emission sources have to be deduced by interpreting these factors;
Information is needed on source profiles or existing profiles in order to verify the representativeness of the calculated source profiles and uncertainties in the estimated source contributions;
Vectors and components are usually related to broad source types rather than specific sources;
Analysis is sensitive to extreme values in the data sets;
Analysis may generate negative loadings on some components (sources);
A large number of solutions can be obtained and it may not be clear whether an optimal solution has been found;
The components are not always physically explainable and fully satisfactory rotation techniques to provide clear definition of indpendent sources have not been identified;
Results cannot be weighted to account for uncertainties in the measured data;
PCA models cannot properly handle missing data or and values below the detection limit (both of which commonly occur in environmental measurements).

Requirements

For exposure assessment, the number of samples analysed must be representative both in time and space.

Method description

Input

As input, the PCA model requires concentration data from a large number of samples (n>100) analysed for chemical constituents. In order to interpret the resulting components, information is also needed on the characteristics of putative sources, derived either from the literature or available measurements of emission composition.

Output

The output from a PCA model is a series of factors (components), which can be interpreted as emission sources. Interpretation is based on the elements/species which characterise each component. The amount of variance explained by each factor is then used as the basis for estimating the contribution associated with each putative source.

PCA also provides qualitative source profiles, which can be used as input to other receptor models such as CMB.

Rationale

Principal Component Analysis (PCA) is a form of factor analysis. Its aim is to identify structures in the pattern of relationships among the variables (i.e. measured concentrations of different pollutant species) in order to.determine whether the observed variables can be explained largely or entirely in terms of a much smaller number of (inferred) variables, or factors. These are then interpreted and allocated to identifiable sources by reference to prior information on, or knowledge about, possible origins of the emissions (e.g. based on literature or independent measurement data). Estimates of the contribution of each of these source categories to the measured pollutant mixture are then made on the basis of the percentage of variance explained by each factor at each measurement location.

Method

PCA models can be expressed as follows:

p is the number of sources;

j is the number of species, with j ≥ p;

C_ij is the measured ambient concentration of species j in sample i;

f_pj (source profiles) is the fractional concentration of species j in the emissions from source p;

g_ip is the concentration contribution of source 'p to sample i; and

e_ij is the portion of the measured elemental concentration that cannot be explained by the model.

References

Jollife, I.T. 1986 Principal component analysis. New York: Springer.
Paatero, P., Hopke, P.K., Begum, B.A. and Biswas, S.K. 2005 A graphical diagnostic method for assessing the rotation in factor analytical models of atmospheric pollution. Atmospheric Environment 39, 193–201.
Wolff, G.T., Korsog, P.E., Kelly, N.A. and Fermam ,M.A. 1985 Relationships between fine particulate species, gaseous pollutants and meteorological parameters in Detroit. Atmospheric Environment 19, 1341–1349.

**Integrated Environmental Health Impact Assessment System**
Topic	Pages
IEHIAS is a website developed by two large EU-funded projects Intarese and Heimtsa. The content from the original website was moved to Opasnet.
Toolkit
Data	Boundaries · Population: age+sex 100m LAU2 Totals Age and gender · ExpoPlatform · Agriculture emissions · Climate · Soil: Degredation · Atlases: Geochemical Urban · SoDa · PVGIS · CORINE 2000 · Biomarkers: AP As BPA BFRs Cd Dioxins DBPs Fluorinated surfactants Pb Organochlorine insecticides OPs Parabens Phthalates PAHs PCBs · Health: Effects Statistics · CARE · IRTAD · Functions: Impact Exposure-response · Monetary values · Morbidity · Mortality: Database
Examples and case studies	Defining question: Agriculture Waste Water · Defining stakeholders: Agriculture Waste Water · Engaging stakeholders: Water · Scenarios: Agriculture Crop CAP Crop allocation Energy crop · Scenario examples: Transport Waste SRES-population UVR and Cancer
Models and methods	Ind. select · Mindmap · Diagr. tools · Scen. constr. · Focal sum · Land use · Visual. toolbox · SIENA: Simulator Data Description · Mass balance · Matrix · Princ. comp. · ADMS · CAR · CHIMERE · EcoSenseWeb · H2O Quality · EMF loss · Geomorf · UVR models · INDEX · RISK IAQ · CalTOX · PANGEA · dynamiCROP · IndusChemFate · Transport · PBPK Cd · PBTK dioxin · Exp. Response · Impact calc. · Aguila · Protocol elic. · Info value · DST metadata · E & H: Monitoring Frameworks · Integrated monitoring: Concepts Framework Methods Needs
Listings	Health impacts of agricultural land use change · Health impacts of regulative policies on use of DBP in consumer products
Guidance System
The concept
Issue framing	Formulating scenarios · Scenarios: Prescriptive Descriptive Predictive Probabilistic · Scoping · Building a conceptual model · Causal chain · Other frameworks · Selecting indicators
Design	Learning · Accuracy · Complex exposures · Matching exposure and health · Info needs · Vulnerable groups · Values · Variation · Location · Resolution · Zone design · Timeframes · Justice · Screening · Estimation · Elicitation · Delphi · Extrapolation · Transferring results · Temporal extrapolation · Spatial extrapolation · Triangulation · Rapid modelling · Intake fraction · iF reading · Piloting · Example · Piloting data · Protocol development
Execution	Causal chain · Contaminant sources · Disaggregation · Contaminant release · Transport and fate · Source attribution · Multimedia models · Exposure · Exposure modelling · Intake fraction · Exposure-to-intake · Internal dose · Exposure-response · Impact analysis · Monetisation · Monetary values · Uncertainty
Appraisal