Machine Learning to Explore Geographical Data

Self-Organising Maps (SOMs)

Exploratory data analysis methods such as Self Organising Maps (SOM’s) can provide an alternative way to analyse spatial data, especially when the data is multi-dimensional.

1 What are SOMs?

Self-Organising Maps (SOMs) are a type of unsupervised Artificial Neural Networks (ANNs). They were developed by Teuvo Kohonen in the early 1980’s and are mostly used for clustering, visualisation and data exploration. SOMs reduce n-dimensional data and display it on the two-dimensional map where similar data is placed into the same grid cells, hereinafter referred to as neurons or nodes.

2 How do SOMs work?

Imagine 1000 people in a (big) room. We define a number of attributes (e.g. gender, age, height, income) and ask the people on the room to move closer to other people who are most similar to them according to all these attributes. After a while, everyone in the room is surrounded by those people that share similar attribute values. This configuration is an example of a two-dimensional representation of multi-dimensional data points. Of course, the SOM algorithm is slightly more complicated. For a more detailed but yet easy to understand explanation click here.

3 An Example

SOMs offer insights that can’t be explored or displayed with the linear indices. The following example provides an alternative look at deprivation patterns in Edinburgh, Scotland. The study uses the 2016 SIMD dataset containing 26 deprivation variables (excluding absolute measures), an overall rank and rankings over each of the 7 domains (Income, employment, education, health, access to services, crime and housing). To closely monitor deprivation patterns, the study area was reduced to the 597 data zones in Edinburgh. The 26 variables were summed up within each of the themes and the SOM was then trained with seven variables representing the seven themes. After the SOM training seven clusters were created using the hierarchical clustering method. All analysis was conducted within R studio and a copy of the code can be found here.

3.1 Training the SOM

The process starts with the user defining the size, shape and topology of the SOM grid.  These factors are determined by the number of observation and can be altered to reduce edge effect.  Once selected the SOM is trained to determine the appropriate number of iterations required. As the SOM training iterations progress, the distance from each node’s weights to the samples represented by that node is reduced. Ideally, this distance should reach a minimum plateau. This plot option shows the progress over time. If the curve is continually decreasing, more iterations are required. Once the plateau has been reached continuing iterations does not improve the quality. Defining size, shape and topology and training the SOM is an ongoing process and several combinations are tested before selecting the final parameters.

3.2 Clustering

Clustering within SOMs was firstly conducted by Ritter & Kohonen (1989) developing their “semantic bird maps”. Since then, it has become a widely-used technique in the field of SOMs. In this study, the number of clusters was chosen based on the Within Clusters Sum of Squares (WCSS) metric, a rough indicator for the ideal number of clusters. In addition, the fact that clusters are spatially continuous within the SOM, and do not display divided clusters or islands, indicates that the clustering was successful. For this reason, a hierarchical clustering method with seven clusters was performed.

By means of component planes analysis and display of the results on a geographic map, the clusters were characterised to facilitate the interpretation. Note that the cluster descriptions are not purely based on scientific knowledge, moreover they aim to tell a narrative backed by common knowledge about the city.

Characteristics of the seven clusters. The node background colours represent the seven clusters. The codes show the seven variable properties for each neuron, with larger symbols indicating higher disadvantage. They are displayed to visualise similarity and differences between adjacent neurons.


3.3 Cluster Characterstics


  • Cluster 1 (7%) | The Precariats — The Precariats is a cluster defined by very high disadvantage within the income and employment domain, low education and health issues, whilst access, crime and housing do not seem to be a major issue. The Precariats are the poorest and most deprived cluster.
  • Cluster 2 (23%) | Rough Edinburgh — The inhabitants of Rough Edinburgh share similarities with The Precariats but are less vulnerable. However, they still score very low within the domains income, employment, education and health. Typical “Rough Edinburgh” districts are areas with social housing such as Dumbiedykes.
  • Cluster 3 (13%) | The Wealthy Commuters — The Wealthy Commuters are the cluster most disadvantaged by access – but by choice. Apart from the access domain they show very low disadvantage amongst all the themes. They typically live on the outskirts or in suburban areas where they are house owners and commute to work every day.
  • Cluster 4 (41%) | The Waitrose Shoppers (Edinburgh Posh) — This cluster is defined by a uniformly low disadvantage across the seven themes. Typical middle class families and professionals with high income and education living in urban and suburban areas belong to this cluster. Areas such as Marchmont and Stockbridge are typical for this cluster.
  • Cluster 5 (3%) | The Urban Intermediates — This cluster is defined by excellent access and intermediate characteristics amongst the domains income, employment, health and education. It is noticeable that there is a relatively high crime rate and rather low housing conditions.
  • Cluster 6 (1%) | The Crime Triangle (Edinburgh Nightlife) — The Crime Triangle is defined by very high crime and bad housing. It represents only a very small area in the city centre nearby Princess Street. Data zones within this cluster are not a typical living area but rather an area where people congregate and where young people enjoy the nightlife, which explains the high proportions of crime rates per inhabitants.
  • Cluster 7 (12%) | The Hotchpotch — The Hotchpotch is characterised by very bad housing and excellent access. It encompasses areas in the city centre with a high proportion of students and presumably flat shares, but also more ethnic areas near to the city centre.


Finally, clusters can then be projected back onto a map to allow analysis of there distribution.



4 Scientific Sources

Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59-69, 10.1007/BF00337288.

Kohonen, T. 2001. Self-Organizing Maps. 3rd ed. Berlin Heidelberg: Springer-Verlag. Available at: (Accessed March 29, 2018).

Skupin, A. & Agarwal, P. 2008. Introduction: What is a Self-Organizing Map? In Agarwal, P. & Skupin, A., eds. Self-Organising Maps. Chichester, UK: John Wiley & Sons, Ltd, 1–20., 10.1002/9780470021699.ch1.

Algorithm to calculate flow within a flat surface or lake

Algorithms – such as D8 (O’Callaghan & Mark 1984) or MD8 (Quinn P. et al. 2006) – indicate flow directions using an elevation raster. Flow directions are calculated by evaluating elevation of neighbouring cells. However, they don’t seem to represent flat surfaces with an accumulation of cells with similar elevation – such as lakes – appropriately.

This post proposes a simple algorithm to represent the flow within a lake by implementing a gravity flow towards the lake outflow. In contrary to the algorithm by Garbrecht & Martz (1997) (improved by Barnes et al. 2014), this solution does not increment elevations within the flat surfaces or lakes. Rather it calculates a “downnode” (i.e. the neighbouring raster cell where the water flows) for each cell with gravity towards the outflow.


Assumptions made

  • It is assumed that all the nodes (raster cells) and the outflow of a lake are known
  • It is assumed that the outflow does not flow back into the lake
  • It is assumed that lakes / flat surfaces don’t have multiple outflows


Description of the Algorithm

The algorithm starts at the outflow of the lake (Figure 1, a). It then searches for neighbours that are within the lake and sets their downnode to the outflow. From there the nearest neighbour to the outflow is chosen and again its neighbours are identified (Figure 1, b). Then again, the nearest node to the outflow is chosen and its neighbours are identified. The algorithm continues until every cells downnode is reset.


Illustration of Algorithm

Figure 1: Explanation of lake flow algorithm. The rastercell with the star in (a) represents the lake outflow. Four different states are marked; Not touched (bright blue), Downnode reset (darker blue), Checknode (yellow star, current node to be checked), Checked (red cross, cells that have already been checknodes).



checknode = outflow
tocheck = [outflow]

while not every downnode is set:
     for neighbour in neighbours
          if neighbour in lake and downnode not set:

#get nearest node from outflow
for node in tocheck:
               distance = outflow.distance(node)
               if distance<distanceNearest:
     checknode = nodeNearest



outflow – a raster cell representing the outflow of a lake

getNeigbours – gets the 8 neighbours of a cell/node

lake – represents all the nodes/cells within a lake

outflow.distance(node) – returns the Euclidian distance between the outflow and the node, an improved version of the algorithm would calculate the shortest path distance to the outflow instead ot the Euclidian distance



Barnes, R., Lehman, C. & Mulla, D. 2014. An efficient assignment of drainage direction over flat surfaces in raster digital elevation models. Computers & Geosciences, 62, 128–135, 10.1016/j.cageo.2013.01.009.

Garbrecht, J. & Martz, L.W. 1997. The assignment of drainage direction over flat surfaces in raster digital elevation models. Journal of Hydrology, 193, 204–213, 10.1016/S0022-1694(96)03138-1.

O’Callaghan, J.F. & Mark, D.M. 1984. The extraction of drainage networks from digital elevation data. Computer vision, graphics, and image processing, 28, 323–344.

Quinn P., Beven K., Chevallier P. & Planchon O. 2006. The prediction of hillslope flow paths for distributed hydrological modelling using digital terrain models. Hydrological Processes, 5, 59–79, 10.1002/hyp.3360050106.


Thanks to Manuel Bär for the inspiration!

Welcome to Geo-Blog


We are very excited to start the new project Geo-Blog!

Geo-Blog is a platform serving as an opportunity for young geoscientists and students to share and publish their ideas, experiences and projects. Find other people with shared interests to discuss current topics and get valuable inputs from fellow young scientists. Every specialization is welcome to be a part of Geo-Blog! You are also invited to share your experiences in the field and find others to join your next trip.

We believe that the different Geosciences are working too seperately on topics of shared interest. The potential of synergies is large but often not used or appreciated. Our platform marks a step in the opposite direction by providing you the opportunity to share your own ideas and experiences as well as give feedback to others. We believe that a community that offers a membership to all geoscientists and students can bring together different perspectives on the same topic. This allows scientific discourses to open up and to think outside the box.

Feel free to join and write one of the first posts.


We are looking forward to welcoming you,

Tobias and Livia