Spatio-temporal Text Mining

OR:

Spatio-Temporal Inference of Personal Activity Patterns Using Text-Location Data from Social Media


Nick Malleson, Mark Birkin and Kirk Harland

School of Geography, University of Leeds

nickmalleson.co.uk/

mass.leeds.ac.uk/

Abstract

Crowd-sourced data potentially offer considerable insights into peoples attitudes and behaviours. In particular, services that allow users’ messages to be marked-up with their accurate spatio-temporal location can provide a unique appreciation of peoples’ conceptualisations of place and how these vary by individual, by time or by activity.

The aim of this research is to first identify spatio-temporal clusters of words and phrases posted to social media services. In this manner it will be possible to identify where and when different words or phrases are the most commonly used. The novel aspect of the work will be to then extend the analysis by combining these word clusters with an analysis of the personal geographies and actions of individual people. By identifying spatio-temporal clusters of messages that emerge from individual users it is possible to create a template for their activity spaces (where people live, work, spend leisure time, etc.) and then identify how their use of language varies relative to this personal geography. In this manner we hope to reveal insights into the functions of different places and how these functions vary by person/community and by time.

Outline

Background / Aims

Big data

The data: Twitter messages

The social life of words

Individual behaviour - anchor points

Conclusions and future work

Background

We don't really understand urban dynamics in any spatial/temporal detail

  • What are different areas used for?
  • How many people are in the city centre?
  • Where have they come from?
  • Where are they going to go?

Implications for crime, disease spread, emergency planning, etc. etc.

Vast (new) data sets could replace traditional surveys?

Aims

Can new sources of data add substantially to our understanding of the short-term dynamics of intra-urban mobility and behaviour?

What kinds of methods are needed to extract robust intelligence with maximum value for important problem domains?

Explore public crowd-sourced data to better understand urban dynamics

Method

  1. Identify spatio-temporal clusters of words and phrases posted to social media services
  2. Combine with an analysis of personal geographies and actions
  3. Create a template for activity spaces (where people live, work, spend leisure time, etc.)
  4. Insights into the functions of different places and how these functions vary by person/community and by time.

Background / Aims

Big data

The data: Twitter messages

The social life of words

Individual behaviour - anchor points

Conclusions and future work

Big Data

New research paradigm (?)

Very large N (relatively)       N=all ?

Correlation rather than causation - e.g. Google Flue Trends

Data-driven rather than hypothesis-driven -- letting the data speak

Lower sampling bias, can acept lower data accuracy

Big data in the social sciences

Fourth paradigm data intensive research (Bell et al., 2009) in the physical sciences

“Crisis” in “empirical sociology” (Savage and Burrows, 2007)

One of the areas that is being most dramatically shaken up by N = all is the social sciences. They have lost their monopoly of making sense of empirical social data, as big data analysis replaces the highly skilled survey specialists of the past. .. When data is collective passively while people do what they normally do anyway, the old biases associated with sampling and questionnaires disappear.

(Mayer-Schonberger and Cukier, 2013)

New data sources

  • Social media
    • Facebook
    • Twitter
    • Foursquare
    • Flickr
  • VGI
    • OpenStreetMap
  • Commercial
    • Loyalty cards
    • Amazon customer database
    • Mobile telephone locations

New paradigms for data collection

(Successful) mobile apps to collect data

Offer something to users

New methodology for survey design (?)

Mappiness

Background / Aims

Big data

The data: Twitter messages

The social life of words

Individual behaviour - anchor points

Conclusions and future work

The data: messages on Twitter

Collected with the Streaming API, provides real-time access to tweets

Restricted to those with GPS coordinates near Leeds

Density of all tweets

Message Content

Username Eco2solar
Location 53.98711 -1.50138
Time Wed Jul 20 08:34:04 +0000 2011
Message In Harrogate today for housing exhibition and maybe catch up with old friends

Tweets per user

Dominated by a small number of prolific users

Tweets per user

Tweets per hour

More messages are posted in the evening

Tweets per hour

Background / Aims

Big data

The data: Twitter messages

The social life of words

Individual behaviour - anchor points

Conclusions and future work

The social life of words

Aim: Linkage between what people say and what they're doing / where they are

What are the most popular words in Leeds?

What are the most popular words in Leeds?

Graph of word counts

Spatio-temporal places

‘City’ = 2km from Leeds Centre
‘Home’ = weekdays, midnight to 8am
‘Work’ = weekdays, 8am to 6pm
‘Play’ = evenings 6pm to midnight, weekends

Table of most common words in each place

Index of dissimilarity

Explore the degree to which different words are distributed in space and time

Interested in words with higher dissimilarity values (the most segregated)


Spatio-temporal places

Significant variation in the words used in different types of times and places.

Clusters of the phrase 'LUFC' in and around Leeds

Space-time activity patterns

PC1PC2PC3PC4PC5PC6
TeamLunchDadOfficeMeetingLove
FootballPubDinnerUniversityDrivingE-mail
GoalWineFilmSchoolBusinessYoung
MatchPicSleepHahaWebsiteGorgeous
PlayPhotoWatchPizzaBoringSweet
ScoreBeerMumExamJokeFacebook
MateParkFamilyCollege Funny
CityLovelyBoughtGutted Annoying
Games     
News     
SPORTSLEISUREHOMECOLLEGEWORKEMOTION

Spatial distribution of 'home' clusters

Spatial distribution of the 'home' geodemographic cluster

Spatial distribution of 'work' clusters

Spatial distribution of the 'work' geodemographic cluster

Summary of cluster proportions for some chosen SOAs.

Summary of cluster proportions for some chosen SOAs.

Background / Aims

Big data

The data: Twitter messages

The social life of words

Individual behaviour - anchor points

Conclusions and future work

Individual behaviour - anchor points

Overall aim to better understand urban dynamics.

Exploring activity patterns / activity space is one element of this.

Activity space: areas a person visits regularly and are reasonably familiar with

Prior research: Spatio-temporal behaviour

Aim: To what extent can we learn about spatio-temporal behaviour from crowd-sourced data.

Stage 1 - Extract prolific users

Those with 50+ messages in the data

Total message density for all prolific users

Stage 2 - Generate a message density surface

Use kernel density estimation to highlight areas with high spatial message density

Generating message density (3D)

Stage 3 - Identify areas of 'unusually' high density (anchor points)

GIS method used to identify peaks in digital elevation data

Landserf free software (Java)

Wood, J. (2011). Identifying Mountains with GIS. In Heywood, I., Cornelius, S. and Carver, S. An Introduction to Geographical Information Systems. Prentice Hall.

Example of GIS analysis to identify peaks
Activity space identification - good results
Activity space identification - bad results

Next...

Anchor points identified - do some analysis!

Variations in number, size, location of anchor points in different areas

Links to geodemographics (some groups have different awareness spaces?)

Temporal analysis

Links with word clusters

Background / Aims

Big data

The data: Twitter messages

The social life of words

Individual behaviour - anchor points

Conclusions and future work

Issues with the Twitter data

Huge sampling bias:
~1% sample (from twitter)
< 10% sample (from GPS)
Who’s missing?

Enormous skew

Similar problems with other data (e.g. mobile phones, rail travel smart cards, Oyster)

Some solutions:
Geodemographics
Linking to other individual-level data sets?

Problems (?) with Big Data analysis

Impossible to validate manually (~10,000 different density surfaces generated)
(Not letting the data speak)

Computational difficulties (no ArcGIS or Excel)

Ethics: Publicly available data, but are users aware?

Future work

More of the same...

Use (agent-based) simulation to incorporate data and overcome biasses

Thankyou

Generating message density (3D)

Nick Malleson, Mark Birkin and Kirk Harland

School of Geography, University of Leeds

nickmalleson.co.uk/

mass.leeds.ac.uk/