4. twied.eventec
— Event Detection¶
4.1. Event Detection¶
The evet detection class is the main class which should be used for the event detection. This class takes in tweets and manages the other cluster modules.
-
class
twied.eventec.eventdetection.
EventDetection
(field='geo.coordinate', tsfield='timestamp_obj', gridded=True, popfunc=None, popmaploc=None, mincnt=10, mdcnt=30, cluster_radius=10, cluster_timediff=datetime.timedelta(0, 1800), cluster_maxage=datetime.timedelta(0, 21600))[source]¶ Main entry point for the Event Detection implementation.
This method is based of that proposed in M.Walther and M. Kaisser, “Geo-spatial event detection in the twitter stream” which focuses on clustering tweets temporally and spatially.
The technique monitors the most recent tweets and clusters those issued geographically close to each other in a given timeframe. Three modules manage the clusters:
ClusterCreator
checks the most recent tweets and creates new clusters where more than a threshold number of tweets were issued within a set time frame and within a set radius.
ClusterUpdater
adds new tweets to clusters if they are within the area of an existing cluster. This module also merges clusters that are overlapping. Clusters are deleted after they have reached a maximum amount of time with no new tweets being added to them.
ClusterManager
manages the clusters.
This class manages the creation, updation and deletion of event clusters by passing the tweets to each of the modules and deciding on what should happen.
Usage: After the object is initialised, tweets should be fed to the class in ascending datetime order to the
process_tweet()
method. The order is required as the classes will compare times based off the newest tweet, removing tweets or events older than a certain age.>>> ed = EventDetection() >>> for t in tweets: >>> ed.process_tweet(t)
Active clusters can be retrieved at any time using the
get_clusters()
method. Other methods also exist for getting expired events, which are clusters that did not add a new tweet for a period of time.The removal of clusters will only be performed once a new tweet has arrived. Each cluster returned from these methods is a
EventCluster
object. A dictionary of the clusters can be retrieved using theas_dict()
method on the cluster object.>>> clusters = ed.get_all_clusters() >>> cdicts = [c.as_dict() for c in clusters]
Initialises and sets up the event detection class.
Either the popfunc value or the popmaploc field must be set. If there is no popmaploc value, then the popfunc will be used to calculate the required number of tweets to start an event at a position. Otherwise the function using a
PopMap
will be used.See also
PopMap
which describes the function for determining the number of required tweets.Parameters: - field (str) –
The field in the tweet dictionary where the coordinates are stored. (default ‘geo.coordinate’)
Can be passed a dot delimited string of the location of the location of the coordinate in the tweet dictionary.
- tsfield (str) – The field in the tweet dictionary where the timestamp field is stored. (This field cannot be dot delimited). (default ‘timestamp_obj’)
- gridded (bool) – Whether or not to use the
ClusterCreatorGrid
or the less performantClusterCreator
. See documentation for these classes for reasons to use either. (default `True`) - popfunc (lambda) –
(Optional) Function which takes a two parameters: lon and lat, and returns a single integer value for the number of required tweets to start an event at that coordinate. If this function is not set, then the population map will be assumed and the popmaploc string must be set.
Simple example for a static value:
>>> popfunc = lambda lat, lon: 25
- popmaploc (str) –
(Optional) The location of the population grid to be used for the population map function. This only works for the UK grid. If this value is not set then the popfunc value must be set.
If the popmap still needs to be used with a different function, this can be setup seperately using the popfunc parameter and passed in.
- mincnt (int) – If the popmaploc value is set this is the minimum number of tweets required to start an event in the lowest populated areas. (default 10)
- mdcnt (int) – If the popmaploc value is set this is the median number of tweets required to start an event in the median populated areas. (default 30)
- cluster_radius (float) – Maximum radius of the events (km). (default 10km)
- cluster_timediff (datetime.timedelta) – Maximum time difference between all tweets to create a cluster. (default 30 minutes)
- cluster_maxage (datetime.timedelta) – Maximum age of a cluster since the last new tweet before it is deleted. (default 6 hours)
-
get_all_clusters
()[source]¶ Gets the list of current active clusters and old clusters appended together from the
ClusterManager
.
-
get_unclustered
()[source]¶ Gets the list of current unclustered tweets from the
ClusterCreator
.Warning
only the non-gridded
ClusterCreator
class supports getting the list of unclustered tweets. It is not possible to get this list when using theClusterCreatorGrid
class.Will raise an exception if this method is used with the wrong class.
Returns: List of unclustered tweets.
-
process_tweet
(tweet)[source]¶ Process a tweet through the event detection.
Tweets should be fed into this method in ascending datetime order. If there is an error in retrieving the coordinate or population of the tweet a silent error will be thrown and the method will return False.
Parameters: tweet (dict) – Dictionary of tweet information. Returns: Whether the tweet was processed correctly. Return type: bool
4.2. Cluster Modules¶
These modules are primarily used by the EventDetection class to create and alter clusters.
-
class
twied.eventec.clustermanager.
ClusterManager
(field, tsfield, radius=10, timediff=datetime.timedelta(0, 1800), maxage=datetime.timedelta(0, 21600))[source]¶ Class to manage detected clusters. Stores current active clusters along with expired clusters.
Creates a cluster manager.
Parameters: - field (str) –
The field in the tweet dictionary where the coordinates are stored.
Can be passed a dot delimited string of the location of the location of the coordinate in the tweet dictionary.
- tsfield (str) – The field in the tweet dictionary where the timestamp field is stored. (This field cannot be dot delimited).
- radius (float) – Maximum radius of the events (km). (default 10km)
- timediff (datetime.timedelta) – Maximum time difference between all tweets to create a cluster. (default 30 minutes)
- maxage (datetime.timedelta) – Maximum age of a cluster since the last new tweet before it is deleted. (default 6 hours)
-
add_cluster
(tweets, maintweet)[source]¶ Creates a new cluster and saves it. The maintweet parameter will be used as the center for the cluster.
Parameters: - tweets – Tweets within the cluster.
- maintweet – Tweet which created the cluster.
Returns: The created cluster object.
Return type:
-
get_all_clusters
()[source]¶ Gets a list of all clusters that have been created, including ones that have reached the maximum age.
Returns: List of clusters. Return type: list
-
get_coordinate
(tweet)[source]¶ Gets a
Coordinate
object from a tweet which stores the values of the lat and long coordinates. Uses the field passed in during initialisation of this class to find the coordinate field in the tweet object.Parameters: tweet (dict) – Dictionary of tweet information. Returns: Coordinate object. Return type: Coordinate
-
merge_clusters
(c1, c2)[source]¶ Merges two clusters together. Will add of the information from c1 to c2 and then remove c2 from the active clusters.
Parameters: - c1 (
TweetCluster
) – First cluster to merge. - c2 (
TweetCluster
) – Second cluster to merge.
- c1 (
-
remove_cluster
(cluster)[source]¶ Removes a cluster from the current active list and adds it to the list of old clusters.
Parameters: cluster – The TweetCluster
object to remove from the active list.Returns: Whether the cluster was removed or not. If not in active set will return False. Return type: bool
- field (str) –
-
class
twied.eventec.clustermanager.
Coordinate
(lat, lon)[source]¶ Object to store the lat and lon coordinates and provides some methods to quickly access or reverse the list depending on the needs.
Create a new object.
Parameters: - lat (float) – The latitude of the coordinate.
- lon (float) – The longitude of the coordinate.
-
class
twied.eventec.clustermanager.
TweetCluster
(tweets, centre, clsman)[source]¶ Holds information about a tweet cluster including the list of tweets within it, the centres of the cluster, the required population that initialised the cluster and the latest tweet within it.
This class is used by the other modules in the event detection and provides methods to merge the cluster or check if a tweet should be within. Also allows for new tweets to be added to the cluster.
Note
as clusters can be merged there can be multiple centres of the cluster.
Creates a new
TweetCluster
.Parameters: - tweets (list, dict) – List of tweets within the cluster.
- centre (dict) – Tweet which represents the centre of the cluster.
- clsman (
ClusterManager
) – TheClusterManager
object.
-
add_tweet
(tweet)[source]¶ Add a tweet to the cluster. Will also update the value for the latest tweet.
Parameters: tweet (dict) – Dictionary of tweet information. Note
this method does not test if the tweet should be within the cluster and will add it regardless. The
in_cluster()
method should be used first to ensure that the tweet is within the cluster radius.
-
as_dict
()[source]¶ Returns the information from the tweet cluster in a dicationary format. This is then used when saving the cluster information.
- The returned dictionary includes:
- tweets - list of all tweets within the cluster, including each of
their tweet_id’s, the time of posting and the coordinate of the tweets posting location.
centres - list of centres in the cluster
- times - the start and finish times of the tweets posted in the
cluster.
Returns: Cluster dictionary.
-
get_points
()[source]¶ Gets a list of the coordinates of all the tweets in the cluster.
Returns: List of Coordinate
objects for all of the locations of the tweets.Return type: list, Coordinate
-
in_cluster
(tweet)[source]¶ Checks if a tweet was posted within the radius of the cluster. Will check for all the centres of the cluster.
Parameters: tweet (dict) – The tweet dictionary to test. Returns: Boolean of if the tweet was posted within the required radius. Return type: bool
-
merge
(cluster)[source]¶ Merge two clusters by appending both of the tweets and cluster centres together.
Parameters: cluster ( TweetCluster
) – The other cluster to merge into this one.
-
class
twied.eventec.clustercreator.
ClusterCreator
(clsman)[source]¶ Class to manage the creation of clusters from tweets.
Tweets are fed in through the
process_tweet()
method. These are compared with the current unclustered tweets and a cluster is created if the required number of tweets were posted within a certain timeframe and radius.Class also manages a list of unclustered tweets. Tweets are added using the
add_unclustered()
method. Theprocess_tweet()
method will remove tweets from the unclustered set if they are added to a cluster or if the difference between the times are greater than the threshold.See also
ClusterCreatorGrid
, more efficient version of this class which only examines surrounding tweets using a grid data structure.Create the cluster manager.
Parameters: clsman (ClusterManager) – The cluster managaer object.
-
class
twied.eventec.clustercreatorgrid.
ClusterCreatorGrid
(clsman)[source]¶ Class to manage the creation of clusters from tweets. This is a alternative implementation of the
ClusterCreator
class which uses aGeoGrid
for more efficient comparisons of nearby tweets.Tweets are fed in through the
process_tweet()
method. These are compared with the current unclustered tweets and a cluster is created if the required number of tweets were posted within a certain timeframe and radius.Class also manages a list of unclustered tweets. Tweets are added using the
add_unclustered()
method. Theprocess_tweet()
method will remove tweets from the unclustered set if they are added to a cluster or if the difference between the times are greater than the threshold.See also
ClusterCreator
, less efficient version of this class which examines all unclustered tweets.Note
This class will not work well for large cluster radiuses, instead
ClusterCreator
should be used.Create the cluster manager.
Parameters: clsman (ClusterManager) – The cluster managaer object.
-
class
twied.eventec.clustercreatorgrid.
GeoGrid
[source]¶ Simple class which stores tweets in a grid storage system for faster retrieval of neighbouring tweets.
The
add_tweet()
method will add a tweet to a cell based on its coordinate. Will store the tweet in the cell of the coordinate and the surrounding cells. The cell key the tweet is stored in will be the int value of the coordinates. Theget_surrounding()
method will retrieve the tweets within a cell, meaning tweets in surrounding cells are also included.Note
this class is not useful if the event radius parameter is too large.
Setup the geogrid.
-
add_tweet
(tweet)[source]¶ Add a tweet dictionary to the cell of its coordinates and the surrounding cells. Uses the
get_str()
method to calculate the cells the tweet will be stored in.Parameters: tweet (dict) – Dictionary of tweet information.
-
static
get_str
(coord)[source]¶ Gets a string of the lat and lon integer values of the coordinate to for fast indexing using a dictionary.
Example: >>> g = GeoGrid() >>> g.get_str((50.734910, -3.533687)) '50,-3'
Parameters: coord (tuple/list, float) – Tuple of the (lat, lon) of the coordinate. Returns: String of the key for the coordinate. Note
this function does not round the values of the coordinates.
-
get_surrounding
(tweet)[source]¶ Gets the tweets within a cell based on the coordinate of the tweet. As tweets area stored in surrounding cells, will get the tweets surrounding this cell also.
Parameters: tweet (dict) – Dictionary of tweet information. Returns: Surrounding tweet dictionaries. Return type: list, dict
-
remove_tweet
(tweet)[source]¶ Removes a tweet dictionary from all cells it is in.
Parameters: tweet (dict) – Dictionary of tweet information. Note
the tweet dictionary must be the same object as the one added using the
add_tweet()
method.
-
4.3. Population Grid¶
This class is used as an interface for the grid of worldwide population.
-
class
twied.eventec.popcount.
PopMap
(filename='glds15ag.asc')[source]¶ Provides access to the population map for choosing the required number of tweets to create a cluster.
This uses the population map retrieved from the SEDAC data centre. The file should be an .asc ASCII file containing the population of each point on the planet.
Note
while the
get_reqfunc_uk()
function is provided for the UK, there is no function for the rest of the world.Create a new PopMap instance.
Parameters: filename (str) – The population file to load. -
get_cell
(lon, lat)[source]¶ Translates a lat, lon coordinate into the coordinates of the population on the population grid.
Parameters: - lon (float) – Longitude of the coordinate.
- lat (float) – Latitude of the coordinate.
Returns: The indexes of the population in the grid.
Return type: Tuple of the (x, y) grid coordinate in the population grid.
See also
get_ll()
performs the reverse of this method.
-
get_ll
(clat, clon)[source]¶ Translates a cell_lat, cell_lon coordinate back into global lat, lon coordinates. Note that this will not return the exact value that was used when creating the cell coordinates.
Parameters: - clat (integer) – The x coordinate of the cell.
- clon (integer) – The y coordinate of the cell.
Returns: The lat, lon position of the centre of the grid cell.
Return type: tuple
See also
get_cell()
performs the reverse of this method.
-
get_population
(lon, lat)[source]¶ Gets the population at a lon, lat coordinate.
Parameters: - lon (float) – Longitude of the coordinate.
- lat (float) – Latitude of the coordinate.
Returns: The population at the coordinate.
Type: float
-
get_reqfunc_uk
(mediancount, mincount)[source]¶ Calculates and returns a function for finding the required number of tweets to initialise a cluster within the UK.
Method returns a lambda function which takes a population value and returns the number of tweets required to start the cluster. The function is a logarithmic function which will return a minimum value at the lowest population densities within the UK and a median value at places with median population densities. The function levels off for larger population values.
Parameters: - mediancount – The median count to return at median population densities within the UK.
- mincount – The minimum count to return at the minimum population densities within the UK.
Returns: lambda function which takes a population and returns the required number of tweets.
-