Guide ===== Installation ------------ Below is a guide on how to setup and use the package. This package required *Python 3*. A number of other modules are required for the module to function correctly. Installing the module using *pip* should automatically include all of these modules, however they may need to be retrieved seperately. Required modules and supported versions can be seen here: ``_. **Pip Installation:** `twied` can be installed via pip: .. code-block:: bash pip install git+https://github.com/Humpheh/twied.git **Source Code:** The source code for `twied` can be retrieved from *GitHub* ``_. This can be cloned using: .. code-block:: bash git clone https://github.com/Humpheh/twied.git **Example Scripts:** Example usage scripts can be seen in the *GitHub* repository under the directory `/scripts/examples`. Example System -------------- Below is an image which shows the layout of components within a system. This system was originally designed for use in the authors dissertation. .. image:: /projectsystem.png SLP Example ----------- Below is an example of the slp method. .. warning:: This implementation is not complete and will not work outright without further work. .. code-block:: python """ Testing the spatial label propagation algorithm. """ import logging import sys import pymongo from configparser import NoOptionError import twieds from labelprop.inference import InferSL # Must run this as a script if __name__ == "__main__": config = twieds.setup("logs/labelprop.log", "settings/locinf.ini") # Connect to the MongoDB logging.info("Connecting to MongoDB...") client = pymongo.MongoClient(config.get("mongo", "address"), config.getint("mongo", "port")) logging.info("Connected to MongoDB") # Select the database and collection based off config try: db = client[config.get("mongo", "database")] user_col = db["users"] user_col.create_index([('user.id', pymongo.ASCENDING)], unique=True) user_col.create_index([('user.screen_name', pymongo.ASCENDING)]) except NoOptionError: logging.critical("Cannot connect to MongoDB database and collection. Config incorrect?") sys.exit() # Get the tweet collection tweet_col = db[config.get("mongo", "collection")] cursor = tweet_col.find({'geo': {'$ne': None}, 'locinf.sl.test': None}) infersl = InferSL(config, user_col, verbose=True) for tweet in cursor: print("\n\nNEXT USER", tweet['user']['screen_name'], ":\n") if input(tweet) == "s": continue inf = infersl.infer(tweet['user']['id'], test=True) print("\nInferred location:", inf) input(">") # Store inferred loc in db db.tweets.update_one({'_id': tweet['_id']}, { '$set': { 'locinf.sl.test': str(inf) } }) if inf is None: continue MI Example ---------- Configuration ~~~~~~~~~~~~~ Below is a basic breakdown of how setting up for using the MI implementation would work: 1. Obtain the `twied` package. 2. Install prerequisite modules (*pip* likely will install them for you). 3. Download databases (see below for link). 4. Setup config file. 5. Create inference script. 6. Start inference. Below is an example config file for use with the :class:`twied.multiind.inference.InferThread` class. Each of the sections is broken down below the example. .. code-block:: ini [twitter] app_key = wYHFS6G9fqVNxYwt53UNUcxT0 app_secret = MU3r4yi2HGDrAbBma2syPpOvFOcWFxaUIiKmeySX8Ard80lr53 oauth_token = 3950426785-SNgK3NmghSzdjLcJGRTAwQq3xyMait0bVQ6HVvV oauth_token_secret = x0FASasjEHqsSvLAZ3h6sqClPWtt54TcM78W8PLOJ1BLv [mongo] address = localhost port = 27017 database = twitter collection = tweets #### Settings for Multi-Indicator Approach #### [multiindicator] workers = 10 gadm_polydb_path = D:/ds/polydb_2.db tld_csv = D:/ds/tlds.csv [mi_weights] TAG = 10 COD = 2.72 GN = 1.51 GN_1 = 2.01 GN_2 = 1.96 GN_3 = 1.96 SP = 0.67 LBS = 4.26 TZ = 0.56 WS_1 = 1.07 [geonames] url = api.geonames.org user = humph limit = 5 fuzzy = 0.8 [dbpedia] spotlight_url = spotlight.sztaki.hu spotlight_port = 2222 spotlight_page = /rest/annotate [slinf] min_mentions = 3 # 4 max_depth = 3 req_locations = 1 max_iterations = 4 num_timelines = 2 **Fields:** - **twitter** - This section contains the settings for the Twitter API. These settings are not directly used by the Inference class, so can be omitted. - **mongo** - This section contains the connection information for the MongoDB, including the location of the database, and the database and table names to infer the tweets from. - **multiindicator** - The *workers* value is an integer value of the number of simultaneous inference threads to run concurrently. The *gadm_polydb_path* is the location of the polygon database (see below) and the *tld_csv* string is the location of the TLD to country name file (see below). - **mi_weights** - This contains the weights of each of the indicators. The default values in this config are the values lifted from the original paper. :`TAG`: weight of geotag indicator :`COD`: weight of coordinate indicator :`GN`: weight of default geonames indicator :`GN_1`: weight of geonames indicator when string split by '/' :`GN_2`: weight of geonames indicator when string split by '-' :`GN_3`: weight of geonames indicator when backup message indicator is used :`SP`: weight of message indicator :`LBS`: weight of location based services indicator (not implemented) :`TZ`: weight of both timezone indicators :`WS_1`: weight of TLD indicator - **geonames** - This contains settings for connecting to the geonames API. *limit* is the max number of suggestions to return and *fuzzy* is the search fuzzy-ness parameter. - **dbpedia** - This contains settings for the URL of the DBPedia spotlight interface. - **slinf** - This can be omitted. Containted settings for the spatial label propagation method. **Files:** The MI approach uses two main extra databases, the *polygon database* and the *tld* database. These are compiled from various sources. Precompiled database files can be downloaded here: https://drive.google.com/open?id=0B0xoZYJ_Tg1aYVhvNTRlRGRiLW8 Example Script ~~~~~~~~~~~~~~ Below shows an example script for running the Multi-Indicator inference process. The file loads a configuration file, connects to the MongoDB and then infers the tweets within that collection. Exceptions are also handled and delays are created if there are problems. The script will also tweet to the authenticated Twitter account every 5000 tweets. .. code-block:: python import configparser import time from twython import Twython, TwythonError from pymongo import MongoClient from urllib3.exceptions import MaxRetryError from twied.multiind.inference import InferThread from twied.multiind.indicators.locfieldindicator import GeonamesException from twied.multiind.interfaces.webinterfaces import GeonamesDecodeException # Setup configuration file config = configparser.ConfigParser() config.read("settings.ini") # Connect to the MongoDB (database twitter, collection tweets) client = MongoClient("localhost", 27017) col = client["twitter"]["tweets"] # Query used for selecting tweets, empty because target is all tweets query = {} # Setup a Twython object to tweet error message if problem api_settings = config._sections['twitter'] twitter = Twython(**api_settings) # Function for tweeting message if there is an error def tweetstr(string): global twitter try: print("[!] Attemting to send tweet: {0}".format(string)) twitter.update_status(status=string) print("[+] Tweet sent.") except Exception: return # Name of inference task inf_name = "MyCol" # Name of the field to save the result to field = "inf" # Run the inference inf = InferThread(col, config, inf_id=inf_name, tweetfunc=tweetstr, tweetint=5000, proc_id=1) while True: print("[+] Starting inference...") try: inf.infer(query, field=field) print("[!] Inference finished successfully.") tweetstr("@Humpheh %s - finished successfully." % inf_name) break except MaxRetryError: print("[!] Got a MaxRetryError - sleeping for 2 mins...") time.sleep(2 * 60) # sleep for 5 mins except GeonamesException: print("[!] Got a GeonamesException - sleeping for 10 mins...") time.sleep(10 * 60) # sleep for 10 mins except GeonamesDecodeException: print("[!] Got a GeonamesDecodeException - sleeping for 2 mins...") time.sleep(2 * 60) # sleep for 2 mins except TwythonError: break except Exception as e: print("[!] Exception caught") tweetstr("@Humpheh %s - exited due to a %s." % (inf_name, type(e).__name__)) raise Collection Example ------------------ Below is an example file using the collection class to collect tweets which contain the word 'Twitter'. This script saves these tweets in a MongoDB database in the database 'test' and the collection 'coltest1'. A :class:`CounterThread` is also created which ouputs the number of tweets collected in the previous 5 seconds. .. note:: The API settings here have been altered so they are not valid. These would need to be created for your own app from the `Twitter API `_. .. code-block:: python import time from twied.twicol import TweetStreamer, CounterThread from pymongo import MongoClient # Save the Twitter API settings api_settings = { 'app_key': 'wYHFS6G9fqVNxYwt53UNUcxT0', 'app_secret': 'MU3r4yi2HGDrAbBma2syPpOvFOcWFxaUIiKmeySX8Ard80lr53', 'oauth_token': '3950426785-SNgK3NmghSzdjLcJGRTAwQq3xyMait0bVQ6HVvV', 'oauth_token_secret': 'x0FASasjEHqsSvLAZ3h6sqClPWtt54TcM78W8PLOJ1BLv', } search_str = "twitter" # Connect to the MongoDB and the correct collection client = MongoClient() collection = client['test']['coltest1'] # Setup the counter thread to output status count ever 5 secs counter = CounterThread(5, lambda count: print("Recieved %i tweets in last 5 seconds" % count)) # Setup the tweet streamer to listen to tweets with 'twitter' in them ts = TweetStreamer("test", search_str, db=collection, callbacks=counter, **api_settings) # Start the threads ts.start() counter.start() try: # Wait while True: time.sleep(1000) except Exception: print("Exception caught...") pass finally: # If excepted, close all threads print("Stopping collection.") counter.stop() ts.stop() Event Detection Example ----------------------- The event detection is an object which you should setup and then feed tweets in ascending time order via a method. Below is a basic example which shows connecting to the database, creating an :class:`twied.eventec.eventdetection.EventDetection` object, and then feeding the tweets in. The final step is getting the clusters from the object and saving them in a file. For more information about how the EventDetection operates or the class parameters see the class documentation. .. code-block:: python import logging import pickle from pymongo import MongoClient from twied.eventec.eventdetection import EventDetection output_filename = "output.pkl" # Connect to the MongoDB client = MongoClient() # Select the database and collection db = client["twitter"] col = db["ptweets"] # Get the tweet cursor (sorted by timestamp - note slow if there is no index) cursor = col.find(no_cursor_timeout=True).sort('timestamp', 1) # Create the EventDetection object with the parameters tf = EventDetection('centre', 'timestamp', popmaploc='D:\ds\population\glds15ag.asc') # Process each tweet that has been found for doc in cursor: tf.process_tweet(doc) # Get the clusters and save them allc = tf.get_all_clusters() carr = [c.as_dict() for c in allc] # Dump the output to a pickle file to save it pkl_file = open(output_filename, 'wb') pickle.dump(carr, pkl_file) pkl_file.close() cursor.close()