Twitter Streaming with Python and SQLite

Twitter Streaming with Python and SQLite

3/20/2016


There are probably about a million Twitter streaming posts out there on the web, but I decided to make my own! Why not. For the unfamiliar, streaming with Twitter is when you basically track all tweets being created, filter via keywords, then do something with the data. My goal was to make a very simple streaming program that takes in Tweets, parses out the useful information, and then stores that data into a database. I'm going to have this program running all the time on a Raspberry Pi because I just don't feel like running that constantly on my desktop, and I definitely don't want to pay to host it somewhere.

The first thing I had to decide on was what data I wanted from the tweets (Twitter's API gives you an insane amount of data for a single tweet), and for that, I had to decide what I was really wanting to do with this data. The data I'm collecting is on my city, Cincinnati. I love the city and I'm curious what people are tweeting about it. So for using this data, I have a few ideas planned. My first idea is some sort of sentiment analysis / text mining project. I thought it would be cool to come up with some sort of algorithm that will score each tweet as it comes in based on its "view" of the city. Low score as a negative view and a high score as a positive view. There will definitely be some challenges with that since people are going to be tweeting about all sorts of things (sports, jobs, events) and it may be difficult to apply something to all of these different kinds of tweets. Oh well. We'll see what I can come up with in a later post. Another thing I know I can do for sure is just some simple visualizations of this data. Things like what time people are tweeting, from where, common themes, etc.. I will be doing that once I collect enough data.

Once I knew what I wanted to do with the data, I had to decide what data to collect. For each Tweet, I decided to parse out the Tweet text (duh), the user name, the user's number of followers, their location, and the time of the tweet. All of those minus the number of followers are included in the streaming data, so they are easy to get. The number of followers isn't hard to get either, since we know the user name and can use the magical Python library Tweepy to get the number of followers. Tweepy is also how I set up the streaming service in Python. It's an awesome library and makes everything really easy to do inside of Python.

Creating the database

Now, onto the code! First thing's first, let's create the database. I decided to go with SQLite for the DB. It's really lightweight (won't take up much space on my Pi), simple to use, and integrates well with Python. Creating the database is very straightforward. You could use SQLite Browser if you're a fan of having interfaces, but I just wrote a quick script to build it instead. If you want to use the SQLite Browser, just copy and paste that CREATE statement below into a SQL script in that browser.


import sqlite3

conn = sqlite3.connect('twitter.db')
c = conn.cursor()
c.execute('''CREATE TABLE tweets
    (tweetText text,
    user text,
    followers integer,
    date text,
    location text)''')
conn.commit()
conn.close()
		

This is pretty much the standard way of creating a SQLite DB in Python. Go to Python's SQLite docs and this is the example that you will see. By looking at the code, specifically the third line, you would think you already would have needed to create 'twitter.db', however that isn't the case. SQLite will actually create that file when this is ran and put it into whatever directory your python file is in. If you wanted more data from the tweets, you can just add more columns in the Create SQL statement there. Note that I don't have a primary key here. If you don't provide a primary key, SQLite will automatically use a column called RowID, which is an auto-incrementing integer that's used as the primary key.

Twitter Authentication

Now, onto the actual streaming part. I'm going to break this up into chunks. I'll show the code, then expain if need be. I have the code commented pretty throughly, so hopefully its somewhat easy to follow along. First up, Twitter Authentication and some other boring stuff!


import tweepy
import json
import sqlite3

# Twitter authentication stuff
global api
access_token = "(Your access token)"
access_token_secret = "(Your access token secret)"
consumer_key = "(Your consumer key)"
consumer_secret = "(Your consumer secret)"
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# DB stuff
conn = sqlite3.connect('twitter.db')
c = conn.cursor()

If you're following along with this, you need to head over to Twitter's dev site and create a new project. Follow along with the steps they give you, and eventually you will be given an access token, token secret, consumer key, and consumer secret all of which will be needed here. The top will just authenticate you with Twitter, then on the end I'm just setting up a connection with the database we created in the previous part.

Creating a Tweet class

Moving on, I then created a class for a Tweet. I don't know if this was really necessary, but I need practice doing object-oriented stuff, so I figured why not. Here's the code for the Tweet class.


# Class for defining a Tweet
class Tweet():

    # Data on the tweet
    def __init__(self, text, user, followers, date, location):
        self.text = text
        self.user = user
        self.followers = followers
        self.date = date
        self.location = location

    # Inserting that data into the DB
    def insertTweet(self):

        c.execute("INSERT INTO tweets (tweetText, user, followers, date, location) VALUES (?, ?, ?, ?, ?)",
            (self.text, self.user, self.followers, self.date, self.location))
        conn.commit()
    

The __init__ function is just setting up all of the attributes that go along with a tweet. If you were planning on adding more data, you would put your extra variable names here. Then the insertTweet function is what inserts the data into the database once it has been collected. Not too much else to say here.

Creating the Stream class

Next up is the the Stream class. A lot of what's written here is straight from the Tweepy docs. I just added in some things that I needed to get my desired results.


# Stream Listener class
class TweetStreamListener(tweepy.StreamListener):

    # When data is received
    def on_data(self, data):

        # Error handling because teachers say to do this
        try:

            # Make it JSON
            tweet = json.loads(data)

            # filter out retweets
            if not tweet['retweeted'] and 'RT @' not in tweet['text']:

                # Get user via Tweepy so we can get their number of followers
                user_profile = api.get_user(tweet['user']['screen_name'])

                # assign all data to Tweet object
                tweet_data = Tweet(
                    str(tweet['text'].encode('utf-8')),
                    tweet['user']['screen_name'],
                    user_profile.followers_count,
                    tweet['created_at'],
                    tweet['user']['location'])

                # Insert that data into the DB
                tweet_data.insertTweet()
                print("success")

        # Let me know if something bad happens
        except Exception as e:
            print(e)
            pass

        return True
   

There's a little bit more going on in this class, so I'll try to explain to the best of my abilities. Once a tweet comes in, the first thing I do is load it into a Python object, using json.loads(). This just deserializes the tweet JSON data to a Python object. Just as a note, the way to get to the attributes of a tweet in json, is indexing your tweet object by the json key. You can see I do this a bunch in this class, as it's what allows me to get to the data.

Next, I then have an if statement to filter out retweets because I didn't want to have a bunch of the same tweet in here. For example, last night Rihanna played a show in Cincy. When I was running this program without the retweet filter this morning, almost every result I was getting was a RT of Rihanna which had something about Cincinnati in it because of her show. I don't want all that redundancy, so I'm throwing the retweets out. Next, I have to use Tweepy again to get the user's profile. Since the tweet data itself doesn't contain the user's number of followers, I have to get that information a different way. Finally, I assigned all of that data using my Tweet class, then called the insertTweet function to commit the data to the database.

Final Touches

Finally we just need to add the main statement that's going to run the whole program. If you're familiar with Python, this is nothing new.


# Driver
if __name__ == '__main__':

    # Run the stream!
    l = TweetStreamListener()
    stream = tweepy.Stream(auth, l)

    # Filter the stream for these keywords. Add whatever you want here!
    stream.filter(track=['Cincinnati', 'Cincy'])
    

The bottom line there is where you would change your keywords to find whatever tweets you're looking for.

And that's it! I hope this was at least somewhat helpful if you're trying to implement Twitter streaming. I'll hopefully have another post sometime soon where I do something with all of this data I'm collecting. If you have any comments or corrections for my code, feel free to post in the comments below.

You can find all the code consolidated on my GitHub repo for this project.


Check out more posts in the Python category!



Tweet Follow @DavidRSchuler