Today, after the 8 hour “Industrial Control Systems Security for IT Professionals class”, I wanted to make something pretty. And code. And work on a protocol problem. I’ve needed to look a little at the new Stimulus bill for work lately, so I thought I’d try and at least say I’d written Python today, dissect the text of the bill into parsable chunks, then throw it into some visualizations. I can’t easily capture the interesting avenues of analysis I was pursuing visually (and I dont feel like writing it up), but I did manage to make some kind of pretty pictures. Hopefully someone feels inspired from them and goes down a similar path. (I already have some ideas at further stats I want to parse from the bill to be able to look at it more meaningfully. Perhaps Ill do it this weekend – this was just the first cut at setting it up.)
First, I grabbed the full text of the bill from HERE. Then, I wrote some (stupidly) simple python (again, I’m never sure if it’s -good- python) to parse the bill and turn it into a new file with five columns: Word Number, Word Length, Line Number, Work Position in Line, and the actual Word itself. This essentially turned the bill into a a text file with every word in the bill on its own line (in the order it showed up), but with machine readable meta-data I could use to visually represent it.
stimulus = open(‘/Users/sintixerr/Documents/stimulus.txt’, ‘r’)
finalfile = open(‘/Users/sintixerr/Documents/sdump.txt’, ‘w’)
for line in stimulus:
for w in word:
Then, I opened up the new tab delimited bill in my visualizer of choice and ran it through a few different ways of representing the bill.
First, the raw text – without any real manipulation – looked cool in and of itself and I noticed some interesting, if obvious in hindsight, features. (I did clean out some obviously bad data first with a little sed action, but that mostly just involved removing punctuation that caused the same words to show up as different ones. )
First, if you look about a fourth of the way from the left, and then again closer to halfway, you see a vertical “break” in the scatterplot where it looks like the density is much lower. That is probably a major section break in the original document (I honestly haven’t actually read it in english yet). That possibility is supported by the second observation which is: Even in human written documents, you can still discern protocol visually. (Again, obvious, but it’s neat.). If you look at the bottom third of the image, it looks nothing like the top 2/3. Much more curving paths, fewer horizontal lines, less density, etc. If you look at those “words”, they’re all document structure words (like section numbers, headings, etc.). …and monetary figures. If you look closely, there appear at first glance to be two or more incompatible or unrelated document content structures there. Above that section is where the more obvious “free form” english exists in the set.
Moving on from there, I wanted to see if I could get anything intellectually or aesthetically interesting by using a scatterplot to draw out the shape of the bill. To do that, I plotted “Line Number” on the X axis and “Position of Word in the Line” on the Y axis. (Actually, originally those two were swapped, but the resulting image “looked better” when I swapped the X and Y). I colored everything by Word on a categorical scale so things wouldn’t blend together too much and then ratcheted up the size scale to reduce empty space. I was looking for a visual representation of the literal structure of the document, not an analysis tool or I wouldn’t have done that last bit.
The resulting image looks like this:
Finally, I was curious if I could do a little manual clustering work. I tried to narrow down the words into the data set to those that might have some intrinsic meaning in the context of the stimulus bill. This means I got rid of prepositions, repeated filler words, etc. I did this by knocking out every word under 4 letters and all of those over 17 chars (over 17 were all artifacts of turning the bill into something parsable, not actual real words). Then I created a bar chart of words and sorted it by how often words appeared in the document and removed about the bottom 70% of words. I made an assumption (which is almost definitely so broad that the data will have to be sliced again a different way for meaningful analysis) that any words that weren’t repeated that often just werent a real “theme” to the people writing the document. Interestingly, things like “security” and “health” and some others were left in the set, but “cyber” was removed. Hmm. :) After that, I went manually through the remaining set of words and removed those that seemed to not have any cluster value (both through intuition as well as by visually watching the scatterplot of the whole set while I highlighted individual words t see what lit up.) Finally, and lastly, since I originally wanted to make visually interesting things more than do real analysis, I used some blurring, resharpening, and layering to give a more cloudy, vibrant feeling to it. Interestingly, that created “clouds” around many of the clusters and made them easier to make out for analysis. That supports my whole theory that what the eyes and mind like to look at is what the mind and eyes are better able to make intelligent use of.
The final result is here: