Christina's LIS Rant
Wednesday, January 07, 2009
  A Structural Exploration of the Science Blogosphere: Director's Cut
Due to popular demand (well 3 requests :) ), this is a commentary and additional information for my conference paper and presentation:
Pikas, C. K. (2008). Detecting Communities in Science Blogs. Paper presented at eScience '08. IEEE Fourth International Conference on eScience, 2008. Indianapolis. 95-102. doi:10.1109/eScience.2008.30 (available in IEEE Xplore to institutional subscribers or e-mail me if you don't have access that way).

The presentation is embedded in another blog post, and is available online at SlideShare. The video of me talking about it is (will be?) available on the conference site, but I haven't gotten it to load.

I'm interested in scholarly communication in science, engineering, and math. Specifically, informal scholarly communication and how information and communication technologies, in particular social computing technologies, can/do/might impact informal scholarly communication in science/math/engineering. I'm also interested in knowledge production and public communication of science, two sub-areas of STS (this acronym has several translations - the most common probably science and technology studies).

As a blogger, and a 2-time (soon to be 3) attendee of what was the NC Science Blogging Conference and a reader of science blogs, I became curious about how and why scientists use blogs and if their use is: a) similar to how non-scientists use blogs b) for informal scholarly communication (to other scientists about their work) c) for public communication of science d) for personal information management e) maybe for team collaboration(?)... The first way I looked at this was by doing a study with content analysis and interviews of chemists and physicists (this has not been published yet, but maybe someday, these things aren't as perishable as writings in other fields, I hope). The second study swings all the way to a structural analysis of the science blogosphere - and that's what was reported here.

In social network analysis (SNA), you look at the link structure, not the attributes of the actors or nodes. The idea is that links show evidence of potential information flows or influence. You can pick out prestigious or central actors, and groups which are more tightly connected to each other than to the rest of the network.

The first major problem was locating science blogs - and even drawing any sort of boundary as to what a science blog was or wasn't. Given that I'm interested in how these things contribute to science, I drew the line thusly:
Blogs maintained by scientists that deal with any aspect of being a scientist
Blogs about scientific topics by non-scientists

Omitted were:
Primarily political speech
Ones maintained by corporations
Non-English language
(you could definitely draw the line somewhere else, but this is what I did!)

Also given that I'm a great searcher but almost not a coder at all, I did this by search, snowball, and any hook or crook to get as big a set as possible. I went to each of these, and copied off the URLS from the blogrolls (to answer a question from a Scibling - if you had a rotating list that showed up in javascript on the page source, I probably got it; if you have a second page with a list of 300 blogs (cough - Bora - cough); I probably got it, likewise if generated by like GoogleReader or something)... so this was incredibly tedious, and probably missed a few, but probably pretty accurate. So that was the first network.

The second network - and I originally had a much grander scheme - took the "most interesting" (most central by common measures) blogs from the first network, and then used Perl scripts (core script developed by Jen Golbeck, and then I customized to work for non-wordpress blogs, and blogs where people changed their templates a lot - you all really could have made this easier, lol) to pull all of the commenter links off of the last 10 posts (this was done in like April).

Blogs have links between them a) in the content b) in the blogroll c) in signed comments... other studies have used basically any link on the page, but the fact is that it's not really saying much to link within a post (a little link love, but not a real endorsement). Blogrolls are some sort of endorsement, typically, and signing a comment means *something*.

So then I ran all the typical SNA things across it to look at central actors and to find cohesive subgroups. As far as centrality - no real surprises. As far as cohesive subgroups - a bit more tricky. Basically one large component - and not terribly clumpy, with the exception of the astro bloggers - they're pretty tight. Most of the community detection techniques use a binary split - or start with binary splits - none of these were at all effective in dividing up the hairball. Spin glass, OTOH, worked beautifully to return 7 clusters. So then I went back and looked at the blog and figured out the commonality for each of the clusters (yes, I could have used some NLP to extract terms and automatically label the clusters, but there were 7 so...).

The single component isn't too surprising because we know from diffusion of innovations for ICTs that we would expect people to pick this up from other people and then probably link back. The power law degree distribution is also very typical when you're talking the activities of people (whether Lotka, Zipf, Pareto, Bradford.... whatever law). The clusters were related to subject areas - very broad subject areas. One question in my mind was how much people would be outside of their home discipline in their reading/commenting... based on this network, certainly outside of their particular speciality, but still in the neighborhood with the exception of a few "a-list" science bloggers who everyone reads.

What was interesting - and most definitely worthy of further investigation - is this cluster of blogs written mostly by women, discussing the scientific life, etc. The degree distribution was much closer to uniform within the cluster, and there were many comment links between all of the nodes. This, to me, indicates other uses for the blogs and perhaps a real community (or Blanchard's virtual settlement).

Also, picked out the troll very easily using the commenter network - so this method could be used to automate troll identification. (in the first study I talked about this guy with a physicist and the physicist basically only reins the troll in when he's so out of bounds as to be gross... so ID-ing a troll doesn't necessarily meaning banning).

I'm quickly running out of steam in this blog post - but this might end up being a pilot for my dissertation, so I'm definitely more than happy to talk about it either in the comments here, or on slideshare, or on friendfeed... or twitter or... just look for cpikas :)


Thanks for this - it's fascinating.

Have you checked Nature Blogs to see if they have any blogs you missed? Actually, I've got a couple of suggestions of ways to estimate the number of science blogs, which I haven't followed up. Mark recapture ideas seem the best.

How much clustering is there by platform? I'm thinking mainly of ScienceBlogs and Nature Network. Are they recognisable in the graph?
My data only include about 3 Nature blogs because this was a while ago when the Nature network was just starting up. One of the first things I did when I loaded the blogroll network in to the various analysis programs was to look at the Sciblings - definitely higher degree on average, but I didn't check to see if they linked to each other more than to other blogs.
This is great fun to read! Very nice.

Have you thought about doing a follow-up study? Or publish it somewhere? (As a "communication interest" I think more people might be interested in "how technology can interact" and it points to a new "social network", doesn't it?!?!)
I'm sorry. I missed the first paragraph. Silly me :) It's already presented....

I hope you got interesting comments?!
@Chall - did get some interesting comments and yes, definitely want to link it to other levels of analysis and social theory... this is a pilot of a pilot study :)
Post a Comment

Links to this post:

Create a Link

Powered by Blogger

This is my blog on library and information science. I'm into Sci/Tech libraries, special libraries, personal information management, sci/tech scholarly comms.... My name is Christina Pikas and I'm a librarian in a physics, astronomy, math, computer science, and engineering library. I'm also a doctoral student at Maryland. Any opinions expressed here are strictly my own and do not necessarily reflect those of my employer or CLIS. You may reach me via e-mail at cpikas {at} gmail {dot} com.

Site Feed (ATOM)

Add to My Yahoo!

Creative Commons License
Christina's LIS Rant by Christina K. Pikas is licensed under a Creative Commons Attribution 3.0 United States License.

Christina Kirk Pikas

Laurel , Maryland , 20707 USA
Most Recent Posts
-- Comps readings this week
-- Comps readings this week
-- Review: Scholarship in the Digital Age
-- Comps readings this week
-- Initial thoughts on NodeXL
-- Comps readings this week
-- IEEE eScience: WOOL: A Workflow Programming Langua...
-- IEEE eScience: Final Keynote
-- IEEE eScience: Science in the Clouds
-- IEEE eScience: Sensor Metadata Management and Its ...
02/01/2004 - 03/01/2004 / 03/01/2004 - 04/01/2004 / 04/01/2004 - 05/01/2004 / 05/01/2004 - 06/01/2004 / 06/01/2004 - 07/01/2004 / 07/01/2004 - 08/01/2004 / 08/01/2004 - 09/01/2004 / 09/01/2004 - 10/01/2004 / 10/01/2004 - 11/01/2004 / 11/01/2004 - 12/01/2004 / 12/01/2004 - 01/01/2005 / 01/01/2005 - 02/01/2005 / 02/01/2005 - 03/01/2005 / 03/01/2005 - 04/01/2005 / 04/01/2005 - 05/01/2005 / 05/01/2005 - 06/01/2005 / 06/01/2005 - 07/01/2005 / 07/01/2005 - 08/01/2005 / 08/01/2005 - 09/01/2005 / 09/01/2005 - 10/01/2005 / 10/01/2005 - 11/01/2005 / 11/01/2005 - 12/01/2005 / 12/01/2005 - 01/01/2006 / 01/01/2006 - 02/01/2006 / 02/01/2006 - 03/01/2006 / 03/01/2006 - 04/01/2006 / 04/01/2006 - 05/01/2006 / 05/01/2006 - 06/01/2006 / 06/01/2006 - 07/01/2006 / 07/01/2006 - 08/01/2006 / 08/01/2006 - 09/01/2006 / 09/01/2006 - 10/01/2006 / 10/01/2006 - 11/01/2006 / 11/01/2006 - 12/01/2006 / 12/01/2006 - 01/01/2007 / 01/01/2007 - 02/01/2007 / 02/01/2007 - 03/01/2007 / 03/01/2007 - 04/01/2007 / 04/01/2007 - 05/01/2007 / 05/01/2007 - 06/01/2007 / 06/01/2007 - 07/01/2007 / 07/01/2007 - 08/01/2007 / 08/01/2007 - 09/01/2007 / 09/01/2007 - 10/01/2007 / 10/01/2007 - 11/01/2007 / 11/01/2007 - 12/01/2007 / 12/01/2007 - 01/01/2008 / 01/01/2008 - 02/01/2008 / 02/01/2008 - 03/01/2008 / 03/01/2008 - 04/01/2008 / 04/01/2008 - 05/01/2008 / 05/01/2008 - 06/01/2008 / 06/01/2008 - 07/01/2008 / 07/01/2008 - 08/01/2008 / 08/01/2008 - 09/01/2008 / 09/01/2008 - 10/01/2008 / 10/01/2008 - 11/01/2008 / 11/01/2008 - 12/01/2008 / 12/01/2008 - 01/01/2009 / 01/01/2009 - 02/01/2009 / 02/01/2009 - 03/01/2009 / 03/01/2009 - 04/01/2009 / 04/01/2009 - 05/01/2009 / 05/01/2009 - 06/01/2009 / 08/01/2010 - 09/01/2010 /

Some of what I'm scanning

Locations of visitors to this page

Search this site

(google api)
How this works

Where am I?

N 39 W 76