Monday, December 29, 2014

Scraping: The Bottom of the Barrel


Many of you who stop by regularly, here at A Family Tapestry, are also bloggers in your own right. You likely work hard to produce posts that will accurately represent your research—or share your latest discoveries in whatever topic you choose to discuss. Though you may not get paid for your efforts, you offer them in the sincere hope that your work will be of benefit to others.

In one way, as bloggers, we share and share alike.

That, however, is vastly different from the instance of those who freely offer on their own site, but unbeknownst to them, have had their work lifted and repackaged on another website—a place likely using that very content to make someone else a profit.

That little sleight of hand is called content scraping (or, in some cases, blog scraping) and it has occasionally become a topic of conversation amongst members of the genea-blogging community—usually in the form of outraged diatribes against such perpetrators by those personally wronged.

If you think you have never had that happen to you, dear blogger, think again. The mere effort of cutting and pasting the title, or an excerpt, from one of your recent blog posts into the search box at Google may reveal otherwise. I know it has done so for me.

Not only that, but there are tools available to such content scrapers to make their “job” even easier. When I googled the term to find relevant sites to support today’s post, the first item to come up was not an example or definition source, but an ad for software to facilitate content scraping. You see, you are not just up against a well-meaning but misguided zealous fellow-researcher, but a worldwide variety of people who see no problem in stealing your hard work.

I’ve been blogging for less than four years, but during that time, I’ve also been an avid genealogy blog reader. And I recall several of my fellow bloggers reporting how they encountered that loss on a personal basis.

The instance that stands out in my mind most vividly is when that occurred to GeneaBloggers originator, Thomas MacEntee. Thomas, blogging not only because of his fascination with genealogy but because, well, computer geeks can do this stuff blindfolded with one arm tied behind their backs, not only took this loss as the serious threat that it was to his business, but put his considerable computing knowledge to work in fighting back.

If you don’t recall Thomas’ frustration, back in 2012, dealing with “sploggers” who were stealing his content, you might find it helpful to check out how he went about combating the problem. He also shared his resource page of links on how to do this, which he posted on Pinterest.

In a different episode, another blogger—Heather Kuhn Roelker of Leaves for Trees—had commented, “I work too hard on writing my blog for it just to be stolen.” Heather found Thomas’ advice helpful. I’m sure a number of others have, too.

Content scrapers do not only target genealogy bloggers, of course. So it is no surprise to find blogs which offer generic advice for all sorts of bloggers in this predicament, such as this one for WordPress bloggers. In fact, it was a recent announcement about a new anti-scraping plug-in for WP bloggers that got me re-thinking this very issue.

In the past, my thoughts had ranged everywhere from “Who would copy my stuff?” to “So what if they copy my stuff; I have enough internal links to lead readers back to my own site.” I pretty much still hold to that latter thought. However, just because content scraping software likely doesn’t know how to differentiate between the rest of the post and a concluding sentence that essentially says, “Hey, if you didn’t find this post on my blog, come read it at my own site,” I’d like to start adding a sentence like that to the bottom of my posts. That way, when the scrapers scrape the rest of my content and lift it to their own site, they’ll also be lifting a sentence that tells readers where to go to get the rest of the story.

Of course, for you who are reading my posts here on my own blog site, it will seem redundant. But humor me. It can take me anywhere from ninety minutes to three or four hours to complete the research and writing for just one of my posts. I’m with Heather: I don’t want to do all that work for someone else’s online profit-making machine, either. However, I don’t want to add another several hours to that tally, just to fight my way through all the hoops necessary to get those people to cease and desist.

I just want to spend my time doing what I feel would be my best contribution: doing the research and writing for my own posts. For you—my regular readers who stop by here at A Family Tapestry to spend a moment every day, and perhaps share a few words of comment as well. It's for a readership community and to further our mutual research interests that I do this. I'd like to keep it that way. 

10 comments:

  1. Hmm -- what happened? Where is this going? What prompted this post?

    ReplyDelete
    Replies
    1. Hmmm...lately, Wendy, you've been thinking of all the questions I should have asked myself before clicking the "publish" button...

      Nothing in particular has been happening--at least to my blog here--but there have been instances that trouble me. For instance, if I take a particular section of one of my posts, enclose it in quote marks and search for the exact phrase on Google, it will come up in the results...in two other sites. Neither of which has my permission to re-post my articles.

      Second, I can see from my analytics that, for some strange reason, I am deeply beloved by voracious readers in the Ukraine--well, at least that's the latest country to pump my numbers up well over a hundred more than my dailynorm. (Incidentally, if you are reading this in Ukrainian, you have my invitation to come read this post on its original site!)

      What prompted this post is that I'd like to give my readers--my real readers--a head's up about some small statements that I'll be adding to my posts and pages in hopes of deflecting some of this bleed-off. To you who come to this site legitimately, the changes will (hopefully) be barely perceptible...but may seem silly at first, until you know the back story.

      Delete
  2. Thanks Jacqui for the shout out and highlighting this issue - I've shared with the GeneaBloggers page over at Facebook. If there is a specific incident of scraping or copying that I can help with, please let me know!

    ReplyDelete
    Replies
    1. Thomas, thanks so much! Your site is a wealth of information, so I'll start there. But thanks for the offer--and for sharing on your Facebook page. Much appreciated!

      Delete
  3. I can't understand why anyone would do this sort of thing myself - but yeah, there is a slimeball for every spam, hack, virus, and malware out there... I just can't understand why they do what they do.... I wish I could give them a swift kick in the ass with steel toed boots.

    ReplyDelete
    Replies
    1. Yeah, it's frustrating...but it is what it is, so we have to do whatever it takes. Fortunately, we have options :)

      Delete
  4. I think it happens more than we realize. I was reading a blog one day and ran across a photo of mine that another blogger claimed was hers...only it had my watermark...
    What Iggy said!

    ReplyDelete
    Replies
    1. Now, that takes the cake, Far Side! Incredible what gall people have! Wonder just what readers thought your watermark meant?!

      Delete
  5. Excellent post! I've only been blogging for 6 months and I found this to be very interesting and informative. I will be reading Thomas's posts as well.

    ReplyDelete
    Replies
    1. Dawn, thanks for stopping by! Yes, do look at Thomas' posts on this subject. Actually, he is a wealth of blogging information beyond just the link I've shared.

      Best wishes on your blogging endeavors, Dawn. You mentioned "only" blogging for six months, but I bet you've learned a lot in those six months!

      Delete