Posts Tagged 'science'

Behavior-driven development of research code

I’ve been reading a lot of papers on optimizing DAG-based workflows and supporting papers on scheduling for my research. Most of the time, I just characterized existing software code on existing systems. This is my first time writing scheduler code from scratch. I started by programming by wishful thinking but got stumped on how a scheduler workflow actually looks like.

In parallel, I have been reading the The Cucumber Book for use in a startup I’m bootstrapping with some friends. I thought that maybe it can help my graduate thought process as well. So I decided to drive my scientific-methodish thinking on a Gherkin file. So here it goes:

Feature: Pipeline workflow optimizations
  We hypothesize that a optimizing a pipeline will have a better load balance
  than a data-aware scheduler.

  Scenario: Load balance comparison of DAG optimization and data-aware scheduler
    Given A pipeline workload with parametarized data
    """
    1->a->1.1->a.a->1.2
    2->b->2.1->b.a->2.2
    """
    When We obtain the load from a data-aware scheduler
    And We obtain the load from the DAG-optimized version
    Then DAG-optimized load is more balanced

Here I used the “Then” clause to describe my hypothesis and the “Given” and “When” clauses to describe the experiment that attempts to verify the hypothesis. In my BDD thought process based on the chapter, “Working with Legacy Code”, if “Then” succeeds we accept the hypothesis else we reject the hypothesis and write new Given and When steps.

If you take a look at my Git commit history, I wrote a lower-level .feature file first that describes how a data-aware schedule should work. Taking a short walk and looking back at my Gherkin features, my ‘stakeholder hat’ started to kick in when I read the features again. Hopefully I’m Cuking from the outside correctly.

FASTA splitting with BioRuby

In reference to my previous post, here’s the splitter using BioRuby.  Note that I also changed the outer loop to one file per iteration instead of some crazy rules of when to create the file.

#!/usr/bin/env ruby
#
# Script: dumpseq.rb [file] [N] [prefix]
# Description: Splits a fasta file evenl across N files.  dumps files in the
#              [prefix]  directory
require 'bio'
require 'fileutils'

include Bio


seqs =  FlatFile.open(ARGV[0])
ncpus = ARGV[1].to_i
prefix = ARGV[2]

# Remove and hardwire n_seqs if you know beforehand the number of sequences in
# a file.  Saves readtime
n_seqs = 0
seqs.each do |seq|
 n_seqs += 1
end
seqs.rewind

overflow = n_seqs % ncpus
split_size = n_seqs / ncpus

ncpus.times do |i|
  filename = sprintf "%s/D%07d/seq%07d.fasta", prefix, i, i
  FileUtils.mkdir_p File.dirname(filename)
  dump = File.new(filename, "w")
  split_size.times do |j|
    dump << seqs.next_entry.to_s
  end
  if i < overflow 
    dump << seqs.next_entry.to_s
  end
  dump.close
end

Splitting bioinformatics FASTA files

I keep forgetting where my scripts were in my home directories. Below is my ruby script to split a large FASTA [1] sequence into N sequences per file:

#!/usr/bin/env ruby
#
# Script: dumpseq.rb
# Description: Parses the a BLAST Fasta file and dumps each sequence to a 
#              file.
# Usage: dumpseq.rb [fasta_file]

require 'fileutils'


fasta_db  = File.new(ARGV[0])

sno = 0
d = 0

file = nil

while true
  x = fasta_db.readline("\n>").sub(/>$/, "")
  x =~ />(.*)\n/
  if sno % 2 == 0 # 2 seqs per query
    file.close if file != nil
    dir = sprintf("D%04d000", d / 1000)
    FileUtils.mkdir_p dir
    # short filenames
    fname = sprintf "SEQ%07d.fasta", d
    d += 1
    file = File.new("#{dir}/#{fname}","w")
  end
  file << x
  sno += 1
  fasta_db.ungetc ?>
end

Its pretty hackish-looking. But then I found out that BioRuby [2] wrappers for parsing FASTA files.

[1] http://en.wikipedia.org/wiki/Fasta
[2] http://www.bioruby.org

On science productivity

Grid computing infrastructures were made to support execution of science applications at larger scales. One challenge today in running your science in these behemoth systems the requirement of “griddification” or “supercomputerification”. You need to know how to make the best of your hardware or grid sites in order to orchestrate beautiful workflows and process your science. So a lot of research has been done to create languages such as Swift to make life easier for these domain scientists.

I was debugging a science application for the last several months to run on petascale (100×10^3++ processors) systems. The main goal of the domain scientist was to process hundres of thousands data sequences. I got too much carried away in the debugging to make the application work and have only looked at 3000 of the set In other words, not much *real* work has been done.

Now I should always remember when debugging, remember the scientists who took pain in measuring this data or who can’t get data. (Much like an analogy of “finish your food because there are millions of children hungry in developing countries”).

Great Chicago Book Sale

I quickly grabbed my bike after coming from a seminar class and arrived 10 minutes before the closing time! Within a short span of time and by relying on my semi-rare impulsiveness of buying, I got these two titles foer 5 USD (buy-one-take-one):

W. T. Welford, Useful Optics (Chicago Lectures in Physics). University Of Chicago Press, October 1991.

Students and professionals alike have long felt the need of a modern source of practical advice on the use of optical tools in scientific research. Walter T. Welford’s _Useful Optics_ meets this need. Welford offers a succinct review of principles basic to the construction and use of optics in physics. His lucid explanations and clear illustrations will particularly help those whose interests lie in other areas but who nevertheless must understand enough about optics to create the experimental apparatus necessary to their research. Consistently emphasizing applications and practical points of design, Welford covers a host of topics: mirrors and prisms, optical materials, aberration, the limits of image formation and resolution, illumination for image-forming systems, laser beams, interference and interferometry, detectors and light sources, holography, and more. The final chapter deals with putting together an experimental optics system. Many areas of the physical sciences and engineering increasingly demand an appreciation of optics. Welford’s _Useful Optics_ will prove indispensable to any researcher trying to develop and use effective optical apparatus. Walter T. Welford (1916-1990) was professor of physics at Imperial College of Science, Technology and Medicine from 1951 until his death. He was a Fellow of the Royal Society and of the Optical Society of America.  Link to [Amazon.com]

T. P. Hughes, Human-Built World: How to Think about Technology and Culture (science * culture).    University Of Chicago Press, May 2005.

To most people, technology has been reduced to computers, consumer goods, and military weapons; we speak of “technological progress” in terms of RAM and CD-ROMs and the flatness of our television screens. In Human-Built World, thankfully, Thomas Hughes restores to technology the conceptual richness and depth it deserves by chronicling the ideas about technology expressed by influential Western thinkers who not only understood its multifaceted character but who also explored its creative potential.

Hughes draws on an enormous range of literature, art, and architecture to explore what technology has brought to society and culture, and to explain how we might begin to develop an “ecotechnology” that works with, not against, ecological systems. From the “Creator” model of development of the sixteenth century to the “big science” of the 1940s and 1950s to the architecture of Frank Gehry, Hughes nimbly charts the myriad ways that technology has been woven into the social and cultural fabric of different eras and the promises and problems it has offered. Thomas Jefferson, for instance, optimistically hoped that technology could be combined with nature to create an Edenic environment; Lewis Mumford, two centuries later, warned of the increasing mechanization of American life.

Such divergent views, Hughes shows, have existed side by side, demonstrating the fundamental idea that “in its variety, technology is full of contradictions, laden with human folly, saved by occasional benign deeds, and rich with unintended consequences.” In Human-Built World, he offers the highly engaging history of these contradictions, follies, and consequences, a history that resurrects technology, rightfully, as more than gadgetry; it is in fact no less than an embodiment of human values. Link to [Amazon.com]

Even information can be found in the UChicago Press site.

Happy 50th anniversary to DOST!

NSTW2008 banner

Every second week of July (7-11) we celebrate the National Science and Technology Week. This year, it is DOST‘s 50th anniversary so it is expected to be a very grand celebration (also probably the reason for cost-cutting the previous years :D). Too bad I won’t be able to go this year 😦

I looked at the DOST website to grabe some teaser news but the website does not seem to work. When I click on a news article, it goes back to the main page. Hey DOST web people, updates please! 🙂

Scientist Valentine’s day cards

I got this from my Make Magazine subscription. Ironic Sans made these cool scientist Valentine’s day cards. Check them out:


Now go out and spread that scientific love! 😀

Link to post: http://www.ironicsans.com/2008/02/idea_scientist_valentines.html