Archive for the 'Uncategorized' Category

Moved everything to my github page

Moved all blog posts to http://aespinosa.github.io/blog/.  I’ll be posting new content there from now on.

広告

Manual berkshelf caching on chefspec

Chefspec 3.2.0 introduced berkshelf integration lately.  However this will decrease the test speed as chefspec will setup and teardown berkshelf all the time. 

Below is a snipped to similar chefspec#242 .  But in addition, I removed the vendoring of the cookbook in development.   Here is my rake task to do that build:

task :build do
  File.open("chefignore", "a").write("*aspnet_skeleton*")
  berksfile.install path: "vendor/cookbooks"
  FileUtils.rmdir "vendor/cookbooks/aspnet_skeleton"
end

This approach has several advantages: [1] chefspec doesn’t need to build the vendored directory everytime rspec is invoked, [2] you can run the test against the source code directory for a fast TDD feedback cycle, and [3]  build and test phases for your CI pipeline can be separated.

Here’s how chefspec consumes the vendored cookbook_path and cookbook-in-development simultaenously:

RSpec.configure do |config|
  config.cookbook_path = %w(vendor/cookbooks ../)
end

Mimic rspec’s “context” in minitest/spec

I like spec-style tests as you can describe the scenarios of a test in a more structured manner.  However, I love the xUnit family’s assertion calls.  Here’s a small helper to create the describe => context synonym:

def context(*args, &block)
  describe(*args, &block)
end

creating fast spec coverage on legacy recipes

As techniques in testing Chef recipes are still evolving, most of use inherit large untested cookbook codebases from eons ago (translates to a few months in Chef’s community speed).   After watching Chefconf 2013 livestreams, I decided to give chefspec [3] a try.  However, most of the write-ups about chefspec cover pretty basic.  Here, I will write how I covered our legacy chef repository with fast tests.  For this example, I will be writing coverage for opscode’s nginx recipe [1] .

First we begin by covering examples for all resources that gets created by default.

describe 'nginx::default' do
  it "loads the ohai plugin"
  it "starts the service"
end

Then, we create contexts for each test case in the recipe logic.

describe 'nginx::default' do
  it "loads the ohai plugin"

  it "builds from source when specified"

  context "install method is by package" do
    context "when the platform is redhat-based" do
      it "includes the yum::epel recipe if the source is epel"
      it "includes the nginx::repo recipe if the source is not epel"
    end
    it "installs the package"
    it "enables the service"
    it "includes common configurations"
  end

  it "starts the service"
end

Now we have a general idea of what tests to write from the rspec documentation run:

 rspec spec/default_spec.rb

nginx::default
  loads the ohai plugin (PENDING: Not yet implemented)
  builds from source when specified (PENDING: Not yet implemented)
  starts the service (PENDING: Not yet implemented)
  install method is by package
    installs the package (PENDING: Not yet implemented)
    enables the service (PENDING: Not yet implemented)
    includes common configurations (PENDING: Not yet implemented)
    when the platform is redhat-based
      includes the yum::epel recipe if the source is epel (PENDING: Not yet implemented)
      includes the nginx::repo recipe if the source is not epel (PENDING: Not yet implemented)

Pending:
  nginx::default loads the ohai plugin
    # Not yet implemented
    # ./spec/default_spec.rb:4
  nginx::default builds from source when specified
    # Not yet implemented
    # ./spec/default_spec.rb:6
  nginx::default starts the service
    # Not yet implemented
    # ./spec/default_spec.rb:18
  nginx::default install method is by package installs the package
    # Not yet implemented
    # ./spec/default_spec.rb:13
  nginx::default install method is by package enables the service
    # Not yet implemented
    # ./spec/default_spec.rb:14
  nginx::default install method is by package includes common configurations
    # Not yet implemented
    # ./spec/default_spec.rb:15
  nginx::default install method is by package when the platform is redhat-based includes the yum::epel recipe if the source is epel
    # Not yet implemented
    # ./spec/default_spec.rb:10
  nginx::default install method is by package when the platform is redhat-based includes the nginx::repo recipe if the source is not epel
    # Not yet implemented
    # ./spec/default_spec.rb:11

Finished in 0.00977 seconds
8 examples, 0 failures, 8 pending

Progress can be found in [2] where i tested everything.  Next I will be writing on how I tested the nginx nginx_site definitions.

  1. http://community.opscode.com/cookbooks/nginx
  2. https://github.com/aespinosa/cookbook-nginx
  3. https://github.com/acrmp/chefspec
  4. https://www.destroyallsoftware.com/screencasts/catalog/untested-code-part-1-introduction

Disclaimer: I came from an xUnit-testing background so I maybe interchanging “test cases” with “examples” and other purist stuff.  Also I may need to proof read on how my spec examples speak.

Splitting bioinformatics FASTA files

I keep forgetting where my scripts were in my home directories. Below is my ruby script to split a large FASTA [1] sequence into N sequences per file:

#!/usr/bin/env ruby
#
# Script: dumpseq.rb
# Description: Parses the a BLAST Fasta file and dumps each sequence to a 
#              file.
# Usage: dumpseq.rb [fasta_file]

require 'fileutils'


fasta_db  = File.new(ARGV[0])

sno = 0
d = 0

file = nil

while true
  x = fasta_db.readline("\n>").sub(/>$/, "")
  x =~ />(.*)\n/
  if sno % 2 == 0 # 2 seqs per query
    file.close if file != nil
    dir = sprintf("D%04d000", d / 1000)
    FileUtils.mkdir_p dir
    # short filenames
    fname = sprintf "SEQ%07d.fasta", d
    d += 1
    file = File.new("#{dir}/#{fname}","w")
  end
  file << x
  sno += 1
  fasta_db.ungetc ?>
end

Its pretty hackish-looking. But then I found out that BioRuby [2] wrappers for parsing FASTA files.

[1] http://en.wikipedia.org/wiki/Fasta
[2] http://www.bioruby.org

On science productivity

Grid computing infrastructures were made to support execution of science applications at larger scales. One challenge today in running your science in these behemoth systems the requirement of “griddification” or “supercomputerification”. You need to know how to make the best of your hardware or grid sites in order to orchestrate beautiful workflows and process your science. So a lot of research has been done to create languages such as Swift to make life easier for these domain scientists.

I was debugging a science application for the last several months to run on petascale (100×10^3++ processors) systems. The main goal of the domain scientist was to process hundres of thousands data sequences. I got too much carried away in the debugging to make the application work and have only looked at 3000 of the set In other words, not much *real* work has been done.

Now I should always remember when debugging, remember the scientists who took pain in measuring this data or who can’t get data. (Much like an analogy of “finish your food because there are millions of children hungry in developing countries”).

Unix timer utility

Timer microbenchmark

Timer utilities performance on C and Perl of "echo -n"

The Unix time(1) command can only give a precision of 10 milliseconds by default. But being the engineer who goes insane after precision, I made my own script to be able to get differences in terms of microseconds. My first timer utility was made in C but I got stuck with the insane exec(3) family of functions since you need to fork the process to a child for the parent process to create successful timing. Hence I used Perl with the Time::HiRes library which is a wrapper to <time.h> and <sys/time.h>. Later on, I found out that C itself has the system(3) functioin in <stdlib.h>

Performance-wise you can see that C has a much faster runtime when the program was being invoked. But you can see in the graph above that Perl has much more consistent values so its standard deviation is lower than C. When I tested both programs for my data-intensive computing experiments, I get better results with the Perl utility! Perhaps I forgot to do all the magic the system function in Perl does in my C implementation?

Here is my Perl code:

#!/usr/bin/perl

use Time::HiRes qw ( tv_interval gettimeofday );

$start = [gettimeofday];
system @ARGV;

$elapsed = tv_interval ( $start );
print $elapsed, "\n";

Here is my C implementation:

#include 
#include 

#include 
#include 


int main(int argc, char* argv[])
{
	struct timeval start, end, diff;
	gettimeofday(&start, NULL);
	char* command = malloc( sizeof(argv) );
	int i;
	sprintf(command, "%s", argv[1]);
	for( i = 2; i < argc; i++ )
	{
		sprintf(command, "%s %s", command, argv[i]);
	}
	system(command);
	gettimeofday(&end, NULL);
	timersub(&end, &start, &diff);
	printf("%d.%06d\n", diff.tv_sec, diff.tv_usec);
	return 0;
}