So I’m in the process of learning how to use Qiime (pronounced ‘Chime,’ don’t ask me why). Qiime is a powerful program that can do many, many things, but I’m primarily using it to sort 16s reads and assign them to taxonomic groups. If that sentence made very little sense to you, think of it like this: Qiime is sorting genetic information into ‘buckets’ based on which organism it thinks the genes (specifically the 16s gene) belongs to. Qiime’s a really useful tool because, among other things, a)it can often sort better than humans could and b)it’s definitely faster than humans would be.
Qiime is run from the command line which can be daunting at first but is a little bit fun as well because playing with the command line always makes me feel powerful. 🙂 Unfortunately, because it is run from the command line it can also be an incredibly frustrating process because the smallest typo can result in an error.
The data set I’m using to familiarize myself with Qiime comes from seagrass samples, but I’m not looking at the seagrass DNA. I’m actually looking at the DNA of the microbes (specifically the bacteria) that live in and on the seagrass.
The first thing I did was install Qiime. This went pretty well, and because I have Linux on my computer I didn’t have to install a virtual box to run it for me. Then I followed the helpful tutorial on the Qiime website. Some of the steps in the tutorial aren’t applicable to my data because it’s the result of Illumina sequencing, not 454 sequencing but the general steps are the same.
Overall things have gone pretty smoothly except for the time component being a lot longer than I expected. The second script I ran (
pick_de_novo_otus.py -i split_library_output/seqs.fna -o otus
) took all night and a good portion of the day to finish.
Now I’m in the process of cleaning up the output of that script. The script produces a table of OTUs (Operational Taxonomic Units; you can think of them as the buckets I mentioned earlier). The table contains a lot of interesting information about the bacterial communities on the plants but at this point in the process, it’s still too messy to analyze. Since I’m working with plant samples, I have to be careful about chloroplast data showing up along with the bacterial data. The chloroplasts from the seagrass and bacteria both have the 16s gene and because I’m not interested in the chloroplasts I have to filter them out before I begin to interpret the data. (If you want to know more about why this would happen check out the very cool Endosymbiotic theory).
Once the cleaning process is done I’ll be able to visualize the data.