Check out the FAQ,Terms of Service & Disclaimers by clicking the
link. Please register
to be able to post. By viewing this site you are agreeing to our Terms of Service and Acknowledge our Disclaimers.
FluTrackers.com Inc. does not provide medical advice. Information on this web site is collected from various internet resources, and the FluTrackers board of directors makes no warranty to the safety, efficacy, correctness or completeness of the information posted on this site by any author or poster.
The information collated here is for instructional and/or discussion purposes only and is NOT intended to diagnose or treat any disease, illness, or other medical condition. Every individual reader or poster should seek advice from their personal physician/healthcare practitioner before considering or using any interventions that are discussed on this website.
By continuing to access this website you agree to consult your personal physican before using any interventions posted on this website, and you agree to hold harmless FluTrackers.com Inc., the board of directors, the members, and all authors and posters for any effects from use of any medication, supplement, vitamin or other substance, device, intervention, etc. mentioned in posts on this website, or other internet venues referenced in posts on this website.
We are not asking for any donations. Do not donate to any entity who says they are raising funds for us.
How do we know what the start point of the sequence is? Is it where the START codon (AUG) is?
Is the difference between H1 and H3 numbering dependent on whether the start codon is included in the numbering?
I don't really know I have always assumed so as it seems the logical place to start. Nearly all my sequence aligning was done years ago on H5N1 where I just got to know the sequences around the areas that were of interest to me (mainly binding and cleavage sites). I have just begun to look at sequences again briefly over the last week or so and H1 is new to me as are the sites and online tools.
I have just been having a look for a downloadable Bioedit but it seems to no longer be available, however it is so basic it may just run from the executable without needing installing so if any one is interested I can see if it will work like that and upload a copy if it does. It is a lot more basic than most of the current stuff. The CLC (free version) only seems to work with multiple aligned nucleotides when I try to convert to proteins it splits the alignment into individual sequences which you can nolonger see side-by-side.
Bioedit screen captures before and after Ctrl+G
[ATTACH]5225[/ATTACH]
[ATTACH]5226[/ATTACH]
the nucleotide-sequences start with ~50 nucleotides
which are not decoded to amino-acids.
Often only parts of these 50 are given or none
The first occurrance of "ATG" is usually the first decoded amino-acid
(Methionine,Met,M)
also niman-H274Y is H275Y in N1
and D225G is D239G in H1
Notice what Gs says here about the starting place: the first occurrance of "ATG" = "M"
So on the MUSCLE alignment example, you will notice their starting positions may vary but consensus starts with the first "atg". See my remarks about counting:
OK! The example I'm showing has the first "atg" for all three starting at position 9.
"We see the mutation at position 831 instead of 822;" so if I subtract 8, I'm at position 823... so I'm still 1 off?
I'm not sure how the counting will work out if we just look at the sequence itself instead of using an alignment program.
The salvage of human life ought to be placed above barter and exchange ~ Louis Harris, 1918
Thanks for posting that; it's exactly what we needed.
Just to clarify,
I think Gs's example is of all the segments of one virus. We won't be comparing in that same manner; we compare like segments to like: segment 4 with segment 4, etc...
But this is how the sequences may look when aligned in MUSCLE. Notice how they all align at the ":" but some have fewer letters on the left hand side. MUSCLE begins searching for the mutations at the (":") consensus point.
Why do we start at the 2nd "atg" in segment 5?
The salvage of human life ought to be placed above barter and exchange ~ Louis Harris, 1918
Why would there be an R (2 of them) on the A/California/07/2009(H1N1) at about positions 715 and 718 ?
I found 3 California/07 sequences at Genbank. 2 have no changes and 1 has mixed signals at amino acid positions 225 (715-717 nucleotide) and 226 (718-720 nucleotide). "R" indicates "A"s and "G"s were found in that position.
Here's a list from gs what the letters in the nucleotide portion mean:
These are the nucleotides:
"A" = "Adenosine"
"C" = "Cytosine"
"G" = "Guanine"
"T" = "Thymidine"
make a database of some hundred index-consensus-nucleotide-sequences
compute all their length-12-subsequences and store the sequence numbers and
positions into the 4^12 database of 12-subdequences.
Whenever you get a new sequence, lookup all their 12-subsequences in the database,
ffind the index-sequence with the maximum matches, choose their alignment.
If no index-sequence matches good enough, then add the new sequence to the index-database.
Easier to use, easier to program and faster and better than the existing alignment programs
that I'm aware of.
It _should_ exist already.
-----------------------------------------------------------
with the author's help I had installed MAFFT now as a Windows-XP executable.
Works well and quite fast. But I can't run it from batch since it doesn't return
to the commandline (command.com ?) when finished. Presumably just a small
programming error.
-------------------------------------------------------------------
For big databases (>~20MB) of similar sequences in my format (one header line,
one line with nucleotides or proteins, ascii13+10 = EOL) I use my own
program align.c which doesn't produce gaps, just finds the best match by shifting.
This is quite fast, e.g. 1 minute for 7000 avian PB2s ,16MB. But some (~1%
of the sequences are bad aligned because of insertions or deletions (which are
usually sequencing errors). Those sequences can be filtered out and
aligned separately with MAFFT or MUSCLE, which then is much faster
because of the reduced size.
---------------------------------------------------------
but as I wrote above, IMO everyone should be using that 12-subsequences-method above.
For longer sequences of other species we could use 14-subsequences.
---------------------------------------------------------------------------
I have a program typesz1.c , that finds the best match from a list of index-sequences
using that subsequences-database-method, but it doesn't align (yet).
-------------------------------------------------------------
or types1.c : finds the flugenome.org types of the unaligned sequences in a file with that method
gb191 , 318MB all 204496 genbank flu-A sequences, it takes only 40seq to assign the types to it.
out of a list of 189 types. The error rate is low.
It should be possible to align the sequences with this methos in almost the same time
-------------------------------------------------------------
this is flu-specific alignment only. But larger databases could be built to include more species.
Long (DNA) sequences could be split
--------------------------------------------------------
searching ...
CONTRAfold algorithm, instead of the McCaskill algorithm
MAFFT is a multiple alignment method that includes two algorithmic techniques: ...
MAFFT employs a progressive method (FFT-NS-2) and an iterative refinement ...
MAFFT: a novel method for rapid multiple sequence alignment based
on fast Fourier transform. Kazutaka Katoh
All pairwise alignments are computed with the Needleman-Wunsch algorithm.
More accurate but slower than --6merpair.
----------------------------------------------------------
blat - Standalone BLAT v. 33x5 fast sequence search command line tool
usage:
blat database query [-ooc=11.ooc] output.psl
where:
database and query are each either a .fa , .nib or .2bit file,
or a list these files one file name per line.
-ooc=11.ooc tells the program to load over-occurring 11-mers from
and external file. This will increase the speed
by a factor of 40 in many cases, but is not required
output.psl is where to put the output.
Subranges of nib and .2bit files may specified using the syntax:
/path/file.nib:seqid:start-end
or
/path/file.2bit:seqid:start-end
or
/path/file.nib:start-end
With the second form, a sequence id of file:start-end will be used.
options:
-t=type Database type. Type is one of:
dna - DNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
The default is dna
-q=type Query type. Type is one of:
dna - DNA sequence
rna - RNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
rnax - DNA sequence translated in three frames to protein
The default is dna
-prot Synonymous with -t=prot -q=prot
-ooc=N.ooc Use overused tile file N.ooc. N should correspond to
the tileSize
-tileSize=N sets the size of match that triggers an alignment.
Usually between 8 and 12
Default is 11 for DNA and 5 for protein.
-stepSize=N spacing between tiles. Default is tileSize.
-oneOff=N If set to 1 this allows one mismatch in tile and still
triggers an alignments. Default is 0.
-minMatch=N sets the number of tile matches. Usually set from 2 to 4
Default is 2 for nucleotide, 1 for protein.
-minScore=N sets minimum score. This is the matches minus the
mismatches minus some sort of gap penalty. Default is 30
-minIdentity=N Sets minimum sequence identity (in percent). Default is
90 for nucleotide searches, 25 for protein or translated
protein searches.
-maxGap=N sets the size of maximum gap between tiles in a clump. Usually
set from 0 to 3. Default is 2. Only relevent for minMatch > 1.
-noHead suppress .psl header (so it's just a tab-separated file)
-makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
-repMatch=N sets the number of repetitions of a tile allowed before
it is marked as overused. Typically this is 256 for tileSize
12, 1024 for tile size 11, 4096 for tile size 10.
Default is 1024. Typically only comes into play with makeOoc
-mask=type Mask out repeats. Alignments won't be started in masked region
but may extend through it in nucleotide searches. Masked areas
are ignored entirely in protein or translated searches. Types are
lower - mask out lower cased sequence
upper - mask out upper cased sequence
out - mask according to database.out RepeatMasker .out file
file.out - mask database according to RepeatMasker file.out
-qMask=type Mask out repeats in query sequence. Similar to -mask above but
for query rather than target sequence.
-repeats=type Type is same as mask types above. Repeat bases will not be
masked in any way, but matches in repeat areas will be reported
separately from matches in other areas in the psl output.
-minRepDivergence=NN - minimum percent divergence of repeats to allow
them to be unmasked. Default is 15. Only relevant for
masking using RepeatMasker .out files.
-dots=N Output dot every N sequences to show program's progress
-trimT Trim leading poly-T
-noTrimA Don't trim trailing poly-A
-trimHardA Remove poly-A tail from qSize as well as alignments in
psl output
-fastMap Run for fast DNA/DNA remapping - not allowing introns,
requiring high %ID
-out=type Controls output file format. Type is one of:
psl - Default. Tab separated format, no sequence
pslx - Tab separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
sim4 - similar to sim4 format
wublast - similar to wublast format
blast - similar to NCBI blast format
blast8- NCBI blast tabular format
blast9 - NCBI blast tabular format with comments
-fine For high quality mRNAs look harder for small initial and
terminal exons. Not recommended for ESTs
-maxIntron=N Sets maximum intron size. Default is 750000
-extendThroughN - Allows extension of alignment through large blocks of N's
Comment