Announcement

Collapse
No announcement yet.

Sequence Analysis Using MUSCLE

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Re: Sequence Analysis Using MUSCLE

    Thanks JJackson.

    How do we know what the start point of the sequence is? Is it where the START codon (AUG) is?

    Is the difference between H1 and H3 numbering dependent on whether the start codon is included in the numbering?

    Comment


    • #17
      Re: Sequence Analysis Using MUSCLE

      Originally posted by Sally View Post
      Thanks JJackson.

      How do we know what the start point of the sequence is? Is it where the START codon (AUG) is?

      Is the difference between H1 and H3 numbering dependent on whether the start codon is included in the numbering?
      I don't really know I have always assumed so as it seems the logical place to start. Nearly all my sequence aligning was done years ago on H5N1 where I just got to know the sequences around the areas that were of interest to me (mainly binding and cleavage sites). I have just begun to look at sequences again briefly over the last week or so and H1 is new to me as are the sites and online tools.
      I have just been having a look for a downloadable Bioedit but it seems to no longer be available, however it is so basic it may just run from the executable without needing installing so if any one is interested I can see if it will work like that and upload a copy if it does. It is a lot more basic than most of the current stuff. The CLC (free version) only seems to work with multiple aligned nucleotides when I try to convert to proteins it splits the alignment into individual sequences which you can nolonger see side-by-side.
      Bioedit screen captures before and after Ctrl+G
      Name:  f069e10a50fbc0ff9e8c3bb7b842fa9f.jpg
Views: 2
Size:  67.4 KB
      Name:  9eb06dd71ad4e2254ec85d84d632ac7b.jpg
Views: 2
Size:  66.0 KB

      Comment


      • #18
        Re: Sequence Analysis Using MUSCLE

        Originally posted by gsgs View Post
        the nucleotide-sequences start with ~50 nucleotides
        which are not decoded to amino-acids.
        Often only parts of these 50 are given or none
        The first occurrance of "ATG" is usually the first decoded amino-acid
        (Methionine,Met,M)

        also niman-H274Y is H275Y in N1
        and D225G is D239G in H1
        Notice what Gs says here about the starting place: the first occurrance of "ATG" = "M"

        So on the MUSCLE alignment example, you will notice their starting positions may vary but consensus starts with the first "atg". See my remarks about counting:

        OK! The example I'm showing has the first "atg" for all three starting at position 9.

        "We see the mutation at position 831 instead of 822;" so if I subtract 8, I'm at position 823... so I'm still 1 off?
        I'm not sure how the counting will work out if we just look at the sequence itself instead of using an alignment program.
        The salvage of human life ought to be placed above barter and exchange ~ Louis Harris, 1918

        Comment


        • #19
          Re: Sequence Analysis Using MUSCLE

          in segment 5 it's the 2nd ATG

          what I use:

          Code:
          >A/Index/******/2009/02/01(H1N1)
          XXXXXXXXXXXXXXXXXXXXTAGCAAAAAAGCAGGTCAAATATATTCAAT:ATGGAGAGAATA
          XXAGTTTGTAAAGGGACGTCCAGTAAGCAAAAGCAGGTCAAACCATTTGA:ATGGATGT
          XXXXXXXXXXXXXXXXXXXXXXXTTAGCAAAAAGCAGGTACTGATCCAAA:ATGGAAGACTTT
          XXXXXXAGCAATAACAAGAGCAAAAGCAGGGGAAAACAAAAGCAACAAAA:ATGAAG
          XTTAAGCAAAAGCAGGGTAGATAATCACCTCAATGAGTGACATCGAAGCC:ATGGCGT
          XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXAGCAAAAGCAGGAGTTTAAA:ATGAATCCAAACC
          XXXXXXXXXXXXXXXXXXXXCAGGGAGCAAAAGCAGGTAGATATTTAAAG:ATGAGTCTTCT
          XXXXXXXXXXXXXXXXXXXXXXXXAGCAAAAGCAGGGTGACAAAAACATA:ATGGACTCCAA
          start is after the ":"
          there are sometimes differences in the very first or last nucleotides, which I usually
          ignore as supposed sequencing-errors
          I'm interested in expert panflu damage estimates
          my current links: [url]http://bit.ly/hFI7H[/url] ILI-charts: [url]http://bit.ly/CcRgT[/url]

          Comment


          • #20
            Re: Sequence Analysis Using MUSCLE

            Thanks for posting that; it's exactly what we needed.
            Just to clarify,
            I think Gs's example is of all the segments of one virus. We won't be comparing in that same manner; we compare like segments to like: segment 4 with segment 4, etc...

            But this is how the sequences may look when aligned in MUSCLE. Notice how they all align at the ":" but some have fewer letters on the left hand side. MUSCLE begins searching for the mutations at the (":") consensus point.

            Why do we start at the 2nd "atg" in segment 5?
            The salvage of human life ought to be placed above barter and exchange ~ Louis Harris, 1918

            Comment


            • #21
              Re: Mutations in A/H1N1 Not Confirmed to Affect Effectiveness of Current Vaccine

              Originally posted by Sally Furniss View Post
              Why would there be an R (2 of them) on the A/California/07/2009(H1N1) at about positions 715 and 718 ?
              I found 3 California/07 sequences at Genbank. 2 have no changes and 1 has mixed signals at amino acid positions 225 (715-717 nucleotide) and 226 (718-720 nucleotide). "R" indicates "A"s and "G"s were found in that position.

              Here's a list from gs what the letters in the nucleotide portion mean:

              These are the nucleotides:
              "A" = "Adenosine"
              "C" = "Cytosine"
              "G" = "Guanine"
              "T" = "Thymidine"

              These indicate changes:
              "Y" = "Pyrimidine (C & T)"
              "R" = "Purine (A & G)"
              "W" = "Weak (A & T)"
              "S" = "Strong (G & C)"
              "K" = "Keto (T & G)"
              "D" = "Not C"
              "V" = "Not T"
              "H" = "Not G"
              "B" = "Not A"
              "X" = "Unknown"
              "N" = "Unknown"
              The salvage of human life ought to be placed above barter and exchange ~ Louis Harris, 1918

              Comment


              • #22
                Re: Sequence Analysis Using MUSCLE

                -------------------------------------
                better alignment program:

                make a database of some hundred index-consensus-nucleotide-sequences
                compute all their length-12-subsequences and store the sequence numbers and
                positions into the 4^12 database of 12-subdequences.

                Whenever you get a new sequence, lookup all their 12-subsequences in the database,
                ffind the index-sequence with the maximum matches, choose their alignment.
                If no index-sequence matches good enough, then add the new sequence to the index-database.

                Easier to use, easier to program and faster and better than the existing alignment programs
                that I'm aware of.

                It _should_ exist already.

                -----------------------------------------------------------
                with the author's help I had installed MAFFT now as a Windows-XP executable.
                Works well and quite fast. But I can't run it from batch since it doesn't return
                to the commandline (command.com ?) when finished. Presumably just a small
                programming error.
                -------------------------------------------------------------------
                For big databases (>~20MB) of similar sequences in my format (one header line,
                one line with nucleotides or proteins, ascii13+10 = EOL) I use my own
                program align.c which doesn't produce gaps, just finds the best match by shifting.
                This is quite fast, e.g. 1 minute for 7000 avian PB2s ,16MB. But some (~1&#37
                of the sequences are bad aligned because of insertions or deletions (which are
                usually sequencing errors). Those sequences can be filtered out and
                aligned separately with MAFFT or MUSCLE, which then is much faster
                because of the reduced size.
                ---------------------------------------------------------
                but as I wrote above, IMO everyone should be using that 12-subsequences-method above.
                For longer sequences of other species we could use 14-subsequences.
                ---------------------------------------------------------------------------
                I have a program typesz1.c , that finds the best match from a list of index-sequences
                using that subsequences-database-method, but it doesn't align (yet).
                -------------------------------------------------------------
                or types1.c : finds the flugenome.org types of the unaligned sequences in a file with that method
                gb191 , 318MB all 204496 genbank flu-A sequences, it takes only 40seq to assign the types to it.
                out of a list of 189 types. The error rate is low.
                It should be possible to align the sequences with this methos in almost the same time
                -------------------------------------------------------------
                this is flu-specific alignment only. But larger databases could be built to include more species.
                Long (DNA) sequences could be split
                --------------------------------------------------------
                searching ...
                http://www.google.de/#hl=de&sclient=...iw=971&bih=512

                http://en.wikipedia.org/wiki/Smith%E...rman_algorithm
                ...
                http://mafft.cbrc.jp/alignment/software/source66.html
                CONTRAfold algorithm, instead of the McCaskill algorithm
                MAFFT is a multiple alignment method that includes two algorithmic techniques: ...
                MAFFT employs a progressive method (FFT-NS-2) and an iterative refinement ...
                MAFFT: a novel method for rapid multiple sequence alignment based
                on fast Fourier transform. Kazutaka Katoh
                All pairwise alignments are computed with the Needleman-Wunsch algorithm.
                More accurate but slower than --6merpair.
                ----------------------------------------------------------
                I'm interested in expert panflu damage estimates
                my current links: [url]http://bit.ly/hFI7H[/url] ILI-charts: [url]http://bit.ly/CcRgT[/url]

                Comment


                • #23
                  Re: Sequence Analysis Using MUSCLE

                  http://en.wikipedia.org/wiki/BLAT_(bioinformatics)

                  Windows executable:
                  http://hgwdev.cse.ucsc.edu/~kent/exe...Suite.33x5.zip

                  blat - Standalone BLAT v. 33x5 fast sequence search command line tool
                  usage:
                  blat database query [-ooc=11.ooc] output.psl
                  where:
                  database and query are each either a .fa , .nib or .2bit file,
                  or a list these files one file name per line.
                  -ooc=11.ooc tells the program to load over-occurring 11-mers from
                  and external file. This will increase the speed
                  by a factor of 40 in many cases, but is not required
                  output.psl is where to put the output.
                  Subranges of nib and .2bit files may specified using the syntax:
                  /path/file.nib:seqid:start-end
                  or
                  /path/file.2bit:seqid:start-end
                  or
                  /path/file.nib:start-end
                  With the second form, a sequence id of file:start-end will be used.
                  options:
                  -t=type Database type. Type is one of:
                  dna - DNA sequence
                  prot - protein sequence
                  dnax - DNA sequence translated in six frames to protein
                  The default is dna
                  -q=type Query type. Type is one of:
                  dna - DNA sequence
                  rna - RNA sequence
                  prot - protein sequence
                  dnax - DNA sequence translated in six frames to protein
                  rnax - DNA sequence translated in three frames to protein
                  The default is dna
                  -prot Synonymous with -t=prot -q=prot
                  -ooc=N.ooc Use overused tile file N.ooc. N should correspond to
                  the tileSize
                  -tileSize=N sets the size of match that triggers an alignment.
                  Usually between 8 and 12
                  Default is 11 for DNA and 5 for protein.
                  -stepSize=N spacing between tiles. Default is tileSize.
                  -oneOff=N If set to 1 this allows one mismatch in tile and still
                  triggers an alignments. Default is 0.
                  -minMatch=N sets the number of tile matches. Usually set from 2 to 4
                  Default is 2 for nucleotide, 1 for protein.
                  -minScore=N sets minimum score. This is the matches minus the
                  mismatches minus some sort of gap penalty. Default is 30
                  -minIdentity=N Sets minimum sequence identity (in percent). Default is
                  90 for nucleotide searches, 25 for protein or translated
                  protein searches.
                  -maxGap=N sets the size of maximum gap between tiles in a clump. Usually
                  set from 0 to 3. Default is 2. Only relevent for minMatch > 1.
                  -noHead suppress .psl header (so it's just a tab-separated file)
                  -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
                  -repMatch=N sets the number of repetitions of a tile allowed before
                  it is marked as overused. Typically this is 256 for tileSize
                  12, 1024 for tile size 11, 4096 for tile size 10.
                  Default is 1024. Typically only comes into play with makeOoc
                  -mask=type Mask out repeats. Alignments won't be started in masked region
                  but may extend through it in nucleotide searches. Masked areas
                  are ignored entirely in protein or translated searches. Types are
                  lower - mask out lower cased sequence
                  upper - mask out upper cased sequence
                  out - mask according to database.out RepeatMasker .out file
                  file.out - mask database according to RepeatMasker file.out
                  -qMask=type Mask out repeats in query sequence. Similar to -mask above but
                  for query rather than target sequence.
                  -repeats=type Type is same as mask types above. Repeat bases will not be
                  masked in any way, but matches in repeat areas will be reported
                  separately from matches in other areas in the psl output.
                  -minRepDivergence=NN - minimum percent divergence of repeats to allow
                  them to be unmasked. Default is 15. Only relevant for
                  masking using RepeatMasker .out files.
                  -dots=N Output dot every N sequences to show program's progress
                  -trimT Trim leading poly-T
                  -noTrimA Don't trim trailing poly-A
                  -trimHardA Remove poly-A tail from qSize as well as alignments in
                  psl output
                  -fastMap Run for fast DNA/DNA remapping - not allowing introns,
                  requiring high %ID
                  -out=type Controls output file format. Type is one of:
                  psl - Default. Tab separated format, no sequence
                  pslx - Tab separated format with sequence
                  axt - blastz-associated axt format
                  maf - multiz-associated maf format
                  sim4 - similar to sim4 format
                  wublast - similar to wublast format
                  blast - similar to NCBI blast format
                  blast8- NCBI blast tabular format
                  blast9 - NCBI blast tabular format with comments
                  -fine For high quality mRNAs look harder for small initial and
                  terminal exons. Not recommended for ESTs
                  -maxIntron=N Sets maximum intron size. Default is 750000
                  -extendThroughN - Allows extension of alignment through large blocks of N's
                  I'm interested in expert panflu damage estimates
                  my current links: [url]http://bit.ly/hFI7H[/url] ILI-charts: [url]http://bit.ly/CcRgT[/url]

                  Comment


                  • #24
                    Re: Sequence Analysis Using MUSCLE

                    mentioned on the DNA-forums was also: (wrt. RNA-virus alignment)

                    Bowtie
                    BWA
                    TopHat
                    I'm interested in expert panflu damage estimates
                    my current links: [url]http://bit.ly/hFI7H[/url] ILI-charts: [url]http://bit.ly/CcRgT[/url]

                    Comment


                    • #25
                      Re: Sequence Analysis Using MUSCLE

                      ahh, so many alignment programs ...
                      http://en.wikipedia.org/wiki/List_of...nment_software

                      what's the best
                      I'm interested in expert panflu damage estimates
                      my current links: [url]http://bit.ly/hFI7H[/url] ILI-charts: [url]http://bit.ly/CcRgT[/url]

                      Comment

                      Working...
                      X