No announcement yet.

how to get flu-sequence information from genbank

  • Filter
  • Time
  • Show
Clear All
new posts

  • how to get flu-sequence information from genbank

    ------ main links -------------
    flu-sequences filtering page:
    daily flu updates:

    influenza is a virus with 8 segments of genetic information,
    about 13000 nucleotides (A,C,G or T) in total.
    (humans have 23 chromosomes with about 3000000000
    nucleotides in total, 230000 times more)

    These nucleotide sequences can be decoded and stored in databases,
    the most common of which is the public genbank.

    There are currently about 240000 influenza sequences at genbank
    from ~80000 flu viruses. Remember, one virus can give 8 sequences
    or even more, since it can be grown and sequenced multiplr times.

    In this thread you'll learn how to find and access these
    sequences at genbank (if there is interest ...)
    I'm interested in expert panflu damage estimates
    my current links: [url][/url] ILI-charts: [url][/url]

  • #2
    Re: how to get flu-sequence information from genbank

    Lection 1
    access by accession-number

    each sequence is stored in a record that has some additional information
    in a somehow standardized manner.
    These records are given an "accesscode" or "accessnumber"
    If you have the accession-number, then you can type in the URL directly
    to get the record:
    and then the access code.

    e.g. :

    there are several methods how to get the accesscodes.
    You can even download them all in ~10MB
    I'm interested in expert panflu damage estimates
    my current links: [url][/url] ILI-charts: [url][/url]


    • #3
      Re: how to get flu-sequence information from genbank

      Lection 2
      records-files via ftp

      these "records" are also stored in big files and made available via ftp
      The influenza records are being put, together with other virus-records,
      into one of the currently 24 virus-files , gbvrl1.seq.gz,...gbvrl24.seq.gz
      These files are compressed with "gzip" and must be uncompressed
      before you can look at them with a viewer or editor.
      Uncompressed these files have a size of at most 250MB per file.

      Only the short sequences , like viruses and bacteria, are stored in this directory,
      big data-sets, e.g. human DNA can be accessed through the genomes subdirectory
      But that's another issue, this thread is for flu only.

      These big files are being updated every 2 months in a new genbank "release".
      The history and statistics of these releases can be seen in
      I'm interested in expert panflu damage estimates
      my current links: [url][/url] ILI-charts: [url][/url]


      • #4
        Re: how to get flu-sequence information from genbank

        Lection 3
        short form, fasta format

        for each flu-record there are several short-forms available
        which only contain essential information in several searchable
        fields in standard,short, computer-readable formats.
        Many such records can be put into one such file and downloaded for analysis.
        These short-forms are called "fasta", the files are "fasta-files",
        they have a header ("defline") with general info followed by one or more other
        lines with raw sequence-data.

        Available for the header line fields are currently:
        (some of the meanings and details will have to be explained later ...)

        Gi --- another code to identify the records
        Accession --- Accession number as in Lection 1
        Host -- the species of the animal whose cells were invaded by the flu-virus
        Segment --- segments are PB2,PB1,PA,HA,NP,NA,MP,NS
        name --- sequence name e.g. A/Pennsylavania/02/2010(H3N2)
        Serotype --- the different flu-types, specified by HA and NA - types
        Segment number --- 1,2,3,4,5,6,7,8
        Country --- where collected
        Year Month Day --- the date when collected
        Strain ---
        Virus name ---
        Definition ---
        Age --- age of host
        Gender --- of host
        Mutations --- typical mutations, e.g. for resistance
        CDS Location --- code-section, which nucleotides translate to amino-

        example of a fasta file with 3 sequences:

        >A/little yellow-shouldered bat/Guatemala/060/2010,2010/09/,Guatemala,H17N10,7,M2,,Bat
        >A/little yellow-shouldered bat/Guatemala/060/2010,2010/09/,Guatemala,H17N10,8,NS2,,Bat
        >A/little yellow-shouldered bat/Guatemala/164/2009,2009/05/,Guatemala,H17N10,8,NS2,,Bat
        I'm interested in expert panflu damage estimates
        my current links: [url][/url] ILI-charts: [url][/url]


        • #5
          Re: how to get flu-sequence information from genbank

          Lection 4
          filtering,searching fasta-headers

          searching genbank sequences for header(fasta,defline) characteristics
          or keywords can be done via the single-sequence page at

          or via the genome page (searches whole viruses,not just one segment) at

          here we discuss the former, genome-search is similar and will be
          discussed later

          you can do a combined filtering and then display or download the results
          or search them by several criteria (explained later)

          1.) nucleotides or amino-acids
          2.) influenza A,B,C or any
          3.) host species
          4.) country/region
          5.) segment / protein
          6.) HA-subtype of the virus
          7.) NA-subtype of the virus
          8.) sequence-length-range
          9.) collection date period
          10.) submission period

          and additional filters for including/excluding

          pandemic H1N1 (2009)
          lab-strains (created,mutated,grown in cell cultures or lab-animals)
          lineage defining strains (theoretical sequences, deduced)
          sequences from the flu-project (
          vaccine strains (as selected by WHO)
          mixed infections (reassorted/mixed in the host)

          once you have selected the desired criteria, you click
          "Add Query" or add to query builder, which displays your criteria in
          an extra window and how many sequences match the criteria.
          Then you can continue and select more sequences by other criteria,
          add them to the query builder too etc.
          In the query builder you can mark or unmark the queries for
          the desired combination. Often I use several query-combinations
          here for several steps, just by marking/unmarking ,
          without having to do new query building.
          E.g. for displaying counts or downloading fastas for several selected years.

          once you have selected your query, you can either

          download those sequences as fasta (nucleotides or protein or headers only)
          display all the headers in an extra screenpage (this may take a while)
          this extra page then allows (combined) sorting for several criteria,
          easy clicking to display the full,long records, design of defline,download options ...
          (discussed later)

          one word to timing, how long those requests may take:

          displaying many sequences in the extra window can take long
          (~a minute for 1000)
          downloading them is faster

          how this filtering = selecting of specified fastas is being done
          is best shown by some subsequent examples

          to do:
          flu-project ?
          download whole database and use offline tools

          "lesson" , not lection
          I'm interested in expert panflu damage estimates
          my current links: [url][/url] ILI-charts: [url][/url]


          • #6
            Re: how to get flu-sequence information from genbank

            Lection 5

            time for a quick, easy example for a first success experience ...


            use all the default settings, except
            click on "Blow fly" in the Host menu (2nd dropdown-menu-window)
            Blowfly is now blue highlited
            then click "Add Query" (bottom left)

            after a few seconds a new window is added below

            "Query builder" ...
            A Blow fly any any any any any any details 4

            meaning that it has found 4 sequences with that query (flu in Blow fly host)

            click : "Show Results"

            and you get a list of the 4 with the option to click on their Accession code
            for displaying the long record :
            4 protein sequences after collapsing (4 total)
            Accession Length Host Protein Subtype Country Region Date Virus name Mutations Age Gender Lineage VacStr Complete
            BAE47131 567 Blow fly HA H5N1 Japan N 2004 Influenza A virus (A/blow fly/Kyoto/93/2004(H5N1)) c
            BAE47132 449 Blow fly NA H5N1 Japan N 2004 Influenza A virus (A/blow fly/Kyoto/93/2004(H5N1)) c
            BAE47133 97 Blow fly M2 H5N1 Japan N 2004 Influenza A virus (A/blow fly/Kyoto/93/2004(H5N1)) c
            BAE47134 252 Blow fly M1 H5N1 Japan N 2004 Influenza A virus (A/blow fly/Kyoto/93/2004(H5N1)) c

            clicking on the third: BAE47133 , then highlighting , crtl-C, ... crtl-V lets me copy the record
            to a post on FT like this one

            membrane ion channel; M2 [Influenza A virus (A/blow fly/Kyoto/93/2004(H5N1))]
            GenBank: BAE47133.1

            FASTA Graphics

            The following popper user interface control may not be accessible. Tab to the next
            button to revert the control to an accessible version.
            Destroy user interface controlGo to:FeaturesSequence
            LOCUS BAE47133 97 aa linear VRL 30-SEP-2010
            DEFINITION membrane ion channel; M2 [Influenza A virus (A/blow
            ACCESSION BAE47133
            VERSION BAE47133.1 GI:78210829
            DBSOURCE accession AB212651.2
            KEYWORDS .
            SOURCE Influenza A virus (A/blow fly/Kyoto/93/2004(H5N1))
            ORGANISM Influenza A virus (A/blow fly/Kyoto/93/2004(H5N1))
            Viruses; ssRNA negative-strand viruses; Orthomyxoviridae;
            Influenzavirus A.
            REFERENCE 1
            AUTHORS Sawabe,K., Hoshino,K., Isawa,H., Sasaki,T., Hayashi,T., Tsuda,Y.,
            Kurahashi,H., Tanabayashi,K., Hotta,A., Saito,T., Yamada,A. and
            TITLE Detection and isolation of highly pathogenic H5N1 avian influenza A
            viruses from blow flies collected in the vicinity of an infected
            poultry farm in Kyoto, Japan, 2004
            JOURNAL Am. J. Trop. Med. Hyg. 75 (2), 327-332 (2006)
            PUBMED 16896143
            REFERENCE 2 (residues 1 to 97)
            AUTHORS Sawabe,K., Hoshino,K., Isawa,H. and Sasaki,T.
            TITLE Direct Submission
            JOURNAL Submitted (28-APR-2005) Contact:Kyoko Sawabe National Institute of
            Infectious Diseases, Department of Medical Entomology; Toyama
            1-23-1, Shinjuku-ku, Tokyo 162-8640, Japan
            FEATURES Location/Qualifiers
            source 1..97
            /organism="Influenza A virus (A/blow
            /strain="A/blow fly/Kyoto/93/2004(H5N1)"
            /isolation_source="blow fly"
            /country="Japan: Kyoto, Tamba"
            /note="isolated from blow fly in Kyoto, 2004"
            Protein 1..97
            /product="membrane ion channel; M2"
            Region 1..97
            /note="Influenza Matrix protein (M2); pfam00599"
            CDS 1..97
            /coded_by="join(AB212651.2:26..52,AB212651.2:741..1 007)"
            1 mslltevetp trnewecrcs dssdplvvaa siigilhlil wildrlffkc iyrrlkyglk
            61 rgpstagvpe smreeyrqeq qsavdvddgh fvniele
            I'm interested in expert panflu damage estimates
            my current links: [url][/url] ILI-charts: [url][/url]


            • #7
              Re: how to get flu-sequence information from genbank

              the whole thing is futile...
              Because I think the better way to deal with genbank is to download the whole database
              and then filter/search/process/analyze everything offline with your own programs and tools.
              These tools are not yet widely available and popular.
              I feel like someone teaching how to use Windows to a group, when he had realized that Linux
              would be the much better system for that task.

              All the full (long) genbank virus records (files gbvrl1,...,gbvrl24) comprise a file of
              5705 MB, which can be compressed with 7zip to 280MB.
              1898 MB of these are HIV viruses 1235MB are Influenza viruses, 645 MB are Hepatitis viruses
              the rest is only 1933MB.

              All the 240000 full (long) genbank flu records comprise a file of length 1240 MB
              which can be compressed with 7zip to 29.4 MB

              All the 240000 short (fasta) genbank flu records comprise a file of length 380 MB.
              which can be compressed with 7zip to 7.5 MB

              All the 240000 (long) headers of the genbank flu records (that's what most normal people need,
              unless you want to analyze the sequence-data) comprise a file of 28 MB, which can be compressed
              with 7zip to 1.4 MB.
              It would be reasonable IMO to update and provide this 1.4MB file for download on a daily or weekly
              basis.Together with a software how to filter and sort and display the content, which could be
              similar to the software on the genbank web-page (but better ...)

              It's not hard to predict that this will come earlier or later. Because it makes sense.
              I have written some of those tools to process,filter,search those big genbank flu-files,
              but I'm still often using the genbank webpage tools, since my programs are not
              complete, not perfect and the genbank webpage is still easier for me for many

              attached is, download and decompress it with 7-zip into gb192a.nam,
              the extended names of 207028 flu-A sequences from release 192.0

              download and install 7-zip e.g. from here:
              that should be useful even without genbank
              Attached Files
              I'm interested in expert panflu damage estimates
              my current links: [url][/url] ILI-charts: [url][/url]


              • #8
                Re: how to get flu-sequence information from genbank

                googling for "customize fasta defline"
                gives 3 hits
                there is one
                document at scribd, but this is somehow
                heavily protected, you
                must register or such.
                That page killed my browser, so be careful.

                the page at
                doesn't have anything
                of what google list it

                the third hit is FT

                genbank help itself is not listed at google search


                so I copy it here so google will get it
                you better read it at the original site, because of the hyperlinks

                Flu home Database Genome

                <table id="Table4" align="center" border="0" cellpadding="0" cellspacing="0" width="100&#37;"><tbody><tr><td></td> <td background=""></td> <td background="" nowrap="nowrap" valign="middle" width="100%"><table id="Table5" border="0" cellpadding="0" cellspacing="0" width="100%"> <tbody><tr align="center"> <td>Flu home</td> <td align="right" width="1"></td> <td>Database</td> <td align="right" width="1"></td> <td>Genome Set</td> <td align="right" width="1"></td> <td>Alignment</td> <td align="right" width="1"></td> <td>Tree</td> <td align="right" width="1"></td> <td>BLAST</td> <td align="right" width="1"></td> <td>Annotation</td> <td align="right" width="1"></td> <td>FTP</td> <td align="right" width="1"></td> <td>Help</td> <td align="right" width="1"></td> <td>Contac</td></tr></tbody></table></td></tr></tbody></table>
                Influenza Virus Resource presents data obtained from the NIAID Influenza Genome Sequencing Project as well as from GenBank, combined with tools for flu sequence analysis and annotation. In addition, it provides links to other resources that contain flu sequences, publications and general information about flu viruses.

                <table><tbody><tr><td align="left">Read more about: This resource | </td> <td> Flu database | </td> <td> Flu sequence submission to GenBank | </td> <td> NIAID Influenza Sequencing Project | </td> <td> Influenza virus biology</td></tr></tbody></table>

                NCBI <table bgcolor="F0F0F8" border="1"> <tbody><tr align="left"> <td align="left">Growth of flu sequences </td> </tr> <tr align="left"> <td colspan="2"> GenBank sequences from the NIAID Project</td> </tr> <tr align="left"> <td colspan="2">Assembly Archive</td> </tr> <tr align="left"> <td colspan="2">Trace Archive</td> </tr> <tr align="left"> <td align="left">NIAID data releasing status </td> </tr> <tr align="left"> <td align="left">RefSeq genomes</td> </tr> <tr> <td align="left"> RefSeq proteins</td> </tr> <tr> <td align="left"> Protein Structures</td> </tr> </tbody></table>
                <table border="1" width="100%"> <tbody><tr bgcolor="C0C0C0"><td> Flu resources</td></tr> </tbody></table> <table bgcolor="F0F0F8" border="1" width="100%"> <tbody><tr> <td>NIAID Project</td> </tr> <tr> <td>JCVI Flu</td> </tr> <tr> <td>HealthMap Flu</td> </tr> <tr> <td align="left">Influenza Research Database</td> </tr> <tr> <td align="left">CDC Flu</td> </tr> <tr> <td>Vaccine Selection</td> </tr> <tr> <td>WHO Flu</td> </tr> </tbody></table>
                <table border="1" width="100%"> <tbody><tr bgcolor="C0C0C0"><td> NCBI Viruses</td></tr> </tbody></table> <table bgcolor="F0F0F8" border="1" width="100%"> <tbody><tr> <td>Viral Genomes</td> </tr> <tr> <td>Virus Variation</td> </tr> <tr> <td>Dengue virus</td> </tr> <tr> <td>Retroviruses</td> </tr> <tr> <td>SARS-CoV</td> </tr> </tbody></table>
                <table border="1" width="100%"> <tbody><tr bgcolor="C0C0C0"><td> Collaborators</td></tr> </tbody></table> <table bgcolor="F0F0F8" border="1" width="100%"> <tbody><tr> <td align="left">Canterbury Health Laboratories</td> </tr> <tr> <td align="left">Ohio State University</td> </tr> <tr> <td align="left">St. Jude Children's Research Hospital</td> </tr> <tr> <td align="left">Surveillance Data, Inc.</td> </tr> <tr> <td align="left">Wadsworth-NYSDOH</td></tr></tbody></table>

                <table border="0" width="100%"><thead><tr bgcolor="EEEEEE"><th colspan="2" align="left">The NCBI Influenza Virus Sequence Database</th><th></th> </tr> </thead> <tbody> <tr align="left"> <td align="left">Nucleotide or protein sequences can be searched by adding a comma or space separated list of GenBank accession numbers or uploading a text file containing such a list under the "Get sequences by accession" section. The sequences can be added to the "Query builder" or shown directly by clicking the "Add query" or "Show results" buttons.

                To search the database using other terms, first decide whether you would like to search for protein sequences, their coding regions, or nucleotide sequences, by checking the radio buttons to the left of the sequence type names.

                The "Search for keyword" section allows users to search for sequences by 1). a string of word in virus strain names (e.g. New York); 2). a pattern in nucleotide or protein sequences (e.g. AGCGAAAGCAGGGGT or RSKV); and 3). drug-resistance mutations in protein sequences (e.g. S31N or H274Y). A list of mutations annotated in the database can be found here.

                In the "Define search set" section, select one or multiple names (by holding the Ctrl or Shift key) each from the lists provided, and/or fill in the boxes. The fields are virus type (e.g. Influenza virus A, B or C), Host (e.g. Human or Avian), Country/Region (e.g. Australia or Asia) or Year (or a range of year) viruses were isolated, Segment (1 through 8) or protein name (e.g. PB1-F2 or M1), Subtypes (e.g. H3N2 or H5), and a range of the lengths of the sequences.

                You can limit your search results to full-length sequences by checking the appropriated boxes. "Full-length only" applies to sequences that have complete coding regions including start and stop codons, and they are labelled as "c" (for complete) in the database query result.. "Full-length plus" applies to all "Full-length only" sequences, plus those only missing start and/or stop codons, which are labelled as "nc" (for nearly complete) in the database query result. Partial sequences are labelled as "p" in the database query result.

                Month and day can be added in addition to year. Please note that not all sequences have month and day available. Therefore sequences with only year as collection date will not be included in a search if a month of the corresponding year is entered in the query. For example, a search for sequences from 2006/05 to 2008/11 will retrieve those with month in collection date for 2006 and 2008, but not those with only 2006 or 2008 as collection date (because they could be from 2006/04 or 2008/12). However, all sequences from 2007, with or without month in the collection date, will be included in such a query. Check the boxes next to "Month" or "Day" under "Collection date must contain" if one wants to retrieve only sequences with month or day in the collection date.

                Released date is the date when a sequence first appeared in GenBank.

                Check boxes next to the segment/protein names under "Required segments" to retrieve sequences defined in the "Segment/Protein" field when all of the selected segments of the same virus isolate exist in the database. Check the "Full-length only" box in this section if the required segments must be full-length.

                From a drop-down menu next to "Pandemic (H1N1) viruses" (also known as the swine flu outbreak), you can include, exclude or retrieve only these sequences in your search results. Newly released sequences can be retrieved from the database by defining the GenBank release date. For example, A(H1N1)pdm09 virus sequences released in GenBank between June 30 and July 6, 2010 can be retrieved using this database query.

                From a drop-down menu next to "The FLU project", you can include, exclude or retrieve only these sequences in your search results. Sequences from the FLU project are those submitted to GenBank through a streamlined GenBank submission pipeline. These are mostly from large scale flu genome sequencing projects, which usually contain complete genomes, detailed source information and high quality of annotations. Currently, the major contributors are the NIAID Influenza Genome Sequencing Project, the St. Jude Influenza Genome Project, the Centers for Disease Control and Prevention, Centers of Excellence for Influenza Research and Surveillance (CEIRS), and the University of Hong Kong.
                Sequences from the FLU Project

                Sequences of reassortments or lab strains (those flagged as "LAB" in the country field) are excluded in the search by default, and the drop-down menu next to "Lab strains" can be used should you want to include or retrieve only those sequences. From a drop-down menu next to "Vaccine strains", you can include, exclude or retrieve only sequences of WHO recommended vaccine strains in your search results.

                From a drop-down menu next to "Lineage defining strains", you can include, exclude or retrieve only sequences of prototype viruses of well defined lineages/clades. Currently, this includes those for Influenza B viruses (Victoria and Yamagata), and the H5N1 and H9N2 subtypes of Influenza A viruses.

                By checking the box next to "Collapse identical sequences", all groups of identical sequences in a dataset will be represented by the oldest sequence in the group. This will reduce the number of sequences in some cases by keeping only unique sequences in a dataset.

                After clicking the "Add Query" button, the query you selected and the number of resulting sequences will be shown in "Query Builder". If "any" is selected in "Virus Species" and/or "Segment", a warning message will be shown and the "Multiple sequence alignment" and "Tree building" functionalities will not be allowed in the subsequent steps when the resulting dataset contains sequences from different virus species and/or different segment. A sample query page can be found here.

                Multiple queries can be built by repeating the above steps. When a different "Virus Species" and/or "Segment" is selected in the new query, the same warning message described above will be shown and the "Multiple sequence alignment" and "Tree building" functionalities will not be allowed in the subsequent steps, if the resulting dataset contains sequences from different virus species and/or different segment. When a different sequence type (i.e. Protein sequence, Coding region or Nucleotide sequence) is selected for the new query, a pop-up window will ask whether you indeed would like to start a new query with a new sequence type (which will clear the current "Query Builder"), or you want to continue with the current sequence type by going back to the current query builder. This is to prevent mixing different sequence types in the same "Query Builder" (e.g. protein sequences with nucleotide sequences). Queries in any combination from the "Query Builder" can be selected to get sequences from the database.

                Sequences found by the selected queries will be shown in a separate window once you click the "Show results" button. By default, the sequences are ordered by the virus names. They can be reordered by up-to three fields sequentially, by holding the Ctrl or Shift key while clicking on field headers. A sample resulting page can be found here.

                Sequences of interest can be selected by checking the boxes to the left of accession numbers. When "Collapse identical sequences" is selected in query, the numbers of identical sequences in the collapsed groups are shown in the column "#". These groups can be expanded by clicking the numbers, and sequences within the groups can be selected to be included in the dataset as well.

                The corresponding protein, coding region or nucleotide sequences of the selected sequences can be downloaded by selecting the appropriate name in the "Download results" drop-down menu. To meet the need of different users, the definition line of the FASTA sequences in the downloaded files can be customized by clicking "Customize FASTA defline". The default defline is in the format of ">{accession} {strain} {year}/{month}/{day} {segname}" (e.g. >ADA83577 A/Argentina/HNRG13/2009 2009/06/05 PB2), but you are able to add any fields by clicking the ones listed, or remove any by deleting them from the Defline editing box. A space is inserted between fields by default, but it can be replaced with other characters by typing in the editing box. When the "Remember changes" box is checked, the defline format you defined will be remembered and used in all subsequent downloads, until it is reset or cookies are deleted in the browser. A list of GenBank accession numbers for selected protein or nucleotide sequences, and a table of the search result in XML, CSV or tab-delimited format can also be downloaded from the "Download results" menu.

                Further sequences analysis of the selected sequences can be performed by clicking the "Do multiple alignment" or "Build a tree" button, if they are allowed (i.e. no mixing species and/or segments in the dataset). User's own sequences (of the same sequence type in FASTA format) can be added to the selected sequences for analysis, by clicking the "Add your own sequences" button. The number of sequences added cannot be more than 128 KB in file size.

                <table border="0" width="100%"><thead><tr bgcolor="EEEEEE"><th colspan="2" align="left">Genome Set</th><th></th> </tr> </thead> <tbody> <tr align="left"> <td align="left">The Influenza Virus Genome Set Tool displays nucleotide sequences obtained from the NCBI Influenza Virus Sequence Database ordered by genome segments for each virus. All segments of the same virus are grouped together in the same background color, alternating in light blue and white. Genomes of the same virus isolate but sequenced in different labs are identified in the database, and are grouped separately based on the sequence submitters. This tool is a convenient way to check the completeness of genome segments for viruses of interest. Database searches can be performed similarly as described above. By default, this tool only gets viruses with a complete set of segments in full-length (or in full-length plus if the check box next to "Complete plus" is selected). To get all viruses with any number of sequences, check the radio button next to "All" in the "Show results" box. The results are shown in the descending order by the number of segments the viruses have.

                <table border="0" width="100%"><thead><tr bgcolor="EEEEEE"><th colspan="2" align="left">Alignment</th><th></th> </tr> </thead> <tbody> <tr align="left"> <td align="left">Multiple alignments of nucleotide or protein sequences from the NCBI Influenza Virus Sequence Database and/or user's input file can be obtained, using the MUSCLE program. Start the alignment by selecting the Alignment button in the top horizontal bar. This will open a database query interface similar to the one described above. Please follow the instruction for database query and be sure to select sequences from the same segment of the genome, and preferably of similar sizes. A maximum number of 1,000 is set for sequences allowed to be included in the alignment. For datasets larger than 1,000 sequences, it is recommended to download the sequences using the download tool of the database, and run the multiple sequence alignment using a program (e.g. MUSCLE) installed locally.

                After sequences of interest are selected from the database and/or added from an input file, click the "Do multiple alignment" button to get the alignment. The consensus sequence is displayed at the top of the alignment, and identical sequences to the consensus are shown in dots and gaps are shown in dashes. In the coding region alignment, non-synonymous changes (in triplets) are highlighted in a different background color. The alignment can be navigated horizontally either by typing in the position you would like the sequences to start from in the text box after "Go to position" and clicking "Go", or by moving the bottom scroll bar that wraps the alignment. When a sequence in the alignment is clicked, a small window will be popped up. The GenBank record for the sequence can be opened by clicking the accession number in the pop-up window. The sequence can also be selected to perform BLAST 2 Sequences (Click the "BLAST 2 seq." button after two different sequences are selected from the alignment). By clicking the "Select for anchor" option from the pop-up window, the consensus sequence will be replaced by the selected sequence. When the anchor sequence is clicked, a small window with options will be popped up. The anchor sequence can be reset to the consensus sequence, and the anchor/consensus sequence can be displayed for copying. The multiple alignment file in FASTA format can be downloaded by selecting "Download alignment". A printer-friendly version of the alignment can be obtained by clicking the "Print-friendly version" button. If desired, click the "Build a tree" button to build a tree from the aligned sequences.


                <table border="0" width="100%"><thead><tr bgcolor="EEEEEE"><th colspan="2" align="left">Clustering and phylogenetic analysis</th><th></th> </tr> </thead> <tbody> <tr align="left"> <td align="left"> Scope Interactive tool DatasetExplorer is a part of the NCBI Influenza Virus Resource that provides an easy way to perform preliminary analysis on nucleotide and protein sequences from the NCBI Influenza Virus Sequence Database and/or user's input file. Datasets are visually represented using phylogenetic/clustering trees. Users can select an algorithm to be used for building a tree as well as similarity criterion.

                </td></tr> <tr><td align="left"> Overview of the Methodology First of all, start the tool by clicking the "Tree" button in the top horizontal bar. Sequences are acquired from the NCBI Influenza Virus Sequence Database or uploaded by a user as described above. After a dataset has been selected, the sequences are aligned using a multiple alignment algorithm, in order to identify common regions in the sequences and establish correspondence between sequence columns (we perform multiple protein alignment, while alignment of the nucleotide sequences for the coding regions is induced by the protein alignment). Distances between sequences are calculated based on their dissimilarity in a selected region on the alignment, and analysis is performed. We offer visualization based on phylogenetic and clustering tree methods: the classical neighbor-joining method and agglomerative hierarchical clustering methods.

                Alignment of protein sequences is performed using the protein multiple alignment tool MUSCLE. We offer different distance measures for calculating pairwise distances between sequences. Particularly, we use some distances implemented in PHYLIP package, as well mPAM weight matrix.

                </td> </tr> <tr><td align="left"> Sequence Alignment The tool performs multiple protein alignments using the MUSCLE program and creates nucleotide alignment of the corresponding coding regions from protein alignment by using codon-amino acid correspondence.

                After sequences are obtained from the NCBI Influenza Virus Sequence Database and/or users' input file, click the "Build a tree" button in the database query results page to start the process. This will bring a window with graphic view of the multiple sequence alignments.

                Sequence Region Selection The graphic view of the multiple alignments of sequences selected from the previous step is displayed. The black and red colors in the graphics represent the presence and absence of amino acid residues at the corresponding positions. The positions in the longest sequence of the selected set for the first and last amino acid of each sequence are shown. A histogram showing the total number of amino acid residues at each position is displayed at the top of the page. The program automatically selects the sequence region to be analyzed so that the majority of the sequences in the set will be included. The sequence region can also be defined by users by first selecting all sequences in the set, and then entering the start and end positions in the boxes provided. When clicking the "Select sequences" button, the region from sequences that have complete coverage between the two positions will be selected, and sequences excluded from the selection will be highlighted with a background color in the graphic view.

                Phylogenetic/Clustering Tree A clustering or phylogenetic tree can be built by selecting one of the clustering algorithms and a distance calculating method from the list, and clicking the "Next step" button.

                Sequences of interest can be highlighted in the tree, and they can be selected or deselected using the check boxes to the right of each sequences.</td></tr></tbody></table>

                Distance methods approximating minimum evolution <table border="1" cellpadding="5" cellspacing="0" width="100%"> <tbody><tr> <td align="center"> Method </td> <td align="center"> Description </td> </tr> <tr><td align="center"> Neighbor-Joining</td> <td align="left"> At each step, a pair with a smallest value of D<sub>ij</sub> - b<sub>i</sub> - b<sub>j</sub> is chosen, where D<sub>ij</sub> is the distance between nodes i and j, and b<sub>i</sub> = ∑<sub>k</sub><sup>n</sup> D<sub>ij</sub> /(n-2). The distance between the new node u and each of remaining nodes is defined as D<sub>uk</sub> = (D<sub>ik</sub> + D<sub>jk</sub> - D<sub>ij</sub> ) /2. Branch lengths are defined as v<sub>ui</sub> = (D<sub>ij</sub> + b<sub>i</sub> - b<sub>j</sub> ) /2 and v<sub>uj</sub> = (D<sub>ij</sub> + b<sub>j</sub> - b<sub>i</sub> ) /2 (negative lengths are truncated to zero). </td> </tr> </tbody></table>
                Agglomerative hierarchical clustering methods <table border="1" cellpadding="5" cellspacing="0" width="100%"><tbody><tr> <td align="center"> Method </td> <td align="center"> Alternative name </td> <td align="center"> Distance between clusters defined as: </td> </tr> <tr> <td align="center">Average Linkage</td> <td align="center">UPGMA</td> <td align="center"> Average distance between pair of objects, one in one cluster, one in another </td> </tr> <tr> <td align="center">Complete Linkage</td> <td align="center">Further Neighbor</td> <td align="center"> Maximum distance between pair of objects, one in one cluster, one in another </td> </tr> <tr> <td align="center">Single Linkage</td> <td align="center">Nearest Neighbor</td> <td align="center"> Minimum distance between pair of objects, one in one cluster, one in another </td></tr></tbody></table>

                <table border="0" width="100%"><tbody><tr><td align="left">Protein and Nucleotide Distances We offer different distance measures for calculating nucleotide and protein pairwise sequence distances, such as those based on Felsenstein F84 distance and Hammering distance for nucleotide sequences; the Dayhoff PAM matrix, the JTT matrix model, the PBM model, and Kimura's approximation for protein sequences implemented in the PHYLIP package, as well as the mPAM weight matrix for protein sequences.

                </td></tr> <tr><td align="left"> Tree Modification An adaptive approach is used to visualize the tree in an aggregated form adapted to the user's screen, allowing users to interactively refine or aggregate visualization of different parts of the tree (see a paper for details). A branch on the tree can be selected by clicking the root node, and the resolution of the selected branch can be changed by moving along the scale bar. The GenBank accession numbers of amino acid sequecnes in the selected branch of a tree can be exported by clicking the "Download accessions" button under the scale bar. Sequences on the tree can be searched by the fields in the database, and the resulting sequences or groups will be highlighted in green color.

                </td></tr> <tr><td align="left"> Tree Export The complete tree can be exported in the Newick format by clicking the "Download full tree" button. The downloaded tree can be displayed by many tree-viewing programs.


                <table border="0" width="100%"><thead><tr bgcolor="EEEEEE"><th colspan="2" align="left">Sequence annotation</th><th></th> </tr> </thead> <tbody> <tr align="left"> <td align="left">The Influenza Virus Sequence Annotation Tool is a web application for user-provided Influenza A virus, Influenza B virus and Influenza C virus sequences. It can predict protein sequences encoded by a flu sequence and produce a feature table that can be used for sequence submission to GenBank, as well as a GenBank flat file. The type/segment/subtype of an input influenza sequence is first determined by BLAST, and then aligned against a corresponding sample protein set with a "Protein to nucleotide alignment tool" (ProSplign). The translated product from the best alignment to the sample protein sequence is used as the predicted protein encoded by the input sequence.

                Type/segment/subtype identification
                An input sequence is searched by BLAST against a specialized influenza sequences database to determine the virus type (A, B or C), segment (1 through 8) and subtype for the hemagglutinin and neuraminidase segments of Influenza A virus. The database contains one reference sequence for each virus segment and each subtype of the hemagglutinin and neuraminidase (available here). The top hit in the BLAST result is used to determine the virus type/segment/subtype of the input sequence.

                Sample protein sequences
                Representatives of published protein and mature peptide sequences for each virus segment and different subtypes for the hemagglutinin and neuraminidase segments of Influenza A virus are maintained on the server side (available in the PROTEIN-A, PROTEIN-B and PROTEIN-C directories located here). For the segments that encode proteins with large variations in amino acid sequences and mature peptide cleavage sites, more than one protein could be chosen to be included. For example, this collection currently has 16 different protein samples for hemagglutinin of Influenza A virus. Based on the segment and subtype determined by the BLAST result, a subset of sample protein sequences is selected and aligned against the input sequence.

                Protein to nucleotide alignment
                A special global protein-to-nucleotide alignment tool, ProSplign, was designed to accurately annotate spliced genes and mature peptides of influenza viruses. ProSplign also handles input sequences with insertions and/or deletions which may cause a frame shift in the coding region.

                Interpreting alignment result and creating outputs
                A successful protein-to-nucleotide alignment should pass the following criteria:
                1) The input sequence should start with a correct start codon (or span the beginning of input sequence in case of partial 5' end)
                2) The input sequence should end with one of the stop codons (or span the end of input sequence in case of partial 3' end)
                3) The input sequence should have no frameshifts or internal stop codons
                4) The number of exon(s) must be correct (2 for the second protein of segments 7 and 8 of Influenza A virus and segment 8 of Influenza B virus, 1 exon for all other segments/proteins)
                If an alignment passes all four criteria above, the tool adopts the translated protein from the alignment as the protein prediction. Positions of the start, stop, splice sites (if present) and mature peptide are taken from the alignment. If an alignment doesn't pass any of the criteria, the tool iterates further by aligning next sample protein from the reference subset. If none of the sample proteins can be used to produce a decent alignment, the best aligned sample protein (with the highest alignment score) will be used to generate an error report.
                The first output of a successful annotation is a feature table, which is a five-column, tab-delimited table of feature locations and qualifiers. The tool also creates the ASN.1, XML and GenBank formatted views of the same annotation, using the following NCBI developed utilities: tbl2asn and asn2xml.

                Drug resistance prediction
                The most common signature mutations that might confer drug resistance by the virus can also be detected and reported by this tool. Such mutations include L26F (e.g. CY009837), V27A (e.g. DQ186974), A30T (e.g. EU263348), S31N (e.g. DQ107508) and G34E (e.g. L25818) in the M2 protein, H274Y (e.g. DQ250165) and N294S (e.g. EF222322) in the N1 subtype of neuraminidase, and R292K (e.g. AY643089) and E119G/D/A/V (e.g. EU429720) in the N2 subtype of neuraminidase.

                Other mutation detection
                The signature mutation, E627K, in the PB2 protein (e.g. AY651719) that might confer high virulence of influenza viruses will be detected and reported.

                To use the tool, simply add one or multiple nucleotide sequences in FASTA format into the sequence box. Sequences can also be imported from a file by clicking the "Browse" button. After the "Annotate FASTA" button is clicked, feature tables separated by a line of equal signs for each input sequence are shown in a separate window. A message showing the predicted segment, and subtype for the hemagglutinin and neuraminidase segments will also be displayed. Warning messages will be shown along with the feature table, if the input sequence does not have a start/stop codon or contains ambiguity sequences. In case frameshifts are found in the coding regions, or a stop codon is introduced within the coding region because of a mutation, no feature table will be produced and an error message will be shown instead, indicating the nature (insertion, deletion or mutation), the length and the location of the error. Other output format (GenBank flat file, ASN.1, XML, protein FASTA and alignment) can be selected and be shown on the browser or saved to files.
                This annotation tool uses published influenza protein sequences as training sets. There are chances that it will not work as expected for some new sequence variations. Please report such cases to us so we can improve this tool.

                How to cite the annotation tool
                Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Tatusova T. FLAN: a web server for influenza virus genome annotation. Nucleic Acids Research. 2007 Jul 1;35(Web Server issue):W280-4.


                <table border="0" width="100%"><thead><tr bgcolor="EEEEEE"><th colspan="2" align="left">FTP</th><th></th> </tr> </thead> <tbody> <tr align="left"> <td align="left">Data in the NCBI Influenza Virus Sequence Database are available through ftp. The ftp directory contains the following files and the corresponding compressed versions that are updated everyday:
                genomeset.dat - Table with supplementary genomeset data
                influenza_na.dat - Table with supplementary nucleotide data
                influenza_aa.dat - Table with supplementary protein data
                influenza.dat - Table with nucleotide, protein and coding regions IDs
                influenza.fna - FASTA nucleotide
                influenza.cds - FASTA coding regions
                influenza.faa - FASTA protein

                The genomeset.dat contains information for sequences of viruses with a complete set of segments in full-length (or nearly full-length). Those of the same virus are grouped together and separated by an empty line from those of other viruses.
                The genomeset.dat, influenza_na.dat and influenza_aa.dat files are tab-delimitated tables which have the following fields:
                GenBank accession number, Host, Genome segment number, Subtype, Country, Year, Sequence length, Virus name, Age, Gender. The influenza_na.dat and influenza_aa.dat files have an additional field in the last column to indicate if a sequence is full-length.
                The influenza.dat file is a tab-delimitated table which has the following fields:
                GenBank accession number for nucleotide GenBank accession number for protein Identifier for protein coding region
                A directory named "updates" contains daily updates for all of the above listed files in subdirectories for each date.
                A directory named "ANNOTATION" contains reference sequences used in the Influenza Virus Sequence Annotation Tool. The file blastDB.fasta has one representative sequence for each type/segment/subtype of influenza viruses A, B and C, and it is used to build a specialized BLAST database for the determination of type/segment/subtype of input influenza virus sequences. The PROTEIN-A, PROTEIN-B and PROTEIN-C subdirectories each contains sample protein and mature peptide sequences used to annotate user-provided sequences.


                Last edited by sharon sanders; March 1st, 2013, 07:33 AM. Reason: reformed with graphs and hyperlinks
                I'm interested in expert panflu damage estimates
                my current links: [url][/url] ILI-charts: [url][/url]