I finished my first version (untested) of flu-genbank at:
5MB compressed, 110MB expanded, version from 2008/04/18
Description at
copy below
errors corrected, notations uniformized, computer-readable
so hopefully future changes will be easy.
some improvements are still possible...
then I have tools to extract/merge headers
extract subsets by keyword
make mutation-tables, draw mutation graphs etc.
to be uploaded later
work in progress, I can send by email if someone is interested
names.exe
xtract.exe
merge.exe
seq1.exe
seqa.exe
align.exe
mn.exe
seq1q.exe
source-code attached to the executables
--------------------------------
file flu.gz
62869 records consisting of 2 lines, the first has a header
with 16 entries, separated by commas , the 2nd line has
the nucleotide-sequence.
my current headers:
examples:
>AB000605,H,6,,Japan,1971,1136,C,C/Sapporo/71,,,y,199356,,26-MAR-2003,
>CY009388,H,4,H3N2,New Zealand,2000,1721,A,A/Canterbury/94/00(H3N2), 31411,F,y,363048,20-10-2000,15-MAR-2006,36817
1) genbank access code
2) species (H:human,A:avian,S:swine)
3) segment 1..8 , 1..7 for C
4) serotype empty for B,C,u
5) country
6) year
7) length
8) type (A,B,C,u)
9) name
10) host-age in days
11) host-sex (m,f)
12) full-length ?
13) taxon
14) collection date (year and month at least, else empty)
15) submission date
16) days since 1900/01/01 (if collection date is given)
the nucleotide-sequences are aligned by inserting "-" for
influenza-A :segments 1,2,3,5,7,8, 4-H1N1,4-H3N2,4-H5N1,6-H1N1,6-H3N2,6-H5N1
(simple alignment : "-"s are only attached at the start and end
if no neighbor <5% then print to extra-file instead
don't calculate all d(f,g), if d>min then exit-for
5MB compressed, 110MB expanded, version from 2008/04/18
Description at
copy below
errors corrected, notations uniformized, computer-readable
so hopefully future changes will be easy.
some improvements are still possible...
then I have tools to extract/merge headers
extract subsets by keyword
make mutation-tables, draw mutation graphs etc.
to be uploaded later
work in progress, I can send by email if someone is interested
names.exe
xtract.exe
merge.exe
seq1.exe
seqa.exe
align.exe
mn.exe
seq1q.exe
source-code attached to the executables
--------------------------------
file flu.gz
62869 records consisting of 2 lines, the first has a header
with 16 entries, separated by commas , the 2nd line has
the nucleotide-sequence.
my current headers:
examples:
>AB000605,H,6,,Japan,1971,1136,C,C/Sapporo/71,,,y,199356,,26-MAR-2003,
>CY009388,H,4,H3N2,New Zealand,2000,1721,A,A/Canterbury/94/00(H3N2), 31411,F,y,363048,20-10-2000,15-MAR-2006,36817
1) genbank access code
2) species (H:human,A:avian,S:swine)
3) segment 1..8 , 1..7 for C
4) serotype empty for B,C,u
5) country
6) year
7) length
8) type (A,B,C,u)
9) name
10) host-age in days
11) host-sex (m,f)
12) full-length ?
13) taxon
14) collection date (year and month at least, else empty)
15) submission date
16) days since 1900/01/01 (if collection date is given)
the nucleotide-sequences are aligned by inserting "-" for
influenza-A :segments 1,2,3,5,7,8, 4-H1N1,4-H3N2,4-H5N1,6-H1N1,6-H3N2,6-H5N1
(simple alignment : "-"s are only attached at the start and end
if no neighbor <5% then print to extra-file instead
don't calculate all d(f,g), if d>min then exit-for