NCBI Minute: New Variation Services for Normalizing, Remapping, and Annotating Variants

NCBI Minute: New Variation Services for Normalizing, Remapping, and Annotating Variants


So, the full title of this webinar is New
Variation Services for Normalizing, Remapping, and Annotating Variants . As I said, I am
Peter Cooper from the customer service part of NCBI. You can write to me [email protected] Lon Phan is also here with me, he is the head
of dbSNP. Rana Morris is also here to help us with questions
and any technical things that might come up in the webinar. What were going to talk about here today is
the sort of bullet points here. What are the variation services APIs? We are going to talk about the variant model
these APIs use. It is called SPDI. I will spend a few minutes talking about that. Explaining how that works. Were going to talk about some applications
of the SPDI variation services. I will show you what you can do with it. I will show you an example where you can see
them in action here. How they work behind the scenes here. I will show you a couple of interfaces that
you can go to to sort of test them out. To play around with them to see how the things
work, to show you what kinds of results you are going to get back. And then I will do some live practice, using
those interfaces. And I will go to the command line as well,
and show you demonstration script. We will show you how to access them using
a script. And of course we will have questions at the
end. So, what are these things? the variation services
and APIs? Set of URL based, RESTful, programming interfaces
for accessing and working with variation data of NCBI. They rely on a new variation representation
called SPDI. Which stands for sequence: position: deletion:
insertion. They are used internally at NCBI dbSNP and
ClinVar. It’s how we manage to deal with these large
numbers of variants at our presence in both of those resources. We will talk a little bit about the variant
model for the next few minutes. Simple representation that uses a sequence
ID. The location and the alleles. The deletion – take something out. Insertion, put someting back in. Represent with precise coordinates. There easier for variant normalization. We will talk about what we I mean by that
in a few minutes. They are designed to be machine-readable. Although, people can figure out what they
mean as well. I’ll explain why they are more machine-readable
than people readable in just a minute. They are amenable to high-performance computation,
which is why we use them internally at NCBI for sequence variant processing. You can represent both nucleotide and protein
sequence variants. At the bottom of the slide, there are some
URLs that will take you to web document about SPDI notation, which is helpful, but much
more detailed is the preprint available on bioRxiv. There is a link to it there on the bottom
of the slide. biorxiv.org/content/biorxiv/early/2019/01/31/537449.full.pdf If you want to know more about this variant
model, that’s the place to go to find out all about it. Here is an example of a variant that is shown
in SPDI notation. This is a linear form of this. As you can see also, on the slide, there is
a structured format a JSON JavaScript object format for the SPDI, which is what the APIs
are going to return to you. Basically, you have a sequence ID. This is a reference sequence for chr8. A position in that reference sequence, then
there is a base that is deleted and a base that is inserted. Notice that the SPDI variant models does not
require that you have the deleted base there. You can just use the length of the base. And that brings me to the next point on the
slide. Which is that SPDI uses -this is the part
that makes it a little tricky to read for people. It uses a zero base coordinate system. It is an inter-based one. Basically, the insertion/deletion starts between
bases 1981972 and 1981973. The space between those bases. The G that is being deleted is the G at position
1981973, in the one-based sequence coordinate system. And so, the advantage of this from a computational
point of view, is that you can actually represent an insertion with the variant length of 0. Here is the corresponding example of SPDI
notation for a protein variant. Again, written both ways with deleted sequence
base, or the deletion length. Again. You can get this as an object, a structured
format. From the API. We use SPDI behind the scenes in the variant
service behind the scene in the variations services, the variation databases and resources
that NCBI. If you want to see them live and in person
in action, here’s a place you can see that. The variant search in ClinVar uses SPDI. If you search ClinVar with HGVS you’ll see
these kinds of things. It you know that you might have a variant
that is not in ClinVar. Then it will map you to dbSNP. Or there might be an equivalent variant that
will link you to that variant. So what can you do with these variation services? Sort of three things you could do. You can normalize your variant. That means finding the equivalent one in NCBI
dbSNP and ClinVar variants. You can convert formats. If you have a SPDI you can get the HGVS and
vice versa. You can do all these kinds of things to standardize
the way your variants are shown. You can correct for left and right shifting
for insertions and deletions. We will talk a little bit more about that
when we talk about contextual alleles. We can transform your data into right shifted
HGVS or left shifted VCF. That is the preferred way of writing them. The useful thing you can do is to remap the
sequence variant. That uses our alignment database system. To allow you to map variants between assemblies. Between various sequences that are in there. It also gives you access to annotations. Because, you can access the snp records in
a structured format. Those are going to include functional consequences,
allele frequency, and clinical significance and all kinds of other important information
about the variant. A couple of terms that we are going to come
across today, that I am going to mention. One of them is the idea of a contextual allele. These are alphabetical on the slide. So, contextual allele is a unique, normalized
representation that is corrected for something called over precision. Over precision is pretty easy to understand. If you look at a portion of DNA, for example
that has a low complexity region or repeat in it. If you look at these examples shown at the
bottom of the slide, you can think about a deletion. You see G AGA repeats there. If I remove one of those GA’s, it doesn’t
matter which one I remove, I’m going to end up with the same sequence, result. So really, over specifying which one of those
are used is what is called over precision. The problem with that is, you can give rise
to things that look like different variants, when they are in fact, not different variants. It is one of the things that SPDI has been
helpful for. Corrects for those kind of things. The other thing that I want to mention is
this canonical representative allele or representative record that you can see. This is a really useful thing to do with the
variation services. It will allow you to map to a particular sequence
model. The thing that we call a canonical allele
is usually based on the latest genomic assembly, which is GRCh38. So if you want to map an allele on a different
sequence to the standard allele in GRCh38, you can do that using the variation services. This is just a slide that shows you how you
can do these kinds of things in sort of a schematic. You’ve got all these different variants in
different formats. You onvert them to a SPDI . In particular,
you want to try to get the contextual alleles. Those are going to normalize for overly precision,
overly precise variants. And you can correct all of your problems,
make your VCF and HGVS compliant with sort of the standard. And then, you could also remap these, using
the alignment database system, into various genome builds and things like that. Here are a couple of simple examples of sort
of normalization, or standardizing your variants. So, at the top of the slide, I have an HGVS
name for a variant. NC_000007.14:g.117548634_117548635insTT It’s written as insertion. So insert Two T’s. I can convert that using the SPDI services
to a SPDI contextual allele. This is an unambiguous representation of this
particular variant. NC_000007.14:117548628:TTTTTTT:TTTTTTTTT Then, I can convert that back into HGVS, on
all these different reference sequences. NG_016465.4:g.87851_87852dup
NC_000007.13:g.117188688_117188689dup NC_000007.14:g.117548634_117548635dup And the preference is to make this representation
a duplication rather than insertion. They you can see the correct way, or the preferred
way of writing these, as a dup. This is a more complicated example. First of all, you will notice that the HGVS
expression contains a GenBank accession number here, that’s a BRCA2 mRNA. U43746.1:c.2472_2477del Because the alignment database has some of
the INSDC or GenBank in it aligned to all of the reference sequences
and things like that. This is written as a deletion. And it is actually left shifted HGVS, which
means that the deletion is shown sort of at the left-hand side of where this variant should
be. This is a problem because this is a repeat
sequence. I can turn this into a unique representation
called us SPDI contextual allele. U43746.1:2699:AAATGAAAAT:AAATT And I can normalize the HGVS that way. U43746.1:c.2476_2481delGAAAAT That I will get back to a right shifted, showing
the right hand end, of that insertion/deletion. I can course take back all the contextual
alleles and all of the HGVS expressions. I am mapping it to chromosome 13 here in this
one. NM_000059.3:c.2476_2481delGAAAAT
NG_012772.3:g.26352_26357delGAAAAT NC_000013.10:g.32910968_32910973delGAAAAT
NC_000013.11:g.32336831_32336836delGAAAAT Okay. So, had you do this? If you are doing it with a program, a script,
you could use the API, which is a URL-based system. There is a base URL that is written for you
there. The services you can use that take various
kinds of input. They take SPDI, they take HGVS, VCF. And you can get back the reference record
using this beta reference service. There are some example URL calls. You can try those out yourself using your
web browser if you want to. I won’t click on them here because we will
do some live examples here in just a couple of minutes. https://api.ncbi.nlm.nih.gov/variation/v0/hgvs/NM_000041.3:c.388T>C/contextuals
https://api.ncbi.nlm.nih.gov/variation/v0/vcf/11/5248224/A/AC/contextuals?assembly=GCF_000001405.25 https://api.ncbi.nlm.nih.gov/variation/v0/spdi/NC_000011.9:5248224:C:CC/canonical_representative One thing that the developers wanted me to
point out is that these URLs will be stable and backwards compatible. You don’t need to worry about changing that. So, here is the full list of services and
functions that are the things you would add on to the end of that URL. There are two different demonstration interfaces. There is a simple web demo and an API document
interface. Will show you both of those. Will show you how they work. This is the simple demo interface, the this
URL here, https://www.ncbi.nlm.nih.gov/variation/services/demo Go to variation/services/demo. You can put any HGVS expression in there you
want. This just happens to be the one for the fairly
famous CFTR Delta 508 deletion. I can take that single HGVS and do all the
things that SPDI service can do for the most part. So I can get the SPDI representation . I can
get the identifiers here that will give me the link to the refSNP record. I can get the normalized HGVS, which in this
case, is the same. Then, I can map it to all these other sequences
here. That are available. The other interface, which is useful in particular
if you are going to start writing software that accesses this. This is the Swagger documentation of the API. It let’s you see the full API response. These are the services here. You can expand these. And you can try them out. So, here is an example where I could put in
the HGVS, like I did before. And, I can run this to try it out and get
back a SPDI . The same one I got a few minutes ago. On the simple interface. But now it comes back in the structured format. It’s going to show you the header and everything
else. It also gives you the request URL you can
use to get this. And, a curl command line so you can try this
out on your command line, to test it out, you could do that. So, just to wrap this part up, the main point
here is, these are demonstration interfaces. You really are going to want to be accessing
these using software. And, Lon Phan has put up some tutorials on
the NCBI github site. That is the URL that will get you there. https://github.com/ncbi/dbsnp/tree/master/tutorials There are tutorials here about how to manage
the JOSN format from dbSNP. There is also a set of variation services
demonstrations here. Including the Jupiter notebook that shows
you how the calls work. And, a simple script in Python that will let
you access things via the API. What I hope we have time to do now, is to
try to do a couple of demos. The first ones I’m going to do are going to
be using variation services with the web interface. I’m going to try to do this with a few variants
that are the same. Then I will show you the Python demo script
from github and I will show you how that runs on the command line line if we have time. Let me go ahead and back out of my slides. I am going to go over to a web browser. What I want to do is to go ahead and take
the two variants that we showed in our slides a few minutes ago. I have made myself a little cheat sheet that’s
available on the ftp site. The two variants we are going to use, are
here. So, here is our first variant, which is written
as 27dupG. Notice I can also write this as an insertion,
which is not the preferred way. I’m going to use that when when we run this. The other thing I have is the VCF line here
that has another variant. It looks quite different. What we want to do is see if these two things
are the same. So, what I can do here is go to the simple
interface here. Paste my VCF. And run this. So, it gives me the SPDI representation. It gives me a link to dbSNP. This is a variant associated with beta thallasemia. There is a normalized HGVS. It turned that insertion variant in to a duplicate
variant which is the way we prefer to have it written. And here is it mapped over to all the other
sequences that we have available here. Now, what we want to do to find out if these
are really the same variant, is we want to map them to that canonical representative. That’s going to be this one here, if we turn
it into a SPDI. Let me do that. I’m going to grab that. Let me put that up here. So, this is our SPDI representation of that
particular variant. Okay. Now, what we want to do is to see if we get
the same SPDI representation for our HGVS allele. What I’m going to do is now, go to the variation
services Swagger documentation, which is over here. And we are going to compare the representation
that we get for that allele that is shown in the VCF. I can expand any one of these I want. For example, if I wanted to try to run that
HGVS expression to get that SPDI. I can do that here. That is in my demo if you want to look at
that script. But I’m not going to do that just to save
sometime. Because we have already done that and see
how that works. What I want to do is get the contextual allele
from the VCF that we have. So, let me get that open. What we want to do, is try that out. Okay? So, we want to put a chromosome. You don’t have to have the accession number
here, which we don’t have. The thing we are working with is over here. So you’ve got chromosome 11. Sorry wrong one. We need our position. We need our ref. I think I can remember that. Let’s see if I do. Now, what we don’t have is the correct assembly
here. We need to know the assembly, because we need
variants for chromosome 11. The assembly I want to use is 25. Let me verify that. That’s the one that was in my notes over here. That is 1405 25. So, there is my SPDI. It is returned to me as JSON. But, it is the same one that we found before. These two alleles are the same allele. So, we verify that by using SPDI. That’s the same kind of thing we do here at
NCBI to verify these. The last thing I’m just going to touch on
is to show you the github site. Maybe we will run one or two examples from
that on the command line. Let me go over there. On my cheat sheet I have the GitHub site here. So, here are the tutorials that are working
with the dbSNP Jason. The variation services we can use today. there is a Jupiter notebook. There is also this Python script. A simple Python script. But it gives you a good starting point if
you want to write your own little scripts to access variation services. And there is also some demonstration files
that you can do to run this. I’m going to go ahead and load spdi_batch.py. It’s got these nice little command lines that
you can run as an example. So, I’m going over here to terminal window. And I have got SPDI batch Python in here . And
I’ve got some other files I can work with. Let’s do a couple of things here. So, I have a VCF file in there. So, let’s do that one. This is the one that I didn’t demonstrate
on the web interface. But it is a pretty handy thing to be able
to do. So, here is the VCF. And just so you know ahead of time, this is
mapped on GRC H 37. What we don’t have is an ID column there. What we can do with this particular part of
the script is, we can go ahead and put the RS IDs in there. So here is a command line that will do that. I am just pasting these in, so you don’t have
to watch me type. So now, you can see, we have refsnp by identifiers. A couple of these don’t have identifiers,
and it lets you know that. This is a useful thing. And so, you could go ahead and try out some
of these other examples if you want on your own. That is just a simple one. You could take any kind of file that you want,
and I’ve got those available and make them available on our FTP site if you want to use
the examples in my hand out here. And included in there is a way to download
JavaScript objects and things like that. I think we should stop at this point, and
see if there are any questions that we need to answer. You probably have a few of them. I will let Rana, has been sort of collating
the questions. We can have her read them and we can answer
them. [Rana Morris] Hi. There are a whole bunch of questions. We will take all of the questions-relevant
questions that have come in, create a Q&A document. Over the next week, make sure we have detailed
answers to every single question that has come in. And, we will post that in the FTP site that
I chatted out to you. https://go.usa.gov/xERZs The first one I think is kind of important
for all of and NCBI variation services at this point. One of the people specifically asked about
using this particular service with nonhuman, genetic variation. Particularly he was interested in dog’s. I’m going to let Lon answer that. [Lon Phan] We no longer support non-human
organism variation. We are also looking at refactoring some of
these services, like remapping service. organism variation. We are also looking at refracting some of
these services, like the remapping service, a separate utility. Presumably, then you could just provide any
type of alignments, data sets, and it would work for any organsism. So if that is something that interests you,
when you get your feedback fromthat interests you, when you get your feedback form that
we send out, just comment on there, what you would like to see this work for. It will be considered and-prioritized with
her other tasks, our work load. So thank you. [Rana Morris] Next question, actually, I am
going to combine two. Somebody asked whether or not this service
is actually open source. Are there any licensing or use requirements. Another person asked, what are the volume
scalability limits of the services? What kind of batches can be used? Can you do hundreds of thousands of variants
as an example? So, one, open source. Two, licensing, and three , hundreds of thousands
of variants as an example? [Lon Phan] This is all open source. There is no license required. We just have the basic disclaimer about NCBI
data. What’s the other one? The volume. Right now we are sort of in the testing mode. So go ahead and try it out. And, also, we will have some of the questions
regarding volume in our survey so again respond to that. That will help us plan in terms of scaling
up if you needed. Responded that. That will help us plan in terms of scaling
up if you needed. Thank you. [Rana Morris] Another user asks, can these
services be run on private servers or infrastructures? [Lon Phan] Again. We are considering having some of these things
run off-line, so as I said, you can run them with your own data set. That is something we are looking at, refactoring
in the future, running these functions as a utility off-line. Again, if you can just respond to the survey
and put your wish list on their. Okay. [Rana Morris] Now we are going to get a little
more into the data. One user said, the refsnp services are still
listed as being in beta mode. When will the final version be released? [Lon Phan] The refsnp page that is live now. It is actually the final version. I think the alpha label was reomoved on that
page, within the last week. [Rana Morris] So, it will still be in development. [Lon Phan]
Yes. We will continue to improve it based on user
feedback. If you have anything you want, we will try
to prioritize it. [Rana Morris] Okay. Someone is really interested in mapping alleles
over to the ClinGen registry alleles using the contextual alleles. That possible? You registry alleles, using the contextual
alleles. That possible? You may want to discuss ClinVar at that point. [Lon Phan] Right now, the-we are not bar at
that point. Right now, the-we are not integrated with
the allele registry yet. It is something we can discuss with other
developers. Right now, they already integrate dbSNP data. I guess it’s possible we could provide all
the contextual alleles and make it available for you to normalize with the allele registry
alleles. [Rana Morris] Now, formatting. User was interested in posting a SPDI JSON
representation. Is that possible for posting in JSON for mapping
purposes? [Lon Phan] Right. If you go to our API, the variation service
page, some of the function will support post functions. Others will not. Again, if you need other functions to post
again, again, just put that it in a future request in the survey. And again, we will look at it. [Rana Morris] Peter, you may have a slide
that you can show for this. Now we are talking about vcfs as an input
file. They wanted to know, if you are going to put
in a VCF input file for let’s say, a SPDI batch, how would you specify the build itself? Let’s say they want to do a search on build
GRCh38, as opposed to 37? [Peter Cooper] So, you have to pass and Lon
can correct me on this. I think you have to pass the assembly accession. [Lon Phan] There is a parameter for you to
pass the assembly accession number. So that we will know what assembly you are
referring to or at least the chromosome. [Peter Cooper] In that example, I did not
point this out and I meant to. Let me see if that example is still there. Let me escape from this for a moment. On that example it should show you the URL. This is the newly released site. It looks a little different. If you see this URL here where I pass the
VCF fields here. There is chr 11, there is the position, there
is the variant. You are using this service that gets you the
contextuals. Were passing the assembly is an assembly accession
number. You need the assembly accession number for
GRCh37, which this one happens to be the assembly number for GRCh37. [Lon Phan] If you look at SPDI batch python
script on github that Peter demonstrated. If you look at the code, the assembly is passed,
there is an example in there. [Peter Cooper] Right, in fact, that is the
example that I ran. That particular case, that Python script is
fairly simple. It has the assembly accession encoded in it. But you can easily modify it so you could
give it the accession number. [Rana Morris] All right. It is after 12:30. So, there are a couple of other questions
that are very specific and technical, with regard to the SPDI format and they are interested
in understanding the rationale behind some of that. We will actually write that up. In the Q&A documents so you have a better
and a fuller understanding of that. [Peter Cooper] The BioRxiv article that is
linked on the slide explains a lot about the SPDI rationale. And details about how it works to make those
contextual alleles. So, I suggest you check that out. That may answer your questions. We will definitely get the answer in the document
that we put up. I think were going to wrap it up. Thank you all very much for coming. Were going to go ahead and sign off now.coming.

One thought on “NCBI Minute: New Variation Services for Normalizing, Remapping, and Annotating Variants

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright © 2019 Toneatronic. All rights reserved.