xml


Mask field of column(struct type) in Saprk DataFrame


I have a created a DataFrame from XML file. The created DataFrame has the below scheme.
val df = hiveContext.read.format("com.databricks.spark.xml").option("rowTag", row_tag_name).load(data_dir_path_xml)
df.printSchema()
root
|-- samples: struct (nullable = true)
| |-- sample: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- abc: string (nullable = true)
| | | |-- def: long (nullable = true)
| | | |-- type: string (nullable = true)
|-- abc: string (nullable = true)
I would like to mask the abc/def in the dataframe.
I was able to get to the field i want using:
val abc = df.select($"samples.sample".getField("abc"))
But i want to mask the field abc/def (replace abc field with XXXX ) in the dataframe df. Please help me on this
There doesn't seem to be much, if any, support in the databricks xml library for manipulating the contents of an XML-based dataframe (wouldn't it be cool to be able to use XSLT?!). But you can always manipulate the inferred rows directly, e.g.
val abc = df.map(row => {
val samples = row.getStruct(0).getSeq(0)
val maskedSamples = samples.map(sample => {
Row("xxxxx", sample.getLong(1), sample.getString(2))
}
Row(Row(maskedSamples), row.getString(1))
}
The code above may not precisely match your desired transformation, since it's somewhat unclear, but you get the idea.
I would suggest you to split the samples array structType to separate columns (StructFields) so that you can mask/replace them as you want. And you can apply dataframe functions as well later on if you want.
Below is the code to separate into three columns
df.withColumn("abcd", lit($"samples.sample.abc"))
.withColumn("def", lit($"samples.sample.def"))
.withColumn("type", lit($"samples.sample.type"))
You can drop the samples column if you want
.drop("samples")
Since you want to mask abc and def with XXXX, you can do
df.withColumn("abcd", lit("XXXX"))
.withColumn("def", lit("XXXX"))
.withColumn("type", lit($"samples.sample.type"))
.drop("samples")
Note : abcd column name is used as there is already another column abc in your schema
Edited to meet the #Raj comments below:
If the original schema is to be preserved and separate columns are not required then creation of a case class and a udf function should do the trick
def mask = udf((typ: mutable.WrappedArray[String]) => Raj("XXXXX", Option(0L), typ(0)))
Case class for Raj is needed
case class Raj(abc : String,
dfe : Option[Long],
typ: String)
Finally call the udf function by passing the type in withColumn
df.withColumn("samples", struct(array(mask(col("samples.sample.type"))) as "sample"))
This should get you the working output

Related Links

XPath expression selecting element based on another node's value
dtd error extra content at the end of the file
Merge 2 or more xmls with xslt by attribute value?
How to correctly use xml.Utility.unescape?
Add node based on parameter of XML using XSL
The markup declarations contained or pointed to by the document type declaration must be well-formed
XML Schema: How to handle different occurrences in an unordered list?
Should xsd schema file pass validation against itself?
swf not loading updated xml data
Displaying XML Data to Table
Sorting XML data based on number of characters in data
How to convert StreamReader to XDocument?
Linq XML - getting distinct values with the combination of parent node & child node
xmllint does not work properly with xpath
Is there a formal XML coding standard for readability? [closed]
Self-closing template elements stopping rendering in Lift

Categories

HOME
shell
google-app-engine
cygwin
twitter-bootstrap-3
merge
rsa
sh
can
web-hosting
zip
code-generation
capistrano3
atlassian-stash
mathprog
red5
identity-management
cfml
cgbitmapcontext
parcelable
php-mysqlidb
exploit
scriptella
resourcebundle
jboss6.x
tiki-wiki
moses
interbase
bulletphysics
licensing
angularjs-components
fragment-backstack
entity-relationship-model
tizen-tv
mql
panoramas
gitlab-api
postback
jett
activexobject
yowsup
getlasterror
navigator
recurrent-neural-network
logback-groovy
placeholder
android-checkbox
berkeley-db-je
apache-spark-dataset
httrack
openresty
clarifai
tinymce-3
identification
reindex
otp
visual-studio-2008-sp1
angular2-testing
hadoop-2.7.2
qtplugin
xcode-server
centrifuge
envi
iotivity
greatest-n-per-group
appscale
query-by-example
spreadjs
data-protection
httpruntime.cache
dllimport
file-move
oracle-spatial
include-guards
ons-api
page.js
doctype
eoferror
omnifaces
node-serialport
android-broadcast
openbabel
procedural-programming
apache-spark-1.3
angular-gettext
structuremap3
psr-4
lambda-architecture
paste
synapse
mt4j
light
oocss
cnf
asdf
pagedown
traceability
wp7test
asplinkbutton
faye
garbage
isa-swizzling
apache-commons-dbutils
htdocs
android-holo-everywhere
tournament
drawtobitmap
event-receiver
luajava
vs-android
zend-rest
glui
sun
scrubyt
movieplayer

Resources

Database Users
RDBMS discuss
Database Dev&Adm
javascript
java
csharp
php
android
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App