xml


Mask field of column(struct type) in Saprk DataFrame


I have a created a DataFrame from XML file. The created DataFrame has the below scheme.
val df = hiveContext.read.format("com.databricks.spark.xml").option("rowTag", row_tag_name).load(data_dir_path_xml)
df.printSchema()
root
|-- samples: struct (nullable = true)
| |-- sample: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- abc: string (nullable = true)
| | | |-- def: long (nullable = true)
| | | |-- type: string (nullable = true)
|-- abc: string (nullable = true)
I would like to mask the abc/def in the dataframe.
I was able to get to the field i want using:
val abc = df.select($"samples.sample".getField("abc"))
But i want to mask the field abc/def (replace abc field with XXXX ) in the dataframe df. Please help me on this
There doesn't seem to be much, if any, support in the databricks xml library for manipulating the contents of an XML-based dataframe (wouldn't it be cool to be able to use XSLT?!). But you can always manipulate the inferred rows directly, e.g.
val abc = df.map(row => {
val samples = row.getStruct(0).getSeq(0)
val maskedSamples = samples.map(sample => {
Row("xxxxx", sample.getLong(1), sample.getString(2))
}
Row(Row(maskedSamples), row.getString(1))
}
The code above may not precisely match your desired transformation, since it's somewhat unclear, but you get the idea.
I would suggest you to split the samples array structType to separate columns (StructFields) so that you can mask/replace them as you want. And you can apply dataframe functions as well later on if you want.
Below is the code to separate into three columns
df.withColumn("abcd", lit($"samples.sample.abc"))
.withColumn("def", lit($"samples.sample.def"))
.withColumn("type", lit($"samples.sample.type"))
You can drop the samples column if you want
.drop("samples")
Since you want to mask abc and def with XXXX, you can do
df.withColumn("abcd", lit("XXXX"))
.withColumn("def", lit("XXXX"))
.withColumn("type", lit($"samples.sample.type"))
.drop("samples")
Note : abcd column name is used as there is already another column abc in your schema
Edited to meet the #Raj comments below:
If the original schema is to be preserved and separate columns are not required then creation of a case class and a udf function should do the trick
def mask = udf((typ: mutable.WrappedArray[String]) => Raj("XXXXX", Option(0L), typ(0)))
Case class for Raj is needed
case class Raj(abc : String,
dfe : Option[Long],
typ: String)
Finally call the udf function by passing the type in withColumn
df.withColumn("samples", struct(array(mask(col("samples.sample.type"))) as "sample"))
This should get you the working output

Related Links

How to select all nodes sharing a common value in XPath?
Building report from xml using perl
How to remove extra data from XML attributes
how to convert date YYYYMMDDHHmm to YYYY/MM/DD HH:mm format in linux
Why doesn't the SPFileCollection.Add method understand Enterprise Keywords?
Import NVD data feed to MS access database
Fetching elements between a node = XSLT
How to return CDATA section from a Web API
Formatting XML from PowerShell
Soapui - How to maintain session from login post request to get request
(Node/Express) Express-sitemap generating incorrect sitemap
Add child node at the beginning of XML
I need to discover what checksum algo is generating the following CRC values
How to validate for language attributes in XSD?
How to apeend data to a file in node js
XSD element substitution group example?

Categories

HOME
go
azure-stream-analytics
cheerio
fpga
mysql-workbench
grafana
newrelic
database-normalization
crystal-reports-2013
opengl-3
modal-dialog
discrete-mathematics
alamofire
filechannel
linear-algebra
alpacajs
glyphicons
soci
oledb
blade
springfox
squashfs
libigl
question2answer
systemtime
visual-studio-extensions
perlbrew
game-theory
fat
motion-detection
has-and-belongs-to-many
paperjs
boost-asio
jmockit
glib
r-grid
memory-address
indy10
numpy-broadcasting
python-pptx
sim-toolkit
reason
watchman
flatmap
protein-database
cin
urlsession
android-ibeacon
movilizer
logback-groovy
chronicle-map
drupal-theming
user-defined-fields
compiler-design
headless
pdf-conversion
check-mk
gpg-signature
structuremap4
filenet
renaming
reachability
docpad
pytables
resourcemanager
mongodb-php
jboss-tools
vorpal.js
lua-telegram-bot
yajsw
relativelayout
spring-retry
pypiserver
sapscript
slash
baasbox
botan
dnssec
snackbar
bufferedinputstream
fail2ban
shopizer
sql-scripts
punycode
cbind
human-computer-interface
named-parameters
jquery-slider
authlogic
bin
gpars
merb
node-blade
superscrollorama
virtual-pc
android-loadermanager
box2d-iphone
node-redis
expression-blend-4
matlab-load
cac
squishit
murmurhash
dsl-tools
asynchronous-wcf-call
escrow
commercial-application

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App